All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2021-07-01  1:46 Andrew Morton
  2021-07-01  1:47 ` [patch 001/192] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c Andrew Morton
                   ` (192 more replies)
  0 siblings, 193 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits


This is the rest of the -mm tree, less 66 patches which are dependent on
things which are (or were recently) in linux-next.  I'll trickle that
material over next week.


192 patches, based on 7cf3dead1ad70c72edb03e2d98e1f3dcd332cdb2 plus the
June 28 sendings.

Subsystems affected by this patch series:

  mm/hugetlb
  mm/userfaultfd
  mm/vmscan
  mm/kconfig
  mm/proc
  mm/z3fold
  mm/zbud
  mm/ras
  mm/mempolicy
  mm/memblock
  mm/migration
  mm/thp
  mm/nommu
  mm/kconfig
  mm/madvise
  mm/memory-hotplug
  mm/zswap
  mm/zsmalloc
  mm/zram
  mm/cleanups
  mm/kfence
  mm/hmm
  procfs
  sysctl
  misc
  core-kernel
  lib
  lz4
  checkpatch
  init
  kprobes
  nilfs2
  hfs
  signals
  exec
  kcov
  selftests
  compress/decompress
  ipc

Subsystem: mm/hugetlb

    Muchun Song <songmuchun@bytedance.com>:
    Patch series "Free some vmemmap pages of HugeTLB page", v23:
      mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c
      mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
      mm: hugetlb: gather discrete indexes of tail page
      mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
      mm: hugetlb: defer freeing of HugeTLB pages
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
      mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap
      mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled
      mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate

    Shixin Liu <liushixin2@huawei.com>:
      mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE
      mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanup and fixup for huge_memory:, v3:
      mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK
      mm/huge_memory.c: use page->deferred_list
      mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()
      mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd
      mm/huge_memory.c: don't discard hugepage if other processes are mapping it

    Christophe Leroy <christophe.leroy@csgroup.eu>:
    Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2:
      mm/hugetlb: change parameters of arch_make_huge_pte()
      mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge
      mm/vmalloc: enable mapping of huge pages at pte level in vmap
      mm/vmalloc: enable mapping of huge pages at pte level in vmalloc
      powerpc/8xx: add support for huge pages on VMAP and VMALLOC

    Nanyong Sun <sunnanyong@huawei.com>:
      khugepaged: selftests: remove debug_cow

    Mina Almasry <almasrymina@google.com>:
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY

    Muchun Song <songmuchun@bytedance.com>:
    Patch series "Split huge PMD mapping of vmemmap pages", v4:
      mm: sparsemem: split the huge PMD mapping of vmemmap pages
      mm: sparsemem: use huge PMD mapping for vmemmap pages
      mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON

    Mike Kravetz <mike.kravetz@oracle.com>:
    Patch series "Fix prep_compound_gigantic_page ref count adjustment":
      hugetlb: remove prep_compound_huge_page cleanup
      hugetlb: address ref count racing in prep_compound_gigantic_page

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
      mm/hwpoison: disable pcp for page_handle_poison()

Subsystem: mm/userfaultfd

    Peter Xu <peterx@redhat.com>:
    Patch series "userfaultfd/selftests: A few cleanups", v2:
      userfaultfd/selftests: use user mode only
      userfaultfd/selftests: remove the time() check on delayed uffd
      userfaultfd/selftests: dropping VERIFY check in locking_thread
      userfaultfd/selftests: only dump counts if mode enabled
      userfaultfd/selftests: unify error handling
    Patch series "mm/uffd: Misc fix for uffd-wp and one more test":
      mm/thp: simplify copying of huge zero page pmd when fork
      mm/userfaultfd: fix uffd-wp special cases for fork()
      mm/userfaultfd: fail uffd-wp registration if not supported
      mm/pagemap: export uffd-wp protection information
      userfaultfd/selftests: add pagemap uffd-wp test

    Axel Rasmussen <axelrasmussen@google.com>:
    Patch series "userfaultfd: add minor fault handling for shmem", v6:
      userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
      userfaultfd/shmem: support minor fault registration for shmem
      userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
      userfaultfd/shmem: advertise shmem minor fault support
      userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()
      userfaultfd/selftests: use memfd_create for shmem test type
      userfaultfd/selftests: create alias mappings in the shmem test
      userfaultfd/selftests: reinitialize test context in each test
      userfaultfd/selftests: exercise minor fault handling shmem support

Subsystem: mm/vmscan

    Yu Zhao <yuzhao@google.com>:
      mm/vmscan.c: fix potential deadlock in reclaim_pages()
      include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low

    Miaohe Lin <linmiaohe@huawei.com>:
      mm: workingset: define macro WORKINGSET_SHIFT

Subsystem: mm/kconfig

    Kefeng Wang <wangkefeng.wang@huawei.com>:
      mm/kconfig: move HOLES_IN_ZONE into mm

Subsystem: mm/proc

    Mike Rapoport <rppt@linux.ibm.com>:
      docs: proc.rst: meminfo: briefly describe gaps in memory accounting

    David Hildenbrand <david@redhat.com>:
    Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3:
      fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
      fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
      fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
      mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()
      virtio-mem: use page_offline_(start|end) when setting PageOffline()
      fs/proc/kcore: use page_offline_(freeze|thaw)

Subsystem: mm/z3fold

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanup and fixup for z3fold":
      mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS
      mm/z3fold: avoid possible underflow in z3fold_alloc()
      mm/z3fold: remove magic number in z3fold_create_pool()
      mm/z3fold: remove unused function handle_to_z3fold_header()
      mm/z3fold: fix potential memory leak in z3fold_destroy_pool()
      mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page

Subsystem: mm/zbud

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanups for zbud", v2:
      mm/zbud: reuse unbuddied[0] as buddied in zbud_pool
      mm/zbud: don't export any zbud API

Subsystem: mm/ras

    YueHaibing <yuehaibing@huawei.com>:
      mm/compaction: use DEVICE_ATTR_WO macro

    Liu Xiang <liu.xiang@zlingsmart.com>:
      mm: compaction: remove duplicate !list_empty(&sublist) check

    Wonhyuk Yang <vvghjk1234@gmail.com>:
      mm/compaction: fix 'limit' in fast_isolate_freepages

Subsystem: mm/mempolicy

    Feng Tang <feng.tang@intel.com>:
    Patch series "mm/mempolicy: some fix and semantics cleanup", v4:
      mm/mempolicy: cleanup nodemask intersection check for oom
      mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
      mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy

    Yang Shi <shy828301@gmail.com>:
      mm: mempolicy: don't have to split pmd for huge zero page

    Ben Widawsky <ben.widawsky@intel.com>:
      mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies

Subsystem: mm/memblock

    Mike Rapoport <rppt@linux.ibm.com>:
    Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4:
      include/linux/mmzone.h: add documentation for pfn_valid()
      memblock: update initialization of reserved pages
      arm64: decouple check whether pfn is in linear map from pfn_valid()
      arm64: drop pfn_valid_within() and simplify pfn_valid()

    Anshuman Khandual <anshuman.khandual@arm.com>:
      arm64/mm: drop HAVE_ARCH_PFN_VALID

Subsystem: mm/migration

    Muchun Song <songmuchun@bytedance.com>:
      mm: migrate: fix missing update page_private to hugetlb_page_subpool

Subsystem: mm/thp

    Collin Fijalkovich <cfijalkovich@google.com>:
      mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs

    Yang Shi <shy828301@gmail.com>:
      mm: memory: add orig_pmd to struct vm_fault
      mm: memory: make numa_migrate_prep() non-static
      mm: thp: refactor NUMA fault handling
      mm: migrate: account THP NUMA migration counters correctly
      mm: migrate: don't split THP for misplaced NUMA page
      mm: migrate: check mapcount for THP instead of refcount
      mm: thp: skip make PMD PROT_NONE if THP migration is not supported

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2

    Yang Shi <shy828301@gmail.com>:
      mm: rmap: make try_to_unmap() void function

    Hugh Dickins <hughd@google.com>:
      mm/thp: remap_page() is only needed on anonymous THP
      mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm/thp: fix strncpy warning

Subsystem: mm/nommu

    Chen Li <chenli@uniontech.com>:
      nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc

    Liam Howlett <liam.howlett@oracle.com>:
      mm/nommu: unexport do_munmap()

Subsystem: mm/kconfig

    Kefeng Wang <wangkefeng.wang@huawei.com>:
      mm: generalize ZONE_[DMA|DMA32]

Subsystem: mm/madvise

    David Hildenbrand <david@redhat.com>:
    Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2:
      mm: make variable names for populate_vma_page_range() consistent
      mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
      MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT
      selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore
      selftests/vm: add test for MADV_POPULATE_(READ|WRITE)

Subsystem: mm/memory-hotplug

    Liam Mark <lmark@codeaurora.org>:
      mm/memory_hotplug: rate limit page migration warnings

    Oscar Salvador <osalvador@suse.de>:
      mm,memory_hotplug: drop unneeded locking

Subsystem: mm/zswap

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanup and fixup for zswap":
      mm/zswap.c: remove unused function zswap_debugfs_exit()
      mm/zswap.c: avoid unnecessary copy-in at map time
      mm/zswap.c: fix two bugs in zswap_writeback_entry()

Subsystem: mm/zsmalloc

    Zhaoyang Huang <zhaoyang.huang@unisoc.com>:
      mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanup for zsmalloc":
      mm/zsmalloc.c: remove confusing code in obj_free()
      mm/zsmalloc.c: improve readability for async_free_zspage()

Subsystem: mm/zram

    Yue Hu <huyue2@yulong.com>:
      zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK

Subsystem: mm/cleanups

    Hyeonggon Yoo <42.hyeyoo@gmail.com>:
      mm: fix typos and grammar error in comments

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm: define default value for FIRST_USER_ADDRESS

    Zhen Lei <thunder.leizhen@huawei.com>:
      mm: fix spelling mistakes

    Mel Gorman <mgorman@techsingularity.net>:
    Patch series "Clean W=1 build warnings for mm/":
      mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages
      mm/vmalloc: include header for prototype of set_iounmap_nonlazy
      mm/page_alloc: make should_fail_alloc_page() static
      mm/mapping_dirty_helpers: remove double Note in kerneldoc
      mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection
      mm/memory_hotplug: fix kerneldoc comment for __try_online_node
      mm/memory_hotplug: fix kerneldoc comment for __remove_memory
      mm/zbud: add kerneldoc fields for zbud_pool
      mm/z3fold: add kerneldoc fields for z3fold_pool
      mm/swap: make swap_address_space an inline function
      mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations
      mm/page_alloc: move prototype for find_suitable_fallback
      mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm/thp: define default pmd_pgtable()

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      kfence: unconditionally use unbound work queue

Subsystem: mm/hmm

    Alistair Popple <apopple@nvidia.com>:
    Patch series "Add support for SVM atomics in Nouveau", v11:
      mm: remove special swap entry functions
      mm/swapops: rework swap entry manipulation code
      mm/rmap: split try_to_munlock from try_to_unmap
      mm/rmap: split migration into its own function
      mm: rename migrate_pgmap_owner
      mm/memory.c: allow different return codes for copy_nonpresent_pte()
      mm: device exclusive memory access
      mm: selftests for exclusive device memory
      nouveau/svm: refactor nouveau_range_fault
      nouveau/svm: implement atomic SVM access

Subsystem: procfs

    Marcelo Henrique Cerri <marcelo.cerri@canonical.com>:
      proc: Avoid mixing integer types in mem_rw()

    ZHOUFENG <zhoufeng.zf@bytedance.com>:
      fs/proc/kcore.c: add mmap interface

    Kalesh Singh <kaleshsingh@google.com>:
      procfs: allow reading fdinfo with PTRACE_MODE_READ
      procfs/dmabuf: add inode number to /proc/*/fdinfo

Subsystem: sysctl

    Jiapeng Chong <jiapeng.chong@linux.alibaba.com>:
      sysctl: remove redundant assignment to first

Subsystem: misc

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      drm: include only needed headers in ascii85.h

Subsystem: core-kernel

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      kernel.h: split out panic and oops helpers

Subsystem: lib

    Zhen Lei <thunder.leizhen@huawei.com>:
      lib: decompress_bunzip2: remove an unneeded semicolon

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
    Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3:
      lib/string_helpers: switch to use BIT() macro
      lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop
      lib/string_helpers: drop indentation level in string_escape_mem()
      lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII
      lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable
      lib/string_helpers: allow to append additional characters to be escaped
      lib/test-string_helpers: print flags in hexadecimal format
      lib/test-string_helpers: get rid of trailing comma in terminators
      lib/test-string_helpers: add test cases for new features
      MAINTAINERS: add myself as designated reviewer for generic string library
      seq_file: introduce seq_escape_mem()
      seq_file: add seq_escape_str() as replica of string_escape_str()
      seq_file: convert seq_escape() to use seq_escape_str()
      nfsd: avoid non-flexible API in seq_quote_mem()
      seq_file: drop unused *_escape_mem_ascii()

    Trent Piepho <tpiepho@gmail.com>:
      lib/math/rational.c: fix divide by zero
      lib/math/rational: add Kunit test cases

    Zhen Lei <thunder.leizhen@huawei.com>:
      lib/decompressors: fix spelling mistakes
      lib/mpi: fix spelling mistakes

    Alexey Dobriyan <adobriyan@gmail.com>:
      lib: memscan() fixlet
      lib: uninline simple_strtoull()

    Matteo Croce <mcroce@microsoft.com>:
      lib/test_string.c: allow module removal

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      kernel.h: split out kstrtox() and simple_strtox() to a separate header

Subsystem: lz4

    Rajat Asthana <thisisrast7@gmail.com>:
      lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static

    Dimitri John Ledkov <dimitri.ledkov@canonical.com>:
      lib/decompress_unlz4.c: correctly handle zero-padding around initrds.

Subsystem: checkpatch

    Guenter Roeck <linux@roeck-us.net>:
      checkpatch: scripts/spdxcheck.py now requires python3

    Joe Perches <joe@perches.com>:
      checkpatch: improve the indented label test

    Guenter Roeck <linux@roeck-us.net>:
      checkpatch: do not complain about positive return values starting with EPOLL

Subsystem: init

    Andrew Halaney <ahalaney@redhat.com>:
      init: print out unknown kernel parameters

Subsystem: kprobes

    Barry Song <song.bao.hua@hisilicon.com>:
      kprobes: remove duplicated strong free_insn_page in x86 and s390

Subsystem: nilfs2

    Colin Ian King <colin.king@canonical.com>:
      nilfs2: remove redundant continue statement in a while-loop

Subsystem: hfs

    Zhen Lei <thunder.leizhen@huawei.com>:
      hfsplus: remove unnecessary oom message

    Chung-Chiang Cheng <shepjeng@gmail.com>:
      hfsplus: report create_date to kstat.btime

Subsystem: signals

    Al Viro <viro@zeniv.linux.org.uk>:
      x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned

Subsystem: exec

    Alexey Dobriyan <adobriyan@gmail.com>:
      exec: remove checks in __register_bimfmt()

Subsystem: kcov

    Marco Elver <elver@google.com>:
      kcov: add __no_sanitize_coverage to fix noinstr for all architectures

Subsystem: selftests

    Dave Hansen <dave.hansen@linux.intel.com>:
    Patch series "selftests/vm/pkeys: Bug fixes and a new test":
      selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
      selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
      selftests/vm/pkeys: refill shadow register after implicit kernel write
      selftests/vm/pkeys: exercise x86 XSAVE init state

Subsystem: compress/decompress

    Yu Kuai <yukuai3@huawei.com>:
      lib/decompressors: remove set but not used variabled 'level'

Subsystem: ipc

    Vasily Averin <vvs@virtuozzo.com>:
    Patch series "ipc: allocations cleanup", v2:
      ipc sem: use kvmalloc for sem_undo allocation
      ipc: use kmalloc for msg_queue and shmid_kernel

    Manfred Spraul <manfred@colorfullife.com>:
      ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
      ipc/util.c: use binary search for max_idx

 Documentation/admin-guide/kernel-parameters.txt    |   35 
 Documentation/admin-guide/mm/hugetlbpage.rst       |   11 
 Documentation/admin-guide/mm/memory-hotplug.rst    |   13 
 Documentation/admin-guide/mm/pagemap.rst           |    2 
 Documentation/admin-guide/mm/userfaultfd.rst       |    3 
 Documentation/core-api/kernel-api.rst              |    7 
 Documentation/filesystems/proc.rst                 |   48 
 Documentation/vm/hmm.rst                           |   19 
 Documentation/vm/unevictable-lru.rst               |   33 
 MAINTAINERS                                        |   10 
 arch/alpha/Kconfig                                 |    5 
 arch/alpha/include/asm/pgalloc.h                   |    1 
 arch/alpha/include/asm/pgtable.h                   |    1 
 arch/alpha/include/uapi/asm/mman.h                 |    3 
 arch/alpha/kernel/setup.c                          |    2 
 arch/arc/include/asm/pgalloc.h                     |    2 
 arch/arc/include/asm/pgtable.h                     |    8 
 arch/arm/Kconfig                                   |    3 
 arch/arm/include/asm/pgalloc.h                     |    1 
 arch/arm64/Kconfig                                 |   15 
 arch/arm64/include/asm/hugetlb.h                   |    3 
 arch/arm64/include/asm/memory.h                    |    2 
 arch/arm64/include/asm/page.h                      |    4 
 arch/arm64/include/asm/pgalloc.h                   |    1 
 arch/arm64/include/asm/pgtable.h                   |    2 
 arch/arm64/kernel/setup.c                          |    1 
 arch/arm64/kvm/mmu.c                               |    2 
 arch/arm64/mm/hugetlbpage.c                        |    5 
 arch/arm64/mm/init.c                               |   51 
 arch/arm64/mm/ioremap.c                            |    4 
 arch/arm64/mm/mmu.c                                |   22 
 arch/csky/include/asm/pgalloc.h                    |    2 
 arch/csky/include/asm/pgtable.h                    |    1 
 arch/hexagon/include/asm/pgtable.h                 |    4 
 arch/ia64/Kconfig                                  |    7 
 arch/ia64/include/asm/pal.h                        |    1 
 arch/ia64/include/asm/pgalloc.h                    |    1 
 arch/ia64/include/asm/pgtable.h                    |    1 
 arch/m68k/Kconfig                                  |    5 
 arch/m68k/include/asm/mcf_pgalloc.h                |    2 
 arch/m68k/include/asm/mcf_pgtable.h                |    2 
 arch/m68k/include/asm/motorola_pgalloc.h           |    1 
 arch/m68k/include/asm/motorola_pgtable.h           |    2 
 arch/m68k/include/asm/pgtable_mm.h                 |    1 
 arch/m68k/include/asm/sun3_pgalloc.h               |    1 
 arch/microblaze/Kconfig                            |    4 
 arch/microblaze/include/asm/pgalloc.h              |    2 
 arch/microblaze/include/asm/pgtable.h              |    2 
 arch/mips/Kconfig                                  |   10 
 arch/mips/include/asm/pgalloc.h                    |    1 
 arch/mips/include/asm/pgtable-32.h                 |    1 
 arch/mips/include/asm/pgtable-64.h                 |    1 
 arch/mips/include/uapi/asm/mman.h                  |    3 
 arch/mips/kernel/relocate.c                        |    1 
 arch/mips/sgi-ip22/ip22-reset.c                    |    1 
 arch/mips/sgi-ip32/ip32-reset.c                    |    1 
 arch/nds32/include/asm/pgalloc.h                   |    5 
 arch/nios2/include/asm/pgalloc.h                   |    1 
 arch/nios2/include/asm/pgtable.h                   |    2 
 arch/openrisc/include/asm/pgalloc.h                |    2 
 arch/openrisc/include/asm/pgtable.h                |    1 
 arch/parisc/include/asm/pgalloc.h                  |    1 
 arch/parisc/include/asm/pgtable.h                  |    2 
 arch/parisc/include/uapi/asm/mman.h                |    3 
 arch/parisc/kernel/pdc_chassis.c                   |    1 
 arch/powerpc/Kconfig                               |    6 
 arch/powerpc/include/asm/book3s/pgtable.h          |    1 
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h   |    5 
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h       |   43 
 arch/powerpc/include/asm/nohash/32/pgtable.h       |    1 
 arch/powerpc/include/asm/nohash/64/pgtable.h       |    2 
 arch/powerpc/include/asm/pgalloc.h                 |    5 
 arch/powerpc/include/asm/pgtable.h                 |    6 
 arch/powerpc/kernel/setup-common.c                 |    1 
 arch/powerpc/platforms/Kconfig.cputype             |    1 
 arch/riscv/Kconfig                                 |    5 
 arch/riscv/include/asm/pgalloc.h                   |    2 
 arch/riscv/include/asm/pgtable.h                   |    2 
 arch/s390/Kconfig                                  |    6 
 arch/s390/include/asm/pgalloc.h                    |    3 
 arch/s390/include/asm/pgtable.h                    |    5 
 arch/s390/kernel/ipl.c                             |    1 
 arch/s390/kernel/kprobes.c                         |    5 
 arch/s390/mm/pgtable.c                             |    2 
 arch/sh/include/asm/pgalloc.h                      |    1 
 arch/sh/include/asm/pgtable.h                      |    2 
 arch/sparc/Kconfig                                 |    5 
 arch/sparc/include/asm/pgalloc_32.h                |    1 
 arch/sparc/include/asm/pgalloc_64.h                |    1 
 arch/sparc/include/asm/pgtable_32.h                |    3 
 arch/sparc/include/asm/pgtable_64.h                |    8 
 arch/sparc/kernel/sstate.c                         |    1 
 arch/sparc/mm/hugetlbpage.c                        |    6 
 arch/sparc/mm/init_64.c                            |    1 
 arch/um/drivers/mconsole_kern.c                    |    1 
 arch/um/include/asm/pgalloc.h                      |    1 
 arch/um/include/asm/pgtable-2level.h               |    1 
 arch/um/include/asm/pgtable-3level.h               |    1 
 arch/um/kernel/um_arch.c                           |    1 
 arch/x86/Kconfig                                   |   17 
 arch/x86/include/asm/desc.h                        |    1 
 arch/x86/include/asm/pgalloc.h                     |    2 
 arch/x86/include/asm/pgtable_types.h               |    2 
 arch/x86/kernel/cpu/mshyperv.c                     |    1 
 arch/x86/kernel/kprobes/core.c                     |    6 
 arch/x86/kernel/setup.c                            |    1 
 arch/x86/mm/init_64.c                              |   21 
 arch/x86/mm/pgtable.c                              |   34 
 arch/x86/purgatory/purgatory.c                     |    2 
 arch/x86/xen/enlighten.c                           |    1 
 arch/xtensa/include/asm/pgalloc.h                  |    2 
 arch/xtensa/include/asm/pgtable.h                  |    1 
 arch/xtensa/include/uapi/asm/mman.h                |    3 
 arch/xtensa/platforms/iss/setup.c                  |    1 
 drivers/block/zram/zram_drv.h                      |    2 
 drivers/bus/brcmstb_gisb.c                         |    1 
 drivers/char/ipmi/ipmi_msghandler.c                |    1 
 drivers/clk/analogbits/wrpll-cln28hpc.c            |    4 
 drivers/edac/altera_edac.c                         |    1 
 drivers/firmware/google/gsmi.c                     |    1 
 drivers/gpu/drm/nouveau/include/nvif/if000c.h      |    1 
 drivers/gpu/drm/nouveau/nouveau_svm.c              |  162 ++-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h      |    1 
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c |    6 
 drivers/hv/vmbus_drv.c                             |    1 
 drivers/hwtracing/coresight/coresight-cpu-debug.c  |    1 
 drivers/leds/trigger/ledtrig-activity.c            |    1 
 drivers/leds/trigger/ledtrig-heartbeat.c           |    1 
 drivers/leds/trigger/ledtrig-panic.c               |    1 
 drivers/misc/bcm-vk/bcm_vk_dev.c                   |    1 
 drivers/misc/ibmasm/heartbeat.c                    |    1 
 drivers/misc/pvpanic/pvpanic.c                     |    1 
 drivers/net/ipa/ipa_smp2p.c                        |    1 
 drivers/parisc/power.c                             |    1 
 drivers/power/reset/ltc2952-poweroff.c             |    1 
 drivers/remoteproc/remoteproc_core.c               |    1 
 drivers/s390/char/con3215.c                        |    1 
 drivers/s390/char/con3270.c                        |    1 
 drivers/s390/char/sclp.c                           |    1 
 drivers/s390/char/sclp_con.c                       |    1 
 drivers/s390/char/sclp_vt220.c                     |    1 
 drivers/s390/char/zcore.c                          |    1 
 drivers/soc/bcm/brcmstb/pm/pm-arm.c                |    1 
 drivers/staging/olpc_dcon/olpc_dcon.c              |    1 
 drivers/video/fbdev/hyperv_fb.c                    |    1 
 drivers/virtio/virtio_mem.c                        |    2 
 fs/Kconfig                                         |   15 
 fs/exec.c                                          |    3 
 fs/hfsplus/inode.c                                 |    5 
 fs/hfsplus/xattr.c                                 |    1 
 fs/nfsd/nfs4state.c                                |    2 
 fs/nilfs2/btree.c                                  |    1 
 fs/open.c                                          |   13 
 fs/proc/base.c                                     |    6 
 fs/proc/fd.c                                       |   20 
 fs/proc/kcore.c                                    |  136 ++
 fs/proc/task_mmu.c                                 |   34 
 fs/seq_file.c                                      |   43 
 fs/userfaultfd.c                                   |   15 
 include/asm-generic/bug.h                          |    3 
 include/linux/ascii85.h                            |    3 
 include/linux/bootmem_info.h                       |   68 +
 include/linux/compat.h                             |    2 
 include/linux/compiler-clang.h                     |   17 
 include/linux/compiler-gcc.h                       |    6 
 include/linux/compiler_types.h                     |    2 
 include/linux/huge_mm.h                            |   74 -
 include/linux/hugetlb.h                            |   80 +
 include/linux/hugetlb_cgroup.h                     |   19 
 include/linux/kcore.h                              |    3 
 include/linux/kernel.h                             |  227 ----
 include/linux/kprobes.h                            |    1 
 include/linux/kstrtox.h                            |  155 ++
 include/linux/memblock.h                           |    4 
 include/linux/memory_hotplug.h                     |   27 
 include/linux/mempolicy.h                          |    9 
 include/linux/memremap.h                           |    2 
 include/linux/migrate.h                            |   27 
 include/linux/mm.h                                 |   18 
 include/linux/mm_types.h                           |    2 
 include/linux/mmu_notifier.h                       |   26 
 include/linux/mmzone.h                             |   27 
 include/linux/mpi.h                                |    4 
 include/linux/page-flags.h                         |   22 
 include/linux/panic.h                              |   98 +
 include/linux/panic_notifier.h                     |   12 
 include/linux/pgtable.h                            |   44 
 include/linux/rmap.h                               |   13 
 include/linux/seq_file.h                           |   10 
 include/linux/shmem_fs.h                           |   19 
 include/linux/signal.h                             |    2 
 include/linux/string.h                             |    7 
 include/linux/string_helpers.h                     |   31 
 include/linux/sunrpc/cache.h                       |    1 
 include/linux/swap.h                               |   19 
 include/linux/swapops.h                            |  171 +--
 include/linux/thread_info.h                        |    1 
 include/linux/userfaultfd_k.h                      |    5 
 include/linux/vmalloc.h                            |   15 
 include/linux/zbud.h                               |   23 
 include/trace/events/vmscan.h                      |   41 
 include/uapi/asm-generic/mman-common.h             |    3 
 include/uapi/linux/mempolicy.h                     |    1 
 include/uapi/linux/userfaultfd.h                   |    7 
 init/main.c                                        |   42 
 ipc/msg.c                                          |    6 
 ipc/sem.c                                          |   25 
 ipc/shm.c                                          |    6 
 ipc/util.c                                         |   44 
 ipc/util.h                                         |    3 
 kernel/hung_task.c                                 |    1 
 kernel/kexec_core.c                                |    1 
 kernel/kprobes.c                                   |    2 
 kernel/panic.c                                     |    1 
 kernel/rcu/tree.c                                  |    2 
 kernel/signal.c                                    |   14 
 kernel/sysctl.c                                    |    4 
 kernel/trace/trace.c                               |    1 
 lib/Kconfig.debug                                  |   12 
 lib/decompress_bunzip2.c                           |    6 
 lib/decompress_unlz4.c                             |    8 
 lib/decompress_unlzo.c                             |    3 
 lib/decompress_unxz.c                              |    2 
 lib/decompress_unzstd.c                            |    4 
 lib/kstrtox.c                                      |    5 
 lib/lz4/lz4_decompress.c                           |    2 
 lib/math/Makefile                                  |    1 
 lib/math/rational-test.c                           |   56 +
 lib/math/rational.c                                |   16 
 lib/mpi/longlong.h                                 |    4 
 lib/mpi/mpicoder.c                                 |    6 
 lib/mpi/mpiutil.c                                  |    2 
 lib/parser.c                                       |    1 
 lib/string.c                                       |    2 
 lib/string_helpers.c                               |  142 +-
 lib/test-string_helpers.c                          |  157 ++-
 lib/test_hmm.c                                     |  127 ++
 lib/test_hmm_uapi.h                                |    2 
 lib/test_string.c                                  |    5 
 lib/vsprintf.c                                     |    1 
 lib/xz/xz_dec_bcj.c                                |    2 
 lib/xz/xz_dec_lzma2.c                              |    8 
 lib/zlib_inflate/inffast.c                         |    2 
 lib/zstd/huf.h                                     |    2 
 mm/Kconfig                                         |   16 
 mm/Makefile                                        |    2 
 mm/bootmem_info.c                                  |  127 ++
 mm/compaction.c                                    |   20 
 mm/debug_vm_pgtable.c                              |  109 --
 mm/gup.c                                           |   58 +
 mm/hmm.c                                           |   12 
 mm/huge_memory.c                                   |  269 ++---
 mm/hugetlb.c                                       |  369 +++++--
 mm/hugetlb_vmemmap.c                               |  332 ++++++
 mm/hugetlb_vmemmap.h                               |   53 -
 mm/internal.h                                      |   29 
 mm/kfence/core.c                                   |    4 
 mm/khugepaged.c                                    |   20 
 mm/madvise.c                                       |   66 +
 mm/mapping_dirty_helpers.c                         |    2 
 mm/memblock.c                                      |   28 
 mm/memcontrol.c                                    |    4 
 mm/memory-failure.c                                |   38 
 mm/memory.c                                        |  239 +++-
 mm/memory_hotplug.c                                |  161 ---
 mm/mempolicy.c                                     |  323 ++----
 mm/migrate.c                                       |  268 +----
 mm/mlock.c                                         |   12 
 mm/mmap_lock.c                                     |   59 -
 mm/mprotect.c                                      |   18 
 mm/nommu.c                                         |    5 
 mm/oom_kill.c                                      |    2 
 mm/page_alloc.c                                    |    5 
 mm/page_vma_mapped.c                               |   15 
 mm/rmap.c                                          |  644 +++++++++---
 mm/shmem.c                                         |  125 --
 mm/sparse-vmemmap.c                                |  432 +++++++-
 mm/sparse.c                                        |    1 
 mm/swap.c                                          |    2 
 mm/swapfile.c                                      |    2 
 mm/userfaultfd.c                                   |  249 ++--
 mm/util.c                                          |   40 
 mm/vmalloc.c                                       |   37 
 mm/vmscan.c                                        |   20 
 mm/workingset.c                                    |   10 
 mm/z3fold.c                                        |   39 
 mm/zbud.c                                          |  235 ++--
 mm/zsmalloc.c                                      |    5 
 mm/zswap.c                                         |   26 
 scripts/checkpatch.pl                              |   16 
 tools/testing/selftests/vm/.gitignore              |    3 
 tools/testing/selftests/vm/Makefile                |    5 
 tools/testing/selftests/vm/hmm-tests.c             |  158 +++
 tools/testing/selftests/vm/khugepaged.c            |    4 
 tools/testing/selftests/vm/madv_populate.c         |  342 ++++++
 tools/testing/selftests/vm/pkey-x86.h              |    1 
 tools/testing/selftests/vm/protection_keys.c       |   85 +
 tools/testing/selftests/vm/run_vmtests.sh          |   16 
 tools/testing/selftests/vm/userfaultfd.c           | 1094 ++++++++++-----------
 299 files changed, 6277 insertions(+), 3183 deletions(-)


^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 001/192] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c
  2021-07-01  1:46 incoming Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 002/192] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Andrew Morton
                   ` (191 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c

Patch series "Free some vmemmap pages of HugeTLB page", v23.

This patch series will free some vmemmap pages(struct page structures)
associated with each HugeTLB page when preallocated to save memory.

In order to reduce the difficulty of the first version of code review.  In
this version, we disable PMD/huge page mapping of vmemmap if this feature
was enabled.  This acutely eliminates a bunch of the complex code doing
page table manipulation.  When this patch series is solid, we cam add the
code of vmemmap page table manipulation in the future.

The struct page structures (page structs) are used to describe a physical
page frame.  By default, there is an one-to-one mapping from a page frame
to it's corresponding page struct.

The HugeTLB pages consist of multiple base page size pages and is
supported by many architectures.  See hugetlbpage.rst in the Documentation
directory for more details.  On the x86 architecture, HugeTLB pages of
size 2MB and 1GB are currently supported.  Since the base page size on x86
is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB
page consists of 4096 base pages.  For each base page, there is a
corresponding page struct.

Within the HugeTLB subsystem, only the first 4 page structs are used to
contain unique information about a HugeTLB page.  HUGETLB_CGROUP_MIN_ORDER
provides this upper limit.  The only 'useful' information in the remaining
page structs is the compound_head field, and this field is the same for
all tail pages.

By removing redundant page structs for HugeTLB pages, memory can returned
to the buddy allocator for other uses.

When the system boot up, every 2M HugeTLB has 512 struct page structs which
size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).

    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | -------------> |     2     |
 |           |                     +-----------+                +-----------+
 |           |                     |     3     | -------------> |     3     |
 |           |                     +-----------+                +-----------+
 |           |                     |     4     | -------------> |     4     |
 |    2MB    |                     +-----------+                +-----------+
 |           |                     |     5     | -------------> |     5     |
 |           |                     +-----------+                +-----------+
 |           |                     |     6     | -------------> |     6     |
 |           |                     +-----------+                +-----------+
 |           |                     |     7     | -------------> |     7     |
 |           |                     +-----------+                +-----------+
 |           |
 |           |
 |           |
 +-----------+

The value of page->compound_head is the same for all tail pages.  The
first page of page structs (page 0) associated with the HugeTLB page
contains the 4 page structs necessary to describe the HugeTLB.  The only
use of the remaining pages of page structs (page 1 to page 7) is to point
to page->compound_head.  Therefore, we can remap pages 2 to 7 to page 1. 
Only 2 pages of page structs will be used for each HugeTLB page.  This
will allow us to free the remaining 6 pages to the buddy allocator.

Here is how things look after remapping.

    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
 +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
 |           |                     |     0     | -------------> |     0     |
 |           |                     +-----------+                +-----------+
 |           |                     |     1     | -------------> |     1     |
 |           |                     +-----------+                +-----------+
 |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
 |           |                     +-----------+                   | | | | |
 |           |                     |     3     | ------------------+ | | | |
 |           |                     +-----------+                     | | | |
 |           |                     |     4     | --------------------+ | | |
 |    2MB    |                     +-----------+                       | | |
 |           |                     |     5     | ----------------------+ | |
 |           |                     +-----------+                         | |
 |           |                     |     6     | ------------------------+ |
 |           |                     +-----------+                           |
 |           |                     |     7     | --------------------------+
 |           |                     +-----------+
 |           |
 |           |
 |           |
 +-----------+

When a HugeTLB is freed to the buddy system, we should allocate 6 pages
for vmemmap pages and restore the previous mapping relationship.

Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page.  It is similar
to the 2MB HugeTLB page.  We also can use this approach to free the
vmemmap pages.

In this case, for the 1GB HugeTLB page, we can save 4094 pages.  This is a
very substantial gain.  On our server, run some SPDK/QEMU applications
which will use 1024GB HugeTLB page.  With this feature enabled, we can
save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory.

Because there are vmemmap page tables reconstruction on the
freeing/allocating path, it increases some overhead.  Here are some
overhead analysis.

1) Allocating 10240 2MB HugeTLB pages.

   a) With this patch series applied:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.166s
   user     0m0.000s
   sys      0m0.166s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
     kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
     @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [8K, 16K)           5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [16K, 32K)          4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
   [32K, 64K)             4 |                                                    |

   b) Without this patch series:
   # time echo 10240 > /proc/sys/vm/nr_hugepages

   real     0m0.067s
   user     0m0.000s
   sys      0m0.067s

   # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; }
     kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs -
     @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)           10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)             93 |                                                    |

   Summarize: this feature is about ~2x slower than before.

2) Freeing 10240 2MB HugeTLB pages.

   a) With this patch series applied:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.213s
   user     0m0.000s
   sys      0m0.213s

   # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
     kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
     @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [8K, 16K)              6 |                                                    |
   [16K, 32K)         10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [32K, 64K)             7 |                                                    |

   b) Without this patch series:
   # time echo 0 > /proc/sys/vm/nr_hugepages

   real     0m0.081s
   user     0m0.000s
   sys      0m0.081s

   # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; }
     kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs -
     @start[tid]); delete(@start[tid]); }'
   Attaching 2 probes...

   @latency:
   [4K, 8K)            6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
   [8K, 16K)           3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
   [16K, 32K)             8 |                                                    |

   Summary: The overhead of __free_hugepage is about ~2-3x slower than before.

Although the overhead has increased, the overhead is not significant. 
Like Mike said, "However, remember that the majority of use cases create
HugeTLB pages at or shortly after boot time and add them to the pool.  So,
additional overhead is at pool creation time.  There is no change to
'normal run time' operations of getting a page from or returning a page to
the pool (think page fault/unmap)".

Despite the overhead and in addition to the memory gains from this series.
The following data is obtained by Joao Martins.  Very thanks to his
effort.

There's an additional benefit which is page (un)pinners will see an improvement
and Joao presumes because there are fewer memmap pages and thus the tail/head
pages are staying in cache more often.

Out of the box Joao saw (when comparing linux-next against linux-next +
this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages):

	get_user_pages(): ~32k -> ~9k
	unpin_user_pages(): ~75k -> ~70k

Usually any tight loop fetching compound_head(), or reading tail pages
data (e.g.  compound_head) benefit a lot.  There's some unpinning
inefficiencies Joao was fixing[2], but with that in added it shows even
more:

	unpin_user_pages(): ~27k -> ~3.8k

[1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/


This patch (of 9):

Move bootmem info registration common API to individual bootmem_info.c. 
And we will use {get,put}_page_bootmem() to initialize the page for the
vmemmap pages or free the vmemmap pages to buddy in the later patch.  So
move them out of CONFIG_MEMORY_HOTPLUG_SPARSE.  This is just code movement
without any functional change.

Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/sparc/mm/init_64.c        |    1 
 arch/x86/mm/init_64.c          |    3 
 include/linux/bootmem_info.h   |   40 +++++++++
 include/linux/memory_hotplug.h |   27 ------
 mm/Makefile                    |    1 
 mm/bootmem_info.c              |  127 +++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            |  116 ----------------------------
 mm/sparse.c                    |    1 
 8 files changed, 172 insertions(+), 144 deletions(-)

--- a/arch/sparc/mm/init_64.c~mm-memory_hotplug-factor-out-bootmem-core-functions-to-bootmem_infoc
+++ a/arch/sparc/mm/init_64.c
@@ -27,6 +27,7 @@
 #include <linux/percpu.h>
 #include <linux/mmzone.h>
 #include <linux/gfp.h>
+#include <linux/bootmem_info.h>
 
 #include <asm/head.h>
 #include <asm/page.h>
--- a/arch/x86/mm/init_64.c~mm-memory_hotplug-factor-out-bootmem-core-functions-to-bootmem_infoc
+++ a/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/bootmem_info.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1623,7 +1624,7 @@ int __meminit vmemmap_populate(unsigned
 	return err;
 }
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HAVE_BOOTMEM_INFO_NODE)
+#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long nr_pages)
 {
--- /dev/null
+++ a/include/linux/bootmem_info.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_BOOTMEM_INFO_H
+#define __LINUX_BOOTMEM_INFO_H
+
+#include <linux/mmzone.h>
+
+/*
+ * Types for free bootmem stored in page->lru.next. These have to be in
+ * some random range in unsigned long space for debugging purposes.
+ */
+enum {
+	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
+	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
+	MIX_SECTION_INFO,
+	NODE_INFO,
+	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+};
+
+#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
+void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
+
+void get_page_bootmem(unsigned long info, struct page *page,
+		      unsigned long type);
+void put_page_bootmem(struct page *page);
+#else
+static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+}
+
+static inline void put_page_bootmem(struct page *page)
+{
+}
+
+static inline void get_page_bootmem(unsigned long info, struct page *page,
+				    unsigned long type)
+{
+}
+#endif
+
+#endif /* __LINUX_BOOTMEM_INFO_H */
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-factor-out-bootmem-core-functions-to-bootmem_infoc
+++ a/include/linux/memory_hotplug.h
@@ -18,18 +18,6 @@ struct vmem_altmap;
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
 
-/*
- * Types for free bootmem stored in page->lru.next. These have to be in
- * some random range in unsigned long space for debugging purposes.
- */
-enum {
-	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
-	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
-	MIX_SECTION_INFO,
-	NODE_INFO,
-	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
-};
-
 /* Types for control the zone type of onlined and offlined memory */
 enum {
 	/* Offline the memory. */
@@ -222,17 +210,6 @@ static inline void arch_refresh_nodedata
 #endif /* CONFIG_NUMA */
 #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */
 
-#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
-extern void __init register_page_bootmem_info_node(struct pglist_data *pgdat);
-#else
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-#endif
-extern void put_page_bootmem(struct page *page);
-extern void get_page_bootmem(unsigned long ingo, struct page *page,
-			     unsigned long type);
-
 void get_online_mems(void);
 void put_online_mems(void);
 
@@ -260,10 +237,6 @@ static inline void zone_span_writelock(s
 static inline void zone_span_writeunlock(struct zone *zone) {}
 static inline void zone_seqlock_init(struct zone *zone) {}
 
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-
 static inline int try_online_node(int nid)
 {
 	return 0;
--- /dev/null
+++ a/mm/bootmem_info.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Bootmem core functions.
+ *
+ * Copyright (c) 2020, Bytedance.
+ *
+ *     Author: Muchun Song <songmuchun@bytedance.com>
+ *
+ */
+#include <linux/mm.h>
+#include <linux/compiler.h>
+#include <linux/memblock.h>
+#include <linux/bootmem_info.h>
+#include <linux/memory_hotplug.h>
+
+void get_page_bootmem(unsigned long info, struct page *page, unsigned long type)
+{
+	page->freelist = (void *)type;
+	SetPagePrivate(page);
+	set_page_private(page, info);
+	page_ref_inc(page);
+}
+
+void put_page_bootmem(struct page *page)
+{
+	unsigned long type;
+
+	type = (unsigned long) page->freelist;
+	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
+
+	if (page_ref_dec_return(page) == 1) {
+		page->freelist = NULL;
+		ClearPagePrivate(page);
+		set_page_private(page, 0);
+		INIT_LIST_HEAD(&page->lru);
+		free_reserved_page(page);
+	}
+}
+
+#ifndef CONFIG_SPARSEMEM_VMEMMAP
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+	struct mem_section_usage *usage;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	/* Get section's memmap address */
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	/*
+	 * Get page for the memmap's phys address
+	 * XXX: need more consideration for sparse_vmemmap...
+	 */
+	page = virt_to_page(memmap);
+	mapsize = sizeof(struct page) * PAGES_PER_SECTION;
+	mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
+
+	/* remember memmap's page */
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, SECTION_INFO);
+
+	usage = ms->usage;
+	page = virt_to_page(usage);
+
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+
+}
+#else /* CONFIG_SPARSEMEM_VMEMMAP */
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+	unsigned long mapsize, section_nr, i;
+	struct mem_section *ms;
+	struct page *page, *memmap;
+	struct mem_section_usage *usage;
+
+	section_nr = pfn_to_section_nr(start_pfn);
+	ms = __nr_to_section(section_nr);
+
+	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+	usage = ms->usage;
+	page = virt_to_page(usage);
+
+	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
+
+	for (i = 0; i < mapsize; i++, page++)
+		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
+
+void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
+{
+	unsigned long i, pfn, end_pfn, nr_pages;
+	int node = pgdat->node_id;
+	struct page *page;
+
+	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+	page = virt_to_page(pgdat);
+
+	for (i = 0; i < nr_pages; i++, page++)
+		get_page_bootmem(node, page, NODE_INFO);
+
+	pfn = pgdat->node_start_pfn;
+	end_pfn = pgdat_end_pfn(pgdat);
+
+	/* register section info */
+	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+		/*
+		 * Some platforms can assign the same pfn to multiple nodes - on
+		 * node0 as well as nodeN.  To avoid registering a pfn against
+		 * multiple nodes we check that this pfn does not already
+		 * reside in some other nodes.
+		 */
+		if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
+			register_page_bootmem_info_section(pfn);
+	}
+}
--- a/mm/Makefile~mm-memory_hotplug-factor-out-bootmem-core-functions-to-bootmem_infoc
+++ a/mm/Makefile
@@ -125,3 +125,4 @@ obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += m
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
+obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
--- a/mm/memory_hotplug.c~mm-memory_hotplug-factor-out-bootmem-core-functions-to-bootmem_infoc
+++ a/mm/memory_hotplug.c
@@ -154,122 +154,6 @@ static void release_memory_resource(stru
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-void get_page_bootmem(unsigned long info,  struct page *page,
-		      unsigned long type)
-{
-	page->freelist = (void *)type;
-	SetPagePrivate(page);
-	set_page_private(page, info);
-	page_ref_inc(page);
-}
-
-void put_page_bootmem(struct page *page)
-{
-	unsigned long type;
-
-	type = (unsigned long) page->freelist;
-	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
-	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
-
-	if (page_ref_dec_return(page) == 1) {
-		page->freelist = NULL;
-		ClearPagePrivate(page);
-		set_page_private(page, 0);
-		INIT_LIST_HEAD(&page->lru);
-		free_reserved_page(page);
-	}
-}
-
-#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void register_page_bootmem_info_section(unsigned long start_pfn)
-{
-	unsigned long mapsize, section_nr, i;
-	struct mem_section *ms;
-	struct page *page, *memmap;
-	struct mem_section_usage *usage;
-
-	section_nr = pfn_to_section_nr(start_pfn);
-	ms = __nr_to_section(section_nr);
-
-	/* Get section's memmap address */
-	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-
-	/*
-	 * Get page for the memmap's phys address
-	 * XXX: need more consideration for sparse_vmemmap...
-	 */
-	page = virt_to_page(memmap);
-	mapsize = sizeof(struct page) * PAGES_PER_SECTION;
-	mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
-
-	/* remember memmap's page */
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, SECTION_INFO);
-
-	usage = ms->usage;
-	page = virt_to_page(usage);
-
-	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
-
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
-
-}
-#else /* CONFIG_SPARSEMEM_VMEMMAP */
-static void register_page_bootmem_info_section(unsigned long start_pfn)
-{
-	unsigned long mapsize, section_nr, i;
-	struct mem_section *ms;
-	struct page *page, *memmap;
-	struct mem_section_usage *usage;
-
-	section_nr = pfn_to_section_nr(start_pfn);
-	ms = __nr_to_section(section_nr);
-
-	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-
-	register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
-
-	usage = ms->usage;
-	page = virt_to_page(usage);
-
-	mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
-
-	for (i = 0; i < mapsize; i++, page++)
-		get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
-}
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
-
-void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-	unsigned long i, pfn, end_pfn, nr_pages;
-	int node = pgdat->node_id;
-	struct page *page;
-
-	nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
-	page = virt_to_page(pgdat);
-
-	for (i = 0; i < nr_pages; i++, page++)
-		get_page_bootmem(node, page, NODE_INFO);
-
-	pfn = pgdat->node_start_pfn;
-	end_pfn = pgdat_end_pfn(pgdat);
-
-	/* register section info */
-	for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
-		/*
-		 * Some platforms can assign the same pfn to multiple nodes - on
-		 * node0 as well as nodeN.  To avoid registering a pfn against
-		 * multiple nodes we check that this pfn does not already
-		 * reside in some other nodes.
-		 */
-		if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node))
-			register_page_bootmem_info_section(pfn);
-	}
-}
-#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
-
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
--- a/mm/sparse.c~mm-memory_hotplug-factor-out-bootmem-core-functions-to-bootmem_infoc
+++ a/mm/sparse.c
@@ -13,6 +13,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/bootmem_info.h>
 
 #include "internal.h"
 #include <asm/dma.h>
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 002/192] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP
  2021-07-01  1:46 incoming Andrew Morton
  2021-07-01  1:47 ` [patch 001/192] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 003/192] mm: hugetlb: gather discrete indexes of tail page Andrew Morton
                   ` (190 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP

The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some
vmemmap pages associated with pre-allocated HugeTLB pages.  For example,
on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB
HugeTLB page.  4094 vmemmap pages of size 4KB each can be saved for each
1GB HugeTLB page.

When a HugeTLB page is allocated or freed, the vmemmap array representing
the range associated with the page will need to be remapped.  When a page
is allocated, vmemmap pages are freed after remapping.  When a page is
freed, previously discarded vmemmap pages must be allocated before
remapping.

The config option is introduced early so that supporting code can be
written to depend on the option.  The initial version of the code only
provides support for x86-64.

If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code
denpend on it to free vmemmap pages.  Otherwise, just use
free_reserved_page() to free vmemmmap pages.  The routine
register_page_bootmem_info() is used to register bootmem info.  Therefore,
make sure register_page_bootmem_info is enabled if
HUGETLB_PAGE_FREE_VMEMMAP is defined.

Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/mm/init_64.c |    2 +-
 fs/Kconfig            |    5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

--- a/arch/x86/mm/init_64.c~mm-hugetlb-introduce-a-new-config-hugetlb_page_free_vmemmap
+++ a/arch/x86/mm/init_64.c
@@ -1270,7 +1270,7 @@ static struct kcore_list kcore_vsyscall;
 
 static void __init register_page_bootmem_info(void)
 {
-#ifdef CONFIG_NUMA
+#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)
 	int i;
 
 	for_each_online_node(i)
--- a/fs/Kconfig~mm-hugetlb-introduce-a-new-config-hugetlb_page_free_vmemmap
+++ a/fs/Kconfig
@@ -240,6 +240,11 @@ config HUGETLBFS
 config HUGETLB_PAGE
 	def_bool HUGETLBFS
 
+config HUGETLB_PAGE_FREE_VMEMMAP
+	def_bool HUGETLB_PAGE
+	depends on X86_64
+	depends on SPARSEMEM_VMEMMAP
+
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 003/192] mm: hugetlb: gather discrete indexes of tail page
  2021-07-01  1:46 incoming Andrew Morton
  2021-07-01  1:47 ` [patch 001/192] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c Andrew Morton
  2021-07-01  1:47 ` [patch 002/192] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Andrew Morton
                   ` (189 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: gather discrete indexes of tail page

For HugeTLB page, there are more metadata to save in the struct page.  But
the head struct page cannot meet our needs, so we have to abuse other tail
struct page to store the metadata.  In order to avoid conflicts caused by
subsequent use of more tail struct pages, we can gather these discrete
indexes of tail struct page.  In this case, it will be easier to add a new
tail page index later.

Link: https://lkml.kernel.org/r/20210510030027.56044-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h        |   21 +++++++++++++++++++--
 include/linux/hugetlb_cgroup.h |   19 +++++++++++--------
 2 files changed, 30 insertions(+), 10 deletions(-)

--- a/include/linux/hugetlb_cgroup.h~mm-hugetlb-gather-discrete-indexes-of-tail-page
+++ a/include/linux/hugetlb_cgroup.h
@@ -21,15 +21,16 @@ struct hugetlb_cgroup;
 struct resv_map;
 struct file_region;
 
+#ifdef CONFIG_CGROUP_HUGETLB
 /*
  * Minimum page order trackable by hugetlb cgroup.
  * At least 4 pages are necessary for all the tracking information.
- * The second tail page (hpage[2]) is the fault usage cgroup.
- * The third tail page (hpage[3]) is the reservation usage cgroup.
+ * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault
+ * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD])
+ * is the reservation usage cgroup.
  */
-#define HUGETLB_CGROUP_MIN_ORDER	2
+#define HUGETLB_CGROUP_MIN_ORDER order_base_2(__MAX_CGROUP_SUBPAGE_INDEX + 1)
 
-#ifdef CONFIG_CGROUP_HUGETLB
 enum hugetlb_memory_event {
 	HUGETLB_MAX,
 	HUGETLB_NR_MEMORY_EVENTS,
@@ -66,9 +67,9 @@ __hugetlb_cgroup_from_page(struct page *
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
 	if (rsvd)
-		return (struct hugetlb_cgroup *)page[3].private;
+		return (void *)page_private(page + SUBPAGE_INDEX_CGROUP_RSVD);
 	else
-		return (struct hugetlb_cgroup *)page[2].private;
+		return (void *)page_private(page + SUBPAGE_INDEX_CGROUP);
 }
 
 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
@@ -90,9 +91,11 @@ static inline int __set_hugetlb_cgroup(s
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
 	if (rsvd)
-		page[3].private = (unsigned long)h_cg;
+		set_page_private(page + SUBPAGE_INDEX_CGROUP_RSVD,
+				 (unsigned long)h_cg);
 	else
-		page[2].private = (unsigned long)h_cg;
+		set_page_private(page + SUBPAGE_INDEX_CGROUP,
+				 (unsigned long)h_cg);
 	return 0;
 }
 
--- a/include/linux/hugetlb.h~mm-hugetlb-gather-discrete-indexes-of-tail-page
+++ a/include/linux/hugetlb.h
@@ -29,6 +29,23 @@ typedef struct { unsigned long pd; } hug
 #include <linux/shm.h>
 #include <asm/tlbflush.h>
 
+/*
+ * For HugeTLB page, there are more metadata to save in the struct page. But
+ * the head struct page cannot meet our needs, so we have to abuse other tail
+ * struct page to store the metadata. In order to avoid conflicts caused by
+ * subsequent use of more tail struct pages, we gather these discrete indexes
+ * of tail struct page here.
+ */
+enum {
+	SUBPAGE_INDEX_SUBPOOL = 1,	/* reuse page->private */
+#ifdef CONFIG_CGROUP_HUGETLB
+	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
+	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
+	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
+#endif
+	__NR_USED_SUBPAGE,
+};
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
@@ -635,13 +652,13 @@ extern unsigned int default_hstate_idx;
  */
 static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
 {
-	return (struct hugepage_subpool *)(hpage+1)->private;
+	return (void *)page_private(hpage + SUBPAGE_INDEX_SUBPOOL);
 }
 
 static inline void hugetlb_set_page_subpool(struct page *hpage,
 					struct hugepage_subpool *subpool)
 {
-	set_page_private(hpage+1, (unsigned long)subpool);
+	set_page_private(hpage + SUBPAGE_INDEX_SUBPOOL, (unsigned long)subpool);
 }
 
 static inline struct hstate *hstate_file(struct file *f)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
  2021-07-01  1:46 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-07-01  1:47 ` [patch 003/192] mm: hugetlb: gather discrete indexes of tail page Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  3:46     ` Linus Torvalds
  2021-07-01  1:47 ` [patch 005/192] mm: hugetlb: defer freeing of HugeTLB pages Andrew Morton
                   ` (188 subsequent siblings)
  192 siblings, 1 reply; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: free the vmemmap pages associated with each HugeTLB page

Every HugeTLB has more than one struct page structure.  We __know__ that
we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
store metadata associated with each HugeTLB.

There are a lot of struct page structures associated with each HugeTLB
page.  For tail pages, the value of compound_head is the same.  So we can
reuse first page of tail page structures.  We map the virtual addresses of
the remaining pages of tail page structures to the first tail page struct,
and then free these page frames.  Therefore, we need to reserve two pages
as vmemmap areas.

When we allocate a HugeTLB page from the buddy, we can free some vmemmap
pages associated with each HugeTLB page.  It is more appropriate to do it
in the prep_new_huge_page().

The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
associated with a HugeTLB page can be freed, returns zero for now, which
means the feature is disabled.  We will enable it once all the
infrastructure is there.

[willy@infradead.org: fix documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bootmem_info.h |   28 ++++
 include/linux/mm.h           |    3 
 mm/Makefile                  |    1 
 mm/hugetlb.c                 |   22 +--
 mm/hugetlb_vmemmap.c         |  218 +++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h         |   20 +++
 mm/sparse-vmemmap.c          |  194 +++++++++++++++++++++++++++++
 7 files changed, 473 insertions(+), 13 deletions(-)

--- a/include/linux/bootmem_info.h~mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/include/linux/bootmem_info.h
@@ -2,7 +2,7 @@
 #ifndef __LINUX_BOOTMEM_INFO_H
 #define __LINUX_BOOTMEM_INFO_H
 
-#include <linux/mmzone.h>
+#include <linux/mm.h>
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
@@ -22,6 +22,27 @@ void __init register_page_bootmem_info_n
 void get_page_bootmem(unsigned long info, struct page *page,
 		      unsigned long type);
 void put_page_bootmem(struct page *page);
+
+/*
+ * Any memory allocated via the memblock allocator and not via the
+ * buddy will be marked reserved already in the memmap. For those
+ * pages, we can call this function to free it to buddy allocator.
+ */
+static inline void free_bootmem_page(struct page *page)
+{
+	unsigned long magic = (unsigned long)page->freelist;
+
+	/*
+	 * The reserve_bootmem_region sets the reserved flag on bootmem
+	 * pages.
+	 */
+	VM_BUG_ON_PAGE(page_ref_count(page) != 2, page);
+
+	if (magic == SECTION_INFO || magic == MIX_SECTION_INFO)
+		put_page_bootmem(page);
+	else
+		VM_BUG_ON_PAGE(1, page);
+}
 #else
 static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
 {
@@ -35,6 +56,11 @@ static inline void get_page_bootmem(unsi
 				    unsigned long type)
 {
 }
+
+static inline void free_bootmem_page(struct page *page)
+{
+	free_reserved_page(page);
+}
 #endif
 
 #endif /* __LINUX_BOOTMEM_INFO_H */
--- a/include/linux/mm.h~mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/include/linux/mm.h
@@ -3076,6 +3076,9 @@ static inline void print_vma_addr(char *
 }
 #endif
 
+void vmemmap_remap_free(unsigned long start, unsigned long end,
+			unsigned long reuse);
+
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
--- a/mm/hugetlb.c~mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/hugetlb.c
@@ -41,6 +41,7 @@
 #include <linux/node.h>
 #include <linux/page_owner.h>
 #include "internal.h"
+#include "hugetlb_vmemmap.h"
 
 int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
@@ -1493,8 +1494,9 @@ static void __prep_account_new_huge_page
 	h->nr_huge_pages_node[nid]++;
 }
 
-static void __prep_new_huge_page(struct page *page)
+static void __prep_new_huge_page(struct hstate *h, struct page *page)
 {
+	free_huge_page_vmemmap(h, page);
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	hugetlb_set_page_subpool(page, NULL);
@@ -1504,7 +1506,7 @@ static void __prep_new_huge_page(struct
 
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
-	__prep_new_huge_page(page);
+	__prep_new_huge_page(h, page);
 	spin_lock_irq(&hugetlb_lock);
 	__prep_account_new_huge_page(h, nid);
 	spin_unlock_irq(&hugetlb_lock);
@@ -2351,14 +2353,15 @@ static int alloc_and_dissolve_huge_page(
 
 	/*
 	 * Before dissolving the page, we need to allocate a new one for the
-	 * pool to remain stable. Using alloc_buddy_huge_page() allows us to
-	 * not having to deal with prep_new_huge_page() and avoids dealing of any
-	 * counters. This simplifies and let us do the whole thing under the
-	 * lock.
+	 * pool to remain stable.  Here, we allocate the page and 'prep' it
+	 * by doing everything but actually updating counters and adding to
+	 * the pool.  This simplifies and let us do most of the processing
+	 * under the lock.
 	 */
 	new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
 	if (!new_page)
 		return -ENOMEM;
+	__prep_new_huge_page(h, new_page);
 
 retry:
 	spin_lock_irq(&hugetlb_lock);
@@ -2397,14 +2400,9 @@ retry:
 		remove_hugetlb_page(h, old_page, false);
 
 		/*
-		 * new_page needs to be initialized with the standard hugetlb
-		 * state. This is normally done by prep_new_huge_page() but
-		 * that takes hugetlb_lock which is already held so we need to
-		 * open code it here.
 		 * Reference count trick is needed because allocator gives us
 		 * referenced page but the pool requires pages with 0 refcount.
 		 */
-		__prep_new_huge_page(new_page);
 		__prep_account_new_huge_page(h, nid);
 		page_ref_dec(new_page);
 		enqueue_huge_page(h, new_page);
@@ -2420,7 +2418,7 @@ retry:
 
 free_new:
 	spin_unlock_irq(&hugetlb_lock);
-	__free_pages(new_page, huge_page_order(h));
+	update_and_free_page(h, new_page);
 
 	return ret;
 }
--- /dev/null
+++ a/mm/hugetlb_vmemmap.c
@@ -0,0 +1,218 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free some vmemmap pages of HugeTLB
+ *
+ * Copyright (c) 2020, Bytedance. All rights reserved.
+ *
+ *     Author: Muchun Song <songmuchun@bytedance.com>
+ *
+ * The struct page structures (page structs) are used to describe a physical
+ * page frame. By default, there is a one-to-one mapping from a page frame to
+ * it's corresponding page struct.
+ *
+ * HugeTLB pages consist of multiple base page size pages and is supported by
+ * many architectures. See hugetlbpage.rst in the Documentation directory for
+ * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
+ * are currently supported. Since the base page size on x86 is 4KB, a 2MB
+ * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+ * 4096 base pages. For each base page, there is a corresponding page struct.
+ *
+ * Within the HugeTLB subsystem, only the first 4 page structs are used to
+ * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
+ * this upper limit. The only 'useful' information in the remaining page structs
+ * is the compound_head field, and this field is the same for all tail pages.
+ *
+ * By removing redundant page structs for HugeTLB pages, memory can be returned
+ * to the buddy allocator for other uses.
+ *
+ * Different architectures support different HugeTLB pages. For example, the
+ * following table is the HugeTLB page size supported by x86 and arm64
+ * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
+ * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
+ * page.
+ *
+ * +--------------+-----------+-----------------------------------------------+
+ * | Architecture | Page Size |                HugeTLB Page Size              |
+ * +--------------+-----------+-----------+-----------+-----------+-----------+
+ * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
+ * +--------------+-----------+-----------+-----------+-----------+-----------+
+ * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
+ * |              +-----------+-----------+-----------+-----------+-----------+
+ * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
+ * |              +-----------+-----------+-----------+-----------+-----------+
+ * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
+ * +--------------+-----------+-----------+-----------+-----------+-----------+
+ *
+ * When the system boot up, every HugeTLB page has more than one struct page
+ * structs which size is (unit: pages):
+ *
+ *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+ *
+ * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
+ * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
+ * relationship.
+ *
+ *    HugeTLB_Size = n * PAGE_SIZE
+ *
+ * Then,
+ *
+ *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+ *                = n * sizeof(struct page) / PAGE_SIZE
+ *
+ * We can use huge mapping at the pud/pmd level for the HugeTLB page.
+ *
+ * For the HugeTLB page of the pmd level mapping, then
+ *
+ *    struct_size = n * sizeof(struct page) / PAGE_SIZE
+ *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
+ *                = sizeof(struct page) / sizeof(pte_t)
+ *                = 64 / 8
+ *                = 8 (pages)
+ *
+ * Where n is how many pte entries which one page can contains. So the value of
+ * n is (PAGE_SIZE / sizeof(pte_t)).
+ *
+ * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
+ * is 8. And this optimization also applicable only when the size of struct page
+ * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
+ * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
+ * size of struct page structs of it is 8 page frames which size depends on the
+ * size of the base page.
+ *
+ * For the HugeTLB page of the pud level mapping, then
+ *
+ *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
+ *                = PAGE_SIZE / 8 * 8 (pages)
+ *                = PAGE_SIZE (pages)
+ *
+ * Where the struct_size(pmd) is the size of the struct page structs of a
+ * HugeTLB page of the pmd level mapping.
+ *
+ * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
+ * HugeTLB page consists in 4096.
+ *
+ * Next, we take the pmd level mapping of the HugeTLB page as an example to
+ * show the internal implementation of this optimization. There are 8 pages
+ * struct page structs associated with a HugeTLB page which is pmd mapped.
+ *
+ * Here is how things look before optimization.
+ *
+ *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ * |           |                     |     0     | -------------> |     0     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     1     | -------------> |     1     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     2     | -------------> |     2     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     3     | -------------> |     3     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     4     | -------------> |     4     |
+ * |    PMD    |                     +-----------+                +-----------+
+ * |   level   |                     |     5     | -------------> |     5     |
+ * |  mapping  |                     +-----------+                +-----------+
+ * |           |                     |     6     | -------------> |     6     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     7     | -------------> |     7     |
+ * |           |                     +-----------+                +-----------+
+ * |           |
+ * |           |
+ * |           |
+ * +-----------+
+ *
+ * The value of page->compound_head is the same for all tail pages. The first
+ * page of page structs (page 0) associated with the HugeTLB page contains the 4
+ * page structs necessary to describe the HugeTLB. The only use of the remaining
+ * pages of page structs (page 1 to page 7) is to point to page->compound_head.
+ * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+ * will be used for each HugeTLB page. This will allow us to free the remaining
+ * 6 pages to the buddy allocator.
+ *
+ * Here is how things look after remapping.
+ *
+ *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ * |           |                     |     0     | -------------> |     0     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     1     | -------------> |     1     |
+ * |           |                     +-----------+                +-----------+
+ * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ * |           |                     +-----------+                   | | | | |
+ * |           |                     |     3     | ------------------+ | | | |
+ * |           |                     +-----------+                     | | | |
+ * |           |                     |     4     | --------------------+ | | |
+ * |    PMD    |                     +-----------+                       | | |
+ * |   level   |                     |     5     | ----------------------+ | |
+ * |  mapping  |                     +-----------+                         | |
+ * |           |                     |     6     | ------------------------+ |
+ * |           |                     +-----------+                           |
+ * |           |                     |     7     | --------------------------+
+ * |           |                     +-----------+
+ * |           |
+ * |           |
+ * |           |
+ * +-----------+
+ *
+ * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+ * vmemmap pages and restore the previous mapping relationship.
+ *
+ * For the HugeTLB page of the pud level mapping. It is similar to the former.
+ * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
+ *
+ * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
+ * (e.g. aarch64) provides a contiguous bit in the translation table entries
+ * that hints to the MMU to indicate that it is one of a contiguous set of
+ * entries that can be cached in a single TLB entry.
+ *
+ * The contiguous bit is used to increase the mapping size at the pmd and pte
+ * (last) level. So this type of HugeTLB page can be optimized only when its
+ * size of the struct page structs is greater than 2 pages.
+ */
+#include "hugetlb_vmemmap.h"
+
+/*
+ * There are a lot of struct page structures associated with each HugeTLB page.
+ * For tail pages, the value of compound_head is the same. So we can reuse first
+ * page of tail page structures. We map the virtual addresses of the remaining
+ * pages of tail page structures to the first tail page struct, and then free
+ * these page frames. Therefore, we need to reserve two pages as vmemmap areas.
+ */
+#define RESERVE_VMEMMAP_NR		2U
+#define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
+
+/*
+ * How many vmemmap pages associated with a HugeTLB page that can be freed
+ * to the buddy allocator.
+ *
+ * Todo: Returns zero for now, which means the feature is disabled. We will
+ * enable it once all the infrastructure is there.
+ */
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
+
+static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
+{
+	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
+}
+
+void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	unsigned long vmemmap_addr = (unsigned long)head;
+	unsigned long vmemmap_end, vmemmap_reuse;
+
+	if (!free_vmemmap_pages_per_hpage(h))
+		return;
+
+	vmemmap_addr += RESERVE_VMEMMAP_SIZE;
+	vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
+	vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+
+	/*
+	 * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end)
+	 * to the page which @vmemmap_reuse is mapped to, then free the pages
+	 * which the range [@vmemmap_addr, @vmemmap_end] is mapped to.
+	 */
+	vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
+}
--- /dev/null
+++ a/mm/hugetlb_vmemmap.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Free some vmemmap pages of HugeTLB
+ *
+ * Copyright (c) 2020, Bytedance. All rights reserved.
+ *
+ *     Author: Muchun Song <songmuchun@bytedance.com>
+ */
+#ifndef _LINUX_HUGETLB_VMEMMAP_H
+#define _LINUX_HUGETLB_VMEMMAP_H
+#include <linux/hugetlb.h>
+
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+#else
+static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+}
+#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
+#endif /* _LINUX_HUGETLB_VMEMMAP_H */
--- a/mm/Makefile~mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/Makefile
@@ -75,6 +75,7 @@ obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
+obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP)	+= hugetlb_vmemmap.o
 obj-$(CONFIG_NUMA) 	+= mempolicy.o
 obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
--- a/mm/sparse-vmemmap.c~mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/sparse-vmemmap.c
@@ -27,8 +27,202 @@
 #include <linux/spinlock.h>
 #include <linux/vmalloc.h>
 #include <linux/sched.h>
+#include <linux/pgtable.h>
+#include <linux/bootmem_info.h>
+
 #include <asm/dma.h>
 #include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+
+/**
+ * struct vmemmap_remap_walk - walk vmemmap page table
+ *
+ * @remap_pte:		called for each lowest-level entry (PTE).
+ * @reuse_page:		the page which is reused for the tail vmemmap pages.
+ * @reuse_addr:		the virtual address of the @reuse_page page.
+ * @vmemmap_pages:	the list head of the vmemmap pages that can be freed.
+ */
+struct vmemmap_remap_walk {
+	void (*remap_pte)(pte_t *pte, unsigned long addr,
+			  struct vmemmap_remap_walk *walk);
+	struct page *reuse_page;
+	unsigned long reuse_addr;
+	struct list_head *vmemmap_pages;
+};
+
+static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
+			      unsigned long end,
+			      struct vmemmap_remap_walk *walk)
+{
+	pte_t *pte = pte_offset_kernel(pmd, addr);
+
+	/*
+	 * The reuse_page is found 'first' in table walk before we start
+	 * remapping (which is calling @walk->remap_pte).
+	 */
+	if (!walk->reuse_page) {
+		walk->reuse_page = pte_page(*pte);
+		/*
+		 * Because the reuse address is part of the range that we are
+		 * walking, skip the reuse address range.
+		 */
+		addr += PAGE_SIZE;
+		pte++;
+	}
+
+	for (; addr != end; addr += PAGE_SIZE, pte++)
+		walk->remap_pte(pte, addr, walk);
+}
+
+static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
+			      unsigned long end,
+			      struct vmemmap_remap_walk *walk)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		BUG_ON(pmd_leaf(*pmd));
+
+		next = pmd_addr_end(addr, end);
+		vmemmap_pte_range(pmd, addr, next, walk);
+	} while (pmd++, addr = next, addr != end);
+}
+
+static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
+			      unsigned long end,
+			      struct vmemmap_remap_walk *walk)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		vmemmap_pmd_range(pud, addr, next, walk);
+	} while (pud++, addr = next, addr != end);
+}
+
+static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
+			      unsigned long end,
+			      struct vmemmap_remap_walk *walk)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		vmemmap_pud_range(p4d, addr, next, walk);
+	} while (p4d++, addr = next, addr != end);
+}
+
+static void vmemmap_remap_range(unsigned long start, unsigned long end,
+				struct vmemmap_remap_walk *walk)
+{
+	unsigned long addr = start;
+	unsigned long next;
+	pgd_t *pgd;
+
+	VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
+	VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
+
+	pgd = pgd_offset_k(addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		vmemmap_p4d_range(pgd, addr, next, walk);
+	} while (pgd++, addr = next, addr != end);
+
+	/*
+	 * We only change the mapping of the vmemmap virtual address range
+	 * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
+	 * belongs to the range.
+	 */
+	flush_tlb_kernel_range(start + PAGE_SIZE, end);
+}
+
+/*
+ * Free a vmemmap page. A vmemmap page can be allocated from the memblock
+ * allocator or buddy allocator. If the PG_reserved flag is set, it means
+ * that it allocated from the memblock allocator, just free it via the
+ * free_bootmem_page(). Otherwise, use __free_page().
+ */
+static inline void free_vmemmap_page(struct page *page)
+{
+	if (PageReserved(page))
+		free_bootmem_page(page);
+	else
+		__free_page(page);
+}
+
+/* Free a list of the vmemmap pages */
+static void free_vmemmap_page_list(struct list_head *list)
+{
+	struct page *page, *next;
+
+	list_for_each_entry_safe(page, next, list, lru) {
+		list_del(&page->lru);
+		free_vmemmap_page(page);
+	}
+}
+
+static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
+			      struct vmemmap_remap_walk *walk)
+{
+	/*
+	 * Remap the tail pages as read-only to catch illegal write operation
+	 * to the tail pages.
+	 */
+	pgprot_t pgprot = PAGE_KERNEL_RO;
+	pte_t entry = mk_pte(walk->reuse_page, pgprot);
+	struct page *page = pte_page(*pte);
+
+	list_add(&page->lru, walk->vmemmap_pages);
+	set_pte_at(&init_mm, addr, pte, entry);
+}
+
+/**
+ * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
+ *			to the page which @reuse is mapped to, then free vmemmap
+ *			which the range are mapped to.
+ * @start:	start address of the vmemmap virtual address range that we want
+ *		to remap.
+ * @end:	end address of the vmemmap virtual address range that we want to
+ *		remap.
+ * @reuse:	reuse address.
+ *
+ * Note: This function depends on vmemmap being base page mapped. Please make
+ * sure that we disable PMD mapping of vmemmap pages when calling this function.
+ */
+void vmemmap_remap_free(unsigned long start, unsigned long end,
+			unsigned long reuse)
+{
+	LIST_HEAD(vmemmap_pages);
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= vmemmap_remap_pte,
+		.reuse_addr	= reuse,
+		.vmemmap_pages	= &vmemmap_pages,
+	};
+
+	/*
+	 * In order to make remapping routine most efficient for the huge pages,
+	 * the routine of vmemmap page table walking has the following rules
+	 * (see more details from the vmemmap_pte_range()):
+	 *
+	 * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
+	 *   should be continuous.
+	 * - The @reuse address is part of the range [@reuse, @end) that we are
+	 *   walking which is passed to vmemmap_remap_range().
+	 * - The @reuse address is the first in the complete range.
+	 *
+	 * So we need to make sure that @start and @reuse meet the above rules.
+	 */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	vmemmap_remap_range(reuse, end, &walk);
+	free_vmemmap_page_list(&vmemmap_pages);
+}
 
 /*
  * Allocate a block of memory to be used to back the virtual memory map
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 005/192] mm: hugetlb: defer freeing of HugeTLB pages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-07-01  1:47 ` [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 006/192] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page Andrew Morton
                   ` (187 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: defer freeing of HugeTLB pages

In the subsequent patch, we should allocate the vmemmap pages when freeing
a HugeTLB page.  But update_and_free_page() can be called under any
context, so we cannot use GFP_KERNEL to allocate vmemmap pages.  However,
we can defer the actual freeing in a kworker to prevent from using
GFP_ATOMIC to allocate the vmemmap pages.

The __update_and_free_page() is where the call to allocate vmemmmap pages
will be inserted.

Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c         |   83 +++++++++++++++++++++++++++++++++++++----
 mm/hugetlb_vmemmap.c |   12 -----
 mm/hugetlb_vmemmap.h |   17 ++++++++
 3 files changed, 93 insertions(+), 19 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlb-defer-freeing-of-hugetlb-pages
+++ a/mm/hugetlb.c
@@ -1376,7 +1376,7 @@ static void remove_hugetlb_page(struct h
 	h->nr_huge_pages_node[nid]--;
 }
 
-static void update_and_free_page(struct hstate *h, struct page *page)
+static void __update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
 	struct page *subpage = page;
@@ -1399,12 +1399,79 @@ static void update_and_free_page(struct
 	}
 }
 
+/*
+ * As update_and_free_page() can be called under any context, so we cannot
+ * use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
+ * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate
+ * the vmemmap pages.
+ *
+ * free_hpage_workfn() locklessly retrieves the linked list of pages to be
+ * freed and frees them one-by-one. As the page->mapping pointer is going
+ * to be cleared in free_hpage_workfn() anyway, it is reused as the llist_node
+ * structure of a lockless linked list of huge pages to be freed.
+ */
+static LLIST_HEAD(hpage_freelist);
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+	struct llist_node *node;
+
+	node = llist_del_all(&hpage_freelist);
+
+	while (node) {
+		struct page *page;
+		struct hstate *h;
+
+		page = container_of((struct address_space **)node,
+				     struct page, mapping);
+		node = node->next;
+		page->mapping = NULL;
+		/*
+		 * The VM_BUG_ON_PAGE(!PageHuge(page), page) in page_hstate()
+		 * is going to trigger because a previous call to
+		 * remove_hugetlb_page() will set_compound_page_dtor(page,
+		 * NULL_COMPOUND_DTOR), so do not use page_hstate() directly.
+		 */
+		h = size_to_hstate(page_size(page));
+
+		__update_and_free_page(h, page);
+
+		cond_resched();
+	}
+}
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+static inline void flush_free_hpage_work(struct hstate *h)
+{
+	if (free_vmemmap_pages_per_hpage(h))
+		flush_work(&free_hpage_work);
+}
+
+static void update_and_free_page(struct hstate *h, struct page *page,
+				 bool atomic)
+{
+	if (!free_vmemmap_pages_per_hpage(h) || !atomic) {
+		__update_and_free_page(h, page);
+		return;
+	}
+
+	/*
+	 * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap pages.
+	 *
+	 * Only call schedule_work() if hpage_freelist is previously
+	 * empty. Otherwise, schedule_work() had been called but the workfn
+	 * hasn't retrieved the list yet.
+	 */
+	if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist))
+		schedule_work(&free_hpage_work);
+}
+
 static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list)
 {
 	struct page *page, *t_page;
 
 	list_for_each_entry_safe(page, t_page, list, lru) {
-		update_and_free_page(h, page);
+		update_and_free_page(h, page, false);
 		cond_resched();
 	}
 }
@@ -1471,12 +1538,12 @@ void free_huge_page(struct page *page)
 	if (HPageTemporary(page)) {
 		remove_hugetlb_page(h, page, false);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
-		update_and_free_page(h, page);
+		update_and_free_page(h, page, true);
 	} else if (h->surplus_huge_pages_node[nid]) {
 		/* remove the page from active list */
 		remove_hugetlb_page(h, page, true);
 		spin_unlock_irqrestore(&hugetlb_lock, flags);
-		update_and_free_page(h, page);
+		update_and_free_page(h, page, true);
 	} else {
 		arch_clear_hugepage_flags(page);
 		enqueue_huge_page(h, page);
@@ -1795,7 +1862,7 @@ retry:
 		remove_hugetlb_page(h, head, false);
 		h->max_huge_pages--;
 		spin_unlock_irq(&hugetlb_lock);
-		update_and_free_page(h, head);
+		update_and_free_page(h, head, false);
 		return 0;
 	}
 out:
@@ -2411,14 +2478,14 @@ retry:
 		 * Pages have been replaced, we can safely free the old one.
 		 */
 		spin_unlock_irq(&hugetlb_lock);
-		update_and_free_page(h, old_page);
+		update_and_free_page(h, old_page, false);
 	}
 
 	return ret;
 
 free_new:
 	spin_unlock_irq(&hugetlb_lock);
-	update_and_free_page(h, new_page);
+	update_and_free_page(h, new_page, false);
 
 	return ret;
 }
@@ -2832,6 +2899,7 @@ static int set_max_huge_pages(struct hst
 	 * pages in hstate via the proc/sysfs interfaces.
 	 */
 	mutex_lock(&h->resize_lock);
+	flush_free_hpage_work(h);
 	spin_lock_irq(&hugetlb_lock);
 
 	/*
@@ -2941,6 +3009,7 @@ static int set_max_huge_pages(struct hst
 	/* free the pages after dropping lock */
 	spin_unlock_irq(&hugetlb_lock);
 	update_and_free_pages_bulk(h, &page_list);
+	flush_free_hpage_work(h);
 	spin_lock_irq(&hugetlb_lock);
 
 	while (count < persistent_huge_pages(h)) {
--- a/mm/hugetlb_vmemmap.c~mm-hugetlb-defer-freeing-of-hugetlb-pages
+++ a/mm/hugetlb_vmemmap.c
@@ -180,18 +180,6 @@
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
-/*
- * How many vmemmap pages associated with a HugeTLB page that can be freed
- * to the buddy allocator.
- *
- * Todo: Returns zero for now, which means the feature is disabled. We will
- * enable it once all the infrastructure is there.
- */
-static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
-{
-	return 0;
-}
-
 static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
 {
 	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
--- a/mm/hugetlb_vmemmap.h~mm-hugetlb-defer-freeing-of-hugetlb-pages
+++ a/mm/hugetlb_vmemmap.h
@@ -12,9 +12,26 @@
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+
+/*
+ * How many vmemmap pages associated with a HugeTLB page that can be freed
+ * to the buddy allocator.
+ *
+ * Todo: Returns zero for now, which means the feature is disabled. We will
+ * enable it once all the infrastructure is there.
+ */
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
 #else
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
+
+static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
+{
+	return 0;
+}
 #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */
 #endif /* _LINUX_HUGETLB_VMEMMAP_H */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 006/192] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
  2021-07-01  1:46 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-07-01  1:47 ` [patch 005/192] mm: hugetlb: defer freeing of HugeTLB pages Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 007/192] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Andrew Morton
                   ` (186 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page

When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:

 1) Failing to free a huge page triggered by the user (decrease nr_pages).

    User needs to try again later.

 2) Failing to free a surplus huge page when freed by the application.

    Try again later when freeing a huge page next time.

 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().

    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.

 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.

    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.

Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.

[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/hugetlbpage.rst    |    8 +
 Documentation/admin-guide/mm/memory-hotplug.rst |   13 +
 include/linux/hugetlb.h                         |    3 
 include/linux/mm.h                              |    2 
 mm/hugetlb.c                                    |   98 +++++++++++---
 mm/hugetlb_vmemmap.c                            |   34 ++++
 mm/hugetlb_vmemmap.h                            |    6 
 mm/migrate.c                                    |    5 
 mm/sparse-vmemmap.c                             |   75 ++++++++++
 9 files changed, 227 insertions(+), 17 deletions(-)

--- a/Documentation/admin-guide/mm/hugetlbpage.rst~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -60,6 +60,10 @@ HugePages_Surp
         the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
         maximum number of surplus huge pages is controlled by
         ``/proc/sys/vm/nr_overcommit_hugepages``.
+	Note: When the feature of freeing unused vmemmap pages associated
+	with each hugetlb page is enabled, the number of surplus huge pages
+	may be temporarily larger than the maximum number of surplus huge
+	pages when the system is under memory pressure.
 Hugepagesize
 	is the default hugepage size (in Kb).
 Hugetlb
@@ -80,6 +84,10 @@ returned to the huge page pool when free
 privileges can dynamically allocate more or free some persistent huge pages
 by increasing or decreasing the value of ``nr_hugepages``.
 
+Note: When the feature of freeing unused vmemmap pages associated with each
+hugetlb page is enabled, we can fail to free the huge pages triggered by
+the user when ths system is under memory pressure.  Please try again later.
+
 Pages that are used as huge pages are reserved inside the kernel and cannot
 be used for other purposes.  Huge pages cannot be swapped out under
 memory pressure.
--- a/Documentation/admin-guide/mm/memory-hotplug.rst~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -357,6 +357,19 @@ creates ZONE_MOVABLE as following.
    Unfortunately, there is no information to show which memory block belongs
    to ZONE_MOVABLE. This is TBD.
 
+   Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE
+   and the feature of freeing unused vmemmap pages associated with each hugetlb
+   page is enabled.
+
+   This can happen when we have plenty of ZONE_MOVABLE memory, but not enough
+   kernel memory to allocate vmemmmap pages.  We may even be able to migrate
+   huge page contents, but will not be able to dissolve the source huge page.
+   This will prevent an offline operation and is unfortunate as memory offlining
+   is expected to succeed on movable zones.  Users that depend on memory hotplug
+   to succeed for movable zones should carefully consider whether the memory
+   savings gained from this feature are worth the risk of possibly not being
+   able to offline memory in certain situations.
+
 .. note::
    Techniques that rely on long-term pinnings of memory (especially, RDMA and
    vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
--- a/include/linux/hugetlb.h~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/include/linux/hugetlb.h
@@ -532,12 +532,14 @@ unsigned long hugetlb_get_unmapped_area(
  *	modifications require hugetlb_lock.
  * HPG_freed - Set when page is on the free lists.
  *	Synchronization: hugetlb_lock held for examination and modification.
+ * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
  */
 enum hugetlb_page_flags {
 	HPG_restore_reserve = 0,
 	HPG_migratable,
 	HPG_temporary,
 	HPG_freed,
+	HPG_vmemmap_optimized,
 	__NR_HPAGEFLAGS,
 };
 
@@ -583,6 +585,7 @@ HPAGEFLAG(RestoreReserve, restore_reserv
 HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
+HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
--- a/include/linux/mm.h~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/include/linux/mm.h
@@ -3078,6 +3078,8 @@ static inline void print_vma_addr(char *
 
 void vmemmap_remap_free(unsigned long start, unsigned long end,
 			unsigned long reuse);
+int vmemmap_remap_alloc(unsigned long start, unsigned long end,
+			unsigned long reuse, gfp_t gfp_mask);
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
--- a/mm/hugetlb.c~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/hugetlb.c
@@ -1376,6 +1376,39 @@ static void remove_hugetlb_page(struct h
 	h->nr_huge_pages_node[nid]--;
 }
 
+static void add_hugetlb_page(struct hstate *h, struct page *page,
+			     bool adjust_surplus)
+{
+	int zeroed;
+	int nid = page_to_nid(page);
+
+	VM_BUG_ON_PAGE(!HPageVmemmapOptimized(page), page);
+
+	lockdep_assert_held(&hugetlb_lock);
+
+	INIT_LIST_HEAD(&page->lru);
+	h->nr_huge_pages++;
+	h->nr_huge_pages_node[nid]++;
+
+	if (adjust_surplus) {
+		h->surplus_huge_pages++;
+		h->surplus_huge_pages_node[nid]++;
+	}
+
+	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
+	set_page_private(page, 0);
+	SetHPageVmemmapOptimized(page);
+
+	/*
+	 * This page is now managed by the hugetlb allocator and has
+	 * no users -- drop the last reference.
+	 */
+	zeroed = put_page_testzero(page);
+	VM_BUG_ON_PAGE(!zeroed, page);
+	arch_clear_hugepage_flags(page);
+	enqueue_huge_page(h, page);
+}
+
 static void __update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
@@ -1384,6 +1417,18 @@ static void __update_and_free_page(struc
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
 
+	if (alloc_huge_page_vmemmap(h, page)) {
+		spin_lock_irq(&hugetlb_lock);
+		/*
+		 * If we cannot allocate vmemmap pages, just refuse to free the
+		 * page and put the page back on the hugetlb free list and treat
+		 * as a surplus page.
+		 */
+		add_hugetlb_page(h, page, true);
+		spin_unlock_irq(&hugetlb_lock);
+		return;
+	}
+
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
@@ -1450,7 +1495,7 @@ static inline void flush_free_hpage_work
 static void update_and_free_page(struct hstate *h, struct page *page,
 				 bool atomic)
 {
-	if (!free_vmemmap_pages_per_hpage(h) || !atomic) {
+	if (!HPageVmemmapOptimized(page) || !atomic) {
 		__update_and_free_page(h, page);
 		return;
 	}
@@ -1806,10 +1851,14 @@ static struct page *remove_pool_huge_pag
  * nothing for in-use hugepages and non-hugepages.
  * This function returns values like below:
  *
- *  -EBUSY: failed to dissolved free hugepages or the hugepage is in-use
- *          (allocated or reserved.)
- *       0: successfully dissolved free hugepages or the page is not a
- *          hugepage (considered as already dissolved)
+ *  -ENOMEM: failed to allocate vmemmap pages to free the freed hugepages
+ *           when the system is under memory pressure and the feature of
+ *           freeing unused vmemmap pages associated with each hugetlb page
+ *           is enabled.
+ *  -EBUSY:  failed to dissolved free hugepages or the hugepage is in-use
+ *           (allocated or reserved.)
+ *       0:  successfully dissolved free hugepages or the page is not a
+ *           hugepage (considered as already dissolved)
  */
 int dissolve_free_huge_page(struct page *page)
 {
@@ -1851,19 +1900,38 @@ retry:
 			goto retry;
 		}
 
-		/*
-		 * Move PageHWPoison flag from head page to the raw error page,
-		 * which makes any subpages rather than the error page reusable.
-		 */
-		if (PageHWPoison(head) && page != head) {
-			SetPageHWPoison(page);
-			ClearPageHWPoison(head);
-		}
 		remove_hugetlb_page(h, head, false);
 		h->max_huge_pages--;
 		spin_unlock_irq(&hugetlb_lock);
-		update_and_free_page(h, head, false);
-		return 0;
+
+		/*
+		 * Normally update_and_free_page will allocate required vmemmmap
+		 * before freeing the page.  update_and_free_page will fail to
+		 * free the page if it can not allocate required vmemmap.  We
+		 * need to adjust max_huge_pages if the page is not freed.
+		 * Attempt to allocate vmemmmap here so that we can take
+		 * appropriate action on failure.
+		 */
+		rc = alloc_huge_page_vmemmap(h, head);
+		if (!rc) {
+			/*
+			 * Move PageHWPoison flag from head page to the raw
+			 * error page, which makes any subpages rather than
+			 * the error page reusable.
+			 */
+			if (PageHWPoison(head) && page != head) {
+				SetPageHWPoison(page);
+				ClearPageHWPoison(head);
+			}
+			update_and_free_page(h, head, false);
+		} else {
+			spin_lock_irq(&hugetlb_lock);
+			add_hugetlb_page(h, head, false);
+			h->max_huge_pages++;
+			spin_unlock_irq(&hugetlb_lock);
+		}
+
+		return rc;
 	}
 out:
 	spin_unlock_irq(&hugetlb_lock);
--- a/mm/hugetlb_vmemmap.c~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/hugetlb_vmemmap.c
@@ -185,6 +185,38 @@ static inline unsigned long free_vmemmap
 	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
 }
 
+/*
+ * Previously discarded vmemmap pages will be allocated and remapping
+ * after this function returns zero.
+ */
+int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	int ret;
+	unsigned long vmemmap_addr = (unsigned long)head;
+	unsigned long vmemmap_end, vmemmap_reuse;
+
+	if (!HPageVmemmapOptimized(head))
+		return 0;
+
+	vmemmap_addr += RESERVE_VMEMMAP_SIZE;
+	vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h);
+	vmemmap_reuse = vmemmap_addr - PAGE_SIZE;
+	/*
+	 * The pages which the vmemmap virtual address range [@vmemmap_addr,
+	 * @vmemmap_end) are mapped to are freed to the buddy allocator, and
+	 * the range is mapped to the page which @vmemmap_reuse is mapped to.
+	 * When a HugeTLB page is freed to the buddy allocator, previously
+	 * discarded vmemmap pages must be allocated and remapping.
+	 */
+	ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse,
+				  GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE);
+
+	if (!ret)
+		ClearHPageVmemmapOptimized(head);
+
+	return ret;
+}
+
 void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 	unsigned long vmemmap_addr = (unsigned long)head;
@@ -203,4 +235,6 @@ void free_huge_page_vmemmap(struct hstat
 	 * which the range [@vmemmap_addr, @vmemmap_end] is mapped to.
 	 */
 	vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
+
+	SetHPageVmemmapOptimized(head);
 }
--- a/mm/hugetlb_vmemmap.h~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/hugetlb_vmemmap.h
@@ -11,6 +11,7 @@
 #include <linux/hugetlb.h>
 
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+int alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
 
 /*
@@ -25,6 +26,11 @@ static inline unsigned int free_vmemmap_
 	return 0;
 }
 #else
+static inline int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
+{
+	return 0;
+}
+
 static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head)
 {
 }
--- a/mm/migrate.c~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/migrate.c
@@ -626,7 +626,10 @@ void migrate_page_states(struct page *ne
 	if (PageSwapCache(page))
 		ClearPageSwapCache(page);
 	ClearPagePrivate(page);
-	set_page_private(page, 0);
+
+	/* page->private contains hugetlb specific flags */
+	if (!PageHuge(page))
+		set_page_private(page, 0);
 
 	/*
 	 * If any waiters have accumulated on the new page then
--- a/mm/sparse-vmemmap.c~mm-hugetlb-alloc-the-vmemmap-pages-associated-with-each-hugetlb-page
+++ a/mm/sparse-vmemmap.c
@@ -40,7 +40,8 @@
  * @remap_pte:		called for each lowest-level entry (PTE).
  * @reuse_page:		the page which is reused for the tail vmemmap pages.
  * @reuse_addr:		the virtual address of the @reuse_page page.
- * @vmemmap_pages:	the list head of the vmemmap pages that can be freed.
+ * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
+ *			or is mapped from.
  */
 struct vmemmap_remap_walk {
 	void (*remap_pte)(pte_t *pte, unsigned long addr,
@@ -224,6 +225,78 @@ void vmemmap_remap_free(unsigned long st
 	free_vmemmap_page_list(&vmemmap_pages);
 }
 
+static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
+				struct vmemmap_remap_walk *walk)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	struct page *page;
+	void *to;
+
+	BUG_ON(pte_page(*pte) != walk->reuse_page);
+
+	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+	list_del(&page->lru);
+	to = page_to_virt(page);
+	copy_page(to, (void *)walk->reuse_addr);
+
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+}
+
+static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
+				   gfp_t gfp_mask, struct list_head *list)
+{
+	unsigned long nr_pages = (end - start) >> PAGE_SHIFT;
+	int nid = page_to_nid((struct page *)start);
+	struct page *page, *next;
+
+	while (nr_pages--) {
+		page = alloc_pages_node(nid, gfp_mask, 0);
+		if (!page)
+			goto out;
+		list_add_tail(&page->lru, list);
+	}
+
+	return 0;
+out:
+	list_for_each_entry_safe(page, next, list, lru)
+		__free_pages(page, 0);
+	return -ENOMEM;
+}
+
+/**
+ * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end)
+ *			 to the page which is from the @vmemmap_pages
+ *			 respectively.
+ * @start:	start address of the vmemmap virtual address range that we want
+ *		to remap.
+ * @end:	end address of the vmemmap virtual address range that we want to
+ *		remap.
+ * @reuse:	reuse address.
+ * @gfp_mask:	GFP flag for allocating vmemmap pages.
+ */
+int vmemmap_remap_alloc(unsigned long start, unsigned long end,
+			unsigned long reuse, gfp_t gfp_mask)
+{
+	LIST_HEAD(vmemmap_pages);
+	struct vmemmap_remap_walk walk = {
+		.remap_pte	= vmemmap_restore_pte,
+		.reuse_addr	= reuse,
+		.vmemmap_pages	= &vmemmap_pages,
+	};
+
+	/* See the comment in the vmemmap_remap_free(). */
+	BUG_ON(start - reuse != PAGE_SIZE);
+
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
+
+	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
+		return -ENOMEM;
+
+	vmemmap_remap_range(reuse, end, &walk);
+
+	return 0;
+}
+
 /*
  * Allocate a block of memory to be used to back the virtual memory map
  * or to back the page tables that are used to create the mapping.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 007/192] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap
  2021-07-01  1:46 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-07-01  1:47 ` [patch 006/192] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 008/192] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled Andrew Morton
                   ` (185 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap

Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.

We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
is enabled.  Because vmemmap_remap_free() depends on vmemmap being base
page mapped.

Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |   17 +++++++++
 Documentation/admin-guide/mm/hugetlbpage.rst    |    3 +
 arch/x86/mm/init_64.c                           |    8 +++-
 include/linux/hugetlb.h                         |   19 +++++++++++
 mm/hugetlb_vmemmap.c                            |   24 ++++++++++++++
 5 files changed, 69 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/init_64.c~mm-hugetlb-add-a-kernel-parameter-hugetlb_free_vmemmap
+++ a/arch/x86/mm/init_64.c
@@ -34,6 +34,7 @@
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/bootmem_info.h>
+#include <linux/hugetlb.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1609,7 +1610,8 @@ int __meminit vmemmap_populate(unsigned
 	VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
 	VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
 
-	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
+	if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
+	    end - start < PAGES_PER_SECTION * sizeof(struct page))
 		err = vmemmap_populate_basepages(start, end, node, NULL);
 	else if (boot_cpu_has(X86_FEATURE_PSE))
 		err = vmemmap_populate_hugepages(start, end, node, altmap);
@@ -1637,6 +1639,8 @@ void register_page_bootmem_memmap(unsign
 	pmd_t *pmd;
 	unsigned int nr_pmd_pages;
 	struct page *page;
+	bool base_mapping = !boot_cpu_has(X86_FEATURE_PSE) ||
+			    is_hugetlb_free_vmemmap_enabled();
 
 	for (; addr < end; addr = next) {
 		pte_t *pte = NULL;
@@ -1662,7 +1666,7 @@ void register_page_bootmem_memmap(unsign
 		}
 		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
 
-		if (!boot_cpu_has(X86_FEATURE_PSE)) {
+		if (base_mapping) {
 			next = (addr + PAGE_SIZE) & PAGE_MASK;
 			pmd = pmd_offset(pud, addr);
 			if (pmd_none(*pmd))
--- a/Documentation/admin-guide/kernel-parameters.txt~mm-hugetlb-add-a-kernel-parameter-hugetlb_free_vmemmap
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -1567,6 +1567,23 @@
 			Documentation/admin-guide/mm/hugetlbpage.rst.
 			Format: size[KMG]
 
+	hugetlb_free_vmemmap=
+			[KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+			enabled.
+			Allows heavy hugetlb users to free up some more
+			memory (6 * PAGE_SIZE for each 2MB hugetlb page).
+			This feauture is not free though. Large page
+			tables are not used to back vmemmap pages which
+			can lead to a performance degradation for some
+			workloads. Also there will be memory allocation
+			required when hugetlb pages are freed from the
+			pool which can lead to corner cases under heavy
+			memory pressure.
+			Format: { on | off (default) }
+
+			on:  enable the feature
+			off: disable the feature
+
 	hung_task_panic=
 			[KNL] Should the hung task detector generate panics.
 			Format: 0 | 1
--- a/Documentation/admin-guide/mm/hugetlbpage.rst~mm-hugetlb-add-a-kernel-parameter-hugetlb_free_vmemmap
+++ a/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -153,6 +153,9 @@ default_hugepagesz
 
 	will all result in 256 2M huge pages being allocated.  Valid default
 	huge page size is architecture dependent.
+hugetlb_free_vmemmap
+	When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing
+	unused vmemmap pages associated with each HugeTLB page.
 
 When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
 indicates the current number of pre-allocated huge pages of the default size.
--- a/include/linux/hugetlb.h~mm-hugetlb-add-a-kernel-parameter-hugetlb_free_vmemmap
+++ a/include/linux/hugetlb.h
@@ -892,6 +892,20 @@ static inline void huge_ptep_modify_prot
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+extern bool hugetlb_free_vmemmap_enabled;
+
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+	return hugetlb_free_vmemmap_enabled;
+}
+#else
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+	return false;
+}
+#endif
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1046,6 +1060,11 @@ static inline void set_huge_swap_pte_at(
 					pte_t *ptep, pte_t pte, unsigned long sz)
 {
 }
+
+static inline bool is_hugetlb_free_vmemmap_enabled(void)
+{
+	return false;
+}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
--- a/mm/hugetlb_vmemmap.c~mm-hugetlb-add-a-kernel-parameter-hugetlb_free_vmemmap
+++ a/mm/hugetlb_vmemmap.c
@@ -168,6 +168,8 @@
  * (last) level. So this type of HugeTLB page can be optimized only when its
  * size of the struct page structs is greater than 2 pages.
  */
+#define pr_fmt(fmt)	"HugeTLB: " fmt
+
 #include "hugetlb_vmemmap.h"
 
 /*
@@ -180,6 +182,28 @@
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
+bool hugetlb_free_vmemmap_enabled;
+
+static int __init early_hugetlb_free_vmemmap_param(char *buf)
+{
+	/* We cannot optimize if a "struct page" crosses page boundaries. */
+	if ((!is_power_of_2(sizeof(struct page)))) {
+		pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n");
+		return 0;
+	}
+
+	if (!buf)
+		return -EINVAL;
+
+	if (!strcmp(buf, "on"))
+		hugetlb_free_vmemmap_enabled = true;
+	else if (strcmp(buf, "off"))
+		return -EINVAL;
+
+	return 0;
+}
+early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param);
+
 static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h)
 {
 	return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 008/192] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled
  2021-07-01  1:46 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-07-01  1:47 ` [patch 007/192] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 009/192] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate Andrew Morton
                   ` (184 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled

The parameter of memory_hotplug.memmap_on_memory is not compatible with
hugetlb_free_vmemmap.  So disable it when hugetlb_free_vmemmap is enabled.

[akpm@linux-foundation.org: remove unneeded include, per Oscar]
Link: https://lkml.kernel.org/r/20210510030027.56044-9-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |    8 ++++++++
 mm/memory_hotplug.c                             |    1 +
 2 files changed, 9 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt~mm-memory_hotplug-disable-memmap_on_memory-when-hugetlb_free_vmemmap-enabled
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -1584,6 +1584,10 @@
 			on:  enable the feature
 			off: disable the feature
 
+			This is not compatible with memory_hotplug.memmap_on_memory.
+			If both parameters are enabled, hugetlb_free_vmemmap takes
+			precedence over memory_hotplug.memmap_on_memory.
+
 	hung_task_panic=
 			[KNL] Should the hung task detector generate panics.
 			Format: 0 | 1
@@ -2850,6 +2854,10 @@
 			Note that even when enabled, there are a few cases where
 			the feature is not effective.
 
+			This is not compatible with hugetlb_free_vmemmap. If
+			both parameters are enabled, hugetlb_free_vmemmap takes
+			precedence over memory_hotplug.memmap_on_memory.
+
 	memtest=	[KNL,X86,ARM,PPC,RISCV] Enable memtest
 			Format: <integer>
 			default : 0 <disable>
--- a/mm/memory_hotplug.c~mm-memory_hotplug-disable-memmap_on_memory-when-hugetlb_free_vmemmap-enabled
+++ a/mm/memory_hotplug.c
@@ -1056,6 +1056,7 @@ bool mhp_supports_memmap_on_memory(unsig
 	 *       populate a single PMD.
 	 */
 	return memmap_on_memory &&
+	       !is_hugetlb_free_vmemmap_enabled() &&
 	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
 	       size == memory_block_size_bytes() &&
 	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 009/192] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate
  2021-07-01  1:46 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-07-01  1:47 ` [patch 008/192] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 010/192] mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE Andrew Morton
                   ` (183 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, almasrymina, anshuman.khandual, bodeddub, bp, bsingharora,
	chenhuang5, corbet, dave.hansen, david, duanxiongchun, hpa,
	joao.m.martins, jroedel, linmiaohe, linux-mm, luto, mhocko,
	mike.kravetz, mingo, mm-commits, naoya.horiguchi, oneukum,
	osalvador, paulmck, pawan.kumar.gupta, peterz, rdunlap, rientjes,
	song.bao.hua, songmuchun, tglx, torvalds, viro, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate

All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
field in the hstate to indicate how many vmemmap pages associated with a
HugeTLB page that can be freed to buddy allocator.  And initialize it in
the hugetlb_vmemmap_init().  This patch is actual enablement of the
feature.

There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page
structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a
BUILD_BUG_ON to catch invalid usage of the tail struct page.

Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h |    3 +++
 mm/hugetlb.c            |    1 +
 mm/hugetlb_vmemmap.c    |   33 +++++++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.h    |   10 ++++++----
 4 files changed, 43 insertions(+), 4 deletions(-)

--- a/include/linux/hugetlb.h~mm-hugetlb-introduce-nr_free_vmemmap_pages-in-the-struct-hstate
+++ a/include/linux/hugetlb.h
@@ -608,6 +608,9 @@ struct hstate {
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+	unsigned int nr_free_vmemmap_pages;
+#endif
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
 	struct cftype cgroup_files_dfl[7];
--- a/mm/hugetlb.c~mm-hugetlb-introduce-nr_free_vmemmap_pages-in-the-struct-hstate
+++ a/mm/hugetlb.c
@@ -3585,6 +3585,7 @@ void __init hugetlb_add_hstate(unsigned
 	h->next_nid_to_free = first_memory_node;
 	snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
 					huge_page_size(h)/1024);
+	hugetlb_vmemmap_init(h);
 
 	parsed_hstate = h;
 }
--- a/mm/hugetlb_vmemmap.c~mm-hugetlb-introduce-nr_free_vmemmap_pages-in-the-struct-hstate
+++ a/mm/hugetlb_vmemmap.c
@@ -262,3 +262,36 @@ void free_huge_page_vmemmap(struct hstat
 
 	SetHPageVmemmapOptimized(head);
 }
+
+void __init hugetlb_vmemmap_init(struct hstate *h)
+{
+	unsigned int nr_pages = pages_per_huge_page(h);
+	unsigned int vmemmap_pages;
+
+	/*
+	 * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct
+	 * page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP,
+	 * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page.
+	 */
+	BUILD_BUG_ON(__NR_USED_SUBPAGE >=
+		     RESERVE_VMEMMAP_SIZE / sizeof(struct page));
+
+	if (!hugetlb_free_vmemmap_enabled)
+		return;
+
+	vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT;
+	/*
+	 * The head page and the first tail page are not to be freed to buddy
+	 * allocator, the other pages will map to the first tail page, so they
+	 * can be freed.
+	 *
+	 * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true
+	 * on some architectures (e.g. aarch64). See Documentation/arm64/
+	 * hugetlbpage.rst for more details.
+	 */
+	if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR))
+		h->nr_free_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR;
+
+	pr_info("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages,
+		h->name);
+}
--- a/mm/hugetlb_vmemmap.h~mm-hugetlb-introduce-nr_free_vmemmap_pages-in-the-struct-hstate
+++ a/mm/hugetlb_vmemmap.h
@@ -13,17 +13,15 @@
 #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
 int alloc_huge_page_vmemmap(struct hstate *h, struct page *head);
 void free_huge_page_vmemmap(struct hstate *h, struct page *head);
+void hugetlb_vmemmap_init(struct hstate *h);
 
 /*
  * How many vmemmap pages associated with a HugeTLB page that can be freed
  * to the buddy allocator.
- *
- * Todo: Returns zero for now, which means the feature is disabled. We will
- * enable it once all the infrastructure is there.
  */
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
-	return 0;
+	return h->nr_free_vmemmap_pages;
 }
 #else
 static inline int alloc_huge_page_vmemmap(struct hstate *h, struct page *head)
@@ -35,6 +33,10 @@ static inline void free_huge_page_vmemma
 {
 }
 
+static inline void hugetlb_vmemmap_init(struct hstate *h)
+{
+}
+
 static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h)
 {
 	return 0;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 010/192] mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE
  2021-07-01  1:46 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-07-01  1:47 ` [patch 009/192] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 011/192] mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake Andrew Morton
                   ` (182 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, anshuman.khandual, linux-mm, liushixin2, mm-commits, torvalds

From: Shixin Liu <liushixin2@huawei.com>
Subject: mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE

The functions {pmd/pud}_set_huge and {pmd/pud}_clear_huge are not
dependent on THP.  Hence move {pmd/pud}_huge_tests out of
CONFIG_TRANSPARENT_HUGEPAGE.

Link: https://lkml.kernel.org/r/20210419071820.750217-1-liushixin2@huawei.com
Signed-off-by: Shixin Liu <liushixin2@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug_vm_pgtable.c |   91 +++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 52 deletions(-)

--- a/mm/debug_vm_pgtable.c~mm-debug_vm_pgtable-move-pmd-pud_huge_tests-out-of-config_transparent_hugepage
+++ a/mm/debug_vm_pgtable.c
@@ -248,29 +248,6 @@ static void __init pmd_leaf_tests(unsign
 	WARN_ON(!pmd_leaf(pmd));
 }
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot)
-{
-	pmd_t pmd;
-
-	if (!arch_vmap_pmd_supported(prot))
-		return;
-
-	pr_debug("Validating PMD huge\n");
-	/*
-	 * X86 defined pmd_set_huge() verifies that the given
-	 * PMD is not a populated non-leaf entry.
-	 */
-	WRITE_ONCE(*pmdp, __pmd(0));
-	WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot));
-	WARN_ON(!pmd_clear_huge(pmdp));
-	pmd = READ_ONCE(*pmdp);
-	WARN_ON(!pmd_none(pmd));
-}
-#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { }
-#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-
 static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 {
 	pmd_t pmd;
@@ -395,30 +372,6 @@ static void __init pud_leaf_tests(unsign
 	pud = pud_mkhuge(pud);
 	WARN_ON(!pud_leaf(pud));
 }
-
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot)
-{
-	pud_t pud;
-
-	if (!arch_vmap_pud_supported(prot))
-		return;
-
-	pr_debug("Validating PUD huge\n");
-	/*
-	 * X86 defined pud_set_huge() verifies that the given
-	 * PUD is not a populated non-leaf entry.
-	 */
-	WRITE_ONCE(*pudp, __pud(0));
-	WARN_ON(!pud_set_huge(pudp, __pfn_to_phys(pfn), prot));
-	WARN_ON(!pud_clear_huge(pudp));
-	pud = READ_ONCE(*pudp);
-	WARN_ON(!pud_none(pud));
-}
-#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
-static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { }
-#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
-
 #else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { }
 static void __init pud_advanced_tests(struct mm_struct *mm,
@@ -428,9 +381,6 @@ static void __init pud_advanced_tests(st
 {
 }
 static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { }
-static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot)
-{
-}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #else  /* !CONFIG_TRANSPARENT_HUGEPAGE */
 static void __init pmd_basic_tests(unsigned long pfn, int idx) { }
@@ -449,14 +399,51 @@ static void __init pud_advanced_tests(st
 }
 static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { }
 static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { }
+static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot)
 {
+	pmd_t pmd;
+
+	if (!arch_vmap_pmd_supported(prot))
+		return;
+
+	pr_debug("Validating PMD huge\n");
+	/*
+	 * X86 defined pmd_set_huge() verifies that the given
+	 * PMD is not a populated non-leaf entry.
+	 */
+	WRITE_ONCE(*pmdp, __pmd(0));
+	WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot));
+	WARN_ON(!pmd_clear_huge(pmdp));
+	pmd = READ_ONCE(*pmdp);
+	WARN_ON(!pmd_none(pmd));
 }
+
 static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot)
 {
+	pud_t pud;
+
+	if (!arch_vmap_pud_supported(prot))
+		return;
+
+	pr_debug("Validating PUD huge\n");
+	/*
+	 * X86 defined pud_set_huge() verifies that the given
+	 * PUD is not a populated non-leaf entry.
+	 */
+	WRITE_ONCE(*pudp, __pud(0));
+	WARN_ON(!pud_set_huge(pudp, __pfn_to_phys(pfn), prot));
+	WARN_ON(!pud_clear_huge(pudp));
+	pud = READ_ONCE(*pudp);
+	WARN_ON(!pud_none(pud));
 }
-static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { }
+static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { }
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot)
 {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 011/192] mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake
  2021-07-01  1:46 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2021-07-01  1:47 ` [patch 010/192] mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 012/192] mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK Andrew Morton
                   ` (181 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: akpm, anshuman.khandual, linux-mm, liushixin2, mm-commits, torvalds

From: Shixin Liu <liushixin2@huawei.com>
Subject: mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake

Remove redundant pfn_{pmd/pte}() in {pmd/pte}_advanced_tests() and adjust
pfn_pud() in pud_advanced_tests() to make it similar with other two
functions.

In addition, the branch condition should be CONFIG_TRANSPARENT_HUGEPAGE
instead of CONFIG_ARCH_HAS_PTE_DEVMAP.

Link: https://lkml.kernel.org/r/20210419071820.750217-2-liushixin2@huawei.com
Signed-off-by: Shixin Liu <liushixin2@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug_vm_pgtable.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/debug_vm_pgtable.c~mm-debug_vm_pgtable-remove-redundant-pfn_pmd-pte-and-fix-one-comment-mistake
+++ a/mm/debug_vm_pgtable.c
@@ -91,7 +91,7 @@ static void __init pte_advanced_tests(st
 				      unsigned long pfn, unsigned long vaddr,
 				      pgprot_t prot)
 {
-	pte_t pte = pfn_pte(pfn, prot);
+	pte_t pte;
 
 	/*
 	 * Architectures optimize set_pte_at by avoiding TLB flush.
@@ -778,12 +778,12 @@ static void __init pmd_swap_soft_dirty_t
 	WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd)));
 	WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd)));
 }
-#else  /* !CONFIG_ARCH_HAS_PTE_DEVMAP */
+#else  /* !CONFIG_TRANSPARENT_HUGEPAGE */
 static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { }
 static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot)
 {
 }
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot)
 {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 012/192] mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK
  2021-07-01  1:46 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2021-07-01  1:47 ` [patch 011/192] mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 013/192] mm/huge_memory.c: use page->deferred_list Andrew Morton
                   ` (180 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: adobriyan, akpm, aneesh.kumar, anshuman.khandual, david, hannes,
	hughd, kirill.shutemov, linmiaohe, linux-mm, mike.kravetz,
	minchan, mm-commits, rcampbell, riel, shy828301, songliubraving,
	torvalds, william.kucharski, willy, ziy

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK

Patch series "Cleanup and fixup for huge_memory:, v3.

This series contains cleanups to remove dedicated macro and remove
unnecessary tlb_remove_page_size() for huge zero pmd.  Also this adds
missing read-only THP checking for transparent_hugepage_enabled() and
avoids discarding hugepage if other processes are mapping it.  More
details can be found in the respective changelogs.


Thi patch (of 5):

Rewrite the pgoff checking logic to remove macro HPAGE_CACHE_INDEX_MASK
which is only used here to simplify the code.

Link: https://lkml.kernel.org/r/20210511134857.1581273-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210511134857.1581273-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/huge_mm.h |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/include/linux/huge_mm.h~mm-huge_memoryc-remove-dedicated-macro-hpage_cache_index_mask
+++ a/include/linux/huge_mm.h
@@ -152,15 +152,13 @@ static inline bool __transparent_hugepag
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma);
 
-#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
-
 static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 		unsigned long haddr)
 {
 	/* Don't have to check pgoff for anonymous vma */
 	if (!vma_is_anonymous(vma)) {
-		if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
-			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+				HPAGE_PMD_NR))
 			return false;
 	}
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 013/192] mm/huge_memory.c: use page->deferred_list
  2021-07-01  1:46 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2021-07-01  1:47 ` [patch 012/192] mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 014/192] mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() Andrew Morton
                   ` (179 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: adobriyan, akpm, aneesh.kumar, anshuman.khandual, david, hannes,
	hughd, kirill.shutemov, linmiaohe, linux-mm, mike.kravetz,
	minchan, mm-commits, rcampbell, riel, shy828301, songliubraving,
	torvalds, william.kucharski, willy, ziy

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/huge_memory.c: use page->deferred_list

Now that we can represent the location of ->deferred_list instead of
->mapping + ->index, make use of it to improve readability.

Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/huge_memory.c~mm-huge_memoryc-use-page-deferred_list
+++ a/mm/huge_memory.c
@@ -2870,7 +2870,7 @@ static unsigned long deferred_split_scan
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_safe(pos, next, &ds_queue->split_queue) {
-		page = list_entry((void *)pos, struct page, mapping);
+		page = list_entry((void *)pos, struct page, deferred_list);
 		page = compound_head(page);
 		if (get_page_unless_zero(page)) {
 			list_move(page_deferred_list(page), &list);
@@ -2885,7 +2885,7 @@ static unsigned long deferred_split_scan
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_safe(pos, next, &list) {
-		page = list_entry((void *)pos, struct page, mapping);
+		page = list_entry((void *)pos, struct page, deferred_list);
 		if (!trylock_page(page))
 			goto next;
 		/* split_huge_page() removes page from list on success */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 014/192] mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2021-07-01  1:47 ` [patch 013/192] mm/huge_memory.c: use page->deferred_list Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 015/192] mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd Andrew Morton
                   ` (178 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: adobriyan, akpm, aneesh.kumar, anshuman.khandual, david, hannes,
	hughd, kirill.shutemov, linmiaohe, linux-mm, mike.kravetz,
	minchan, mm-commits, rcampbell, riel, shy828301, songliubraving,
	torvalds, william.kucharski, willy, ziy

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()

Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
(non-shmem) FS"), read-only THP file mapping is supported.  But it forgot
to add checking for it in transparent_hugepage_enabled().  To fix it, we
add checking for read-only THP file mapping and also introduce helper
transhuge_vma_enabled() to check whether thp is enabled for specified vma
to reduce duplicated code.  We rename transparent_hugepage_enabled to
transparent_hugepage_active to make the code easier to follow as suggested
by David Hildenbrand.

[linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
  Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c      |    2 -
 include/linux/huge_mm.h |   57 +++++++++++++++++++++++---------------
 mm/huge_memory.c        |   11 ++++++-
 mm/khugepaged.c         |    4 --
 mm/shmem.c              |    3 --
 5 files changed, 48 insertions(+), 29 deletions(-)

--- a/fs/proc/task_mmu.c~mm-huge_memoryc-add-missing-read-only-thp-checking-in-transparent_hugepage_enabled
+++ a/fs/proc/task_mmu.c
@@ -832,7 +832,7 @@ static int show_smap(struct seq_file *m,
 	__show_smap(m, &mss, false);
 
 	seq_printf(m, "THPeligible:    %d\n",
-		   transparent_hugepage_enabled(vma));
+		   transparent_hugepage_active(vma));
 
 	if (arch_pkeys_enabled())
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
--- a/include/linux/huge_mm.h~mm-huge_memoryc-add-missing-read-only-thp-checking-in-transparent_hugepage_enabled
+++ a/include/linux/huge_mm.h
@@ -115,9 +115,34 @@ extern struct kobj_attribute shmem_enabl
 
 extern unsigned long transparent_hugepage_flags;
 
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long haddr)
+{
+	/* Don't have to check pgoff for anonymous vma */
+	if (!vma_is_anonymous(vma)) {
+		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+				HPAGE_PMD_NR))
+			return false;
+	}
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return false;
+	return true;
+}
+
+static inline bool transhuge_vma_enabled(struct vm_area_struct *vma,
+					  unsigned long vm_flags)
+{
+	/* Explicitly disabled through madvise. */
+	if ((vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return false;
+	return true;
+}
+
 /*
  * to be used on vmas which are known to support THP.
- * Use transparent_hugepage_enabled otherwise
+ * Use transparent_hugepage_active otherwise
  */
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
@@ -128,15 +153,12 @@ static inline bool __transparent_hugepag
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX))
 		return false;
 
-	if (vma->vm_flags & VM_NOHUGEPAGE)
+	if (!transhuge_vma_enabled(vma, vma->vm_flags))
 		return false;
 
 	if (vma_is_temporary_stack(vma))
 		return false;
 
-	if (test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
-		return false;
-
 	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
 		return true;
 
@@ -150,22 +172,7 @@ static inline bool __transparent_hugepag
 	return false;
 }
 
-bool transparent_hugepage_enabled(struct vm_area_struct *vma);
-
-static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
-		unsigned long haddr)
-{
-	/* Don't have to check pgoff for anonymous vma */
-	if (!vma_is_anonymous(vma)) {
-		if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
-				HPAGE_PMD_NR))
-			return false;
-	}
-
-	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
-		return false;
-	return true;
-}
+bool transparent_hugepage_active(struct vm_area_struct *vma);
 
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
@@ -352,7 +359,7 @@ static inline bool __transparent_hugepag
 	return false;
 }
 
-static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+static inline bool transparent_hugepage_active(struct vm_area_struct *vma)
 {
 	return false;
 }
@@ -362,6 +369,12 @@ static inline bool transhuge_vma_suitabl
 {
 	return false;
 }
+
+static inline bool transhuge_vma_enabled(struct vm_area_struct *vma,
+					  unsigned long vm_flags)
+{
+	return false;
+}
 
 static inline void prep_transhuge_page(struct page *page) {}
 
--- a/mm/huge_memory.c~mm-huge_memoryc-add-missing-read-only-thp-checking-in-transparent_hugepage_enabled
+++ a/mm/huge_memory.c
@@ -64,7 +64,14 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
-bool transparent_hugepage_enabled(struct vm_area_struct *vma)
+static inline bool file_thp_enabled(struct vm_area_struct *vma)
+{
+	return transhuge_vma_enabled(vma, vma->vm_flags) && vma->vm_file &&
+	       !inode_is_open_for_write(vma->vm_file->f_inode) &&
+	       (vma->vm_flags & VM_EXEC);
+}
+
+bool transparent_hugepage_active(struct vm_area_struct *vma)
 {
 	/* The addr is used to check if the vma size fits */
 	unsigned long addr = (vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE;
@@ -75,6 +82,8 @@ bool transparent_hugepage_enabled(struct
 		return __transparent_hugepage_enabled(vma);
 	if (vma_is_shmem(vma))
 		return shmem_huge_enabled(vma);
+	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS))
+		return file_thp_enabled(vma);
 
 	return false;
 }
--- a/mm/khugepaged.c~mm-huge_memoryc-add-missing-read-only-thp-checking-in-transparent_hugepage_enabled
+++ a/mm/khugepaged.c
@@ -442,9 +442,7 @@ static inline int khugepaged_test_exit(s
 static bool hugepage_vma_check(struct vm_area_struct *vma,
 			       unsigned long vm_flags)
 {
-	/* Explicitly disabled through madvise. */
-	if ((vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+	if (!transhuge_vma_enabled(vma, vm_flags))
 		return false;
 
 	/* Enabled via shmem mount options or sysfs settings. */
--- a/mm/shmem.c~mm-huge_memoryc-add-missing-read-only-thp-checking-in-transparent_hugepage_enabled
+++ a/mm/shmem.c
@@ -4040,8 +4040,7 @@ bool shmem_huge_enabled(struct vm_area_s
 	loff_t i_size;
 	pgoff_t off;
 
-	if ((vma->vm_flags & VM_NOHUGEPAGE) ||
-	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+	if (!transhuge_vma_enabled(vma, vma->vm_flags))
 		return false;
 	if (shmem_huge == SHMEM_HUGE_FORCE)
 		return true;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 015/192] mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd
  2021-07-01  1:46 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2021-07-01  1:47 ` [patch 014/192] mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:47 ` [patch 016/192] mm/huge_memory.c: don't discard hugepage if other processes are mapping it Andrew Morton
                   ` (177 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: adobriyan, akpm, aneesh.kumar, anshuman.khandual, david, hannes,
	hughd, kirill.shutemov, linmiaohe, linux-mm, mike.kravetz,
	minchan, mm-commits, rcampbell, riel, shy828301, songliubraving,
	torvalds, william.kucharski, willy, ziy

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd

Commit aa88b68c3b1d ("thp: keep huge zero page pinned until tlb flush")
introduced tlb_remove_page() for huge zero page to keep it pinned until
flush is complete and prevents the page from being split under us.  But
huge zero page is kept pinned until all relevant mm_users reach zero since
the commit 6fcb52a56ff6 ("thp: reduce usage of huge zero page's atomic
counter").  So tlb_remove_page_size() for huge zero pmd is unnecessary
now.

Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.com
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/mm/huge_memory.c~mm-huge_memoryc-remove-unnecessary-tlb_remove_page_size-for-huge-zero-pmd
+++ a/mm/huge_memory.c
@@ -1686,12 +1686,9 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 		if (arch_needs_pgtable_deposit())
 			zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
-		if (is_huge_zero_pmd(orig_pmd))
-			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
 		zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
-		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
 	} else {
 		struct page *page = NULL;
 		int flush_needed = 1;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 016/192] mm/huge_memory.c: don't discard hugepage if other processes are mapping it
  2021-07-01  1:46 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2021-07-01  1:47 ` [patch 015/192] mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd Andrew Morton
@ 2021-07-01  1:47 ` Andrew Morton
  2021-07-01  1:48 ` [patch 017/192] mm/hugetlb: change parameters of arch_make_huge_pte() Andrew Morton
                   ` (176 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:47 UTC (permalink / raw)
  To: adobriyan, akpm, aneesh.kumar, anshuman.khandual, david, hannes,
	hughd, kirill.shutemov, linmiaohe, linux-mm, mike.kravetz,
	minchan, mm-commits, rcampbell, riel, shy828301, songliubraving,
	torvalds, william.kucharski, willy, ziy

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/huge_memory.c: don't discard hugepage if other processes are mapping it

If other processes are mapping any other subpages of the hugepage, i.e. 
in pte-mapped thp case, page_mapcount() will return 1 incorrectly.  Then
we would discard the page while other processes are still mapping it.  Fix
it by using total_mapcount() which can tell whether other processes are
still mapping it.

Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/huge_memory.c~mm-huge_memoryc-dont-discard-hugepage-if-other-processes-are-mapping-it
+++ a/mm/huge_memory.c
@@ -1613,7 +1613,7 @@ bool madvise_free_huge_pmd(struct mmu_ga
 	 * If other processes are mapping this page, we couldn't discard
 	 * the page unless they all do MADV_FREE so let's skip the page.
 	 */
-	if (page_mapcount(page) != 1)
+	if (total_mapcount(page) != 1)
 		goto out;
 
 	if (!trylock_page(page))
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 017/192] mm/hugetlb: change parameters of arch_make_huge_pte()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2021-07-01  1:47 ` [patch 016/192] mm/huge_memory.c: don't discard hugepage if other processes are mapping it Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 018/192] mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge Andrew Morton
                   ` (175 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, linux-mm, mike.kravetz, mm-commits,
	mpe, npiggin, paulus, rppt, torvalds, uladzislau.rezki

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: mm/hugetlb: change parameters of arch_make_huge_pte()

Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.

This series implements huge VMAP and VMALLOC on powerpc 8xx.

Powerpc 8xx has 4 page sizes:
- 4k
- 16k
- 512k
- 8M

At the time being, vmalloc and vmap only support huge pages which are
leaf at PMD level.

Here the PMD level is 4M, it doesn't correspond to any supported
page size.

For now, implement use of 16k and 512k pages which is done
at PTE level.

Support of 8M pages will be implemented later, it requires use of
hugepd tables.

To allow this, the architecture provides two functions:
- arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
page size to use. A stub returning PAGE_SIZE is provided when the
architecture doesn't provide this function.
- arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
what page shift to use for a given area size. A stub returning
PAGE_SHIFT is provided when the architecture doesn't provide this
function.


This patch (of 5):

At the time being, arch_make_huge_pte() has the following prototype:

  pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
			   struct page *page, int writable);

vma is used to get the pages shift or size.
vma is also used on Sparc to get vm_flags.
page is not used.
writable is not used.

In order to use this function without a vma, replace vma by shift and
flags.  Also remove the used parameters.

Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/include/asm/hugetlb.h                 |    3 +--
 arch/arm64/mm/hugetlbpage.c                      |    5 ++---
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h |    5 ++---
 arch/sparc/include/asm/pgtable_64.h              |    3 +--
 arch/sparc/mm/hugetlbpage.c                      |    6 ++----
 include/linux/hugetlb.h                          |    4 ++--
 mm/hugetlb.c                                     |    6 ++++--
 mm/migrate.c                                     |    4 +++-
 8 files changed, 17 insertions(+), 19 deletions(-)

--- a/arch/arm64/include/asm/hugetlb.h~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/arch/arm64/include/asm/hugetlb.h
@@ -23,8 +23,7 @@ static inline void arch_clear_hugepage_f
 }
 #define arch_clear_hugepage_flags arch_clear_hugepage_flags
 
-extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
-				struct page *page, int writable);
+pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
 #define arch_make_huge_pte arch_make_huge_pte
 #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
 extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
--- a/arch/arm64/mm/hugetlbpage.c~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/arch/arm64/mm/hugetlbpage.c
@@ -339,10 +339,9 @@ pte_t *huge_pte_offset(struct mm_struct
 	return NULL;
 }
 
-pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
-			 struct page *page, int writable)
+pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
 {
-	size_t pagesize = huge_page_size(hstate_vma(vma));
+	size_t pagesize = 1UL << shift;
 
 	if (pagesize == CONT_PTE_SIZE) {
 		entry = pte_mkcont(entry);
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -66,10 +66,9 @@ static inline void huge_ptep_set_wrprote
 }
 
 #ifdef CONFIG_PPC_4K_PAGES
-static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
-				       struct page *page, int writable)
+static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
 {
-	size_t size = huge_page_size(hstate_vma(vma));
+	size_t size = 1UL << shift;
 
 	if (size == SZ_16K)
 		return __pte(pte_val(entry) & ~_PAGE_HUGE);
--- a/arch/sparc/include/asm/pgtable_64.h~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -377,8 +377,7 @@ static inline pgprot_t pgprot_noncached(
 #define pgprot_noncached pgprot_noncached
 
 #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
-extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
-				struct page *page, int writable);
+pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
 #define arch_make_huge_pte arch_make_huge_pte
 static inline unsigned long __pte_default_huge_mask(void)
 {
--- a/arch/sparc/mm/hugetlbpage.c~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/arch/sparc/mm/hugetlbpage.c
@@ -177,10 +177,8 @@ static pte_t hugepage_shift_to_tte(pte_t
 		return sun4u_hugepage_shift_to_tte(entry, shift);
 }
 
-pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
-			 struct page *page, int writeable)
+pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
 {
-	unsigned int shift = huge_page_shift(hstate_vma(vma));
 	pte_t pte;
 
 	pte = hugepage_shift_to_tte(entry, shift);
@@ -188,7 +186,7 @@ pte_t arch_make_huge_pte(pte_t entry, st
 #ifdef CONFIG_SPARC64
 	/* If this vma has ADI enabled on it, turn on TTE.mcd
 	 */
-	if (vma->vm_flags & VM_SPARC_ADI)
+	if (flags & VM_SPARC_ADI)
 		return pte_mkmcd(pte);
 	else
 		return pte_mknotmcd(pte);
--- a/include/linux/hugetlb.h~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/include/linux/hugetlb.h
@@ -741,8 +741,8 @@ static inline void arch_clear_hugepage_f
 #endif
 
 #ifndef arch_make_huge_pte
-static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
-				       struct page *page, int writable)
+static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift,
+				       vm_flags_t flags)
 {
 	return entry;
 }
--- a/mm/hugetlb.c~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/mm/hugetlb.c
@@ -4060,6 +4060,7 @@ static pte_t make_huge_pte(struct vm_are
 				int writable)
 {
 	pte_t entry;
+	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 	if (writable) {
 		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
@@ -4070,7 +4071,7 @@ static pte_t make_huge_pte(struct vm_are
 	}
 	entry = pte_mkyoung(entry);
 	entry = pte_mkhuge(entry);
-	entry = arch_make_huge_pte(entry, vma, page, writable);
+	entry = arch_make_huge_pte(entry, shift, vma->vm_flags);
 
 	return entry;
 }
@@ -5468,10 +5469,11 @@ unsigned long hugetlb_change_protection(
 		}
 		if (!huge_pte_none(pte)) {
 			pte_t old_pte;
+			unsigned int shift = huge_page_shift(hstate_vma(vma));
 
 			old_pte = huge_ptep_modify_prot_start(vma, address, ptep);
 			pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
-			pte = arch_make_huge_pte(pte, vma, NULL, 0);
+			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
 			pages++;
 		}
--- a/mm/migrate.c~mm-hugetlb-change-parameters-of-arch_make_huge_pte
+++ a/mm/migrate.c
@@ -226,8 +226,10 @@ static bool remove_migration_pte(struct
 
 #ifdef CONFIG_HUGETLB_PAGE
 		if (PageHuge(new)) {
+			unsigned int shift = huge_page_shift(hstate_vma(vma));
+
 			pte = pte_mkhuge(pte);
-			pte = arch_make_huge_pte(pte, vma, new, 0);
+			pte = arch_make_huge_pte(pte, shift, vma->vm_flags);
 			set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 			if (PageAnon(new))
 				hugepage_add_anon_rmap(new, vma, pvmw.address);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 018/192] mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge
  2021-07-01  1:46 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2021-07-01  1:48 ` [patch 017/192] mm/hugetlb: change parameters of arch_make_huge_pte() Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 019/192] mm/vmalloc: enable mapping of huge pages at pte level in vmap Andrew Morton
                   ` (174 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, linux-mm, mike.kravetz, mm-commits,
	mpe, naresh.kamboju, npiggin, paulus, rppt, torvalds,
	uladzislau.rezki

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge

For architectures with no PMD and/or no PUD, add stubs similar to what we
have for architectures without P4D.

[christophe.leroy@csgroup.eu: arm64: define only {pud/pmd}_{set/clear}_huge when useful]
  Link: https://lkml.kernel.org/r/73ec95f40cafbbb69bdfb43a7f53876fd845b0ce.1620990479.git.christophe.leroy@csgroup.eu
[christophe.leroy@csgroup.eu: x86: define only {pud/pmd}_{set/clear}_huge when useful]
  Link: https://lkml.kernel.org/r/7fbf1b6bc3e15c07c24fa45278d57064f14c896b.1620930415.git.christophe.leroy@csgroup.eu
Link: https://lkml.kernel.org/r/5ac5976419350e8e048d463a64cae449eb3ba4b0.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/mmu.c     |   20 ++++++++++++--------
 arch/x86/mm/pgtable.c   |   34 +++++++++++++++++++---------------
 include/linux/pgtable.h |   26 +++++++++++++++++++++++++-
 3 files changed, 56 insertions(+), 24 deletions(-)

--- a/arch/arm64/mm/mmu.c~mm-pgtable-add-stubs-for-pmd-pub_set-clear_huge
+++ a/arch/arm64/mm/mmu.c
@@ -1338,6 +1338,7 @@ void *__init fixmap_remap_fdt(phys_addr_
 	return dt_virt;
 }
 
+#if CONFIG_PGTABLE_LEVELS > 3
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
 	pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot));
@@ -1352,6 +1353,16 @@ int pud_set_huge(pud_t *pudp, phys_addr_
 	return 1;
 }
 
+int pud_clear_huge(pud_t *pudp)
+{
+	if (!pud_sect(READ_ONCE(*pudp)))
+		return 0;
+	pud_clear(pudp);
+	return 1;
+}
+#endif
+
+#if CONFIG_PGTABLE_LEVELS > 2
 int pmd_set_huge(pmd_t *pmdp, phys_addr_t phys, pgprot_t prot)
 {
 	pmd_t new_pmd = pfn_pmd(__phys_to_pfn(phys), mk_pmd_sect_prot(prot));
@@ -1366,14 +1377,6 @@ int pmd_set_huge(pmd_t *pmdp, phys_addr_
 	return 1;
 }
 
-int pud_clear_huge(pud_t *pudp)
-{
-	if (!pud_sect(READ_ONCE(*pudp)))
-		return 0;
-	pud_clear(pudp);
-	return 1;
-}
-
 int pmd_clear_huge(pmd_t *pmdp)
 {
 	if (!pmd_sect(READ_ONCE(*pmdp)))
@@ -1381,6 +1384,7 @@ int pmd_clear_huge(pmd_t *pmdp)
 	pmd_clear(pmdp);
 	return 1;
 }
+#endif
 
 int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
 {
--- a/arch/x86/mm/pgtable.c~mm-pgtable-add-stubs-for-pmd-pub_set-clear_huge
+++ a/arch/x86/mm/pgtable.c
@@ -682,6 +682,7 @@ int p4d_clear_huge(p4d_t *p4d)
 }
 #endif
 
+#if CONFIG_PGTABLE_LEVELS > 3
 /**
  * pud_set_huge - setup kernel PUD mapping
  *
@@ -721,6 +722,23 @@ int pud_set_huge(pud_t *pud, phys_addr_t
 }
 
 /**
+ * pud_clear_huge - clear kernel PUD mapping when it is set
+ *
+ * Returns 1 on success and 0 on failure (no PUD map is found).
+ */
+int pud_clear_huge(pud_t *pud)
+{
+	if (pud_large(*pud)) {
+		pud_clear(pud);
+		return 1;
+	}
+
+	return 0;
+}
+#endif
+
+#if CONFIG_PGTABLE_LEVELS > 2
+/**
  * pmd_set_huge - setup kernel PMD mapping
  *
  * See text over pud_set_huge() above.
@@ -751,21 +769,6 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t
 }
 
 /**
- * pud_clear_huge - clear kernel PUD mapping when it is set
- *
- * Returns 1 on success and 0 on failure (no PUD map is found).
- */
-int pud_clear_huge(pud_t *pud)
-{
-	if (pud_large(*pud)) {
-		pud_clear(pud);
-		return 1;
-	}
-
-	return 0;
-}
-
-/**
  * pmd_clear_huge - clear kernel PMD mapping when it is set
  *
  * Returns 1 on success and 0 on failure (no PMD map is found).
@@ -779,6 +782,7 @@ int pmd_clear_huge(pmd_t *pmd)
 
 	return 0;
 }
+#endif
 
 #ifdef CONFIG_X86_64
 /**
--- a/include/linux/pgtable.h~mm-pgtable-add-stubs-for-pmd-pub_set-clear_huge
+++ a/include/linux/pgtable.h
@@ -1379,10 +1379,34 @@ static inline int p4d_clear_huge(p4d_t *
 }
 #endif /* !__PAGETABLE_P4D_FOLDED */
 
+#ifndef __PAGETABLE_PUD_FOLDED
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
-int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
 int pud_clear_huge(pud_t *pud);
+#else
+static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
+static inline int pud_clear_huge(pud_t *pud)
+{
+	return 0;
+}
+#endif /* !__PAGETABLE_PUD_FOLDED */
+
+#ifndef __PAGETABLE_PMD_FOLDED
+int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
 int pmd_clear_huge(pmd_t *pmd);
+#else
+static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
+{
+	return 0;
+}
+static inline int pmd_clear_huge(pmd_t *pmd)
+{
+	return 0;
+}
+#endif /* !__PAGETABLE_PMD_FOLDED */
+
 int p4d_free_pud_page(p4d_t *p4d, unsigned long addr);
 int pud_free_pmd_page(pud_t *pud, unsigned long addr);
 int pmd_free_pte_page(pmd_t *pmd, unsigned long addr);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 019/192] mm/vmalloc: enable mapping of huge pages at pte level in vmap
  2021-07-01  1:46 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2021-07-01  1:48 ` [patch 018/192] mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 020/192] mm/vmalloc: enable mapping of huge pages at pte level in vmalloc Andrew Morton
                   ` (173 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, linux-mm, mike.kravetz, mm-commits,
	mpe, npiggin, paulus, rppt, torvalds, uladzislau.rezki

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: mm/vmalloc: enable mapping of huge pages at pte level in vmap

On some architectures like powerpc, there are huge pages that are mapped
at pte level.

Enable it in vmap.

For that, architectures can provide arch_vmap_pte_range_map_size() that
returns the size of pages to map at pte level.

Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |    8 ++++++++
 mm/vmalloc.c            |   21 ++++++++++++++++++---
 2 files changed, 26 insertions(+), 3 deletions(-)

--- a/include/linux/vmalloc.h~mm-vmalloc-enable-mapping-of-huge-pages-at-pte-level-in-vmap
+++ a/include/linux/vmalloc.h
@@ -104,6 +104,14 @@ static inline bool arch_vmap_pmd_support
 }
 #endif
 
+#ifndef arch_vmap_pte_range_map_size
+static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, unsigned long end,
+							 u64 pfn, unsigned int max_page_shift)
+{
+	return PAGE_SIZE;
+}
+#endif
+
 /*
  *	Highlevel APIs for driver use
  */
--- a/mm/vmalloc.c~mm-vmalloc-enable-mapping-of-huge-pages-at-pte-level-in-vmap
+++ a/mm/vmalloc.c
@@ -36,6 +36,7 @@
 #include <linux/overflow.h>
 #include <linux/pgtable.h>
 #include <linux/uaccess.h>
+#include <linux/hugetlb.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
@@ -83,10 +84,11 @@ static void free_work(struct work_struct
 /*** Page table manipulation functions ***/
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			phys_addr_t phys_addr, pgprot_t prot,
-			pgtbl_mod_mask *mask)
+			unsigned int max_page_shift, pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 	u64 pfn;
+	unsigned long size = PAGE_SIZE;
 
 	pfn = phys_addr >> PAGE_SHIFT;
 	pte = pte_alloc_kernel_track(pmd, addr, mask);
@@ -94,9 +96,22 @@ static int vmap_pte_range(pmd_t *pmd, un
 		return -ENOMEM;
 	do {
 		BUG_ON(!pte_none(*pte));
+
+#ifdef CONFIG_HUGETLB_PAGE
+		size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift);
+		if (size != PAGE_SIZE) {
+			pte_t entry = pfn_pte(pfn, prot);
+
+			entry = pte_mkhuge(entry);
+			entry = arch_make_huge_pte(entry, ilog2(size), 0);
+			set_huge_pte_at(&init_mm, addr, pte, entry);
+			pfn += PFN_DOWN(size);
+			continue;
+		}
+#endif
 		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += PFN_DOWN(size), addr += size, addr != end);
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
@@ -145,7 +160,7 @@ static int vmap_pmd_range(pud_t *pud, un
 			continue;
 		}
 
-		if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+		if (vmap_pte_range(pmd, addr, next, phys_addr, prot, max_page_shift, mask))
 			return -ENOMEM;
 	} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 020/192] mm/vmalloc: enable mapping of huge pages at pte level in vmalloc
  2021-07-01  1:46 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2021-07-01  1:48 ` [patch 019/192] mm/vmalloc: enable mapping of huge pages at pte level in vmap Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 021/192] powerpc/8xx: add support for huge pages on VMAP and VMALLOC Andrew Morton
                   ` (172 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, linux-mm, mike.kravetz, mm-commits,
	mpe, npiggin, paulus, rppt, torvalds, uladzislau.rezki

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: mm/vmalloc: enable mapping of huge pages at pte level in vmalloc

On some architectures like powerpc, there are huge pages that are mapped
at pte level.

Enable it in vmalloc.

For that, architectures can provide arch_vmap_pte_supported_shift() that
returns the shift for pages to map at pte level.

Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |    7 +++++++
 mm/vmalloc.c            |   13 +++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

--- a/include/linux/vmalloc.h~mm-vmalloc-enable-mapping-of-huge-pages-at-pte-level-in-vmalloc
+++ a/include/linux/vmalloc.h
@@ -112,6 +112,13 @@ static inline unsigned long arch_vmap_pt
 }
 #endif
 
+#ifndef arch_vmap_pte_supported_shift
+static inline int arch_vmap_pte_supported_shift(unsigned long size)
+{
+	return PAGE_SHIFT;
+}
+#endif
+
 /*
  *	Highlevel APIs for driver use
  */
--- a/mm/vmalloc.c~mm-vmalloc-enable-mapping-of-huge-pages-at-pte-level-in-vmalloc
+++ a/mm/vmalloc.c
@@ -2927,8 +2927,7 @@ void *__vmalloc_node_range(unsigned long
 		return NULL;
 	}
 
-	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) &&
-			arch_vmap_pmd_supported(prot)) {
+	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
 		unsigned long size_per_node;
 
 		/*
@@ -2941,11 +2940,13 @@ void *__vmalloc_node_range(unsigned long
 		size_per_node = size;
 		if (node == NUMA_NO_NODE)
 			size_per_node /= num_online_nodes();
-		if (size_per_node >= PMD_SIZE) {
+		if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
 			shift = PMD_SHIFT;
-			align = max(real_align, 1UL << shift);
-			size = ALIGN(real_size, 1UL << shift);
-		}
+		else
+			shift = arch_vmap_pte_supported_shift(size_per_node);
+
+		align = max(real_align, 1UL << shift);
+		size = ALIGN(real_size, 1UL << shift);
 	}
 
 again:
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 021/192] powerpc/8xx: add support for huge pages on VMAP and VMALLOC
  2021-07-01  1:46 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2021-07-01  1:48 ` [patch 020/192] mm/vmalloc: enable mapping of huge pages at pte level in vmalloc Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 022/192] khugepaged: selftests: remove debug_cow Andrew Morton
                   ` (171 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, linux-mm, mike.kravetz, mm-commits,
	mpe, npiggin, paulus, rppt, torvalds, uladzislau.rezki

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: powerpc/8xx: add support for huge pages on VMAP and VMALLOC

powerpc 8xx has 4 page sizes:
- 4k
- 16k
- 512k
- 8M

At the time being, vmalloc and vmap only support huge pages which are leaf
at PMD level.

Here the PMD level is 4M, it doesn't correspond to any supported page
size.

For now, implement use of 16k and 512k pages which is done at PTE level.

Support of 8M pages will be implemented later, it requires vmalloc to
support hugepd tables.

Link: https://lkml.kernel.org/r/8b972f1c03fb6bd59953035f0a3e4d26659de4f8.1620795204.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/Kconfig                         |    2 
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h |   43 +++++++++++++++++
 2 files changed, 44 insertions(+), 1 deletion(-)

--- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h~powerpc-8xx-add-support-for-huge-pages-on-vmap-and-vmalloc
+++ a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
@@ -178,6 +178,7 @@
 #ifndef __ASSEMBLY__
 
 #include <linux/mmdebug.h>
+#include <linux/sizes.h>
 
 void mmu_pin_tlb(unsigned long top, bool readonly);
 
@@ -225,6 +226,48 @@ static inline unsigned int mmu_psize_to_
 	BUG();
 }
 
+static inline bool arch_vmap_try_size(unsigned long addr, unsigned long end, u64 pfn,
+				      unsigned int max_page_shift, unsigned long size)
+{
+	if (end - addr < size)
+		return false;
+
+	if ((1UL << max_page_shift) < size)
+		return false;
+
+	if (!IS_ALIGNED(addr, size))
+		return false;
+
+	if (!IS_ALIGNED(PFN_PHYS(pfn), size))
+		return false;
+
+	return true;
+}
+
+static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, unsigned long end,
+							 u64 pfn, unsigned int max_page_shift)
+{
+	if (arch_vmap_try_size(addr, end, pfn, max_page_shift, SZ_512K))
+		return SZ_512K;
+	if (PAGE_SIZE == SZ_16K)
+		return SZ_16K;
+	if (arch_vmap_try_size(addr, end, pfn, max_page_shift, SZ_16K))
+		return SZ_16K;
+	return PAGE_SIZE;
+}
+#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
+
+static inline int arch_vmap_pte_supported_shift(unsigned long size)
+{
+	if (size >= SZ_512K)
+		return 19;
+	else if (size >= SZ_16K)
+		return 14;
+	else
+		return PAGE_SHIFT;
+}
+#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift
+
 /* patch sites */
 extern s32 patch__itlbmiss_exit_1, patch__dtlbmiss_exit_1;
 extern s32 patch__itlbmiss_perf, patch__dtlbmiss_perf;
--- a/arch/powerpc/Kconfig~powerpc-8xx-add-support-for-huge-pages-on-vmap-and-vmalloc
+++ a/arch/powerpc/Kconfig
@@ -187,7 +187,7 @@ config PPC
 	select GENERIC_VDSO_TIME_NS
 	select HAVE_ARCH_AUDITSYSCALL
 	select HAVE_ARCH_HUGE_VMALLOC		if HAVE_ARCH_HUGE_VMAP
-	select HAVE_ARCH_HUGE_VMAP		if PPC_BOOK3S_64 && PPC_RADIX_MMU
+	select HAVE_ARCH_HUGE_VMAP		if PPC_RADIX_MMU || PPC_8xx
 	select HAVE_ARCH_JUMP_LABEL
 	select HAVE_ARCH_JUMP_LABEL_RELATIVE
 	select HAVE_ARCH_KASAN			if PPC32 && PPC_PAGE_SHIFT <= 14
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 022/192] khugepaged: selftests: remove debug_cow
  2021-07-01  1:46 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2021-07-01  1:48 ` [patch 021/192] powerpc/8xx: add support for huge pages on VMAP and VMALLOC Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY Andrew Morton
                   ` (170 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, kirill.shutemov, linux-mm, mm-commits, shuah, sunnanyong,
	torvalds, wangkefeng.wang, yang.shi, ziy

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: khugepaged: selftests: remove debug_cow

The debug_cow attribute had been removed since commit 4958e4d86ecb01 ("mm:
thp: remove debug_cow switch"), so remove it in selftest code too,
otherwise the khugepaged test will fail.

Link: https://lkml.kernel.org/r/20210430051117.400189-1-sunnanyong@huawei.com
Fixes: 4958e4d86ecb01 ("mm: thp: remove debug_cow switch")
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/khugepaged.c |    4 ----
 1 file changed, 4 deletions(-)

--- a/tools/testing/selftests/vm/khugepaged.c~khugepaged-selftests-remove-debug_cow
+++ a/tools/testing/selftests/vm/khugepaged.c
@@ -86,7 +86,6 @@ struct settings {
 	enum thp_enabled thp_enabled;
 	enum thp_defrag thp_defrag;
 	enum shmem_enabled shmem_enabled;
-	bool debug_cow;
 	bool use_zero_page;
 	struct khugepaged_settings khugepaged;
 };
@@ -95,7 +94,6 @@ static struct settings default_settings
 	.thp_enabled = THP_MADVISE,
 	.thp_defrag = THP_DEFRAG_ALWAYS,
 	.shmem_enabled = SHMEM_NEVER,
-	.debug_cow = 0,
 	.use_zero_page = 0,
 	.khugepaged = {
 		.defrag = 1,
@@ -268,7 +266,6 @@ static void write_settings(struct settin
 	write_string("defrag", thp_defrag_strings[settings->thp_defrag]);
 	write_string("shmem_enabled",
 			shmem_enabled_strings[settings->shmem_enabled]);
-	write_num("debug_cow", settings->debug_cow);
 	write_num("use_zero_page", settings->use_zero_page);
 
 	write_num("khugepaged/defrag", khugepaged->defrag);
@@ -304,7 +301,6 @@ static void save_settings(void)
 		.thp_defrag = read_string("defrag", thp_defrag_strings),
 		.shmem_enabled =
 			read_string("shmem_enabled", shmem_enabled_strings),
-		.debug_cow = read_num("debug_cow"),
 		.use_zero_page = read_num("use_zero_page"),
 	};
 	saved_settings.khugepaged = (struct khugepaged_settings) {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
  2021-07-01  1:46 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2021-07-01  1:48 ` [patch 022/192] khugepaged: selftests: remove debug_cow Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-12 14:48   ` Matthew Wilcox
  2021-07-01  1:48 ` [patch 024/192] mm: sparsemem: split the huge PMD mapping of vmemmap pages Andrew Morton
                   ` (169 subsequent siblings)
  192 siblings, 1 reply; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, almasrymina, axelrasmussen, linux-mm, mike.kravetz,
	mm-commits, peterx, torvalds, yuehaibing

From: Mina Almasry <almasrymina@google.com>
Subject: mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY

On UFFDIO_COPY, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller after
allocating a page that consumed a reservation.  In this case there may be
a fault that double consumes the reservation.  To handle this, we free the
allocated page, fix the reservations, and allocate a temporary hugetlb
page and return that to the caller.  When the caller does the copy outside
of the lock, we again check the cache, and allocate a page consuming the
reservation, and copy over the contents.

Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
        2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the
test runs number of free/resv hugepages is correct.

[yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
  Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
[almasrymina@google.com: fix allocation error check and copy func name]
  Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/migrate.h |    4 +++
 mm/hugetlb.c            |   48 +++++++++++++++++++++++++++++-------
 mm/migrate.c            |    2 -
 mm/userfaultfd.c        |   50 --------------------------------------
 4 files changed, 45 insertions(+), 59 deletions(-)

--- a/include/linux/migrate.h~mm-hugetlb-fix-racy-resv_huge_pages-underflow-on-uffdio_copy
+++ a/include/linux/migrate.h
@@ -51,6 +51,7 @@ extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page, int extra_count);
+extern void copy_huge_page(struct page *dst, struct page *src);
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -77,6 +78,9 @@ static inline int migrate_huge_page_move
 	return -ENOSYS;
 }
 
+static inline void copy_huge_page(struct page *dst, struct page *src)
+{
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
--- a/mm/hugetlb.c~mm-hugetlb-fix-racy-resv_huge_pages-underflow-on-uffdio_copy
+++ a/mm/hugetlb.c
@@ -30,6 +30,7 @@
 #include <linux/numa.h>
 #include <linux/llist.h>
 #include <linux/cma.h>
+#include <linux/migrate.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -5076,20 +5077,17 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
 			    struct page **pagep)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
-	struct address_space *mapping;
-	pgoff_t idx;
+	struct hstate *h = hstate_vma(dst_vma);
+	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
-	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
-	int ret;
+	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
 
-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
-
 	if (is_continue) {
 		ret = -EFAULT;
 		page = find_lock_page(mapping, idx);
@@ -5118,12 +5116,44 @@ int hugetlb_mcopy_atomic_pte(struct mm_s
 		/* fallback to copy_from_user outside mmap_lock */
 		if (unlikely(ret)) {
 			ret = -ENOENT;
+			/* Free the allocated page which may have
+			 * consumed a reservation.
+			 */
+			restore_reserve_on_error(h, dst_vma, dst_addr, page);
+			put_page(page);
+
+			/* Allocate a temporary page to hold the copied
+			 * contents.
+			 */
+			page = alloc_huge_page_vma(h, dst_vma, dst_addr);
+			if (!page) {
+				ret = -ENOMEM;
+				goto out;
+			}
 			*pagep = page;
-			/* don't free the page */
+			/* Set the outparam pagep and return to the caller to
+			 * copy the contents outside the lock. Don't free the
+			 * page.
+			 */
 			goto out;
 		}
 	} else {
-		page = *pagep;
+		if (vm_shared &&
+		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			put_page(*pagep);
+			ret = -EEXIST;
+			*pagep = NULL;
+			goto out;
+		}
+
+		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		if (IS_ERR(page)) {
+			ret = -ENOMEM;
+			*pagep = NULL;
+			goto out;
+		}
+		copy_huge_page(page, *pagep);
+		put_page(*pagep);
 		*pagep = NULL;
 	}
 
--- a/mm/migrate.c~mm-hugetlb-fix-racy-resv_huge_pages-underflow-on-uffdio_copy
+++ a/mm/migrate.c
@@ -553,7 +553,7 @@ static void __copy_gigantic_page(struct
 	}
 }
 
-static void copy_huge_page(struct page *dst, struct page *src)
+void copy_huge_page(struct page *dst, struct page *src)
 {
 	int i;
 	int nr_pages;
--- a/mm/userfaultfd.c~mm-hugetlb-fix-racy-resv_huge_pages-underflow-on-uffdio_copy
+++ a/mm/userfaultfd.c
@@ -209,7 +209,6 @@ static __always_inline ssize_t __mcopy_a
 					      unsigned long len,
 					      enum mcopy_atomic_mode mode)
 {
-	int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
 	pte_t *dst_pte;
@@ -308,7 +307,6 @@ retry:
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		i_mmap_unlock_read(mapping);
-		vm_alloc_shared = vm_shared;
 
 		cond_resched();
 
@@ -346,54 +344,8 @@ retry:
 out_unlock:
 	mmap_read_unlock(dst_mm);
 out:
-	if (page) {
-		/*
-		 * We encountered an error and are about to free a newly
-		 * allocated huge page.
-		 *
-		 * Reservation handling is very subtle, and is different for
-		 * private and shared mappings.  See the routine
-		 * restore_reserve_on_error for details.  Unfortunately, we
-		 * can not call restore_reserve_on_error now as it would
-		 * require holding mmap_lock.
-		 *
-		 * If a reservation for the page existed in the reservation
-		 * map of a private mapping, the map was modified to indicate
-		 * the reservation was consumed when the page was allocated.
-		 * We clear the HPageRestoreReserve flag now so that the global
-		 * reserve count will not be incremented in free_huge_page.
-		 * The reservation map will still indicate the reservation
-		 * was consumed and possibly prevent later page allocation.
-		 * This is better than leaking a global reservation.  If no
-		 * reservation existed, it is still safe to clear
-		 * HPageRestoreReserve as no adjustments to reservation counts
-		 * were made during allocation.
-		 *
-		 * The reservation map for shared mappings indicates which
-		 * pages have reservations.  When a huge page is allocated
-		 * for an address with a reservation, no change is made to
-		 * the reserve map.  In this case HPageRestoreReserve will be
-		 * set to indicate that the global reservation count should be
-		 * incremented when the page is freed.  This is the desired
-		 * behavior.  However, when a huge page is allocated for an
-		 * address without a reservation a reservation entry is added
-		 * to the reservation map, and HPageRestoreReserve will not be
-		 * set. When the page is freed, the global reserve count will
-		 * NOT be incremented and it will appear as though we have
-		 * leaked reserved page.  In this case, set HPageRestoreReserve
-		 * so that the global reserve count will be incremented to
-		 * match the reservation map entry which was created.
-		 *
-		 * Note that vm_alloc_shared is based on the flags of the vma
-		 * for which the page was originally allocated.  dst_vma could
-		 * be different or NULL on error.
-		 */
-		if (vm_alloc_shared)
-			SetHPageRestoreReserve(page);
-		else
-			ClearHPageRestoreReserve(page);
+	if (page)
 		put_page(page);
-	}
 	BUG_ON(copied < 0);
 	BUG_ON(err > 0);
 	BUG_ON(!copied && !err);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 024/192] mm: sparsemem: split the huge PMD mapping of vmemmap pages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2021-07-01  1:48 ` [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 025/192] mm: sparsemem: use huge PMD mapping for " Andrew Morton
                   ` (168 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, chenhuang5, corbet, david, duanxiongchun, linux-mm, mhocko,
	mike.kravetz, mm-commits, osalvador, songmuchun, torvalds

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: sparsemem: split the huge PMD mapping of vmemmap pages

Patch series "Split huge PMD mapping of vmemmap pages", v4.

In order to reduce the difficulty of code review in series[1].  We disable
huge PMD mapping of vmemmap pages when that feature is enabled.  In this
series, we do not disable huge PMD mapping of vmemmap pages anymore.  We
will split huge PMD mapping when needed.  When HugeTLB pages are freed
from the pool we do not attempt coalasce and move back to a PMD mapping
because it is much more complex.

[1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/


This patch (of 3):

In [1], PMD mappings of vmemmap pages were disabled if the the feature
hugetlb_free_vmemmap was enabled.  This was done to simplify the initial
implementation of vmmemap freeing for hugetlb pages.  Now, remove this
simplification by allowing PMD mapping and switching to PTE mappings as
needed for allocated hugetlb pages.

When a hugetlb page is allocated, the vmemmap page tables are walked to
free vmemmap pages.  During this walk, split huge PMD mappings to PTE
mappings as required.  In the unlikely case PTE pages can not be
allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
page.

When HugeTLB pages are freed from the pool, we do not attempt to
coalesce and move back to a PMD mapping because it is much more complex.

[1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com

Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h   |    4 -
 mm/hugetlb_vmemmap.c |    5 -
 mm/sparse-vmemmap.c  |  163 +++++++++++++++++++++++++++++++----------
 3 files changed, 129 insertions(+), 43 deletions(-)

--- a/include/linux/mm.h~mm-sparsemem-split-the-huge-pmd-mapping-of-vmemmap-pages
+++ a/include/linux/mm.h
@@ -3076,8 +3076,8 @@ static inline void print_vma_addr(char *
 }
 #endif
 
-void vmemmap_remap_free(unsigned long start, unsigned long end,
-			unsigned long reuse);
+int vmemmap_remap_free(unsigned long start, unsigned long end,
+		       unsigned long reuse);
 int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 			unsigned long reuse, gfp_t gfp_mask);
 
--- a/mm/hugetlb_vmemmap.c~mm-sparsemem-split-the-huge-pmd-mapping-of-vmemmap-pages
+++ a/mm/hugetlb_vmemmap.c
@@ -258,9 +258,8 @@ void free_huge_page_vmemmap(struct hstat
 	 * to the page which @vmemmap_reuse is mapped to, then free the pages
 	 * which the range [@vmemmap_addr, @vmemmap_end] is mapped to.
 	 */
-	vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse);
-
-	SetHPageVmemmapOptimized(head);
+	if (!vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse))
+		SetHPageVmemmapOptimized(head);
 }
 
 void __init hugetlb_vmemmap_init(struct hstate *h)
--- a/mm/sparse-vmemmap.c~mm-sparsemem-split-the-huge-pmd-mapping-of-vmemmap-pages
+++ a/mm/sparse-vmemmap.c
@@ -38,6 +38,7 @@
  * struct vmemmap_remap_walk - walk vmemmap page table
  *
  * @remap_pte:		called for each lowest-level entry (PTE).
+ * @nr_walked:		the number of walked pte.
  * @reuse_page:		the page which is reused for the tail vmemmap pages.
  * @reuse_addr:		the virtual address of the @reuse_page page.
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
@@ -46,11 +47,44 @@
 struct vmemmap_remap_walk {
 	void (*remap_pte)(pte_t *pte, unsigned long addr,
 			  struct vmemmap_remap_walk *walk);
+	unsigned long nr_walked;
 	struct page *reuse_page;
 	unsigned long reuse_addr;
 	struct list_head *vmemmap_pages;
 };
 
+static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start,
+				  struct vmemmap_remap_walk *walk)
+{
+	pmd_t __pmd;
+	int i;
+	unsigned long addr = start;
+	struct page *page = pmd_page(*pmd);
+	pte_t *pgtable = pte_alloc_one_kernel(&init_mm);
+
+	if (!pgtable)
+		return -ENOMEM;
+
+	pmd_populate_kernel(&init_mm, &__pmd, pgtable);
+
+	for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		pgprot_t pgprot = PAGE_KERNEL;
+
+		entry = mk_pte(page + i, pgprot);
+		pte = pte_offset_kernel(&__pmd, addr);
+		set_pte_at(&init_mm, addr, pte, entry);
+	}
+
+	/* Make pte visible before pmd. See comment in __pte_alloc(). */
+	smp_wmb();
+	pmd_populate_kernel(&init_mm, pmd, pgtable);
+
+	flush_tlb_kernel_range(start, start + PMD_SIZE);
+
+	return 0;
+}
+
 static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
 			      unsigned long end,
 			      struct vmemmap_remap_walk *walk)
@@ -69,58 +103,80 @@ static void vmemmap_pte_range(pmd_t *pmd
 		 */
 		addr += PAGE_SIZE;
 		pte++;
+		walk->nr_walked++;
 	}
 
-	for (; addr != end; addr += PAGE_SIZE, pte++)
+	for (; addr != end; addr += PAGE_SIZE, pte++) {
 		walk->remap_pte(pte, addr, walk);
+		walk->nr_walked++;
+	}
 }
 
-static void vmemmap_pmd_range(pud_t *pud, unsigned long addr,
-			      unsigned long end,
-			      struct vmemmap_remap_walk *walk)
+static int vmemmap_pmd_range(pud_t *pud, unsigned long addr,
+			     unsigned long end,
+			     struct vmemmap_remap_walk *walk)
 {
 	pmd_t *pmd;
 	unsigned long next;
 
 	pmd = pmd_offset(pud, addr);
 	do {
-		BUG_ON(pmd_leaf(*pmd));
+		if (pmd_leaf(*pmd)) {
+			int ret;
 
+			ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, walk);
+			if (ret)
+				return ret;
+		}
 		next = pmd_addr_end(addr, end);
 		vmemmap_pte_range(pmd, addr, next, walk);
 	} while (pmd++, addr = next, addr != end);
+
+	return 0;
 }
 
-static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
-			      unsigned long end,
-			      struct vmemmap_remap_walk *walk)
+static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr,
+			     unsigned long end,
+			     struct vmemmap_remap_walk *walk)
 {
 	pud_t *pud;
 	unsigned long next;
 
 	pud = pud_offset(p4d, addr);
 	do {
+		int ret;
+
 		next = pud_addr_end(addr, end);
-		vmemmap_pmd_range(pud, addr, next, walk);
+		ret = vmemmap_pmd_range(pud, addr, next, walk);
+		if (ret)
+			return ret;
 	} while (pud++, addr = next, addr != end);
+
+	return 0;
 }
 
-static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
-			      unsigned long end,
-			      struct vmemmap_remap_walk *walk)
+static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr,
+			     unsigned long end,
+			     struct vmemmap_remap_walk *walk)
 {
 	p4d_t *p4d;
 	unsigned long next;
 
 	p4d = p4d_offset(pgd, addr);
 	do {
+		int ret;
+
 		next = p4d_addr_end(addr, end);
-		vmemmap_pud_range(p4d, addr, next, walk);
+		ret = vmemmap_pud_range(p4d, addr, next, walk);
+		if (ret)
+			return ret;
 	} while (p4d++, addr = next, addr != end);
+
+	return 0;
 }
 
-static void vmemmap_remap_range(unsigned long start, unsigned long end,
-				struct vmemmap_remap_walk *walk)
+static int vmemmap_remap_range(unsigned long start, unsigned long end,
+			       struct vmemmap_remap_walk *walk)
 {
 	unsigned long addr = start;
 	unsigned long next;
@@ -131,8 +187,12 @@ static void vmemmap_remap_range(unsigned
 
 	pgd = pgd_offset_k(addr);
 	do {
+		int ret;
+
 		next = pgd_addr_end(addr, end);
-		vmemmap_p4d_range(pgd, addr, next, walk);
+		ret = vmemmap_p4d_range(pgd, addr, next, walk);
+		if (ret)
+			return ret;
 	} while (pgd++, addr = next, addr != end);
 
 	/*
@@ -141,6 +201,8 @@ static void vmemmap_remap_range(unsigned
 	 * belongs to the range.
 	 */
 	flush_tlb_kernel_range(start + PAGE_SIZE, end);
+
+	return 0;
 }
 
 /*
@@ -179,10 +241,27 @@ static void vmemmap_remap_pte(pte_t *pte
 	pte_t entry = mk_pte(walk->reuse_page, pgprot);
 	struct page *page = pte_page(*pte);
 
-	list_add(&page->lru, walk->vmemmap_pages);
+	list_add_tail(&page->lru, walk->vmemmap_pages);
 	set_pte_at(&init_mm, addr, pte, entry);
 }
 
+static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
+				struct vmemmap_remap_walk *walk)
+{
+	pgprot_t pgprot = PAGE_KERNEL;
+	struct page *page;
+	void *to;
+
+	BUG_ON(pte_page(*pte) != walk->reuse_page);
+
+	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
+	list_del(&page->lru);
+	to = page_to_virt(page);
+	copy_page(to, (void *)walk->reuse_addr);
+
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+}
+
 /**
  * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end)
  *			to the page which @reuse is mapped to, then free vmemmap
@@ -193,12 +272,12 @@ static void vmemmap_remap_pte(pte_t *pte
  *		remap.
  * @reuse:	reuse address.
  *
- * Note: This function depends on vmemmap being base page mapped. Please make
- * sure that we disable PMD mapping of vmemmap pages when calling this function.
+ * Return: %0 on success, negative error code otherwise.
  */
-void vmemmap_remap_free(unsigned long start, unsigned long end,
-			unsigned long reuse)
+int vmemmap_remap_free(unsigned long start, unsigned long end,
+		       unsigned long reuse)
 {
+	int ret;
 	LIST_HEAD(vmemmap_pages);
 	struct vmemmap_remap_walk walk = {
 		.remap_pte	= vmemmap_remap_pte,
@@ -221,25 +300,31 @@ void vmemmap_remap_free(unsigned long st
 	 */
 	BUG_ON(start - reuse != PAGE_SIZE);
 
-	vmemmap_remap_range(reuse, end, &walk);
-	free_vmemmap_page_list(&vmemmap_pages);
-}
+	mmap_write_lock(&init_mm);
+	ret = vmemmap_remap_range(reuse, end, &walk);
+	mmap_write_downgrade(&init_mm);
 
-static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
-				struct vmemmap_remap_walk *walk)
-{
-	pgprot_t pgprot = PAGE_KERNEL;
-	struct page *page;
-	void *to;
+	if (ret && walk.nr_walked) {
+		end = reuse + walk.nr_walked * PAGE_SIZE;
+		/*
+		 * vmemmap_pages contains pages from the previous
+		 * vmemmap_remap_range call which failed.  These
+		 * are pages which were removed from the vmemmap.
+		 * They will be restored in the following call.
+		 */
+		walk = (struct vmemmap_remap_walk) {
+			.remap_pte	= vmemmap_restore_pte,
+			.reuse_addr	= reuse,
+			.vmemmap_pages	= &vmemmap_pages,
+		};
 
-	BUG_ON(pte_page(*pte) != walk->reuse_page);
+		vmemmap_remap_range(reuse, end, &walk);
+	}
+	mmap_read_unlock(&init_mm);
 
-	page = list_first_entry(walk->vmemmap_pages, struct page, lru);
-	list_del(&page->lru);
-	to = page_to_virt(page);
-	copy_page(to, (void *)walk->reuse_addr);
+	free_vmemmap_page_list(&vmemmap_pages);
 
-	set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot));
+	return ret;
 }
 
 static int alloc_vmemmap_page_list(unsigned long start, unsigned long end,
@@ -273,6 +358,8 @@ out:
  *		remap.
  * @reuse:	reuse address.
  * @gfp_mask:	GFP flag for allocating vmemmap pages.
+ *
+ * Return: %0 on success, negative error code otherwise.
  */
 int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 			unsigned long reuse, gfp_t gfp_mask)
@@ -287,12 +374,12 @@ int vmemmap_remap_alloc(unsigned long st
 	/* See the comment in the vmemmap_remap_free(). */
 	BUG_ON(start - reuse != PAGE_SIZE);
 
-	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
-
 	if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages))
 		return -ENOMEM;
 
+	mmap_read_lock(&init_mm);
 	vmemmap_remap_range(reuse, end, &walk);
+	mmap_read_unlock(&init_mm);
 
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 025/192] mm: sparsemem: use huge PMD mapping for vmemmap pages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2021-07-01  1:48 ` [patch 024/192] mm: sparsemem: split the huge PMD mapping of vmemmap pages Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 026/192] mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON Andrew Morton
                   ` (167 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, chenhuang5, corbet, david, duanxiongchun, linux-mm, mhocko,
	mike.kravetz, mm-commits, osalvador, songmuchun, torvalds

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: sparsemem: use huge PMD mapping for vmemmap pages

The preparation of splitting huge PMD mapping of vmemmap pages is ready,
so switch the mapping from PTE to PMD.

Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |    7 ---
 arch/x86/mm/init_64.c                           |    8 +---
 include/linux/hugetlb.h                         |   25 +++-----------
 mm/memory_hotplug.c                             |    2 -
 4 files changed, 9 insertions(+), 33 deletions(-)

--- a/arch/x86/mm/init_64.c~mm-sparsemem-use-huge-pmd-mapping-for-vmemmap-pages
+++ a/arch/x86/mm/init_64.c
@@ -34,7 +34,6 @@
 #include <linux/gfp.h>
 #include <linux/kcore.h>
 #include <linux/bootmem_info.h>
-#include <linux/hugetlb.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1610,8 +1609,7 @@ int __meminit vmemmap_populate(unsigned
 	VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE));
 	VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE));
 
-	if ((is_hugetlb_free_vmemmap_enabled()  && !altmap) ||
-	    end - start < PAGES_PER_SECTION * sizeof(struct page))
+	if (end - start < PAGES_PER_SECTION * sizeof(struct page))
 		err = vmemmap_populate_basepages(start, end, node, NULL);
 	else if (boot_cpu_has(X86_FEATURE_PSE))
 		err = vmemmap_populate_hugepages(start, end, node, altmap);
@@ -1639,8 +1637,6 @@ void register_page_bootmem_memmap(unsign
 	pmd_t *pmd;
 	unsigned int nr_pmd_pages;
 	struct page *page;
-	bool base_mapping = !boot_cpu_has(X86_FEATURE_PSE) ||
-			    is_hugetlb_free_vmemmap_enabled();
 
 	for (; addr < end; addr = next) {
 		pte_t *pte = NULL;
@@ -1666,7 +1662,7 @@ void register_page_bootmem_memmap(unsign
 		}
 		get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
 
-		if (base_mapping) {
+		if (!boot_cpu_has(X86_FEATURE_PSE)) {
 			next = (addr + PAGE_SIZE) & PAGE_MASK;
 			pmd = pmd_offset(pud, addr);
 			if (pmd_none(*pmd))
--- a/Documentation/admin-guide/kernel-parameters.txt~mm-sparsemem-use-huge-pmd-mapping-for-vmemmap-pages
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -1572,13 +1572,6 @@
 			enabled.
 			Allows heavy hugetlb users to free up some more
 			memory (6 * PAGE_SIZE for each 2MB hugetlb page).
-			This feauture is not free though. Large page
-			tables are not used to back vmemmap pages which
-			can lead to a performance degradation for some
-			workloads. Also there will be memory allocation
-			required when hugetlb pages are freed from the
-			pool which can lead to corner cases under heavy
-			memory pressure.
 			Format: { on | off (default) }
 
 			on:  enable the feature
--- a/include/linux/hugetlb.h~mm-sparsemem-use-huge-pmd-mapping-for-vmemmap-pages
+++ a/include/linux/hugetlb.h
@@ -895,20 +895,6 @@ static inline void huge_ptep_modify_prot
 }
 #endif
 
-#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
-extern bool hugetlb_free_vmemmap_enabled;
-
-static inline bool is_hugetlb_free_vmemmap_enabled(void)
-{
-	return hugetlb_free_vmemmap_enabled;
-}
-#else
-static inline bool is_hugetlb_free_vmemmap_enabled(void)
-{
-	return false;
-}
-#endif
-
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1063,13 +1049,14 @@ static inline void set_huge_swap_pte_at(
 					pte_t *ptep, pte_t pte, unsigned long sz)
 {
 }
-
-static inline bool is_hugetlb_free_vmemmap_enabled(void)
-{
-	return false;
-}
 #endif	/* CONFIG_HUGETLB_PAGE */
 
+#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
+extern bool hugetlb_free_vmemmap_enabled;
+#else
+#define hugetlb_free_vmemmap_enabled	false
+#endif
+
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
 					struct mm_struct *mm, pte_t *pte)
 {
--- a/mm/memory_hotplug.c~mm-sparsemem-use-huge-pmd-mapping-for-vmemmap-pages
+++ a/mm/memory_hotplug.c
@@ -1056,7 +1056,7 @@ bool mhp_supports_memmap_on_memory(unsig
 	 *       populate a single PMD.
 	 */
 	return memmap_on_memory &&
-	       !is_hugetlb_free_vmemmap_enabled() &&
+	       !hugetlb_free_vmemmap_enabled &&
 	       IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
 	       size == memory_block_size_bytes() &&
 	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 026/192] mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON
  2021-07-01  1:46 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2021-07-01  1:48 ` [patch 025/192] mm: sparsemem: use huge PMD mapping for " Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 027/192] hugetlb: remove prep_compound_huge_page cleanup Andrew Morton
                   ` (166 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, chenhuang5, corbet, david, duanxiongchun, linux-mm, mhocko,
	mike.kravetz, mm-commits, osalvador, songmuchun, torvalds

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON

When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages
associated with each HugeTLB page is default off.  Now the vmemmap is PMD
mapped.  So there is no side effect when this feature is enabled with no
HugeTLB pages in the system.  Someone may want to enable this feature in
the compiler time instead of using boot command line.  So add a config to
make it default on when someone do not want to enable it via command line.

Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |    3 +++
 fs/Kconfig                                      |   10 ++++++++++
 mm/hugetlb_vmemmap.c                            |    6 ++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt~mm-hugetlb-introduce-config_hugetlb_page_free_vmemmap_default_on
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -1577,6 +1577,9 @@
 			on:  enable the feature
 			off: disable the feature
 
+			Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y,
+			the default is on.
+
 			This is not compatible with memory_hotplug.memmap_on_memory.
 			If both parameters are enabled, hugetlb_free_vmemmap takes
 			precedence over memory_hotplug.memmap_on_memory.
--- a/fs/Kconfig~mm-hugetlb-introduce-config_hugetlb_page_free_vmemmap_default_on
+++ a/fs/Kconfig
@@ -245,6 +245,16 @@ config HUGETLB_PAGE_FREE_VMEMMAP
 	depends on X86_64
 	depends on SPARSEMEM_VMEMMAP
 
+config HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON
+	bool "Default freeing vmemmap pages of HugeTLB to on"
+	default n
+	depends on HUGETLB_PAGE_FREE_VMEMMAP
+	help
+	  When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap
+	  pages associated with each HugeTLB page is default off. Say Y here
+	  to enable freeing vmemmap pages of HugeTLB by default. It can then
+	  be disabled on the command line via hugetlb_free_vmemmap=off.
+
 config MEMFD_CREATE
 	def_bool TMPFS || HUGETLBFS
 
--- a/mm/hugetlb_vmemmap.c~mm-hugetlb-introduce-config_hugetlb_page_free_vmemmap_default_on
+++ a/mm/hugetlb_vmemmap.c
@@ -182,7 +182,7 @@
 #define RESERVE_VMEMMAP_NR		2U
 #define RESERVE_VMEMMAP_SIZE		(RESERVE_VMEMMAP_NR << PAGE_SHIFT)
 
-bool hugetlb_free_vmemmap_enabled;
+bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON);
 
 static int __init early_hugetlb_free_vmemmap_param(char *buf)
 {
@@ -197,7 +197,9 @@ static int __init early_hugetlb_free_vme
 
 	if (!strcmp(buf, "on"))
 		hugetlb_free_vmemmap_enabled = true;
-	else if (strcmp(buf, "off"))
+	else if (!strcmp(buf, "off"))
+		hugetlb_free_vmemmap_enabled = false;
+	else
 		return -EINVAL;
 
 	return 0;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 027/192] hugetlb: remove prep_compound_huge_page cleanup
  2021-07-01  1:46 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2021-07-01  1:48 ` [patch 026/192] mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 028/192] hugetlb: address ref count racing in prep_compound_gigantic_page Andrew Morton
                   ` (165 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, jack, jannh, jhubbard, kirill, linux-mm, mhocko,
	mike.kravetz, mm-commits, songmuchun, torvalds, willy,
	youquan.song

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: remove prep_compound_huge_page cleanup

Patch series "Fix prep_compound_gigantic_page ref count adjustment".

These patches address the possible race between
prep_compound_gigantic_page and __page_cache_add_speculative as described
by Jann Horn in [1].

The first patch simply removes the unnecessary/obsolete helper routine
prep_compound_huge_page to make the actual fix a little simpler.

The second patch is the actual fix and has a detailed explanation in the
commit message.

This potential issue has existed for almost 10 years and I am unaware of
anyone actually hitting the race.  I did not cc stable, but would be happy
to squash the patches and send to stable if anyone thinks that is a good
idea.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/


This patch (of 2):

I could not think of a reliable way to recreate the issue for testing.
Rather, I 'simulated errors' to exercise all the error paths.

The routine prep_compound_huge_page is a simple wrapper to call either
prep_compound_gigantic_page or prep_compound_page.  However, it is only
called from gather_bootmem_prealloc which only processes gigantic pages. 
Eliminate the routine and call prep_compound_gigantic_page directly.

Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Youquan Song <youquan.song@intel.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

--- a/mm/hugetlb.c~hugetlb-remove-prep_compound_huge_page-cleanup
+++ a/mm/hugetlb.c
@@ -1320,8 +1320,6 @@ static struct page *alloc_gigantic_page(
 	return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask);
 }
 
-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid);
-static void prep_compound_gigantic_page(struct page *page, unsigned int order);
 #else /* !CONFIG_CONTIG_ALLOC */
 static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 					int nid, nodemask_t *nodemask)
@@ -2759,16 +2757,10 @@ found:
 	return 1;
 }
 
-static void __init prep_compound_huge_page(struct page *page,
-		unsigned int order)
-{
-	if (unlikely(order > (MAX_ORDER - 1)))
-		prep_compound_gigantic_page(page, order);
-	else
-		prep_compound_page(page, order);
-}
-
-/* Put bootmem huge pages into the standard lists after mem_map is up */
+/*
+ * Put bootmem huge pages into the standard lists after mem_map is up.
+ * Note: This only applies to gigantic (order > MAX_ORDER) pages.
+ */
 static void __init gather_bootmem_prealloc(void)
 {
 	struct huge_bootmem_page *m;
@@ -2777,20 +2769,19 @@ static void __init gather_bootmem_preall
 		struct page *page = virt_to_page(m);
 		struct hstate *h = m->hstate;
 
+		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(page_count(page) != 1);
-		prep_compound_huge_page(page, huge_page_order(h));
+		prep_compound_gigantic_page(page, huge_page_order(h));
 		WARN_ON(PageReserved(page));
 		prep_new_huge_page(h, page, page_to_nid(page));
 		put_page(page); /* free it into the hugepage allocator */
 
 		/*
-		 * If we had gigantic hugepages allocated at boot time, we need
-		 * to restore the 'stolen' pages to totalram_pages in order to
-		 * fix confusing memory reports from free(1) and another
-		 * side-effects, like CommitLimit going negative.
+		 * We need to restore the 'stolen' pages to totalram_pages
+		 * in order to fix confusing memory reports from free(1) and
+		 * other side-effects, like CommitLimit going negative.
 		 */
-		if (hstate_is_gigantic(h))
-			adjust_managed_page_count(page, pages_per_huge_page(h));
+		adjust_managed_page_count(page, pages_per_huge_page(h));
 		cond_resched();
 	}
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 028/192] hugetlb: address ref count racing in prep_compound_gigantic_page
  2021-07-01  1:46 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2021-07-01  1:48 ` [patch 027/192] hugetlb: remove prep_compound_huge_page cleanup Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 029/192] mm/hwpoison: disable pcp for page_handle_poison() Andrew Morton
                   ` (164 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, jack, jannh, jhubbard, kirill, linux-mm, mhocko,
	mike.kravetz, mm-commits, songmuchun, torvalds, willy,
	youquan.song

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: address ref count racing in prep_compound_gigantic_page

In [1], Jann Horn points out a possible race between
prep_compound_gigantic_page and __page_cache_add_speculative.  The root
cause of the possible race is prep_compound_gigantic_page uncondittionally
setting the ref count of pages to zero.  It does this because
prep_compound_gigantic_page is handed a 'group' of pages from an allocator
and needs to convert that group of pages to a compound page.  The ref
count of each page in this 'group' is one as set by the allocator. 
However, the ref count of compound page tail pages must be zero.

The potential race comes about when ref counted pages are returned from
the allocator.  When this happens, other mm code could also take a
reference on the page.  __page_cache_add_speculative is one such example. 
Therefore, prep_compound_gigantic_page can not just set the ref count of
pages to zero as it does today.  Doing so would lose the reference taken
by any other code.  This would lead to BUGs in code checking ref counts
and could possibly even lead to memory corruption.

There are two possible ways to address this issue.

1) Make all allocators of gigantic groups of pages be able to return a
   properly constructed compound page.

2) Make prep_compound_gigantic_page be more careful when constructing a
   compound page.

This patch takes approach 2.

In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
that the extra ref count will be driopped during a rcu grace period.  This
is not a performance critical code path and the wait should be
accceptable.  If the ref count is still inflated after the grace period,
then undo any modifications made and return an error.

Currently prep_compound_gigantic_page is type void and does not return
errors.  Modify the two callers to check for and handle error returns.  On
error, the caller must free the 'group' of pages as they can not be used
to form a gigantic page.  After freeing pages, the runtime caller
(alloc_fresh_huge_page) will retry the allocation once.  Boot time
allocations can not be retried.

The routine prep_compound_page also unconditionally sets the ref count of
compound page tail pages to zero.  However, in this case the buddy
allocator is constructing a compound page from freshly allocated pages. 
The ref count on those freshly allocated pages is already zero, so the
set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
remove it.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Cc: Youquan Song <youquan.song@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c    |   72 ++++++++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c |    1 
 2 files changed, 64 insertions(+), 9 deletions(-)

--- a/mm/hugetlb.c~hugetlb-address-ref-count-racing-in-prep_compound_gigantic_page
+++ a/mm/hugetlb.c
@@ -1623,9 +1623,9 @@ static void prep_new_huge_page(struct hs
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-static void prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 {
-	int i;
+	int i, j;
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
 
@@ -1647,11 +1647,48 @@ static void prep_compound_gigantic_page(
 		 * after get_user_pages().
 		 */
 		__ClearPageReserved(p);
+		/*
+		 * Subtle and very unlikely
+		 *
+		 * Gigantic 'page allocators' such as memblock or cma will
+		 * return a set of pages with each page ref counted.  We need
+		 * to turn this set of pages into a compound page with tail
+		 * page ref counts set to zero.  Code such as speculative page
+		 * cache adding could take a ref on a 'to be' tail page.
+		 * We need to respect any increased ref count, and only set
+		 * the ref count to zero if count is currently 1.  If count
+		 * is not 1, we call synchronize_rcu in the hope that a rcu
+		 * grace period will cause ref count to drop and then retry.
+		 * If count is still inflated on retry we return an error and
+		 * must discard the pages.
+		 */
+		if (!page_ref_freeze(p, 1)) {
+			pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n");
+			synchronize_rcu();
+			if (!page_ref_freeze(p, 1))
+				goto out_error;
+		}
 		set_page_count(p, 0);
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
 	atomic_set(compound_pincount_ptr(page), 0);
+	return true;
+
+out_error:
+	/* undo tail page modifications made above */
+	p = page + 1;
+	for (j = 1; j < i; j++, p = mem_map_next(p, page, j)) {
+		clear_compound_head(p);
+		set_page_refcounted(p);
+	}
+	/* need to clear PG_reserved on remaining tail pages  */
+	for (; j < nr_pages; j++, p = mem_map_next(p, page, j))
+		__ClearPageReserved(p);
+	set_compound_order(page, 0);
+	page[1].compound_nr = 0;
+	__ClearPageHead(page);
+	return false;
 }
 
 /*
@@ -1771,7 +1808,9 @@ static struct page *alloc_fresh_huge_pag
 		nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
+	bool retry = false;
 
+retry:
 	if (hstate_is_gigantic(h))
 		page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
 	else
@@ -1780,8 +1819,21 @@ static struct page *alloc_fresh_huge_pag
 	if (!page)
 		return NULL;
 
-	if (hstate_is_gigantic(h))
-		prep_compound_gigantic_page(page, huge_page_order(h));
+	if (hstate_is_gigantic(h)) {
+		if (!prep_compound_gigantic_page(page, huge_page_order(h))) {
+			/*
+			 * Rare failure to convert pages to compound page.
+			 * Free pages and try again - ONCE!
+			 */
+			free_gigantic_page(page, huge_page_order(h));
+			if (!retry) {
+				retry = true;
+				goto retry;
+			}
+			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+			return NULL;
+		}
+	}
 	prep_new_huge_page(h, page, page_to_nid(page));
 
 	return page;
@@ -2771,10 +2823,14 @@ static void __init gather_bootmem_preall
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(page_count(page) != 1);
-		prep_compound_gigantic_page(page, huge_page_order(h));
-		WARN_ON(PageReserved(page));
-		prep_new_huge_page(h, page, page_to_nid(page));
-		put_page(page); /* free it into the hugepage allocator */
+		if (prep_compound_gigantic_page(page, huge_page_order(h))) {
+			WARN_ON(PageReserved(page));
+			prep_new_huge_page(h, page, page_to_nid(page));
+			put_page(page); /* add to the hugepage allocator */
+		} else {
+			free_gigantic_page(page, huge_page_order(h));
+			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+		}
 
 		/*
 		 * We need to restore the 'stolen' pages to totalram_pages
--- a/mm/page_alloc.c~hugetlb-address-ref-count-racing-in-prep_compound_gigantic_page
+++ a/mm/page_alloc.c
@@ -754,7 +754,6 @@ void prep_compound_page(struct page *pag
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
-		set_page_count(p, 0);
 		p->mapping = TAIL_MAPPING;
 		set_compound_head(p, page);
 	}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 029/192] mm/hwpoison: disable pcp for page_handle_poison()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2021-07-01  1:48 ` [patch 028/192] hugetlb: address ref count racing in prep_compound_gigantic_page Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 030/192] userfaultfd/selftests: use user mode only Andrew Morton
                   ` (163 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: akpm, david, linux-mm, mgorman, mhocko, mike.kravetz, mm-commits,
	naoya.horiguchi, osalvador, torvalds

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: mm/hwpoison: disable pcp for page_handle_poison()

Recent changes by patch "mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists" makes kernels determine whether to use pcp by
pcp_allowed_order(), which breaks soft-offline for hugetlb pages.

Soft-offline dissolves a migration source page, then removes it from buddy
free list, so it's assumed that any subpage of the soft-offlined hugepage
are recognized as a buddy page just after returning from
dissolve_free_huge_page().  pcp_allowed_order() returns true for hugetlb,
so this assumption is no longer true.

So disable pcp during dissolve_free_huge_page() and take_page_off_buddy()
to prevent soft-offlined hugepages from linking to pcp lists. 
Soft-offline should not be common events so the impact on performance
should be minimal.  And I think that the optimization of Mel's patch could
benefit to hugetlb so zone_pcp_disable() is called only in hwpoison
context.

Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

--- a/mm/memory-failure.c~mm-hwpoison-disable-pcp-for-page_handle_poison
+++ a/mm/memory-failure.c
@@ -66,6 +66,19 @@ int sysctl_memory_failure_recovery __rea
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+static bool __page_handle_poison(struct page *page)
+{
+	bool ret;
+
+	zone_pcp_disable(page_zone(page));
+	ret = dissolve_free_huge_page(page);
+	if (!ret)
+		ret = take_page_off_buddy(page);
+	zone_pcp_enable(page_zone(page));
+
+	return ret;
+}
+
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
 {
 	if (hugepage_or_freepage) {
@@ -73,7 +86,7 @@ static bool page_handle_poison(struct pa
 		 * Doing this check for free pages is also fine since dissolve_free_huge_page
 		 * returns 0 for non-hugetlb pages as well.
 		 */
-		if (dissolve_free_huge_page(page) || !take_page_off_buddy(page))
+		if (!__page_handle_poison(page))
 			/*
 			 * We could fail to take off the target page from buddy
 			 * for example due to racy page allocation, but that's
@@ -985,7 +998,7 @@ static int me_huge_page(struct page *p,
 		 */
 		if (PageAnon(hpage))
 			put_page(hpage);
-		if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) {
+		if (__page_handle_poison(p)) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
@@ -1446,7 +1459,7 @@ static int memory_failure_hugetlb(unsign
 			}
 			unlock_page(head);
 			res = MF_FAILED;
-			if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) {
+			if (__page_handle_poison(p)) {
 				page_ref_inc(p);
 				res = MF_RECOVERED;
 			}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 030/192] userfaultfd/selftests: use user mode only
  2021-07-01  1:46 incoming Andrew Morton
                   ` (28 preceding siblings ...)
  2021-07-01  1:48 ` [patch 029/192] mm/hwpoison: disable pcp for page_handle_poison() Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 031/192] userfaultfd/selftests: remove the time() check on delayed uffd Andrew Morton
                   ` (162 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: userfaultfd/selftests: use user mode only

Patch series "userfaultfd/selftests: A few cleanups", v2.

I wanted to cleanup userfaultfd.c fault handling for a long time.  If it's
not cleaned, when the new code grows the file it'll also grow the size
that needs to be cleaned...  This is my attempt to cleanup the userfaultfd
selftest on fault handling, to use an err() macro instead of either
fprintf() or perror() then another exit() call.

The huge cleanup is done in the last patch.  The first 4 patches are some
other standalone cleanups for the same file, so I put them together.


This patch (of 5):

Userfaultfd selftest does not need to handle kernel initiated fault.  Set
user mode so it can be run even if unprivileged_userfaultfd=0 (which is
the default).

Link: https://lkml.kernel.org/r/20210412232753.1012412-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-use-user-mode-only
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -831,7 +831,7 @@ static int userfaultfd_open_ext(uint64_t
 {
 	struct uffdio_api uffdio_api;
 
-	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
 	if (uffd < 0) {
 		fprintf(stderr,
 			"userfaultfd syscall not available in this kernel\n");
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 031/192] userfaultfd/selftests: remove the time() check on delayed uffd
  2021-07-01  1:46 incoming Andrew Morton
                   ` (29 preceding siblings ...)
  2021-07-01  1:48 ` [patch 030/192] userfaultfd/selftests: use user mode only Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 032/192] userfaultfd/selftests: dropping VERIFY check in locking_thread Andrew Morton
                   ` (161 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: userfaultfd/selftests: remove the time() check on delayed uffd

There seems to have no guarantee that time() will return the same for the
two calls even if there's no delay, e.g.  when a fault is accidentally
crossing the changing of a second.  Meanwhile, this message is also not
helping that much since delay could happen with a lot of reasons, e.g.,
schedule latency of resolving thread.  It may not mean an issue with uffd.

Neither do I saw this error triggered either in the past runs.  Even if it
triggers, it'll be drown in all the rest of test logs.  Remove it.

Link: https://lkml.kernel.org/r/20210412232753.1012412-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |    8 --------
 1 file changed, 8 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-remove-the-time-check-on-delayed-uffd
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -395,7 +395,6 @@ static void *locking_thread(void *arg)
 	unsigned long long count;
 	char randstate[64];
 	unsigned int seed;
-	time_t start;
 
 	if (bounces & BOUNCE_RANDOM) {
 		seed = (unsigned int) time(NULL) - bounces;
@@ -432,7 +431,6 @@ static void *locking_thread(void *arg)
 			page_nr += 1;
 		page_nr %= nr_pages;
 
-		start = time(NULL);
 		if (bounces & BOUNCE_VERIFY) {
 			count = *area_count(area_dst, page_nr);
 			if (!count) {
@@ -495,12 +493,6 @@ static void *locking_thread(void *arg)
 		count++;
 		*area_count(area_dst, page_nr) = count_verify[page_nr] = count;
 		pthread_mutex_unlock(area_mutex(area_dst, page_nr));
-
-		if (time(NULL) - start > 1)
-			fprintf(stderr,
-				"userfault too slow %ld "
-				"possible false positive with overcommit\n",
-				time(NULL) - start);
 	}
 
 	return NULL;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 032/192] userfaultfd/selftests: dropping VERIFY check in locking_thread
  2021-07-01  1:46 incoming Andrew Morton
                   ` (30 preceding siblings ...)
  2021-07-01  1:48 ` [patch 031/192] userfaultfd/selftests: remove the time() check on delayed uffd Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 033/192] userfaultfd/selftests: only dump counts if mode enabled Andrew Morton
                   ` (160 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: userfaultfd/selftests: dropping VERIFY check in locking_thread

It tries to check against all zeros and looped for quite a few times. 
However after that we'll verify the same page with count_verify, while
count_verify can never be zero.  So it means if it's a zero page we'll
detect it anyways with below code.

There's yet another place we conditionally check the fault flag - just do
it unconditionally.

Link: https://lkml.kernel.org/r/20210412232753.1012412-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   55 ---------------------
 1 file changed, 1 insertion(+), 54 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-dropping-verify-check-in-locking_thread
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -430,58 +430,6 @@ static void *locking_thread(void *arg)
 		} else
 			page_nr += 1;
 		page_nr %= nr_pages;
-
-		if (bounces & BOUNCE_VERIFY) {
-			count = *area_count(area_dst, page_nr);
-			if (!count) {
-				fprintf(stderr,
-					"page_nr %lu wrong count %Lu %Lu\n",
-					page_nr, count,
-					count_verify[page_nr]);
-				exit(1);
-			}
-
-
-			/*
-			 * We can't use bcmp (or memcmp) because that
-			 * returns 0 erroneously if the memory is
-			 * changing under it (even if the end of the
-			 * page is never changing and always
-			 * different).
-			 */
-#if 1
-			if (!my_bcmp(area_dst + page_nr * page_size, zeropage,
-				     page_size)) {
-				fprintf(stderr,
-					"my_bcmp page_nr %lu wrong count %Lu %Lu\n",
-					page_nr, count, count_verify[page_nr]);
-				exit(1);
-			}
-#else
-			unsigned long loops;
-
-			loops = 0;
-			/* uncomment the below line to test with mutex */
-			/* pthread_mutex_lock(area_mutex(area_dst, page_nr)); */
-			while (!bcmp(area_dst + page_nr * page_size, zeropage,
-				     page_size)) {
-				loops += 1;
-				if (loops > 10)
-					break;
-			}
-			/* uncomment below line to test with mutex */
-			/* pthread_mutex_unlock(area_mutex(area_dst, page_nr)); */
-			if (loops) {
-				fprintf(stderr,
-					"page_nr %lu all zero thread %lu %p %lu\n",
-					page_nr, cpu, area_dst + page_nr * page_size,
-					loops);
-				if (loops > 10)
-					exit(1);
-			}
-#endif
-		}
-
 		pthread_mutex_lock(area_mutex(area_dst, page_nr));
 		count = *area_count(area_dst, page_nr);
 		if (count != count_verify[page_nr]) {
@@ -613,8 +561,7 @@ static void uffd_handle_page_fault(struc
 		stats->minor_faults++;
 	} else {
 		/* Missing page faults */
-		if (bounces & BOUNCE_VERIFY &&
-		    msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) {
+		if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) {
 			fprintf(stderr, "unexpected write fault\n");
 			exit(1);
 		}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 033/192] userfaultfd/selftests: only dump counts if mode enabled
  2021-07-01  1:46 incoming Andrew Morton
                   ` (31 preceding siblings ...)
  2021-07-01  1:48 ` [patch 032/192] userfaultfd/selftests: dropping VERIFY check in locking_thread Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 034/192] userfaultfd/selftests: unify error handling Andrew Morton
                   ` (159 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: userfaultfd/selftests: only dump counts if mode enabled

WP and MINOR modes are conditionally enabled on specific memory types. 
This patch avoids dumping tons of zeros for those cases when the modes are
not supported at all.

Link: https://lkml.kernel.org/r/20210412232753.1012412-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   30 ++++++++++++++-------
 1 file changed, 20 insertions(+), 10 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-only-dump-counts-if-mode-enabled
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -171,16 +171,26 @@ static void uffd_stats_report(struct uff
 		minor_total += stats[i].minor_faults;
 	}
 
-	printf("userfaults: %llu missing (", miss_total);
-	for (i = 0; i < n_cpus; i++)
-		printf("%lu+", stats[i].missing_faults);
-	printf("\b), %llu wp (", wp_total);
-	for (i = 0; i < n_cpus; i++)
-		printf("%lu+", stats[i].wp_faults);
-	printf("\b), %llu minor (", minor_total);
-	for (i = 0; i < n_cpus; i++)
-		printf("%lu+", stats[i].minor_faults);
-	printf("\b)\n");
+	printf("userfaults: ");
+	if (miss_total) {
+		printf("%llu missing (", miss_total);
+		for (i = 0; i < n_cpus; i++)
+			printf("%lu+", stats[i].missing_faults);
+		printf("\b) ");
+	}
+	if (wp_total) {
+		printf("%llu wp (", wp_total);
+		for (i = 0; i < n_cpus; i++)
+			printf("%lu+", stats[i].wp_faults);
+		printf("\b) ");
+	}
+	if (minor_total) {
+		printf("%llu minor (", minor_total);
+		for (i = 0; i < n_cpus; i++)
+			printf("%lu+", stats[i].minor_faults);
+		printf("\b)");
+	}
+	printf("\n");
 }
 
 static int anon_release_pages(char *rel_area)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 034/192] userfaultfd/selftests: unify error handling
  2021-07-01  1:46 incoming Andrew Morton
                   ` (32 preceding siblings ...)
  2021-07-01  1:48 ` [patch 033/192] userfaultfd/selftests: only dump counts if mode enabled Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:48 ` [patch 035/192] mm/thp: simplify copying of huge zero page pmd when fork Andrew Morton
                   ` (158 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: userfaultfd/selftests: unify error handling

Introduce err()/_err() and replace all the different ways to fail the
program, mostly "fprintf" and "perror" with tons of exit() calls.  Always
stop the test program at any failure.

Link: https://lkml.kernel.org/r/20210412232753.1012412-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |  556 +++++++--------------
 1 file changed, 187 insertions(+), 369 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-unify-error-handling
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -140,11 +140,18 @@ static void usage(void)
 	exit(1);
 }
 
-#define uffd_error(code, fmt, ...)                                             \
-	do {                                                                   \
-		fprintf(stderr, fmt, ##__VA_ARGS__);                           \
-		fprintf(stderr, ": %" PRId64 "\n", (int64_t)(code));           \
-		exit(1);                                                       \
+#define _err(fmt, ...)						\
+	do {							\
+		int ret = errno;				\
+		fprintf(stderr, "ERROR: " fmt, ##__VA_ARGS__);	\
+		fprintf(stderr, " (errno=%d, line=%d)\n",	\
+			ret, __LINE__);				\
+	} while (0)
+
+#define err(fmt, ...)				\
+	do {					\
+		_err(fmt, ##__VA_ARGS__);	\
+		exit(1);			\
 	} while (0)
 
 static void uffd_stats_reset(struct uffd_stats *uffd_stats,
@@ -193,44 +200,28 @@ static void uffd_stats_report(struct uff
 	printf("\n");
 }
 
-static int anon_release_pages(char *rel_area)
+static void anon_release_pages(char *rel_area)
 {
-	int ret = 0;
-
-	if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) {
-		perror("madvise");
-		ret = 1;
-	}
-
-	return ret;
+	if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED))
+		err("madvise(MADV_DONTNEED) failed");
 }
 
 static void anon_allocate_area(void **alloc_area)
 {
-	if (posix_memalign(alloc_area, page_size, nr_pages * page_size)) {
-		fprintf(stderr, "out of memory\n");
-		*alloc_area = NULL;
-	}
+	if (posix_memalign(alloc_area, page_size, nr_pages * page_size))
+		err("posix_memalign() failed");
 }
 
 static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
 {
 }
 
-/* HugeTLB memory */
-static int hugetlb_release_pages(char *rel_area)
+static void hugetlb_release_pages(char *rel_area)
 {
-	int ret = 0;
-
 	if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-				rel_area == huge_fd_off0 ? 0 :
-				nr_pages * page_size,
-				nr_pages * page_size)) {
-		perror("fallocate");
-		ret = 1;
-	}
-
-	return ret;
+		      rel_area == huge_fd_off0 ? 0 : nr_pages * page_size,
+		      nr_pages * page_size))
+		err("fallocate() failed");
 }
 
 static void hugetlb_allocate_area(void **alloc_area)
@@ -243,20 +234,16 @@ static void hugetlb_allocate_area(void *
 			   MAP_HUGETLB,
 			   huge_fd, *alloc_area == area_src ? 0 :
 			   nr_pages * page_size);
-	if (*alloc_area == MAP_FAILED) {
-		perror("mmap of hugetlbfs file failed");
-		goto fail;
-	}
+	if (*alloc_area == MAP_FAILED)
+		err("mmap of hugetlbfs file failed");
 
 	if (map_shared) {
 		area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
 				  MAP_SHARED | MAP_HUGETLB,
 				  huge_fd, *alloc_area == area_src ? 0 :
 				  nr_pages * page_size);
-		if (area_alias == MAP_FAILED) {
-			perror("mmap of hugetlb file alias failed");
-			goto fail_munmap;
-		}
+		if (area_alias == MAP_FAILED)
+			err("mmap of hugetlb file alias failed");
 	}
 
 	if (*alloc_area == area_src) {
@@ -267,16 +254,6 @@ static void hugetlb_allocate_area(void *
 	}
 	if (area_alias)
 		*alloc_area_alias = area_alias;
-
-	return;
-
-fail_munmap:
-	if (munmap(*alloc_area, nr_pages * page_size) < 0) {
-		perror("hugetlb munmap");
-		exit(1);
-	}
-fail:
-	*alloc_area = NULL;
 }
 
 static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset)
@@ -292,33 +269,24 @@ static void hugetlb_alias_mapping(__u64
 	*start = (unsigned long) area_dst_alias + offset;
 }
 
-/* Shared memory */
-static int shmem_release_pages(char *rel_area)
+static void shmem_release_pages(char *rel_area)
 {
-	int ret = 0;
-
-	if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) {
-		perror("madvise");
-		ret = 1;
-	}
-
-	return ret;
+	if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE))
+		err("madvise(MADV_REMOVE) failed");
 }
 
 static void shmem_allocate_area(void **alloc_area)
 {
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
 			   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
-	if (*alloc_area == MAP_FAILED) {
-		fprintf(stderr, "shared memory mmap failed\n");
-		*alloc_area = NULL;
-	}
+	if (*alloc_area == MAP_FAILED)
+		err("mmap of memfd failed");
 }
 
 struct uffd_test_ops {
 	unsigned long expected_ioctls;
 	void (*allocate_area)(void **alloc_area);
-	int (*release_pages)(char *rel_area);
+	void (*release_pages)(char *rel_area);
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
 };
 
@@ -373,11 +341,8 @@ static void wp_range(int ufd, __u64 star
 	/* Undo write-protect, do wakeup after that */
 	prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0;
 
-	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) {
-		fprintf(stderr, "clear WP failed for address 0x%" PRIx64 "\n",
-			(uint64_t)start);
-		exit(1);
-	}
+	if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms))
+		err("clear WP failed: address=0x%"PRIx64, (uint64_t)start);
 }
 
 static void continue_range(int ufd, __u64 start, __u64 len)
@@ -388,12 +353,9 @@ static void continue_range(int ufd, __u6
 	req.range.len = len;
 	req.mode = 0;
 
-	if (ioctl(ufd, UFFDIO_CONTINUE, &req)) {
-		fprintf(stderr,
-			"UFFDIO_CONTINUE failed for address 0x%" PRIx64 "\n",
-			(uint64_t)start);
-		exit(1);
-	}
+	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
+		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
+		    (uint64_t)start);
 }
 
 static void *locking_thread(void *arg)
@@ -412,10 +374,8 @@ static void *locking_thread(void *arg)
 			seed += cpu;
 		bzero(&rand, sizeof(rand));
 		bzero(&randstate, sizeof(randstate));
-		if (initstate_r(seed, randstate, sizeof(randstate), &rand)) {
-			fprintf(stderr, "srandom_r error\n");
-			exit(1);
-		}
+		if (initstate_r(seed, randstate, sizeof(randstate), &rand))
+			err("initstate_r failed");
 	} else {
 		page_nr = -bounces;
 		if (!(bounces & BOUNCE_RACINGFAULTS))
@@ -424,16 +384,12 @@ static void *locking_thread(void *arg)
 
 	while (!finished) {
 		if (bounces & BOUNCE_RANDOM) {
-			if (random_r(&rand, &rand_nr)) {
-				fprintf(stderr, "random_r 1 error\n");
-				exit(1);
-			}
+			if (random_r(&rand, &rand_nr))
+				err("random_r failed");
 			page_nr = rand_nr;
 			if (sizeof(page_nr) > sizeof(rand_nr)) {
-				if (random_r(&rand, &rand_nr)) {
-					fprintf(stderr, "random_r 2 error\n");
-					exit(1);
-				}
+				if (random_r(&rand, &rand_nr))
+					err("random_r failed");
 				page_nr |= (((unsigned long) rand_nr) << 16) <<
 					   16;
 			}
@@ -442,12 +398,9 @@ static void *locking_thread(void *arg)
 		page_nr %= nr_pages;
 		pthread_mutex_lock(area_mutex(area_dst, page_nr));
 		count = *area_count(area_dst, page_nr);
-		if (count != count_verify[page_nr]) {
-			fprintf(stderr,
-				"page_nr %lu memory corruption %Lu %Lu\n",
-				page_nr, count,
-				count_verify[page_nr]); exit(1);
-		}
+		if (count != count_verify[page_nr])
+			err("page_nr %lu memory corruption %llu %llu",
+			    page_nr, count, count_verify[page_nr]);
 		count++;
 		*area_count(area_dst, page_nr) = count_verify[page_nr] = count;
 		pthread_mutex_unlock(area_mutex(area_dst, page_nr));
@@ -464,22 +417,21 @@ static void retry_copy_page(int ufd, str
 				     offset);
 	if (ioctl(ufd, UFFDIO_COPY, uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
-		if (uffdio_copy->copy != -EEXIST) {
-			uffd_error(uffdio_copy->copy,
-				   "UFFDIO_COPY retry error");
-		}
-	} else
-		uffd_error(uffdio_copy->copy, "UFFDIO_COPY retry unexpected");
+		if (uffdio_copy->copy != -EEXIST)
+			err("UFFDIO_COPY retry error: %"PRId64,
+			    (int64_t)uffdio_copy->copy);
+	} else {
+		err("UFFDIO_COPY retry unexpected: %"PRId64,
+		    (int64_t)uffdio_copy->copy);
+	}
 }
 
 static int __copy_page(int ufd, unsigned long offset, bool retry)
 {
 	struct uffdio_copy uffdio_copy;
 
-	if (offset >= nr_pages * page_size) {
-		fprintf(stderr, "unexpected offset %lu\n", offset);
-		exit(1);
-	}
+	if (offset >= nr_pages * page_size)
+		err("unexpected offset %lu\n", offset);
 	uffdio_copy.dst = (unsigned long) area_dst + offset;
 	uffdio_copy.src = (unsigned long) area_src + offset;
 	uffdio_copy.len = page_size;
@@ -491,9 +443,10 @@ static int __copy_page(int ufd, unsigned
 	if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) {
 		/* real retval in ufdio_copy.copy */
 		if (uffdio_copy.copy != -EEXIST)
-			uffd_error(uffdio_copy.copy, "UFFDIO_COPY error");
+			err("UFFDIO_COPY error: %"PRId64,
+			    (int64_t)uffdio_copy.copy);
 	} else if (uffdio_copy.copy != page_size) {
-		uffd_error(uffdio_copy.copy, "UFFDIO_COPY unexpected copy");
+		err("UFFDIO_COPY error: %"PRId64, (int64_t)uffdio_copy.copy);
 	} else {
 		if (test_uffdio_copy_eexist && retry) {
 			test_uffdio_copy_eexist = false;
@@ -522,11 +475,10 @@ static int uffd_read_msg(int ufd, struct
 		if (ret < 0) {
 			if (errno == EAGAIN)
 				return 1;
-			perror("blocking read error");
+			err("blocking read error");
 		} else {
-			fprintf(stderr, "short read\n");
+			err("short read");
 		}
-		exit(1);
 	}
 
 	return 0;
@@ -537,10 +489,8 @@ static void uffd_handle_page_fault(struc
 {
 	unsigned long offset;
 
-	if (msg->event != UFFD_EVENT_PAGEFAULT) {
-		fprintf(stderr, "unexpected msg event %u\n", msg->event);
-		exit(1);
-	}
+	if (msg->event != UFFD_EVENT_PAGEFAULT)
+		err("unexpected msg event %u", msg->event);
 
 	if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {
 		/* Write protect page faults */
@@ -571,10 +521,8 @@ static void uffd_handle_page_fault(struc
 		stats->minor_faults++;
 	} else {
 		/* Missing page faults */
-		if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) {
-			fprintf(stderr, "unexpected write fault\n");
-			exit(1);
-		}
+		if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			err("unexpected write fault");
 
 		offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst;
 		offset &= ~(page_size-1);
@@ -601,32 +549,20 @@ static void *uffd_poll_thread(void *arg)
 
 	for (;;) {
 		ret = poll(pollfd, 2, -1);
-		if (!ret) {
-			fprintf(stderr, "poll error %d\n", ret);
-			exit(1);
-		}
-		if (ret < 0) {
-			perror("poll");
-			exit(1);
-		}
+		if (ret <= 0)
+			err("poll error: %d", ret);
 		if (pollfd[1].revents & POLLIN) {
-			if (read(pollfd[1].fd, &tmp_chr, 1) != 1) {
-				fprintf(stderr, "read pipefd error\n");
-				exit(1);
-			}
+			if (read(pollfd[1].fd, &tmp_chr, 1) != 1)
+				err("read pipefd error");
 			break;
 		}
-		if (!(pollfd[0].revents & POLLIN)) {
-			fprintf(stderr, "pollfd[0].revents %d\n",
-				pollfd[0].revents);
-			exit(1);
-		}
+		if (!(pollfd[0].revents & POLLIN))
+			err("pollfd[0].revents %d", pollfd[0].revents);
 		if (uffd_read_msg(uffd, &msg))
 			continue;
 		switch (msg.event) {
 		default:
-			fprintf(stderr, "unexpected msg event %u\n",
-				msg.event); exit(1);
+			err("unexpected msg event %u\n", msg.event);
 			break;
 		case UFFD_EVENT_PAGEFAULT:
 			uffd_handle_page_fault(&msg, stats);
@@ -640,10 +576,8 @@ static void *uffd_poll_thread(void *arg)
 			uffd_reg.range.start = msg.arg.remove.start;
 			uffd_reg.range.len = msg.arg.remove.end -
 				msg.arg.remove.start;
-			if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range)) {
-				fprintf(stderr, "remove failure\n");
-				exit(1);
-			}
+			if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range))
+				err("remove failure");
 			break;
 		case UFFD_EVENT_REMAP:
 			area_dst = (char *)(unsigned long)msg.arg.remap.to;
@@ -746,9 +680,7 @@ static int stress(struct uffd_stats *uff
 	 * UFFDIO_COPY without writing zero pages into area_dst
 	 * because the background threads already completed).
 	 */
-	if (uffd_test_ops->release_pages(area_src))
-		return 1;
-
+	uffd_test_ops->release_pages(area_src);
 
 	finished = 1;
 	for (cpu = 0; cpu < nr_cpus; cpu++)
@@ -758,10 +690,8 @@ static int stress(struct uffd_stats *uff
 	for (cpu = 0; cpu < nr_cpus; cpu++) {
 		char c;
 		if (bounces & BOUNCE_POLL) {
-			if (write(pipefd[cpu*2+1], &c, 1) != 1) {
-				fprintf(stderr, "pipefd write error\n");
-				return 1;
-			}
+			if (write(pipefd[cpu*2+1], &c, 1) != 1)
+				err("pipefd write error");
 			if (pthread_join(uffd_threads[cpu],
 					 (void *)&uffd_stats[cpu]))
 				return 1;
@@ -861,10 +791,8 @@ static int faulting_process(int signal_t
 		memset(&act, 0, sizeof(act));
 		act.sa_sigaction = sighndl;
 		act.sa_flags = SA_SIGINFO;
-		if (sigaction(SIGBUS, &act, 0)) {
-			perror("sigaction");
-			return 1;
-		}
+		if (sigaction(SIGBUS, &act, 0))
+			err("sigaction");
 		lastnr = (unsigned long)-1;
 	}
 
@@ -874,10 +802,8 @@ static int faulting_process(int signal_t
 
 		if (signal_test) {
 			if (sigsetjmp(*sigbuf, 1) != 0) {
-				if (steps == 1 && nr == lastnr) {
-					fprintf(stderr, "Signal repeated\n");
-					return 1;
-				}
+				if (steps == 1 && nr == lastnr)
+					err("Signal repeated");
 
 				lastnr = nr;
 				if (signal_test == 1) {
@@ -902,12 +828,9 @@ static int faulting_process(int signal_t
 		}
 
 		count = *area_count(area_dst, nr);
-		if (count != count_verify[nr]) {
-			fprintf(stderr,
-				"nr %lu memory corruption %Lu %Lu\n",
-				nr, count,
-				count_verify[nr]);
-	        }
+		if (count != count_verify[nr])
+			err("nr %lu memory corruption %llu %llu\n",
+			    nr, count, count_verify[nr]);
 		/*
 		 * Trigger write protection if there is by writing
 		 * the same value back.
@@ -923,18 +846,14 @@ static int faulting_process(int signal_t
 
 	area_dst = mremap(area_dst, nr_pages * page_size,  nr_pages * page_size,
 			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
-	if (area_dst == MAP_FAILED) {
-		perror("mremap");
-		exit(1);
-	}
+	if (area_dst == MAP_FAILED)
+		err("mremap");
 
 	for (; nr < nr_pages; nr++) {
 		count = *area_count(area_dst, nr);
 		if (count != count_verify[nr]) {
-			fprintf(stderr,
-				"nr %lu memory corruption %Lu %Lu\n",
-				nr, count,
-				count_verify[nr]); exit(1);
+			err("nr %lu memory corruption %llu %llu\n",
+			    nr, count, count_verify[nr]);
 		}
 		/*
 		 * Trigger write protection if there is by writing
@@ -943,15 +862,11 @@ static int faulting_process(int signal_t
 		*area_count(area_dst, nr) = count;
 	}
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
+	uffd_test_ops->release_pages(area_dst);
 
-	for (nr = 0; nr < nr_pages; nr++) {
-		if (my_bcmp(area_dst + nr * page_size, zeropage, page_size)) {
-			fprintf(stderr, "nr %lu is not zero\n", nr);
-			exit(1);
-		}
-	}
+	for (nr = 0; nr < nr_pages; nr++)
+		if (my_bcmp(area_dst + nr * page_size, zeropage, page_size))
+			err("nr %lu is not zero", nr);
 
 	return 0;
 }
@@ -964,13 +879,12 @@ static void retry_uffdio_zeropage(int uf
 				     uffdio_zeropage->range.len,
 				     offset);
 	if (ioctl(ufd, UFFDIO_ZEROPAGE, uffdio_zeropage)) {
-		if (uffdio_zeropage->zeropage != -EEXIST) {
-			uffd_error(uffdio_zeropage->zeropage,
-				   "UFFDIO_ZEROPAGE retry error");
-		}
+		if (uffdio_zeropage->zeropage != -EEXIST)
+			err("UFFDIO_ZEROPAGE error: %"PRId64,
+			    (int64_t)uffdio_zeropage->zeropage);
 	} else {
-		uffd_error(uffdio_zeropage->zeropage,
-			   "UFFDIO_ZEROPAGE retry unexpected");
+		err("UFFDIO_ZEROPAGE error: %"PRId64,
+		    (int64_t)uffdio_zeropage->zeropage);
 	}
 }
 
@@ -983,10 +897,8 @@ static int __uffdio_zeropage(int ufd, un
 
 	has_zeropage = uffd_test_ops->expected_ioctls & (1 << _UFFDIO_ZEROPAGE);
 
-	if (offset >= nr_pages * page_size) {
-		fprintf(stderr, "unexpected offset %lu\n", offset);
-		exit(1);
-	}
+	if (offset >= nr_pages * page_size)
+		err("unexpected offset %lu", offset);
 	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
 	uffdio_zeropage.range.len = page_size;
 	uffdio_zeropage.mode = 0;
@@ -994,14 +906,13 @@ static int __uffdio_zeropage(int ufd, un
 	res = uffdio_zeropage.zeropage;
 	if (ret) {
 		/* real retval in ufdio_zeropage.zeropage */
-		if (has_zeropage) {
-			uffd_error(res, "UFFDIO_ZEROPAGE %s",
-				   res == -EEXIST ? "-EEXIST" : "error");
-		} else if (res != -EINVAL)
-			uffd_error(res, "UFFDIO_ZEROPAGE not -EINVAL");
+		if (has_zeropage)
+			err("UFFDIO_ZEROPAGE error: %"PRId64, (int64_t)res);
+		else if (res != -EINVAL)
+			err("UFFDIO_ZEROPAGE not -EINVAL");
 	} else if (has_zeropage) {
 		if (res != page_size) {
-			uffd_error(res, "UFFDIO_ZEROPAGE unexpected");
+			err("UFFDIO_ZEROPAGE unexpected size");
 		} else {
 			if (test_uffdio_zeropage_eexist && retry) {
 				test_uffdio_zeropage_eexist = false;
@@ -1011,7 +922,7 @@ static int __uffdio_zeropage(int ufd, un
 			return 1;
 		}
 	} else
-		uffd_error(res, "UFFDIO_ZEROPAGE succeeded");
+		err("UFFDIO_ZEROPAGE succeeded");
 
 	return 0;
 }
@@ -1030,8 +941,7 @@ static int userfaultfd_zeropage_test(voi
 	printf("testing UFFDIO_ZEROPAGE: ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
+	uffd_test_ops->release_pages(area_dst);
 
 	if (userfaultfd_open(0))
 		return 1;
@@ -1040,25 +950,16 @@ static int userfaultfd_zeropage_test(voi
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
 	if (test_uffdio_wp)
 		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
-	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
-		fprintf(stderr, "register failure\n");
-		exit(1);
-	}
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		err("register failure");
 
 	expected_ioctls = uffd_test_ops->expected_ioctls;
-	if ((uffdio_register.ioctls & expected_ioctls) !=
-	    expected_ioctls) {
-		fprintf(stderr,
-			"unexpected missing ioctl for anon memory\n");
-		exit(1);
-	}
+	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
+		err("unexpected missing ioctl for anon memory");
 
-	if (uffdio_zeropage(uffd, 0)) {
-		if (my_bcmp(area_dst, zeropage, page_size)) {
-			fprintf(stderr, "zeropage is not zero\n");
-			exit(1);
-		}
-	}
+	if (uffdio_zeropage(uffd, 0))
+		if (my_bcmp(area_dst, zeropage, page_size))
+			err("zeropage is not zero");
 
 	close(uffd);
 	printf("done.\n");
@@ -1078,8 +979,7 @@ static int userfaultfd_events_test(void)
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
+	uffd_test_ops->release_pages(area_dst);
 
 	features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
 		UFFD_FEATURE_EVENT_REMOVE;
@@ -1092,41 +992,28 @@ static int userfaultfd_events_test(void)
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
 	if (test_uffdio_wp)
 		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
-	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
-		fprintf(stderr, "register failure\n");
-		exit(1);
-	}
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		err("register failure");
 
 	expected_ioctls = uffd_test_ops->expected_ioctls;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
-		fprintf(stderr, "unexpected missing ioctl for anon memory\n");
-		exit(1);
-	}
+	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
+		err("unexpected missing ioctl for anon memory");
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) {
-		perror("uffd_poll_thread create");
-		exit(1);
-	}
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
+		err("uffd_poll_thread create");
 
 	pid = fork();
-	if (pid < 0) {
-		perror("fork");
-		exit(1);
-	}
+	if (pid < 0)
+		err("fork");
 
 	if (!pid)
 		exit(faulting_process(0));
 
 	waitpid(pid, &err, 0);
-	if (err) {
-		fprintf(stderr, "faulting process failed\n");
-		exit(1);
-	}
-
-	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) {
-		perror("pipe write");
-		exit(1);
-	}
+	if (err)
+		err("faulting process failed");
+	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
+		err("pipe write");
 	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
@@ -1151,8 +1038,7 @@ static int userfaultfd_sig_test(void)
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
+	uffd_test_ops->release_pages(area_dst);
 
 	features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
 	if (userfaultfd_open(features))
@@ -1164,57 +1050,41 @@ static int userfaultfd_sig_test(void)
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
 	if (test_uffdio_wp)
 		uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
-	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
-		fprintf(stderr, "register failure\n");
-		exit(1);
-	}
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		err("register failure");
 
 	expected_ioctls = uffd_test_ops->expected_ioctls;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
-		fprintf(stderr, "unexpected missing ioctl for anon memory\n");
-		exit(1);
-	}
+	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
+		err("unexpected missing ioctl for anon memory");
 
-	if (faulting_process(1)) {
-		fprintf(stderr, "faulting process failed\n");
-		exit(1);
-	}
+	if (faulting_process(1))
+		err("faulting process failed");
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
+	uffd_test_ops->release_pages(area_dst);
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) {
-		perror("uffd_poll_thread create");
-		exit(1);
-	}
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
+		err("uffd_poll_thread create");
 
 	pid = fork();
-	if (pid < 0) {
-		perror("fork");
-		exit(1);
-	}
+	if (pid < 0)
+		err("fork");
 
 	if (!pid)
 		exit(faulting_process(2));
 
 	waitpid(pid, &err, 0);
-	if (err) {
-		fprintf(stderr, "faulting process failed\n");
-		exit(1);
-	}
-
-	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) {
-		perror("pipe write");
-		exit(1);
-	}
+	if (err)
+		err("faulting process failed");
+	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
+		err("pipe write");
 	if (pthread_join(uffd_mon, (void **)&userfaults))
 		return 1;
 
 	printf("done.\n");
 	if (userfaults)
-		fprintf(stderr, "Signal test failed, userfaults: %ld\n",
-			userfaults);
+		err("Signal test failed, userfaults: %ld", userfaults);
 	close(uffd);
+
 	return userfaults != 0;
 }
 
@@ -1236,8 +1106,7 @@ static int userfaultfd_minor_test(void)
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	if (uffd_test_ops->release_pages(area_dst))
-		return 1;
+	uffd_test_ops->release_pages(area_dst);
 
 	if (userfaultfd_open_ext(&features))
 		return 1;
@@ -1251,17 +1120,13 @@ static int userfaultfd_minor_test(void)
 	uffdio_register.range.start = (unsigned long)area_dst_alias;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR;
-	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
-		fprintf(stderr, "register failure\n");
-		exit(1);
-	}
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		err("register failure");
 
 	expected_ioctls = uffd_test_ops->expected_ioctls;
 	expected_ioctls |= 1 << _UFFDIO_CONTINUE;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) {
-		fprintf(stderr, "unexpected missing ioctl(s)\n");
-		exit(1);
-	}
+	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
+		err("unexpected missing ioctl(s)");
 
 	/*
 	 * After registering with UFFD, populate the non-UFFD-registered side of
@@ -1272,10 +1137,8 @@ static int userfaultfd_minor_test(void)
 		       page_size);
 	}
 
-	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) {
-		perror("uffd_poll_thread create");
-		exit(1);
-	}
+	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
+		err("uffd_poll_thread create");
 
 	/*
 	 * Read each of the pages back using the UFFD-registered mapping. We
@@ -1284,26 +1147,19 @@ static int userfaultfd_minor_test(void)
 	 * page's contents, and then issuing a CONTINUE ioctl.
 	 */
 
-	if (posix_memalign(&expected_page, page_size, page_size)) {
-		fprintf(stderr, "out of memory\n");
-		return 1;
-	}
+	if (posix_memalign(&expected_page, page_size, page_size))
+		err("out of memory");
 
 	for (p = 0; p < nr_pages; ++p) {
 		expected_byte = ~((uint8_t)(p % ((uint8_t)-1)));
 		memset(expected_page, expected_byte, page_size);
 		if (my_bcmp(expected_page, area_dst_alias + (p * page_size),
-			    page_size)) {
-			fprintf(stderr,
-				"unexpected page contents after minor fault\n");
-			exit(1);
-		}
+			    page_size))
+			err("unexpected page contents after minor fault");
 	}
 
-	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) {
-		perror("pipe write");
-		exit(1);
-	}
+	if (write(pipefd[1], &c, sizeof(c)) != sizeof(c))
+		err("pipe write");
 	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
@@ -1321,7 +1177,6 @@ static int userfaultfd_stress(void)
 	unsigned long nr;
 	struct uffdio_register uffdio_register;
 	unsigned long cpu;
-	int err;
 	struct uffd_stats uffd_stats[nr_cpus];
 
 	uffd_test_ops->allocate_area((void **)&area_src);
@@ -1366,10 +1221,8 @@ static int userfaultfd_stress(void)
 		}
 	}
 
-	if (posix_memalign(&area, page_size, page_size)) {
-		fprintf(stderr, "out of memory\n");
-		return 1;
-	}
+	if (posix_memalign(&area, page_size, page_size))
+		err("out of memory");
 	zeropage = area;
 	bzero(zeropage, page_size);
 
@@ -1378,7 +1231,6 @@ static int userfaultfd_stress(void)
 	pthread_attr_init(&attr);
 	pthread_attr_setstacksize(&attr, 16*1024*1024);
 
-	err = 0;
 	while (bounces--) {
 		unsigned long expected_ioctls;
 
@@ -1407,25 +1259,18 @@ static int userfaultfd_stress(void)
 		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
 		if (test_uffdio_wp)
 			uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
-		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
-			fprintf(stderr, "register failure\n");
-			return 1;
-		}
+		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+			err("register failure");
 		expected_ioctls = uffd_test_ops->expected_ioctls;
 		if ((uffdio_register.ioctls & expected_ioctls) !=
-		    expected_ioctls) {
-			fprintf(stderr,
-				"unexpected missing ioctl for anon memory\n");
-			return 1;
-		}
+		    expected_ioctls)
+			err("unexpected missing ioctl for anon memory");
 
 		if (area_dst_alias) {
 			uffdio_register.range.start = (unsigned long)
 				area_dst_alias;
-			if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
-				fprintf(stderr, "register failure alias\n");
-				return 1;
-			}
+			if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+				err("register failure alias");
 		}
 
 		/*
@@ -1452,8 +1297,7 @@ static int userfaultfd_stress(void)
 		 * MADV_DONTNEED only after the UFFDIO_REGISTER, so it's
 		 * required to MADV_DONTNEED here.
 		 */
-		if (uffd_test_ops->release_pages(area_dst))
-			return 1;
+		uffd_test_ops->release_pages(area_dst);
 
 		uffd_stats_reset(uffd_stats, nr_cpus);
 
@@ -1467,33 +1311,22 @@ static int userfaultfd_stress(void)
 				 nr_pages * page_size, false);
 
 		/* unregister */
-		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
-			fprintf(stderr, "unregister failure\n");
-			return 1;
-		}
+		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range))
+			err("unregister failure");
 		if (area_dst_alias) {
 			uffdio_register.range.start = (unsigned long) area_dst;
 			if (ioctl(uffd, UFFDIO_UNREGISTER,
-				  &uffdio_register.range)) {
-				fprintf(stderr, "unregister failure alias\n");
-				return 1;
-			}
+				  &uffdio_register.range))
+				err("unregister failure alias");
 		}
 
 		/* verification */
-		if (bounces & BOUNCE_VERIFY) {
-			for (nr = 0; nr < nr_pages; nr++) {
-				if (*area_count(area_dst, nr) != count_verify[nr]) {
-					fprintf(stderr,
-						"error area_count %Lu %Lu %lu\n",
-						*area_count(area_src, nr),
-						count_verify[nr],
-						nr);
-					err = 1;
-					bounces = 0;
-				}
-			}
-		}
+		if (bounces & BOUNCE_VERIFY)
+			for (nr = 0; nr < nr_pages; nr++)
+				if (*area_count(area_dst, nr) != count_verify[nr])
+					err("error area_count %llu %llu %lu\n",
+					    *area_count(area_src, nr),
+					    count_verify[nr], nr);
 
 		/* prepare next bounce */
 		tmp_area = area_src;
@@ -1507,9 +1340,6 @@ static int userfaultfd_stress(void)
 		uffd_stats_report(uffd_stats, nr_cpus);
 	}
 
-	if (err)
-		return err;
-
 	close(uffd);
 	return userfaultfd_zeropage_test() || userfaultfd_sig_test()
 		|| userfaultfd_events_test() || userfaultfd_minor_test();
@@ -1560,7 +1390,7 @@ static void set_test_type(const char *ty
 		test_type = TEST_SHMEM;
 		uffd_test_ops = &shmem_uffd_test_ops;
 	} else {
-		fprintf(stderr, "Unknown test type: %s\n", type); exit(1);
+		err("Unknown test type: %s", type);
 	}
 
 	if (test_type == TEST_HUGETLB)
@@ -1568,15 +1398,11 @@ static void set_test_type(const char *ty
 	else
 		page_size = sysconf(_SC_PAGE_SIZE);
 
-	if (!page_size) {
-		fprintf(stderr, "Unable to determine page size\n");
-		exit(2);
-	}
+	if (!page_size)
+		err("Unable to determine page size");
 	if ((unsigned long) area_count(NULL, 0) + sizeof(unsigned long long) * 2
-	    > page_size) {
-		fprintf(stderr, "Impossible to run this test\n");
-		exit(2);
-	}
+	    > page_size)
+		err("Impossible to run this test");
 }
 
 static void sigalrm(int sig)
@@ -1593,10 +1419,8 @@ int main(int argc, char **argv)
 	if (argc < 4)
 		usage();
 
-	if (signal(SIGALRM, sigalrm) == SIG_ERR) {
-		fprintf(stderr, "failed to arm SIGALRM");
-		exit(1);
-	}
+	if (signal(SIGALRM, sigalrm) == SIG_ERR)
+		err("failed to arm SIGALRM");
 	alarm(ALARM_INTERVAL_SECS);
 
 	set_test_type(argv[1]);
@@ -1605,13 +1429,13 @@ int main(int argc, char **argv)
 	nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size /
 		nr_cpus;
 	if (!nr_pages_per_cpu) {
-		fprintf(stderr, "invalid MiB\n");
+		_err("invalid MiB");
 		usage();
 	}
 
 	bounces = atoi(argv[3]);
 	if (bounces <= 0) {
-		fprintf(stderr, "invalid bounces\n");
+		_err("invalid bounces");
 		usage();
 	}
 	nr_pages = nr_pages_per_cpu * nr_cpus;
@@ -1620,16 +1444,10 @@ int main(int argc, char **argv)
 		if (argc < 5)
 			usage();
 		huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
-		if (huge_fd < 0) {
-			fprintf(stderr, "Open of %s failed", argv[3]);
-			perror("open");
-			exit(1);
-		}
-		if (ftruncate(huge_fd, 0)) {
-			fprintf(stderr, "ftruncate %s to size 0 failed", argv[3]);
-			perror("ftruncate");
-			exit(1);
-		}
+		if (huge_fd < 0)
+			err("Open of %s failed", argv[4]);
+		if (ftruncate(huge_fd, 0))
+			err("ftruncate %s to size 0 failed", argv[4]);
 	}
 	printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
 	       nr_pages, nr_pages_per_cpu);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 035/192] mm/thp: simplify copying of huge zero page pmd when fork
  2021-07-01  1:46 incoming Andrew Morton
                   ` (33 preceding siblings ...)
  2021-07-01  1:48 ` [patch 034/192] userfaultfd/selftests: unify error handling Andrew Morton
@ 2021-07-01  1:48 ` Andrew Morton
  2021-07-01  1:49 ` [patch 036/192] mm/userfaultfd: fix uffd-wp special cases for fork() Andrew Morton
                   ` (157 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:48 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: mm/thp: simplify copying of huge zero page pmd when fork

Patch series "mm/uffd: Misc fix for uffd-wp and one more test".

This series tries to fix some corner case bugs for uffd-wp on either thp
or fork().  Then it introduced a new test with pagemap/pageout.

Patch layout:

Patch 1:    cleanup for THP, it'll slightly simplify the follow up patches
Patch 2-4:  misc fixes for uffd-wp here and there; please refer to each patch
Patch 5:    add pagemap support for uffd-wp
Patch 6:    add pagemap/pageout test for uffd-wp

The last test introduced can also verify some of the fixes in previous
patches, as the test will fail without the fixes.  However it's not easy
to verify all the changes in patch 2-4, but hopefully they can still be
properly reviewed.

Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch
5 will be incomplete as it's missing e.g.  hugetlbfs part or the special
swap pte detection.  However that's not needed in this series, and since
that series is still during review, this series does not depend on that
one (the last test only runs with anonymous memory, not file-backed).  So
this series can be merged even before that series.


This patch (of 6):

Huge zero page is handled in a special path in copy_huge_pmd(), however it
should share most codes with a normal thp page.  Trying to share more code
with it by removing the special path.  The only leftover so far is the
huge zero page refcounting (mm_get_huge_zero_page()), because that's
separately done with a global counter.

This prepares for a future patch to modify the huge pmd to be installed,
so that we don't need to duplicate it explicitly into huge zero page case
too.

Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/mm/huge_memory.c~mm-thp-simplify-copying-of-huge-zero-page-pmd-when-fork
+++ a/mm/huge_memory.c
@@ -1088,17 +1088,13 @@ int copy_huge_pmd(struct mm_struct *dst_
 	 * a page table.
 	 */
 	if (is_huge_zero_pmd(pmd)) {
-		struct page *zero_page;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
-		zero_page = mm_get_huge_zero_page(dst_mm);
-		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
-				zero_page);
-		ret = 0;
-		goto out_unlock;
+		mm_get_huge_zero_page(dst_mm);
+		goto out_zero_page;
 	}
 
 	src_page = pmd_page(pmd);
@@ -1122,6 +1118,7 @@ int copy_huge_pmd(struct mm_struct *dst_
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 036/192] mm/userfaultfd: fix uffd-wp special cases for fork()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (34 preceding siblings ...)
  2021-07-01  1:48 ` [patch 035/192] mm/thp: simplify copying of huge zero page pmd when fork Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 037/192] mm/userfaultfd: fail uffd-wp registration if not supported Andrew Morton
                   ` (156 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: mm/userfaultfd: fix uffd-wp special cases for fork()

We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop
_PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
right..  A few fixes around the code path:

1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
   than the new vma.  That's overlooked in b569a1760782, so it won't work
   as expected.  Thanks to the recent rework on fork code
   (7a4830c380f3a8b3), we can easily get the new vma now, so switch the
   checks to that.

2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
   huge pmd is a migration huge pmd.  When it happens, instead of using
   pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
   handle them separately.

3. Forget to carry over uffd-wp bit for a write migration huge pmd
   entry.  This also happens in copy_huge_pmd(), where we converted a
   write huge migration entry into a read one.

4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.

5. In copy_present_page() when COW is enforced when fork(), we also
   need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
   vma, and when the pte to be copied has uffd-wp bit set.

Remove the comment in copy_present_pte() about this.  It won't help a huge
lot to only comment there, but comment everywhere would be an overkill. 
Let's assume the commit messages would help.

[peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
  Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/huge_mm.h |    2 +-
 include/linux/swapops.h |    2 ++
 mm/huge_memory.c        |   27 ++++++++++++++-------------
 mm/memory.c             |   25 +++++++++++++------------
 4 files changed, 30 insertions(+), 26 deletions(-)

--- a/include/linux/huge_mm.h~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork
+++ a/include/linux/huge_mm.h
@@ -10,7 +10,7 @@
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-		  struct vm_area_struct *vma);
+		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
--- a/include/linux/swapops.h~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork
+++ a/include/linux/swapops.h
@@ -265,6 +265,8 @@ static inline swp_entry_t pmd_to_swp_ent
 
 	if (pmd_swp_soft_dirty(pmd))
 		pmd = pmd_swp_clear_soft_dirty(pmd);
+	if (pmd_swp_uffd_wp(pmd))
+		pmd = pmd_swp_clear_uffd_wp(pmd);
 	arch_entry = __pmd_to_swp_entry(pmd);
 	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
 }
--- a/mm/huge_memory.c~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork
+++ a/mm/huge_memory.c
@@ -1026,7 +1026,7 @@ struct page *follow_devmap_pmd(struct vm
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-		  struct vm_area_struct *vma)
+		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	struct page *src_page;
@@ -1035,7 +1035,7 @@ int copy_huge_pmd(struct mm_struct *dst_
 	int ret = -ENOMEM;
 
 	/* Skip if can be re-fill on fault */
-	if (!vma_is_anonymous(vma))
+	if (!vma_is_anonymous(dst_vma))
 		return 0;
 
 	pgtable = pte_alloc_one(dst_mm);
@@ -1049,14 +1049,6 @@ int copy_huge_pmd(struct mm_struct *dst_
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
-	/*
-	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
-	 * does not have the VM_UFFD_WP, which means that the uffd
-	 * fork event is not enabled.
-	 */
-	if (!(vma->vm_flags & VM_UFFD_WP))
-		pmd = pmd_clear_uffd_wp(pmd);
-
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
@@ -1067,11 +1059,15 @@ int copy_huge_pmd(struct mm_struct *dst_
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
+			if (pmd_swp_uffd_wp(*src_pmd))
+				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		if (!userfaultfd_wp(dst_vma))
+			pmd = pmd_swp_clear_uffd_wp(pmd);
 		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 		ret = 0;
 		goto out_unlock;
@@ -1107,11 +1103,11 @@ int copy_huge_pmd(struct mm_struct *dst_
 	 * best effort that the pinned pages won't be replaced by another
 	 * random page during the coming copy-on-write.
 	 */
-	if (unlikely(page_needs_cow_for_dma(vma, src_page))) {
+	if (unlikely(page_needs_cow_for_dma(src_vma, src_page))) {
 		pte_free(dst_mm, pgtable);
 		spin_unlock(src_ptl);
 		spin_unlock(dst_ptl);
-		__split_huge_pmd(vma, src_pmd, addr, false, NULL);
+		__split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
 		return -EAGAIN;
 	}
 
@@ -1121,8 +1117,9 @@ int copy_huge_pmd(struct mm_struct *dst_
 out_zero_page:
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	if (!userfaultfd_wp(dst_vma))
+		pmd = pmd_clear_uffd_wp(pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
@@ -1835,6 +1832,8 @@ int change_huge_pmd(struct vm_area_struc
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
+			if (pmd_swp_uffd_wp(*pmd))
+				newpmd = pmd_swp_mkuffd_wp(newpmd);
 			set_pmd_at(mm, addr, pmd, newpmd);
 		}
 		goto unlock;
@@ -3245,6 +3244,8 @@ void remove_migration_pmd(struct page_vm
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_write_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
+	if (pmd_swp_uffd_wp(*pvmw->pmd))
+		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new))
--- a/mm/memory.c~mm-userfaultfd-fix-uffd-wp-special-cases-for-fork
+++ a/mm/memory.c
@@ -707,10 +707,10 @@ out:
 
 static unsigned long
 copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma, unsigned long addr, int *rss)
 {
-	unsigned long vm_flags = vma->vm_flags;
+	unsigned long vm_flags = dst_vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
 	swp_entry_t entry = pte_to_swp_entry(pte);
@@ -779,6 +779,8 @@ copy_nonpresent_pte(struct mm_struct *ds
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
 	}
+	if (!userfaultfd_wp(dst_vma))
+		pte = pte_swp_clear_uffd_wp(pte);
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
 }
@@ -844,6 +846,9 @@ copy_present_page(struct vm_area_struct
 	/* All done, just insert the new page copy in the child */
 	pte = mk_pte(new_page, dst_vma->vm_page_prot);
 	pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
+	if (userfaultfd_pte_wp(dst_vma, *src_pte))
+		/* Uffd-wp needs to be delivered to dest pte as well */
+		pte = pte_wrprotect(pte_mkuffd_wp(pte));
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
 	return 0;
 }
@@ -893,12 +898,7 @@ copy_present_pte(struct vm_area_struct *
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
-	/*
-	 * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA
-	 * does not have the VM_UFFD_WP, which means that the uffd
-	 * fork event is not enabled.
-	 */
-	if (!(vm_flags & VM_UFFD_WP))
+	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
@@ -973,7 +973,8 @@ again:
 		if (unlikely(!pte_present(*src_pte))) {
 			entry.val = copy_nonpresent_pte(dst_mm, src_mm,
 							dst_pte, src_pte,
-							src_vma, addr, rss);
+							dst_vma, src_vma,
+							addr, rss);
 			if (entry.val)
 				break;
 			progress += 8;
@@ -1050,8 +1051,8 @@ copy_pmd_range(struct vm_area_struct *ds
 			|| pmd_devmap(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
-			err = copy_huge_pmd(dst_mm, src_mm,
-					    dst_pmd, src_pmd, addr, src_vma);
+			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
+					    addr, dst_vma, src_vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 037/192] mm/userfaultfd: fail uffd-wp registration if not supported
  2021-07-01  1:46 incoming Andrew Morton
                   ` (35 preceding siblings ...)
  2021-07-01  1:49 ` [patch 036/192] mm/userfaultfd: fix uffd-wp special cases for fork() Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 038/192] mm/pagemap: export uffd-wp protection information Andrew Morton
                   ` (155 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: mm/userfaultfd: fail uffd-wp registration if not supported

We should fail uffd-wp registration immediately if the arch does not even
have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined.  That'll block also relevant
ioctls on e.g.  UFFDIO_WRITEPROTECT because that'll check against
VM_UFFD_WP, which can only be applied with a success registration.

Remove the WP feature bit too for those archs when handling UFFDIO_API
ioctl.

Link: https://lkml.kernel.org/r/20210428225030.9708-5-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/userfaultfd.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/fs/userfaultfd.c~mm-userfaultfd-fail-uffd-wp-registeration-if-not-supported
+++ a/fs/userfaultfd.c
@@ -1304,8 +1304,12 @@ static int userfaultfd_register(struct u
 	vm_flags = 0;
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING)
 		vm_flags |= VM_UFFD_MISSING;
-	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)
+	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
+#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+		goto out;
+#endif
 		vm_flags |= VM_UFFD_WP;
+	}
 	if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) {
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		goto out;
@@ -1943,6 +1947,9 @@ static int userfaultfd_api(struct userfa
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
 #endif
+#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+#endif
 	uffdio_api.ioctls = UFFD_API_IOCTLS;
 	ret = -EFAULT;
 	if (copy_to_user(buf, &uffdio_api, sizeof(uffdio_api)))
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 038/192] mm/pagemap: export uffd-wp protection information
  2021-07-01  1:46 incoming Andrew Morton
                   ` (36 preceding siblings ...)
  2021-07-01  1:49 ` [patch 037/192] mm/userfaultfd: fail uffd-wp registration if not supported Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 039/192] userfaultfd/selftests: add pagemap uffd-wp test Andrew Morton
                   ` (154 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: mm/pagemap: export uffd-wp protection information

Export the PTE/PMD status of uffd-wp to pagemap too.

Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/pagemap.rst |    2 ++
 fs/proc/task_mmu.c                       |    9 +++++++++
 2 files changed, 11 insertions(+)

--- a/Documentation/admin-guide/mm/pagemap.rst~mm-pagemap-export-uffd-wp-protection-information
+++ a/Documentation/admin-guide/mm/pagemap.rst
@@ -21,6 +21,8 @@ There are four components to pagemap:
     * Bit  55    pte is soft-dirty (see
       :ref:`Documentation/admin-guide/mm/soft-dirty.rst <soft_dirty>`)
     * Bit  56    page exclusively mapped (since 4.2)
+    * Bit  57    pte is uffd-wp write-protected (since 5.13) (see
+      :ref:`Documentation/admin-guide/mm/userfaultfd.rst <userfaultfd>`)
     * Bits 57-60 zero
     * Bit  61    page is file-page or shared-anon (since 3.5)
     * Bit  62    page swapped
--- a/fs/proc/task_mmu.c~mm-pagemap-export-uffd-wp-protection-information
+++ a/fs/proc/task_mmu.c
@@ -1302,6 +1302,7 @@ struct pagemapread {
 #define PM_PFRAME_MASK		GENMASK_ULL(PM_PFRAME_BITS - 1, 0)
 #define PM_SOFT_DIRTY		BIT_ULL(55)
 #define PM_MMAP_EXCLUSIVE	BIT_ULL(56)
+#define PM_UFFD_WP		BIT_ULL(57)
 #define PM_FILE			BIT_ULL(61)
 #define PM_SWAP			BIT_ULL(62)
 #define PM_PRESENT		BIT_ULL(63)
@@ -1375,10 +1376,14 @@ static pagemap_entry_t pte_to_pagemap_en
 		page = vm_normal_page(vma, addr, pte);
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
+		if (pte_uffd_wp(pte))
+			flags |= PM_UFFD_WP;
 	} else if (is_swap_pte(pte)) {
 		swp_entry_t entry;
 		if (pte_swp_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
+		if (pte_swp_uffd_wp(pte))
+			flags |= PM_UFFD_WP;
 		entry = pte_to_swp_entry(pte);
 		if (pm->show_pfn)
 			frame = swp_type(entry) |
@@ -1426,6 +1431,8 @@ static int pagemap_pmd_range(pmd_t *pmdp
 			flags |= PM_PRESENT;
 			if (pmd_soft_dirty(pmd))
 				flags |= PM_SOFT_DIRTY;
+			if (pmd_uffd_wp(pmd))
+				flags |= PM_UFFD_WP;
 			if (pm->show_pfn)
 				frame = pmd_pfn(pmd) +
 					((addr & ~PMD_MASK) >> PAGE_SHIFT);
@@ -1444,6 +1451,8 @@ static int pagemap_pmd_range(pmd_t *pmdp
 			flags |= PM_SWAP;
 			if (pmd_swp_soft_dirty(pmd))
 				flags |= PM_SOFT_DIRTY;
+			if (pmd_swp_uffd_wp(pmd))
+				flags |= PM_UFFD_WP;
 			VM_BUG_ON(!is_pmd_migration_entry(pmd));
 			page = migration_entry_to_page(entry);
 		}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 039/192] userfaultfd/selftests: add pagemap uffd-wp test
  2021-07-01  1:46 incoming Andrew Morton
                   ` (37 preceding siblings ...)
  2021-07-01  1:49 ` [patch 038/192] mm/pagemap: export uffd-wp protection information Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 040/192] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Andrew Morton
                   ` (153 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Peter Xu <peterx@redhat.com>
Subject: userfaultfd/selftests: add pagemap uffd-wp test

Add one anonymous specific test to start using pagemap.  With pagemap
support, we can directly read the uffd-wp bit from pgtable without
triggering any fault, so it's easier to do sanity checks in unit tests.

Meanwhile this test also leverages the newly introduced MADV_PAGEOUT
madvise function to test swap ptes with uffd-wp bit set, and across
fork()s.

Link: https://lkml.kernel.org/r/20210428225030.9708-7-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |  154 +++++++++++++++++++++
 1 file changed, 154 insertions(+)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-add-pagemap-uffd-wp-test
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -1170,6 +1170,144 @@ static int userfaultfd_minor_test(void)
 	return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
 }
 
+#define BIT_ULL(nr)                   (1ULL << (nr))
+#define PM_SOFT_DIRTY                 BIT_ULL(55)
+#define PM_MMAP_EXCLUSIVE             BIT_ULL(56)
+#define PM_UFFD_WP                    BIT_ULL(57)
+#define PM_FILE                       BIT_ULL(61)
+#define PM_SWAP                       BIT_ULL(62)
+#define PM_PRESENT                    BIT_ULL(63)
+
+static int pagemap_open(void)
+{
+	int fd = open("/proc/self/pagemap", O_RDONLY);
+
+	if (fd < 0)
+		err("open pagemap");
+
+	return fd;
+}
+
+static uint64_t pagemap_read_vaddr(int fd, void *vaddr)
+{
+	uint64_t value;
+	int ret;
+
+	ret = pread(fd, &value, sizeof(uint64_t),
+		    ((uint64_t)vaddr >> 12) * sizeof(uint64_t));
+	if (ret != sizeof(uint64_t))
+		err("pread() on pagemap failed");
+
+	return value;
+}
+
+/* This macro let __LINE__ works in err() */
+#define  pagemap_check_wp(value, wp) do {				\
+		if (!!(value & PM_UFFD_WP) != wp)			\
+			err("pagemap uffd-wp bit error: 0x%"PRIx64, value); \
+	} while (0)
+
+static int pagemap_test_fork(bool present)
+{
+	pid_t child = fork();
+	uint64_t value;
+	int fd, result;
+
+	if (!child) {
+		/* Open the pagemap fd of the child itself */
+		fd = pagemap_open();
+		value = pagemap_read_vaddr(fd, area_dst);
+		/*
+		 * After fork() uffd-wp bit should be gone as long as we're
+		 * without UFFD_FEATURE_EVENT_FORK
+		 */
+		pagemap_check_wp(value, false);
+		/* Succeed */
+		exit(0);
+	}
+	waitpid(child, &result, 0);
+	return result;
+}
+
+static void userfaultfd_pagemap_test(unsigned int test_pgsize)
+{
+	struct uffdio_register uffdio_register;
+	int pagemap_fd;
+	uint64_t value;
+
+	/* Pagemap tests uffd-wp only */
+	if (!test_uffdio_wp)
+		return;
+
+	/* Not enough memory to test this page size */
+	if (test_pgsize > nr_pages * page_size)
+		return;
+
+	printf("testing uffd-wp with pagemap (pgsize=%u): ", test_pgsize);
+	/* Flush so it doesn't flush twice in parent/child later */
+	fflush(stdout);
+
+	uffd_test_ops->release_pages(area_dst);
+
+	if (test_pgsize > page_size) {
+		/* This is a thp test */
+		if (madvise(area_dst, nr_pages * page_size, MADV_HUGEPAGE))
+			err("madvise(MADV_HUGEPAGE) failed");
+	} else if (test_pgsize == page_size) {
+		/* This is normal page test; force no thp */
+		if (madvise(area_dst, nr_pages * page_size, MADV_NOHUGEPAGE))
+			err("madvise(MADV_NOHUGEPAGE) failed");
+	}
+
+	if (userfaultfd_open(0))
+		err("userfaultfd_open");
+
+	uffdio_register.range.start = (unsigned long) area_dst;
+	uffdio_register.range.len = nr_pages * page_size;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
+		err("register failed");
+
+	pagemap_fd = pagemap_open();
+
+	/* Touch the page */
+	*area_dst = 1;
+	wp_range(uffd, (uint64_t)area_dst, test_pgsize, true);
+	value = pagemap_read_vaddr(pagemap_fd, area_dst);
+	pagemap_check_wp(value, true);
+	/* Make sure uffd-wp bit dropped when fork */
+	if (pagemap_test_fork(true))
+		err("Detected stall uffd-wp bit in child");
+
+	/* Exclusive required or PAGEOUT won't work */
+	if (!(value & PM_MMAP_EXCLUSIVE))
+		err("multiple mapping detected: 0x%"PRIx64, value);
+
+	if (madvise(area_dst, test_pgsize, MADV_PAGEOUT))
+		err("madvise(MADV_PAGEOUT) failed");
+
+	/* Uffd-wp should persist even swapped out */
+	value = pagemap_read_vaddr(pagemap_fd, area_dst);
+	pagemap_check_wp(value, true);
+	/* Make sure uffd-wp bit dropped when fork */
+	if (pagemap_test_fork(false))
+		err("Detected stall uffd-wp bit in child");
+
+	/* Unprotect; this tests swap pte modifications */
+	wp_range(uffd, (uint64_t)area_dst, page_size, false);
+	value = pagemap_read_vaddr(pagemap_fd, area_dst);
+	pagemap_check_wp(value, false);
+
+	/* Fault in the page from disk */
+	*area_dst = 2;
+	value = pagemap_read_vaddr(pagemap_fd, area_dst);
+	pagemap_check_wp(value, false);
+
+	close(pagemap_fd);
+	close(uffd);
+	printf("done\n");
+}
+
 static int userfaultfd_stress(void)
 {
 	void *area;
@@ -1341,6 +1479,22 @@ static int userfaultfd_stress(void)
 	}
 
 	close(uffd);
+
+	if (test_type == TEST_ANON) {
+		/*
+		 * shmem/hugetlb won't be able to run since they have different
+		 * behavior on fork() (file-backed memory normally drops ptes
+		 * directly when fork), meanwhile the pagemap test will verify
+		 * pgtable entry of fork()ed child.
+		 */
+		userfaultfd_pagemap_test(page_size);
+		/*
+		 * Hard-code for x86_64 for now for 2M THP, as x86_64 is
+		 * currently the only one that supports uffd-wp
+		 */
+		userfaultfd_pagemap_test(page_size * 512);
+	}
+
 	return userfaultfd_zeropage_test() || userfaultfd_sig_test()
 		|| userfaultfd_events_test() || userfaultfd_minor_test();
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 040/192] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte
  2021-07-01  1:46 incoming Andrew Morton
                   ` (38 preceding siblings ...)
  2021-07-01  1:49 ` [patch 039/192] userfaultfd/selftests: add pagemap uffd-wp test Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 041/192] userfaultfd/shmem: support minor fault registration for shmem Andrew Morton
                   ` (152 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte

Patch series "userfaultfd: add minor fault handling for shmem", v6.

Overview
========

See the series which added minor faults for hugetlbfs [3] for a detailed
overview of minor fault handling in general.  This series adds the same
support for shmem-backed areas.

This series is structured as follows:

- Commits 1 and 2 are cleanups.
- Commits 3 and 4 implement the new feature (minor fault handling for shmem).
- Commit 5 advertises that the feature is now available since at this point it's
  fully implemented.
- Commit 6 is a final cleanup, modifying an existing code path to re-use a new
  helper we've introduced.
- Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.

Use Case
========

In some cases it is useful to have VM memory backed by tmpfs instead of
hugetlbfs.  So, this feature will be used to support the same VM live
migration use case described in my original series.

Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope
to optimize the Android Runtime garbage collector using this feature:

"The plan is to use userfaultfd for concurrently compacting the heap. 
With this feature, the heap can be shared-mapped at another location where
the GC-thread(s) could continue the compaction operation without the need
to invoke userfault ioctl(UFFDIO_COPY) each time.  OTOH, if and when Java
threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
execution.  Furthermore, this feature enables updating references in the
'non-moving' portion of the heap efficiently.  Without this feature,
uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."

[1] https://lore.kernel.org/patchwork/cover/1388144/
[2] https://lore.kernel.org/patchwork/patch/1408161/
[3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t


This patch (of 9):

Previously, we did a dance where we had one calling path in userfaultfd.c
(mfill_atomic_pte), but then we split it into two in shmem_fs.h
(shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
shared function in shmem.c (shmem_mfill_atomic_pte).

This is all a bit overly complex.  Just call the single combined shmem
function directly, allowing us to clean up various branches, boilerplate,
etc.

While we're touching this function, two other small cleanup changes:
- offset is equivalent to pgoff, so we can get rid of offset entirely.
- Split two VM_BUG_ON cases into two statements. This means the line
  number reported when the BUG is hit specifies exactly which condition
  was true.

Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/shmem_fs.h |   19 +++++--------
 mm/shmem.c               |   52 +++++++++++--------------------------
 mm/userfaultfd.c         |   10 ++-----
 3 files changed, 27 insertions(+), 54 deletions(-)

--- a/include/linux/shmem_fs.h~userfaultfd-shmem-combine-shmem_mcopy_atomicmfill_zeropage_pte
+++ a/include/linux/shmem_fs.h
@@ -122,21 +122,18 @@ static inline bool shmem_file(struct fil
 extern bool shmem_charge(struct inode *inode, long pages);
 extern void shmem_uncharge(struct inode *inode, long pages);
 
+#ifdef CONFIG_USERFAULTFD
 #ifdef CONFIG_SHMEM
-extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				  struct vm_area_struct *dst_vma,
 				  unsigned long dst_addr,
 				  unsigned long src_addr,
+				  bool zeropage,
 				  struct page **pagep);
-extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
-				    pmd_t *dst_pmd,
-				    struct vm_area_struct *dst_vma,
-				    unsigned long dst_addr);
-#else
-#define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
-			       src_addr, pagep)        ({ BUG(); 0; })
-#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \
-				 dst_addr)      ({ BUG(); 0; })
-#endif
+#else /* !CONFIG_SHMEM */
+#define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
+			       src_addr, zeropage, pagep)       ({ BUG(); 0; })
+#endif /* CONFIG_SHMEM */
+#endif /* CONFIG_USERFAULTFD */
 
 #endif
--- a/mm/shmem.c~userfaultfd-shmem-combine-shmem_mcopy_atomicmfill_zeropage_pte
+++ a/mm/shmem.c
@@ -2352,13 +2352,14 @@ static struct inode *shmem_get_inode(str
 	return inode;
 }
 
-static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
-				  pmd_t *dst_pmd,
-				  struct vm_area_struct *dst_vma,
-				  unsigned long dst_addr,
-				  unsigned long src_addr,
-				  bool zeropage,
-				  struct page **pagep)
+#ifdef CONFIG_USERFAULTFD
+int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
+			   pmd_t *dst_pmd,
+			   struct vm_area_struct *dst_vma,
+			   unsigned long dst_addr,
+			   unsigned long src_addr,
+			   bool zeropage,
+			   struct page **pagep)
 {
 	struct inode *inode = file_inode(dst_vma->vm_file);
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -2370,7 +2371,7 @@ static int shmem_mfill_atomic_pte(struct
 	struct page *page;
 	pte_t _dst_pte, *dst_pte;
 	int ret;
-	pgoff_t offset, max_off;
+	pgoff_t max_off;
 
 	ret = -ENOMEM;
 	if (!shmem_inode_acct_block(inode, 1)) {
@@ -2391,7 +2392,7 @@ static int shmem_mfill_atomic_pte(struct
 		if (!page)
 			goto out_unacct_blocks;
 
-		if (!zeropage) {	/* mcopy_atomic */
+		if (!zeropage) {	/* COPY */
 			page_kaddr = kmap_atomic(page);
 			ret = copy_from_user(page_kaddr,
 					     (const void __user *)src_addr,
@@ -2405,7 +2406,7 @@ static int shmem_mfill_atomic_pte(struct
 				/* don't free the page */
 				return -ENOENT;
 			}
-		} else {		/* mfill_zeropage_atomic */
+		} else {		/* ZEROPAGE */
 			clear_highpage(page);
 		}
 	} else {
@@ -2413,15 +2414,15 @@ static int shmem_mfill_atomic_pte(struct
 		*pagep = NULL;
 	}
 
-	VM_BUG_ON(PageLocked(page) || PageSwapBacked(page));
+	VM_BUG_ON(PageLocked(page));
+	VM_BUG_ON(PageSwapBacked(page));
 	__SetPageLocked(page);
 	__SetPageSwapBacked(page);
 	__SetPageUptodate(page);
 
 	ret = -EFAULT;
-	offset = linear_page_index(dst_vma, dst_addr);
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(offset >= max_off))
+	if (unlikely(pgoff >= max_off))
 		goto out_release;
 
 	ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
@@ -2447,7 +2448,7 @@ static int shmem_mfill_atomic_pte(struct
 
 	ret = -EFAULT;
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(offset >= max_off))
+	if (unlikely(pgoff >= max_off))
 		goto out_release_unlock;
 
 	ret = -EEXIST;
@@ -2484,28 +2485,7 @@ out_unacct_blocks:
 	shmem_inode_unacct_blocks(inode, 1);
 	goto out;
 }
-
-int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
-			   pmd_t *dst_pmd,
-			   struct vm_area_struct *dst_vma,
-			   unsigned long dst_addr,
-			   unsigned long src_addr,
-			   struct page **pagep)
-{
-	return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
-				      dst_addr, src_addr, false, pagep);
-}
-
-int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
-			     pmd_t *dst_pmd,
-			     struct vm_area_struct *dst_vma,
-			     unsigned long dst_addr)
-{
-	struct page *page = NULL;
-
-	return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
-				      dst_addr, 0, true, &page);
-}
+#endif /* CONFIG_USERFAULTFD */
 
 #ifdef CONFIG_TMPFS
 static const struct inode_operations shmem_symlink_inode_operations;
--- a/mm/userfaultfd.c~userfaultfd-shmem-combine-shmem_mcopy_atomicmfill_zeropage_pte
+++ a/mm/userfaultfd.c
@@ -392,13 +392,9 @@ static __always_inline ssize_t mfill_ato
 						 dst_vma, dst_addr);
 	} else {
 		VM_WARN_ON_ONCE(wp_copy);
-		if (!zeropage)
-			err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd,
-						     dst_vma, dst_addr,
-						     src_addr, page);
-		else
-			err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd,
-						       dst_vma, dst_addr);
+		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
+					     dst_addr, src_addr, zeropage,
+					     page);
 	}
 
 	return err;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 041/192] userfaultfd/shmem: support minor fault registration for shmem
  2021-07-01  1:46 incoming Andrew Morton
                   ` (39 preceding siblings ...)
  2021-07-01  1:49 ` [patch 040/192] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 042/192] userfaultfd/shmem: support UFFDIO_CONTINUE " Andrew Morton
                   ` (151 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/shmem: support minor fault registration for shmem

This patch allows shmem-backed VMAs to be registered for minor faults. 
Minor faults are appropriately relayed to userspace in the fault path, for
VMAs with the relevant flag.

This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
minor faults, though, so userspace doesn't yet have a way to resolve such
faults.

Because of this, we also don't yet advertise this as a supported feature. 
That will be done in a separate commit when the feature is fully
implemented.

Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/userfaultfd.c |    3 +--
 mm/memory.c      |    8 +++++---
 mm/shmem.c       |   12 +++++++++++-
 3 files changed, 17 insertions(+), 6 deletions(-)

--- a/fs/userfaultfd.c~userfaultfd-shmem-support-minor-fault-registration-for-shmem
+++ a/fs/userfaultfd.c
@@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(str
 	}
 
 	if (vm_flags & VM_UFFD_MINOR) {
-		/* FIXME: Add minor fault interception for shmem. */
-		if (!is_vm_hugetlb_page(vma))
+		if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
 			return false;
 	}
 
--- a/mm/memory.c~userfaultfd-shmem-support-minor-fault-registration-for-shmem
+++ a/mm/memory.c
@@ -4026,9 +4026,11 @@ static vm_fault_t do_read_fault(struct v
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-		ret = do_fault_around(vmf);
-		if (ret)
-			return ret;
+		if (likely(!userfaultfd_minor(vmf->vma))) {
+			ret = do_fault_around(vmf);
+			if (ret)
+				return ret;
+		}
 	}
 
 	ret = __do_fault(vmf);
--- a/mm/shmem.c~userfaultfd-shmem-support-minor-fault-registration-for-shmem
+++ a/mm/shmem.c
@@ -1797,7 +1797,7 @@ unlock:
  * vm. If we swap it in we mark it dirty since we also free the swap
  * entry since a page cannot live in both the swap and page cache.
  *
- * vmf and fault_type are only supplied by shmem_fault:
+ * vma, vmf, and fault_type are only supplied by shmem_fault:
  * otherwise they are NULL.
  */
 static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
@@ -1832,6 +1832,16 @@ repeat:
 
 	page = pagecache_get_page(mapping, index,
 					FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
+
+	if (page && vma && userfaultfd_minor(vma)) {
+		if (!xa_is_value(page)) {
+			unlock_page(page);
+			put_page(page);
+		}
+		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
+		return 0;
+	}
+
 	if (xa_is_value(page)) {
 		error = shmem_swapin_page(inode, index, &page,
 					  sgp, gfp, vma, fault_type);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 042/192] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem
  2021-07-01  1:46 incoming Andrew Morton
                   ` (40 preceding siblings ...)
  2021-07-01  1:49 ` [patch 041/192] userfaultfd/shmem: support minor fault registration for shmem Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 043/192] userfaultfd/shmem: advertise shmem minor fault support Andrew Morton
                   ` (150 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

With this change, userspace can resolve a minor fault within a
shmem-backed area with a UFFDIO_CONTINUE ioctl.  The semantics for this
match those for hugetlbfs - we look up the existing page in the page
cache, and install a PTE for it.

This commit introduces a new helper: mfill_atomic_install_pte.

Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
shmem.c?  The existing userfault implementation only relies on shmem.c for
VM_SHARED VMAs.  However, minor fault handling / CONTINUE work just fine
for !VM_SHARED VMAs as well.  We'd prefer to handle CONTINUE for shmem in
one place, regardless of shared/private (to reduce code duplication).

Why add a new mfill_atomic_install_pte helper?  A problem we have with
continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are
*close* to what we want, but not exactly.  We do want to setup the PTEs in
a CONTINUE operation, but we don't want to e.g.  allocate a new page,
charge it (e.g.  to the shmem inode), manipulate various flags, etc.  Also
we have the problem stated above: shmem_mfill_atomic_pte() and
mcopy_atomic_pte() both handle one-half of the problem (shared / private)
continue cares about.  So, introduce mcontinue_atomic_pte(), to handle all
of the shmem continue cases.  Introduce the helper so it doesn't duplicate
code with mcopy_atomic_pte().

In a future commit, shmem_mfill_atomic_pte() will also be modified to use
this new helper.  However, since this is a bigger refactor, it seems most
clear to do it as a separate change.

Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/userfaultfd.c |  172 +++++++++++++++++++++++++++++++++------------
 1 file changed, 127 insertions(+), 45 deletions(-)

--- a/mm/userfaultfd.c~userfaultfd-shmem-support-uffdio_continue-for-shmem
+++ a/mm/userfaultfd.c
@@ -48,6 +48,83 @@ struct vm_area_struct *find_dst_vma(stru
 	return dst_vma;
 }
 
+/*
+ * Install PTEs, to map dst_addr (within dst_vma) to page.
+ *
+ * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
+ * whether or not dst_vma is VM_SHARED. It also handles the more general
+ * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
+ * backed, or not).
+ *
+ * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
+ * shmem_mcopy_atomic_pte instead.
+ */
+static int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+				    struct vm_area_struct *dst_vma,
+				    unsigned long dst_addr, struct page *page,
+				    bool newly_allocated, bool wp_copy)
+{
+	int ret;
+	pte_t _dst_pte, *dst_pte;
+	bool writable = dst_vma->vm_flags & VM_WRITE;
+	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
+	bool page_in_cache = page->mapping;
+	spinlock_t *ptl;
+	struct inode *inode;
+	pgoff_t offset, max_off;
+
+	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	if (page_in_cache && !vm_shared)
+		writable = false;
+	if (writable || !page_in_cache)
+		_dst_pte = pte_mkdirty(_dst_pte);
+	if (writable) {
+		if (wp_copy)
+			_dst_pte = pte_mkuffd_wp(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	}
+
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+	if (vma_is_shmem(dst_vma)) {
+		/* serialize against truncate with the page table lock */
+		inode = dst_vma->vm_file->f_inode;
+		offset = linear_page_index(dst_vma, dst_addr);
+		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+		ret = -EFAULT;
+		if (unlikely(offset >= max_off))
+			goto out_unlock;
+	}
+
+	ret = -EEXIST;
+	if (!pte_none(*dst_pte))
+		goto out_unlock;
+
+	if (page_in_cache)
+		page_add_file_rmap(page, false);
+	else
+		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+
+	/*
+	 * Must happen after rmap, as mm_counter() checks mapping (via
+	 * PageAnon()), which is set by __page_set_anon_rmap().
+	 */
+	inc_mm_counter(dst_mm, mm_counter(page));
+
+	if (newly_allocated)
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
 static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    pmd_t *dst_pmd,
 			    struct vm_area_struct *dst_vma,
@@ -56,13 +133,9 @@ static int mcopy_atomic_pte(struct mm_st
 			    struct page **pagep,
 			    bool wp_copy)
 {
-	pte_t _dst_pte, *dst_pte;
-	spinlock_t *ptl;
 	void *page_kaddr;
 	int ret;
 	struct page *page;
-	pgoff_t offset, max_off;
-	struct inode *inode;
 
 	if (!*pagep) {
 		ret = -ENOMEM;
@@ -99,43 +172,12 @@ static int mcopy_atomic_pte(struct mm_st
 	if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
 		goto out_release;
 
-	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-	if (dst_vma->vm_flags & VM_WRITE) {
-		if (wp_copy)
-			_dst_pte = pte_mkuffd_wp(_dst_pte);
-		else
-			_dst_pte = pte_mkwrite(_dst_pte);
-	}
-
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-	if (dst_vma->vm_file) {
-		/* the shmem MAP_PRIVATE case requires checking the i_size */
-		inode = dst_vma->vm_file->f_inode;
-		offset = linear_page_index(dst_vma, dst_addr);
-		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-		ret = -EFAULT;
-		if (unlikely(offset >= max_off))
-			goto out_release_uncharge_unlock;
-	}
-	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
-		goto out_release_uncharge_unlock;
-
-	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
-	lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
-	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
-
-	pte_unmap_unlock(dst_pte, ptl);
-	ret = 0;
+	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+				       page, true, wp_copy);
+	if (ret)
+		goto out_release;
 out:
 	return ret;
-out_release_uncharge_unlock:
-	pte_unmap_unlock(dst_pte, ptl);
 out_release:
 	put_page(page);
 	goto out;
@@ -176,6 +218,41 @@ out_unlock:
 	return ret;
 }
 
+/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
+static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
+				pmd_t *dst_pmd,
+				struct vm_area_struct *dst_vma,
+				unsigned long dst_addr,
+				bool wp_copy)
+{
+	struct inode *inode = file_inode(dst_vma->vm_file);
+	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
+	if (ret)
+		goto out;
+	if (!page) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+				       page, false, wp_copy);
+	if (ret)
+		goto out_release;
+
+	unlock_page(page);
+	ret = 0;
+out:
+	return ret;
+out_release:
+	unlock_page(page);
+	put_page(page);
+	goto out;
+}
+
 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
@@ -367,11 +444,16 @@ static __always_inline ssize_t mfill_ato
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						struct page **page,
-						bool zeropage,
+						enum mcopy_atomic_mode mode,
 						bool wp_copy)
 {
 	ssize_t err;
 
+	if (mode == MCOPY_ATOMIC_CONTINUE) {
+		return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+					    wp_copy);
+	}
+
 	/*
 	 * The normal page fault path for a shmem will invoke the
 	 * fault, fill the hole in the file and COW it right away. The
@@ -383,7 +465,7 @@ static __always_inline ssize_t mfill_ato
 	 * and not in the radix tree.
 	 */
 	if (!(dst_vma->vm_flags & VM_SHARED)) {
-		if (!zeropage)
+		if (mode == MCOPY_ATOMIC_NORMAL)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
 					       wp_copy);
@@ -393,7 +475,8 @@ static __always_inline ssize_t mfill_ato
 	} else {
 		VM_WARN_ON_ONCE(wp_copy);
 		err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					     dst_addr, src_addr, zeropage,
+					     dst_addr, src_addr,
+					     mode != MCOPY_ATOMIC_NORMAL,
 					     page);
 	}
 
@@ -415,7 +498,6 @@ static __always_inline ssize_t __mcopy_a
 	long copied;
 	struct page *page;
 	bool wp_copy;
-	bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
 
 	/*
 	 * Sanitize the command parameters:
@@ -478,7 +560,7 @@ retry:
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
-	if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
+	if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
 		goto out_unlock;
 
 	/*
@@ -526,7 +608,7 @@ retry:
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, zeropage, wp_copy);
+				       src_addr, &page, mcopy_mode, wp_copy);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 043/192] userfaultfd/shmem: advertise shmem minor fault support
  2021-07-01  1:46 incoming Andrew Morton
                   ` (41 preceding siblings ...)
  2021-07-01  1:49 ` [patch 042/192] userfaultfd/shmem: support UFFDIO_CONTINUE " Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 044/192] userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() Andrew Morton
                   ` (149 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/shmem: advertise shmem minor fault support

Now that the feature is fully implemented (the faulting path hooks exist
so userspace is notified, and the ioctl to resolve such faults is
available), advertise this as a supported feature.

Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/userfaultfd.rst |    3 ++-
 fs/userfaultfd.c                             |    3 ++-
 include/uapi/linux/userfaultfd.h             |    7 ++++++-
 3 files changed, 10 insertions(+), 3 deletions(-)

--- a/Documentation/admin-guide/mm/userfaultfd.rst~userfaultfd-shmem-advertise-shmem-minor-fault-support
+++ a/Documentation/admin-guide/mm/userfaultfd.rst
@@ -77,7 +77,8 @@ events, except page fault notifications,
 
 - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
   ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
-  areas.
+  areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
+  support for shmem virtual memory areas.
 
 The userland application should set the feature flags it intends to use
 when invoking the ``UFFDIO_API`` ioctl, to request that those features be
--- a/fs/userfaultfd.c~userfaultfd-shmem-advertise-shmem-minor-fault-support
+++ a/fs/userfaultfd.c
@@ -1944,7 +1944,8 @@ static int userfaultfd_api(struct userfa
 	/* report all available features and ioctls to userland */
 	uffdio_api.features = UFFD_API_FEATURES;
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
-	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
+	uffdio_api.features &=
+		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
 #endif
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
--- a/include/uapi/linux/userfaultfd.h~userfaultfd-shmem-advertise-shmem-minor-fault-support
+++ a/include/uapi/linux/userfaultfd.h
@@ -31,7 +31,8 @@
 			   UFFD_FEATURE_MISSING_SHMEM |		\
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
-			   UFFD_FEATURE_MINOR_HUGETLBFS)
+			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
+			   UFFD_FEATURE_MINOR_SHMEM)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -185,6 +186,9 @@ struct uffdio_api {
 	 * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults
 	 * can be intercepted (via REGISTER_MODE_MINOR) for
 	 * hugetlbfs-backed pages.
+	 *
+	 * UFFD_FEATURE_MINOR_SHMEM indicates the same support as
+	 * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead.
 	 */
 #define UFFD_FEATURE_PAGEFAULT_FLAG_WP		(1<<0)
 #define UFFD_FEATURE_EVENT_FORK			(1<<1)
@@ -196,6 +200,7 @@ struct uffdio_api {
 #define UFFD_FEATURE_SIGBUS			(1<<7)
 #define UFFD_FEATURE_THREAD_ID			(1<<8)
 #define UFFD_FEATURE_MINOR_HUGETLBFS		(1<<9)
+#define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 	__u64 features;
 
 	__u64 ioctls;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 044/192] userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (42 preceding siblings ...)
  2021-07-01  1:49 ` [patch 043/192] userfaultfd/shmem: advertise shmem minor fault support Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 045/192] userfaultfd/selftests: use memfd_create for shmem test type Andrew Morton
                   ` (148 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte()

In a previous commit, we added the mfill_atomic_install_pte() helper. 
This helper does the job of setting up PTEs for an existing page, to map
it into a given VMA.  It deals with both the anon and shmem cases, as well
as the shared and private cases.

In other words, shmem_mfill_atomic_pte() duplicates a case it already
handles.  So, expose it, and let shmem_mfill_atomic_pte() use it directly,
to reduce code duplication.

This requires that we refactor shmem_mfill_atomic_pte() a bit:

Instead of doing accounting (shmem_recalc_inode() et al) part-way through
the PTE setup, do it afterward.  This frees up mfill_atomic_install_pte()
from having to care about this accounting, and means we don't need to e.g.
shmem_uncharge() in the error path.

A side effect is this switches shmem_mfill_atomic_pte() to use
lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add(). 
This wrapper does some extra accounting in an exceptional case, if
appropriate, so it's actually the more correct thing to use.

Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/userfaultfd_k.h |    5 ++
 mm/shmem.c                    |   58 ++++++--------------------------
 mm/userfaultfd.c              |   17 +++------
 3 files changed, 23 insertions(+), 57 deletions(-)

--- a/include/linux/userfaultfd_k.h~userfaultfd-shmem-modify-shmem_mfill_atomic_pte-to-use-install_pte
+++ a/include/linux/userfaultfd_k.h
@@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
 	MCOPY_ATOMIC_CONTINUE,
 };
 
+extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+				    struct vm_area_struct *dst_vma,
+				    unsigned long dst_addr, struct page *page,
+				    bool newly_allocated, bool wp_copy);
+
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len,
 			    bool *mmap_changing, __u64 mode);
--- a/mm/shmem.c~userfaultfd-shmem-modify-shmem_mfill_atomic_pte-to-use-install_pte
+++ a/mm/shmem.c
@@ -2376,14 +2376,11 @@ int shmem_mfill_atomic_pte(struct mm_str
 	struct address_space *mapping = inode->i_mapping;
 	gfp_t gfp = mapping_gfp_mask(mapping);
 	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
-	spinlock_t *ptl;
 	void *page_kaddr;
 	struct page *page;
-	pte_t _dst_pte, *dst_pte;
 	int ret;
 	pgoff_t max_off;
 
-	ret = -ENOMEM;
 	if (!shmem_inode_acct_block(inode, 1)) {
 		/*
 		 * We may have got a page, returned -ENOENT triggering a retry,
@@ -2394,10 +2391,11 @@ int shmem_mfill_atomic_pte(struct mm_str
 			put_page(*pagep);
 			*pagep = NULL;
 		}
-		goto out;
+		return -ENOMEM;
 	}
 
 	if (!*pagep) {
+		ret = -ENOMEM;
 		page = shmem_alloc_page(gfp, info, pgoff);
 		if (!page)
 			goto out_unacct_blocks;
@@ -2412,9 +2410,9 @@ int shmem_mfill_atomic_pte(struct mm_str
 			/* fallback to copy_from_user outside mmap_lock */
 			if (unlikely(ret)) {
 				*pagep = page;
-				shmem_inode_unacct_blocks(inode, 1);
+				ret = -ENOENT;
 				/* don't free the page */
-				return -ENOENT;
+				goto out_unacct_blocks;
 			}
 		} else {		/* ZEROPAGE */
 			clear_highpage(page);
@@ -2440,32 +2438,10 @@ int shmem_mfill_atomic_pte(struct mm_str
 	if (ret)
 		goto out_release;
 
-	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	if (dst_vma->vm_flags & VM_WRITE)
-		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
-	else {
-		/*
-		 * We don't set the pte dirty if the vma has no
-		 * VM_WRITE permission, so mark the page dirty or it
-		 * could be freed from under us. We could do it
-		 * unconditionally before unlock_page(), but doing it
-		 * only if VM_WRITE is not set is faster.
-		 */
-		set_page_dirty(page);
-	}
-
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-
-	ret = -EFAULT;
-	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(pgoff >= max_off))
-		goto out_release_unlock;
-
-	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
-		goto out_release_unlock;
-
-	lru_cache_add(page);
+	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+				       page, true, false);
+	if (ret)
+		goto out_delete_from_cache;
 
 	spin_lock_irq(&info->lock);
 	info->alloced++;
@@ -2473,27 +2449,17 @@ int shmem_mfill_atomic_pte(struct mm_str
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 
-	inc_mm_counter(dst_mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
-	pte_unmap_unlock(dst_pte, ptl);
+	SetPageDirty(page);
 	unlock_page(page);
-	ret = 0;
-out:
-	return ret;
-out_release_unlock:
-	pte_unmap_unlock(dst_pte, ptl);
-	ClearPageDirty(page);
+	return 0;
+out_delete_from_cache:
 	delete_from_page_cache(page);
 out_release:
 	unlock_page(page);
 	put_page(page);
 out_unacct_blocks:
 	shmem_inode_unacct_blocks(inode, 1);
-	goto out;
+	return ret;
 }
 #endif /* CONFIG_USERFAULTFD */
 
--- a/mm/userfaultfd.c~userfaultfd-shmem-modify-shmem_mfill_atomic_pte-to-use-install_pte
+++ a/mm/userfaultfd.c
@@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(stru
 /*
  * Install PTEs, to map dst_addr (within dst_vma) to page.
  *
- * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
- * whether or not dst_vma is VM_SHARED. It also handles the more general
- * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
- * backed, or not).
- *
- * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
- * shmem_mcopy_atomic_pte instead.
+ * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem
+ * and anon, and for both shared and private VMAs.
  */
-static int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
-				    struct vm_area_struct *dst_vma,
-				    unsigned long dst_addr, struct page *page,
-				    bool newly_allocated, bool wp_copy)
+int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+			     struct vm_area_struct *dst_vma,
+			     unsigned long dst_addr, struct page *page,
+			     bool newly_allocated, bool wp_copy)
 {
 	int ret;
 	pte_t _dst_pte, *dst_pte;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 045/192] userfaultfd/selftests: use memfd_create for shmem test type
  2021-07-01  1:46 incoming Andrew Morton
                   ` (43 preceding siblings ...)
  2021-07-01  1:49 ` [patch 044/192] userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 046/192] userfaultfd/selftests: create alias mappings in the shmem test Andrew Morton
                   ` (147 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: use memfd_create for shmem test type

This is a preparatory commit.  In the future, we want to be able to setup
alias mappings for area_src and area_dst in the shmem test, like we do in
the hugetlb_shared test.  With a VMA obtained via mmap(MAP_ANONYMOUS |
MAP_SHARED), it isn't clear how to do this.

So, mmap() with an fd, so we can create alias mappings.  Use memfd_create
instead of actually passing in a tmpfs path like hugetlb does, since it's
more convenient / simpler to run, and works just as well.

Future commits will:

1. Setup the alias mappings.
2. Extend our tests to actually take advantage of this, to test new
   userfaultfd behavior being introduced in this series.

Also, a small fix in the area we're changing: when the hugetlb setup fails
in main(), pass in the right argv[] so we actually print out the hugetlb
file path.

Link: https://lkml.kernel.org/r/20210503180737.2487560-8-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-use-memfd_create-for-shmem-test-type
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -85,6 +85,7 @@ static bool test_uffdio_wp = false;
 static bool test_uffdio_minor = false;
 
 static bool map_shared;
+static int shm_fd;
 static int huge_fd;
 static char *huge_fd_off0;
 static unsigned long long *count_verify;
@@ -277,8 +278,11 @@ static void shmem_release_pages(char *re
 
 static void shmem_allocate_area(void **alloc_area)
 {
+	unsigned long offset =
+		alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
+
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			   MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+			   MAP_SHARED, shm_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of memfd failed");
 }
@@ -1602,6 +1606,16 @@ int main(int argc, char **argv)
 			err("Open of %s failed", argv[4]);
 		if (ftruncate(huge_fd, 0))
 			err("ftruncate %s to size 0 failed", argv[4]);
+	} else if (test_type == TEST_SHMEM) {
+		shm_fd = memfd_create(argv[0], 0);
+		if (shm_fd < 0)
+			err("memfd_create");
+		if (ftruncate(shm_fd, nr_pages * page_size * 2))
+			err("ftruncate");
+		if (fallocate(shm_fd,
+			      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
+			      nr_pages * page_size * 2))
+			err("fallocate");
 	}
 	printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
 	       nr_pages, nr_pages_per_cpu);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 046/192] userfaultfd/selftests: create alias mappings in the shmem test
  2021-07-01  1:46 incoming Andrew Morton
                   ` (44 preceding siblings ...)
  2021-07-01  1:49 ` [patch 045/192] userfaultfd/selftests: use memfd_create for shmem test type Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 047/192] userfaultfd/selftests: reinitialize test context in each test Andrew Morton
                   ` (146 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: create alias mappings in the shmem test

Previously, we just allocated two shm areas: area_src and area_dst.  With
this commit, change this so we also allocate area_src_alias, and
area_dst_alias.

area_*_alias and area_* (respectively) point to the same underlying
physical pages, but are different VMAs.  In a future commit in this
series, we'll leverage this setup to exercise minor fault handling support
for shmem, just like we do in the hugetlb_shared test.

Link: https://lkml.kernel.org/r/20210503180737.2487560-9-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   22 ++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-create-alias-mappings-in-the-shmem-test
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -278,13 +278,29 @@ static void shmem_release_pages(char *re
 
 static void shmem_allocate_area(void **alloc_area)
 {
-	unsigned long offset =
-		alloc_area == (void **)&area_src ? 0 : nr_pages * page_size;
+	void *area_alias = NULL;
+	bool is_src = alloc_area == (void **)&area_src;
+	unsigned long offset = is_src ? 0 : nr_pages * page_size;
 
 	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
 			   MAP_SHARED, shm_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of memfd failed");
+
+	area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
+			  MAP_SHARED, shm_fd, offset);
+	if (area_alias == MAP_FAILED)
+		err("mmap of memfd alias failed");
+
+	if (is_src)
+		area_src_alias = area_alias;
+	else
+		area_dst_alias = area_alias;
+}
+
+static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset)
+{
+	*start = (unsigned long)area_dst_alias + offset;
 }
 
 struct uffd_test_ops {
@@ -314,7 +330,7 @@ static struct uffd_test_ops shmem_uffd_t
 	.expected_ioctls = SHMEM_EXPECTED_IOCTLS,
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
-	.alias_mapping = noop_alias_mapping,
+	.alias_mapping = shmem_alias_mapping,
 };
 
 static struct uffd_test_ops hugetlb_uffd_test_ops = {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 047/192] userfaultfd/selftests: reinitialize test context in each test
  2021-07-01  1:46 incoming Andrew Morton
                   ` (45 preceding siblings ...)
  2021-07-01  1:49 ` [patch 046/192] userfaultfd/selftests: create alias mappings in the shmem test Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 048/192] userfaultfd/selftests: exercise minor fault handling shmem support Andrew Morton
                   ` (145 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: reinitialize test context in each test

Currently, the context (fds, mmap-ed areas, etc.) are global.  Each test
mutates this state in some way, in some cases really "clobbering it"
(e.g., the events test mremap-ing area_dst over the top of area_src, or
the minor faults tests overwriting the count_verify values in the test
areas).  We run the tests in a particular order, each test is careful to
make the right assumptions about its starting state, etc.

But, this is fragile.  It's better for a test's success or failure to not
depend on what some other prior test case did to the global state.

To that end, clear and reinitialize the test context at the start of each
test case, so whatever prior test cases did doesn't affect future tests.

This is particularly relevant to this series because the events test's
mremap of area_dst screws up assumptions the minor fault test was relying
on.  This wasn't a problem for hugetlb, as we don't mremap in that case.

[peterx@redhat.com: fix conflict between this patch and the uffd pagemap series]
  Link: https://lkml.kernel.org/r/YKQqKrl+/cQ1utrb@t490s
Link: https://lkml.kernel.org/r/20210503180737.2487560-10-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |  222 +++++++++++----------
 1 file changed, 117 insertions(+), 105 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-reinitialize-test-context-in-each-test
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -89,7 +89,8 @@ static int shm_fd;
 static int huge_fd;
 static char *huge_fd_off0;
 static unsigned long long *count_verify;
-static int uffd, uffd_flags, finished, *pipefd;
+static int uffd = -1;
+static int uffd_flags, finished, *pipefd;
 static char *area_src, *area_src_alias, *area_dst, *area_dst_alias;
 static char *zeropage;
 pthread_attr_t attr;
@@ -342,6 +343,111 @@ static struct uffd_test_ops hugetlb_uffd
 
 static struct uffd_test_ops *uffd_test_ops;
 
+static void userfaultfd_open(uint64_t *features)
+{
+	struct uffdio_api uffdio_api;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
+	if (uffd < 0)
+		err("userfaultfd syscall not available in this kernel");
+	uffd_flags = fcntl(uffd, F_GETFD, NULL);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = *features;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api))
+		err("UFFDIO_API failed.\nPlease make sure to "
+		    "run with either root or ptrace capability.");
+	if (uffdio_api.api != UFFD_API)
+		err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api);
+
+	*features = uffdio_api.features;
+}
+
+static inline void munmap_area(void **area)
+{
+	if (*area)
+		if (munmap(*area, nr_pages * page_size))
+			err("munmap");
+
+	*area = NULL;
+}
+
+static void uffd_test_ctx_clear(void)
+{
+	size_t i;
+
+	if (pipefd) {
+		for (i = 0; i < nr_cpus * 2; ++i) {
+			if (close(pipefd[i]))
+				err("close pipefd");
+		}
+		free(pipefd);
+		pipefd = NULL;
+	}
+
+	if (count_verify) {
+		free(count_verify);
+		count_verify = NULL;
+	}
+
+	if (uffd != -1) {
+		if (close(uffd))
+			err("close uffd");
+		uffd = -1;
+	}
+
+	huge_fd_off0 = NULL;
+	munmap_area((void **)&area_src);
+	munmap_area((void **)&area_src_alias);
+	munmap_area((void **)&area_dst);
+	munmap_area((void **)&area_dst_alias);
+}
+
+static void uffd_test_ctx_init_ext(uint64_t *features)
+{
+	unsigned long nr, cpu;
+
+	uffd_test_ctx_clear();
+
+	uffd_test_ops->allocate_area((void **)&area_src);
+	uffd_test_ops->allocate_area((void **)&area_dst);
+
+	uffd_test_ops->release_pages(area_src);
+	uffd_test_ops->release_pages(area_dst);
+
+	userfaultfd_open(features);
+
+	count_verify = malloc(nr_pages * sizeof(unsigned long long));
+	if (!count_verify)
+		err("count_verify");
+
+	for (nr = 0; nr < nr_pages; nr++) {
+		*area_mutex(area_src, nr) =
+			(pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER;
+		count_verify[nr] = *area_count(area_src, nr) = 1;
+		/*
+		 * In the transition between 255 to 256, powerpc will
+		 * read out of order in my_bcmp and see both bytes as
+		 * zero, so leave a placeholder below always non-zero
+		 * after the count, to avoid my_bcmp to trigger false
+		 * positives.
+		 */
+		*(area_count(area_src, nr) + 1) = 1;
+	}
+
+	pipefd = malloc(sizeof(int) * nr_cpus * 2);
+	if (!pipefd)
+		err("pipefd");
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		if (pipe2(&pipefd[cpu * 2], O_CLOEXEC | O_NONBLOCK))
+			err("pipe");
+}
+
+static inline void uffd_test_ctx_init(uint64_t features)
+{
+	uffd_test_ctx_init_ext(&features);
+}
+
 static int my_bcmp(char *str1, char *str2, size_t n)
 {
 	unsigned long i;
@@ -726,40 +832,6 @@ static int stress(struct uffd_stats *uff
 	return 0;
 }
 
-static int userfaultfd_open_ext(uint64_t *features)
-{
-	struct uffdio_api uffdio_api;
-
-	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
-	if (uffd < 0) {
-		fprintf(stderr,
-			"userfaultfd syscall not available in this kernel\n");
-		return 1;
-	}
-	uffd_flags = fcntl(uffd, F_GETFD, NULL);
-
-	uffdio_api.api = UFFD_API;
-	uffdio_api.features = *features;
-	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
-		fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to "
-			"run with either root or ptrace capability.\n");
-		return 1;
-	}
-	if (uffdio_api.api != UFFD_API) {
-		fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n",
-			(uint64_t)uffdio_api.api);
-		return 1;
-	}
-
-	*features = uffdio_api.features;
-	return 0;
-}
-
-static int userfaultfd_open(uint64_t features)
-{
-	return userfaultfd_open_ext(&features);
-}
-
 sigjmp_buf jbuf, *sigbuf;
 
 static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -868,6 +940,8 @@ static int faulting_process(int signal_t
 			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
 	if (area_dst == MAP_FAILED)
 		err("mremap");
+	/* Reset area_src since we just clobbered it */
+	area_src = NULL;
 
 	for (; nr < nr_pages; nr++) {
 		count = *area_count(area_dst, nr);
@@ -961,10 +1035,8 @@ static int userfaultfd_zeropage_test(voi
 	printf("testing UFFDIO_ZEROPAGE: ");
 	fflush(stdout);
 
-	uffd_test_ops->release_pages(area_dst);
+	uffd_test_ctx_init(0);
 
-	if (userfaultfd_open(0))
-		return 1;
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
@@ -981,7 +1053,6 @@ static int userfaultfd_zeropage_test(voi
 		if (my_bcmp(area_dst, zeropage, page_size))
 			err("zeropage is not zero");
 
-	close(uffd);
 	printf("done.\n");
 	return 0;
 }
@@ -999,12 +1070,10 @@ static int userfaultfd_events_test(void)
 	printf("testing events (fork, remap, remove): ");
 	fflush(stdout);
 
-	uffd_test_ops->release_pages(area_dst);
-
 	features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
 		UFFD_FEATURE_EVENT_REMOVE;
-	if (userfaultfd_open(features))
-		return 1;
+	uffd_test_ctx_init(features);
+
 	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
@@ -1037,8 +1106,6 @@ static int userfaultfd_events_test(void)
 	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
-	close(uffd);
-
 	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != nr_pages;
@@ -1058,11 +1125,9 @@ static int userfaultfd_sig_test(void)
 	printf("testing signal delivery: ");
 	fflush(stdout);
 
-	uffd_test_ops->release_pages(area_dst);
-
 	features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS;
-	if (userfaultfd_open(features))
-		return 1;
+	uffd_test_ctx_init(features);
+
 	fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
 
 	uffdio_register.range.start = (unsigned long) area_dst;
@@ -1103,7 +1168,6 @@ static int userfaultfd_sig_test(void)
 	printf("done.\n");
 	if (userfaults)
 		err("Signal test failed, userfaults: %ld", userfaults);
-	close(uffd);
 
 	return userfaults != 0;
 }
@@ -1126,10 +1190,7 @@ static int userfaultfd_minor_test(void)
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	uffd_test_ops->release_pages(area_dst);
-
-	if (userfaultfd_open_ext(&features))
-		return 1;
+	uffd_test_ctx_init_ext(&features);
 	/* If kernel reports the feature isn't supported, skip the test. */
 	if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
 		printf("skipping test due to lack of feature support\n");
@@ -1183,8 +1244,6 @@ static int userfaultfd_minor_test(void)
 	if (pthread_join(uffd_mon, NULL))
 		return 1;
 
-	close(uffd);
-
 	uffd_stats_report(&stats, 1);
 
 	return stats.missing_faults != 0 || stats.minor_faults != nr_pages;
@@ -1267,7 +1326,7 @@ static void userfaultfd_pagemap_test(uns
 	/* Flush so it doesn't flush twice in parent/child later */
 	fflush(stdout);
 
-	uffd_test_ops->release_pages(area_dst);
+	uffd_test_ctx_init(0);
 
 	if (test_pgsize > page_size) {
 		/* This is a thp test */
@@ -1279,9 +1338,6 @@ static void userfaultfd_pagemap_test(uns
 			err("madvise(MADV_NOHUGEPAGE) failed");
 	}
 
-	if (userfaultfd_open(0))
-		err("userfaultfd_open");
-
 	uffdio_register.range.start = (unsigned long) area_dst;
 	uffdio_register.range.len = nr_pages * page_size;
 	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
@@ -1324,7 +1380,6 @@ static void userfaultfd_pagemap_test(uns
 	pagemap_check_wp(value, false);
 
 	close(pagemap_fd);
-	close(uffd);
 	printf("done\n");
 }
 
@@ -1334,50 +1389,9 @@ static int userfaultfd_stress(void)
 	char *tmp_area;
 	unsigned long nr;
 	struct uffdio_register uffdio_register;
-	unsigned long cpu;
 	struct uffd_stats uffd_stats[nr_cpus];
 
-	uffd_test_ops->allocate_area((void **)&area_src);
-	if (!area_src)
-		return 1;
-	uffd_test_ops->allocate_area((void **)&area_dst);
-	if (!area_dst)
-		return 1;
-
-	if (userfaultfd_open(0))
-		return 1;
-
-	count_verify = malloc(nr_pages * sizeof(unsigned long long));
-	if (!count_verify) {
-		perror("count_verify");
-		return 1;
-	}
-
-	for (nr = 0; nr < nr_pages; nr++) {
-		*area_mutex(area_src, nr) = (pthread_mutex_t)
-			PTHREAD_MUTEX_INITIALIZER;
-		count_verify[nr] = *area_count(area_src, nr) = 1;
-		/*
-		 * In the transition between 255 to 256, powerpc will
-		 * read out of order in my_bcmp and see both bytes as
-		 * zero, so leave a placeholder below always non-zero
-		 * after the count, to avoid my_bcmp to trigger false
-		 * positives.
-		 */
-		*(area_count(area_src, nr) + 1) = 1;
-	}
-
-	pipefd = malloc(sizeof(int) * nr_cpus * 2);
-	if (!pipefd) {
-		perror("pipefd");
-		return 1;
-	}
-	for (cpu = 0; cpu < nr_cpus; cpu++) {
-		if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) {
-			perror("pipe");
-			return 1;
-		}
-	}
+	uffd_test_ctx_init(0);
 
 	if (posix_memalign(&area, page_size, page_size))
 		err("out of memory");
@@ -1498,8 +1512,6 @@ static int userfaultfd_stress(void)
 		uffd_stats_report(uffd_stats, nr_cpus);
 	}
 
-	close(uffd);
-
 	if (test_type == TEST_ANON) {
 		/*
 		 * shmem/hugetlb won't be able to run since they have different
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 048/192] userfaultfd/selftests: exercise minor fault handling shmem support
  2021-07-01  1:46 incoming Andrew Morton
                   ` (46 preceding siblings ...)
  2021-07-01  1:49 ` [patch 047/192] userfaultfd/selftests: reinitialize test context in each test Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 049/192] mm/vmscan.c: fix potential deadlock in reclaim_pages() Andrew Morton
                   ` (144 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: aarcange, akpm, almasrymina, axelrasmussen, bgeffon, dgilbert,
	hughd, jglisse, joe, kirill, linux-mm, lokeshgidra, mike.kravetz,
	mm-commits, oupton, peterx, rppt, sfr, shli, shuah, torvalds,
	viro, wangqing

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: exercise minor fault handling shmem support

Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the test
slightly to pass in / check for the right feature flags.

Link: https://lkml.kernel.org/r/20210503180737.2487560-11-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   29 ++++++++++++++++++---
 1 file changed, 25 insertions(+), 4 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-exercise-minor-fault-handling-shmem-support
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -474,6 +474,7 @@ static void wp_range(int ufd, __u64 star
 static void continue_range(int ufd, __u64 start, __u64 len)
 {
 	struct uffdio_continue req;
+	int ret;
 
 	req.range.start = start;
 	req.range.len = len;
@@ -482,6 +483,17 @@ static void continue_range(int ufd, __u6
 	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
 		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
 		    (uint64_t)start);
+
+	/*
+	 * Error handling within the kernel for continue is subtly different
+	 * from copy or zeropage, so it may be a source of bugs. Trigger an
+	 * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
+	 */
+	req.mapped = 0;
+	ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
+	if (ret >= 0 || req.mapped != -EEXIST)
+		err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
+		    ret, (int64_t) req.mapped);
 }
 
 static void *locking_thread(void *arg)
@@ -1182,7 +1194,7 @@ static int userfaultfd_minor_test(void)
 	void *expected_page;
 	char c;
 	struct uffd_stats stats = { 0 };
-	uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS;
+	uint64_t req_features, features_out;
 
 	if (!test_uffdio_minor)
 		return 0;
@@ -1190,9 +1202,17 @@ static int userfaultfd_minor_test(void)
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	uffd_test_ctx_init_ext(&features);
-	/* If kernel reports the feature isn't supported, skip the test. */
-	if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) {
+	if (test_type == TEST_HUGETLB)
+		req_features = UFFD_FEATURE_MINOR_HUGETLBFS;
+	else if (test_type == TEST_SHMEM)
+		req_features = UFFD_FEATURE_MINOR_SHMEM;
+	else
+		return 1;
+
+	features_out = req_features;
+	uffd_test_ctx_init_ext(&features_out);
+	/* If kernel reports required features aren't supported, skip test. */
+	if ((features_out & req_features) != req_features) {
 		printf("skipping test due to lack of feature support\n");
 		fflush(stdout);
 		return 0;
@@ -1575,6 +1595,7 @@ static void set_test_type(const char *ty
 		map_shared = true;
 		test_type = TEST_SHMEM;
 		uffd_test_ops = &shmem_uffd_test_ops;
+		test_uffdio_minor = true;
 	} else {
 		err("Unknown test type: %s", type);
 	}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 049/192] mm/vmscan.c: fix potential deadlock in reclaim_pages()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (47 preceding siblings ...)
  2021-07-01  1:49 ` [patch 048/192] userfaultfd/selftests: exercise minor fault handling shmem support Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 050/192] include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low Andrew Morton
                   ` (143 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: akpm, linux-mm, minchan, mm-commits, torvalds, yuzhao

From: Yu Zhao <yuzhao@google.com>
Subject: mm/vmscan.c: fix potential deadlock in reclaim_pages()

Theoretically without the protect from memalloc_noreclaim_save() and
memalloc_noreclaim_restore(), reclaim_pages() can go into the block
I/O layer recursively and deadlock.

Querying 'reclaim_pages' in our kernel crash databases didn't yield
any results. So the deadlock seems unlikely to happen. A possible
explanation is that the only user of reclaim_pages(), i.e.,
MADV_PAGEOUT, is usually called before memory pressure builds up,
e.g., on Android and Chrome OS. Under such a condition, allocations in
the block I/O layer can be fulfilled without diverting to direct
reclaim and therefore the recursion is avoided.

Link: https://lkml.kernel.org/r/20210622074642.785473-1-yuzhao@google.com
Link: https://lkml.kernel.org/r/20210614194727.2684053-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

--- a/mm/vmscan.c~mm-vmscanc-fix-potential-deadlock-in-reclaim_pages
+++ a/mm/vmscan.c
@@ -1701,6 +1701,7 @@ unsigned int reclaim_clean_pages_from_li
 	unsigned int nr_reclaimed;
 	struct page *page, *next;
 	LIST_HEAD(clean_pages);
+	unsigned int noreclaim_flag;
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (!PageHuge(page) && page_is_file_lru(page) &&
@@ -1711,8 +1712,17 @@ unsigned int reclaim_clean_pages_from_li
 		}
 	}
 
+	/*
+	 * We should be safe here since we are only dealing with file pages and
+	 * we are not kswapd and therefore cannot write dirty file pages. But
+	 * call memalloc_noreclaim_save() anyway, just in case these conditions
+	 * change in the future.
+	 */
+	noreclaim_flag = memalloc_noreclaim_save();
 	nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 					&stat, true);
+	memalloc_noreclaim_restore(noreclaim_flag);
+
 	list_splice(&clean_pages, page_list);
 	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE,
 			    -(long)nr_reclaimed);
@@ -2306,6 +2316,7 @@ unsigned long reclaim_pages(struct list_
 	LIST_HEAD(node_page_list);
 	struct reclaim_stat dummy_stat;
 	struct page *page;
+	unsigned int noreclaim_flag;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.priority = DEF_PRIORITY,
@@ -2314,6 +2325,8 @@ unsigned long reclaim_pages(struct list_
 		.may_swap = 1,
 	};
 
+	noreclaim_flag = memalloc_noreclaim_save();
+
 	while (!list_empty(page_list)) {
 		page = lru_to_page(page_list);
 		if (nid == NUMA_NO_NODE) {
@@ -2350,6 +2363,8 @@ unsigned long reclaim_pages(struct list_
 		}
 	}
 
+	memalloc_noreclaim_restore(noreclaim_flag);
+
 	return nr_reclaimed;
 }
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 050/192] include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low
  2021-07-01  1:46 incoming Andrew Morton
                   ` (48 preceding siblings ...)
  2021-07-01  1:49 ` [patch 049/192] mm/vmscan.c: fix potential deadlock in reclaim_pages() Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 051/192] mm: workingset: define macro WORKINGSET_SHIFT Andrew Morton
                   ` (142 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mm-commits, torvalds, yuzhao

From: Yu Zhao <yuzhao@google.com>
Subject: include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low

mm_vmscan_inactive_list_is_low has no users after commit b91ac374346b
("mm: vmscan: enforce inactive:active ratio at the reclaim root").

Remove it.

Link: https://lkml.kernel.org/r/20210614194554.2683395-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/vmscan.h |   41 --------------------------------
 1 file changed, 41 deletions(-)

--- a/include/trace/events/vmscan.h~include-trace-events-vmscanh-remove-mm_vmscan_inactive_list_is_low
+++ a/include/trace/events/vmscan.h
@@ -423,47 +423,6 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_inactive_list_is_low,
-
-	TP_PROTO(int nid, int reclaim_idx,
-		unsigned long total_inactive, unsigned long inactive,
-		unsigned long total_active, unsigned long active,
-		unsigned long ratio, int file),
-
-	TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file),
-
-	TP_STRUCT__entry(
-		__field(int, nid)
-		__field(int, reclaim_idx)
-		__field(unsigned long, total_inactive)
-		__field(unsigned long, inactive)
-		__field(unsigned long, total_active)
-		__field(unsigned long, active)
-		__field(unsigned long, ratio)
-		__field(int, reclaim_flags)
-	),
-
-	TP_fast_assign(
-		__entry->nid = nid;
-		__entry->reclaim_idx = reclaim_idx;
-		__entry->total_inactive = total_inactive;
-		__entry->inactive = inactive;
-		__entry->total_active = total_active;
-		__entry->active = active;
-		__entry->ratio = ratio;
-		__entry->reclaim_flags = trace_reclaim_flags(file) &
-					 RECLAIM_WB_LRU;
-	),
-
-	TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s",
-		__entry->nid,
-		__entry->reclaim_idx,
-		__entry->total_inactive, __entry->inactive,
-		__entry->total_active, __entry->active,
-		__entry->ratio,
-		show_reclaim_flags(__entry->reclaim_flags))
-);
-
 TRACE_EVENT(mm_vmscan_node_reclaim_begin,
 
 	TP_PROTO(int nid, int order, gfp_t gfp_flags),
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 051/192] mm: workingset: define macro WORKINGSET_SHIFT
  2021-07-01  1:46 incoming Andrew Morton
                   ` (49 preceding siblings ...)
  2021-07-01  1:49 ` [patch 050/192] include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:49 ` [patch 052/192] mm/kconfig: move HOLES_IN_ZONE into mm Andrew Morton
                   ` (141 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: akpm, hannes, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm: workingset: define macro WORKINGSET_SHIFT

The magic number 1 is used in several places in workingset.c.  Define a
macro WORKINGSET_SHIFT for it to improve code readability.

Link: https://lkml.kernel.org/r/20210624122307.1759342-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/workingset.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/mm/workingset.c~mm-workingset-define-macro-workingset_shift
+++ a/mm/workingset.c
@@ -168,8 +168,10 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+#define WORKINGSET_SHIFT 1
 #define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
-			 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT)
+			 WORKINGSET_SHIFT + NODES_SHIFT + \
+			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
 
 /*
@@ -189,7 +191,7 @@ static void *pack_shadow(int memcgid, pg
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
-	eviction = (eviction << 1) | workingset;
+	eviction = (eviction << WORKINGSET_SHIFT) | workingset;
 
 	return xa_mk_value(eviction);
 }
@@ -201,8 +203,8 @@ static void unpack_shadow(void *shadow,
 	int memcgid, nid;
 	bool workingset;
 
-	workingset = entry & 1;
-	entry >>= 1;
+	workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1);
+	entry >>= WORKINGSET_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
 	entry >>= NODES_SHIFT;
 	memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 052/192] mm/kconfig: move HOLES_IN_ZONE into mm
  2021-07-01  1:46 incoming Andrew Morton
                   ` (50 preceding siblings ...)
  2021-07-01  1:49 ` [patch 051/192] mm: workingset: define macro WORKINGSET_SHIFT Andrew Morton
@ 2021-07-01  1:49 ` Andrew Morton
  2021-07-01  1:50 ` [patch 053/192] docs: proc.rst: meminfo: briefly describe gaps in memory accounting Andrew Morton
                   ` (140 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:49 UTC (permalink / raw)
  To: akpm, catalin.marinas, linux-mm, mm-commits, torvalds, tsbogend,
	wangkefeng.wang, will

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: mm/kconfig: move HOLES_IN_ZONE into mm

commit a55749639dc1 ("ia64: drop marked broken DISCONTIGMEM and
VIRTUAL_MEM_MAP") drop VIRTUAL_MEM_MAP, so there is no need HOLES_IN_ZONE
on ia64.

Also move HOLES_IN_ZONE into mm/Kconfig, select it if architecture needs
this feature.

Link: https://lkml.kernel.org/r/20210417075946.181402-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/Kconfig |    4 +---
 arch/ia64/Kconfig  |    3 ---
 arch/mips/Kconfig  |    3 ---
 mm/Kconfig         |    3 +++
 4 files changed, 4 insertions(+), 9 deletions(-)

--- a/arch/arm64/Kconfig~mm-move-holes_in_zone-into-mm
+++ a/arch/arm64/Kconfig
@@ -201,6 +201,7 @@ config ARM64
 	select HAVE_KPROBES
 	select HAVE_KRETPROBES
 	select HAVE_GENERIC_VDSO
+	select HOLES_IN_ZONE
 	select IOMMU_DMA if IOMMU_SUPPORT
 	select IRQ_DOMAIN
 	select IRQ_FORCED_THREADING
@@ -1052,9 +1053,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
 	def_bool y
 	depends on NUMA
 
-config HOLES_IN_ZONE
-	def_bool y
-
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
--- a/arch/ia64/Kconfig~mm-move-holes_in_zone-into-mm
+++ a/arch/ia64/Kconfig
@@ -308,9 +308,6 @@ config NODES_SHIFT
 	  MAX_NUMNODES will be 2^(This value).
 	  If in doubt, use the default.
 
-config HOLES_IN_ZONE
-	bool
-
 config HAVE_ARCH_NODEDATA_EXTENSION
 	def_bool y
 	depends on NUMA
--- a/arch/mips/Kconfig~mm-move-holes_in_zone-into-mm
+++ a/arch/mips/Kconfig
@@ -1233,9 +1233,6 @@ config HAVE_PLAT_MEMCPY
 config ISA_DMA_API
 	bool
 
-config HOLES_IN_ZONE
-	bool
-
 config SYS_SUPPORTS_RELOCATABLE
 	bool
 	help
--- a/mm/Kconfig~mm-move-holes_in_zone-into-mm
+++ a/mm/Kconfig
@@ -96,6 +96,9 @@ config HAVE_FAST_GUP
 	depends on MMU
 	bool
 
+config HOLES_IN_ZONE
+	bool
+
 # Don't discard allocated memory used to track "memory" and "reserved" memblocks
 # after early boot, so it can still be used to test for validity of memory.
 # Also, memblocks are updated with memory hot(un)plug.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 053/192] docs: proc.rst: meminfo: briefly describe gaps in memory accounting
  2021-07-01  1:46 incoming Andrew Morton
                   ` (51 preceding siblings ...)
  2021-07-01  1:49 ` [patch 052/192] mm/kconfig: move HOLES_IN_ZONE into mm Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 054/192] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER Andrew Morton
                   ` (139 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, corbet, eric.dumazet, linux-mm, mhocko,
	mm-commits, rppt, torvalds, vbabka, willy

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: docs: proc.rst: meminfo: briefly describe gaps in memory accounting

Add a paragraph that explains that it may happen that the counters in
/proc/meminfo do not add up to the overall memory usage.

Link: https://lkml.kernel.org/r/20210421061127.1182723-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/filesystems/proc.rst |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/Documentation/filesystems/proc.rst~docs-procrst-meminfo-briefly-describe-gaps-in-memory-accounting
+++ a/Documentation/filesystems/proc.rst
@@ -933,8 +933,15 @@ meminfo
 ~~~~~~~
 
 Provides information about distribution and utilization of memory.  This
-varies by architecture and compile options.  The following is from a
-16GB PIII, which has highmem enabled.  You may not have all of these fields.
+varies by architecture and compile options.  Some of the counters reported
+here overlap.  The memory reported by the non overlapping counters may not
+add up to the overall memory usage and the difference for some workloads
+can be substantial.  In many cases there are other means to find out
+additional memory using subsystem specific interfaces, for instance
+/proc/net/sockstat for TCP memory allocations.
+
+The following is from a 16GB PIII, which has highmem enabled.
+You may not have all of these fields.
 
 ::
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 054/192] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER
  2021-07-01  1:46 incoming Andrew Morton
                   ` (52 preceding siblings ...)
  2021-07-01  1:50 ` [patch 053/192] docs: proc.rst: meminfo: briefly describe gaps in memory accounting Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 055/192] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM Andrew Morton
                   ` (138 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, alex.shi, david, guro, haiyangz, jasowang,
	jbohac, kys, linux-mm, mhocko, mike.kravetz, mm-commits, mst,
	naoya.horiguchi, osalvador, rppt, steven.price, sthemmin,
	torvalds, wei.liu, willy, yaoaili

From: David Hildenbrand <david@redhat.com>
Subject: fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER

Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.

Looking for places where the kernel might unconditionally read
PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
needs some more love to not touch some other pages we really don't want to
read -- i.e., hwpoisoned ones.

Examples for PageOffline() pages are pages inflated in a balloon, memory
unplugged via virtio-mem, and partially-present sections in memory added
by the Hyper-V balloon.

When reading pages inflated in a balloon, we essentially produce
unnecessary load in the hypervisor; holes in partially present sections in
case of Hyper-V are not accessible and already were a problem for
/proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages.  In
the future, virtio-mem might disallow reading unplugged memory -- marked
as PageOffline() -- in some environments, resulting in undefined behavior
when accessed; therefore, I'm trying to identify and rework all these
(corner) cases.

With this series, there is really only access via /dev/mem, /proc/vmcore
and kdb left after I ripped out /dev/kmem.  kdb is an advanced corner-case
use case -- we won't care for now if someone explicitly tries to do nasty
things by reading from/writing to physical addresses we better not touch. 
/dev/mem is a use case we won't support for virtio-mem, at least for now,
so we'll simply disallow mapping any virtio-mem memory via /dev/mem next. 
/proc/vmcore is really only a problem when dumping the old kernel via
something that's not makedumpfile (read: basically never), however, we'll
try sanitizing that as well in the second kernel in the future.

Tested via kcore_dump:
	https://github.com/schlafwandler/kcore_dump
	


This patch (of 6):

Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()")
removed the last user of KCORE_REMAP.

Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
dumping vsyscall user page") removed the last user of KCORE_OTHER.

Let's drop both types.  While at it, also drop vaddr in "struct
kcore_list", used by KCORE_REMAP only.

Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c       |    7 ++-----
 include/linux/kcore.h |    3 ---
 2 files changed, 2 insertions(+), 8 deletions(-)

--- a/fs/proc/kcore.c~fs-proc-kcore-drop-kcore_remap-and-kcore_other
+++ a/fs/proc/kcore.c
@@ -380,11 +380,8 @@ read_kcore(struct file *file, char __use
 			phdr->p_type = PT_LOAD;
 			phdr->p_flags = PF_R | PF_W | PF_X;
 			phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
-			if (m->type == KCORE_REMAP)
-				phdr->p_vaddr = (size_t)m->vaddr;
-			else
-				phdr->p_vaddr = (size_t)m->addr;
-			if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
+			phdr->p_vaddr = (size_t)m->addr;
+			if (m->type == KCORE_RAM)
 				phdr->p_paddr = __pa(m->addr);
 			else if (m->type == KCORE_TEXT)
 				phdr->p_paddr = __pa_symbol(m->addr);
--- a/include/linux/kcore.h~fs-proc-kcore-drop-kcore_remap-and-kcore_other
+++ a/include/linux/kcore.h
@@ -11,14 +11,11 @@ enum kcore_type {
 	KCORE_RAM,
 	KCORE_VMEMMAP,
 	KCORE_USER,
-	KCORE_OTHER,
-	KCORE_REMAP,
 };
 
 struct kcore_list {
 	struct list_head list;
 	unsigned long addr;
-	unsigned long vaddr;
 	size_t size;
 	int type;
 };
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 055/192] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM
  2021-07-01  1:46 incoming Andrew Morton
                   ` (53 preceding siblings ...)
  2021-07-01  1:50 ` [patch 054/192] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 056/192] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages Andrew Morton
                   ` (137 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, alex.shi, david, guro, haiyangz, jasowang,
	jbohac, kys, linux-mm, mhocko, mike.kravetz, mm-commits, mst,
	naoya.horiguchi, osalvador, rppt, steven.price, sthemmin,
	torvalds, wei.liu, willy, yaoaili

From: David Hildenbrand <david@redhat.com>
Subject: fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM

Let's resturcture the code, using switch-case, and checking pfn_is_ram()
only when we are dealing with KCORE_RAM.

Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c |   35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

--- a/fs/proc/kcore.c~fs-proc-kcore-pfn_is_ram-check-only-applies-to-kcore_ram
+++ a/fs/proc/kcore.c
@@ -483,25 +483,36 @@ read_kcore(struct file *file, char __use
 				goto out;
 			}
 			m = NULL;	/* skip the list anchor */
-		} else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
-			if (clear_user(buffer, tsz)) {
-				ret = -EFAULT;
-				goto out;
-			}
-		} else if (m->type == KCORE_VMALLOC) {
+			goto skip;
+		}
+
+		switch (m->type) {
+		case KCORE_VMALLOC:
 			vread(buf, (char *)start, tsz);
 			/* we have to zero-fill user buffer even if no read */
 			if (copy_to_user(buffer, buf, tsz)) {
 				ret = -EFAULT;
 				goto out;
 			}
-		} else if (m->type == KCORE_USER) {
+			break;
+		case KCORE_USER:
 			/* User page is handled prior to normal kernel page: */
 			if (copy_to_user(buffer, (char *)start, tsz)) {
 				ret = -EFAULT;
 				goto out;
 			}
-		} else {
+			break;
+		case KCORE_RAM:
+			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+				if (clear_user(buffer, tsz)) {
+					ret = -EFAULT;
+					goto out;
+				}
+				break;
+			}
+			fallthrough;
+		case KCORE_VMEMMAP:
+		case KCORE_TEXT:
 			if (kern_addr_valid(start)) {
 				/*
 				 * Using bounce buffer to bypass the
@@ -525,7 +536,15 @@ read_kcore(struct file *file, char __use
 					goto out;
 				}
 			}
+			break;
+		default:
+			pr_warn_once("Unhandled KCORE type: %d\n", m->type);
+			if (clear_user(buffer, tsz)) {
+				ret = -EFAULT;
+				goto out;
+			}
 		}
+skip:
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 056/192] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (54 preceding siblings ...)
  2021-07-01  1:50 ` [patch 055/192] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 057/192] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() Andrew Morton
                   ` (136 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, alex.shi, david, guro, haiyangz, jasowang,
	jbohac, kys, linux-mm, mhocko, mike.kravetz, mm-commits, mst,
	naoya.horiguchi, osalvador, rppt, steven.price, sthemmin,
	torvalds, wei.liu, willy, yaoaili

From: David Hildenbrand <david@redhat.com>
Subject: fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages

Let's avoid reading:

1) Offline memory sections: the content of offline memory sections is
   stale as the memory is effectively unused by the kernel.  On s390x with
   standby memory, offline memory sections (belonging to offline storage
   increments) are not accessible.  With virtio-mem and the hyper-v
   balloon, we can have unavailable memory chunks that should not be
   accessed inside offline memory sections.  Last but not least, offline
   memory sections might contain hwpoisoned pages which we can no longer
   identify because the memmap is stale.

2) PG_offline pages: logically offline pages that are documented as
   "The content of these pages is effectively stale.  Such pages should
   not be touched (read/write/dump/save) except by their owner.". 
   Examples include pages inflated in a balloon or unavailble memory
   ranges inside hotplugged memory sections with virtio-mem or the hyper-v
   balloon.

3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal. 
   As documented: "Accessing is not safe since it may cause another
   machine check.  Don't touch!"

Introduce is_page_hwpoison(), adding a comment that it is inherently racy
but best we can really do.

Reading /proc/kcore now performs similar checks as when reading
/proc/vmcore for kdump via makedumpfile: problematic pages are exclude. 
It's also similar to hibernation code, however, we don't skip hwpoisoned
pages when processing pages in kernel/power/snapshot.c:saveable_page()
yet.

Note 1: we can race against memory offlining code, especially memory going
offline and getting unplugged: however, we will properly tear down the
identity mapping and handle faults gracefully when accessing this memory
from kcore code.

Note 2: we can race against drivers setting PageOffline() and turning
memory inaccessible in the hypervisor.  We'll handle this in a follow-up
patch.

Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c            |   14 +++++++++++++-
 include/linux/page-flags.h |   12 ++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

--- a/fs/proc/kcore.c~fs-proc-kcore-dont-read-offline-sections-logically-offline-pages-and-hwpoisoned-pages
+++ a/fs/proc/kcore.c
@@ -465,6 +465,9 @@ read_kcore(struct file *file, char __use
 
 	m = NULL;
 	while (buflen) {
+		struct page *page;
+		unsigned long pfn;
+
 		/*
 		 * If this is the first iteration or the address is not within
 		 * the previous entry, search for a matching entry.
@@ -503,7 +506,16 @@ read_kcore(struct file *file, char __use
 			}
 			break;
 		case KCORE_RAM:
-			if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) {
+			pfn = __pa(start) >> PAGE_SHIFT;
+			page = pfn_to_online_page(pfn);
+
+			/*
+			 * Don't read offline sections, logically offline pages
+			 * (e.g., inflated in a balloon), hwpoisoned pages,
+			 * and explicitly excluded physical ranges.
+			 */
+			if (!page || PageOffline(page) ||
+			    is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
 				if (clear_user(buffer, tsz)) {
 					ret = -EFAULT;
 					goto out;
--- a/include/linux/page-flags.h~fs-proc-kcore-dont-read-offline-sections-logically-offline-pages-and-hwpoisoned-pages
+++ a/include/linux/page-flags.h
@@ -695,6 +695,18 @@ PAGEFLAG_FALSE(DoubleMap)
 #endif
 
 /*
+ * Check if a page is currently marked HWPoisoned. Note that this check is
+ * best effort only and inherently racy: there is no way to synchronize with
+ * failing hardware.
+ */
+static inline bool is_page_hwpoison(struct page *page)
+{
+	if (PageHWPoison(page))
+		return true;
+	return PageHuge(page) && PageHWPoison(compound_head(page));
+}
+
+/*
  * For pages that are never mapped to userspace (and aren't PageSlab),
  * page_type may be used.  Because it is initialised to -1, we invert the
  * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 057/192] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (55 preceding siblings ...)
  2021-07-01  1:50 ` [patch 056/192] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 058/192] virtio-mem: use page_offline_(start|end) when " Andrew Morton
                   ` (135 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, alex.shi, david, guro, haiyangz, jasowang,
	jbohac, kys, linux-mm, mhocko, mike.kravetz, mm-commits, mst,
	naoya.horiguchi, osalvador, rppt, steven.price, sthemmin,
	torvalds, wei.liu, willy, yaoaili

From: David Hildenbrand <david@redhat.com>
Subject: mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline()

A driver might set a page logically offline -- PageOffline() -- and turn
the page inaccessible in the hypervisor; after that, access to page
content can be fatal.  One example is virtio-mem; while unplugged memory
-- marked as PageOffline() can currently be read in the hypervisor, this
will no longer be the case in the future; for example, when having a
virtio-mem device backed by huge pages in the hypervisor.

Some special PFN walkers -- i.e., /proc/kcore -- read content of random
pages after checking PageOffline(); however, these PFN walkers can race
with drivers that set PageOffline().

Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.

page_offline_freeze()/page_offline_thaw() allows for a subsystem to
synchronize with such drivers, achieving that a page cannot be set
PageOffline() while frozen.

page_offline_begin()/page_offline_end() is used by drivers that care about
such races when setting a page PageOffline().

For simplicity, use a rwsem for now; neither drivers nor users are
performance sensitive.

Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h |   10 ++++++++
 mm/util.c                  |   40 +++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

--- a/include/linux/page-flags.h~mm-introduce-page_offline_beginendfreezethaw-to-synchronize-setting-pageoffline
+++ a/include/linux/page-flags.h
@@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * relies on this feature is aware that re-onlining the memory block will
  * require to re-set the pages PageOffline() and not giving them to the
  * buddy via online_page_callback_t.
+ *
+ * There are drivers that mark a page PageOffline() and expect there won't be
+ * any further access to page content. PFN walkers that read content of random
+ * pages should check PageOffline() and synchronize with such drivers using
+ * page_offline_freeze()/page_offline_thaw().
  */
 PAGE_TYPE_OPS(Offline, offline)
 
+extern void page_offline_freeze(void);
+extern void page_offline_thaw(void);
+extern void page_offline_begin(void);
+extern void page_offline_end(void);
+
 /*
  * Marks pages in use as page tables.
  */
--- a/mm/util.c~mm-introduce-page_offline_beginendfreezethaw-to-synchronize-setting-pageoffline
+++ a/mm/util.c
@@ -1010,3 +1010,43 @@ void mem_dump_obj(void *object)
 }
 EXPORT_SYMBOL_GPL(mem_dump_obj);
 #endif
+
+/*
+ * A driver might set a page logically offline -- PageOffline() -- and
+ * turn the page inaccessible in the hypervisor; after that, access to page
+ * content can be fatal.
+ *
+ * Some special PFN walkers -- i.e., /proc/kcore -- read content of random
+ * pages after checking PageOffline(); however, these PFN walkers can race
+ * with drivers that set PageOffline().
+ *
+ * page_offline_freeze()/page_offline_thaw() allows for a subsystem to
+ * synchronize with such drivers, achieving that a page cannot be set
+ * PageOffline() while frozen.
+ *
+ * page_offline_begin()/page_offline_end() is used by drivers that care about
+ * such races when setting a page PageOffline().
+ */
+static DECLARE_RWSEM(page_offline_rwsem);
+
+void page_offline_freeze(void)
+{
+	down_read(&page_offline_rwsem);
+}
+
+void page_offline_thaw(void)
+{
+	up_read(&page_offline_rwsem);
+}
+
+void page_offline_begin(void)
+{
+	down_write(&page_offline_rwsem);
+}
+EXPORT_SYMBOL(page_offline_begin);
+
+void page_offline_end(void)
+{
+	up_write(&page_offline_rwsem);
+}
+EXPORT_SYMBOL(page_offline_end);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 058/192] virtio-mem: use page_offline_(start|end) when setting PageOffline()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (56 preceding siblings ...)
  2021-07-01  1:50 ` [patch 057/192] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 059/192] fs/proc/kcore: use page_offline_(freeze|thaw) Andrew Morton
                   ` (134 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, alex.shi, david, guro, haiyangz, jasowang,
	jbohac, kys, linux-mm, mhocko, mike.kravetz, mm-commits, mst,
	naoya.horiguchi, osalvador, rppt, steven.price, sthemmin,
	torvalds, wei.liu, willy, yaoaili

From: David Hildenbrand <david@redhat.com>
Subject: virtio-mem: use page_offline_(start|end) when setting PageOffline()

Let's properly use page_offline_(start|end) to synchronize setting
PageOffline(), so we won't have valid page access to unplugged memory
regions from /proc/kcore.

Existing balloon implementations usually allow reading inflated memory;
doing so might result in unnecessary overhead in the hypervisor, which is
currently the case with virtio-mem.

For future virtio-mem use cases, it will be different when using shmem,
huge pages, !anonymous private mappings, ...  as backing storage for a VM.
virtio-mem unplugged memory must no longer be accessed and access might
result in undefined behavior.  There will be a virtio spec extension to
document this change, including a new feature flag indicating the changed
behavior.  We really don't want to race against PFN walkers reading random
page content.

Link: https://lkml.kernel.org/r/20210526093041.8800-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/virtio/virtio_mem.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/drivers/virtio/virtio_mem.c~virtio-mem-use-page_offline_startend-when-setting-pageoffline
+++ a/drivers/virtio/virtio_mem.c
@@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb
 static void virtio_mem_set_fake_offline(unsigned long pfn,
 					unsigned long nr_pages, bool onlined)
 {
+	page_offline_begin();
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
@@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(
 			ClearPageReserved(page);
 		}
 	}
+	page_offline_end();
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 059/192] fs/proc/kcore: use page_offline_(freeze|thaw)
  2021-07-01  1:46 incoming Andrew Morton
                   ` (57 preceding siblings ...)
  2021-07-01  1:50 ` [patch 058/192] virtio-mem: use page_offline_(start|end) when " Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 060/192] mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS Andrew Morton
                   ` (133 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: adobriyan, akpm, alex.shi, david, guro, haiyangz, jasowang,
	jbohac, kys, linux-mm, mhocko, mike.kravetz, mm-commits, mst,
	naoya.horiguchi, osalvador, rppt, steven.price, sthemmin,
	torvalds, wei.liu, willy, yaoaili

From: David Hildenbrand <david@redhat.com>
Subject: fs/proc/kcore: use page_offline_(freeze|thaw)

Let's properly synchronize with drivers that set PageOffline(). 
Unfreeze/thaw every now and then, so drivers that want to set
PageOffline() can make progress.

Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- a/fs/proc/kcore.c~fs-proc-kcore-use-page_offline_freezethaw
+++ a/fs/proc/kcore.c
@@ -313,6 +313,7 @@ read_kcore(struct file *file, char __use
 {
 	char *buf = file->private_data;
 	size_t phdrs_offset, notes_offset, data_offset;
+	size_t page_offline_frozen = 1;
 	size_t phdrs_len, notes_len;
 	struct kcore_list *m;
 	size_t tsz;
@@ -322,6 +323,11 @@ read_kcore(struct file *file, char __use
 	int ret = 0;
 
 	down_read(&kclist_lock);
+	/*
+	 * Don't race against drivers that set PageOffline() and expect no
+	 * further page access.
+	 */
+	page_offline_freeze();
 
 	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
 	phdrs_offset = sizeof(struct elfhdr);
@@ -480,6 +486,12 @@ read_kcore(struct file *file, char __use
 			}
 		}
 
+		if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) {
+			page_offline_thaw();
+			cond_resched();
+			page_offline_freeze();
+		}
+
 		if (&m->list == &kclist_head) {
 			if (clear_user(buffer, tsz)) {
 				ret = -EFAULT;
@@ -565,6 +577,7 @@ skip:
 	}
 
 out:
+	page_offline_thaw();
 	up_read(&kclist_lock);
 	if (ret)
 		return ret;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 060/192] mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS
  2021-07-01  1:46 incoming Andrew Morton
                   ` (58 preceding siblings ...)
  2021-07-01  1:50 ` [patch 059/192] fs/proc/kcore: use page_offline_(freeze|thaw) Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 061/192] mm/z3fold: avoid possible underflow in z3fold_alloc() Andrew Morton
                   ` (132 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, hdanton, linmiaohe, linux-mm, mm-commits, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS

Patch series "Cleanup and fixup for z3fold".

This series contains cleanups to remove unused function, redefine macro to
improve readability and so on.  Also this fixes several bugs in z3fold,
such as memory leak in z3fold_destroy_pool().  More details can be found
in the respective changelogs.


This patch (of 6):

To improve code readability, we could define macro NCHUNKS as TOTAL_CHUNKS
- ZHDR_CHUNKS.  No functional change intended.

Link: https://lkml.kernel.org/r/20210619093151.1492174-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210619093151.1492174-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/z3fold.c~mm-z3fold-define-macro-nchunks-as-total_chunks-zhdr_chunks
+++ a/mm/z3fold.c
@@ -62,7 +62,7 @@
 #define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE)
 #define ZHDR_CHUNKS	(ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT)
 #define TOTAL_CHUNKS	(PAGE_SIZE >> CHUNK_SHIFT)
-#define NCHUNKS		((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
+#define NCHUNKS		(TOTAL_CHUNKS - ZHDR_CHUNKS)
 
 #define BUDDY_MASK	(0x3)
 #define BUDDY_SHIFT	2
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 061/192] mm/z3fold: avoid possible underflow in z3fold_alloc()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (59 preceding siblings ...)
  2021-07-01  1:50 ` [patch 060/192] mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 062/192] mm/z3fold: remove magic number in z3fold_create_pool() Andrew Morton
                   ` (131 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, hdanton, linmiaohe, linux-mm, mm-commits, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/z3fold: avoid possible underflow in z3fold_alloc()

It is not enough to just make sure the z3fold header is not larger than
the page size.  When z3fold header is equal to PAGE_SIZE, we would
underflow when check alloc size against PAGE_SIZE - ZHDR_SIZE_ALIGNED -
CHUNK_SIZE in z3fold_alloc().  Make sure there has remaining spaces for
its buddy to fix this theoretical issue.

Link: https://lkml.kernel.org/r/20210619093151.1492174-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- a/mm/z3fold.c~mm-z3fold-avoid-possible-underflow-in-z3fold_alloc
+++ a/mm/z3fold.c
@@ -1803,8 +1803,11 @@ static int __init init_z3fold(void)
 {
 	int ret;
 
-	/* Make sure the z3fold header is not larger than the page size */
-	BUILD_BUG_ON(ZHDR_SIZE_ALIGNED > PAGE_SIZE);
+	/*
+	 * Make sure the z3fold header is not larger than the page size and
+	 * there has remaining spaces for its buddy.
+	 */
+	BUILD_BUG_ON(ZHDR_SIZE_ALIGNED > PAGE_SIZE - CHUNK_SIZE);
 	ret = z3fold_mount();
 	if (ret)
 		return ret;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 062/192] mm/z3fold: remove magic number in z3fold_create_pool()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (60 preceding siblings ...)
  2021-07-01  1:50 ` [patch 061/192] mm/z3fold: avoid possible underflow in z3fold_alloc() Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 063/192] mm/z3fold: remove unused function handle_to_z3fold_header() Andrew Morton
                   ` (130 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, hdanton, linmiaohe, linux-mm, mm-commits, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/z3fold: remove magic number in z3fold_create_pool()

It's meaningless to pass a magic number 2 to __alloc_percpu() as there is
a minimum alignment size of PCPU_MIN_ALLOC_SIZE (> 2) in it.  Also there
is no special alignment requirement for unbuddied.  So we could replace
this magic number with nature alignment, i.e.  __alignof__(struct
list_head), to improve readability.

Link: https://lkml.kernel.org/r/20210619093151.1492174-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/z3fold.c~mm-z3fold-remove-magic-number-in-z3fold_create_pool
+++ a/mm/z3fold.c
@@ -998,7 +998,8 @@ static struct z3fold_pool *z3fold_create
 		goto out_c;
 	spin_lock_init(&pool->lock);
 	spin_lock_init(&pool->stale_lock);
-	pool->unbuddied = __alloc_percpu(sizeof(struct list_head)*NCHUNKS, 2);
+	pool->unbuddied = __alloc_percpu(sizeof(struct list_head) * NCHUNKS,
+					 __alignof__(struct list_head));
 	if (!pool->unbuddied)
 		goto out_pool;
 	for_each_possible_cpu(cpu) {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 063/192] mm/z3fold: remove unused function handle_to_z3fold_header()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (61 preceding siblings ...)
  2021-07-01  1:50 ` [patch 062/192] mm/z3fold: remove magic number in z3fold_create_pool() Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 064/192] mm/z3fold: fix potential memory leak in z3fold_destroy_pool() Andrew Morton
                   ` (129 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, hdanton, linmiaohe, linux-mm, mm-commits, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/z3fold: remove unused function handle_to_z3fold_header()

handle_to_z3fold_header() is unused now.  So we can remove it.  As a
result, get_z3fold_header() becomes the only caller of
__get_z3fold_header() and the argument lock is always true.  Therefore we
could further fold the __get_z3fold_header() into get_z3fold_header() with
lock = true.

Link: https://lkml.kernel.org/r/20210619093151.1492174-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |   22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

--- a/mm/z3fold.c~mm-z3fold-remove-unused-function-handle_to_z3fold_header
+++ a/mm/z3fold.c
@@ -253,9 +253,8 @@ static inline void z3fold_page_unlock(st
 	spin_unlock(&zhdr->page_lock);
 }
 
-
-static inline struct z3fold_header *__get_z3fold_header(unsigned long handle,
-							bool lock)
+/* return locked z3fold page if it's not headless */
+static inline struct z3fold_header *get_z3fold_header(unsigned long handle)
 {
 	struct z3fold_buddy_slots *slots;
 	struct z3fold_header *zhdr;
@@ -269,13 +268,12 @@ static inline struct z3fold_header *__ge
 			read_lock(&slots->lock);
 			addr = *(unsigned long *)handle;
 			zhdr = (struct z3fold_header *)(addr & PAGE_MASK);
-			if (lock)
-				locked = z3fold_page_trylock(zhdr);
+			locked = z3fold_page_trylock(zhdr);
 			read_unlock(&slots->lock);
 			if (locked)
 				break;
 			cpu_relax();
-		} while (lock);
+		} while (true);
 	} else {
 		zhdr = (struct z3fold_header *)(handle & PAGE_MASK);
 	}
@@ -283,18 +281,6 @@ static inline struct z3fold_header *__ge
 	return zhdr;
 }
 
-/* Returns the z3fold page where a given handle is stored */
-static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h)
-{
-	return __get_z3fold_header(h, false);
-}
-
-/* return locked z3fold page if it's not headless */
-static inline struct z3fold_header *get_z3fold_header(unsigned long h)
-{
-	return __get_z3fold_header(h, true);
-}
-
 static inline void put_z3fold_header(struct z3fold_header *zhdr)
 {
 	struct page *page = virt_to_page(zhdr);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 064/192] mm/z3fold: fix potential memory leak in z3fold_destroy_pool()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (62 preceding siblings ...)
  2021-07-01  1:50 ` [patch 063/192] mm/z3fold: remove unused function handle_to_z3fold_header() Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 065/192] mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page Andrew Morton
                   ` (128 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, hdanton, linmiaohe, linux-mm, mm-commits, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/z3fold: fix potential memory leak in z3fold_destroy_pool()

There is a memory leak in z3fold_destroy_pool() as it forgets to
free_percpu pool->unbuddied.  Call free_percpu for pool->unbuddied to fix
this issue.

Link: https://lkml.kernel.org/r/20210619093151.1492174-6-linmiaohe@huawei.com
Fixes: d30561c56f41 ("z3fold: use per-cpu unbuddied lists")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    1 +
 1 file changed, 1 insertion(+)

--- a/mm/z3fold.c~mm-z3fold-fix-potential-memory-leak-in-z3fold_destroy_pool
+++ a/mm/z3fold.c
@@ -1046,6 +1046,7 @@ static void z3fold_destroy_pool(struct z
 	destroy_workqueue(pool->compact_wq);
 	destroy_workqueue(pool->release_wq);
 	z3fold_unregister_migration(pool);
+	free_percpu(pool->unbuddied);
 	kfree(pool);
 }
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 065/192] mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page
  2021-07-01  1:46 incoming Andrew Morton
                   ` (63 preceding siblings ...)
  2021-07-01  1:50 ` [patch 064/192] mm/z3fold: fix potential memory leak in z3fold_destroy_pool() Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 066/192] mm/zbud: reuse unbuddied[0] as buddied in zbud_pool Andrew Morton
                   ` (127 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, hdanton, linmiaohe, linux-mm, mm-commits, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page

We should use release_z3fold_page_locked() to release z3fold page when
it's locked, although it looks harmless to use release_z3fold_page() now.

Link: https://lkml.kernel.org/r/20210619093151.1492174-7-linmiaohe@huawei.com
Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/z3fold.c~mm-z3fold-use-release_z3fold_page_locked-to-release-locked-z3fold-page
+++ a/mm/z3fold.c
@@ -1370,7 +1370,7 @@ static int z3fold_reclaim_page(struct z3
 			if (zhdr->foreign_handles ||
 			    test_and_set_bit(PAGE_CLAIMED, &page->private)) {
 				if (kref_put(&zhdr->refcount,
-						release_z3fold_page))
+						release_z3fold_page_locked))
 					atomic64_dec(&pool->pages_nr);
 				else
 					z3fold_page_unlock(zhdr);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 066/192] mm/zbud: reuse unbuddied[0] as buddied in zbud_pool
  2021-07-01  1:46 incoming Andrew Morton
                   ` (64 preceding siblings ...)
  2021-07-01  1:50 ` [patch 065/192] mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 067/192] mm/zbud: don't export any zbud API Andrew Morton
                   ` (126 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, ddstreet, linmiaohe, linux-mm, mm-commits, sjenning, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zbud: reuse unbuddied[0] as buddied in zbud_pool

Patch series "Cleanups for zbud", v2.

This series contains just cleanups to save some possible memory in
zbud_pool and avoid exporting any unneeded zbud API.  More details can be
found in the respective changelogs


This patch (of 2):

Since commit 9d8c5b5284e4 ("mm: zbud: fix condition check on allocation
size"), zbud_pool.unbuddied[0] is always unused.  We can reuse it as
buddied field to save some possible memory.

Link: https://lkml.kernel.org/r/20210608114515.206992-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210608114515.206992-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zbud.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- a/mm/zbud.c~mm-zbud-reuse-unbuddied-as-buddied-in-zbud_pool
+++ a/mm/zbud.c
@@ -93,8 +93,14 @@
  */
 struct zbud_pool {
 	spinlock_t lock;
-	struct list_head unbuddied[NCHUNKS];
-	struct list_head buddied;
+	union {
+		/*
+		 * Reuse unbuddied[0] as buddied on the ground that
+		 * unbuddied[0] is unused.
+		 */
+		struct list_head buddied;
+		struct list_head unbuddied[NCHUNKS];
+	};
 	struct list_head lru;
 	u64 pages_nr;
 	const struct zbud_ops *ops;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 067/192] mm/zbud: don't export any zbud API
  2021-07-01  1:46 incoming Andrew Morton
                   ` (65 preceding siblings ...)
  2021-07-01  1:50 ` [patch 066/192] mm/zbud: reuse unbuddied[0] as buddied in zbud_pool Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 068/192] mm/compaction: use DEVICE_ATTR_WO macro Andrew Morton
                   ` (125 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, ddstreet, linmiaohe, linux-mm, mm-commits, nathan,
	sjenning, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zbud: don't export any zbud API

The zbud doesn't need to export any API and it is meant to be used via
zpool API since the commit 12d79d64bfd3 ("mm/zpool: update zswap to use
zpool").  So we can remove the unneeded zbud.h and move down zpool API to
avoid any forward declaration.

[linmiaohe@huawei.com: fix unused function warnings when CONFIG_ZPOOL is disabled]
  Link: https://lkml.kernel.org/r/20210619025508.1239386-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210608114515.206992-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS          |    1 
 include/linux/zbud.h |   23 ----
 mm/Kconfig           |    1 
 mm/zbud.c            |  223 ++++++++++++++++++++---------------------
 4 files changed, 110 insertions(+), 138 deletions(-)

--- a/include/linux/zbud.h
+++ /dev/null
@@ -1,23 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ZBUD_H_
-#define _ZBUD_H_
-
-#include <linux/types.h>
-
-struct zbud_pool;
-
-struct zbud_ops {
-	int (*evict)(struct zbud_pool *pool, unsigned long handle);
-};
-
-struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops);
-void zbud_destroy_pool(struct zbud_pool *pool);
-int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp,
-	unsigned long *handle);
-void zbud_free(struct zbud_pool *pool, unsigned long handle);
-int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
-void *zbud_map(struct zbud_pool *pool, unsigned long handle);
-void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
-u64 zbud_get_pool_size(struct zbud_pool *pool);
-
-#endif /* _ZBUD_H_ */
--- a/MAINTAINERS~mm-zbud-dont-export-any-zbud-api
+++ a/MAINTAINERS
@@ -20172,7 +20172,6 @@ M:	Seth Jennings <sjenning@redhat.com>
 M:	Dan Streetman <ddstreet@ieee.org>
 L:	linux-mm@kvack.org
 S:	Maintained
-F:	include/linux/zbud.h
 F:	mm/zbud.c
 
 ZD1211RW WIRELESS DRIVER
--- a/mm/Kconfig~mm-zbud-dont-export-any-zbud-api
+++ a/mm/Kconfig
@@ -674,6 +674,7 @@ config ZPOOL
 
 config ZBUD
 	tristate "Low (Up to 2x) density storage for compressed pages"
+	depends on ZPOOL
 	help
 	  A special purpose allocator for storing compressed pages.
 	  It is designed to store up to two compressed pages per physical
--- a/mm/zbud.c~mm-zbud-dont-export-any-zbud-api
+++ a/mm/zbud.c
@@ -51,7 +51,6 @@
 #include <linux/preempt.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/zbud.h>
 #include <linux/zpool.h>
 
 /*****************
@@ -73,6 +72,12 @@
 #define ZHDR_SIZE_ALIGNED CHUNK_SIZE
 #define NCHUNKS		((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT)
 
+struct zbud_pool;
+
+struct zbud_ops {
+	int (*evict)(struct zbud_pool *pool, unsigned long handle);
+};
+
 /**
  * struct zbud_pool - stores metadata for each zbud pool
  * @lock:	protects all pool fields and first|last_chunk fields of any
@@ -104,10 +109,8 @@ struct zbud_pool {
 	struct list_head lru;
 	u64 pages_nr;
 	const struct zbud_ops *ops;
-#ifdef CONFIG_ZPOOL
 	struct zpool *zpool;
 	const struct zpool_ops *zpool_ops;
-#endif
 };
 
 /*
@@ -127,104 +130,6 @@ struct zbud_header {
 };
 
 /*****************
- * zpool
- ****************/
-
-#ifdef CONFIG_ZPOOL
-
-static int zbud_zpool_evict(struct zbud_pool *pool, unsigned long handle)
-{
-	if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict)
-		return pool->zpool_ops->evict(pool->zpool, handle);
-	else
-		return -ENOENT;
-}
-
-static const struct zbud_ops zbud_zpool_ops = {
-	.evict =	zbud_zpool_evict
-};
-
-static void *zbud_zpool_create(const char *name, gfp_t gfp,
-			       const struct zpool_ops *zpool_ops,
-			       struct zpool *zpool)
-{
-	struct zbud_pool *pool;
-
-	pool = zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL);
-	if (pool) {
-		pool->zpool = zpool;
-		pool->zpool_ops = zpool_ops;
-	}
-	return pool;
-}
-
-static void zbud_zpool_destroy(void *pool)
-{
-	zbud_destroy_pool(pool);
-}
-
-static int zbud_zpool_malloc(void *pool, size_t size, gfp_t gfp,
-			unsigned long *handle)
-{
-	return zbud_alloc(pool, size, gfp, handle);
-}
-static void zbud_zpool_free(void *pool, unsigned long handle)
-{
-	zbud_free(pool, handle);
-}
-
-static int zbud_zpool_shrink(void *pool, unsigned int pages,
-			unsigned int *reclaimed)
-{
-	unsigned int total = 0;
-	int ret = -EINVAL;
-
-	while (total < pages) {
-		ret = zbud_reclaim_page(pool, 8);
-		if (ret < 0)
-			break;
-		total++;
-	}
-
-	if (reclaimed)
-		*reclaimed = total;
-
-	return ret;
-}
-
-static void *zbud_zpool_map(void *pool, unsigned long handle,
-			enum zpool_mapmode mm)
-{
-	return zbud_map(pool, handle);
-}
-static void zbud_zpool_unmap(void *pool, unsigned long handle)
-{
-	zbud_unmap(pool, handle);
-}
-
-static u64 zbud_zpool_total_size(void *pool)
-{
-	return zbud_get_pool_size(pool) * PAGE_SIZE;
-}
-
-static struct zpool_driver zbud_zpool_driver = {
-	.type =		"zbud",
-	.sleep_mapped = true,
-	.owner =	THIS_MODULE,
-	.create =	zbud_zpool_create,
-	.destroy =	zbud_zpool_destroy,
-	.malloc =	zbud_zpool_malloc,
-	.free =		zbud_zpool_free,
-	.shrink =	zbud_zpool_shrink,
-	.map =		zbud_zpool_map,
-	.unmap =	zbud_zpool_unmap,
-	.total_size =	zbud_zpool_total_size,
-};
-
-MODULE_ALIAS("zpool-zbud");
-#endif /* CONFIG_ZPOOL */
-
-/*****************
  * Helpers
 *****************/
 /* Just to make the code easier to read */
@@ -310,7 +215,7 @@ static int num_free_chunks(struct zbud_h
  * Return: pointer to the new zbud pool or NULL if the metadata allocation
  * failed.
  */
-struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops)
+static struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops)
 {
 	struct zbud_pool *pool;
 	int i;
@@ -334,7 +239,7 @@ struct zbud_pool *zbud_create_pool(gfp_t
  *
  * The pool should be emptied before this function is called.
  */
-void zbud_destroy_pool(struct zbud_pool *pool)
+static void zbud_destroy_pool(struct zbud_pool *pool)
 {
 	kfree(pool);
 }
@@ -358,7 +263,7 @@ void zbud_destroy_pool(struct zbud_pool
  * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate
  * a new page.
  */
-int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp,
+static int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp,
 			unsigned long *handle)
 {
 	int chunks, i, freechunks;
@@ -433,7 +338,7 @@ found:
  * only sets the first|last_chunks to 0.  The page is actually freed
  * once both buddies are evicted (see zbud_reclaim_page() below).
  */
-void zbud_free(struct zbud_pool *pool, unsigned long handle)
+static void zbud_free(struct zbud_pool *pool, unsigned long handle)
 {
 	struct zbud_header *zhdr;
 	int freechunks;
@@ -505,7 +410,7 @@ void zbud_free(struct zbud_pool *pool, u
  * no pages to evict or an eviction handler is not registered, -EAGAIN if
  * the retry limit was hit.
  */
-int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
+static int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries)
 {
 	int i, ret, freechunks;
 	struct zbud_header *zhdr;
@@ -587,7 +492,7 @@ next:
  *
  * Returns: a pointer to the mapped allocation
  */
-void *zbud_map(struct zbud_pool *pool, unsigned long handle)
+static void *zbud_map(struct zbud_pool *pool, unsigned long handle)
 {
 	return (void *)(handle);
 }
@@ -597,7 +502,7 @@ void *zbud_map(struct zbud_pool *pool, u
  * @pool:	pool in which the allocation resides
  * @handle:	handle associated with the allocation to be unmapped
  */
-void zbud_unmap(struct zbud_pool *pool, unsigned long handle)
+static void zbud_unmap(struct zbud_pool *pool, unsigned long handle)
 {
 }
 
@@ -608,30 +513,120 @@ void zbud_unmap(struct zbud_pool *pool,
  * Returns: size in pages of the given pool.  The pool lock need not be
  * taken to access pages_nr.
  */
-u64 zbud_get_pool_size(struct zbud_pool *pool)
+static u64 zbud_get_pool_size(struct zbud_pool *pool)
 {
 	return pool->pages_nr;
 }
 
+/*****************
+ * zpool
+ ****************/
+
+static int zbud_zpool_evict(struct zbud_pool *pool, unsigned long handle)
+{
+	if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict)
+		return pool->zpool_ops->evict(pool->zpool, handle);
+	else
+		return -ENOENT;
+}
+
+static const struct zbud_ops zbud_zpool_ops = {
+	.evict =	zbud_zpool_evict
+};
+
+static void *zbud_zpool_create(const char *name, gfp_t gfp,
+			       const struct zpool_ops *zpool_ops,
+			       struct zpool *zpool)
+{
+	struct zbud_pool *pool;
+
+	pool = zbud_create_pool(gfp, zpool_ops ? &zbud_zpool_ops : NULL);
+	if (pool) {
+		pool->zpool = zpool;
+		pool->zpool_ops = zpool_ops;
+	}
+	return pool;
+}
+
+static void zbud_zpool_destroy(void *pool)
+{
+	zbud_destroy_pool(pool);
+}
+
+static int zbud_zpool_malloc(void *pool, size_t size, gfp_t gfp,
+			unsigned long *handle)
+{
+	return zbud_alloc(pool, size, gfp, handle);
+}
+static void zbud_zpool_free(void *pool, unsigned long handle)
+{
+	zbud_free(pool, handle);
+}
+
+static int zbud_zpool_shrink(void *pool, unsigned int pages,
+			unsigned int *reclaimed)
+{
+	unsigned int total = 0;
+	int ret = -EINVAL;
+
+	while (total < pages) {
+		ret = zbud_reclaim_page(pool, 8);
+		if (ret < 0)
+			break;
+		total++;
+	}
+
+	if (reclaimed)
+		*reclaimed = total;
+
+	return ret;
+}
+
+static void *zbud_zpool_map(void *pool, unsigned long handle,
+			enum zpool_mapmode mm)
+{
+	return zbud_map(pool, handle);
+}
+static void zbud_zpool_unmap(void *pool, unsigned long handle)
+{
+	zbud_unmap(pool, handle);
+}
+
+static u64 zbud_zpool_total_size(void *pool)
+{
+	return zbud_get_pool_size(pool) * PAGE_SIZE;
+}
+
+static struct zpool_driver zbud_zpool_driver = {
+	.type =		"zbud",
+	.sleep_mapped = true,
+	.owner =	THIS_MODULE,
+	.create =	zbud_zpool_create,
+	.destroy =	zbud_zpool_destroy,
+	.malloc =	zbud_zpool_malloc,
+	.free =		zbud_zpool_free,
+	.shrink =	zbud_zpool_shrink,
+	.map =		zbud_zpool_map,
+	.unmap =	zbud_zpool_unmap,
+	.total_size =	zbud_zpool_total_size,
+};
+
+MODULE_ALIAS("zpool-zbud");
+
 static int __init init_zbud(void)
 {
 	/* Make sure the zbud header will fit in one chunk */
 	BUILD_BUG_ON(sizeof(struct zbud_header) > ZHDR_SIZE_ALIGNED);
 	pr_info("loaded\n");
 
-#ifdef CONFIG_ZPOOL
 	zpool_register_driver(&zbud_zpool_driver);
-#endif
 
 	return 0;
 }
 
 static void __exit exit_zbud(void)
 {
-#ifdef CONFIG_ZPOOL
 	zpool_unregister_driver(&zbud_zpool_driver);
-#endif
-
 	pr_info("unloaded\n");
 }
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 068/192] mm/compaction: use DEVICE_ATTR_WO macro
  2021-07-01  1:46 incoming Andrew Morton
                   ` (66 preceding siblings ...)
  2021-07-01  1:50 ` [patch 067/192] mm/zbud: don't export any zbud API Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 069/192] mm: compaction: remove duplicate !list_empty(&sublist) check Andrew Morton
                   ` (124 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, yuehaibing

From: YueHaibing <yuehaibing@huawei.com>
Subject: mm/compaction: use DEVICE_ATTR_WO macro

Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the
code a bit shorter and easier to read.

Link: https://lkml.kernel.org/r/20210523064521.32912-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/compaction.c~mm-compaction-use-device_attr_wo-macro
+++ a/mm/compaction.c
@@ -2722,9 +2722,9 @@ int sysctl_compaction_handler(struct ctl
 }
 
 #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
-static ssize_t sysfs_compact_node(struct device *dev,
-			struct device_attribute *attr,
-			const char *buf, size_t count)
+static ssize_t compact_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
 {
 	int nid = dev->id;
 
@@ -2737,7 +2737,7 @@ static ssize_t sysfs_compact_node(struct
 
 	return count;
 }
-static DEVICE_ATTR(compact, 0200, NULL, sysfs_compact_node);
+static DEVICE_ATTR_WO(compact);
 
 int compaction_register_node(struct node *node)
 {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 069/192] mm: compaction: remove duplicate !list_empty(&sublist) check
  2021-07-01  1:46 incoming Andrew Morton
                   ` (67 preceding siblings ...)
  2021-07-01  1:50 ` [patch 068/192] mm/compaction: use DEVICE_ATTR_WO macro Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 070/192] mm/compaction: fix 'limit' in fast_isolate_freepages Andrew Morton
                   ` (123 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, david, linux-mm, liu.xiang, mm-commits, torvalds

From: Liu Xiang <liu.xiang@zlingsmart.com>
Subject: mm: compaction: remove duplicate !list_empty(&sublist) check

The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist)
check, so remove the duplicate call.

Link: https://lkml.kernel.org/r/20210609095409.19920-1-liu.xiang@zlingsmart.com
Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/mm/compaction.c~mm-compaction-remove-duplicate-list_emptysublist-check
+++ a/mm/compaction.c
@@ -1297,8 +1297,7 @@ move_freelist_head(struct list_head *fre
 
 	if (!list_is_last(freelist, &freepage->lru)) {
 		list_cut_before(&sublist, freelist, &freepage->lru);
-		if (!list_empty(&sublist))
-			list_splice_tail(&sublist, freelist);
+		list_splice_tail(&sublist, freelist);
 	}
 }
 
@@ -1315,8 +1314,7 @@ move_freelist_tail(struct list_head *fre
 
 	if (!list_is_first(freelist, &freepage->lru)) {
 		list_cut_position(&sublist, freelist, &freepage->lru);
-		if (!list_empty(&sublist))
-			list_splice_tail(&sublist, freelist);
+		list_splice_tail(&sublist, freelist);
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 070/192] mm/compaction: fix 'limit' in fast_isolate_freepages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (68 preceding siblings ...)
  2021-07-01  1:50 ` [patch 069/192] mm: compaction: remove duplicate !list_empty(&sublist) check Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:50 ` [patch 071/192] mm/mempolicy: cleanup nodemask intersection check for oom Andrew Morton
                   ` (122 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: akpm, linux-mm, mgorman, mm-commits, torvalds, vbabka, vvghjk1234

From: Wonhyuk Yang <vvghjk1234@gmail.com>
Subject: mm/compaction: fix 'limit' in fast_isolate_freepages

Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1. 
This takes away the opportunities of find candinate pages.  So, by making
enough scans available, increases the probability of finding the
appropriate freepage.

Tested it on the thpscale and the results are as follows.

                                        5.12.0                 5.12.0
                                      valnilla                patched
Amean     fault-both-1       598.15 (   0.00%)      592.56 (   0.93%)
Amean     fault-both-3      1494.47 (   0.00%)     1514.35 (  -1.33%)
Amean     fault-both-5      2519.48 (   0.00%)     2471.76 (   1.89%)
Amean     fault-both-7      3173.85 (   0.00%)     3079.19 (   2.98%)
Amean     fault-both-12     8063.83 (   0.00%)     7858.29 (   2.55%)
Amean     fault-both-18     8781.20 (   0.00%)     7827.70 *  10.86%*
Amean     fault-both-24    12576.44 (   0.00%)    12250.20 (   2.59%)
Amean     fault-both-30    18503.27 (   0.00%)    17528.11 *   5.27%*
Amean     fault-both-32    16133.69 (   0.00%)    13874.24 *  14.00%*

                                           5.12.0         5.12.0
                                          vanilla        patched
Ops Compaction migrate scanned         6547133.00     5963901.00
Ops Compaction free scanned           32452453.00    26609101.00

                        5.12        5.12
                     vanilla     patched
Duration User          27.99       28.84
Duration System       244.08      236.76
Duration Elapsed       78.27       78.38

Link: https://lkml.kernel.org/r/20210626082443.22547-1-vvghjk1234@gmail.com
Fixes: 5a811889de10f ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/compaction.c~mm-compaction-fix-limit-in-fast_isolate_freepages
+++ a/mm/compaction.c
@@ -1378,7 +1378,7 @@ static int next_search_order(struct comp
 static unsigned long
 fast_isolate_freepages(struct compact_control *cc)
 {
-	unsigned int limit = min(1U, freelist_scan_limit(cc) >> 1);
+	unsigned int limit = max(1U, freelist_scan_limit(cc) >> 1);
 	unsigned int nr_scanned = 0;
 	unsigned long low_pfn, min_pfn, highest = 0;
 	unsigned long nr_isolated = 0;
@@ -1490,11 +1490,11 @@ fast_isolate_freepages(struct compact_co
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
 		/*
-		 * Smaller scan on next order so the total scan ig related
+		 * Smaller scan on next order so the total scan is related
 		 * to freelist_scan_limit.
 		 */
 		if (order_scanned >= limit)
-			limit = min(1U, limit >> 1);
+			limit = max(1U, limit >> 1);
 	}
 
 	if (!page) {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 071/192] mm/mempolicy: cleanup nodemask intersection check for oom
  2021-07-01  1:46 incoming Andrew Morton
                   ` (69 preceding siblings ...)
  2021-07-01  1:50 ` [patch 070/192] mm/compaction: fix 'limit' in fast_isolate_freepages Andrew Morton
@ 2021-07-01  1:50 ` Andrew Morton
  2021-07-01  1:51 ` [patch 072/192] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Andrew Morton
                   ` (121 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:50 UTC (permalink / raw)
  To: aarcange, ak, akpm, ben.widawsky, dan.j.williams, dave.hansen,
	feng.tang, linux-mm, mgorman, mhocko, mike.kravetz, mm-commits,
	rdunlap, rientjes, torvalds, vbabka, ying.huang

From: Feng Tang <feng.tang@intel.com>
Subject: mm/mempolicy: cleanup nodemask intersection check for oom

Patch series "mm/mempolicy: some fix and semantics cleanup", v4.

Current memory policy code has some confusing and ambiguous part about
MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
there are many places having to distinguish them.  Also the nodemask
intersection check needs cleanup to be more explicit for OOM use, and
handle MPOL_INTERLEAVE correctly.  This patchset cleans up these and
unifies the parameter sanity check for mbind() and set_mempolicy().


This patch (of 3):

mempolicy_nodemask_intersects seem to be a general purpose mempolicy
function.  In fact it is partially tailored for the OOM purpose
instead.  The oom proper is the only existing user so rename the
function to make that purpose explicit.

While at it drop the MPOL_INTERLEAVE as those allocations never has a
nodemask defined (see alloc_page_interleave) so this is a dead code and
a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
requirement so it shouldn't be considered during the OOM.

The final code can be reduced to a check for MPOL_BIND which is the
only memory policy that is a hard requirement and thus relevant to a
constrained OOM logic.

[mhocko@suse.com: changelog edits]
Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mempolicy.h |    2 +-
 mm/mempolicy.c            |   34 +++++++++-------------------------
 mm/oom_kill.c             |    2 +-
 3 files changed, 11 insertions(+), 27 deletions(-)

--- a/include/linux/mempolicy.h~mm-mempolicy-cleanup-nodemask-intersection-check-for-oom
+++ a/include/linux/mempolicy.h
@@ -150,7 +150,7 @@ extern int huge_node(struct vm_area_stru
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
-extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
--- a/mm/mempolicy.c~mm-mempolicy-cleanup-nodemask-intersection-check-for-oom
+++ a/mm/mempolicy.c
@@ -2094,16 +2094,16 @@ bool init_nodemask_of_mempolicy(nodemask
 #endif
 
 /*
- * mempolicy_nodemask_intersects
+ * mempolicy_in_oom_domain
  *
- * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
- * policy.  Otherwise, check for intersection between mask and the policy
- * nodemask for 'bind' or 'interleave' policy.  For 'preferred' or 'local'
- * policy, always return true since it may allocate elsewhere on fallback.
+ * If tsk's mempolicy is "bind", check for intersection between mask and
+ * the policy nodemask. Otherwise, return true for all other policies
+ * including "interleave", as a tsk with "interleave" policy may have
+ * memory allocated from all nodes in system.
  *
  * Takes task_lock(tsk) to prevent freeing of its mempolicy.
  */
-bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+bool mempolicy_in_oom_domain(struct task_struct *tsk,
 					const nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
@@ -2111,29 +2111,13 @@ bool mempolicy_nodemask_intersects(struc
 
 	if (!mask)
 		return ret;
+
 	task_lock(tsk);
 	mempolicy = tsk->mempolicy;
-	if (!mempolicy)
-		goto out;
-
-	switch (mempolicy->mode) {
-	case MPOL_PREFERRED:
-		/*
-		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
-		 * allocate from, they may fallback to other nodes when oom.
-		 * Thus, it's possible for tsk to have allocated memory from
-		 * nodes in mask.
-		 */
-		break;
-	case MPOL_BIND:
-	case MPOL_INTERLEAVE:
+	if (mempolicy && mempolicy->mode == MPOL_BIND)
 		ret = nodes_intersects(mempolicy->v.nodes, *mask);
-		break;
-	default:
-		BUG();
-	}
-out:
 	task_unlock(tsk);
+
 	return ret;
 }
 
--- a/mm/oom_kill.c~mm-mempolicy-cleanup-nodemask-intersection-check-for-oom
+++ a/mm/oom_kill.c
@@ -104,7 +104,7 @@ static bool oom_cpuset_eligible(struct t
 			 * mempolicy intersects current, otherwise it may be
 			 * needlessly killed.
 			 */
-			ret = mempolicy_nodemask_intersects(tsk, mask);
+			ret = mempolicy_in_oom_domain(tsk, mask);
 		} else {
 			/*
 			 * This is not a mempolicy constrained oom, so only
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 072/192] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-07-01  1:46 incoming Andrew Morton
                   ` (70 preceding siblings ...)
  2021-07-01  1:50 ` [patch 071/192] mm/mempolicy: cleanup nodemask intersection check for oom Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 073/192] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Andrew Morton
                   ` (120 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: aarcange, ak, akpm, ben.widawsky, dan.j.williams, dave.hansen,
	feng.tang, linux-mm, mgorman, mhocko, mhocko, mike.kravetz,
	mm-commits, rdunlap, rientjes, torvalds, vbabka, ying.huang

From: Feng Tang <feng.tang@intel.com>
Subject: mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy

MPOL_LOCAL policy has been setup as a real policy, but it is still handled
like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit
set, and there are many places having to judge the real 'prefer' or the
'local' policy, which are quite confusing.

In current code, there are 4 cases that MPOL_LOCAL are used:

1. user specifies 'local' policy

2. user specifies 'prefer' policy, but with empty nodemask

3. system 'default' policy is used

4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
   flag set, and when it is 'rebind' to a nodemask which doesn't contains
   the 'preferred' node, it will perform as 'local' policy

So make 'local' a real policy instead of a fake 'prefer' one, and kill
MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading.

For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal
Hocko pointed out:

: I do believe that rebinding preferred policy is just bogus and it should
: be dropped altogether on the ground that a preference is a mere hint from
: userspace where to start the allocation.  Unless I am missing something
: cpusets will be always authoritative for the final placement.  The
: preferred node just acts as a starting point and it should be really
: preserved when cpusets changes.  Otherwise we have a very subtle behavior
: corner cases.

So dump all the tricky transformation between 'prefer' and 'local', and
just record the new nodemask of rebinding.

[feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko]
  Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal]
  Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/uapi/linux/mempolicy.h |    1 
 mm/mempolicy.c                 |  136 ++++++++++++-------------------
 2 files changed, 56 insertions(+), 81 deletions(-)

--- a/include/uapi/linux/mempolicy.h~mm-mempolicy-dont-handle-mpol_local-like-a-fake-mpol_preferred-policy
+++ a/include/uapi/linux/mempolicy.h
@@ -60,7 +60,6 @@ enum {
  * are never OR'ed into the mode in mempolicy API arguments.
  */
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
-#define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
--- a/mm/mempolicy.c~mm-mempolicy-dont-handle-mpol_local-like-a-fake-mpol_preferred-policy
+++ a/mm/mempolicy.c
@@ -121,8 +121,7 @@ enum zone_type policy_zone = 0;
  */
 static struct mempolicy default_policy = {
 	.refcnt = ATOMIC_INIT(1), /* never free it */
-	.mode = MPOL_PREFERRED,
-	.flags = MPOL_F_LOCAL,
+	.mode = MPOL_LOCAL,
 };
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
@@ -200,12 +199,9 @@ static int mpol_new_interleave(struct me
 
 static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 {
-	if (!nodes)
-		pol->flags |= MPOL_F_LOCAL;	/* local allocation */
-	else if (nodes_empty(*nodes))
-		return -EINVAL;			/*  no allowed nodes */
-	else
-		pol->v.preferred_node = first_node(*nodes);
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->v.preferred_node = first_node(*nodes);
 	return 0;
 }
 
@@ -220,8 +216,7 @@ static int mpol_new_bind(struct mempolic
 /*
  * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
  * any, for the new policy.  mpol_new() has already validated the nodes
- * parameter with respect to the policy mode and flags.  But, we need to
- * handle an empty nodemask with MPOL_PREFERRED here.
+ * parameter with respect to the policy mode and flags.
  *
  * Must be called holding task's alloc_lock to protect task's mems_allowed
  * and mempolicy.  May also be called holding the mmap_lock for write.
@@ -231,33 +226,31 @@ static int mpol_set_nodemask(struct memp
 {
 	int ret;
 
-	/* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
-	if (pol == NULL)
+	/*
+	 * Default (pol==NULL) resp. local memory policies are not a
+	 * subject of any remapping. They also do not need any special
+	 * constructor.
+	 */
+	if (!pol || pol->mode == MPOL_LOCAL)
 		return 0;
+
 	/* Check N_MEMORY */
 	nodes_and(nsc->mask1,
 		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
-	if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
-		nodes = NULL;	/* explicit local allocation */
-	else {
-		if (pol->flags & MPOL_F_RELATIVE_NODES)
-			mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
-		else
-			nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-		if (mpol_store_user_nodemask(pol))
-			pol->w.user_nodemask = *nodes;
-		else
-			pol->w.cpuset_mems_allowed =
-						cpuset_current_mems_allowed;
-	}
+	if (pol->flags & MPOL_F_RELATIVE_NODES)
+		mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
+	else
+		nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-	if (nodes)
-		ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
+	if (mpol_store_user_nodemask(pol))
+		pol->w.user_nodemask = *nodes;
 	else
-		ret = mpol_ops[pol->mode].create(pol, NULL);
+		pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed;
+
+	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
 	return ret;
 }
 
@@ -290,13 +283,14 @@ static struct mempolicy *mpol_new(unsign
 			if (((flags & MPOL_F_STATIC_NODES) ||
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
+
+			mode = MPOL_LOCAL;
 		}
 	} else if (mode == MPOL_LOCAL) {
 		if (!nodes_empty(*nodes) ||
 		    (flags & MPOL_F_STATIC_NODES) ||
 		    (flags & MPOL_F_RELATIVE_NODES))
 			return ERR_PTR(-EINVAL);
-		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -344,25 +338,7 @@ static void mpol_rebind_nodemask(struct
 static void mpol_rebind_preferred(struct mempolicy *pol,
 						const nodemask_t *nodes)
 {
-	nodemask_t tmp;
-
-	if (pol->flags & MPOL_F_STATIC_NODES) {
-		int node = first_node(pol->w.user_nodemask);
-
-		if (node_isset(node, *nodes)) {
-			pol->v.preferred_node = node;
-			pol->flags &= ~MPOL_F_LOCAL;
-		} else
-			pol->flags |= MPOL_F_LOCAL;
-	} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
-		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
-		pol->v.preferred_node = first_node(tmp);
-	} else if (!(pol->flags & MPOL_F_LOCAL)) {
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
-						   pol->w.cpuset_mems_allowed,
-						   *nodes);
-		pol->w.cpuset_mems_allowed = *nodes;
-	}
+	pol->w.cpuset_mems_allowed = *nodes;
 }
 
 /*
@@ -376,7 +352,7 @@ static void mpol_rebind_policy(struct me
 {
 	if (!pol)
 		return;
-	if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) &&
+	if (!mpol_store_user_nodemask(pol) &&
 	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
 		return;
 
@@ -427,6 +403,9 @@ static const struct mempolicy_operations
 		.create = mpol_new_bind,
 		.rebind = mpol_rebind_nodemask,
 	},
+	[MPOL_LOCAL] = {
+		.rebind = mpol_rebind_default,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -919,10 +898,12 @@ static void get_policy_nodemask(struct m
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
+	case MPOL_LOCAL:
+		/* return empty node mask for local allocation */
+		break;
+
 	case MPOL_PREFERRED:
-		if (!(p->flags & MPOL_F_LOCAL))
-			node_set(p->v.preferred_node, *nodes);
-		/* else return empty node mask for local allocation */
+		node_set(p->v.preferred_node, *nodes);
 		break;
 	default:
 		BUG();
@@ -1894,9 +1875,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, s
 /* Return the node id preferred by the given mempolicy, or the given id */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
-	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+	if (policy->mode == MPOL_PREFERRED) {
 		nd = policy->v.preferred_node;
-	else {
+	} else {
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
 		 * because we might easily break the expectation to stay on the
@@ -1933,14 +1914,11 @@ unsigned int mempolicy_slab_node(void)
 		return node;
 
 	policy = current->mempolicy;
-	if (!policy || policy->flags & MPOL_F_LOCAL)
+	if (!policy)
 		return node;
 
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
-		/*
-		 * handled MPOL_F_LOCAL above
-		 */
 		return policy->v.preferred_node;
 
 	case MPOL_INTERLEAVE:
@@ -1960,6 +1938,8 @@ unsigned int mempolicy_slab_node(void)
 							&policy->v.nodes);
 		return z->zone ? zone_to_nid(z->zone) : node;
 	}
+	case MPOL_LOCAL:
+		return node;
 
 	default:
 		BUG();
@@ -2072,16 +2052,18 @@ bool init_nodemask_of_mempolicy(nodemask
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
-		if (mempolicy->flags & MPOL_F_LOCAL)
-			nid = numa_node_id();
-		else
-			nid = mempolicy->v.preferred_node;
+		nid = mempolicy->v.preferred_node;
 		init_nodemask_of_node(mask, nid);
 		break;
 
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		*mask =  mempolicy->v.nodes;
+		*mask = mempolicy->v.nodes;
+		break;
+
+	case MPOL_LOCAL:
+		nid = numa_node_id();
+		init_nodemask_of_node(mask, nid);
 		break;
 
 	default:
@@ -2188,7 +2170,7 @@ struct page *alloc_pages_vma(gfp_t gfp,
 		 * If the policy is interleave, or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+		if (pol->mode == MPOL_PREFERRED)
 			hpage_node = pol->v.preferred_node;
 
 		nmask = policy_nodemask(gfp, pol);
@@ -2324,10 +2306,9 @@ bool __mpol_equal(struct mempolicy *a, s
 	case MPOL_INTERLEAVE:
 		return !!nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
-		/* a's ->flags is the same as b's */
-		if (a->flags & MPOL_F_LOCAL)
-			return true;
 		return a->v.preferred_node == b->v.preferred_node;
+	case MPOL_LOCAL:
+		return true;
 	default:
 		BUG();
 		return false;
@@ -2465,10 +2446,11 @@ int mpol_misplaced(struct page *page, st
 		break;
 
 	case MPOL_PREFERRED:
-		if (pol->flags & MPOL_F_LOCAL)
-			polnid = numa_node_id();
-		else
-			polnid = pol->v.preferred_node;
+		polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_LOCAL:
+		polnid = numa_node_id();
 		break;
 
 	case MPOL_BIND:
@@ -2835,9 +2817,6 @@ void numa_default_policy(void)
  * Parse and format mempolicy from/to strings
  */
 
-/*
- * "local" is implemented internally by MPOL_PREFERRED with MPOL_F_LOCAL flag.
- */
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2915,7 +2894,6 @@ int mpol_parse_str(char *str, struct mem
 		 */
 		if (nodelist)
 			goto out;
-		mode = MPOL_PREFERRED;
 		break;
 	case MPOL_DEFAULT:
 		/*
@@ -2959,7 +2937,7 @@ int mpol_parse_str(char *str, struct mem
 	else if (nodelist)
 		new->v.preferred_node = first_node(nodes);
 	else
-		new->flags |= MPOL_F_LOCAL;
+		new->mode = MPOL_LOCAL;
 
 	/*
 	 * Save nodes for contextualization: this will be used to "clone"
@@ -3005,12 +2983,10 @@ void mpol_to_str(char *buffer, int maxle
 
 	switch (mode) {
 	case MPOL_DEFAULT:
+	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
-		if (flags & MPOL_F_LOCAL)
-			mode = MPOL_LOCAL;
-		else
-			node_set(pol->v.preferred_node, nodes);
+		node_set(pol->v.preferred_node, nodes);
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 073/192] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy
  2021-07-01  1:46 incoming Andrew Morton
                   ` (71 preceding siblings ...)
  2021-07-01  1:51 ` [patch 072/192] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 074/192] mm: mempolicy: don't have to split pmd for huge zero page Andrew Morton
                   ` (119 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: aarcange, ak, akpm, ben.widawsky, dan.j.williams, dave.hansen,
	feng.tang, linux-mm, mgorman, mhocko, mike.kravetz, mm-commits,
	rdunlap, rientjes, torvalds, vbabka, ying.huang

From: Feng Tang <feng.tang@intel.com>
Subject: mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy

Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same
operation for parameter sanity check.

Add a helper function to unify the code to reduce the redundancy, and make
it easier for changing the sanity check code in future.

[thanks to David Rientjes for suggesting using helper function instead of
macro].

[feng.tang@intel.com: add comment]
  Link: https://lkml.kernel.org/r/1622560492-1294-4-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/1622469956-82897-4-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |   48 +++++++++++++++++++++++++++++------------------
 1 file changed, 30 insertions(+), 18 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicy-unify-the-parameter-sanity-check-for-mbind-and-set_mempolicy
+++ a/mm/mempolicy.c
@@ -1441,26 +1441,38 @@ static int copy_nodes_to_user(unsigned l
 	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
 }
 
+/* Basic parameter sanity check used by both mbind() and set_mempolicy() */
+static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
+{
+	*flags = *mode & MPOL_MODE_FLAGS;
+	*mode &= ~MPOL_MODE_FLAGS;
+	if ((unsigned int)(*mode) >= MPOL_MAX)
+		return -EINVAL;
+	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
+		return -EINVAL;
+
+	return 0;
+}
+
 static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
 {
+	unsigned short mode_flags;
 	nodemask_t nodes;
+	int lmode = mode;
 	int err;
-	unsigned short mode_flags;
 
 	start = untagged_addr(start);
-	mode_flags = mode & MPOL_MODE_FLAGS;
-	mode &= ~MPOL_MODE_FLAGS;
-	if (mode >= MPOL_MAX)
-		return -EINVAL;
-	if ((mode_flags & MPOL_F_STATIC_NODES) &&
-	    (mode_flags & MPOL_F_RELATIVE_NODES))
-		return -EINVAL;
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return err;
+
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-	return do_mbind(start, len, mode, mode_flags, &nodes, flags);
+
+	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
 }
 
 SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
@@ -1474,20 +1486,20 @@ SYSCALL_DEFINE6(mbind, unsigned long, st
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
 {
-	int err;
+	unsigned short mode_flags;
 	nodemask_t nodes;
-	unsigned short flags;
+	int lmode = mode;
+	int err;
+
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return err;
 
-	flags = mode & MPOL_MODE_FLAGS;
-	mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)mode >= MPOL_MAX)
-		return -EINVAL;
-	if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES))
-		return -EINVAL;
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-	return do_set_mempolicy(mode, flags, &nodes);
+
+	return do_set_mempolicy(lmode, mode_flags, &nodes);
 }
 
 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 074/192] mm: mempolicy: don't have to split pmd for huge zero page
  2021-07-01  1:46 incoming Andrew Morton
                   ` (72 preceding siblings ...)
  2021-07-01  1:51 ` [patch 073/192] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 075/192] mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies Andrew Morton
                   ` (118 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, hughd, kirill.shutemov, linux-mm, mhocko, mm-commits,
	nao.horiguchi, shy828301, torvalds, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: mempolicy: don't have to split pmd for huge zero page

When trying to migrate pages to obey mempolicy, the huge zero page is
split by inserting base zero pfn to all PTEs, then the page table walk
fallback to PTE level and just skips zero page.  Skipping zero page for
mempolicy has been the behavior of kernel since v2.6.16 due to commit
f4598c8b3678 ("[PATCH] migration: make sure there is no attempt to migrate
reserved pages.").  So it seems pointless to split huge zero page, it
could be just skipped like base zero page.

Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for
this case.

Link: https://lkml.kernel.org/r/20210609172146.3594-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20210604203513.240709-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicy-dont-have-to-split-pmd-for-huge-zero-page
+++ a/mm/mempolicy.c
@@ -437,7 +437,8 @@ static inline bool queue_pages_required(
 
 /*
  * queue_pages_pmd() has four possible return values:
- * 0 - pages are placed on the right node or queued successfully.
+ * 0 - pages are placed on the right node or queued successfully, or
+ *     special page is met, i.e. huge zero page.
  * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
  *     specified.
  * 2 - THP was split.
@@ -461,8 +462,7 @@ static int queue_pages_pmd(pmd_t *pmd, s
 	page = pmd_page(*pmd);
 	if (is_huge_zero_page(page)) {
 		spin_unlock(ptl);
-		__split_huge_pmd(walk->vma, pmd, addr, false, NULL);
-		ret = 2;
+		walk->action = ACTION_CONTINUE;
 		goto out;
 	}
 	if (!queue_pages_required(page, qp))
@@ -489,7 +489,8 @@ out:
  * and move them to the pagelist if they do.
  *
  * queue_pages_pte_range() has three possible return values:
- * 0 - pages are placed on the right node or queued successfully.
+ * 0 - pages are placed on the right node or queued successfully, or
+ *     special page is met, i.e. zero page.
  * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were
  *     specified.
  * -EIO - only MPOL_MF_STRICT was specified and an existing page was already
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 075/192] mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies
  2021-07-01  1:46 incoming Andrew Morton
                   ` (73 preceding siblings ...)
  2021-07-01  1:51 ` [patch 074/192] mm: mempolicy: don't have to split pmd for huge zero page Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 076/192] include/linux/mmzone.h: add documentation for pfn_valid() Andrew Morton
                   ` (117 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: aarcange, ak, akpm, ben.widawsky, dave.hansen, feng.tang,
	linux-mm, mgorman, mhocko, mike.kravetz, mm-commits, rientjes,
	torvalds, vbabka

From: Ben Widawsky <ben.widawsky@intel.com>
Subject: mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies

Current structure 'mempolicy' uses a union to store the node info for
bind/interleave/perfer policies.

	union {
		short 		 preferred_node; /* preferred */
		nodemask_t	 nodes;		/* interleave/bind */
		/* undefined for default */
	} v;

Since preferred node can also be represented by a nodemask_t with only ont
bit set, unify these policies with using one nodemask_t 'nodes', which can
remove a union, simplify the code and make it easier to support future's
new policy's node info.

Link: https://lore.kernel.org/r/20200630212517.308045-7-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1623399825-75651-1-git-send-email-feng.tang@intel.com
Co-developed-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mempolicy.h |    7 --
 mm/mempolicy.c            |   96 ++++++++++++++++--------------------
 2 files changed, 46 insertions(+), 57 deletions(-)

--- a/include/linux/mempolicy.h~mm-mempolicy-use-unified-nodes-for-bind-interleave-prefer-policies
+++ a/include/linux/mempolicy.h
@@ -46,11 +46,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	unsigned short mode; 	/* See MPOL_* above */
 	unsigned short flags;	/* See set_mempolicy() MPOL_F_* above */
-	union {
-		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave/bind */
-		/* undefined for default */
-	} v;
+	nodemask_t nodes;	/* interleave/bind/perfer */
+
 	union {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
--- a/mm/mempolicy.c~mm-mempolicy-use-unified-nodes-for-bind-interleave-prefer-policies
+++ a/mm/mempolicy.c
@@ -193,7 +193,7 @@ static int mpol_new_interleave(struct me
 {
 	if (nodes_empty(*nodes))
 		return -EINVAL;
-	pol->v.nodes = *nodes;
+	pol->nodes = *nodes;
 	return 0;
 }
 
@@ -201,7 +201,9 @@ static int mpol_new_preferred(struct mem
 {
 	if (nodes_empty(*nodes))
 		return -EINVAL;
-	pol->v.preferred_node = first_node(*nodes);
+
+	nodes_clear(pol->nodes);
+	node_set(first_node(*nodes), pol->nodes);
 	return 0;
 }
 
@@ -209,7 +211,7 @@ static int mpol_new_bind(struct mempolic
 {
 	if (nodes_empty(*nodes))
 		return -EINVAL;
-	pol->v.nodes = *nodes;
+	pol->nodes = *nodes;
 	return 0;
 }
 
@@ -324,7 +326,7 @@ static void mpol_rebind_nodemask(struct
 	else if (pol->flags & MPOL_F_RELATIVE_NODES)
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 	else {
-		nodes_remap(tmp, pol->v.nodes, pol->w.cpuset_mems_allowed,
+		nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
 								*nodes);
 		pol->w.cpuset_mems_allowed = *nodes;
 	}
@@ -332,7 +334,7 @@ static void mpol_rebind_nodemask(struct
 	if (nodes_empty(tmp))
 		tmp = *nodes;
 
-	pol->v.nodes = tmp;
+	pol->nodes = tmp;
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -897,15 +899,12 @@ static void get_policy_nodemask(struct m
 	switch (p->mode) {
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		*nodes = p->v.nodes;
+	case MPOL_PREFERRED:
+		*nodes = p->nodes;
 		break;
 	case MPOL_LOCAL:
 		/* return empty node mask for local allocation */
 		break;
-
-	case MPOL_PREFERRED:
-		node_set(p->v.preferred_node, *nodes);
-		break;
 	default:
 		BUG();
 	}
@@ -989,7 +988,7 @@ static long do_get_mempolicy(int *policy
 			*policy = err;
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
-			*policy = next_node_in(current->il_prev, pol->v.nodes);
+			*policy = next_node_in(current->il_prev, pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1857,14 +1856,14 @@ static int apply_policy_zone(struct memp
 	BUG_ON(dynamic_policy_zone == ZONE_MOVABLE);
 
 	/*
-	 * if policy->v.nodes has movable memory only,
+	 * if policy->nodes has movable memory only,
 	 * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only.
 	 *
-	 * policy->v.nodes is intersect with node_states[N_MEMORY].
+	 * policy->nodes is intersect with node_states[N_MEMORY].
 	 * so if the following test fails, it implies
-	 * policy->v.nodes has movable memory only.
+	 * policy->nodes has movable memory only.
 	 */
-	if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY]))
+	if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY]))
 		dynamic_policy_zone = ZONE_MOVABLE;
 
 	return zone >= dynamic_policy_zone;
@@ -1879,8 +1878,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, s
 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
 	if (unlikely(policy->mode == MPOL_BIND) &&
 			apply_policy_zone(policy, gfp_zone(gfp)) &&
-			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
-		return &policy->v.nodes;
+			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
+		return &policy->nodes;
 
 	return NULL;
 }
@@ -1889,7 +1888,7 @@ nodemask_t *policy_nodemask(gfp_t gfp, s
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
 	if (policy->mode == MPOL_PREFERRED) {
-		nd = policy->v.preferred_node;
+		nd = first_node(policy->nodes);
 	} else {
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
@@ -1908,7 +1907,7 @@ static unsigned interleave_nodes(struct
 	unsigned next;
 	struct task_struct *me = current;
 
-	next = next_node_in(me->il_prev, policy->v.nodes);
+	next = next_node_in(me->il_prev, policy->nodes);
 	if (next < MAX_NUMNODES)
 		me->il_prev = next;
 	return next;
@@ -1932,7 +1931,7 @@ unsigned int mempolicy_slab_node(void)
 
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
-		return policy->v.preferred_node;
+		return first_node(policy->nodes);
 
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
@@ -1948,7 +1947,7 @@ unsigned int mempolicy_slab_node(void)
 		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
 		zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK];
 		z = first_zones_zonelist(zonelist, highest_zoneidx,
-							&policy->v.nodes);
+							&policy->nodes);
 		return z->zone ? zone_to_nid(z->zone) : node;
 	}
 	case MPOL_LOCAL:
@@ -1961,12 +1960,12 @@ unsigned int mempolicy_slab_node(void)
 
 /*
  * Do static interleaving for a VMA with known offset @n.  Returns the n'th
- * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the
+ * node in pol->nodes (starting from n=0), wrapping around if n exceeds the
  * number of present nodes.
  */
 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
+	unsigned nnodes = nodes_weight(pol->nodes);
 	unsigned target;
 	int i;
 	int nid;
@@ -1974,9 +1973,9 @@ static unsigned offset_il_node(struct me
 	if (!nnodes)
 		return numa_node_id();
 	target = (unsigned int)n % nnodes;
-	nid = first_node(pol->v.nodes);
+	nid = first_node(pol->nodes);
 	for (i = 0; i < target; i++)
-		nid = next_node(nid, pol->v.nodes);
+		nid = next_node(nid, pol->nodes);
 	return nid;
 }
 
@@ -2032,7 +2031,7 @@ int huge_node(struct vm_area_struct *vma
 	} else {
 		nid = policy_node(gfp_flags, *mpol, numa_node_id());
 		if ((*mpol)->mode == MPOL_BIND)
-			*nodemask = &(*mpol)->v.nodes;
+			*nodemask = &(*mpol)->nodes;
 	}
 	return nid;
 }
@@ -2056,7 +2055,6 @@ int huge_node(struct vm_area_struct *vma
 bool init_nodemask_of_mempolicy(nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
-	int nid;
 
 	if (!(mask && current->mempolicy))
 		return false;
@@ -2065,18 +2063,13 @@ bool init_nodemask_of_mempolicy(nodemask
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
-		nid = mempolicy->v.preferred_node;
-		init_nodemask_of_node(mask, nid);
-		break;
-
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		*mask = mempolicy->v.nodes;
+		*mask = mempolicy->nodes;
 		break;
 
 	case MPOL_LOCAL:
-		nid = numa_node_id();
-		init_nodemask_of_node(mask, nid);
+		init_nodemask_of_node(mask, numa_node_id());
 		break;
 
 	default:
@@ -2110,7 +2103,7 @@ bool mempolicy_in_oom_domain(struct task
 	task_lock(tsk);
 	mempolicy = tsk->mempolicy;
 	if (mempolicy && mempolicy->mode == MPOL_BIND)
-		ret = nodes_intersects(mempolicy->v.nodes, *mask);
+		ret = nodes_intersects(mempolicy->nodes, *mask);
 	task_unlock(tsk);
 
 	return ret;
@@ -2184,7 +2177,7 @@ struct page *alloc_pages_vma(gfp_t gfp,
 		 * node in its nodemask, we allocate the standard way.
 		 */
 		if (pol->mode == MPOL_PREFERRED)
-			hpage_node = pol->v.preferred_node;
+			hpage_node = first_node(pol->nodes);
 
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
@@ -2317,9 +2310,8 @@ bool __mpol_equal(struct mempolicy *a, s
 	switch (a->mode) {
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		return !!nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
-		return a->v.preferred_node == b->v.preferred_node;
+		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
 	default:
@@ -2459,7 +2451,7 @@ int mpol_misplaced(struct page *page, st
 		break;
 
 	case MPOL_PREFERRED:
-		polnid = pol->v.preferred_node;
+		polnid = first_node(pol->nodes);
 		break;
 
 	case MPOL_LOCAL:
@@ -2469,7 +2461,7 @@ int mpol_misplaced(struct page *page, st
 	case MPOL_BIND:
 		/* Optimize placement among multiple nodes via NUMA balancing */
 		if (pol->flags & MPOL_F_MORON) {
-			if (node_isset(thisnid, pol->v.nodes))
+			if (node_isset(thisnid, pol->nodes))
 				break;
 			goto out;
 		}
@@ -2480,12 +2472,12 @@ int mpol_misplaced(struct page *page, st
 		 * else select nearest allowed node, if any.
 		 * If no allowed nodes, use current [!misplaced].
 		 */
-		if (node_isset(curnid, pol->v.nodes))
+		if (node_isset(curnid, pol->nodes))
 			goto out;
 		z = first_zones_zonelist(
 				node_zonelist(numa_node_id(), GFP_HIGHUSER),
 				gfp_zone(GFP_HIGHUSER),
-				&pol->v.nodes);
+				&pol->nodes);
 		polnid = zone_to_nid(z->zone);
 		break;
 
@@ -2688,7 +2680,7 @@ int mpol_set_shared_policy(struct shared
 		 vma->vm_pgoff,
 		 sz, npol ? npol->mode : -1,
 		 npol ? npol->flags : -1,
-		 npol ? nodes_addr(npol->v.nodes)[0] : NUMA_NO_NODE);
+		 npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE);
 
 	if (npol) {
 		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
@@ -2786,7 +2778,7 @@ void __init numa_policy_init(void)
 			.refcnt = ATOMIC_INIT(1),
 			.mode = MPOL_PREFERRED,
 			.flags = MPOL_F_MOF | MPOL_F_MORON,
-			.v = { .preferred_node = nid, },
+			.nodes = nodemask_of_node(nid),
 		};
 	}
 
@@ -2945,12 +2937,14 @@ int mpol_parse_str(char *str, struct mem
 	 * Save nodes for mpol_to_str() to show the tmpfs mount options
 	 * for /proc/mounts, /proc/pid/mounts and /proc/pid/mountinfo.
 	 */
-	if (mode != MPOL_PREFERRED)
-		new->v.nodes = nodes;
-	else if (nodelist)
-		new->v.preferred_node = first_node(nodes);
-	else
+	if (mode != MPOL_PREFERRED) {
+		new->nodes = nodes;
+	} else if (nodelist) {
+		nodes_clear(new->nodes);
+		node_set(first_node(nodes), new->nodes);
+	} else {
 		new->mode = MPOL_LOCAL;
+	}
 
 	/*
 	 * Save nodes for contextualization: this will be used to "clone"
@@ -2999,11 +2993,9 @@ void mpol_to_str(char *buffer, int maxle
 	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
-		node_set(pol->v.preferred_node, nodes);
-		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		nodes = pol->v.nodes;
+		nodes = pol->nodes;
 		break;
 	default:
 		WARN_ON_ONCE(1);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 076/192] include/linux/mmzone.h: add documentation for pfn_valid()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (74 preceding siblings ...)
  2021-07-01  1:51 ` [patch 075/192] mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 077/192] memblock: update initialization of reserved pages Andrew Morton
                   ` (116 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, ardb, catalin.marinas, david, linux-mm,
	mark.rutland, maz, mm-commits, rppt, torvalds, wangkefeng.wang,
	will

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: include/linux/mmzone.h: add documentation for pfn_valid()

Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4.

These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire
pfn_valid_within() to 1.  

The idea is to mark NOMAP pages as reserved in the memory map and restore
the intended semantics of pfn_valid() to designate availability of struct
page for a pfn.

With this the core mm will be able to cope with the fact that it cannot
use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER
blocks will be treated correctly even without the need for
pfn_valid_within.


This patch (of 4):

Add comment describing the semantics of pfn_valid() that clarifies that
pfn_valid() only checks for availability of a memory map entry (i.e. 
struct page) for a PFN rather than availability of usable memory backing
that PFN.

The most "generic" version of pfn_valid() used by the configurations with
SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most
suitable place for documentation about semantics of pfn_valid().

Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/include/linux/mmzone.h~include-linux-mmzoneh-add-documentation-for-pfn_valid
+++ a/include/linux/mmzone.h
@@ -1445,6 +1445,17 @@ static inline int pfn_section_valid(stru
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
+/**
+ * pfn_valid - check if there is a valid memory map entry for a PFN
+ * @pfn: the page frame number to check
+ *
+ * Check if there is a valid memory map entry aka struct page for the @pfn.
+ * Note, that availability of the memory map entry does not imply that
+ * there is actual usable memory at that @pfn. The struct page may
+ * represent a hole or an unusable page frame.
+ *
+ * Return: 1 for PFNs that have memory map entries and 0 otherwise
+ */
 static inline int pfn_valid(unsigned long pfn)
 {
 	struct mem_section *ms;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 077/192] memblock: update initialization of reserved pages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (75 preceding siblings ...)
  2021-07-01  1:51 ` [patch 076/192] include/linux/mmzone.h: add documentation for pfn_valid() Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 078/192] arm64: decouple check whether pfn is in linear map from pfn_valid() Andrew Morton
                   ` (115 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, ardb, catalin.marinas, david, linux-mm,
	mark.rutland, maz, mm-commits, rppt, torvalds, wangkefeng.wang,
	will

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: update initialization of reserved pages

The struct pages representing a reserved memory region are initialized
using reserve_bootmem_range() function.  This function is called for each
reserved region just before the memory is freed from memblock to the buddy
page allocator.

The struct pages for MEMBLOCK_NOMAP regions are kept with the default
values set by the memory map initialization which makes it necessary to
have a special treatment for such pages in pfn_valid() and
pfn_valid_within().

Split out initialization of the reserved pages to a function with a
meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the
reserved regions and mark struct pages for the NOMAP regions as
PageReserved.

Link: https://lkml.kernel.org/r/20210511100550.28178-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |    4 +++-
 mm/memblock.c            |   28 ++++++++++++++++++++++++++--
 2 files changed, 29 insertions(+), 3 deletions(-)

--- a/include/linux/memblock.h~memblock-update-initialization-of-reserved-pages
+++ a/include/linux/memblock.h
@@ -30,7 +30,9 @@ extern unsigned long long max_possible_p
  * @MEMBLOCK_NONE: no special request
  * @MEMBLOCK_HOTPLUG: hotpluggable region
  * @MEMBLOCK_MIRROR: mirrored region
- * @MEMBLOCK_NOMAP: don't add to kernel direct mapping
+ * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
+ * reserved in the memory map; refer to memblock_mark_nomap() description
+ * for further details
  */
 enum memblock_flags {
 	MEMBLOCK_NONE		= 0x0,	/* No special request */
--- a/mm/memblock.c~memblock-update-initialization-of-reserved-pages
+++ a/mm/memblock.c
@@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
+ * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the
+ * direct mapping of the physical memory. These regions will still be
+ * covered by the memory map. The struct page representing NOMAP memory
+ * frames in the memory map will be PageReserved()
+ *
  * Return: 0 on success, -errno on failure.
  */
 int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size)
@@ -2002,6 +2007,26 @@ static unsigned long __init __free_memor
 	return end_pfn - start_pfn;
 }
 
+static void __init memmap_init_reserved_pages(void)
+{
+	struct memblock_region *region;
+	phys_addr_t start, end;
+	u64 i;
+
+	/* initialize struct pages for the reserved regions */
+	for_each_reserved_mem_range(i, &start, &end)
+		reserve_bootmem_region(start, end);
+
+	/* and also treat struct pages for the NOMAP regions as PageReserved */
+	for_each_mem_region(region) {
+		if (memblock_is_nomap(region)) {
+			start = region->base;
+			end = start + region->size;
+			reserve_bootmem_region(start, end);
+		}
+	}
+}
+
 static unsigned long __init free_low_memory_core_early(void)
 {
 	unsigned long count = 0;
@@ -2010,8 +2035,7 @@ static unsigned long __init free_low_mem
 
 	memblock_clear_hotplug(0, -1);
 
-	for_each_reserved_mem_range(i, &start, &end)
-		reserve_bootmem_region(start, end);
+	memmap_init_reserved_pages();
 
 	/*
 	 * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 078/192] arm64: decouple check whether pfn is in linear map from pfn_valid()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (76 preceding siblings ...)
  2021-07-01  1:51 ` [patch 077/192] memblock: update initialization of reserved pages Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 079/192] arm64: drop pfn_valid_within() and simplify pfn_valid() Andrew Morton
                   ` (114 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, ardb, catalin.marinas, david, linux-mm,
	mark.rutland, maz, mm-commits, rppt, torvalds, wangkefeng.wang,
	will

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: arm64: decouple check whether pfn is in linear map from pfn_valid()

The intended semantics of pfn_valid() is to verify whether there is a
struct page for the pfn in question and nothing else.

Yet, on arm64 it is used to distinguish memory areas that are mapped in
the linear map vs those that require ioremap() to access them.

Introduce a dedicated pfn_is_map_memory() wrapper for
memblock_is_map_memory() to perform such check and use it where
appropriate.

Using a wrapper allows to avoid cyclic include dependencies.

While here also update style of pfn_valid() so that both pfn_valid() and
pfn_is_map_memory() declarations will be consistent.

Link: https://lkml.kernel.org/r/20210511100550.28178-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/include/asm/memory.h |    2 +-
 arch/arm64/include/asm/page.h   |    3 ++-
 arch/arm64/kvm/mmu.c            |    2 +-
 arch/arm64/mm/init.c            |   12 ++++++++++++
 arch/arm64/mm/ioremap.c         |    4 ++--
 arch/arm64/mm/mmu.c             |    2 +-
 6 files changed, 19 insertions(+), 6 deletions(-)

--- a/arch/arm64/include/asm/memory.h~arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid
+++ a/arch/arm64/include/asm/memory.h
@@ -369,7 +369,7 @@ static inline void *phys_to_virt(phys_ad
 
 #define virt_addr_valid(addr)	({					\
 	__typeof__(addr) __addr = __tag_reset(addr);			\
-	__is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr));	\
+	__is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr));	\
 })
 
 void dump_mem_limit(void);
--- a/arch/arm64/include/asm/page.h~arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid
+++ a/arch/arm64/include/asm/page.h
@@ -37,7 +37,8 @@ void copy_highpage(struct page *to, stru
 
 typedef struct page *pgtable_t;
 
-extern int pfn_valid(unsigned long);
+int pfn_valid(unsigned long pfn);
+int pfn_is_map_memory(unsigned long pfn);
 
 #include <asm/memory.h>
 
--- a/arch/arm64/kvm/mmu.c~arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid
+++ a/arch/arm64/kvm/mmu.c
@@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *k
 
 static bool kvm_is_device_pfn(unsigned long pfn)
 {
-	return !pfn_valid(pfn);
+	return !pfn_is_map_memory(pfn);
 }
 
 static void *stage2_memcache_zalloc_page(void *arg)
--- a/arch/arm64/mm/init.c~arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid
+++ a/arch/arm64/mm/init.c
@@ -256,6 +256,18 @@ int pfn_valid(unsigned long pfn)
 }
 EXPORT_SYMBOL(pfn_valid);
 
+int pfn_is_map_memory(unsigned long pfn)
+{
+	phys_addr_t addr = PFN_PHYS(pfn);
+
+	/* avoid false positives for bogus PFNs, see comment in pfn_valid() */
+	if (PHYS_PFN(addr) != pfn)
+		return 0;
+
+	return memblock_is_map_memory(addr);
+}
+EXPORT_SYMBOL(pfn_is_map_memory);
+
 static phys_addr_t memory_limit = PHYS_ADDR_MAX;
 
 /*
--- a/arch/arm64/mm/ioremap.c~arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid
+++ a/arch/arm64/mm/ioremap.c
@@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(ph
 	/*
 	 * Don't allow RAM to be mapped.
 	 */
-	if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr))))
+	if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr))))
 		return NULL;
 
 	area = get_vm_area_caller(size, VM_IOREMAP, caller);
@@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap);
 void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size)
 {
 	/* For normal memory we already have a cacheable mapping. */
-	if (pfn_valid(__phys_to_pfn(phys_addr)))
+	if (pfn_is_map_memory(__phys_to_pfn(phys_addr)))
 		return (void __iomem *)__phys_to_virt(phys_addr);
 
 	return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL),
--- a/arch/arm64/mm/mmu.c~arm64-decouple-check-whether-pfn-is-in-linear-map-from-pfn_valid
+++ a/arch/arm64/mm/mmu.c
@@ -82,7 +82,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 			      unsigned long size, pgprot_t vma_prot)
 {
-	if (!pfn_valid(pfn))
+	if (!pfn_is_map_memory(pfn))
 		return pgprot_noncached(vma_prot);
 	else if (file->f_flags & O_SYNC)
 		return pgprot_writecombine(vma_prot);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 079/192] arm64: drop pfn_valid_within() and simplify pfn_valid()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (77 preceding siblings ...)
  2021-07-01  1:51 ` [patch 078/192] arm64: decouple check whether pfn is in linear map from pfn_valid() Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 080/192] arm64/mm: drop HAVE_ARCH_PFN_VALID Andrew Morton
                   ` (113 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, ardb, catalin.marinas, david, linux-mm,
	mark.rutland, maz, mm-commits, rppt, torvalds, wangkefeng.wang,
	will

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: arm64: drop pfn_valid_within() and simplify pfn_valid()

The arm64's version of pfn_valid() differs from the generic because of two
reasons:

* Parts of the memory map are freed during boot. This makes it necessary to
  verify that there is actual physical memory that corresponds to a pfn
  which is done by querying memblock.

* There are NOMAP memory regions. These regions are not mapped in the
  linear map and until the previous commit the struct pages representing
  these areas had default values.

As the consequence of absence of the special treatment of NOMAP regions in
the memory map it was necessary to use memblock_is_map_memory() in
pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that
generic mm functionality would not treat a NOMAP page as a normal page.

Since the NOMAP regions are now marked as PageReserved(), pfn walkers and
the rest of core mm will treat them as unusable memory and thus
pfn_valid_within() is no longer required at all and can be disabled on
arm64.

pfn_valid() can be slightly simplified by replacing
memblock_is_map_memory() with memblock_is_memory().

[rppt@kernel.org: fix merge fix]
  Link: https://lkml.kernel.org/r/YJtoQhidtIJOhYsV@kernel.org
Link: https://lkml.kernel.org/r/20210511100550.28178-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/Kconfig   |    1 -
 arch/arm64/mm/init.c |    2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

--- a/arch/arm64/Kconfig~arm64-drop-pfn_valid_within-and-simplify-pfn_valid
+++ a/arch/arm64/Kconfig
@@ -201,7 +201,6 @@ config ARM64
 	select HAVE_KPROBES
 	select HAVE_KRETPROBES
 	select HAVE_GENERIC_VDSO
-	select HOLES_IN_ZONE
 	select IOMMU_DMA if IOMMU_SUPPORT
 	select IRQ_DOMAIN
 	select IRQ_FORCED_THREADING
--- a/arch/arm64/mm/init.c~arm64-drop-pfn_valid_within-and-simplify-pfn_valid
+++ a/arch/arm64/mm/init.c
@@ -252,7 +252,7 @@ int pfn_valid(unsigned long pfn)
 	if (!early_section(ms))
 		return pfn_section_valid(ms, pfn);
 
-	return memblock_is_map_memory(addr);
+	return memblock_is_memory(addr);
 }
 EXPORT_SYMBOL(pfn_valid);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 080/192] arm64/mm: drop HAVE_ARCH_PFN_VALID
  2021-07-01  1:46 incoming Andrew Morton
                   ` (78 preceding siblings ...)
  2021-07-01  1:51 ` [patch 079/192] arm64: drop pfn_valid_within() and simplify pfn_valid() Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 081/192] mm: migrate: fix missing update page_private to hugetlb_page_subpool Andrew Morton
                   ` (112 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, catalin.marinas, david, linux-mm,
	mm-commits, rppt, rppt, torvalds, will

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: arm64/mm: drop HAVE_ARCH_PFN_VALID

CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64
platforms and free_unused_memmap() would just return without creating any
holes in the memmap mapping.  There is no need for any special handling in
pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped.  This also moves
the pfn upper bits sanity check into generic pfn_valid().

Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/Kconfig            |    1 
 arch/arm64/include/asm/page.h |    1 
 arch/arm64/mm/init.c          |   37 --------------------------------
 include/linux/mmzone.h        |    9 +++++++
 4 files changed, 9 insertions(+), 39 deletions(-)

--- a/arch/arm64/include/asm/page.h~arm64-mm-drop-have_arch_pfn_valid
+++ a/arch/arm64/include/asm/page.h
@@ -37,7 +37,6 @@ void copy_highpage(struct page *to, stru
 
 typedef struct page *pgtable_t;
 
-int pfn_valid(unsigned long pfn);
 int pfn_is_map_memory(unsigned long pfn);
 
 #include <asm/memory.h>
--- a/arch/arm64/Kconfig~arm64-mm-drop-have_arch_pfn_valid
+++ a/arch/arm64/Kconfig
@@ -154,7 +154,6 @@ config ARM64
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_MMAP_RND_BITS
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
-	select HAVE_ARCH_PFN_VALID
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	select HAVE_ARCH_SECCOMP_FILTER
--- a/arch/arm64/mm/init.c~arm64-mm-drop-have_arch_pfn_valid
+++ a/arch/arm64/mm/init.c
@@ -219,43 +219,6 @@ static void __init zone_sizes_init(unsig
 	free_area_init(max_zone_pfns);
 }
 
-int pfn_valid(unsigned long pfn)
-{
-	phys_addr_t addr = PFN_PHYS(pfn);
-	struct mem_section *ms;
-
-	/*
-	 * Ensure the upper PAGE_SHIFT bits are clear in the
-	 * pfn. Else it might lead to false positives when
-	 * some of the upper bits are set, but the lower bits
-	 * match a valid pfn.
-	 */
-	if (PHYS_PFN(addr) != pfn)
-		return 0;
-
-	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
-		return 0;
-
-	ms = __pfn_to_section(pfn);
-	if (!valid_section(ms))
-		return 0;
-
-	/*
-	 * ZONE_DEVICE memory does not have the memblock entries.
-	 * memblock_is_map_memory() check for ZONE_DEVICE based
-	 * addresses will always fail. Even the normal hotplugged
-	 * memory will never have MEMBLOCK_NOMAP flag set in their
-	 * memblock entries. Skip memblock search for all non early
-	 * memory sections covering all of hotplug memory including
-	 * both normal and ZONE_DEVICE based.
-	 */
-	if (!early_section(ms))
-		return pfn_section_valid(ms, pfn);
-
-	return memblock_is_memory(addr);
-}
-EXPORT_SYMBOL(pfn_valid);
-
 int pfn_is_map_memory(unsigned long pfn)
 {
 	phys_addr_t addr = PFN_PHYS(pfn);
--- a/include/linux/mmzone.h~arm64-mm-drop-have_arch_pfn_valid
+++ a/include/linux/mmzone.h
@@ -1460,6 +1460,15 @@ static inline int pfn_valid(unsigned lon
 {
 	struct mem_section *ms;
 
+	/*
+	 * Ensure the upper PAGE_SHIFT bits are clear in the
+	 * pfn. Else it might lead to false positives when
+	 * some of the upper bits are set, but the lower bits
+	 * match a valid pfn.
+	 */
+	if (PHYS_PFN(PFN_PHYS(pfn)) != pfn)
+		return 0;
+
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
 	ms = __nr_to_section(pfn_to_section_nr(pfn));
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 081/192] mm: migrate: fix missing update page_private to hugetlb_page_subpool
  2021-07-01  1:46 incoming Andrew Morton
                   ` (79 preceding siblings ...)
  2021-07-01  1:51 ` [patch 080/192] arm64/mm: drop HAVE_ARCH_PFN_VALID Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 082/192] mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs Andrew Morton
                   ` (111 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, david, duanxiongchun, linux-mm, mhocko,
	mike.kravetz, mm-commits, osalvador, songmuchun, torvalds, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: migrate: fix missing update page_private to hugetlb_page_subpool

Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific
page flags") converts page.private for hugetlb specific page flags.  We
should use hugetlb_page_subpool() to get the subpool pointer instead of
page_private().

This 'could' prevent the migration of hugetlb pages.  page_private(hpage)
is now used for hugetlb page specific flags.  At migration time, the only
flag which could be set is HPageVmemmapOptimized.  This flag will only be
set if the new vmemmap reduction feature is enabled.  In addition,
!page_mapping() implies an anonymous mapping.  So, this will prevent
migration of hugetb pages in anonymous mappings if the vmemmap reduction
feature is enabled.

In addition, that if statement checked for the rare race condition of a
page being migrated while in the process of being freed.  Since that check
is now wrong, we could leak hugetlb subpool usage counts.

The commit forgot to update it in the page migration routine.  So fix it.

[songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
  Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>	[arm64]
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h |    5 +++++
 mm/migrate.c            |    2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

--- a/include/linux/hugetlb.h~mm-migrate-fix-missing-update-page_private-to-hugetlb_page_subpool
+++ a/include/linux/hugetlb.h
@@ -898,6 +898,11 @@ static inline void huge_ptep_modify_prot
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage)
+{
+	return NULL;
+}
+
 static inline int isolate_or_dissolve_huge_page(struct page *page,
 						struct list_head *list)
 {
--- a/mm/migrate.c~mm-migrate-fix-missing-update-page_private-to-hugetlb_page_subpool
+++ a/mm/migrate.c
@@ -1293,7 +1293,7 @@ static int unmap_and_move_huge_page(new_
 	 * page_mapping() set, hugetlbfs specific move page routine will not
 	 * be called and we could leak usage counts for subpools.
 	 */
-	if (page_private(hpage) && !page_mapping(hpage)) {
+	if (hugetlb_page_subpool(hpage) && !page_mapping(hpage)) {
 		rc = -EBUSY;
 		goto out_unlock;
 	}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 082/192] mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs
  2021-07-01  1:46 incoming Andrew Morton
                   ` (80 preceding siblings ...)
  2021-07-01  1:51 ` [patch 081/192] mm: migrate: fix missing update page_private to hugetlb_page_subpool Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 083/192] mm: memory: add orig_pmd to struct vm_fault Andrew Morton
                   ` (110 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, cfijalkovich, hridya, hughd, kaleshsingh, linux-mm,
	mm-commits, song, surenb, timmurray, torvalds, viro,
	william.kucharski, willy

From: Collin Fijalkovich <cfijalkovich@google.com>
Subject: mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs

Transparent huge pages are supported for read-only non-shmem files, but
are only used for vmas with VM_DENYWRITE.  This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY).  Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open().  Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().

Systems that make heavy use of shared libraries (e.g.  Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.

This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()).  It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.

Restricting the use of THPs to executable mappings eliminates the risk
that a read-only file later opened for write would encounter significant
latencies due to page cache truncation.

The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout.  The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping
to be backed with THPs.

Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.

4KB Pages:
==========

count              event_name            # count / runtime
598,995,035,942    cpu-cycles            # 1.800861 GHz
 81,195,620,851    raw-stall-frontend    # 244.112 M/sec
347,754,466,597    iTLB-loads            # 1.046 G/sec
  2,970,248,900    iTLB-load-misses      # 0.854122% miss rate

Total test time: 332.854998 seconds.

2MB Pages:
==========

count              event_name            # count / runtime
592,872,663,047    cpu-cycles            # 1.800358 GHz
 76,485,624,143    raw-stall-frontend    # 232.261 M/sec
350,478,413,710    iTLB-loads            # 1.064 G/sec
    803,233,322    iTLB-load-misses      # 0.229182% miss rate

Total test time: 329.826087 seconds

A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

/apex/com.android.art/lib64/libart.so
FilePmdMapped:      4096 kB

/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped:      2048 kB

Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/open.c       |   13 +++++++++++--
 mm/khugepaged.c |   16 +++++++++++++++-
 2 files changed, 26 insertions(+), 3 deletions(-)

--- a/fs/open.c~mm-thp-relax-the-vm_denywrite-constraint-on-file-backed-thps
+++ a/fs/open.c
@@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f
 	 * XXX: Huge page cache doesn't support writing yet. Drop all page
 	 * cache for this file before processing writes.
 	 */
-	if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping))
-		truncate_pagecache(inode, 0);
+	if (f->f_mode & FMODE_WRITE) {
+		/*
+		 * Paired with smp_mb() in collapse_file() to ensure nr_thps
+		 * is up to date and the update to i_writecount by
+		 * get_write_access() is visible. Ensures subsequent insertion
+		 * of THPs into the page cache will fail.
+		 */
+		smp_mb();
+		if (filemap_nr_thps(inode->i_mapping))
+			truncate_pagecache(inode, 0);
+	}
 
 	return 0;
 
--- a/mm/khugepaged.c~mm-thp-relax-the-vm_denywrite-constraint-on-file-backed-thps
+++ a/mm/khugepaged.c
@@ -457,7 +457,8 @@ static bool hugepage_vma_check(struct vm
 
 	/* Read-only file mappings need to be aligned for THP to work. */
 	if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
-	    (vm_flags & VM_DENYWRITE)) {
+	    !inode_is_open_for_write(vma->vm_file->f_inode) &&
+	    (vm_flags & VM_EXEC)) {
 		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
 				HPAGE_PMD_NR);
 	}
@@ -1862,6 +1863,19 @@ out_unlock:
 	else {
 		__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
 		filemap_nr_thps_inc(mapping);
+		/*
+		 * Paired with smp_mb() in do_dentry_open() to ensure
+		 * i_writecount is up to date and the update to nr_thps is
+		 * visible. Ensures the page cache will be truncated if the
+		 * file is opened writable.
+		 */
+		smp_mb();
+		if (inode_is_open_for_write(mapping->host)) {
+			result = SCAN_FAIL;
+			__mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
+			filemap_nr_thps_dec(mapping);
+			goto xa_locked;
+		}
 	}
 
 	if (nr_none) {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 083/192] mm: memory: add orig_pmd to struct vm_fault
  2021-07-01  1:46 incoming Andrew Morton
                   ` (81 preceding siblings ...)
  2021-07-01  1:51 ` [patch 082/192] mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 084/192] mm: memory: make numa_migrate_prep() non-static Andrew Morton
                   ` (109 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, gerald.schaefer, gor, hca, hughd,
	kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: memory: add orig_pmd to struct vm_fault

Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.

When the THP NUMA fault support was added THP migration was not supported
yet.  So the ad hoc THP migration was implemented in NUMA fault handling. 
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.  It is definitely a maintenance burden to keep
two THP migration implementation for different code paths and it is more
error prone.  Using the generic THP migration implementation allows us
remove the duplicate code and some hacks needed by the old ad hoc
implementation.

A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
THP and NUMA balancing.  The most of them support THP migration except for
S390.  Zi Yan tried to add THP migration support for S390 before but it
was not accepted due to the design of S390 PMD.  For the discussion,
please see: https://lkml.org/lkml/2018/4/27/953.

Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
huge PMD for S390 for now.

I saw there were some hacks about gup from git history, but I didn't
figure out if they have been removed or not since I just found FOLL_NUMA
code in the current gup implementation and they seems useful.

Patch #1 ~ #2 are preparation patches.
Patch #3 is the real meat.
Patch #4 ~ #6 keep consistent counters and behaviors with before.
Patch #7 skips change huge PMD to prot_none if thp migration is not supported.


Test
----
Did some tests to measure the latency of do_huge_pmd_numa_page.  The test
VM has 80 vcpus and 64G memory.  The test would create 2 processes to
consume 128G memory together which would incur memory pressure to cause
THP splits.  And it also creates 80 processes to hog cpu, and the memory
consumer processes are bound to different nodes periodically in order to
increase NUMA faults.

The below test script is used:

echo 3 > /proc/sys/vm/drop_caches

# Run stress-ng for 24 hours
./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
PID=$!

./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &

# Wait for vm stressors forked
sleep 5

PID_1=`pgrep -P $PID | awk 'NR == 1'`
PID_2=`pgrep -P $PID | awk 'NR == 2'`

JOB1=`pgrep -P $PID_1`
JOB2=`pgrep -P $PID_2`

# Bind load jobs to different nodes periodically to force generate
# cross node memory access
while [ -d "/proc/$PID" ]
do
        taskset -apc 8 $JOB1
        taskset -apc 8 $JOB2
        sleep 300
        taskset -apc 58 $JOB1
        taskset -apc 58 $JOB2
        sleep 300
done

With the above test the histogram of latency of do_huge_pmd_numa_page is
as shown below.  Since the number of do_huge_pmd_numa_page varies
drastically for each run (should be due to scheduler), so I converted the
raw number to percentage.

                             patched               base
@us[stress-ng]:
[0]                          3.57%                 0.16%
[1]                          55.68%                18.36%
[2, 4)                       10.46%                40.44%
[4, 8)                       7.26%                 17.82%
[8, 16)                      21.12%                13.41%
[16, 32)                     1.06%                 4.27%
[32, 64)                     0.56%                 4.07%
[64, 128)                    0.16%                 0.35%
[128, 256)                   < 0.1%                < 0.1%
[256, 512)                   < 0.1%                < 0.1%
[512, 1K)                    < 0.1%                < 0.1%
[1K, 2K)                     < 0.1%                < 0.1%
[2K, 4K)                     < 0.1%                < 0.1%
[4K, 8K)                     < 0.1%                < 0.1%
[8K, 16K)                    < 0.1%                < 0.1%
[16K, 32K)                   < 0.1%                < 0.1%
[32K, 64K)                   < 0.1%                < 0.1%

Per the result, patched kernel is even slightly better than the base
kernel.  I think this is because the lock contention against THP split is
less than base kernel due to the refactor.

To exclude the affect from THP split, I also did test w/o memory pressure.
No obvious regression is spotted.  The below is the test result *w/o*
memory pressure.

                           patched                  base
@us[stress-ng]:
[0]                        7.97%                   18.4%
[1]                        69.63%                  58.24%
[2, 4)                     4.18%                   2.63%
[4, 8)                     0.22%                   0.17%
[8, 16)                    1.03%                   0.92%
[16, 32)                   0.14%                   < 0.1%
[32, 64)                   < 0.1%                  < 0.1%
[64, 128)                  < 0.1%                  < 0.1%
[128, 256)                 < 0.1%                  < 0.1%
[256, 512)                 0.45%                   1.19%
[512, 1K)                  15.45%                  17.27%
[1K, 2K)                   < 0.1%                  < 0.1%
[2K, 4K)                   < 0.1%                  < 0.1%
[4K, 8K)                   < 0.1%                  < 0.1%
[8K, 16K)                  0.86%                   0.88%
[16K, 32K)                 < 0.1%                  0.15%
[32K, 64K)                 < 0.1%                  < 0.1%
[64K, 128K)                < 0.1%                  < 0.1%
[128K, 256K)               < 0.1%                  < 0.1%

The series also survived a series of tests that exercise NUMA balancing
migrations by Mel.


This patch (of 7):

Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
page fault could be removed, just like its PTE counterpart does.

Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/huge_mm.h |    9 ++++-----
 include/linux/mm.h      |    7 ++++++-
 mm/huge_memory.c        |    9 ++++++---
 mm/memory.c             |   26 +++++++++++++-------------
 4 files changed, 29 insertions(+), 22 deletions(-)

--- a/include/linux/huge_mm.h~mm-memory-add-orig_pmd-to-struct-vm_fault
+++ a/include/linux/huge_mm.h
@@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(st
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
+void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma);
@@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed
 }
 #endif
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmd,
 				   unsigned int flags);
@@ -288,7 +288,7 @@ struct page *follow_devmap_pmd(struct vm
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
 extern struct page *huge_zero_page;
 extern unsigned long huge_zero_pfn;
@@ -441,8 +441,7 @@ static inline spinlock_t *pud_trans_huge
 	return NULL;
 }
 
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf,
-		pmd_t orig_pmd)
+static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	return 0;
 }
--- a/include/linux/mm.h~mm-memory-add-orig_pmd-to-struct-vm_fault
+++ a/include/linux/mm.h
@@ -550,7 +550,12 @@ struct vm_fault {
 	pud_t *pud;			/* Pointer to pud entry matching
 					 * the 'address'
 					 */
-	pte_t orig_pte;			/* Value of PTE at the time of fault */
+	union {
+		pte_t orig_pte;		/* Value of PTE at the time of fault */
+		pmd_t orig_pmd;		/* Value of PMD at the time of fault,
+					 * used by PMD fault only.
+					 */
+	};
 
 	struct page *cow_page;		/* Page handler may use for COW fault */
 	struct page *page;		/* ->fault handlers should return a
--- a/mm/huge_memory.c~mm-memory-add-orig_pmd-to-struct-vm_fault
+++ a/mm/huge_memory.c
@@ -1257,11 +1257,12 @@ unlock:
 }
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
+void huge_pmd_set_accessed(struct vm_fault *vmf)
 {
 	pmd_t entry;
 	unsigned long haddr;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	pmd_t orig_pmd = vmf->orig_pmd;
 
 	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
 	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
@@ -1278,11 +1279,12 @@ unlock:
 	spin_unlock(vmf->ptl);
 }
 
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	pmd_t orig_pmd = vmf->orig_pmd;
 
 	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
@@ -1418,9 +1420,10 @@ out:
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
+vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	pmd_t pmd = vmf->orig_pmd;
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
--- a/mm/memory.c~mm-memory-add-orig_pmd-to-struct-vm_fault
+++ a/mm/memory.c
@@ -4298,12 +4298,12 @@ static inline vm_fault_t create_huge_pmd
 }
 
 /* `inline' is required to avoid gcc 4.1.2 build error */
-static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
+static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
-		return do_huge_pmd_wp_page(vmf, orig_pmd);
+		return do_huge_pmd_wp_page(vmf);
 	}
 	if (vmf->vma->vm_ops->huge_fault) {
 		vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
@@ -4530,26 +4530,26 @@ retry_pud:
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-		pmd_t orig_pmd = *vmf.pmd;
+		vmf.orig_pmd = *vmf.pmd;
 
 		barrier();
-		if (unlikely(is_swap_pmd(orig_pmd))) {
+		if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
 			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(orig_pmd));
-			if (is_pmd_migration_entry(orig_pmd))
+					  !is_pmd_migration_entry(vmf.orig_pmd));
+			if (is_pmd_migration_entry(vmf.orig_pmd))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
-		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
-			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
-				return do_huge_pmd_numa_page(&vmf, orig_pmd);
+		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
+				return do_huge_pmd_numa_page(&vmf);
 
-			if (dirty && !pmd_write(orig_pmd)) {
-				ret = wp_huge_pmd(&vmf, orig_pmd);
+			if (dirty && !pmd_write(vmf.orig_pmd)) {
+				ret = wp_huge_pmd(&vmf);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
 			} else {
-				huge_pmd_set_accessed(&vmf, orig_pmd);
+				huge_pmd_set_accessed(&vmf);
 				return 0;
 			}
 		}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 084/192] mm: memory: make numa_migrate_prep() non-static
  2021-07-01  1:46 incoming Andrew Morton
                   ` (82 preceding siblings ...)
  2021-07-01  1:51 ` [patch 083/192] mm: memory: add orig_pmd to struct vm_fault Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 085/192] mm: thp: refactor NUMA fault handling Andrew Morton
                   ` (108 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, gerald.schaefer, gor, hca, hughd,
	kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: memory: make numa_migrate_prep() non-static

The numa_migrate_prep() will be used by huge NUMA fault as well in the
following patch, make it non-static.

Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/internal.h |    3 +++
 mm/memory.c   |    5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

--- a/mm/internal.h~mm-memory-make-numa_migrate_prep-non-static
+++ a/mm/internal.h
@@ -672,4 +672,7 @@ int vmap_pages_range_noflush(unsigned lo
 
 void vunmap_range_noflush(unsigned long start, unsigned long end);
 
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+		      unsigned long addr, int page_nid, int *flags);
+
 #endif	/* __MM_INTERNAL_H */
--- a/mm/memory.c~mm-memory-make-numa_migrate_prep-non-static
+++ a/mm/memory.c
@@ -4175,9 +4175,8 @@ static vm_fault_t do_fault(struct vm_fau
 	return ret;
 }
 
-static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
-				unsigned long addr, int page_nid,
-				int *flags)
+int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+		      unsigned long addr, int page_nid, int *flags)
 {
 	get_page(page);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 085/192] mm: thp: refactor NUMA fault handling
  2021-07-01  1:46 incoming Andrew Morton
                   ` (83 preceding siblings ...)
  2021-07-01  1:51 ` [patch 084/192] mm: memory: make numa_migrate_prep() non-static Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 086/192] mm: migrate: account THP NUMA migration counters correctly Andrew Morton
                   ` (107 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, dan.carpenter, gerald.schaefer, gor, hca,
	hughd, kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: thp: refactor NUMA fault handling

When the THP NUMA fault support was added THP migration was not supported
yet.  So the ad hoc THP migration was implemented in NUMA fault handling. 
Since v4.14 THP migration has been supported so it doesn't make too much
sense to still keep another THP migration implementation rather than using
the generic migration code.

This patch reworks the NUMA fault handling to use generic migration
implementation to migrate misplaced page.  There is no functional change.

After the refactor the flow of NUMA fault handling looks just like its
PTE counterpart:
  Acquire ptl
  Prepare for migration (elevate page refcount)
  Release ptl
  Isolate page from lru and elevate page refcount
  Migrate the misplaced THP

If migration fails just restore the old normal PMD.

In the old code anon_vma lock was needed to serialize THP migration
against THP split, but since then the THP code has been reworked a lot, it
seems anon_vma lock is not required anymore to avoid the race.

The page refcount elevation when holding ptl should prevent from THP
split.

Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
and remove all the dead and duplicate code.

[dan.carpenter@oracle.com: fix a double unlock bug]
  Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/migrate.h |   23 ----
 mm/huge_memory.c        |  145 +++++++++----------------------
 mm/internal.h           |   18 ---
 mm/migrate.c            |  177 +++++++-------------------------------
 4 files changed, 77 insertions(+), 286 deletions(-)

--- a/include/linux/migrate.h~mm-thp-refactor-numa-fault-handling
+++ a/include/linux/migrate.h
@@ -99,14 +99,9 @@ static inline void __ClearPageMovable(st
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
 				  struct vm_area_struct *vma, int node);
 #else
-static inline bool pmd_trans_migrating(pmd_t pmd)
-{
-	return false;
-}
 static inline int migrate_misplaced_page(struct page *page,
 					 struct vm_area_struct *vma, int node)
 {
@@ -114,24 +109,6 @@ static inline int migrate_misplaced_page
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-extern int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node);
-#else
-static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-			struct vm_area_struct *vma,
-			pmd_t *pmd, pmd_t entry,
-			unsigned long address,
-			struct page *page, int node)
-{
-	return -EAGAIN;
-}
-#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
-
-
 #ifdef CONFIG_MIGRATION
 
 /*
--- a/mm/huge_memory.c~mm-thp-refactor-numa-fault-handling
+++ a/mm/huge_memory.c
@@ -1423,94 +1423,22 @@ out:
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	pmd_t pmd = vmf->orig_pmd;
-	struct anon_vma *anon_vma = NULL;
+	pmd_t oldpmd = vmf->orig_pmd;
+	pmd_t pmd;
 	struct page *page;
 	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-	int page_nid = NUMA_NO_NODE, this_nid = numa_node_id();
+	int page_nid = NUMA_NO_NODE;
 	int target_nid, last_cpupid = -1;
-	bool page_locked;
 	bool migrated = false;
-	bool was_writable;
+	bool was_writable = pmd_savedwrite(oldpmd);
 	int flags = 0;
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	if (unlikely(!pmd_same(pmd, *vmf->pmd)))
-		goto out_unlock;
-
-	/*
-	 * If there are potential migrations, wait for completion and retry
-	 * without disrupting NUMA hinting information. Do not relock and
-	 * check_same as the page may no longer be mapped.
-	 */
-	if (unlikely(pmd_trans_migrating(*vmf->pmd))) {
-		page = pmd_page(*vmf->pmd);
-		if (!get_page_unless_zero(page))
-			goto out_unlock;
+	if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
 		spin_unlock(vmf->ptl);
-		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
 		goto out;
 	}
 
-	page = pmd_page(pmd);
-	BUG_ON(is_huge_zero_page(page));
-	page_nid = page_to_nid(page);
-	last_cpupid = page_cpupid_last(page);
-	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (page_nid == this_nid) {
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
-		flags |= TNF_FAULT_LOCAL;
-	}
-
-	/* See similar comment in do_numa_page for explanation */
-	if (!pmd_savedwrite(pmd))
-		flags |= TNF_NO_GROUP;
-
-	/*
-	 * Acquire the page lock to serialise THP migrations but avoid dropping
-	 * page_table_lock if at all possible
-	 */
-	page_locked = trylock_page(page);
-	target_nid = mpol_misplaced(page, vma, haddr);
-	/* Migration could have started since the pmd_trans_migrating check */
-	if (!page_locked) {
-		page_nid = NUMA_NO_NODE;
-		if (!get_page_unless_zero(page))
-			goto out_unlock;
-		spin_unlock(vmf->ptl);
-		put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE);
-		goto out;
-	} else if (target_nid == NUMA_NO_NODE) {
-		/* There are no parallel migrations and page is in the right
-		 * node. Clear the numa hinting info in this pmd.
-		 */
-		goto clear_pmdnuma;
-	}
-
-	/*
-	 * Page is misplaced. Page lock serialises migrations. Acquire anon_vma
-	 * to serialises splits
-	 */
-	get_page(page);
-	spin_unlock(vmf->ptl);
-	anon_vma = page_lock_anon_vma_read(page);
-
-	/* Confirm the PMD did not change while page_table_lock was released */
-	spin_lock(vmf->ptl);
-	if (unlikely(!pmd_same(pmd, *vmf->pmd))) {
-		unlock_page(page);
-		put_page(page);
-		page_nid = NUMA_NO_NODE;
-		goto out_unlock;
-	}
-
-	/* Bail if we fail to protect against THP splits for any reason */
-	if (unlikely(!anon_vma)) {
-		put_page(page);
-		page_nid = NUMA_NO_NODE;
-		goto clear_pmdnuma;
-	}
-
 	/*
 	 * Since we took the NUMA fault, we must have observed the !accessible
 	 * bit. Make sure all other CPUs agree with that, to avoid them
@@ -1537,43 +1465,58 @@ vm_fault_t do_huge_pmd_numa_page(struct
 					      haddr + HPAGE_PMD_SIZE);
 	}
 
-	/*
-	 * Migrate the THP to the requested node, returns with page unlocked
-	 * and access rights restored.
-	 */
+	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
+	page = vm_normal_page_pmd(vma, haddr, pmd);
+	if (!page)
+		goto out_map;
+
+	/* See similar comment in do_numa_page for explanation */
+	if (!was_writable)
+		flags |= TNF_NO_GROUP;
+
+	page_nid = page_to_nid(page);
+	last_cpupid = page_cpupid_last(page);
+	target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
+				       &flags);
+
+	if (target_nid == NUMA_NO_NODE) {
+		put_page(page);
+		goto out_map;
+	}
+
 	spin_unlock(vmf->ptl);
 
-	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
-				vmf->pmd, pmd, vmf->address, page, target_nid);
+	migrated = migrate_misplaced_page(page, vma, target_nid);
 	if (migrated) {
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
-	} else
+	} else {
 		flags |= TNF_MIGRATE_FAIL;
-
-	goto out;
-clear_pmdnuma:
-	BUG_ON(!PageLocked(page));
-	was_writable = pmd_savedwrite(pmd);
-	pmd = pmd_modify(pmd, vma->vm_page_prot);
-	pmd = pmd_mkyoung(pmd);
-	if (was_writable)
-		pmd = pmd_mkwrite(pmd);
-	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
-	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-	unlock_page(page);
-out_unlock:
-	spin_unlock(vmf->ptl);
+		vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+		if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
+			spin_unlock(vmf->ptl);
+			goto out;
+		}
+		goto out_map;
+	}
 
 out:
-	if (anon_vma)
-		page_unlock_anon_vma_read(anon_vma);
-
 	if (page_nid != NUMA_NO_NODE)
 		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
 				flags);
 
 	return 0;
+
+out_map:
+	/* Restore the PMD */
+	pmd = pmd_modify(oldpmd, vma->vm_page_prot);
+	pmd = pmd_mkyoung(pmd);
+	if (was_writable)
+		pmd = pmd_mkwrite(pmd);
+	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
+	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+	spin_unlock(vmf->ptl);
+	goto out;
 }
 
 /*
--- a/mm/internal.h~mm-thp-refactor-numa-fault-handling
+++ a/mm/internal.h
@@ -369,23 +369,6 @@ extern unsigned int munlock_vma_page(str
  */
 extern void clear_page_mlock(struct page *page);
 
-/*
- * mlock_migrate_page - called only from migrate_misplaced_transhuge_page()
- * (because that does not go through the full procedure of migration ptes):
- * to migrate the Mlocked page flag; update statistics.
- */
-static inline void mlock_migrate_page(struct page *newpage, struct page *page)
-{
-	if (TestClearPageMlocked(page)) {
-		int nr_pages = thp_nr_pages(page);
-
-		/* Holding pmd lock, no change in irq context: __mod is safe */
-		__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-		SetPageMlocked(newpage);
-		__mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
-	}
-}
-
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
 /*
@@ -461,7 +444,6 @@ static inline struct file *maybe_unlock_
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_migrate_page(struct page *new, struct page *old) { }
 static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 }
--- a/mm/migrate.c~mm-thp-refactor-numa-fault-handling
+++ a/mm/migrate.c
@@ -2048,6 +2048,23 @@ static struct page *alloc_misplaced_dst_
 	return newpage;
 }
 
+static struct page *alloc_misplaced_dst_page_thp(struct page *page,
+						 unsigned long data)
+{
+	int nid = (int) data;
+	struct page *newpage;
+
+	newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
+				   HPAGE_PMD_ORDER);
+	if (!newpage)
+		goto out;
+
+	prep_transhuge_page(newpage);
+
+out:
+	return newpage;
+}
+
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
@@ -2086,12 +2103,6 @@ static int numamigrate_isolate_page(pg_d
 	return 1;
 }
 
-bool pmd_trans_migrating(pmd_t pmd)
-{
-	struct page *page = pmd_page(pmd);
-	return PageLocked(page);
-}
-
 /*
  * Attempt to migrate a misplaced page to the specified destination
  * node. Caller is expected to have an elevated reference count on
@@ -2104,6 +2115,20 @@ int migrate_misplaced_page(struct page *
 	int isolated;
 	int nr_remaining;
 	LIST_HEAD(migratepages);
+	new_page_t *new;
+	bool compound;
+
+	/*
+	 * PTE mapped THP or HugeTLB page can't reach here so the page could
+	 * be either base page or THP.  And it must be head page if it is
+	 * THP.
+	 */
+	compound = PageTransHuge(page);
+
+	if (compound)
+		new = alloc_misplaced_dst_page_thp;
+	else
+		new = alloc_misplaced_dst_page;
 
 	/*
 	 * Don't migrate file pages that are mapped in multiple processes
@@ -2125,9 +2150,8 @@ int migrate_misplaced_page(struct page *
 		goto out;
 
 	list_add(&page->lru, &migratepages);
-	nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
-				     NULL, node, MIGRATE_ASYNC,
-				     MR_NUMA_MISPLACED);
+	nr_remaining = migrate_pages(&migratepages, *new, NULL, node,
+				     MIGRATE_ASYNC, MR_NUMA_MISPLACED);
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
@@ -2146,141 +2170,6 @@ out:
 	return 0;
 }
 #endif /* CONFIG_NUMA_BALANCING */
-
-#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-/*
- * Migrates a THP to a given target node. page must be locked and is unlocked
- * before returning.
- */
-int migrate_misplaced_transhuge_page(struct mm_struct *mm,
-				struct vm_area_struct *vma,
-				pmd_t *pmd, pmd_t entry,
-				unsigned long address,
-				struct page *page, int node)
-{
-	spinlock_t *ptl;
-	pg_data_t *pgdat = NODE_DATA(node);
-	int isolated = 0;
-	struct page *new_page = NULL;
-	int page_lru = page_is_file_lru(page);
-	unsigned long start = address & HPAGE_PMD_MASK;
-
-	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
-		HPAGE_PMD_ORDER);
-	if (!new_page)
-		goto out_fail;
-	prep_transhuge_page(new_page);
-
-	isolated = numamigrate_isolate_page(pgdat, page);
-	if (!isolated) {
-		put_page(new_page);
-		goto out_fail;
-	}
-
-	/* Prepare a page as a migration target */
-	__SetPageLocked(new_page);
-	if (PageSwapBacked(page))
-		__SetPageSwapBacked(new_page);
-
-	/* anon mapping, we can simply copy page->mapping to the new page: */
-	new_page->mapping = page->mapping;
-	new_page->index = page->index;
-	/* flush the cache before copying using the kernel virtual address */
-	flush_cache_range(vma, start, start + HPAGE_PMD_SIZE);
-	migrate_page_copy(new_page, page);
-	WARN_ON(PageLRU(new_page));
-
-	/* Recheck the target PMD */
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) {
-		spin_unlock(ptl);
-
-		/* Reverse changes made by migrate_page_copy() */
-		if (TestClearPageActive(new_page))
-			SetPageActive(page);
-		if (TestClearPageUnevictable(new_page))
-			SetPageUnevictable(page);
-
-		unlock_page(new_page);
-		put_page(new_page);		/* Free it */
-
-		/* Retake the callers reference and putback on LRU */
-		get_page(page);
-		putback_lru_page(page);
-		mod_node_page_state(page_pgdat(page),
-			 NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
-
-		goto out_unlock;
-	}
-
-	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
-	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-
-	/*
-	 * Overwrite the old entry under pagetable lock and establish
-	 * the new PTE. Any parallel GUP will either observe the old
-	 * page blocking on the page lock, block on the page table
-	 * lock or observe the new page. The SetPageUptodate on the
-	 * new page and page_add_new_anon_rmap guarantee the copy is
-	 * visible before the pagetable update.
-	 */
-	page_add_anon_rmap(new_page, vma, start, true);
-	/*
-	 * At this point the pmd is numa/protnone (i.e. non present) and the TLB
-	 * has already been flushed globally.  So no TLB can be currently
-	 * caching this non present pmd mapping.  There's no need to clear the
-	 * pmd before doing set_pmd_at(), nor to flush the TLB after
-	 * set_pmd_at().  Clearing the pmd here would introduce a race
-	 * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the
-	 * mmap_lock for reading.  If the pmd is set to NULL at any given time,
-	 * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this
-	 * pmd.
-	 */
-	set_pmd_at(mm, start, pmd, entry);
-	update_mmu_cache_pmd(vma, address, &entry);
-
-	page_ref_unfreeze(page, 2);
-	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
-	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
-
-	spin_unlock(ptl);
-
-	/* Take an "isolate" reference and put new page on the LRU. */
-	get_page(new_page);
-	putback_lru_page(new_page);
-
-	unlock_page(new_page);
-	unlock_page(page);
-	put_page(page);			/* Drop the rmap reference */
-	put_page(page);			/* Drop the LRU isolation reference */
-
-	count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
-	count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
-
-	mod_node_page_state(page_pgdat(page),
-			NR_ISOLATED_ANON + page_lru,
-			-HPAGE_PMD_NR);
-	return isolated;
-
-out_fail:
-	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-	ptl = pmd_lock(mm, pmd);
-	if (pmd_same(*pmd, entry)) {
-		entry = pmd_modify(entry, vma->vm_page_prot);
-		set_pmd_at(mm, start, pmd, entry);
-		update_mmu_cache_pmd(vma, address, &entry);
-	}
-	spin_unlock(ptl);
-
-out_unlock:
-	unlock_page(page);
-	put_page(page);
-	return 0;
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_DEVICE_PRIVATE
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 086/192] mm: migrate: account THP NUMA migration counters correctly
  2021-07-01  1:46 incoming Andrew Morton
                   ` (84 preceding siblings ...)
  2021-07-01  1:51 ` [patch 085/192] mm: thp: refactor NUMA fault handling Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 087/192] mm: migrate: don't split THP for misplaced NUMA page Andrew Morton
                   ` (106 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, gerald.schaefer, gor, hca, hughd,
	kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: migrate: account THP NUMA migration counters correctly

Now both base page and THP NUMA migration is done via
migrate_misplaced_page(), keep the counters correctly for THP.

Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/mm/migrate.c~mm-migrate-account-thp-numa-migration-counters-correctly
+++ a/mm/migrate.c
@@ -2117,6 +2117,7 @@ int migrate_misplaced_page(struct page *
 	LIST_HEAD(migratepages);
 	new_page_t *new;
 	bool compound;
+	unsigned int nr_pages = thp_nr_pages(page);
 
 	/*
 	 * PTE mapped THP or HugeTLB page can't reach here so the page could
@@ -2155,13 +2156,13 @@ int migrate_misplaced_page(struct page *
 	if (nr_remaining) {
 		if (!list_empty(&migratepages)) {
 			list_del(&page->lru);
-			dec_node_page_state(page, NR_ISOLATED_ANON +
-					page_is_file_lru(page));
+			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+					page_is_file_lru(page), -nr_pages);
 			putback_lru_page(page);
 		}
 		isolated = 0;
 	} else
-		count_vm_numa_event(NUMA_PAGE_MIGRATE);
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages);
 	BUG_ON(!list_empty(&migratepages));
 	return isolated;
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 087/192] mm: migrate: don't split THP for misplaced NUMA page
  2021-07-01  1:46 incoming Andrew Morton
                   ` (85 preceding siblings ...)
  2021-07-01  1:51 ` [patch 086/192] mm: migrate: account THP NUMA migration counters correctly Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 088/192] mm: migrate: check mapcount for THP instead of refcount Andrew Morton
                   ` (105 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, gerald.schaefer, gor, hca, hughd,
	kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: migrate: don't split THP for misplaced NUMA page

The old behavior didn't split THP if migration is failed due to lack of
memory on the target node.  But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.

Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/migrate.c~mm-migrate-dont-split-thp-for-misplaced-numa-page
+++ a/mm/migrate.c
@@ -1423,6 +1423,7 @@ int migrate_pages(struct list_head *from
 	int swapwrite = current->flags & PF_SWAPWRITE;
 	int rc, nr_subpages;
 	LIST_HEAD(ret_pages);
+	bool nosplit = (reason == MR_NUMA_MISPLACED);
 
 	trace_mm_migrate_pages_start(mode, reason);
 
@@ -1494,8 +1495,9 @@ retry:
 				/*
 				 * When memory is low, don't bother to try to migrate
 				 * other pages, just exit.
+				 * THP NUMA faulting doesn't split THP to retry.
 				 */
-				if (is_thp) {
+				if (is_thp && !nosplit) {
 					if (!try_split_thp(page, &page2, from)) {
 						nr_thp_split++;
 						goto retry;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 088/192] mm: migrate: check mapcount for THP instead of refcount
  2021-07-01  1:46 incoming Andrew Morton
                   ` (86 preceding siblings ...)
  2021-07-01  1:51 ` [patch 087/192] mm: migrate: don't split THP for misplaced NUMA page Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 089/192] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Andrew Morton
                   ` (104 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, gerald.schaefer, gor, hca, hughd,
	kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: migrate: check mapcount for THP instead of refcount

The generic migration path will check refcount, so no need check refcount
here.  But the old code actually prevents from migrating shared THP
(mapped by multiple processes), so bail out early if mapcount is > 1 to
keep the behavior.

Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

--- a/mm/migrate.c~mm-migrate-check-mapcount-for-thp-instead-of-refcount
+++ a/mm/migrate.c
@@ -2073,6 +2073,10 @@ static int numamigrate_isolate_page(pg_d
 
 	VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
 
+	/* Do not migrate THP mapped by multiple processes */
+	if (PageTransHuge(page) && total_mapcount(page) > 1)
+		return 0;
+
 	/* Avoid migrating to a node that is nearly full */
 	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
 		return 0;
@@ -2080,18 +2084,6 @@ static int numamigrate_isolate_page(pg_d
 	if (isolate_lru_page(page))
 		return 0;
 
-	/*
-	 * migrate_misplaced_transhuge_page() skips page migration's usual
-	 * check on page_count(), so we must do it here, now that the page
-	 * has been isolated: a GUP pin, or any other pin, prevents migration.
-	 * The expected page count is 3: 1 for page's mapcount and 1 for the
-	 * caller's pin and 1 for the reference taken by isolate_lru_page().
-	 */
-	if (PageTransHuge(page) && page_count(page) != 3) {
-		putback_lru_page(page);
-		return 0;
-	}
-
 	page_lru = page_is_file_lru(page);
 	mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
 				thp_nr_pages(page));
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 089/192] mm: thp: skip make PMD PROT_NONE if THP migration is not supported
  2021-07-01  1:46 incoming Andrew Morton
                   ` (87 preceding siblings ...)
  2021-07-01  1:51 ` [patch 088/192] mm: migrate: check mapcount for THP instead of refcount Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:51 ` [patch 090/192] mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2 Andrew Morton
                   ` (103 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, borntraeger, gerald.schaefer, gor, hca, hughd,
	kirill.shutemov, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, ying.huang, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: thp: skip make PMD PROT_NONE if THP migration is not supported

A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
NUMA balancing and THP.  But S390 doesn't support THP migration so NUMA
balancing actually can't migrate any misplaced pages.

Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
by pointless NUMA hinting faults on S390.

Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/huge_memory.c~mm-thp-skip-make-pmd-prot_none-if-thp-migration-is-not-supported
+++ a/mm/huge_memory.c
@@ -1742,6 +1742,7 @@ bool move_huge_pmd(struct vm_area_struct
  * Returns
  *  - 0 if PMD could not be locked
  *  - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
+ *      or if prot_numa but THP migration is not supported
  *  - HPAGE_PMD_NR if protections changed and TLB flush necessary
  */
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -1756,6 +1757,9 @@ int change_huge_pmd(struct vm_area_struc
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
 
+	if (prot_numa && !thp_migration_supported())
+		return 1;
+
 	ptl = __pmd_trans_huge_lock(pmd, vma);
 	if (!ptl)
 		return 0;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 090/192] mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2
  2021-07-01  1:46 incoming Andrew Morton
                   ` (88 preceding siblings ...)
  2021-07-01  1:51 ` [patch 089/192] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Andrew Morton
@ 2021-07-01  1:51 ` Andrew Morton
  2021-07-01  1:52 ` [patch 091/192] mm: rmap: make try_to_unmap() void function Andrew Morton
                   ` (102 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:51 UTC (permalink / raw)
  To: akpm, anshuman.khandual, gerald.schaefer, gor, hca, linux-mm,
	mingo, mm-commits, tglx, torvalds

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2

ARCH_ENABLE_SPLIT_PMD_PTLOCK is irrelevant unless there are more than two
page table levels including PMD (also per
Documentation/vm/split_page_table_lock.rst).  Make this dependency
explicit on remaining platforms i.e x86 and s390 where
ARCH_ENABLE_SPLIT_PMD_PTLOCK is subscribed.

Link: https://lkml.kernel.org/r/1622013501-20409-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/Kconfig |    2 +-
 arch/x86/Kconfig  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/arch/s390/Kconfig~mm-thp-make-arch_enable_split_pmd_ptlock-dependent-on-pgtable_levels-2
+++ a/arch/s390/Kconfig
@@ -62,7 +62,7 @@ config S390
 	select ARCH_BINFMT_ELF_STATE
 	select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM
 	select ARCH_ENABLE_MEMORY_HOTREMOVE
-	select ARCH_ENABLE_SPLIT_PMD_PTLOCK
+	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
 	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DEBUG_WX
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
--- a/arch/x86/Kconfig~mm-thp-make-arch_enable_split_pmd_ptlock-dependent-on-pgtable_levels-2
+++ a/arch/x86/Kconfig
@@ -63,7 +63,7 @@ config X86
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
-	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if X86_64 || X86_PAE
+	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)
 	select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
 	select ARCH_HAS_CACHE_LINE_SIZE
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 091/192] mm: rmap: make try_to_unmap() void function
  2021-07-01  1:46 incoming Andrew Morton
                   ` (89 preceding siblings ...)
  2021-07-01  1:51 ` [patch 090/192] mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2 Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 092/192] mm/thp: remap_page() is only needed on anonymous THP Andrew Morton
                   ` (101 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, apopple, hughd, jack, juew, kirill.shutemov, linmiaohe,
	linux-mm, minchan, mm-commits, naoya.horiguchi, osalvador,
	peterx, rcampbell, shakeelb, shy828301, torvalds, wangyugui,
	willy, ziy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: rmap: make try_to_unmap() void function

Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't check
all subpages of compound page.  The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.

Actually the most callers of try_to_unmap() don't care about the return
value at all.  So just need check if page is still mapped by page_mapped()
when necessary.  And page_mapped() does bail out early when it finds
mapped subpage.

Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/rmap.h |    2 +-
 mm/memory-failure.c  |   15 +++++++--------
 mm/rmap.c            |   15 ++++-----------
 mm/vmscan.c          |    3 ++-
 4 files changed, 14 insertions(+), 21 deletions(-)

--- a/include/linux/rmap.h~mm-rmap-make-try_to_unmap-void-function
+++ a/include/linux/rmap.h
@@ -195,7 +195,7 @@ static inline void page_dup_rmap(struct
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-bool try_to_unmap(struct page *, enum ttu_flags flags);
+void try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
--- a/mm/memory-failure.c~mm-rmap-make-try_to_unmap-void-function
+++ a/mm/memory-failure.c
@@ -1269,7 +1269,7 @@ static bool hwpoison_user_mappings(struc
 	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
-	bool unmap_success = true;
+	bool unmap_success;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
@@ -1332,7 +1332,7 @@ static bool hwpoison_user_mappings(struc
 		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
 	if (!PageHuge(hpage)) {
-		unmap_success = try_to_unmap(hpage, ttu);
+		try_to_unmap(hpage, ttu);
 	} else {
 		if (!PageAnon(hpage)) {
 			/*
@@ -1344,17 +1344,16 @@ static bool hwpoison_user_mappings(struc
 			 */
 			mapping = hugetlb_page_mapping_lock_write(hpage);
 			if (mapping) {
-				unmap_success = try_to_unmap(hpage,
-						     ttu|TTU_RMAP_LOCKED);
+				try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
 				i_mmap_unlock_write(mapping);
-			} else {
+			} else
 				pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn);
-				unmap_success = false;
-			}
 		} else {
-			unmap_success = try_to_unmap(hpage, ttu);
+			try_to_unmap(hpage, ttu);
 		}
 	}
+
+	unmap_success = !page_mapped(hpage);
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
 		       pfn, page_mapcount(hpage));
--- a/mm/rmap.c~mm-rmap-make-try_to_unmap-void-function
+++ a/mm/rmap.c
@@ -1405,7 +1405,7 @@ static bool try_to_unmap_one(struct page
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
 	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
-	 * try_to_unmap() may return false when it is about to become true,
+	 * try_to_unmap() may return before page_mapped() has become false,
 	 * if page table locking is skipped: use TTU_SYNC to wait for that.
 	 */
 	if (flags & TTU_SYNC)
@@ -1756,9 +1756,10 @@ static int page_not_mapped(struct page *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
  *
- * If unmap is successful, return true. Otherwise, false.
+ * It is the caller's responsibility to check if the page is still
+ * mapped when needed (use TTU_SYNC to prevent accounting races).
  */
-bool try_to_unmap(struct page *page, enum ttu_flags flags)
+void try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
@@ -1783,14 +1784,6 @@ bool try_to_unmap(struct page *page, enu
 		rmap_walk_locked(page, &rwc);
 	else
 		rmap_walk(page, &rwc);
-
-	/*
-	 * When racing against e.g. zap_pte_range() on another cpu,
-	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
-	 * try_to_unmap() may return false when it is about to become true,
-	 * if page table locking is skipped: use TTU_SYNC to wait for that.
-	 */
-	return !page_mapcount(page);
 }
 
 /**
--- a/mm/vmscan.c~mm-rmap-make-try_to_unmap-void-function
+++ a/mm/vmscan.c
@@ -1499,7 +1499,8 @@ static unsigned int shrink_page_list(str
 			if (unlikely(PageTransHuge(page)))
 				flags |= TTU_SPLIT_HUGE_PMD;
 
-			if (!try_to_unmap(page, flags)) {
+			try_to_unmap(page, flags);
+			if (page_mapped(page)) {
 				stat->nr_unmap_fail += nr_pages;
 				if (!was_swapbacked && PageSwapBacked(page))
 					stat->nr_lazyfree_fail += nr_pages;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 092/192] mm/thp: remap_page() is only needed on anonymous THP
  2021-07-01  1:46 incoming Andrew Morton
                   ` (90 preceding siblings ...)
  2021-07-01  1:52 ` [patch 091/192] mm: rmap: make try_to_unmap() void function Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 093/192] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC Andrew Morton
                   ` (100 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, apopple, hughd, jack, juew, kirill.shutemov, linmiaohe,
	linux-mm, minchan, mm-commits, naoya.horiguchi, osalvador,
	peterx, rcampbell, shakeelb, shy828301, torvalds, wangyugui,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm/thp: remap_page() is only needed on anonymous THP

THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and
migration entries are only inserted when TTU_MIGRATION (unused here) or
TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to
search for migration entries to remove when !PageAnon.

Link: https://lkml.kernel.org/r/f987bc44-f28e-688d-2424-b4722153ed8@google.com
Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/mm/huge_memory.c~mm-thp-remap_page-is-only-needed-on-anonymous-thp
+++ a/mm/huge_memory.c
@@ -2307,6 +2307,7 @@ static void unmap_page(struct page *page
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	/* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2318,6 +2319,10 @@ static void unmap_page(struct page *page
 static void remap_page(struct page *page, unsigned int nr)
 {
 	int i;
+
+	/* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */
+	if (!PageAnon(page))
+		return;
 	if (PageTransHuge(page)) {
 		remove_migration_ptes(page, page, true);
 	} else {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 093/192] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC
  2021-07-01  1:46 incoming Andrew Morton
                   ` (91 preceding siblings ...)
  2021-07-01  1:52 ` [patch 092/192] mm/thp: remap_page() is only needed on anonymous THP Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 094/192] mm/thp: fix strncpy warning Andrew Morton
                   ` (99 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, apopple, hughd, jack, juew, kirill.shutemov, linmiaohe,
	linux-mm, minchan, mm-commits, naoya.horiguchi, osalvador,
	peterx, rcampbell, shakeelb, shy828301, torvalds, wangyugui,
	willy, ziy

From: Hugh Dickins <hughd@google.com>
Subject: mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC

TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly
before the page is accounted as unmapped.  It is unlikely to coincide with
hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would
do well to use it.

Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory-failure.c~mm-hwpoison_user_mappings-try_to_unmap-with-ttu_sync
+++ a/mm/memory-failure.c
@@ -1266,7 +1266,7 @@ static int get_hwpoison_page(struct page
 static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_IGNORE_MLOCK;
+	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 094/192] mm/thp: fix strncpy warning
  2021-07-01  1:46 incoming Andrew Morton
                   ` (92 preceding siblings ...)
  2021-07-01  1:52 ` [patch 093/192] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 095/192] nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc Andrew Morton
                   ` (98 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, linux-mm, mike.kravetz, mm-commits, torvalds, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm/thp: fix strncpy warning

Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify
complain as it thinks the string might be longer than the buffer, and if
it is, we will end up with a "string" that is missing a NUL terminator. 
It's trivial to show that 'tok' points to a NUL-terminated string which is
less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy()
and avoid the warning.

Link: https://lkml.kernel.org/r/20210615200242.1716568-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/huge_memory.c~mm-thp-fix-strncpy-warning
+++ a/mm/huge_memory.c
@@ -3101,7 +3101,7 @@ static ssize_t split_huge_pages_write(st
 
 		tok = strsep(&buf, ",");
 		if (tok) {
-			strncpy(file_path, tok, MAX_INPUT_BUF_SZ);
+			strcpy(file_path, tok);
 		} else {
 			ret = -EINVAL;
 			goto out;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 095/192] nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc
  2021-07-01  1:46 incoming Andrew Morton
                   ` (93 preceding siblings ...)
  2021-07-01  1:52 ` [patch 094/192] mm/thp: fix strncpy warning Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 096/192] mm/nommu: unexport do_munmap() Andrew Morton
                   ` (97 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, chenli, david, geert, gerg, linux-mm, mm-commits, torvalds, willy

From: Chen Li <chenli@uniontech.com>
Subject: nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc

mm/nommu.c:
void *__vmalloc(unsigned long size, gfp_t gfp_mask)
{
	/*
	 *  You can't specify __GFP_HIGHMEM with kmalloc() since kmalloc()
	 * returns only a logical address.
	 */
	return kmalloc(size, (gfp_mask | __GFP_COMP) & ~__GFP_HIGHMEM);
}

nommu's __vmalloc just uses kmalloc internally and elimitates
__GFP_HIGHMEM, so it makes no sense to add __GFP_HIGHMEM for nommu's
vmalloc/vzalloc.

[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/875z00rnp8.wl-chenli@uniontech.com
Signed-off-by: Chen Li <chenli@uniontech.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/nommu.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/nommu.c~nommu-remove-__gfp_highmem-in-vmalloc-vzalloc
+++ a/mm/nommu.c
@@ -223,7 +223,7 @@ long vread(char *buf, char *addr, unsign
  */
 void *vmalloc(unsigned long size)
 {
-       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM);
+	return __vmalloc(size, GFP_KERNEL);
 }
 EXPORT_SYMBOL(vmalloc);
 
@@ -241,7 +241,7 @@ EXPORT_SYMBOL(vmalloc);
  */
 void *vzalloc(unsigned long size)
 {
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
+	return __vmalloc(size, GFP_KERNEL | __GFP_ZERO);
 }
 EXPORT_SYMBOL(vzalloc);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 096/192] mm/nommu: unexport do_munmap()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (94 preceding siblings ...)
  2021-07-01  1:52 ` [patch 095/192] nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 097/192] mm: generalize ZONE_[DMA|DMA32] Andrew Morton
                   ` (96 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, david, Liam.Howlett, linux-mm, mm-commits, torvalds, willy

From: Liam Howlett <liam.howlett@oracle.com>
Subject: mm/nommu: unexport do_munmap()

do_munmap() does not take the mmap_write_lock().  vm_munmap() should be
used instead.

Link: https://lkml.kernel.org/r/20210604194002.648037-1-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/nommu.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/nommu.c~mm-nommu-unexport-do_munmap
+++ a/mm/nommu.c
@@ -1501,7 +1501,6 @@ erase_whole_vma:
 	delete_vma(mm, vma);
 	return 0;
 }
-EXPORT_SYMBOL(do_munmap);
 
 int vm_munmap(unsigned long addr, size_t len)
 {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 097/192] mm: generalize ZONE_[DMA|DMA32]
  2021-07-01  1:46 incoming Andrew Morton
                   ` (95 preceding siblings ...)
  2021-07-01  1:52 ` [patch 096/192] mm/nommu: unexport do_munmap() Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  2:46   ` Linus Torvalds
  2021-07-01  1:52 ` [patch 098/192] mm: make variable names for populate_vma_page_range() consistent Andrew Morton
                   ` (95 subsequent siblings)
  192 siblings, 1 reply; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, bp, catalin.marinas, davem, geert, linux-mm, linux,
	michal.simek, mingo, mm-commits, mpe, palmerdabbelt, rppt, rth,
	torvalds, tsbogend, wangkefeng.wang, will

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: mm: generalize ZONE_[DMA|DMA32]

ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that
subscribe to them.  Instead, just make them generic options which can be
selected on applicable platforms.

Also only x86/arm64 architectures could enable both ZONE_DMA and
ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone
configurable and visible on the two architectures.

Link: https://lkml.kernel.org/r/20210528074557.17768-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	[RISC-V]
Acked-by: Michal Simek <michal.simek@xilinx.com>	[microblaze]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/Kconfig                     |    5 +----
 arch/arm/Kconfig                       |    3 ---
 arch/arm64/Kconfig                     |    9 +--------
 arch/ia64/Kconfig                      |    4 +---
 arch/m68k/Kconfig                      |    5 +----
 arch/microblaze/Kconfig                |    4 +---
 arch/mips/Kconfig                      |    7 -------
 arch/powerpc/Kconfig                   |    4 ----
 arch/powerpc/platforms/Kconfig.cputype |    1 +
 arch/riscv/Kconfig                     |    5 +----
 arch/s390/Kconfig                      |    4 +---
 arch/sparc/Kconfig                     |    5 +----
 arch/x86/Kconfig                       |   15 ++-------------
 mm/Kconfig                             |   12 ++++++++++++
 14 files changed, 23 insertions(+), 60 deletions(-)

--- a/arch/alpha/Kconfig~mm-generalize-zone_
+++ a/arch/alpha/Kconfig
@@ -40,6 +40,7 @@ config ALPHA
 	select MMU_GATHER_NO_RANGE
 	select SET_FS
 	select SPARSEMEM_EXTREME if SPARSEMEM
+	select ZONE_DMA
 	help
 	  The Alpha is a 64-bit general-purpose processor designed and
 	  marketed by the Digital Equipment Corporation of blessed memory,
@@ -65,10 +66,6 @@ config GENERIC_CALIBRATE_DELAY
 	bool
 	default y
 
-config ZONE_DMA
-	bool
-	default y
-
 config GENERIC_ISA_DMA
 	bool
 	default y
--- a/arch/arm64/Kconfig~mm-generalize-zone_
+++ a/arch/arm64/Kconfig
@@ -42,6 +42,7 @@ config ARM64
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_ELF_PROT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_INLINE_READ_LOCK if !PREEMPTION
@@ -306,14 +307,6 @@ config GENERIC_CSUM
 config GENERIC_CALIBRATE_DELAY
 	def_bool y
 
-config ZONE_DMA
-	bool "Support DMA zone" if EXPERT
-	default y
-
-config ZONE_DMA32
-	bool "Support DMA32 zone" if EXPERT
-	default y
-
 config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	def_bool y
 
--- a/arch/arm/Kconfig~mm-generalize-zone_
+++ a/arch/arm/Kconfig
@@ -218,9 +218,6 @@ config GENERIC_CALIBRATE_DELAY
 config ARCH_MAY_HAVE_PC_FDC
 	bool
 
-config ZONE_DMA
-	bool
-
 config ARCH_SUPPORTS_UPROBES
 	def_bool y
 
--- a/arch/ia64/Kconfig~mm-generalize-zone_
+++ a/arch/ia64/Kconfig
@@ -60,6 +60,7 @@ config IA64
 	select NUMA if !FLATMEM
 	select PCI_MSI_ARCH_FALLBACKS if PCI_MSI
 	select SET_FS
+	select ZONE_DMA32
 	default y
 	help
 	  The Itanium Processor Family is Intel's 64-bit successor to
@@ -72,9 +73,6 @@ config 64BIT
 	select ATA_NONSTANDARD if ATA
 	default y
 
-config ZONE_DMA32
-	def_bool y
-
 config MMU
 	bool
 	default y
--- a/arch/m68k/Kconfig~mm-generalize-zone_
+++ a/arch/m68k/Kconfig
@@ -34,6 +34,7 @@ config M68K
 	select SET_FS
 	select UACCESS_MEMCPY if !MMU
 	select VIRT_TO_BUS
+	select ZONE_DMA
 
 config CPU_BIG_ENDIAN
 	def_bool y
@@ -62,10 +63,6 @@ config TIME_LOW_RES
 config NO_IOPORT_MAP
 	def_bool y
 
-config ZONE_DMA
-	bool
-	default y
-
 config HZ
 	int
 	default 1000 if CLEOPATRA
--- a/arch/microblaze/Kconfig~mm-generalize-zone_
+++ a/arch/microblaze/Kconfig
@@ -43,6 +43,7 @@ config MICROBLAZE
 	select MMU_GATHER_NO_RANGE
 	select SPARSE_IRQ
 	select SET_FS
+	select ZONE_DMA
 
 # Endianness selection
 choice
@@ -60,9 +61,6 @@ config CPU_LITTLE_ENDIAN
 
 endchoice
 
-config ZONE_DMA
-	def_bool y
-
 config ARCH_HAS_ILOG2_U32
 	def_bool n
 
--- a/arch/mips/Kconfig~mm-generalize-zone_
+++ a/arch/mips/Kconfig
@@ -3274,13 +3274,6 @@ config I8253
 	select CLKSRC_I8253
 	select CLKEVT_I8253
 	select MIPS_EXTERNAL_TIMER
-
-config ZONE_DMA
-	bool
-
-config ZONE_DMA32
-	bool
-
 endmenu
 
 config TRAD_SIGNALS
--- a/arch/powerpc/Kconfig~mm-generalize-zone_
+++ a/arch/powerpc/Kconfig
@@ -403,10 +403,6 @@ config PPC_ADV_DEBUG_DAC_RANGE
 config PPC_DAWR
 	bool
 
-config ZONE_DMA
-	bool
-	default y if PPC_BOOK3E_64
-
 config PGTABLE_LEVELS
 	int
 	default 2 if !PPC64
--- a/arch/powerpc/platforms/Kconfig.cputype~mm-generalize-zone_
+++ a/arch/powerpc/platforms/Kconfig.cputype
@@ -111,6 +111,7 @@ config PPC_BOOK3E_64
 	select PPC_FPU # Make it a choice ?
 	select PPC_SMP_MUXED_IPI
 	select PPC_DOORBELL
+	select ZONE_DMA
 
 endchoice
 
--- a/arch/riscv/Kconfig~mm-generalize-zone_
+++ a/arch/riscv/Kconfig
@@ -104,6 +104,7 @@ config RISCV
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
 	select UACCESS_MEMCPY if !MMU
+	select ZONE_DMA32 if 64BIT
 
 config ARCH_MMAP_RND_BITS_MIN
 	default 18 if 64BIT
@@ -133,10 +134,6 @@ config MMU
 	  Select if you want MMU-based virtualised addressing space
 	  support by paged memory management. If unsure, say 'Y'.
 
-config ZONE_DMA32
-	bool
-	default y if 64BIT
-
 config VA_BITS
 	int
 	default 32 if 32BIT
--- a/arch/s390/Kconfig~mm-generalize-zone_
+++ a/arch/s390/Kconfig
@@ -2,9 +2,6 @@
 config MMU
 	def_bool y
 
-config ZONE_DMA
-	def_bool y
-
 config CPU_BIG_ENDIAN
 	def_bool y
 
@@ -210,6 +207,7 @@ config S390
 	select THREAD_INFO_IN_TASK
 	select TTY
 	select VIRT_CPU_ACCOUNTING
+	select ZONE_DMA
 	# Note: keep the above list sorted alphabetically
 
 config SCHED_OMIT_FRAME_POINTER
--- a/arch/sparc/Kconfig~mm-generalize-zone_
+++ a/arch/sparc/Kconfig
@@ -59,6 +59,7 @@ config SPARC32
 	select CLZ_TAB
 	select HAVE_UID16
 	select OLD_SIGACTION
+	select ZONE_DMA
 
 config SPARC64
 	def_bool 64BIT
@@ -141,10 +142,6 @@ config HIGHMEM
 	default y if SPARC32
 	select KMAP_LOCAL
 
-config ZONE_DMA
-	bool
-	default y if SPARC32
-
 config GENERIC_ISA_DMA
 	bool
 	default y if SPARC32
--- a/arch/x86/Kconfig~mm-generalize-zone_
+++ a/arch/x86/Kconfig
@@ -33,6 +33,7 @@ config X86_64
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
 	select ARCH_HAS_ELFCORE_COMPAT
+	select ZONE_DMA32
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
@@ -93,6 +94,7 @@ config X86
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_DEBUG_WX
+	select ARCH_HAS_ZONE_DMA_SET if EXPERT
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
@@ -343,9 +345,6 @@ config ARCH_SUSPEND_POSSIBLE
 config ARCH_WANT_GENERAL_HUGETLB
 	def_bool y
 
-config ZONE_DMA32
-	def_bool y if X86_64
-
 config AUDIT_ARCH
 	def_bool y if X86_64
 
@@ -393,16 +392,6 @@ config CC_HAS_SANE_STACKPROTECTOR
 
 menu "Processor type and features"
 
-config ZONE_DMA
-	bool "DMA memory allocation support" if EXPERT
-	default y
-	help
-	  DMA memory allocation support allows devices with less than 32-bit
-	  addressing to allocate within the first 16MB of address space.
-	  Disable if no such devices will be used.
-
-	  If unsure, say Y.
-
 config SMP
 	bool "Symmetric multi-processing support"
 	help
--- a/mm/Kconfig~mm-generalize-zone_
+++ a/mm/Kconfig
@@ -761,6 +761,18 @@ config ARCH_HAS_CACHE_LINE_SIZE
 config ARCH_HAS_PTE_DEVMAP
 	bool
 
+config ARCH_HAS_ZONE_DMA_SET
+	bool
+
+config ZONE_DMA
+	bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET
+	default y if ARM64 || X86
+
+config ZONE_DMA32
+	bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET
+	depends on !X86_32
+	default y if ARM64
+
 config ZONE_DEVICE
 	bool "Device memory (pmem, HMM, etc...) hotplug support"
 	depends on MEMORY_HOTPLUG
_


^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 098/192] mm: make variable names for populate_vma_page_range() consistent
  2021-07-01  1:46 incoming Andrew Morton
                   ` (96 preceding siblings ...)
  2021-07-01  1:52 ` [patch 097/192] mm: generalize ZONE_[DMA|DMA32] Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 099/192] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables Andrew Morton
                   ` (94 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: aarcange, akpm, arnd, chris, dave.hansen, david, deller,
	eike-kernel, hughd, ink, James.Bottomley, jannh, jcmvbkbc, jgg,
	kirill.shutemov, linux-mm, linuxram, mattst88, mhocko,
	mike.kravetz, minchan, mm-commits, mst, osalvador, peterx, riel,
	rth, shuah, torvalds, tsbogend, vbabka, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm: make variable names for populate_vma_page_range() consistent

Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2.

Excessive details on MADV_POPULATE_(READ|WRITE) can be found in patch #2.


This patch (of 5):

Let's make the variable names in the function declaration match the
variable names used in the definition.

Link: https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210419135443.12822-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/internal.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/internal.h~mm-make-variable-names-for-populate_vma_page_range-consistent
+++ a/mm/internal.h
@@ -344,7 +344,7 @@ void __vma_unlink_list(struct mm_struct
 
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, int *nonblocking);
+		unsigned long start, unsigned long end, int *locked);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 099/192] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
  2021-07-01  1:46 incoming Andrew Morton
                   ` (97 preceding siblings ...)
  2021-07-01  1:52 ` [patch 098/192] mm: make variable names for populate_vma_page_range() consistent Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 100/192] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT Andrew Morton
                   ` (93 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: aarcange, akpm, arnd, chris, dave.hansen, david, deller,
	eike-kernel, hughd, ink, James.Bottomley, jannh, jcmvbkbc, jgg,
	kirill.shutemov, linux-mm, linuxram, mattst88, mhocko,
	mike.kravetz, minchan, mm-commits, mst, osalvador, peterx, riel,
	rth, shuah, torvalds, tsbogend, vbabka, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables

I. Background: Sparse Memory Mappings

When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region.  Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators.  In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does
not succeed because we are out of backend memory (which can happen easily
with file-based mappings, especially tmpfs and hugetlbfs).

While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory for most mapping types, there is no generic
approach to populate page tables and preallocate memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to populate/discard dynamically
and avoid expensive/problematic remappings.  In addition, we never
actually report errors during the final populate phase - it is best-effort
only.

fallocate() can be used to preallocate file-based memory and fail in a
safe way.  However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics.  In addition, fallocate()
does not actually populate page tables, so we still always get pagefaults
on first access - which is sometimes undesired (i.e., real-time workloads)
and requires real prefaulting of page tables, not just a preallocation of
backend storage.  There might be interesting use cases for sparse memory
regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
as it does not prefault page tables.

II. On preallcoation/prefaulting from user space

Because we don't have a proper interface, what applications (like QEMU and
databases) end up doing is touching (i.e., reading+writing one byte to not
overwrite existing data) all individual pages.

However, that approach
1) Can result in wear on storage backing, because we end up reading/writing
   each page; this is especially a problem for dax/pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
   threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
   of hugetlbfs/shmem/file-backed memory. For example, this is
   problematic in hypervisors like QEMU where SIGBUS handlers might already
   be used by other subsystems concurrently to e.g, handle hardware errors.
   "Simply" doing preallocation concurrently from other thread is not that
   easy.

III. On MADV_WILLNEED

Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
   "might be a good idea to read some pages" vs. "Definitely populate/
   preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
   don't want populate/prealloc semantics. They treat this rather as a hint
   to give a little performance boost without too much overhead - and don't
   expect that a lot of memory might get consumed or a lot of time
   might be spent.

IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
MAP_POPULATE, with the following semantics:
1. MADV_POPULATE_READ can be used to prefault page tables just like
   manually reading each individual page. This will not break any COW
   mappings. The shared zero page might get mapped and no backend storage
   might get preallocated -- allocation might be deferred to
   write-fault time. Especially shared file mappings require an explicit
   fallocate() upfront to actually preallocate backend memory (blocks in
   the file system) in case the file might have holes.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
   (prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
   prefault page tables just like manually writing (or
   reading+writing) each individual page. This will break any COW
   mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
   (prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
   mappings marked with VM_PFNMAP and VM_IO. Also, proper access
   permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
   mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
   might have been populated.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
   when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
   cannot protect from the OOM (Out Of Memory) handler killing the
   process.

While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), one issue is that
whenever we prefault pages writable, the pages have to be marked dirty,
because the CPU could dirty them any time.  while not a real problem for
hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
page will be marked dirty and has to be written back later when evicting.

MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
mapping from backend storage without marking it dirty, such that eviction
won't have to write it back.  As discussed above, shared file mappings
might require an explciit fallocate() upfront to achieve
preallcoation+prepopulation.

Although sparse memory mappings are the primary use case, this will also
be useful for other preallocate/prefault use cases where MAP_POPULATE is
not desired or the semantics of MAP_POPULATE are not sufficient: as one
example, QEMU users can trigger preallocation/prefaulting of guest RAM
after the mapping was created -- and don't want errors to be silently
suppressed.

Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements --
which should also still be the case.

V. Single-threaded performance comparison

I did a short experiment, prefaulting page tables on completely *empty
mappings/files* and repeated the experiment 10 times.  The results
correspond to the shortest execution time.  In general, the performance
benefit for huge pages is negligible with small mappings.

V.1: Private mappings

POPULATE_READ and POPULATE_WRITE is fastest.  Note that
Reading/POPULATE_READ will populate the shared zeropage where applicable
-- which result in short population times.

The fastest way to allocate backend storage (here: swap or huge pages) and
prefault page tables is POPULATE_WRITE.

V.2: Shared mappings

fallocate() is fastest, however, doesn't prefault page tables. 
POPULATE_WRITE is faster than simple writes and read/writes. 
POPULATE_READ is faster than simple reads.

Without a fd, the fastest way to allocate backend storage and prefault
page tables is POPULATE_WRITE.  With an fd, the fastest way is usually
FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
exception are actual files: FALLOCATE+Read is slightly faster than
FALLOCATE+POPULATE_READ.

The fastest way to allocate backend storage prefault page tables is
FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
dirty.

v.3: Detailed results

==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB     : Read                     :     0.119 ms
Anon 4 KiB     : Write                    :     0.222 ms
Anon 4 KiB     : Read/Write               :     0.380 ms
Anon 4 KiB     : POPULATE_READ            :     0.060 ms
Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
Memfd 4 KiB    : Read                     :     0.034 ms
Memfd 4 KiB    : Write                    :     0.310 ms
Memfd 4 KiB    : Read/Write               :     0.362 ms
Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
Memfd 2 MiB    : Read                     :     0.030 ms
Memfd 2 MiB    : Write                    :     0.030 ms
Memfd 2 MiB    : Read/Write               :     0.030 ms
Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
tmpfs          : Read                     :     0.033 ms
tmpfs          : Write                    :     0.313 ms
tmpfs          : Read/Write               :     0.406 ms
tmpfs          : POPULATE_READ            :     0.039 ms
tmpfs          : POPULATE_WRITE           :     0.285 ms
file           : Read                     :     0.033 ms
file           : Write                    :     0.351 ms
file           : Read/Write               :     0.408 ms
file           : POPULATE_READ            :     0.039 ms
file           : POPULATE_WRITE           :     0.290 ms
hugetlbfs      : Read                     :     0.030 ms
hugetlbfs      : Write                    :     0.030 ms
hugetlbfs      : Read/Write               :     0.030 ms
hugetlbfs      : POPULATE_READ            :     0.030 ms
hugetlbfs      : POPULATE_WRITE           :     0.030 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB     : Read                     :   237.940 ms
Anon 4 KiB     : Write                    :   708.409 ms
Anon 4 KiB     : Read/Write               :  1054.041 ms
Anon 4 KiB     : POPULATE_READ            :   124.310 ms
Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
Memfd 4 KiB    : Read                     :   136.928 ms
Memfd 4 KiB    : Write                    :   963.898 ms
Memfd 4 KiB    : Read/Write               :  1106.561 ms
Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
Memfd 2 MiB    : Read                     :   357.116 ms
Memfd 2 MiB    : Write                    :   357.210 ms
Memfd 2 MiB    : Read/Write               :   357.606 ms
Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
tmpfs          : Read                     :   137.536 ms
tmpfs          : Write                    :   954.362 ms
tmpfs          : Read/Write               :  1105.954 ms
tmpfs          : POPULATE_READ            :    80.289 ms
tmpfs          : POPULATE_WRITE           :   822.826 ms
file           : Read                     :   137.874 ms
file           : Write                    :   987.025 ms
file           : Read/Write               :  1107.439 ms
file           : POPULATE_READ            :    80.413 ms
file           : POPULATE_WRITE           :   857.622 ms
hugetlbfs      : Read                     :   355.607 ms
hugetlbfs      : Write                    :   355.729 ms
hugetlbfs      : Read/Write               :   356.127 ms
hugetlbfs      : POPULATE_READ            :   354.585 ms
hugetlbfs      : POPULATE_WRITE           :   355.138 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anon 4 KiB     : Read                     :     0.394 ms
Anon 4 KiB     : Write                    :     0.348 ms
Anon 4 KiB     : Read/Write               :     0.400 ms
Anon 4 KiB     : POPULATE_READ            :     0.326 ms
Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
Anon 2 MiB     : Read                     :     0.030 ms
Anon 2 MiB     : Write                    :     0.030 ms
Anon 2 MiB     : Read/Write               :     0.030 ms
Anon 2 MiB     : POPULATE_READ            :     0.030 ms
Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
Memfd 4 KiB    : Read                     :     0.412 ms
Memfd 4 KiB    : Write                    :     0.372 ms
Memfd 4 KiB    : Read/Write               :     0.419 ms
Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
Memfd 4 KiB    : FALLOCATE                :     0.137 ms
Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
Memfd 2 MiB    : Read                     :     0.030 ms
Memfd 2 MiB    : Write                    :     0.030 ms
Memfd 2 MiB    : Read/Write               :     0.030 ms
Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
Memfd 2 MiB    : FALLOCATE                :     0.030 ms
Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
tmpfs          : Read                     :     0.416 ms
tmpfs          : Write                    :     0.369 ms
tmpfs          : Read/Write               :     0.425 ms
tmpfs          : POPULATE_READ            :     0.346 ms
tmpfs          : POPULATE_WRITE           :     0.295 ms
tmpfs          : FALLOCATE                :     0.139 ms
tmpfs          : FALLOCATE+Read           :     0.447 ms
tmpfs          : FALLOCATE+Write          :     0.333 ms
tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
file           : Read                     :     0.191 ms
file           : Write                    :     0.511 ms
file           : Read/Write               :     0.524 ms
file           : POPULATE_READ            :     0.196 ms
file           : POPULATE_WRITE           :     0.434 ms
file           : FALLOCATE                :     0.004 ms
file           : FALLOCATE+Read           :     0.197 ms
file           : FALLOCATE+Write          :     0.554 ms
file           : FALLOCATE+Read/Write     :     0.480 ms
file           : FALLOCATE+POPULATE_READ  :     0.201 ms
file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
hugetlbfs      : Read                     :     0.030 ms
hugetlbfs      : Write                    :     0.030 ms
hugetlbfs      : Read/Write               :     0.030 ms
hugetlbfs      : POPULATE_READ            :     0.030 ms
hugetlbfs      : POPULATE_WRITE           :     0.030 ms
hugetlbfs      : FALLOCATE                :     0.030 ms
hugetlbfs      : FALLOCATE+Read           :     0.031 ms
hugetlbfs      : FALLOCATE+Write          :     0.031 ms
hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anon 4 KiB     : Read                     :  1053.090 ms
Anon 4 KiB     : Write                    :   913.642 ms
Anon 4 KiB     : Read/Write               :  1060.350 ms
Anon 4 KiB     : POPULATE_READ            :   893.691 ms
Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
Anon 2 MiB     : Read                     :   358.553 ms
Anon 2 MiB     : Write                    :   358.419 ms
Anon 2 MiB     : Read/Write               :   357.992 ms
Anon 2 MiB     : POPULATE_READ            :   357.533 ms
Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
Memfd 4 KiB    : Read                     :  1078.144 ms
Memfd 4 KiB    : Write                    :   942.036 ms
Memfd 4 KiB    : Read/Write               :  1100.391 ms
Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
Memfd 4 KiB    : FALLOCATE                :   304.632 ms
Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
Memfd 2 MiB    : Read                     :   358.131 ms
Memfd 2 MiB    : Write                    :   358.099 ms
Memfd 2 MiB    : Read/Write               :   358.250 ms
Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
Memfd 2 MiB    : FALLOCATE                :   356.735 ms
Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
tmpfs          : Read                     :  1087.265 ms
tmpfs          : Write                    :   950.840 ms
tmpfs          : Read/Write               :  1107.567 ms
tmpfs          : POPULATE_READ            :   922.605 ms
tmpfs          : POPULATE_WRITE           :   810.094 ms
tmpfs          : FALLOCATE                :   306.320 ms
tmpfs          : FALLOCATE+Read           :  1169.796 ms
tmpfs          : FALLOCATE+Write          :   933.730 ms
tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
file           : Read                     :   654.101 ms
file           : Write                    :  1259.142 ms
file           : Read/Write               :  1289.509 ms
file           : POPULATE_READ            :   661.642 ms
file           : POPULATE_WRITE           :  1106.816 ms
file           : FALLOCATE                :     1.864 ms
file           : FALLOCATE+Read           :   656.328 ms
file           : FALLOCATE+Write          :  1153.300 ms
file           : FALLOCATE+Read/Write     :  1180.613 ms
file           : FALLOCATE+POPULATE_READ  :   668.347 ms
file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
hugetlbfs      : Read                     :   357.245 ms
hugetlbfs      : Write                    :   357.413 ms
hugetlbfs      : Read/Write               :   357.120 ms
hugetlbfs      : POPULATE_READ            :   356.321 ms
hugetlbfs      : POPULATE_WRITE           :   356.693 ms
hugetlbfs      : FALLOCATE                :   355.927 ms
hugetlbfs      : FALLOCATE+Read           :   357.074 ms
hugetlbfs      : FALLOCATE+Write          :   357.120 ms
hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
**************************************************

[1] https://lkml.org/lkml/2013/6/27/698

[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/include/uapi/asm/mman.h     |    3 +
 arch/mips/include/uapi/asm/mman.h      |    3 +
 arch/parisc/include/uapi/asm/mman.h    |    3 +
 arch/xtensa/include/uapi/asm/mman.h    |    3 +
 include/uapi/asm-generic/mman-common.h |    3 +
 mm/gup.c                               |   58 ++++++++++++++++++++
 mm/internal.h                          |    3 +
 mm/madvise.c                           |   66 +++++++++++++++++++++++
 8 files changed, 142 insertions(+)

--- a/arch/alpha/include/uapi/asm/mman.h~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/arch/alpha/include/uapi/asm/mman.h
@@ -71,6 +71,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
--- a/arch/mips/include/uapi/asm/mman.h~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/arch/mips/include/uapi/asm/mman.h
@@ -98,6 +98,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
--- a/arch/parisc/include/uapi/asm/mman.h~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/arch/parisc/include/uapi/asm/mman.h
@@ -52,6 +52,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
--- a/arch/xtensa/include/uapi/asm/mman.h~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/arch/xtensa/include/uapi/asm/mman.h
@@ -106,6 +106,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
--- a/include/uapi/asm-generic/mman-common.h~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/include/uapi/asm-generic/mman-common.h
@@ -72,6 +72,9 @@
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE_READ	22	/* populate (prefault) page tables readable */
+#define MADV_POPULATE_WRITE	23	/* populate (prefault) page tables writable */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
--- a/mm/gup.c~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/mm/gup.c
@@ -1501,6 +1501,64 @@ long populate_vma_page_range(struct vm_a
 }
 
 /*
+ * faultin_vma_page_range() - populate (prefault) page tables inside the
+ *			      given VMA range readable/writable
+ *
+ * This takes care of mlocking the pages, too, if VM_LOCKED is set.
+ *
+ * @vma: target vma
+ * @start: start address
+ * @end: end address
+ * @write: whether to prefault readable or writable
+ * @locked: whether the mmap_lock is still held
+ *
+ * Returns either number of processed pages in the vma, or a negative error
+ * code on error (see __get_user_pages()).
+ *
+ * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and
+ * covered by the VMA.
+ *
+ * If @locked is NULL, it may be held for read or write and will be unperturbed.
+ *
+ * If @locked is non-NULL, it must held for read only and may be released.  If
+ * it's released, *@locked will be set to 0.
+ */
+long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end, bool write, int *locked)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long nr_pages = (end - start) / PAGE_SIZE;
+	int gup_flags;
+
+	VM_BUG_ON(!PAGE_ALIGNED(start));
+	VM_BUG_ON(!PAGE_ALIGNED(end));
+	VM_BUG_ON_VMA(start < vma->vm_start, vma);
+	VM_BUG_ON_VMA(end > vma->vm_end, vma);
+	mmap_assert_locked(mm);
+
+	/*
+	 * FOLL_TOUCH: Mark page accessed and thereby young; will also mark
+	 *	       the page dirty with FOLL_WRITE -- which doesn't make a
+	 *	       difference with !FOLL_FORCE, because the page is writable
+	 *	       in the page table.
+	 * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
+	 *		  a poisoned page.
+	 * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
+	 * !FOLL_FORCE: Require proper access permissions.
+	 */
+	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+	if (write)
+		gup_flags |= FOLL_WRITE;
+
+	/*
+	 * See check_vma_flags(): Will return -EFAULT on incompatible mappings
+	 * or with insufficient permissions.
+	 */
+	return __get_user_pages(mm, start, nr_pages, gup_flags,
+				NULL, NULL, locked);
+}
+
+/*
  * __mm_populate - populate and/or mlock pages within a range of address space.
  *
  * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap
--- a/mm/internal.h~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/mm/internal.h
@@ -345,6 +345,9 @@ void __vma_unlink_list(struct mm_struct
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *locked);
+extern long faultin_vma_page_range(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end,
+				   bool write, int *locked);
 extern void munlock_vma_pages_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
 static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
--- a/mm/madvise.c~mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables
+++ a/mm/madvise.c
@@ -53,6 +53,8 @@ static int madvise_need_mmap_write(int b
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_FREE:
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -822,6 +824,61 @@ static long madvise_dontneed_free(struct
 		return -EINVAL;
 }
 
+static long madvise_populate(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end,
+			     int behavior)
+{
+	const bool write = behavior == MADV_POPULATE_WRITE;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long tmp_end;
+	int locked = 1;
+	long pages;
+
+	*prev = vma;
+
+	while (start < end) {
+		/*
+		 * We might have temporarily dropped the lock. For example,
+		 * our VMA might have been split.
+		 */
+		if (!vma || start >= vma->vm_end) {
+			vma = find_vma(mm, start);
+			if (!vma || start < vma->vm_start)
+				return -ENOMEM;
+		}
+
+		tmp_end = min_t(unsigned long, end, vma->vm_end);
+		/* Populate (prefault) page tables readable/writable. */
+		pages = faultin_vma_page_range(vma, start, tmp_end, write,
+					       &locked);
+		if (!locked) {
+			mmap_read_lock(mm);
+			locked = 1;
+			*prev = NULL;
+			vma = NULL;
+		}
+		if (pages < 0) {
+			switch (pages) {
+			case -EINTR:
+				return -EINTR;
+			case -EFAULT: /* Incompatible mappings / permissions. */
+				return -EINVAL;
+			case -EHWPOISON:
+				return -EHWPOISON;
+			default:
+				pr_warn_once("%s: unhandled return value: %ld\n",
+					     __func__, pages);
+				fallthrough;
+			case -ENOMEM:
+				return -ENOMEM;
+			}
+		}
+		start += pages * PAGE_SIZE;
+	}
+	return 0;
+}
+
 /*
  * Application wants to free up the pages and associated backing store.
  * This is effectively punching a hole into the middle of a file.
@@ -935,6 +992,9 @@ madvise_vma(struct vm_area_struct *vma,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
+		return madvise_populate(vma, prev, start, end, behavior);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -955,6 +1015,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_FREE:
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_POPULATE_READ:
+	case MADV_POPULATE_WRITE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
@@ -1042,6 +1104,10 @@ process_madvise_behavior_valid(int behav
  *		easily if memory pressure happens.
  *  MADV_PAGEOUT - the application is not expected to use this memory soon,
  *		page out the pages in this range immediately.
+ *  MADV_POPULATE_READ - populate (prefault) page tables readable by
+ *		triggering read faults if required
+ *  MADV_POPULATE_WRITE - populate (prefault) page tables writable by
+ *		triggering write faults if required
  *
  * return values:
  *  zero    - success
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 100/192] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT
  2021-07-01  1:46 incoming Andrew Morton
                   ` (98 preceding siblings ...)
  2021-07-01  1:52 ` [patch 099/192] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 101/192] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore Andrew Morton
                   ` (92 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: aarcange, akpm, arnd, chris, dave.hansen, david, deller,
	eike-kernel, hughd, ink, James.Bottomley, jannh, jcmvbkbc, jgg,
	kirill.shutemov, linux-mm, linuxram, mattst88, mhocko,
	mike.kravetz, minchan, mm-commits, mst, osalvador, peterx, riel,
	rppt, rth, shuah, torvalds, tsbogend, vbabka, willy

From: David Hildenbrand <david@redhat.com>
Subject: MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT

MEMORY MANAGEMENT seems to be a good fit.

Link: https://lkml.kernel.org/r/20210419135443.12822-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    1 +
 1 file changed, 1 insertion(+)

--- a/MAINTAINERS~maintainers-add-tools-testing-selftests-vm-to-memory-management
+++ a/MAINTAINERS
@@ -11828,6 +11828,7 @@ F:	include/linux/mmzone.h
 F:	include/linux/pagewalk.h
 F:	include/linux/vmalloc.h
 F:	mm/
+F:	tools/testing/selftests/vm/
 
 MEMORY TECHNOLOGY DEVICES (MTD)
 M:	Miquel Raynal <miquel.raynal@bootlin.com>
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 101/192] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore
  2021-07-01  1:46 incoming Andrew Morton
                   ` (99 preceding siblings ...)
  2021-07-01  1:52 ` [patch 100/192] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 102/192] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) Andrew Morton
                   ` (91 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: aarcange, akpm, arnd, chris, dave.hansen, david, deller,
	eike-kernel, hughd, ink, James.Bottomley, jannh, jcmvbkbc, jgg,
	kirill.shutemov, linux-mm, linuxram, mattst88, mhocko,
	mike.kravetz, minchan, mm-commits, mst, osalvador, peterx, riel,
	rth, shuah, torvalds, tsbogend, vbabka, willy

From: David Hildenbrand <david@redhat.com>
Subject: selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore

We missed adding two binaries to gitignore.

Link: https://lkml.kernel.org/r/20210419135443.12822-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/.gitignore |    2 ++
 1 file changed, 2 insertions(+)

--- a/tools/testing/selftests/vm/.gitignore~selftests-vm-add-protection_keys_32-protection_keys_64-to-gitignore
+++ a/tools/testing/selftests/vm/.gitignore
@@ -12,6 +12,8 @@ mremap_test
 on-fault-limit
 transhuge-stress
 protection_keys
+protection_keys_32
+protection_keys_64
 userfaultfd
 mlock-intersect-test
 mlock-random-test
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 102/192] selftests/vm: add test for MADV_POPULATE_(READ|WRITE)
  2021-07-01  1:46 incoming Andrew Morton
                   ` (100 preceding siblings ...)
  2021-07-01  1:52 ` [patch 101/192] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 103/192] mm/memory_hotplug: rate limit page migration warnings Andrew Morton
                   ` (90 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: aarcange, akpm, arnd, chris, dave.hansen, david, deller,
	eike-kernel, hughd, ink, James.Bottomley, jannh, jcmvbkbc, jgg,
	kirill.shutemov, linux-mm, linuxram, mattst88, mhocko,
	mike.kravetz, minchan, mm-commits, mst, osalvador, peterx, riel,
	rth, shuah, torvalds, tsbogend, vbabka, willy

From: David Hildenbrand <david@redhat.com>
Subject: selftests/vm: add test for MADV_POPULATE_(READ|WRITE)

Let's add a simple test for MADV_POPULATE_READ and MADV_POPULATE_WRITE,
verifying some error handling, that population works, and that softdirty
tracking works as expected.  For now, limit the test to private anonymous
memory.

Link: https://lkml.kernel.org/r/20210419135443.12822-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/.gitignore      |    1 
 tools/testing/selftests/vm/Makefile        |    1 
 tools/testing/selftests/vm/madv_populate.c |  342 +++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh  |   16 
 4 files changed, 360 insertions(+)

--- a/tools/testing/selftests/vm/.gitignore~selftests-vm-add-test-for-madv_populate_readwrite
+++ a/tools/testing/selftests/vm/.gitignore
@@ -14,6 +14,7 @@ transhuge-stress
 protection_keys
 protection_keys_32
 protection_keys_64
+madv_populate
 userfaultfd
 mlock-intersect-test
 mlock-random-test
--- /dev/null
+++ a/tools/testing/selftests/vm/madv_populate.c
@@ -0,0 +1,342 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MADV_POPULATE_READ and MADV_POPULATE_WRITE tests
+ *
+ * Copyright 2021, Red Hat, Inc.
+ *
+ * Author(s): David Hildenbrand <david@redhat.com>
+ */
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+
+#include "../kselftest.h"
+
+#if defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE)
+
+/*
+ * For now, we're using 2 MiB of private anonymous memory for all tests.
+ */
+#define SIZE (2 * 1024 * 1024)
+
+static size_t pagesize;
+
+static uint64_t pagemap_get_entry(int fd, char *start)
+{
+	const unsigned long pfn = (unsigned long)start / pagesize;
+	uint64_t entry;
+	int ret;
+
+	ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry));
+	if (ret != sizeof(entry))
+		ksft_exit_fail_msg("reading pagemap failed\n");
+	return entry;
+}
+
+static bool pagemap_is_populated(int fd, char *start)
+{
+	uint64_t entry = pagemap_get_entry(fd, start);
+
+	/* Present or swapped. */
+	return entry & 0xc000000000000000ull;
+}
+
+static bool pagemap_is_softdirty(int fd, char *start)
+{
+	uint64_t entry = pagemap_get_entry(fd, start);
+
+	return entry & 0x0080000000000000ull;
+}
+
+static void sense_support(void)
+{
+	char *addr;
+	int ret;
+
+	addr = mmap(0, pagesize, PROT_READ | PROT_WRITE,
+		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (!addr)
+		ksft_exit_fail_msg("mmap failed\n");
+
+	ret = madvise(addr, pagesize, MADV_POPULATE_READ);
+	if (ret)
+		ksft_exit_skip("MADV_POPULATE_READ is not available\n");
+
+	ret = madvise(addr, pagesize, MADV_POPULATE_WRITE);
+	if (ret)
+		ksft_exit_skip("MADV_POPULATE_WRITE is not available\n");
+
+	munmap(addr, pagesize);
+}
+
+static void test_prot_read(void)
+{
+	char *addr;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	addr = mmap(0, SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (addr == MAP_FAILED)
+		ksft_exit_fail_msg("mmap failed\n");
+
+	ret = madvise(addr, SIZE, MADV_POPULATE_READ);
+	ksft_test_result(!ret, "MADV_POPULATE_READ with PROT_READ\n");
+
+	ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
+	ksft_test_result(ret == -1 && errno == EINVAL,
+			 "MADV_POPULATE_WRITE with PROT_READ\n");
+
+	munmap(addr, SIZE);
+}
+
+static void test_prot_write(void)
+{
+	char *addr;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	addr = mmap(0, SIZE, PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (addr == MAP_FAILED)
+		ksft_exit_fail_msg("mmap failed\n");
+
+	ret = madvise(addr, SIZE, MADV_POPULATE_READ);
+	ksft_test_result(ret == -1 && errno == EINVAL,
+			 "MADV_POPULATE_READ with PROT_WRITE\n");
+
+	ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
+	ksft_test_result(!ret, "MADV_POPULATE_WRITE with PROT_WRITE\n");
+
+	munmap(addr, SIZE);
+}
+
+static void test_holes(void)
+{
+	char *addr;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	addr = mmap(0, SIZE, PROT_READ | PROT_WRITE,
+		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (addr == MAP_FAILED)
+		ksft_exit_fail_msg("mmap failed\n");
+	ret = munmap(addr + pagesize, pagesize);
+	if (ret)
+		ksft_exit_fail_msg("munmap failed\n");
+
+	/* Hole in the middle */
+	ret = madvise(addr, SIZE, MADV_POPULATE_READ);
+	ksft_test_result(ret == -1 && errno == ENOMEM,
+			 "MADV_POPULATE_READ with holes in the middle\n");
+	ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
+	ksft_test_result(ret == -1 && errno == ENOMEM,
+			 "MADV_POPULATE_WRITE with holes in the middle\n");
+
+	/* Hole at end */
+	ret = madvise(addr, 2 * pagesize, MADV_POPULATE_READ);
+	ksft_test_result(ret == -1 && errno == ENOMEM,
+			 "MADV_POPULATE_READ with holes at the end\n");
+	ret = madvise(addr, 2 * pagesize, MADV_POPULATE_WRITE);
+	ksft_test_result(ret == -1 && errno == ENOMEM,
+			 "MADV_POPULATE_WRITE with holes at the end\n");
+
+	/* Hole at beginning */
+	ret = madvise(addr + pagesize, pagesize, MADV_POPULATE_READ);
+	ksft_test_result(ret == -1 && errno == ENOMEM,
+			 "MADV_POPULATE_READ with holes at the beginning\n");
+	ret = madvise(addr + pagesize, pagesize, MADV_POPULATE_WRITE);
+	ksft_test_result(ret == -1 && errno == ENOMEM,
+			 "MADV_POPULATE_WRITE with holes at the beginning\n");
+
+	munmap(addr, SIZE);
+}
+
+static bool range_is_populated(char *start, ssize_t size)
+{
+	int fd = open("/proc/self/pagemap", O_RDONLY);
+	bool ret = true;
+
+	if (fd < 0)
+		ksft_exit_fail_msg("opening pagemap failed\n");
+	for (; size > 0 && ret; size -= pagesize, start += pagesize)
+		if (!pagemap_is_populated(fd, start))
+			ret = false;
+	close(fd);
+	return ret;
+}
+
+static bool range_is_not_populated(char *start, ssize_t size)
+{
+	int fd = open("/proc/self/pagemap", O_RDONLY);
+	bool ret = true;
+
+	if (fd < 0)
+		ksft_exit_fail_msg("opening pagemap failed\n");
+	for (; size > 0 && ret; size -= pagesize, start += pagesize)
+		if (pagemap_is_populated(fd, start))
+			ret = false;
+	close(fd);
+	return ret;
+}
+
+static void test_populate_read(void)
+{
+	char *addr;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	addr = mmap(0, SIZE, PROT_READ | PROT_WRITE,
+		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (addr == MAP_FAILED)
+		ksft_exit_fail_msg("mmap failed\n");
+	ksft_test_result(range_is_not_populated(addr, SIZE),
+			 "range initially not populated\n");
+
+	ret = madvise(addr, SIZE, MADV_POPULATE_READ);
+	ksft_test_result(!ret, "MADV_POPULATE_READ\n");
+	ksft_test_result(range_is_populated(addr, SIZE),
+			 "range is populated\n");
+
+	munmap(addr, SIZE);
+}
+
+static void test_populate_write(void)
+{
+	char *addr;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	addr = mmap(0, SIZE, PROT_READ | PROT_WRITE,
+		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (addr == MAP_FAILED)
+		ksft_exit_fail_msg("mmap failed\n");
+	ksft_test_result(range_is_not_populated(addr, SIZE),
+			 "range initially not populated\n");
+
+	ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
+	ksft_test_result(!ret, "MADV_POPULATE_WRITE\n");
+	ksft_test_result(range_is_populated(addr, SIZE),
+			 "range is populated\n");
+
+	munmap(addr, SIZE);
+}
+
+static bool range_is_softdirty(char *start, ssize_t size)
+{
+	int fd = open("/proc/self/pagemap", O_RDONLY);
+	bool ret = true;
+
+	if (fd < 0)
+		ksft_exit_fail_msg("opening pagemap failed\n");
+	for (; size > 0 && ret; size -= pagesize, start += pagesize)
+		if (!pagemap_is_softdirty(fd, start))
+			ret = false;
+	close(fd);
+	return ret;
+}
+
+static bool range_is_not_softdirty(char *start, ssize_t size)
+{
+	int fd = open("/proc/self/pagemap", O_RDONLY);
+	bool ret = true;
+
+	if (fd < 0)
+		ksft_exit_fail_msg("opening pagemap failed\n");
+	for (; size > 0 && ret; size -= pagesize, start += pagesize)
+		if (pagemap_is_softdirty(fd, start))
+			ret = false;
+	close(fd);
+	return ret;
+}
+
+static void clear_softdirty(void)
+{
+	int fd = open("/proc/self/clear_refs", O_WRONLY);
+	const char *ctrl = "4";
+	int ret;
+
+	if (fd < 0)
+		ksft_exit_fail_msg("opening clear_refs failed\n");
+	ret = write(fd, ctrl, strlen(ctrl));
+	if (ret != strlen(ctrl))
+		ksft_exit_fail_msg("writing clear_refs failed\n");
+	close(fd);
+}
+
+static void test_softdirty(void)
+{
+	char *addr;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	addr = mmap(0, SIZE, PROT_READ | PROT_WRITE,
+		    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+	if (addr == MAP_FAILED)
+		ksft_exit_fail_msg("mmap failed\n");
+
+	/* Clear any softdirty bits. */
+	clear_softdirty();
+	ksft_test_result(range_is_not_softdirty(addr, SIZE),
+			 "range is not softdirty\n");
+
+	/* Populating READ should set softdirty. */
+	ret = madvise(addr, SIZE, MADV_POPULATE_READ);
+	ksft_test_result(!ret, "MADV_POPULATE_READ\n");
+	ksft_test_result(range_is_not_softdirty(addr, SIZE),
+			 "range is not softdirty\n");
+
+	/* Populating WRITE should set softdirty. */
+	ret = madvise(addr, SIZE, MADV_POPULATE_WRITE);
+	ksft_test_result(!ret, "MADV_POPULATE_WRITE\n");
+	ksft_test_result(range_is_softdirty(addr, SIZE),
+			 "range is softdirty\n");
+
+	munmap(addr, SIZE);
+}
+
+int main(int argc, char **argv)
+{
+	int err;
+
+	pagesize = getpagesize();
+
+	ksft_print_header();
+	ksft_set_plan(21);
+
+	sense_support();
+	test_prot_read();
+	test_prot_write();
+	test_holes();
+	test_populate_read();
+	test_populate_write();
+	test_softdirty();
+
+	err = ksft_get_fail_cnt();
+	if (err)
+		ksft_exit_fail_msg("%d out of %d tests failed\n",
+				   err, ksft_test_num());
+	return ksft_exit_pass();
+}
+
+#else /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */
+
+#warning "missing MADV_POPULATE_READ or MADV_POPULATE_WRITE definition"
+
+int main(int argc, char **argv)
+{
+	ksft_print_header();
+	ksft_exit_skip("MADV_POPULATE_READ or MADV_POPULATE_WRITE not defined\n");
+}
+
+#endif /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */
--- a/tools/testing/selftests/vm/Makefile~selftests-vm-add-test-for-madv_populate_readwrite
+++ a/tools/testing/selftests/vm/Makefile
@@ -31,6 +31,7 @@ TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += khugepaged
+TEST_GEN_FILES += madv_populate
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
--- a/tools/testing/selftests/vm/run_vmtests.sh~selftests-vm-add-test-for-madv_populate_readwrite
+++ a/tools/testing/selftests/vm/run_vmtests.sh
@@ -346,4 +346,20 @@ else
 	exitcode=1
 fi
 
+echo "--------------------------------------------------------"
+echo "running MADV_POPULATE_READ and MADV_POPULATE_WRITE tests"
+echo "--------------------------------------------------------"
+./madv_populate
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
 exit $exitcode
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 103/192] mm/memory_hotplug: rate limit page migration warnings
  2021-07-01  1:46 incoming Andrew Morton
                   ` (101 preceding siblings ...)
  2021-07-01  1:52 ` [patch 102/192] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 104/192] mm,memory_hotplug: drop unneeded locking Andrew Morton
                   ` (89 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, david, georgi.djakov, linux-mm, lmark, mm-commits, torvalds

From: Liam Mark <lmark@codeaurora.org>
Subject: mm/memory_hotplug: rate limit page migration warnings

When offlining memory the system can attempt to migrate a lot of pages, if
there are problems with migration this can flood the logs.  Printing all
the data hogs the CPU and cause some RT threads to run for a long time,
which may have some bad consequences.

Rate limit the page migration warnings in order to avoid this.

Link: https://lkml.kernel.org/r/20210505140542.24935-1-georgi.djakov@linaro.org
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-rate-limit-page-migration-warnings
+++ a/mm/memory_hotplug.c
@@ -1406,6 +1406,8 @@ do_migrate_range(unsigned long start_pfn
 	struct page *page, *head;
 	int ret = 0;
 	LIST_HEAD(source);
+	static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		if (!pfn_valid(pfn))
@@ -1452,8 +1454,10 @@ do_migrate_range(unsigned long start_pfn
 						    page_is_file_lru(page));
 
 		} else {
-			pr_warn("failed to isolate pfn %lx\n", pfn);
-			dump_page(page, "isolation failed");
+			if (__ratelimit(&migrate_rs)) {
+				pr_warn("failed to isolate pfn %lx\n", pfn);
+				dump_page(page, "isolation failed");
+			}
 		}
 		put_page(page);
 	}
@@ -1482,9 +1486,11 @@ do_migrate_range(unsigned long start_pfn
 			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
 		if (ret) {
 			list_for_each_entry(page, &source, lru) {
-				pr_warn("migrating pfn %lx failed ret:%d ",
-				       page_to_pfn(page), ret);
-				dump_page(page, "migration failure");
+				if (__ratelimit(&migrate_rs)) {
+					pr_warn("migrating pfn %lx failed ret:%d\n",
+						page_to_pfn(page), ret);
+					dump_page(page, "migration failure");
+				}
 			}
 			putback_movable_pages(&source);
 		}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 104/192] mm,memory_hotplug: drop unneeded locking
  2021-07-01  1:46 incoming Andrew Morton
                   ` (102 preceding siblings ...)
  2021-07-01  1:52 ` [patch 103/192] mm/memory_hotplug: rate limit page migration warnings Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 105/192] mm/zswap.c: remove unused function zswap_debugfs_exit() Andrew Morton
                   ` (88 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, anshuman.khandual, david, linux-mm, mhocko, mm-commits,
	osalvador, pasha.tatashin, torvalds, vbabka

From: Oscar Salvador <osalvador@suse.de>
Subject: mm,memory_hotplug: drop unneeded locking

Currently, memory-hotplug code takes zone's span_writelock and pgdat's
resize_lock when resizing the node/zone's spanned pages via
{move_pfn_range_to_zone(),remove_pfn_range_from_zone()} and when resizing
node and zone's present pages via adjust_present_page_count().

These locks are also taken during the initialization of the system at boot
time, where it protects parallel struct page initialization, but they
should not really be needed in memory-hotplug where all operations are a)
synchronized on device level and b) serialized by the mem_hotplug_lock
lock.

[akpm@linux-foundation.org: remove now-unused locals]
Link: https://lkml.kernel.org/r/20210531093958.15021-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

--- a/mm/memory_hotplug.c~mmmemory_hotplug-drop-unneeded-locking
+++ a/mm/memory_hotplug.c
@@ -329,7 +329,6 @@ static void shrink_zone_span(struct zone
 	unsigned long pfn;
 	int nid = zone_to_nid(zone);
 
-	zone_span_writelock(zone);
 	if (zone->zone_start_pfn == start_pfn) {
 		/*
 		 * If the section is smallest section in the zone, it need
@@ -362,7 +361,6 @@ static void shrink_zone_span(struct zone
 			zone->spanned_pages = 0;
 		}
 	}
-	zone_span_writeunlock(zone);
 }
 
 static void update_pgdat_span(struct pglist_data *pgdat)
@@ -399,7 +397,7 @@ void __ref remove_pfn_range_from_zone(st
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
-	unsigned long pfn, cur_nr_pages, flags;
+	unsigned long pfn, cur_nr_pages;
 
 	/* Poison struct pages because they are now uninitialized again. */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) {
@@ -424,10 +422,8 @@ void __ref remove_pfn_range_from_zone(st
 
 	clear_zone_contiguous(zone);
 
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
 	update_pgdat_span(pgdat);
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
 	set_zone_contiguous(zone);
 }
@@ -634,19 +630,13 @@ void __ref move_pfn_range_to_zone(struct
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nid = pgdat->node_id;
-	unsigned long flags;
 
 	clear_zone_contiguous(zone);
 
-	/* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */
-	pgdat_resize_lock(pgdat, &flags);
-	zone_span_writelock(zone);
 	if (zone_is_empty(zone))
 		init_currently_empty_zone(zone, start_pfn, nr_pages);
 	resize_zone_range(zone, start_pfn, nr_pages);
-	zone_span_writeunlock(zone);
 	resize_pgdat_range(pgdat, start_pfn, nr_pages);
-	pgdat_resize_unlock(pgdat, &flags);
 
 	/*
 	 * Subsection population requires care in pfn_to_online_page().
@@ -736,12 +726,8 @@ struct zone *zone_for_pfn_range(int onli
  */
 void adjust_present_page_count(struct zone *zone, long nr_pages)
 {
-	unsigned long flags;
-
 	zone->present_pages += nr_pages;
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
 	zone->zone_pgdat->node_present_pages += nr_pages;
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
 int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 105/192] mm/zswap.c: remove unused function zswap_debugfs_exit()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (103 preceding siblings ...)
  2021-07-01  1:52 ` [patch 104/192] mm,memory_hotplug: drop unneeded locking Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 106/192] mm/zswap.c: avoid unnecessary copy-in at map time Andrew Morton
                   ` (87 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, bigeasy, colin.king, ddstreet, linmiaohe, linux-mm,
	mm-commits, nathan, sjenning, tiantao6, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zswap.c: remove unused function zswap_debugfs_exit()

Patch series "Cleanup and fixup for zswap".

This series contains cleanups to remove unused function and avoid
unnecessary copy-in at map time.  Also this fixes two bugs in the function
zswap_writeback_entry().  More details can be found in the respective
changelogs.


This patch (of 3):

zswap_debugfs_exit() is unused, remove it.

Link: https://lkml.kernel.org/r/20210522092242.3233191-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210522092242.3233191-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zswap.c |    7 -------
 1 file changed, 7 deletions(-)

--- a/mm/zswap.c~mm-zswapc-remove-unused-function-zswap_debugfs_exit
+++ a/mm/zswap.c
@@ -1427,18 +1427,11 @@ static int __init zswap_debugfs_init(voi
 
 	return 0;
 }
-
-static void __exit zswap_debugfs_exit(void)
-{
-	debugfs_remove_recursive(zswap_debugfs_root);
-}
 #else
 static int __init zswap_debugfs_init(void)
 {
 	return 0;
 }
-
-static void __exit zswap_debugfs_exit(void) { }
 #endif
 
 /*********************************
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 106/192] mm/zswap.c: avoid unnecessary copy-in at map time
  2021-07-01  1:46 incoming Andrew Morton
                   ` (104 preceding siblings ...)
  2021-07-01  1:52 ` [patch 105/192] mm/zswap.c: remove unused function zswap_debugfs_exit() Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 107/192] mm/zswap.c: fix two bugs in zswap_writeback_entry() Andrew Morton
                   ` (86 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, bigeasy, colin.king, ddstreet, linmiaohe, linux-mm,
	mm-commits, nathan, sjenning, tiantao6, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zswap.c: avoid unnecessary copy-in at map time

The buf mapped via zpool_map_handle() is only used to store compressed
page buffer and there is no information to extract from it. So we could
use ZPOOL_MM_WO instead to avoid unnecessary copy-in at map time.

Link: https://lkml.kernel.org/r/20210522092242.3233191-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Tian Tao <tiantao6@hisilicon.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zswap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/zswap.c~mm-zswapc-avoid-unnecessary-copy-in-at-map-time
+++ a/mm/zswap.c
@@ -1203,7 +1203,7 @@ static int zswap_frontswap_store(unsigne
 		zswap_reject_alloc_fail++;
 		goto put_dstmem;
 	}
-	buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_RW);
+	buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_WO);
 	memcpy(buf, &zhdr, hlen);
 	memcpy(buf + hlen, dst, dlen);
 	zpool_unmap_handle(entry->pool->zpool, handle);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 107/192] mm/zswap.c: fix two bugs in zswap_writeback_entry()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (105 preceding siblings ...)
  2021-07-01  1:52 ` [patch 106/192] mm/zswap.c: avoid unnecessary copy-in at map time Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01  1:52 ` [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep Andrew Morton
                   ` (85 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, bigeasy, colin.king, ddstreet, linmiaohe, linux-mm,
	mm-commits, nathan, sjenning, tiantao6, torvalds, vitaly.wool

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zswap.c: fix two bugs in zswap_writeback_entry()

In the ZSWAP_SWAPCACHE_FAIL and ZSWAP_SWAPCACHE_EXIST case, we forgot to
call zpool_unmap_handle() when zpool can't sleep. And we might sleep in
zswap_get_swap_cache_page() while zpool can't sleep. To fix all of these,
zpool_unmap_handle() should be done before zswap_get_swap_cache_page()
when zpool can't sleep.

Link: https://lkml.kernel.org/r/20210522092242.3233191-4-linmiaohe@huawei.com
Fixes: fc6697a89f56 ("mm/zswap: add the flag can_sleep_mapped")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Tian Tao <tiantao6@hisilicon.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zswap.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/mm/zswap.c~mm-zswapc-fix-two-bugs-in-zswap_writeback_entry
+++ a/mm/zswap.c
@@ -967,6 +967,13 @@ static int zswap_writeback_entry(struct
 	spin_unlock(&tree->lock);
 	BUG_ON(offset != entry->offset);
 
+	src = (u8 *)zhdr + sizeof(struct zswap_header);
+	if (!zpool_can_sleep_mapped(pool)) {
+		memcpy(tmp, src, entry->length);
+		src = tmp;
+		zpool_unmap_handle(pool, handle);
+	}
+
 	/* try to allocate swap cache page */
 	switch (zswap_get_swap_cache_page(swpentry, &page)) {
 	case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
@@ -982,17 +989,7 @@ static int zswap_writeback_entry(struct
 	case ZSWAP_SWAPCACHE_NEW: /* page is locked */
 		/* decompress */
 		acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
-
 		dlen = PAGE_SIZE;
-		src = (u8 *)zhdr + sizeof(struct zswap_header);
-
-		if (!zpool_can_sleep_mapped(pool)) {
-
-			memcpy(tmp, src, entry->length);
-			src = tmp;
-
-			zpool_unmap_handle(pool, handle);
-		}
 
 		mutex_lock(acomp_ctx->mutex);
 		sg_init_one(&input, src, entry->length);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-01  1:46 incoming Andrew Morton
                   ` (106 preceding siblings ...)
  2021-07-01  1:52 ` [patch 107/192] mm/zswap.c: fix two bugs in zswap_writeback_entry() Andrew Morton
@ 2021-07-01  1:52 ` Andrew Morton
  2021-07-01 14:55   ` Minchan Kim
  2021-07-01  1:53 ` [patch 109/192] mm/zsmalloc.c: remove confusing code in obj_free() Andrew Morton
                   ` (84 subsequent siblings)
  192 siblings, 1 reply; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:52 UTC (permalink / raw)
  To: akpm, linux-mm, minchan, mm-commits, senozhatsky, torvalds,
	zhaoyang.huang

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep

Zspage_cachep is found be merged with other kmem cache during test, which
is not good for debug things (zs_pool->zspage_cachep present to be another
kmem cache in memory dumpfile).  It is also neccessary to do so as
shrinker has been registered for zspage.

Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.

Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zsmalloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/zsmalloc.c~mm-zram-amend-slab_reclaim_account-on-zspage_cachep
+++ a/mm/zsmalloc.c
@@ -328,7 +328,7 @@ static int create_cache(struct zs_pool *
 		return 1;
 
 	pool->zspage_cachep = kmem_cache_create("zspage", sizeof(struct zspage),
-					0, 0, NULL);
+					0, SLAB_RECLAIM_ACCOUNT, NULL);
 	if (!pool->zspage_cachep) {
 		kmem_cache_destroy(pool->handle_cachep);
 		pool->handle_cachep = NULL;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 109/192] mm/zsmalloc.c: remove confusing code in obj_free()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (107 preceding siblings ...)
  2021-07-01  1:52 ` [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 110/192] mm/zsmalloc.c: improve readability for async_free_zspage() Andrew Morton
                   ` (83 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, minchan, mm-commits, ngupta,
	senozhatsky, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zsmalloc.c: remove confusing code in obj_free()

Patch series "Cleanup for zsmalloc".

This series contains cleanups to remove confusing code in obj_free(),
combine two atomic ops and improve readability for async_free_zspage(). 
More details can be found in the respective changelogs.


This patch (of 2):

OBJ_ALLOCATED_TAG is only set for handle to indicate allocated object. 
It's irrelevant with obj.  So remove this misleading code to improve
readability.

Link: https://lkml.kernel.org/r/20210624123930.1769093-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210624123930.1769093-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zsmalloc.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/zsmalloc.c~mm-zsmallocc-remove-confusing-code-in-obj_free
+++ a/mm/zsmalloc.c
@@ -1471,7 +1471,6 @@ static void obj_free(struct size_class *
 	unsigned int f_objidx;
 	void *vaddr;
 
-	obj &= ~OBJ_ALLOCATED_TAG;
 	obj_to_location(obj, &f_page, &f_objidx);
 	f_offset = (class->size * f_objidx) & ~PAGE_MASK;
 	zspage = get_zspage(f_page);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 110/192] mm/zsmalloc.c: improve readability for async_free_zspage()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (108 preceding siblings ...)
  2021-07-01  1:53 ` [patch 109/192] mm/zsmalloc.c: remove confusing code in obj_free() Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 111/192] zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK Andrew Morton
                   ` (82 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, minchan, mm-commits, ngupta,
	senozhatsky, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zsmalloc.c: improve readability for async_free_zspage()

The class is extracted from pool->size_class[class_idx] again before
calling __free_zspage().  It looks like class will change after we fetch
the class lock.  But this is misleading as class will stay unchanged.

Link: https://lkml.kernel.org/r/20210624123930.1769093-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zsmalloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/zsmalloc.c~mm-zsmallocc-improve-readability-for-async_free_zspage
+++ a/mm/zsmalloc.c
@@ -2162,7 +2162,7 @@ static void async_free_zspage(struct wor
 		VM_BUG_ON(fullness != ZS_EMPTY);
 		class = pool->size_class[class_idx];
 		spin_lock(&class->lock);
-		__free_zspage(pool, pool->size_class[class_idx], zspage);
+		__free_zspage(pool, class, zspage);
 		spin_unlock(&class->lock);
 	}
 };
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 111/192] zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK
  2021-07-01  1:46 incoming Andrew Morton
                   ` (109 preceding siblings ...)
  2021-07-01  1:53 ` [patch 110/192] mm/zsmalloc.c: improve readability for async_free_zspage() Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 112/192] mm: fix typos and grammar error in comments Andrew Morton
                   ` (81 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, huyue2, linux-mm, minchan, mm-commits, senozhatsky,
	sergey.senozhatsky, torvalds

From: Yue Hu <huyue2@yulong.com>
Subject: zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK

backing_dev is never used when not enable CONFIG_ZRAM_WRITEBACK and it's
introduced from writeback feature.  So it's needless also affect
readability in that case.

Link: https://lkml.kernel.org/r/20210521060544.2385-1-zbestahu@gmail.com
Signed-off-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/block/zram/zram_drv.h~zram-move-backing_dev-under-macro-config_zram_writeback
+++ a/drivers/block/zram/zram_drv.h
@@ -113,8 +113,8 @@ struct zram {
 	 * zram is claimed so open request will be failed
 	 */
 	bool claim; /* Protected by bdev->bd_mutex */
-	struct file *backing_dev;
 #ifdef CONFIG_ZRAM_WRITEBACK
+	struct file *backing_dev;
 	spinlock_t wb_limit_lock;
 	bool wb_limit_enable;
 	u64 bd_wb_limit;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 112/192] mm: fix typos and grammar error in comments
  2021-07-01  1:46 incoming Andrew Morton
                   ` (110 preceding siblings ...)
  2021-07-01  1:53 ` [patch 111/192] zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 113/192] mm: define default value for FIRST_USER_ADDRESS Andrew Morton
                   ` (80 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: 42.hyeyoo, akpm, linux-mm, mm-commits, torvalds

From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Subject: mm: fix typos and grammar error in comments

We moves tha -> We move that in mm/swap.c
statments -> statements in include/linux/mm.h

Link: https://lkml.kernel.org/r/20210509063444.GA24745@hyeyoo
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    2 +-
 mm/swap.c          |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/mm.h~mm-fix-typos-and-grammar-error-in-comments
+++ a/include/linux/mm.h
@@ -155,7 +155,7 @@ extern int mmap_rnd_compat_bits __read_m
 /* This function must be updated when the size of struct page grows above 80
  * or reduces below 56. The idea that compiler optimizes out switch()
  * statement, and only leaves move/store instructions. Also the compiler can
- * combine write statments if they are both assignments and can be reordered,
+ * combine write statements if they are both assignments and can be reordered,
  * this can result in several of the writes here being dropped.
  */
 #define	mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
--- a/mm/swap.c~mm-fix-typos-and-grammar-error-in-comments
+++ a/mm/swap.c
@@ -554,7 +554,7 @@ static void lru_deactivate_file_fn(struc
 	} else {
 		/*
 		 * The page's writeback ends up during pagevec
-		 * We moves tha page into tail of inactive.
+		 * We move that page into tail of inactive.
 		 */
 		add_page_to_lru_list_tail(page, lruvec);
 		__count_vm_events(PGROTATED, nr_pages);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 113/192] mm: define default value for FIRST_USER_ADDRESS
  2021-07-01  1:46 incoming Andrew Morton
                   ` (111 preceding siblings ...)
  2021-07-01  1:53 ` [patch 112/192] mm: fix typos and grammar error in comments Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 114/192] mm: fix spelling mistakes Andrew Morton
                   ` (79 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bcain, catalin.marinas, chris,
	christophe.leroy, davem, geert, guoren, hca, James.Bottomley,
	jdike, jonas, ley.foon.tan, linux-mm, mm-commits, monstr, mpe,
	palmerdabbelt, paul.walmsley, rppt, rth, shorne,
	stefan.kristiansson, tglx, torvalds, tsbogend, vgupta, will,
	ysato

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm: define default value for FIRST_USER_ADDRESS

Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the
same code all over.  Instead just define a generic default value (i.e 0UL)
for FIRST_USER_ADDRESS and let the platforms override when required.  This
makes it much cleaner with reduced code.

The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h>
when the given platform overrides its value via <asm/pgtable.h>.

Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Acked-by: Guo Ren <guoren@kernel.org>			[csky]
Acked-by: Stafford Horne <shorne@gmail.com>		[openrisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	[RISC-V]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/include/asm/pgtable.h             |    1 -
 arch/arc/include/asm/pgtable.h               |    6 ------
 arch/arm64/include/asm/pgtable.h             |    2 --
 arch/csky/include/asm/pgtable.h              |    1 -
 arch/hexagon/include/asm/pgtable.h           |    3 ---
 arch/ia64/include/asm/pgtable.h              |    1 -
 arch/m68k/include/asm/pgtable_mm.h           |    1 -
 arch/microblaze/include/asm/pgtable.h        |    2 --
 arch/mips/include/asm/pgtable-32.h           |    1 -
 arch/mips/include/asm/pgtable-64.h           |    1 -
 arch/nios2/include/asm/pgtable.h             |    2 --
 arch/openrisc/include/asm/pgtable.h          |    1 -
 arch/parisc/include/asm/pgtable.h            |    2 --
 arch/powerpc/include/asm/book3s/pgtable.h    |    1 -
 arch/powerpc/include/asm/nohash/32/pgtable.h |    1 -
 arch/powerpc/include/asm/nohash/64/pgtable.h |    2 --
 arch/riscv/include/asm/pgtable.h             |    2 --
 arch/s390/include/asm/pgtable.h              |    2 --
 arch/sh/include/asm/pgtable.h                |    2 --
 arch/sparc/include/asm/pgtable_32.h          |    1 -
 arch/sparc/include/asm/pgtable_64.h          |    3 ---
 arch/um/include/asm/pgtable-2level.h         |    1 -
 arch/um/include/asm/pgtable-3level.h         |    1 -
 arch/x86/include/asm/pgtable_types.h         |    2 --
 arch/xtensa/include/asm/pgtable.h            |    1 -
 include/linux/pgtable.h                      |    9 +++++++++
 26 files changed, 9 insertions(+), 43 deletions(-)

--- a/arch/alpha/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/alpha/include/asm/pgtable.h
@@ -46,7 +46,6 @@ struct vm_area_struct;
 #define PTRS_PER_PMD	(1UL << (PAGE_SHIFT-3))
 #define PTRS_PER_PGD	(1UL << (PAGE_SHIFT-3))
 #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 
 /* Number of pointers that fit on a page:  this will go away. */
 #define PTRS_PER_PAGE	(1UL << (PAGE_SHIFT-3))
--- a/arch/arc/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/arc/include/asm/pgtable.h
@@ -222,12 +222,6 @@
  */
 #define	USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
 
-/*
- * No special requirements for lowest virtual address we permit any user space
- * mapping to be mapped at.
- */
-#define FIRST_USER_ADDRESS      0UL
-
 
 /****************************************************************
  * Bucket load of VM Helpers
--- a/arch/arm64/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/arm64/include/asm/pgtable.h
@@ -26,8 +26,6 @@
 
 #define vmemmap			((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
 
-#define FIRST_USER_ADDRESS	0UL
-
 #ifndef __ASSEMBLY__
 
 #include <asm/cmpxchg.h>
--- a/arch/csky/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/csky/include/asm/pgtable.h
@@ -14,7 +14,6 @@
 #define PGDIR_MASK		(~(PGDIR_SIZE-1))
 
 #define USER_PTRS_PER_PGD	(PAGE_OFFSET/PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 
 /*
  * C-SKY is two-level paging structure:
--- a/arch/hexagon/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/hexagon/include/asm/pgtable.h
@@ -155,9 +155,6 @@ extern unsigned long _dflt_cache_att;
 
 extern pgd_t swapper_pg_dir[PTRS_PER_PGD];  /* located in head.S */
 
-/* Seems to be zero even in architectures where the zero page is firewalled? */
-#define FIRST_USER_ADDRESS 0UL
-
 /*  HUGETLB not working currently  */
 #ifdef CONFIG_HUGETLB_PAGE
 #define pte_mkhuge(pte) __pte((pte_val(pte) & ~0x3) | HVM_HUGEPAGE_SIZE)
--- a/arch/ia64/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/ia64/include/asm/pgtable.h
@@ -128,7 +128,6 @@
 #define PTRS_PER_PGD_SHIFT	PTRS_PER_PTD_SHIFT
 #define PTRS_PER_PGD		(1UL << PTRS_PER_PGD_SHIFT)
 #define USER_PTRS_PER_PGD	(5*PTRS_PER_PGD/8)	/* regions 0-4 are user regions */
-#define FIRST_USER_ADDRESS	0UL
 
 /*
  * All the normal masks have the "page accessed" bits on, as any time
--- a/arch/m68k/include/asm/pgtable_mm.h~mm-define-default-value-for-first_user_address
+++ a/arch/m68k/include/asm/pgtable_mm.h
@@ -72,7 +72,6 @@
 #define PTRS_PER_PGD	128
 #endif
 #define USER_PTRS_PER_PGD	(TASK_SIZE/PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 
 /* Virtual address region for use by kernel_map() */
 #ifdef CONFIG_SUN3
--- a/arch/microblaze/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/microblaze/include/asm/pgtable.h
@@ -25,8 +25,6 @@ extern int mem_init_done;
 #include <asm/mmu.h>
 #include <asm/page.h>
 
-#define FIRST_USER_ADDRESS	0UL
-
 extern unsigned long va_to_phys(unsigned long address);
 extern pte_t *va_to_pte(unsigned long address);
 
--- a/arch/mips/include/asm/pgtable-32.h~mm-define-default-value-for-first_user_address
+++ a/arch/mips/include/asm/pgtable-32.h
@@ -93,7 +93,6 @@ extern int add_temporary_entry(unsigned
 #endif
 
 #define USER_PTRS_PER_PGD	(0x80000000UL/PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 
 #define VMALLOC_START	  MAP_BASE
 
--- a/arch/mips/include/asm/pgtable-64.h~mm-define-default-value-for-first_user_address
+++ a/arch/mips/include/asm/pgtable-64.h
@@ -137,7 +137,6 @@
 #define PTRS_PER_PTE	((PAGE_SIZE << PTE_ORDER) / sizeof(pte_t))
 
 #define USER_PTRS_PER_PGD       ((TASK_SIZE64 / PGDIR_SIZE)?(TASK_SIZE64 / PGDIR_SIZE):1)
-#define FIRST_USER_ADDRESS	0UL
 
 /*
  * TLB refill handlers also map the vmalloc area into xuseg.  Avoid
--- a/arch/nios2/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/nios2/include/asm/pgtable.h
@@ -24,8 +24,6 @@
 #include <asm/pgtable-bits.h>
 #include <asm-generic/pgtable-nopmd.h>
 
-#define FIRST_USER_ADDRESS	0UL
-
 #define VMALLOC_START		CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
 #define VMALLOC_END		(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
 
--- a/arch/openrisc/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/openrisc/include/asm/pgtable.h
@@ -73,7 +73,6 @@ extern void paging_init(void);
  */
 
 #define USER_PTRS_PER_PGD       (TASK_SIZE/PGDIR_SIZE)
-#define FIRST_USER_ADDRESS      0UL
 
 /*
  * Kernels own virtual memory area.
--- a/arch/parisc/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/parisc/include/asm/pgtable.h
@@ -171,8 +171,6 @@ static inline void purge_tlb_entries(str
  * pgd entries used up by user/kernel:
  */
 
-#define FIRST_USER_ADDRESS	0UL
-
 /* NB: The tlb miss handlers make certain assumptions about the order */
 /*     of the following bits, so be careful (One example, bits 25-31  */
 /*     are moved together in one instruction).                        */
--- a/arch/powerpc/include/asm/book3s/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/powerpc/include/asm/book3s/pgtable.h
@@ -8,7 +8,6 @@
 #include <asm/book3s/32/pgtable.h>
 #endif
 
-#define FIRST_USER_ADDRESS	0UL
 #ifndef __ASSEMBLY__
 /* Insert a PTE, top-level function is out of line. It uses an inline
  * low level function in the respective pgtable-* files
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -54,7 +54,6 @@ extern int icache_44x_need_flush;
 #define PGD_MASKED_BITS		0
 
 #define USER_PTRS_PER_PGD	(TASK_SIZE / PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 
 #define pte_ERROR(e) \
 	pr_err("%s:%d: bad pte %llx.\n", __FILE__, __LINE__, \
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -12,8 +12,6 @@
 #include <asm/barrier.h>
 #include <asm/asm-const.h>
 
-#define FIRST_USER_ADDRESS	0UL
-
 /*
  * Size of EA range mapped by our pagetables.
  */
--- a/arch/riscv/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/riscv/include/asm/pgtable.h
@@ -536,8 +536,6 @@ void setup_bootmem(void);
 void paging_init(void);
 void misc_mem_init(void);
 
-#define FIRST_USER_ADDRESS  0
-
 /*
  * ZERO_PAGE is a global shared page that is always zero,
  * used for zero-mapped memory areas, etc.
--- a/arch/s390/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/s390/include/asm/pgtable.h
@@ -65,8 +65,6 @@ extern unsigned long zero_page_mask;
 
 /* TODO: s390 cannot support io_remap_pfn_range... */
 
-#define FIRST_USER_ADDRESS  0UL
-
 #define pte_ERROR(e) \
 	printk("%s:%d: bad pte %p.\n", __FILE__, __LINE__, (void *) pte_val(e))
 #define pmd_ERROR(e) \
--- a/arch/sh/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/sh/include/asm/pgtable.h
@@ -59,8 +59,6 @@ static inline unsigned long long neff_si
 /* Entries per level */
 #define PTRS_PER_PTE	(PAGE_SIZE / (1 << PTE_MAGNITUDE))
 
-#define FIRST_USER_ADDRESS	0UL
-
 #define PHYS_ADDR_MASK29		0x1fffffff
 #define PHYS_ADDR_MASK32		0xffffffff
 
--- a/arch/sparc/include/asm/pgtable_32.h~mm-define-default-value-for-first_user_address
+++ a/arch/sparc/include/asm/pgtable_32.h
@@ -48,7 +48,6 @@ unsigned long __init bootmem_init(unsign
 #define PTRS_PER_PMD    	64
 #define PTRS_PER_PGD    	256
 #define USER_PTRS_PER_PGD	PAGE_OFFSET / PGDIR_SIZE
-#define FIRST_USER_ADDRESS	0UL
 #define PTE_SIZE		(PTRS_PER_PTE*4)
 
 #define PAGE_NONE	SRMMU_PAGE_NONE
--- a/arch/sparc/include/asm/pgtable_64.h~mm-define-default-value-for-first_user_address
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -95,9 +95,6 @@ bool kern_addr_valid(unsigned long addr)
 #define PTRS_PER_PUD	(1UL << PUD_BITS)
 #define PTRS_PER_PGD	(1UL << PGDIR_BITS)
 
-/* Kernel has a separate 44bit address space. */
-#define FIRST_USER_ADDRESS	0UL
-
 #define pmd_ERROR(e)							\
 	pr_err("%s:%d: bad pmd %p(%016lx) seen at (%pS)\n",		\
 	       __FILE__, __LINE__, &(e), pmd_val(e), __builtin_return_address(0))
--- a/arch/um/include/asm/pgtable-2level.h~mm-define-default-value-for-first_user_address
+++ a/arch/um/include/asm/pgtable-2level.h
@@ -23,7 +23,6 @@
 #define PTRS_PER_PTE	1024
 #define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE)
 #define PTRS_PER_PGD	1024
-#define FIRST_USER_ADDRESS	0UL
 
 #define pte_ERROR(e) \
         printk("%s:%d: bad pte %p(%08lx).\n", __FILE__, __LINE__, &(e), \
--- a/arch/um/include/asm/pgtable-3level.h~mm-define-default-value-for-first_user_address
+++ a/arch/um/include/asm/pgtable-3level.h
@@ -41,7 +41,6 @@
 #endif
 
 #define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 
 #define pte_ERROR(e) \
         printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), \
--- a/arch/x86/include/asm/pgtable_types.h~mm-define-default-value-for-first_user_address
+++ a/arch/x86/include/asm/pgtable_types.h
@@ -7,8 +7,6 @@
 
 #include <asm/page_types.h>
 
-#define FIRST_USER_ADDRESS	0UL
-
 #define _PAGE_BIT_PRESENT	0	/* is present */
 #define _PAGE_BIT_RW		1	/* writeable */
 #define _PAGE_BIT_USER		2	/* userspace addressable */
--- a/arch/xtensa/include/asm/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/arch/xtensa/include/asm/pgtable.h
@@ -59,7 +59,6 @@
 #define PTRS_PER_PGD		1024
 #define PGD_ORDER		0
 #define USER_PTRS_PER_PGD	(TASK_SIZE/PGDIR_SIZE)
-#define FIRST_USER_ADDRESS	0UL
 #define FIRST_USER_PGD_NR	(FIRST_USER_ADDRESS >> PGDIR_SHIFT)
 
 #ifdef CONFIG_MMU
--- a/include/linux/pgtable.h~mm-define-default-value-for-first_user_address
+++ a/include/linux/pgtable.h
@@ -29,6 +29,15 @@
 #endif
 
 /*
+ * This defines the first usable user address. Platforms
+ * can override its value with custom FIRST_USER_ADDRESS
+ * defined in their respective <asm/pgtable.h>.
+ */
+#ifndef FIRST_USER_ADDRESS
+#define FIRST_USER_ADDRESS	0UL
+#endif
+
+/*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
  * The pXx_index() functions return the index of the entry in the page
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 114/192] mm: fix spelling mistakes
  2021-07-01  1:46 incoming Andrew Morton
                   ` (112 preceding siblings ...)
  2021-07-01  1:53 ` [patch 113/192] mm: define default value for FIRST_USER_ADDRESS Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 115/192] mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages Andrew Morton
                   ` (78 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, jrdr.linux, linux-mm, mm-commits, thunder.leizhen, torvalds

From: Zhen Lei <thunder.leizhen@huawei.com>
Subject: mm: fix spelling mistakes

Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness

Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memremap.h |    2 +-
 include/linux/mm_types.h |    2 +-
 include/linux/mmzone.h   |    2 +-
 mm/memory-failure.c      |    2 +-
 mm/memory_hotplug.c      |    4 ++--
 mm/page_alloc.c          |    2 +-
 mm/swapfile.c            |    2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

--- a/include/linux/memremap.h~mm-clear-spelling-mistakes
+++ a/include/linux/memremap.h
@@ -26,7 +26,7 @@ struct vmem_altmap {
 };
 
 /*
- * Specialize ZONE_DEVICE memory into multiple types each having differents
+ * Specialize ZONE_DEVICE memory into multiple types each has a different
  * usage.
  *
  * MEMORY_DEVICE_PRIVATE:
--- a/include/linux/mm_types.h~mm-clear-spelling-mistakes
+++ a/include/linux/mm_types.h
@@ -397,7 +397,7 @@ struct mm_struct {
 		unsigned long mmap_base;	/* base of mmap area */
 		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
-		/* Base adresses for compatible mmap() */
+		/* Base addresses for compatible mmap() */
 		unsigned long mmap_compat_base;
 		unsigned long mmap_compat_legacy_base;
 #endif
--- a/include/linux/mmzone.h~mm-clear-spelling-mistakes
+++ a/include/linux/mmzone.h
@@ -114,7 +114,7 @@ static inline bool free_area_empty(struc
 struct pglist_data;
 
 /*
- * Add a wild amount of padding here to ensure datas fall into separate
+ * Add a wild amount of padding here to ensure data fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
  */
--- a/mm/memory-failure.c~mm-clear-spelling-mistakes
+++ a/mm/memory-failure.c
@@ -1340,7 +1340,7 @@ static bool hwpoison_user_mappings(struc
 			 * could potentially call huge_pmd_unshare.  Because of
 			 * this, take semaphore in write mode here and set
 			 * TTU_RMAP_LOCKED to indicate we have taken the lock
-			 * at this higer level.
+			 * at this higher level.
 			 */
 			mapping = hugetlb_page_mapping_lock_write(hpage);
 			if (mapping) {
--- a/mm/memory_hotplug.c~mm-clear-spelling-mistakes
+++ a/mm/memory_hotplug.c
@@ -783,7 +783,7 @@ int __ref online_pages(unsigned long pfn
 
 	/*
 	 * {on,off}lining is constrained to full memory sections (or more
-	 * precisly to memory blocks from the user space POV).
+	 * precisely to memory blocks from the user space POV).
 	 * memmap_on_memory is an exception because it reserves initial part
 	 * of the physical memory space for vmemmaps. That space is pageblock
 	 * aligned.
@@ -1580,7 +1580,7 @@ int __ref offline_pages(unsigned long st
 
 	/*
 	 * {on,off}lining is constrained to full memory sections (or more
-	 * precisly to memory blocks from the user space POV).
+	 * precisely to memory blocks from the user space POV).
 	 * memmap_on_memory is an exception because it reserves initial part
 	 * of the physical memory space for vmemmaps. That space is pageblock
 	 * aligned.
--- a/mm/page_alloc.c~mm-clear-spelling-mistakes
+++ a/mm/page_alloc.c
@@ -3180,7 +3180,7 @@ static void __drain_all_pages(struct zon
 	int cpu;
 
 	/*
-	 * Allocate in the BSS so we wont require allocation in
+	 * Allocate in the BSS so we won't require allocation in
 	 * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y
 	 */
 	static cpumask_t cpus_with_pcps;
--- a/mm/swapfile.c~mm-clear-spelling-mistakes
+++ a/mm/swapfile.c
@@ -2967,7 +2967,7 @@ static unsigned long read_swap_header(st
 		return 0;
 	}
 
-	/* swap partition endianess hack... */
+	/* swap partition endianness hack... */
 	if (swab32(swap_header->info.version) == 1) {
 		swab32s(&swap_header->info.version);
 		swab32s(&swap_header->info.last_page);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 115/192] mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages
  2021-07-01  1:46 incoming Andrew Morton
                   ` (113 preceding siblings ...)
  2021-07-01  1:53 ` [patch 114/192] mm: fix spelling mistakes Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 116/192] mm/vmalloc: include header for prototype of set_iounmap_nonlazy Andrew Morton
                   ` (77 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages

Patch series "Clean W=1 build warnings for mm/".

This is a janitorial only.  During development of a tool to catch build
warnings early to avoid tripping the Intel lkp-robot, I noticed that mm/
is not clean for W=1.  This is generally harmless but there is no harm in
cleaning it up.  It disrupts git blame a little but on relatively obvious
lines that are unlikely to be git blame targets.


This patch (of 13):

make W=1 generates the following warning for vmscan.c

    mm/vmscan.c:1814: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst

It is not a kerneldoc comment and isolate_lru_pages() is a static
function.  While the detailed comment is nice, it does not need to be
exposed via kernel-doc.

Link: https://lkml.kernel.org/r/20210520084809.8576-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210520084809.8576-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c~mm-vmscan-remove-kerneldoc-like-comment-from-isolate_lru_pages
+++ a/mm/vmscan.c
@@ -1821,7 +1821,7 @@ static __always_inline void update_lru_s
 
 }
 
-/**
+/*
  * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
  *
  * lruvec->lru_lock is heavily contended.  Some of the functions that
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 116/192] mm/vmalloc: include header for prototype of set_iounmap_nonlazy
  2021-07-01  1:46 incoming Andrew Morton
                   ` (114 preceding siblings ...)
  2021-07-01  1:53 ` [patch 115/192] mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 117/192] mm/page_alloc: make should_fail_alloc_page() static Andrew Morton
                   ` (76 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmalloc: include header for prototype of set_iounmap_nonlazy

make W=1 generates the following warning for mm/vmalloc.c

  mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes]
   void set_iounmap_nonlazy(void)
        ^~~~~~~~~~~~~~~~~~~

This is an arch-generic function only used by x86.  On other arches, it's
dead code.  Include the header with the definition and make it x86-64
specific.

Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/vmalloc.c~mm-vmalloc-include-header-for-prototype-of-set_iounmap_nonlazy
+++ a/mm/vmalloc.c
@@ -25,6 +25,7 @@
 #include <linux/notifier.h>
 #include <linux/rbtree.h>
 #include <linux/xarray.h>
+#include <linux/io.h>
 #include <linux/rcupdate.h>
 #include <linux/pfn.h>
 #include <linux/kmemleak.h>
@@ -1607,6 +1608,7 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
 
+#ifdef CONFIG_X86_64
 /*
  * called before a call to iounmap() if the caller wants vm_area_struct's
  * immediately freed.
@@ -1615,6 +1617,7 @@ void set_iounmap_nonlazy(void)
 {
 	atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1);
 }
+#endif /* CONFIG_X86_64 */
 
 /*
  * Purges all lazily-freed vmap areas.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 117/192] mm/page_alloc: make should_fail_alloc_page() static
  2021-07-01  1:46 incoming Andrew Morton
                   ` (115 preceding siblings ...)
  2021-07-01  1:53 ` [patch 116/192] mm/vmalloc: include header for prototype of set_iounmap_nonlazy Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 118/192] mm/mapping_dirty_helpers: remove double Note in kerneldoc Andrew Morton
                   ` (75 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/page_alloc: make should_fail_alloc_page() static

make W=1 generates the following warning for mm/page_alloc.c

  mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
   noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
                 ^~~~~~~~~~~~~~~~~~~~~~

This function is deliberately split out for BPF to allow errors to be
injected.  The function is not used anywhere else so it is local to the
file.  Make it static which should still allow error injection to be used
similar to how block/blk-core.c:should_fail_bio() works.

Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-make-should_fail_alloc_page-a-static-function-should_fail_alloc_page-static
+++ a/mm/page_alloc.c
@@ -3819,7 +3819,7 @@ static inline bool __should_fail_alloc_p
 
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
-noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 {
 	return __should_fail_alloc_page(gfp_mask, order);
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 118/192] mm/mapping_dirty_helpers: remove double Note in kerneldoc
  2021-07-01  1:46 incoming Andrew Morton
                   ` (116 preceding siblings ...)
  2021-07-01  1:53 ` [patch 117/192] mm/page_alloc: make should_fail_alloc_page() static Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 119/192] mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection Andrew Morton
                   ` (74 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/mapping_dirty_helpers: remove double Note in kerneldoc

make W=1 generates the following warning for mm/mapping_dirty_helpers.c

mm/mapping_dirty_helpers.c:325: warning: duplicate section name 'Note'

The helper function is very specific to one driver -- vmwgfx.  While the
two notes are separate, all of it needs to be taken into account when
using the helper so make it one note.

Link: https://lkml.kernel.org/r/20210520084809.8576-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mapping_dirty_helpers.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/mapping_dirty_helpers.c~mm-mapping_dirty_helpers-remove-double-note-in-kerneldoc
+++ a/mm/mapping_dirty_helpers.c
@@ -317,7 +317,7 @@ EXPORT_SYMBOL_GPL(wp_shared_mapping_rang
  * pfn_mkwrite(). And then after a TLB flush following the write-protection
  * pick up all dirty bits.
  *
- * Note: This function currently skips transhuge page-table entries, since
+ * This function currently skips transhuge page-table entries, since
  * it's intended for dirty-tracking on the PTE level. It will warn on
  * encountering transhuge dirty entries, though, and can easily be extended
  * to handle them as well.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 119/192] mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection
  2021-07-01  1:46 incoming Andrew Morton
                   ` (117 preceding siblings ...)
  2021-07-01  1:53 ` [patch 118/192] mm/mapping_dirty_helpers: remove double Note in kerneldoc Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 120/192] mm/memory_hotplug: fix kerneldoc comment for __try_online_node Andrew Morton
                   ` (73 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, chris, david, ddstreet, linux-mm, mgorman, mhocko,
	mm-commits, shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection

make W=1 generates the following warning for mem_cgroup_calculate_protection

  mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead

Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from
protection checks") changed the function definition but not the associated
kerneldoc comment.

Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net
Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memcontrol.c~mm-memcontrolc-fix-kerneldoc-comment-for-mem_cgroup_calculate_protection
+++ a/mm/memcontrol.c
@@ -6639,7 +6639,7 @@ static unsigned long effective_protectio
 }
 
 /**
- * mem_cgroup_protected - check if memory consumption is in the normal range
+ * mem_cgroup_calculate_protection - check if memory consumption is in the normal range
  * @root: the top ancestor of the sub-tree being checked
  * @memcg: the memory cgroup to check
  *
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 120/192] mm/memory_hotplug: fix kerneldoc comment for __try_online_node
  2021-07-01  1:46 incoming Andrew Morton
                   ` (118 preceding siblings ...)
  2021-07-01  1:53 ` [patch 119/192] mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 121/192] mm/memory_hotplug: fix kerneldoc comment for __remove_memory Andrew Morton
                   ` (72 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/memory_hotplug: fix kerneldoc comment for __try_online_node

make W=1 generates the following warning for try_online_node

mm/memory_hotplug.c:1087: warning: expecting prototype for try_online_node(). Prototype was for __try_online_node() instead

Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use
__try_online_node") renamed the function but did not update the associated
kerneldoc.  The function is static and somewhat specialised in nature so
it's not clear it warrants being a kerneldoc by moving the comment to
try_online_node.  Hence, leave the comment of the internal helper in place
but leave it out of kerneldoc and correct the function name in the
comment.

Link: https://lkml.kernel.org/r/20210520084809.8576-8-mgorman@techsingularity.net
Fixes: Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-fix-kerneldoc-comment-for-__try_online_node
+++ a/mm/memory_hotplug.c
@@ -942,8 +942,8 @@ static void rollback_node_hotadd(int nid
 }
 
 
-/**
- * try_online_node - online a node if offlined
+/*
+ * __try_online_node - online a node if offlined
  * @nid: the node ID
  * @set_node_online: Whether we want to online the node
  * called by cpu_up() to online a node without onlined memory.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 121/192] mm/memory_hotplug: fix kerneldoc comment for __remove_memory
  2021-07-01  1:46 incoming Andrew Morton
                   ` (119 preceding siblings ...)
  2021-07-01  1:53 ` [patch 120/192] mm/memory_hotplug: fix kerneldoc comment for __try_online_node Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 122/192] mm/zbud: add kerneldoc fields for zbud_pool Andrew Morton
                   ` (71 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/memory_hotplug: fix kerneldoc comment for __remove_memory

make W=1 generates the following warning for __remove_memory

  mm/memory_hotplug.c:2044: warning: expecting prototype for remove_memory(). Prototype was for __remove_memory() instead

Commit eca499ab3749 ("mm/hotplug: make remove_memory() interface usable")
introduced the kerneldoc comment and function but the kerneldoc name and
function name did not match.

Link: https://lkml.kernel.org/r/20210520084809.8576-9-mgorman@techsingularity.net
Fixes: eca499ab3749 ("mm/hotplug: make remove_memory() interface usable")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-fix-kerneldoc-comment-for-__remove_memory
+++ a/mm/memory_hotplug.c
@@ -1908,7 +1908,7 @@ static int __ref try_remove_memory(int n
 }
 
 /**
- * remove_memory
+ * __remove_memory - Remove memory if every memory block is offline
  * @nid: the node ID
  * @start: physical address of the region to remove
  * @size: size of the region to remove
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 122/192] mm/zbud: add kerneldoc fields for zbud_pool
  2021-07-01  1:46 incoming Andrew Morton
                   ` (120 preceding siblings ...)
  2021-07-01  1:53 ` [patch 121/192] mm/memory_hotplug: fix kerneldoc comment for __remove_memory Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 123/192] mm/z3fold: add kerneldoc fields for z3fold_pool Andrew Morton
                   ` (70 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/zbud: add kerneldoc fields for zbud_pool

make W=1 generates the following warning for zbud_pool

  mm/zbud.c:105: warning: Function parameter or member 'zpool' not described in 'zbud_pool'
  mm/zbud.c:105: warning: Function parameter or member 'zpool_ops' not described in 'zbud_pool'

Commit 479305fd7172 ("zpool: remove zpool_evict()") removed the
zpool_evict helper and added the associated zpool and operations structure
in struct zbud_pool but did not add documentation for the fields.  Add
rudimentary documentation.

Link: https://lkml.kernel.org/r/20210520084809.8576-10-mgorman@techsingularity.net
Fixes: 479305fd7172 ("zpool: remove zpool_evict()")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zbud.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/mm/zbud.c~mm-zbud-add-kerneldoc-fields-for-zbud_pool
+++ a/mm/zbud.c
@@ -92,6 +92,8 @@ struct zbud_ops {
  * @pages_nr:	number of zbud pages in the pool.
  * @ops:	pointer to a structure of user defined operations specified at
  *		pool creation time.
+ * @zpool:	zpool driver
+ * @zpool_ops:	zpool operations structure with an evict callback
  *
  * This structure is allocated at pool creation time and maintains metadata
  * pertaining to a particular zbud pool.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 123/192] mm/z3fold: add kerneldoc fields for z3fold_pool
  2021-07-01  1:46 incoming Andrew Morton
                   ` (121 preceding siblings ...)
  2021-07-01  1:53 ` [patch 122/192] mm/zbud: add kerneldoc fields for zbud_pool Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 124/192] mm/swap: make swap_address_space an inline function Andrew Morton
                   ` (69 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/z3fold: add kerneldoc fields for z3fold_pool

make W=1 generates the following warning for z3fold_pool

  mm/z3fold.c:171: warning: Function parameter or member 'zpool' not described in 'z3fold_pool'
  mm/z3fold.c:171: warning: Function parameter or member 'zpool_ops' not described in 'z3fold_pool'

Commit 9a001fc19ccc ("z3fold: the 3-fold allocator for compressed pages")
simply did not document the fields at the time.  Add rudimentary
documentation.

Link: https://lkml.kernel.org/r/20210520084809.8576-11-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/mm/z3fold.c~mm-z3fold-add-kerneldoc-fields-for-z3fold_pool
+++ a/mm/z3fold.c
@@ -144,6 +144,8 @@ struct z3fold_header {
  * @c_handle:	cache for z3fold_buddy_slots allocation
  * @ops:	pointer to a structure of user defined operations specified at
  *		pool creation time.
+ * @zpool:	zpool driver
+ * @zpool_ops:	zpool operations structure with an evict callback
  * @compact_wq:	workqueue for page layout background optimization
  * @release_wq:	workqueue for safe page release
  * @work:	work_struct for safe page release
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 124/192] mm/swap: make swap_address_space an inline function
  2021-07-01  1:46 incoming Andrew Morton
                   ` (122 preceding siblings ...)
  2021-07-01  1:53 ` [patch 123/192] mm/z3fold: add kerneldoc fields for z3fold_pool Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 125/192] mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations Andrew Morton
                   ` (68 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/swap: make swap_address_space an inline function

make W=1 generates the following warning in page_mapping() for allnoconfig

  mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable]
     swp_entry_t entry;
                 ^~~~~

swap_address is a #define on !CONFIG_SWAP configurations.  Make the helper
an inline function to suppress the warning, add type checking and to apply
any side-effects in the parameter list.

Link: https://lkml.kernel.org/r/20210520084809.8576-12-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap.h |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- a/include/linux/swap.h~mm-swap-make-swap_address_space-an-inline-function
+++ a/include/linux/swap.h
@@ -537,7 +537,11 @@ static inline void put_swap_device(struc
 {
 }
 
-#define swap_address_space(entry)		(NULL)
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return NULL;
+}
+
 #define get_nr_swap_pages()			0L
 #define total_swap_pages			0L
 #define total_swapcache_pages()			0UL
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 125/192] mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations
  2021-07-01  1:46 incoming Andrew Morton
                   ` (123 preceding siblings ...)
  2021-07-01  1:53 ` [patch 124/192] mm/swap: make swap_address_space an inline function Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 126/192] mm/page_alloc: move prototype for find_suitable_fallback Andrew Morton
                   ` (67 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, cuibixuan, david, ddstreet, linux-mm, mgorman, mhocko,
	mm-commits, shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations

make W=1 generates the following warning in mmap_lock.c for allnoconfig

  mm/mmap_lock.c:213:6: warning: no previous prototype for `__mmap_lock_do_trace_start_locking' [-Wmissing-prototypes]
   void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/mmap_lock.c:219:6: warning: no previous prototype for `__mmap_lock_do_trace_acquire_returned' [-Wmissing-prototypes]
   void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/mmap_lock.c:226:6: warning: no previous prototype for `__mmap_lock_do_trace_released' [-Wmissing-prototypes]
   void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)

On !CONFIG_TRACING configurations, the code is dead so put it behind an
#ifdef.

[cuibixuan@huawei.com: fix warning when CONFIG_TRACING is not defined]
  Link: https://lkml.kernel.org/r/20210531033426.74031-1-cuibixuan@huawei.com
Link: https://lkml.kernel.org/r/20210520084809.8576-13-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap_lock.c |   59 +++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 27 deletions(-)

--- a/mm/mmap_lock.c~mm-mmap_lock-remove-dead-code-for-config_tracing-configurations
+++ a/mm/mmap_lock.c
@@ -153,6 +153,37 @@ static inline void put_memcg_path_buf(vo
 	rcu_read_unlock();
 }
 
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
+	do {                                                                   \
+		const char *memcg_path;                                        \
+		preempt_disable();                                             \
+		memcg_path = get_mm_memcg_path(mm);                            \
+		trace_mmap_lock_##type(mm,                                     \
+				       memcg_path != NULL ? memcg_path : "",   \
+				       ##__VA_ARGS__);                         \
+		if (likely(memcg_path != NULL))                                \
+			put_memcg_path_buf();                                  \
+		preempt_enable();                                              \
+	} while (0)
+
+#else /* !CONFIG_MEMCG */
+
+int trace_mmap_lock_reg(void)
+{
+	return 0;
+}
+
+void trace_mmap_lock_unreg(void)
+{
+}
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
+	trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
+
+#endif /* CONFIG_MEMCG */
+
+#ifdef CONFIG_TRACING
+#ifdef CONFIG_MEMCG
 /*
  * Write the given mm_struct's memcg path to a percpu buffer, and return a
  * pointer to it. If the path cannot be determined, or no buffer was available
@@ -187,33 +218,6 @@ out:
 	return buf;
 }
 
-#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
-	do {                                                                   \
-		const char *memcg_path;                                        \
-		local_lock(&memcg_paths.lock);				       \
-		memcg_path = get_mm_memcg_path(mm);                            \
-		trace_mmap_lock_##type(mm,                                     \
-				       memcg_path != NULL ? memcg_path : "",   \
-				       ##__VA_ARGS__);                         \
-		if (likely(memcg_path != NULL))                                \
-			put_memcg_path_buf();                                  \
-		local_unlock(&memcg_paths.lock);			       \
-	} while (0)
-
-#else /* !CONFIG_MEMCG */
-
-int trace_mmap_lock_reg(void)
-{
-	return 0;
-}
-
-void trace_mmap_lock_unreg(void)
-{
-}
-
-#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
-	trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
-
 #endif /* CONFIG_MEMCG */
 
 /*
@@ -239,3 +243,4 @@ void __mmap_lock_do_trace_released(struc
 	TRACE_MMAP_LOCK_EVENT(released, mm, write);
 }
 EXPORT_SYMBOL(__mmap_lock_do_trace_released);
+#endif /* CONFIG_TRACING */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 126/192] mm/page_alloc: move prototype for find_suitable_fallback
  2021-07-01  1:46 incoming Andrew Morton
                   ` (124 preceding siblings ...)
  2021-07-01  1:53 ` [patch 125/192] mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 127/192] mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM Andrew Morton
                   ` (66 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/page_alloc: move prototype for find_suitable_fallback

make W=1 generates the following warning in mmap_lock.c for allnoconfig

  mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes]
   int find_suitable_fallback(struct free_area *area, unsigned int order,
       ^~~~~~~~~~~~~~~~~~~~~~

find_suitable_fallback is only shared outside of page_alloc.c for
CONFIG_COMPACTION but to suppress the warning, move the protype outside of
CONFIG_COMPACTION.  It is not worth the effort at this time to find a
clever way of allowing compaction.c to share the code or avoid the use
entirely as the function is called on relatively slow paths.

Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/internal.h |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/internal.h~mm-page_alloc-move-prototype-for-find_suitable_fallback
+++ a/mm/internal.h
@@ -274,11 +274,10 @@ isolate_freepages_range(struct compact_c
 int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
+#endif
 int find_suitable_fallback(struct free_area *area, unsigned int order,
 			int migratetype, bool only_stealable, bool *can_steal);
 
-#endif
-
 /*
  * This function returns the order of a free page in the buddy system. In
  * general, page_zone(page)->lock must be held by the caller to prevent the
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 127/192] mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM
  2021-07-01  1:46 incoming Andrew Morton
                   ` (125 preceding siblings ...)
  2021-07-01  1:53 ` [patch 126/192] mm/page_alloc: move prototype for find_suitable_fallback Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:53 ` [patch 128/192] mm/thp: define default pmd_pgtable() Andrew Morton
                   ` (65 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, david, ddstreet, linux-mm, mgorman, mhocko, mm-commits,
	shy828301, torvalds, vbabka

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM

make W=1 generates the following warning in mm/workingset.c for allnoconfig

  mm/workingset.c: In function `unpack_shadow':
  mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable]
    int memcgid, nid;
                 ^~~

On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing
nid.  Make the helper an inline function to suppress the warning, add type
checking and to apply any side-effects in the parameter list.

Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/include/linux/mmzone.h~mm-swap-make-node_data-an-inline-function-on-config_flatmem
+++ a/include/linux/mmzone.h
@@ -1064,7 +1064,10 @@ extern char numa_zonelist_order[];
 #ifndef CONFIG_NUMA
 
 extern struct pglist_data contig_page_data;
-#define NODE_DATA(nid)		(&contig_page_data)
+static inline struct pglist_data *NODE_DATA(int nid)
+{
+	return &contig_page_data;
+}
 #define NODE_MEM_MAP(nid)	mem_map
 
 #else /* CONFIG_NUMA */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 128/192] mm/thp: define default pmd_pgtable()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (126 preceding siblings ...)
  2021-07-01  1:53 ` [patch 127/192] mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM Andrew Morton
@ 2021-07-01  1:53 ` Andrew Morton
  2021-07-01  1:54 ` [patch 129/192] kfence: unconditionally use unbound work queue Andrew Morton
                   ` (64 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:53 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bcain, catalin.marinas, chris,
	christophe.leroy, davem, geert, guoren, hca, James.Bottomley,
	jdike, jonas, ley.foon.tan, linux-mm, mm-commits, monstr, mpe,
	nickhu, palmer, paul.walmsley, rppt, rth, shorne,
	stefan.kristiansson, tglx, torvalds, tsbogend, vgupta, will,
	ysato

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/thp: define default pmd_pgtable()

Currently most platforms define pmd_pgtable() as pmd_page() duplicating
the same code all over.  Instead just define a default value i.e
pmd_page() for pmd_pgtable() and let platforms override when required via
<asm/pgtable.h>.  All the existing platform that override pmd_pgtable()
have been moved into their respective <asm/pgtable.h> header in order to
precede before the new generic definition.  This makes it much cleaner
with reduced code.

Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/include/asm/pgalloc.h         |    1 -
 arch/arc/include/asm/pgalloc.h           |    2 --
 arch/arc/include/asm/pgtable.h           |    2 ++
 arch/arm/include/asm/pgalloc.h           |    1 -
 arch/arm64/include/asm/pgalloc.h         |    1 -
 arch/csky/include/asm/pgalloc.h          |    2 --
 arch/hexagon/include/asm/pgtable.h       |    1 -
 arch/ia64/include/asm/pgalloc.h          |    1 -
 arch/m68k/include/asm/mcf_pgalloc.h      |    2 --
 arch/m68k/include/asm/mcf_pgtable.h      |    2 ++
 arch/m68k/include/asm/motorola_pgalloc.h |    1 -
 arch/m68k/include/asm/motorola_pgtable.h |    2 ++
 arch/m68k/include/asm/sun3_pgalloc.h     |    1 -
 arch/microblaze/include/asm/pgalloc.h    |    2 --
 arch/mips/include/asm/pgalloc.h          |    1 -
 arch/nds32/include/asm/pgalloc.h         |    5 -----
 arch/nios2/include/asm/pgalloc.h         |    1 -
 arch/openrisc/include/asm/pgalloc.h      |    2 --
 arch/parisc/include/asm/pgalloc.h        |    1 -
 arch/powerpc/include/asm/pgalloc.h       |    5 -----
 arch/powerpc/include/asm/pgtable.h       |    6 ++++++
 arch/riscv/include/asm/pgalloc.h         |    2 --
 arch/s390/include/asm/pgalloc.h          |    3 ---
 arch/s390/include/asm/pgtable.h          |    3 +++
 arch/sh/include/asm/pgalloc.h            |    1 -
 arch/sparc/include/asm/pgalloc_32.h      |    1 -
 arch/sparc/include/asm/pgalloc_64.h      |    1 -
 arch/sparc/include/asm/pgtable_32.h      |    2 ++
 arch/sparc/include/asm/pgtable_64.h      |    2 ++
 arch/um/include/asm/pgalloc.h            |    1 -
 arch/x86/include/asm/pgalloc.h           |    2 --
 arch/xtensa/include/asm/pgalloc.h        |    2 --
 include/linux/pgtable.h                  |    9 +++++++++
 33 files changed, 28 insertions(+), 43 deletions(-)

--- a/arch/alpha/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/alpha/include/asm/pgalloc.h
@@ -18,7 +18,6 @@ pmd_populate(struct mm_struct *mm, pmd_t
 {
 	pmd_set(pmd, (pte_t *)(page_to_pa(pte) + PAGE_OFFSET));
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 static inline void
 pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
--- a/arch/arc/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/arc/include/asm/pgalloc.h
@@ -129,6 +129,4 @@ static inline void pte_free(struct mm_st
 
 #define __pte_free_tlb(tlb, pte, addr)  pte_free((tlb)->mm, pte)
 
-#define pmd_pgtable(pmd)	((pgtable_t) pmd_page_vaddr(pmd))
-
 #endif /* _ASM_ARC_PGALLOC_H */
--- a/arch/arc/include/asm/pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/arc/include/asm/pgtable.h
@@ -350,6 +350,8 @@ void update_mmu_cache(struct vm_area_str
 
 #define kern_addr_valid(addr)	(1)
 
+#define pmd_pgtable(pmd)       ((pgtable_t) pmd_page_vaddr(pmd))
+
 /*
  * remap a physical page `pfn' of size `size' with page protection `prot'
  * into virtual address `from'
--- a/arch/arm64/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/arm64/include/asm/pgalloc.h
@@ -86,6 +86,5 @@ pmd_populate(struct mm_struct *mm, pmd_t
 	VM_BUG_ON(mm == &init_mm);
 	__pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN);
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 #endif
--- a/arch/arm/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/arm/include/asm/pgalloc.h
@@ -143,7 +143,6 @@ pmd_populate(struct mm_struct *mm, pmd_t
 
 	__pmd_populate(pmdp, page_to_phys(ptep), prot);
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 #endif /* CONFIG_MMU */
 
--- a/arch/csky/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/csky/include/asm/pgalloc.h
@@ -22,8 +22,6 @@ static inline void pmd_populate(struct m
 	set_pmd(pmd, __pmd(__pa(page_address(pte))));
 }
 
-#define pmd_pgtable(pmd) pmd_page(pmd)
-
 extern void pgd_init(unsigned long *p);
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
--- a/arch/hexagon/include/asm/pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/hexagon/include/asm/pgtable.h
@@ -239,7 +239,6 @@ static inline int pmd_bad(pmd_t pmd)
  * pmd_page - converts a PMD entry to a page pointer
  */
 #define pmd_page(pmd)  (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 /**
  * pte_none - check if pte is mapped
--- a/arch/ia64/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/ia64/include/asm/pgalloc.h
@@ -52,7 +52,6 @@ pmd_populate(struct mm_struct *mm, pmd_t
 {
 	pmd_val(*pmd_entry) = page_to_phys(pte);
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 static inline void
 pmd_populate_kernel(struct mm_struct *mm, pmd_t * pmd_entry, pte_t * pte)
--- a/arch/m68k/include/asm/mcf_pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/m68k/include/asm/mcf_pgalloc.h
@@ -32,8 +32,6 @@ extern inline pmd_t *pmd_alloc_kernel(pg
 
 #define pmd_populate_kernel pmd_populate
 
-#define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT)
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable,
 				  unsigned long address)
 {
--- a/arch/m68k/include/asm/mcf_pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/m68k/include/asm/mcf_pgtable.h
@@ -150,6 +150,8 @@
 
 #ifndef __ASSEMBLY__
 
+#define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT)
+
 /*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
--- a/arch/m68k/include/asm/motorola_pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/m68k/include/asm/motorola_pgalloc.h
@@ -88,7 +88,6 @@ static inline void pmd_populate(struct m
 {
 	pmd_set(pmd, page);
 }
-#define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd))
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 {
--- a/arch/m68k/include/asm/motorola_pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/m68k/include/asm/motorola_pgtable.h
@@ -105,6 +105,8 @@ extern unsigned long mm_cachebits;
 #define __S110	PAGE_SHARED_C
 #define __S111	PAGE_SHARED_C
 
+#define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd))
+
 /*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
--- a/arch/m68k/include/asm/sun3_pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/m68k/include/asm/sun3_pgalloc.h
@@ -32,7 +32,6 @@ static inline void pmd_populate(struct m
 {
 	pmd_val(*pmd) = __pa((unsigned long)page_address(page));
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 /*
  * allocating and freeing a pmd is trivial: the 1-entry pmd is
--- a/arch/microblaze/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/microblaze/include/asm/pgalloc.h
@@ -28,8 +28,6 @@ static inline pgd_t *get_pgd(void)
 
 #define pgd_alloc(mm)		get_pgd()
 
-#define pmd_pgtable(pmd)	pmd_page(pmd)
-
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
 
 #define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, (pte))
--- a/arch/mips/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/mips/include/asm/pgalloc.h
@@ -28,7 +28,6 @@ static inline void pmd_populate(struct m
 {
 	set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 /*
  * Initialize a new pmd table with invalid pointers.
--- a/arch/nds32/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/nds32/include/asm/pgalloc.h
@@ -12,11 +12,6 @@
 #define __HAVE_ARCH_PTE_ALLOC_ONE
 #include <asm-generic/pgalloc.h>	/* for pte_{alloc,free}_one */
 
-/*
- * Since we have only two-level page tables, these are trivial
- */
-#define pmd_pgtable(pmd) pmd_page(pmd)
-
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t * pgd);
 
--- a/arch/nios2/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/nios2/include/asm/pgalloc.h
@@ -25,7 +25,6 @@ static inline void pmd_populate(struct m
 {
 	set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 /*
  * Initialize a new pmd table with invalid pointers.
--- a/arch/openrisc/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/openrisc/include/asm/pgalloc.h
@@ -72,6 +72,4 @@ do {					\
 	tlb_remove_page((tlb), (pte));	\
 } while (0)
 
-#define pmd_pgtable(pmd) pmd_page(pmd)
-
 #endif
--- a/arch/parisc/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/parisc/include/asm/pgalloc.h
@@ -69,6 +69,5 @@ pmd_populate_kernel(struct mm_struct *mm
 
 #define pmd_populate(mm, pmd, pte_page) \
 	pmd_populate_kernel(mm, pmd, page_address(pte_page))
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 #endif
--- a/arch/powerpc/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/powerpc/include/asm/pgalloc.h
@@ -70,9 +70,4 @@ extern struct kmem_cache *pgtable_cache[
 #include <asm/nohash/pgalloc.h>
 #endif
 
-static inline pgtable_t pmd_pgtable(pmd_t pmd)
-{
-	return (pgtable_t)pmd_page_vaddr(pmd);
-}
-
 #endif /* _ASM_POWERPC_PGALLOC_H */
--- a/arch/powerpc/include/asm/pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/powerpc/include/asm/pgtable.h
@@ -152,6 +152,12 @@ static inline bool p4d_is_leaf(p4d_t p4d
 }
 #endif
 
+#define pmd_pgtable pmd_pgtable
+static inline pgtable_t pmd_pgtable(pmd_t pmd)
+{
+	return (pgtable_t)pmd_page_vaddr(pmd);
+}
+
 #ifdef CONFIG_PPC64
 #define is_ioremap_addr is_ioremap_addr
 static inline bool is_ioremap_addr(const void *x)
--- a/arch/riscv/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/riscv/include/asm/pgalloc.h
@@ -38,8 +38,6 @@ static inline void pud_populate(struct m
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-#define pmd_pgtable(pmd)	pmd_page(pmd)
-
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
 	pgd_t *pgd;
--- a/arch/s390/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/s390/include/asm/pgalloc.h
@@ -134,9 +134,6 @@ static inline void pmd_populate(struct m
 
 #define pmd_populate_kernel(mm, pmd, pte) pmd_populate(mm, pmd, pte)
 
-#define pmd_pgtable(pmd) \
-	((pgtable_t)__va(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE))
-
 /*
  * page table entry allocation/free routines.
  */
--- a/arch/s390/include/asm/pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/s390/include/asm/pgtable.h
@@ -1709,4 +1709,7 @@ extern void s390_reset_cmma(struct mm_st
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
 
+#define pmd_pgtable(pmd) \
+	((pgtable_t)__va(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE))
+
 #endif /* _S390_PAGE_H */
--- a/arch/sh/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/sh/include/asm/pgalloc.h
@@ -30,7 +30,6 @@ static inline void pmd_populate(struct m
 {
 	set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
 }
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 #define __pte_free_tlb(tlb,pte,addr)			\
 do {							\
--- a/arch/sparc/include/asm/pgalloc_32.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/sparc/include/asm/pgalloc_32.h
@@ -51,7 +51,6 @@ static inline void free_pmd_fast(pmd_t *
 #define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
 
 #define pmd_populate(mm, pmd, pte)	pmd_set(pmd, pte)
-#define pmd_pgtable(pmd)		(pgtable_t)__pmd_page(pmd)
 
 void pmd_set(pmd_t *pmdp, pte_t *ptep);
 #define pmd_populate_kernel		pmd_populate
--- a/arch/sparc/include/asm/pgalloc_64.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/sparc/include/asm/pgalloc_64.h
@@ -67,7 +67,6 @@ void pte_free(struct mm_struct *mm, pgta
 
 #define pmd_populate_kernel(MM, PMD, PTE)	pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE)		pmd_set(MM, PMD, PTE)
-#define pmd_pgtable(PMD)			((pte_t *)pmd_page_vaddr(PMD))
 
 void pgtable_free(void *table, bool is_page);
 
--- a/arch/sparc/include/asm/pgtable_32.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/sparc/include/asm/pgtable_32.h
@@ -432,4 +432,6 @@ static inline int io_remap_pfn_range(str
 /* We provide our own get_unmapped_area to cope with VA holes for userland */
 #define HAVE_ARCH_UNMAPPED_AREA
 
+#define pmd_pgtable(pmd)	((pgtable_t)__pmd_page(pmd))
+
 #endif /* !(_SPARC_PGTABLE_H) */
--- a/arch/sparc/include/asm/pgtable_64.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -1117,6 +1117,8 @@ extern unsigned long cmdline_memory_size
 
 asmlinkage void do_sparc64_fault(struct pt_regs *regs);
 
+#define pmd_pgtable(PMD)	((pte_t *)pmd_page_vaddr(PMD))
+
 #ifdef CONFIG_HUGETLB_PAGE
 
 #define pud_leaf_size pud_leaf_size
--- a/arch/um/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/um/include/asm/pgalloc.h
@@ -19,7 +19,6 @@
 	set_pmd(pmd, __pmd(_PAGE_TABLE +			\
 		((unsigned long long)page_to_pfn(pte) <<	\
 			(unsigned long long) PAGE_SHIFT)))
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 /*
  * Allocate and free page tables.
--- a/arch/x86/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/x86/include/asm/pgalloc.h
@@ -84,8 +84,6 @@ static inline void pmd_populate(struct m
 	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
 }
 
-#define pmd_pgtable(pmd) pmd_page(pmd)
-
 #if CONFIG_PGTABLE_LEVELS > 2
 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd);
 
--- a/arch/xtensa/include/asm/pgalloc.h~mm-thp-define-default-pmd_pgtable
+++ a/arch/xtensa/include/asm/pgalloc.h
@@ -25,7 +25,6 @@
 	(pmd_val(*(pmdp)) = ((unsigned long)ptep))
 #define pmd_populate(mm, pmdp, page)					     \
 	(pmd_val(*(pmdp)) = ((unsigned long)page_to_virt(page)))
-#define pmd_pgtable(pmd) pmd_page(pmd)
 
 static inline pgd_t*
 pgd_alloc(struct mm_struct *mm)
@@ -63,7 +62,6 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-#define pmd_pgtable(pmd) pmd_page(pmd)
 #endif /* CONFIG_MMU */
 
 #endif /* _XTENSA_PGALLOC_H */
--- a/include/linux/pgtable.h~mm-thp-define-default-pmd_pgtable
+++ a/include/linux/pgtable.h
@@ -38,6 +38,15 @@
 #endif
 
 /*
+ * This defines the generic helper for accessing PMD page
+ * table page. Although platforms can still override this
+ * via their respective <asm/pgtable.h>.
+ */
+#ifndef pmd_pgtable
+#define pmd_pgtable(pmd) pmd_page(pmd)
+#endif
+
+/*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
  * The pXx_index() functions return the index of the entry in the page
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 129/192] kfence: unconditionally use unbound work queue
  2021-07-01  1:46 incoming Andrew Morton
                   ` (127 preceding siblings ...)
  2021-07-01  1:53 ` [patch 128/192] mm/thp: define default pmd_pgtable() Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 130/192] mm: remove special swap entry functions Andrew Morton
                   ` (63 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, hdanton, linux-mm, mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: unconditionally use unbound work queue

Unconditionally use unbound work queue, and not just if wq_power_efficient
is true.  Because if the system is idle, KFENCE may wait, and by being run
on the unbound work queue, we permit the scheduler to make better
scheduling decisions and not require pinning KFENCE to the same CPU upon
waking up.

Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com
Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Hillf Danton <hdanton@sina.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/kfence/core.c~kfence-unconditionally-use-unbound-work-queue
+++ a/mm/kfence/core.c
@@ -636,7 +636,7 @@ static void toggle_allocation_gate(struc
 	/* Disable static key and reset timer. */
 	static_branch_disable(&kfence_allocation_key);
 #endif
-	queue_delayed_work(system_power_efficient_wq, &kfence_timer,
+	queue_delayed_work(system_unbound_wq, &kfence_timer,
 			   msecs_to_jiffies(kfence_sample_interval));
 }
 static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate);
@@ -666,7 +666,7 @@ void __init kfence_init(void)
 	}
 
 	WRITE_ONCE(kfence_enabled, true);
-	queue_delayed_work(system_power_efficient_wq, &kfence_timer, 0);
+	queue_delayed_work(system_unbound_wq, &kfence_timer, 0);
 	pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE,
 		CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool,
 		(void *)(__kfence_pool + KFENCE_POOL_SIZE));
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 130/192] mm: remove special swap entry functions
  2021-07-01  1:46 incoming Andrew Morton
                   ` (128 preceding siblings ...)
  2021-07-01  1:54 ` [patch 129/192] kfence: unconditionally use unbound work queue Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 131/192] mm/swapops: rework swap entry manipulation code Andrew Morton
                   ` (62 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm: remove special swap entry functions

Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory.  To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU.  This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag.  A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry.  The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic. 
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes.  For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.


This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn.  Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page(). 
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/mm/pgtable.c  |    2 -
 fs/proc/task_mmu.c      |   23 ++++--------
 include/linux/swap.h    |    4 +-
 include/linux/swapops.h |   69 ++++++++++++--------------------------
 mm/hmm.c                |    5 +-
 mm/huge_memory.c        |    6 +--
 mm/memcontrol.c         |    2 -
 mm/memory.c             |   10 ++---
 mm/migrate.c            |    6 +--
 mm/page_vma_mapped.c    |    6 +--
 10 files changed, 51 insertions(+), 82 deletions(-)

--- a/arch/s390/mm/pgtable.c~mm-remove-special-swap-entry-functions
+++ a/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct m
 	if (!non_swap_entry(entry))
 		dec_mm_counter(mm, MM_SWAPENTS);
 	else if (is_migration_entry(entry)) {
-		struct page *page = migration_entry_to_page(entry);
+		struct page *page = pfn_swap_entry_to_page(entry);
 
 		dec_mm_counter(mm, mm_counter(page));
 	}
--- a/fs/proc/task_mmu.c~mm-remove-special-swap-entry-functions
+++ a/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte,
 			} else {
 				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
 			}
-		} else if (is_migration_entry(swpent))
-			page = migration_entry_to_page(swpent);
-		else if (is_device_private_entry(swpent))
-			page = device_private_entry_to_page(swpent);
+		} else if (is_pfn_swap_entry(swpent))
+			page = pfn_swap_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = xa_load(&vma->vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd,
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
 		if (is_migration_entry(entry))
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 	}
 	if (IS_ERR_OR_NULL(page))
 		return;
@@ -694,10 +692,8 @@ static int smaps_hugetlb_range(pte_t *pt
 	} else if (is_swap_pte(*pte)) {
 		swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-		if (is_migration_entry(swpent))
-			page = migration_entry_to_page(swpent);
-		else if (is_device_private_entry(swpent))
-			page = device_private_entry_to_page(swpent);
+		if (is_pfn_swap_entry(swpent))
+			page = pfn_swap_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1389,11 +1385,8 @@ static pagemap_entry_t pte_to_pagemap_en
 			frame = swp_type(entry) |
 				(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
 		flags |= PM_SWAP;
-		if (is_migration_entry(entry))
-			page = migration_entry_to_page(entry);
-
-		if (is_device_private_entry(entry))
-			page = device_private_entry_to_page(entry);
+		if (is_pfn_swap_entry(entry))
+			page = pfn_swap_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
@@ -1454,7 +1447,7 @@ static int pagemap_pmd_range(pmd_t *pmdp
 			if (pmd_swp_uffd_wp(pmd))
 				flags |= PM_UFFD_WP;
 			VM_BUG_ON(!is_pmd_migration_entry(pmd));
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 		}
 #endif
 
--- a/include/linux/swap.h~mm-remove-special-swap-entry-functions
+++ a/include/linux/swap.h
@@ -564,8 +564,8 @@ static inline void show_swap_cache_info(
 {
 }
 
-#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
-#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
+/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
+#define free_swap_and_cache(e) is_pfn_swap_entry(e)
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
--- a/include/linux/swapops.h~mm-remove-special-swap-entry-functions
+++ a/include/linux/swapops.h
@@ -128,16 +128,6 @@ static inline bool is_write_device_priva
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
-
-static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
-static inline struct page *device_private_entry_to_page(swp_entry_t entry)
-{
-	return pfn_to_page(swp_offset(entry));
-}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
 {
@@ -157,16 +147,6 @@ static inline bool is_write_device_priva
 {
 	return false;
 }
-
-static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry)
-{
-	return 0;
-}
-
-static inline struct page *device_private_entry_to_page(swp_entry_t entry)
-{
-	return NULL;
-}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -189,22 +169,6 @@ static inline int is_write_migration_ent
 	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
-{
-	return swp_offset(entry);
-}
-
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
-{
-	struct page *p = pfn_to_page(swp_offset(entry));
-	/*
-	 * Any use of migration entries may only occur while the
-	 * corresponding page is locked
-	 */
-	BUG_ON(!PageLocked(compound_head(p)));
-	return p;
-}
-
 static inline void make_migration_entry_read(swp_entry_t *entry)
 {
 	*entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
@@ -224,16 +188,6 @@ static inline int is_migration_entry(swp
 	return 0;
 }
 
-static inline unsigned long migration_entry_to_pfn(swp_entry_t entry)
-{
-	return 0;
-}
-
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
-{
-	return NULL;
-}
-
 static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl) { }
@@ -248,6 +202,29 @@ static inline int is_write_migration_ent
 
 #endif
 
+static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
+{
+	struct page *p = pfn_to_page(swp_offset(entry));
+
+	/*
+	 * Any use of migration entries may only occur while the
+	 * corresponding page is locked
+	 */
+	BUG_ON(is_migration_entry(entry) && !PageLocked(p));
+
+	return p;
+}
+
+/*
+ * A pfn swap entry is a special type of swap entry that always has a pfn stored
+ * in the swap offset. They are used to represent unaddressable device memory
+ * and to restrict access to a page undergoing migration.
+ */
+static inline bool is_pfn_swap_entry(swp_entry_t entry)
+{
+	return is_migration_entry(entry) || is_device_private_entry(entry);
+}
+
 struct page_vma_mapped_walk;
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
--- a/mm/hmm.c~mm-remove-special-swap-entry-functions
+++ a/mm/hmm.c
@@ -214,7 +214,7 @@ static inline bool hmm_is_device_private
 		swp_entry_t entry)
 {
 	return is_device_private_entry(entry) &&
-		device_private_entry_to_page(entry)->pgmap->owner ==
+		pfn_swap_entry_to_page(entry)->pgmap->owner ==
 		range->dev_private_owner;
 }
 
@@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_
 			cpu_flags = HMM_PFN_VALID;
 			if (is_write_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
-			*hmm_pfn = device_private_entry_to_pfn(entry) |
-					cpu_flags;
+			*hmm_pfn = swp_offset(entry) | cpu_flags;
 			return 0;
 		}
 
--- a/mm/huge_memory.c~mm-remove-special-swap-entry-functions
+++ a/mm/huge_memory.c
@@ -1643,7 +1643,7 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 
 			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 			flush_needed = 0;
 		} else
 			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
@@ -2012,7 +2012,7 @@ static void __split_huge_pmd_locked(stru
 			swp_entry_t entry;
 
 			entry = pmd_to_swp_entry(old_pmd);
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 		} else {
 			page = pmd_page(old_pmd);
 			if (!PageDirty(page) && pmd_dirty(old_pmd))
@@ -2066,7 +2066,7 @@ static void __split_huge_pmd_locked(stru
 		swp_entry_t entry;
 
 		entry = pmd_to_swp_entry(old_pmd);
-		page = migration_entry_to_page(entry);
+		page = pfn_swap_entry_to_page(entry);
 		write = is_write_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
--- a/mm/memcontrol.c~mm-remove-special-swap-entry-functions
+++ a/mm/memcontrol.c
@@ -5532,7 +5532,7 @@ static struct page *mc_handle_swap_pte(s
 	 * as special swap entry in the CPU page table.
 	 */
 	if (is_device_private_entry(ent)) {
-		page = device_private_entry_to_page(ent);
+		page = pfn_swap_entry_to_page(ent);
 		/*
 		 * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have
 		 * a refcount of 1 when free (unlike normal page)
--- a/mm/memory.c~mm-remove-special-swap-entry-functions
+++ a/mm/memory.c
@@ -729,7 +729,7 @@ copy_nonpresent_pte(struct mm_struct *ds
 		}
 		rss[MM_SWAPENTS]++;
 	} else if (is_migration_entry(entry)) {
-		page = migration_entry_to_page(entry);
+		page = pfn_swap_entry_to_page(entry);
 
 		rss[mm_counter(page)]++;
 
@@ -748,7 +748,7 @@ copy_nonpresent_pte(struct mm_struct *ds
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
 	} else if (is_device_private_entry(entry)) {
-		page = device_private_entry_to_page(entry);
+		page = pfn_swap_entry_to_page(entry);
 
 		/*
 		 * Update rss count even for unaddressable pages, as
@@ -1280,7 +1280,7 @@ again:
 
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry)) {
-			struct page *page = device_private_entry_to_page(entry);
+			struct page *page = pfn_swap_entry_to_page(entry);
 
 			if (unlikely(details && details->check_mapping)) {
 				/*
@@ -1309,7 +1309,7 @@ again:
 		else if (is_migration_entry(entry)) {
 			struct page *page;
 
-			page = migration_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 			rss[mm_counter(page)]--;
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
@@ -3372,7 +3372,7 @@ vm_fault_t do_swap_page(struct vm_fault
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
 		} else if (is_device_private_entry(entry)) {
-			vmf->page = device_private_entry_to_page(entry);
+			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
--- a/mm/migrate.c~mm-remove-special-swap-entry-functions
+++ a/mm/migrate.c
@@ -296,7 +296,7 @@ void __migration_entry_wait(struct mm_st
 	if (!is_migration_entry(entry))
 		goto out;
 
-	page = migration_entry_to_page(entry);
+	page = pfn_swap_entry_to_page(entry);
 	page = compound_head(page);
 
 	/*
@@ -337,7 +337,7 @@ void pmd_migration_entry_wait(struct mm_
 	ptl = pmd_lock(mm, pmd);
 	if (!is_pmd_migration_entry(*pmd))
 		goto unlock;
-	page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
+	page = pfn_swap_entry_to_page(pmd_to_swp_entry(*pmd));
 	if (!get_page_unless_zero(page))
 		goto unlock;
 	spin_unlock(ptl);
@@ -2289,7 +2289,7 @@ again:
 			if (!is_device_private_entry(entry))
 				goto next;
 
-			page = device_private_entry_to_page(entry);
+			page = pfn_swap_entry_to_page(entry);
 			if (!(migrate->flags &
 				MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
 			    page->pgmap->owner != migrate->pgmap_owner)
--- a/mm/page_vma_mapped.c~mm-remove-special-swap-entry-functions
+++ a/mm/page_vma_mapped.c
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_ma
 		if (!is_migration_entry(entry))
 			return false;
 
-		pfn = migration_entry_to_pfn(entry);
+		pfn = swp_offset(entry);
 	} else if (is_swap_pte(*pvmw->pte)) {
 		swp_entry_t entry;
 
@@ -105,7 +105,7 @@ static bool check_pte(struct page_vma_ma
 		if (!is_device_private_entry(entry))
 			return false;
 
-		pfn = device_private_entry_to_pfn(entry);
+		pfn = swp_offset(entry);
 	} else {
 		if (!pte_present(*pvmw->pte))
 			return false;
@@ -233,7 +233,7 @@ restart:
 					return not_found(pvmw);
 				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
-				    migration_entry_to_page(entry) != page)
+				    pfn_swap_entry_to_page(entry) != page)
 					return not_found(pvmw);
 				return true;
 			}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 131/192] mm/swapops: rework swap entry manipulation code
  2021-07-01  1:46 incoming Andrew Morton
                   ` (129 preceding siblings ...)
  2021-07-01  1:54 ` [patch 130/192] mm: remove special swap entry functions Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 132/192] mm/rmap: split try_to_munlock from try_to_unmap Andrew Morton
                   ` (61 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm/swapops: rework swap entry manipulation code

Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions.  The arguments to these are
somewhat inconsitent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.

Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swapops.h |   56 ++++++++++++++++++++------------------
 mm/debug_vm_pgtable.c   |   12 ++++----
 mm/hmm.c                |    2 -
 mm/huge_memory.c        |   26 ++++++++++++-----
 mm/hugetlb.c            |   10 ++++--
 mm/memory.c             |   10 ++++--
 mm/migrate.c            |   26 +++++++++++++----
 mm/mprotect.c           |   10 ++++--
 mm/rmap.c               |   10 ++++--
 9 files changed, 100 insertions(+), 62 deletions(-)

--- a/include/linux/swapops.h~mm-swapops-rework-swap-entry-manipulation-code
+++ a/include/linux/swapops.h
@@ -107,35 +107,35 @@ static inline void *swp_to_radix_entry(s
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-	return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-			 page_to_pfn(page));
+	return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-	int type = swp_type(entry);
-	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+	return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-	*entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+	int type = swp_type(entry);
+	return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
 	return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+	return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -143,35 +143,32 @@ static inline bool is_device_private_ent
 	return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-	BUG_ON(!PageLocked(compound_head(page)));
-
-	return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-			page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
 			swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-	*entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+	return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -181,21 +178,28 @@ extern void migration_entry_wait(struct
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
 	return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl) { }
 static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
 static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte) { }
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
 	return 0;
 }
--- a/mm/debug_vm_pgtable.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/debug_vm_pgtable.c
@@ -843,17 +843,17 @@ static void __init swap_migration_tests(
 	 * locked, otherwise it stumbles upon a BUG_ON().
 	 */
 	__SetPageLocked(page);
-	swp = make_migration_entry(page, 1);
+	swp = make_writable_migration_entry(page_to_pfn(page));
 	WARN_ON(!is_migration_entry(swp));
-	WARN_ON(!is_write_migration_entry(swp));
+	WARN_ON(!is_writable_migration_entry(swp));
 
-	make_migration_entry_read(&swp);
+	swp = make_readable_migration_entry(swp_offset(swp));
 	WARN_ON(!is_migration_entry(swp));
-	WARN_ON(is_write_migration_entry(swp));
+	WARN_ON(is_writable_migration_entry(swp));
 
-	swp = make_migration_entry(page, 0);
+	swp = make_readable_migration_entry(page_to_pfn(page));
 	WARN_ON(!is_migration_entry(swp));
-	WARN_ON(is_write_migration_entry(swp));
+	WARN_ON(is_writable_migration_entry(swp));
 	__ClearPageLocked(page);
 	__free_page(page);
 }
--- a/mm/hmm.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/hmm.c
@@ -255,7 +255,7 @@ static int hmm_vma_handle_pte(struct mm_
 		 */
 		if (hmm_is_device_private_entry(range, entry)) {
 			cpu_flags = HMM_PFN_VALID;
-			if (is_write_device_private_entry(entry))
+			if (is_writable_device_private_entry(entry))
 				cpu_flags |= HMM_PFN_WRITE;
 			*hmm_pfn = swp_offset(entry) | cpu_flags;
 			return 0;
--- a/mm/huge_memory.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/huge_memory.c
@@ -1054,8 +1054,9 @@ int copy_huge_pmd(struct mm_struct *dst_
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
 		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (is_write_migration_entry(entry)) {
-			make_migration_entry_read(&entry);
+		if (is_writable_migration_entry(entry)) {
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*src_pmd))
 				pmd = pmd_swp_mksoft_dirty(pmd);
@@ -1772,13 +1773,14 @@ int change_huge_pmd(struct vm_area_struc
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
 		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
-		if (is_write_migration_entry(entry)) {
+		if (is_writable_migration_entry(entry)) {
 			pmd_t newpmd;
 			/*
 			 * A protection check is difficult so
 			 * just be safe and disable write
 			 */
-			make_migration_entry_read(&entry);
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
@@ -2067,7 +2069,7 @@ static void __split_huge_pmd_locked(stru
 
 		entry = pmd_to_swp_entry(old_pmd);
 		page = pfn_swap_entry_to_page(entry);
-		write = is_write_migration_entry(entry);
+		write = is_writable_migration_entry(entry);
 		young = false;
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
@@ -2099,7 +2101,12 @@ static void __split_huge_pmd_locked(stru
 		 */
 		if (freeze || pmd_migration) {
 			swp_entry_t swp_entry;
-			swp_entry = make_migration_entry(page + i, write);
+			if (write)
+				swp_entry = make_writable_migration_entry(
+							page_to_pfn(page + i));
+			else
+				swp_entry = make_readable_migration_entry(
+							page_to_pfn(page + i));
 			entry = swp_entry_to_pte(swp_entry);
 			if (soft_dirty)
 				entry = pte_swp_mksoft_dirty(entry);
@@ -3171,7 +3178,10 @@ void set_pmd_migration_entry(struct page
 	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
 	if (pmd_dirty(pmdval))
 		set_page_dirty(page);
-	entry = make_migration_entry(page, pmd_write(pmdval));
+	if (pmd_write(pmdval))
+		entry = make_writable_migration_entry(page_to_pfn(page));
+	else
+		entry = make_readable_migration_entry(page_to_pfn(page));
 	pmdswp = swp_entry_to_pmd(entry);
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3197,7 +3207,7 @@ void remove_migration_pmd(struct page_vm
 	pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot));
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
-	if (is_write_migration_entry(entry))
+	if (is_writable_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
 	if (pmd_swp_uffd_wp(*pvmw->pmd))
 		pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
--- a/mm/hugetlb.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/hugetlb.c
@@ -4242,12 +4242,13 @@ again:
 				    is_hugetlb_entry_hwpoisoned(entry))) {
 			swp_entry_t swp_entry = pte_to_swp_entry(entry);
 
-			if (is_write_migration_entry(swp_entry) && cow) {
+			if (is_writable_migration_entry(swp_entry) && cow) {
 				/*
 				 * COW mappings require pages in both
 				 * parent and child to be set to read.
 				 */
-				make_migration_entry_read(&swp_entry);
+				swp_entry = make_readable_migration_entry(
+							swp_offset(swp_entry));
 				entry = swp_entry_to_pte(swp_entry);
 				set_huge_swap_pte_at(src, addr, src_pte,
 						     entry, sz);
@@ -5532,10 +5533,11 @@ unsigned long hugetlb_change_protection(
 		if (unlikely(is_hugetlb_entry_migration(pte))) {
 			swp_entry_t entry = pte_to_swp_entry(pte);
 
-			if (is_write_migration_entry(entry)) {
+			if (is_writable_migration_entry(entry)) {
 				pte_t newpte;
 
-				make_migration_entry_read(&entry);
+				entry = make_readable_migration_entry(
+							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				set_huge_swap_pte_at(mm, address, ptep,
 						     newpte, huge_page_size(h));
--- a/mm/memory.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/memory.c
@@ -733,13 +733,14 @@ copy_nonpresent_pte(struct mm_struct *ds
 
 		rss[mm_counter(page)]++;
 
-		if (is_write_migration_entry(entry) &&
+		if (is_writable_migration_entry(entry) &&
 				is_cow_mapping(vm_flags)) {
 			/*
 			 * COW mappings require pages in both
 			 * parent and child to be set to read.
 			 */
-			make_migration_entry_read(&entry);
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(*src_pte))
 				pte = pte_swp_mksoft_dirty(pte);
@@ -770,9 +771,10 @@ copy_nonpresent_pte(struct mm_struct *ds
 		 * when a device driver is involved (you cannot easily
 		 * save and restore device driver state).
 		 */
-		if (is_write_device_private_entry(entry) &&
+		if (is_writable_device_private_entry(entry) &&
 		    is_cow_mapping(vm_flags)) {
-			make_device_private_entry_read(&entry);
+			entry = make_readable_device_private_entry(
+							swp_offset(entry));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_uffd_wp(*src_pte))
 				pte = pte_swp_mkuffd_wp(pte);
--- a/mm/migrate.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/migrate.c
@@ -210,13 +210,18 @@ static bool remove_migration_pte(struct
 		 * Recheck VMA as permissions can change since migration started
 		 */
 		entry = pte_to_swp_entry(*pvmw.pte);
-		if (is_write_migration_entry(entry))
+		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else if (pte_swp_uffd_wp(*pvmw.pte))
 			pte = pte_mkuffd_wp(pte);
 
 		if (unlikely(is_device_private_page(new))) {
-			entry = make_device_private_entry(new, pte_write(pte));
+			if (pte_write(pte))
+				entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+			else
+				entry = make_readable_device_private_entry(
+							page_to_pfn(new));
 			pte = swp_entry_to_pte(entry);
 			if (pte_swp_soft_dirty(*pvmw.pte))
 				pte = pte_swp_mksoft_dirty(pte);
@@ -2297,7 +2302,7 @@ again:
 
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
-			if (is_write_device_private_entry(entry))
+			if (is_writable_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
 			if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
@@ -2343,8 +2348,12 @@ again:
 			ptep_get_and_clear(mm, addr, ptep);
 
 			/* Setup special migration page table entry */
-			entry = make_migration_entry(page, mpfn &
-						     MIGRATE_PFN_WRITE);
+			if (mpfn & MIGRATE_PFN_WRITE)
+				entry = make_writable_migration_entry(
+							page_to_pfn(page));
+			else
+				entry = make_readable_migration_entry(
+							page_to_pfn(page));
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_present(pte)) {
 				if (pte_soft_dirty(pte))
@@ -2817,7 +2826,12 @@ static void migrate_vma_insert_page(stru
 		if (is_device_private_page(page)) {
 			swp_entry_t swp_entry;
 
-			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			if (vma->vm_flags & VM_WRITE)
+				swp_entry = make_writable_device_private_entry(
+							page_to_pfn(page));
+			else
+				swp_entry = make_readable_device_private_entry(
+							page_to_pfn(page));
 			entry = swp_entry_to_pte(swp_entry);
 		} else {
 			/*
--- a/mm/mprotect.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/mprotect.c
@@ -143,23 +143,25 @@ static unsigned long change_pte_range(st
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 			pte_t newpte;
 
-			if (is_write_migration_entry(entry)) {
+			if (is_writable_migration_entry(entry)) {
 				/*
 				 * A protection check is difficult so
 				 * just be safe and disable write
 				 */
-				make_migration_entry_read(&entry);
+				entry = make_readable_migration_entry(
+							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_soft_dirty(oldpte))
 					newpte = pte_swp_mksoft_dirty(newpte);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (is_write_device_private_entry(entry)) {
+			} else if (is_writable_device_private_entry(entry)) {
 				/*
 				 * We do not preserve soft-dirtiness. See
 				 * copy_one_pte() for explanation.
 				 */
-				make_device_private_entry_read(&entry);
+				entry = make_readable_device_private_entry(
+							swp_offset(entry));
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
--- a/mm/rmap.c~mm-swapops-rework-swap-entry-manipulation-code
+++ a/mm/rmap.c
@@ -1533,7 +1533,7 @@ static bool try_to_unmap_one(struct page
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			entry = make_migration_entry(page, 0);
+			entry = make_readable_migration_entry(page_to_pfn(page));
 			swp_pte = swp_entry_to_pte(entry);
 
 			/*
@@ -1629,8 +1629,12 @@ static bool try_to_unmap_one(struct page
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			entry = make_migration_entry(subpage,
-					pte_write(pteval));
+			if (pte_write(pteval))
+				entry = make_writable_migration_entry(
+							page_to_pfn(subpage));
+			else
+				entry = make_readable_migration_entry(
+							page_to_pfn(subpage));
 			swp_pte = swp_entry_to_pte(entry);
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 132/192] mm/rmap: split try_to_munlock from try_to_unmap
  2021-07-01  1:46 incoming Andrew Morton
                   ` (130 preceding siblings ...)
  2021-07-01  1:54 ` [patch 131/192] mm/swapops: rework swap entry manipulation code Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 133/192] mm/rmap: split migration into its own function Andrew Morton
                   ` (60 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm/rmap: split try_to_munlock from try_to_unmap

The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.

TTU_MUNLOCK is one such flag.  However it is exclusively used by
try_to_munlock() which specifies no other flags.  Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/unevictable-lru.rst |   33 ++++--------
 include/linux/rmap.h                 |    3 -
 mm/mlock.c                           |   12 ++--
 mm/rmap.c                            |   66 ++++++++++++++++++-------
 4 files changed, 69 insertions(+), 45 deletions(-)

--- a/Documentation/vm/unevictable-lru.rst~mm-rmap-split-try_to_munlock-from-try_to_unmap
+++ a/Documentation/vm/unevictable-lru.rst
@@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that
 mlocked pages.  Note, however, that at this point we haven't checked whether
 the page is mapped by other VM_LOCKED VMAs.
 
-We can't call try_to_munlock(), the function that walks the reverse map to
+We can't call page_mlock(), the function that walks the reverse map to
 check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
-try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+page_mlock() is a variant of try_to_unmap() and thus requires that the page
 not be on an LRU list [more on these below].  However, the call to
-isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  So,
+isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
 we go ahead and clear PG_mlocked up front, as this might be the only chance we
-have.  If we can successfully isolate the page, we go ahead and
-try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+have.  If we can successfully isolate the page, we go ahead and call
+page_mlock(), which will restore the PG_mlocked flag and update the zone
 page statistics if it finds another VMA holding the page mlocked.  If we fail
 to isolate the page, we'll have left a potentially mlocked page on the LRU.
 This is fine, because we'll catch it later if and if vmscan tries to reclaim
@@ -545,31 +545,24 @@ munlock or munmap system calls, mm teard
 holepunching, and truncation of file pages and their anonymous COWed pages.
 
 
-try_to_munlock() Reverse Map Scan
+page_mlock() Reverse Map Scan
 ---------------------------------
 
-.. warning::
-   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
-   page_referenced() reverse map walker.
-
 When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
 Handling <munlock_munlockall_handling>` above] tries to munlock a
 page, it needs to determine whether or not the page is mapped by any
 VM_LOCKED VMA without actually attempting to unmap all PTEs from the
 page.  For this purpose, the unevictable/mlock infrastructure
-introduced a variant of try_to_unmap() called try_to_munlock().
+introduced a variant of try_to_unmap() called page_mlock().
 
-try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
-mapped file and KSM pages with a flag argument specifying unlock versus unmap
-processing.  Again, these functions walk the respective reverse maps looking
-for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
-the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  This
-undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When
+such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
 
-Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+Note that page_mlock()'s reverse map walk must visit every VMA in a page's
 reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
 However, the scan can terminate when it encounters a VM_LOCKED VMA.
-Although try_to_munlock() might be called a great many times when munlocking a
+Although page_mlock() might be called a great many times when munlocking a
 large region or tearing down a large address space that has been mlocked via
 mlockall(), overall this is a fairly rare event.
 
@@ -602,7 +595,7 @@ inactive lists to the appropriate node's
 shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
 after shrink_active_list() had moved them to the inactive list, or pages mapped
 into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
-recheck via try_to_munlock().  shrink_inactive_list() won't notice the latter,
+recheck via page_mlock().  shrink_inactive_list() won't notice the latter,
 but will pass on to shrink_page_list().
 
 shrink_page_list() again culls obviously unevictable pages that it could
--- a/include/linux/rmap.h~mm-rmap-split-try_to_munlock-from-try_to_unmap
+++ a/include/linux/rmap.h
@@ -87,7 +87,6 @@ struct anon_vma_chain {
 
 enum ttu_flags {
 	TTU_MIGRATION		= 0x1,	/* migration mode */
-	TTU_MUNLOCK		= 0x2,	/* munlock mode */
 
 	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
 	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
@@ -240,7 +239,7 @@ int page_mkclean(struct page *);
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
-void try_to_munlock(struct page *);
+void page_mlock(struct page *page);
 
 void remove_migration_ptes(struct page *old, struct page *new, bool locked);
 
--- a/mm/mlock.c~mm-rmap-split-try_to_munlock-from-try_to_unmap
+++ a/mm/mlock.c
@@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page)
 /*
  * Finish munlock after successful page isolation
  *
- * Page must be locked. This is a wrapper for try_to_munlock()
+ * Page must be locked. This is a wrapper for page_mlock()
  * and putback_lru_page() with munlock accounting.
  */
 static void __munlock_isolated_page(struct page *page)
@@ -118,7 +118,7 @@ static void __munlock_isolated_page(stru
 	 * and we don't need to check all the other vmas.
 	 */
 	if (page_mapcount(page) > 1)
-		try_to_munlock(page);
+		page_mlock(page);
 
 	/* Did try_to_unlock() succeed or punt? */
 	if (!PageMlocked(page))
@@ -158,7 +158,7 @@ static void __munlock_isolation_failed(s
  * munlock()ed or munmap()ed, we want to check whether other vmas hold the
  * page locked so that we can leave it on the unevictable lru list and not
  * bother vmscan with it.  However, to walk the page's rmap list in
- * try_to_munlock() we must isolate the page from the LRU.  If some other
+ * page_mlock() we must isolate the page from the LRU.  If some other
  * task has removed the page from the LRU, we won't be able to do that.
  * So we clear the PageMlocked as we might not get another chance.  If we
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
@@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct pag
 {
 	int nr_pages;
 
-	/* For try_to_munlock() and to serialize with page migration */
+	/* For page_mlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -205,7 +205,7 @@ static int __mlock_posix_error_return(lo
  *
  * The fast path is available only for evictable pages with single mapping.
  * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need try_to_munlock() which can fail.
+ * when mapcount > 1 we need page_mlock() which can fail.
  * when !page_evictable(), we need the full redo logic of putback_lru_page to
  * avoid leaving evictable page in unevictable list.
  *
@@ -414,7 +414,7 @@ static unsigned long __munlock_pagevec_f
  *
  * We don't save and restore VM_LOCKED here because pages are
  * still on lru.  In unmap path, pages might be scanned by reclaim
- * and re-mlocked by try_to_{munlock|unmap} before we unmap and
+ * and re-mlocked by page_mlock/try_to_unmap before we unmap and
  * free them.  This will result in freeing mlocked pages.
  */
 void munlock_vma_pages_range(struct vm_area_struct *vma,
--- a/mm/rmap.c~mm-rmap-split-try_to_munlock-from-try_to_unmap
+++ a/mm/rmap.c
@@ -1411,10 +1411,6 @@ static bool try_to_unmap_one(struct page
 	if (flags & TTU_SYNC)
 		pvmw.flags = PVMW_SYNC;
 
-	/* munlock has nothing to gain from examining un-locked vmas */
-	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
-		return true;
-
 	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
@@ -1476,8 +1472,6 @@ static bool try_to_unmap_one(struct page
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
-			if (flags & TTU_MUNLOCK)
-				continue;
 		}
 
 		/* Unexpected PMD-mapped THP? */
@@ -1790,20 +1784,58 @@ void try_to_unmap(struct page *page, enu
 		rmap_walk(page, &rwc);
 }
 
+/*
+ * Walks the vma's mapping a page and mlocks the page if any locked vma's are
+ * found. Once one is found the page is locked and the scan can be terminated.
+ */
+static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
+				 unsigned long address, void *unused)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+
+	/* An un-locked vma doesn't have any pages to lock, continue the scan */
+	if (!(vma->vm_flags & VM_LOCKED))
+		return true;
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/*
+		 * Need to recheck under the ptl to serialise with
+		 * __munlock_pagevec_fill() after VM_LOCKED is cleared in
+		 * munlock_vma_pages_range().
+		 */
+		if (vma->vm_flags & VM_LOCKED) {
+			/* PTE-mapped THP are never mlocked */
+			if (!PageTransCompound(page))
+				mlock_vma_page(page);
+			page_vma_mapped_walk_done(&pvmw);
+		}
+
+		/*
+		 * no need to continue scanning other vma's if the page has
+		 * been locked.
+		 */
+		return false;
+	}
+
+	return true;
+}
+
 /**
- * try_to_munlock - try to munlock a page
- * @page: the page to be munlocked
+ * page_mlock - try to mlock a page
+ * @page: the page to be mlocked
  *
- * Called from munlock code.  Checks all of the VMAs mapping the page
- * to make sure nobody else has this page mlocked. The page will be
- * returned with PG_mlocked cleared if no other vmas have it mlocked.
+ * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
+ * the page if any are found. The page will be returned with PG_mlocked cleared
+ * if it is not mapped by any locked vmas.
  */
-
-void try_to_munlock(struct page *page)
+void page_mlock(struct page *page)
 {
 	struct rmap_walk_control rwc = {
-		.rmap_one = try_to_unmap_one,
-		.arg = (void *)TTU_MUNLOCK,
+		.rmap_one = page_mlock_one,
 		.done = page_not_mapped,
 		.anon_lock = page_lock_anon_vma_read,
 
@@ -1855,7 +1887,7 @@ static struct anon_vma *rmap_walk_anon_l
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the anon_vma struct it points to.
  *
- * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
+ * When called from page_mlock(), the mmap_lock of the mm containing the vma
  * where the page was found will be held for write.  So, we won't recheck
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
@@ -1908,7 +1940,7 @@ static void rmap_walk_anon(struct page *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
  *
- * When called from try_to_munlock(), the mmap_lock of the mm containing the vma
+ * When called from page_mlock(), the mmap_lock of the mm containing the vma
  * where the page was found will be held for write.  So, we won't recheck
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 133/192] mm/rmap: split migration into its own function
  2021-07-01  1:46 incoming Andrew Morton
                   ` (131 preceding siblings ...)
  2021-07-01  1:54 ` [patch 132/192] mm/rmap: split try_to_munlock from try_to_unmap Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 134/192] mm: rename migrate_pgmap_owner Andrew Morton
                   ` (59 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm/rmap: split migration into its own function

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.

Several simplifications can also be made in try_to_migrate_one() based on
the following observations:

 - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
 - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
 - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page.  This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().

Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/rmap.h |    4 
 mm/huge_memory.c     |   16 +
 mm/migrate.c         |    9 -
 mm/rmap.c            |  367 ++++++++++++++++++++++++++++++-----------
 4 files changed, 289 insertions(+), 107 deletions(-)

--- a/include/linux/rmap.h~mm-rmap-split-migration-into-its-own-function
+++ a/include/linux/rmap.h
@@ -86,8 +86,6 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_MIGRATION		= 0x1,	/* migration mode */
-
 	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
 	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
 	TTU_SYNC		= 0x10,	/* avoid racy checks with PVMW_SYNC */
@@ -97,7 +95,6 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
-	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
 };
 
 #ifdef CONFIG_MMU
@@ -194,6 +191,7 @@ static inline void page_dup_rmap(struct
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
+void try_to_migrate(struct page *page, enum ttu_flags flags);
 void try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
--- a/mm/huge_memory.c~mm-rmap-split-migration-into-its-own-function
+++ a/mm/huge_memory.c
@@ -2309,16 +2309,20 @@ void vma_adjust_trans_huge(struct vm_are
 
 static void unmap_page(struct page *page)
 {
-	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC |
-		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+	enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
+		TTU_SYNC;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	/* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */
+	/*
+	 * Anon pages need migration entries to preserve them, but file
+	 * pages can simply be left unmapped, then faulted back on demand.
+	 * If that is ever changed (perhaps for mlock), update remap_page().
+	 */
 	if (PageAnon(page))
-		ttu_flags |= TTU_SPLIT_FREEZE;
-
-	try_to_unmap(page, ttu_flags);
+		try_to_migrate(page, ttu_flags);
+	else
+		try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK);
 
 	VM_WARN_ON_ONCE_PAGE(page_mapped(page), page);
 }
--- a/mm/migrate.c~mm-rmap-split-migration-into-its-own-function
+++ a/mm/migrate.c
@@ -1109,7 +1109,7 @@ static int __unmap_and_move(struct page
 		/* Establish migration ptes */
 		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
 				page);
-		try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
+		try_to_migrate(page, 0);
 		page_was_mapped = 1;
 	}
 
@@ -1311,7 +1311,7 @@ static int unmap_and_move_huge_page(new_
 
 	if (page_mapped(hpage)) {
 		bool mapping_locked = false;
-		enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK;
+		enum ttu_flags ttu = 0;
 
 		if (!PageAnon(hpage)) {
 			/*
@@ -1328,7 +1328,7 @@ static int unmap_and_move_huge_page(new_
 			ttu |= TTU_RMAP_LOCKED;
 		}
 
-		try_to_unmap(hpage, ttu);
+		try_to_migrate(hpage, ttu);
 		page_was_mapped = 1;
 
 		if (mapping_locked)
@@ -2602,7 +2602,6 @@ static void migrate_vma_prepare(struct m
  */
 static void migrate_vma_unmap(struct migrate_vma *migrate)
 {
-	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK;
 	const unsigned long npages = migrate->npages;
 	const unsigned long start = migrate->start;
 	unsigned long addr, i, restore = 0;
@@ -2614,7 +2613,7 @@ static void migrate_vma_unmap(struct mig
 			continue;
 
 		if (page_mapped(page)) {
-			try_to_unmap(page, flags);
+			try_to_migrate(page, 0);
 			if (page_mapped(page))
 				goto restore;
 		}
--- a/mm/rmap.c~mm-rmap-split-migration-into-its-own-function
+++ a/mm/rmap.c
@@ -1411,14 +1411,8 @@ static bool try_to_unmap_one(struct page
 	if (flags & TTU_SYNC)
 		pvmw.flags = PVMW_SYNC;
 
-	if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
-	    is_zone_device_page(page) && !is_device_private_page(page))
-		return true;
-
-	if (flags & TTU_SPLIT_HUGE_PMD) {
-		split_huge_pmd_address(vma, address,
-				flags & TTU_SPLIT_FREEZE, page);
-	}
+	if (flags & TTU_SPLIT_HUGE_PMD)
+		split_huge_pmd_address(vma, address, false, page);
 
 	/*
 	 * For THP, we have to assume the worse case ie pmd for invalidation.
@@ -1443,16 +1437,6 @@ static bool try_to_unmap_one(struct page
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte && (flags & TTU_MIGRATION)) {
-			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
-
-			set_pmd_migration_entry(&pvmw, page);
-			continue;
-		}
-#endif
-
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
 		 * If it's recently referenced (perhaps page_referenced
@@ -1514,46 +1498,6 @@ static bool try_to_unmap_one(struct page
 			}
 		}
 
-		if (IS_ENABLED(CONFIG_MIGRATION) &&
-		    (flags & TTU_MIGRATION) &&
-		    is_zone_device_page(page)) {
-			swp_entry_t entry;
-			pte_t swp_pte;
-
-			pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
-
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = make_readable_migration_entry(page_to_pfn(page));
-			swp_pte = swp_entry_to_pte(entry);
-
-			/*
-			 * pteval maps a zone device page and is therefore
-			 * a swap pte.
-			 */
-			if (pte_swp_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_swp_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 *
-			 * The assignment to subpage above was computed from a
-			 * swap PTE which results in an invalid pointer.
-			 * Since only PAGE_SIZE pages can currently be
-			 * migrated, just set it to page. This will need to be
-			 * changed when hugepage migrations to device private
-			 * memory are supported.
-			 */
-			subpage = page;
-			goto discard;
-		}
-
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 		if (should_defer_flush(mm, flags)) {
@@ -1606,39 +1550,6 @@ static bool try_to_unmap_one(struct page
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
-		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
-				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
-			swp_entry_t entry;
-			pte_t swp_pte;
-
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			if (pte_write(pteval))
-				entry = make_writable_migration_entry(
-							page_to_pfn(subpage));
-			else
-				entry = make_readable_migration_entry(
-							page_to_pfn(subpage));
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1766,6 +1677,277 @@ void try_to_unmap(struct page *page, enu
 		.anon_lock = page_lock_anon_vma_read,
 	};
 
+	if (flags & TTU_RMAP_LOCKED)
+		rmap_walk_locked(page, &rwc);
+	else
+		rmap_walk(page, &rwc);
+}
+
+/*
+ * @arg: enum ttu_flags will be passed to this argument.
+ *
+ * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
+ * containing migration entries. This and TTU_RMAP_LOCKED are the only supported
+ * flags.
+ */
+static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
+		     unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+	pte_t pteval;
+	struct page *subpage;
+	bool ret = true;
+	struct mmu_notifier_range range;
+	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+
+	if (is_zone_device_page(page) && !is_device_private_page(page))
+		return true;
+
+	/*
+	 * When racing against e.g. zap_pte_range() on another cpu,
+	 * in between its ptep_get_and_clear_full() and page_remove_rmap(),
+	 * try_to_migrate() may return before page_mapped() has become false,
+	 * if page table locking is skipped: use TTU_SYNC to wait for that.
+	 */
+	if (flags & TTU_SYNC)
+		pvmw.flags = PVMW_SYNC;
+
+	/*
+	 * unmap_page() in mm/huge_memory.c is the only user of migration with
+	 * TTU_SPLIT_HUGE_PMD and it wants to freeze.
+	 */
+	if (flags & TTU_SPLIT_HUGE_PMD)
+		split_huge_pmd_address(vma, address, true, page);
+
+	/*
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
+	 */
+	range.end = PageKsm(page) ?
+			address + PAGE_SIZE : vma_address_end(page, vma);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				address, range.end);
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+						     &range.end);
+	}
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		/* PMD-mapped THP migration entry */
+		if (!pvmw.pte) {
+			VM_BUG_ON_PAGE(PageHuge(page) ||
+				       !PageTransCompound(page), page);
+
+			set_pmd_migration_entry(&pvmw, page);
+			continue;
+		}
+#endif
+
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
+
+		if (PageHuge(page) && !PageAnon(page)) {
+			/*
+			 * To call huge_pmd_unshare, i_mmap_rwsem must be
+			 * held in write mode.  Caller needs to explicitly
+			 * do this outside rmap routines.
+			 */
+			VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+			if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, range.start, range.end);
+				flush_tlb_range(vma, range.start, range.end);
+				mmu_notifier_invalidate_range(mm, range.start,
+							      range.end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
+
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
+
+		/* Update high watermark before we lower rss */
+		update_hiwater_rss(mm);
+
+		if (is_zone_device_page(page)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_readable_migration_entry(
+							page_to_pfn(page));
+			swp_pte = swp_entry_to_pte(entry);
+
+			/*
+			 * pteval maps a zone device page and is therefore
+			 * a swap pte.
+			 */
+			if (pte_swp_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_swp_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 *
+			 * The assignment to subpage above was computed from a
+			 * swap PTE which results in an invalid pointer.
+			 * Since only PAGE_SIZE pages can currently be
+			 * migrated, just set it to page. This will need to be
+			 * changed when hugepage migrations to device private
+			 * memory are supported.
+			 */
+			subpage = page;
+		} else if (PageHWPoison(page)) {
+			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+			if (PageHuge(page)) {
+				hugetlb_count_sub(compound_nr(page), mm);
+				set_huge_swap_pte_at(mm, address,
+						     pvmw.pte, pteval,
+						     vma_mmu_pagesize(vma));
+			} else {
+				dec_mm_counter(mm, mm_counter(page));
+				set_pte_at(mm, address, pvmw.pte, pteval);
+			}
+
+		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+			/*
+			 * The guest indicated that the page content is of no
+			 * interest anymore. Simply discard the pte, vmscan
+			 * will take care of the rest.
+			 * A future reference will then fault in a new zero
+			 * page. When userfaultfd is active, we must not drop
+			 * this page though, as its main user (postcopy
+			 * migration) will not expect userfaults on already
+			 * copied pages.
+			 */
+			dec_mm_counter(mm, mm_counter(page));
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
+		} else {
+			swp_entry_t entry;
+			pte_t swp_pte;
+
+			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+				set_pte_at(mm, address, pvmw.pte, pteval);
+				ret = false;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			if (pte_write(pteval))
+				entry = make_writable_migration_entry(
+							page_to_pfn(subpage));
+			else
+				entry = make_readable_migration_entry(
+							page_to_pfn(subpage));
+
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			if (pte_uffd_wp(pteval))
+				swp_pte = pte_swp_mkuffd_wp(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			/*
+			 * No need to invalidate here it will synchronize on
+			 * against the special swap migration pte.
+			 */
+		}
+
+		/*
+		 * No need to call mmu_notifier_invalidate_range() it has be
+		 * done above for all cases requiring it to happen under page
+		 * table lock before mmu_notifier_invalidate_range_end()
+		 *
+		 * See Documentation/vm/mmu_notifier.rst
+		 */
+		page_remove_rmap(subpage, PageHuge(page));
+		put_page(page);
+	}
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+/**
+ * try_to_migrate - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries. Caller must hold the page lock.
+ *
+ * If is successful, return true. Otherwise, false.
+ */
+void try_to_migrate(struct page *page, enum ttu_flags flags)
+{
+	struct rmap_walk_control rwc = {
+		.rmap_one = try_to_migrate_one,
+		.arg = (void *)flags,
+		.done = page_not_mapped,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+
+	/*
+	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
+	 * TTU_SPLIT_HUGE_PMD and TTU_SYNC flags.
+	 */
+	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
+					TTU_SYNC)))
+		return;
+
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
 	 * The VMA is moved under the anon_vma lock but not the
@@ -1774,8 +1956,7 @@ void try_to_unmap(struct page *page, enu
 	 * locking requirements of exec(), migration skips
 	 * temporary VMAs until after exec() completes.
 	 */
-	if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))
-	    && !PageKsm(page) && PageAnon(page))
+	if (!PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
 	if (flags & TTU_RMAP_LOCKED)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 134/192] mm: rename migrate_pgmap_owner
  2021-07-01  1:46 incoming Andrew Morton
                   ` (132 preceding siblings ...)
  2021-07-01  1:54 ` [patch 133/192] mm/rmap: split migration into its own function Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 135/192] mm/memory.c: allow different return codes for copy_nonpresent_pte() Andrew Morton
                   ` (58 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm: rename migrate_pgmap_owner

MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer.  This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events.  Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.

Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/hmm.rst              |    2 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c |    2 +-
 include/linux/mmu_notifier.h          |   20 ++++++++++----------
 lib/test_hmm.c                        |    2 +-
 mm/migrate.c                          |   10 +++++-----
 5 files changed, 18 insertions(+), 18 deletions(-)

--- a/Documentation/vm/hmm.rst~mm-rename-migrate_pgmap_owner
+++ a/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and
    walks to fill in the ``args->src`` array with PFNs to be migrated.
    The ``invalidate_range_start()`` callback is passed a
    ``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
    the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
    allows the device driver to skip the invalidation callback and only
    invalidate device private MMU mappings that are actually migrating.
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c~mm-rename-migrate_pgmap_owner
+++ a/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(stru
 	 * the invalidation is handled as part of the migration process.
 	 */
 	if (update->event == MMU_NOTIFY_MIGRATE &&
-	    update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+	    update->owner == svmm->vmm->cli->drm->dev)
 		goto out;
 
 	if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
--- a/include/linux/mmu_notifier.h~mm-rename-migrate_pgmap_owner
+++ a/include/linux/mmu_notifier.h
@@ -41,7 +41,7 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -269,7 +269,7 @@ struct mmu_notifier_range {
 	unsigned long end;
 	unsigned flags;
 	enum mmu_notifier_event event;
-	void *migrate_pgmap_owner;
+	void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +521,14 @@ static inline void mmu_notifier_range_in
 	range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-			struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+			struct mmu_notifier_range *range,
+			enum mmu_notifier_event event, unsigned int flags,
 			struct vm_area_struct *vma, struct mm_struct *mm,
-			unsigned long start, unsigned long end, void *pgmap)
+			unsigned long start, unsigned long end, void *owner)
 {
-	mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-				start, end);
-	range->migrate_pgmap_owner = pgmap;
+	mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+	range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
@@ -655,8 +655,8 @@ static inline void _mmu_notifier_range_i
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
 	_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-					pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+					end, owner) \
 	_mmu_notifier_range_init(range, start, end)
 
 static inline bool
--- a/lib/test_hmm.c~mm-rename-migrate_pgmap_owner
+++ a/lib/test_hmm.c
@@ -218,7 +218,7 @@ static bool dmirror_interval_invalidate(
 	 * the invalidation is handled as part of the migration process.
 	 */
 	if (range->event == MMU_NOTIFY_MIGRATE &&
-	    range->migrate_pgmap_owner == dmirror->mdevice)
+	    range->owner == dmirror->mdevice)
 		return true;
 
 	if (mmu_notifier_range_blockable(range))
--- a/mm/migrate.c~mm-rename-migrate_pgmap_owner
+++ a/mm/migrate.c
@@ -2416,8 +2416,8 @@ static void migrate_vma_collect(struct m
 	 * that the registered device driver can skip invalidating device
 	 * private page mappings that won't be migrated.
 	 */
-	mmu_notifier_range_init_migrate(&range, 0, migrate->vma,
-		migrate->vma->vm_mm, migrate->start, migrate->end,
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
+		migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end,
 		migrate->pgmap_owner);
 	mmu_notifier_invalidate_range_start(&range);
 
@@ -2927,9 +2927,9 @@ void migrate_vma_pages(struct migrate_vm
 			if (!notified) {
 				notified = true;
 
-				mmu_notifier_range_init_migrate(&range, 0,
-					migrate->vma, migrate->vma->vm_mm,
-					addr, migrate->end,
+				mmu_notifier_range_init_owner(&range,
+					MMU_NOTIFY_MIGRATE, 0, migrate->vma,
+					migrate->vma->vm_mm, addr, migrate->end,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 135/192] mm/memory.c: allow different return codes for copy_nonpresent_pte()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (133 preceding siblings ...)
  2021-07-01  1:54 ` [patch 134/192] mm: rename migrate_pgmap_owner Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 136/192] mm: device exclusive memory access Andrew Morton
                   ` (57 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm/memory.c: allow different return codes for copy_nonpresent_pte()

Currently if copy_nonpresent_pte() returns a non-zero value it is assumed
to be a swap entry which requires further processing outside the loop in
copy_pte_range() after dropping locks.  This prevents other values being
returned to signal conditions such as failure which a subsequent change
requires.

Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.

Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |   28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

--- a/mm/memory.c~mm-memoryc-allow-different-return-codes-for-copy_nonpresent_pte
+++ a/mm/memory.c
@@ -717,7 +717,7 @@ copy_nonpresent_pte(struct mm_struct *ds
 
 	if (likely(!non_swap_entry(entry))) {
 		if (swap_duplicate(entry) < 0)
-			return entry.val;
+			return -EIO;
 
 		/* make sure dst_mm is on swapoff's mmlist. */
 		if (unlikely(list_empty(&dst_mm->mmlist))) {
@@ -973,12 +973,14 @@ again:
 			continue;
 		}
 		if (unlikely(!pte_present(*src_pte))) {
-			entry.val = copy_nonpresent_pte(dst_mm, src_mm,
-							dst_pte, src_pte,
-							dst_vma, src_vma,
-							addr, rss);
-			if (entry.val)
+			ret = copy_nonpresent_pte(dst_mm, src_mm,
+						  dst_pte, src_pte,
+						  dst_vma, src_vma,
+						  addr, rss);
+			if (ret == -EIO) {
+				entry = pte_to_swp_entry(*src_pte);
 				break;
+			}
 			progress += 8;
 			continue;
 		}
@@ -1011,20 +1013,24 @@ again:
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
 
-	if (entry.val) {
+	if (ret == -EIO) {
+		VM_WARN_ON_ONCE(!entry.val);
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
 			ret = -ENOMEM;
 			goto out;
 		}
 		entry.val = 0;
-	} else if (ret) {
-		WARN_ON_ONCE(ret != -EAGAIN);
+	} else if (ret ==  -EAGAIN) {
 		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
 		if (!prealloc)
 			return -ENOMEM;
-		/* We've captured and resolved the error. Reset, try again. */
-		ret = 0;
+	} else if (ret) {
+		VM_WARN_ON_ONCE(1);
 	}
+
+	/* We've captured and resolved the error. Reset, try again. */
+	ret = 0;
+
 	if (addr != end)
 		goto again;
 out:
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 136/192] mm: device exclusive memory access
  2021-07-01  1:46 incoming Andrew Morton
                   ` (134 preceding siblings ...)
  2021-07-01  1:54 ` [patch 135/192] mm/memory.c: allow different return codes for copy_nonpresent_pte() Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 137/192] mm: selftests for exclusive device memory Andrew Morton
                   ` (56 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, dan.carpenter, hch, hughd, jgg, jhubbard,
	linux-mm, mm-commits, peterx, rcampbell, shakeelb, torvalds,
	willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm: device exclusive memory access

Some devices require exclusive write access to shared virtual memory (SVM)
ranges to perform atomic operations on that memory.  This requires CPU
page tables to be updated to deny access whilst atomic operations are
occurring.

In order to do this introduce a new swap entry type
(SWP_DEVICE_EXCLUSIVE).  When a SVM range needs to be marked for exclusive
access by a device all page table mappings for the particular range are
replaced with device exclusive swap entries.  This causes any CPU access
to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping.  This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access.  After
notifiers have been called the device will no longer have exclusive access
to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk.  A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised.  However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

[dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
  Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda
Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/hmm.rst     |   17 +++
 include/linux/mmu_notifier.h |    6 +
 include/linux/rmap.h         |    4 
 include/linux/swap.h         |    9 +
 include/linux/swapops.h      |   44 +++++++
 mm/hmm.c                     |    5 
 mm/memory.c                  |  127 +++++++++++++++++++++-
 mm/mprotect.c                |    8 +
 mm/page_vma_mapped.c         |    9 +
 mm/rmap.c                    |  186 +++++++++++++++++++++++++++++++++
 10 files changed, 405 insertions(+), 10 deletions(-)

--- a/Documentation/vm/hmm.rst~mm-device-exclusive-memory-access
+++ a/Documentation/vm/hmm.rst
@@ -405,6 +405,23 @@ between device driver specific code and
 
    The lock can now be released.
 
+Exclusive access memory
+=======================
+
+Some devices have features such as atomic PTE bits that can be used to implement
+atomic access to system memory. To support atomic operations to a shared virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 ========================================
 
--- a/include/linux/mmu_notifier.h~mm-device-exclusive-memory-access
+++ a/include/linux/mmu_notifier.h
@@ -42,6 +42,11 @@ struct mmu_interval_notifier;
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
  * owner field matches the driver's device private pgmap owner.
+ *
+ * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
+ * longer have exclusive access to the page. When sent during creation of an
+ * exclusive range the owner will be initialised to the value provided by the
+ * caller of make_device_exclusive_range(), otherwise the owner will be NULL.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -51,6 +56,7 @@ enum mmu_notifier_event {
 	MMU_NOTIFY_SOFT_DIRTY,
 	MMU_NOTIFY_RELEASE,
 	MMU_NOTIFY_MIGRATE,
+	MMU_NOTIFY_EXCLUSIVE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
--- a/include/linux/rmap.h~mm-device-exclusive-memory-access
+++ a/include/linux/rmap.h
@@ -194,6 +194,10 @@ int page_referenced(struct page *, int i
 void try_to_migrate(struct page *page, enum ttu_flags flags);
 void try_to_unmap(struct page *, enum ttu_flags flags);
 
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *arg);
+
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
 /* Look for migarion entries rather than present PTEs */
--- a/include/linux/swap.h~mm-device-exclusive-memory-access
+++ a/include/linux/swap.h
@@ -62,12 +62,17 @@ static inline int current_is_kswapd(void
  * migrate part of a process memory to device memory.
  *
  * When a page is migrated from CPU to device, we set the CPU page table entry
- * to a special SWP_DEVICE_* entry.
+ * to a special SWP_DEVICE_{READ|WRITE} entry.
+ *
+ * When a page is mapped by the device for exclusive access we set the CPU page
+ * table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_NUM 4
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
--- a/include/linux/swapops.h~mm-device-exclusive-memory-access
+++ a/include/linux/swapops.h
@@ -127,6 +127,27 @@ static inline bool is_writable_device_pr
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
+		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
+}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -147,6 +168,26 @@ static inline bool is_writable_device_pr
 {
 	return false;
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -226,7 +267,8 @@ static inline struct page *pfn_swap_entr
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
-	return is_migration_entry(entry) || is_device_private_entry(entry);
+	return is_migration_entry(entry) || is_device_private_entry(entry) ||
+	       is_device_exclusive_entry(entry);
 }
 
 struct page_vma_mapped_walk;
--- a/mm/hmm.c~mm-device-exclusive-memory-access
+++ a/mm/hmm.c
@@ -26,6 +26,8 @@
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
+#include "internal.h"
+
 struct hmm_vma_walk {
 	struct hmm_range	*range;
 	unsigned long		last;
@@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_
 		if (!non_swap_entry(entry))
 			goto fault;
 
+		if (is_device_exclusive_entry(entry))
+			goto fault;
+
 		if (is_migration_entry(entry)) {
 			pte_unmap(ptep);
 			hmm_vma_walk->last = addr;
--- a/mm/memory.c~mm-device-exclusive-memory-access
+++ a/mm/memory.c
@@ -699,6 +699,68 @@ out:
 }
 #endif
 
+static void restore_exclusive_pte(struct vm_area_struct *vma,
+				  struct page *page, unsigned long address,
+				  pte_t *ptep)
+{
+	pte_t pte;
+	swp_entry_t entry;
+
+	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+	if (pte_swp_soft_dirty(*ptep))
+		pte = pte_mksoft_dirty(pte);
+
+	entry = pte_to_swp_entry(*ptep);
+	if (pte_swp_uffd_wp(*ptep))
+		pte = pte_mkuffd_wp(pte);
+	else if (is_writable_device_exclusive_entry(entry))
+		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	set_pte_at(vma->vm_mm, address, ptep, pte);
+
+	/*
+	 * No need to take a page reference as one was already
+	 * created when the swap entry was made.
+	 */
+	if (PageAnon(page))
+		page_add_anon_rmap(page, vma, address, false);
+	else
+		/*
+		 * Currently device exclusive access only supports anonymous
+		 * memory so the entry shouldn't point to a filebacked page.
+		 */
+		WARN_ON_ONCE(!PageAnon(page));
+
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_vma_page(page);
+
+	/*
+	 * No need to invalidate - it was non-present before. However
+	 * secondary CPUs may have mappings that need invalidating.
+	 */
+	update_mmu_cache(vma, address, ptep);
+}
+
+/*
+ * Tries to restore an exclusive pte if the page lock can be acquired without
+ * sleeping.
+ */
+static int
+try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	swp_entry_t entry = pte_to_swp_entry(*src_pte);
+	struct page *page = pfn_swap_entry_to_page(entry);
+
+	if (trylock_page(page)) {
+		restore_exclusive_pte(vma, page, addr, src_pte);
+		unlock_page(page);
+		return 0;
+	}
+
+	return -EBUSY;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -780,6 +842,17 @@ copy_nonpresent_pte(struct mm_struct *ds
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+	} else if (is_device_exclusive_entry(entry)) {
+		/*
+		 * Make device exclusive entries present by restoring the
+		 * original entry then copying as for a present pte. Device
+		 * exclusive entries currently only support private writable
+		 * (ie. COW) mappings.
+		 */
+		VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags));
+		if (try_restore_exclusive_pte(src_pte, src_vma, addr))
+			return -EBUSY;
+		return -ENOENT;
 	}
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_swp_clear_uffd_wp(pte);
@@ -980,9 +1053,18 @@ again:
 			if (ret == -EIO) {
 				entry = pte_to_swp_entry(*src_pte);
 				break;
+			} else if (ret == -EBUSY) {
+				break;
+			} else if (!ret) {
+				progress += 8;
+				continue;
 			}
-			progress += 8;
-			continue;
+
+			/*
+			 * Device exclusive entry restored, continue by copying
+			 * the now present pte.
+			 */
+			WARN_ON_ONCE(ret != -ENOENT);
 		}
 		/* copy_present_pte() will clear `*prealloc' if consumed */
 		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
@@ -1020,6 +1102,8 @@ again:
 			goto out;
 		}
 		entry.val = 0;
+	} else if (ret == -EBUSY) {
+		goto out;
 	} else if (ret ==  -EAGAIN) {
 		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
 		if (!prealloc)
@@ -1287,7 +1371,8 @@ again:
 		}
 
 		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry)) {
+		if (is_device_private_entry(entry) ||
+		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
 			if (unlikely(details && details->check_mapping)) {
@@ -1303,7 +1388,10 @@ again:
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+
+			if (is_device_private_entry(entry))
+				page_remove_rmap(page, false);
+
 			put_page(page);
 			continue;
 		}
@@ -3352,6 +3440,34 @@ void unmap_mapping_range(struct address_
 EXPORT_SYMBOL(unmap_mapping_range);
 
 /*
+ * Restore a potential device exclusive pte to a working pte entry
+ */
+static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	struct vm_area_struct *vma = vmf->vma;
+	struct mmu_notifier_range range;
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+		return VM_FAULT_RETRY;
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+				vma->vm_mm, vmf->address & PAGE_MASK,
+				(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
+	mmu_notifier_invalidate_range_start(&range);
+
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+				&vmf->ptl);
+	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+		restore_exclusive_pte(vma, page, vmf->address, vmf->pte);
+
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	unlock_page(page);
+
+	mmu_notifier_invalidate_range_end(&range);
+	return 0;
+}
+
+/*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with pte unmapped and unlocked.
@@ -3379,6 +3495,9 @@ vm_fault_t do_swap_page(struct vm_fault
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_exclusive_entry(entry)) {
+			vmf->page = pfn_swap_entry_to_page(entry);
+			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
--- a/mm/mprotect.c~mm-device-exclusive-memory-access
+++ a/mm/mprotect.c
@@ -165,6 +165,14 @@ static unsigned long change_pte_range(st
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
+			} else if (is_writable_device_exclusive_entry(entry)) {
+				entry = make_readable_device_exclusive_entry(
+							swp_offset(entry));
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 			} else {
 				newpte = oldpte;
 			}
--- a/mm/page_vma_mapped.c~mm-device-exclusive-memory-access
+++ a/mm/page_vma_mapped.c
@@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapp
 
 				/* Handle un-addressable ZONE_DEVICE memory */
 				entry = pte_to_swp_entry(*pvmw->pte);
-				if (!is_device_private_entry(entry))
+				if (!is_device_private_entry(entry) &&
+				    !is_device_exclusive_entry(entry))
 					return false;
 			} else if (!pte_present(*pvmw->pte))
 				return false;
@@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_ma
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
 
-		if (!is_migration_entry(entry))
+		if (!is_migration_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_ma
 
 		/* Handle un-addressable ZONE_DEVICE memory */
 		entry = pte_to_swp_entry(*pvmw->pte);
-		if (!is_device_private_entry(entry))
+		if (!is_device_private_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
--- a/mm/rmap.c~mm-device-exclusive-memory-access
+++ a/mm/rmap.c
@@ -2028,6 +2028,192 @@ void page_mlock(struct page *page)
 	rmap_walk(page, &rwc);
 }
 
+#ifdef CONFIG_DEVICE_PRIVATE
+struct make_exclusive_args {
+	struct mm_struct *mm;
+	unsigned long address;
+	void *owner;
+	bool valid;
+};
+
+static bool page_make_device_exclusive_one(struct page *page,
+		struct vm_area_struct *vma, unsigned long address, void *priv)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+	struct make_exclusive_args *args = priv;
+	pte_t pteval;
+	struct page *subpage;
+	bool ret = true;
+	struct mmu_notifier_range range;
+	swp_entry_t entry;
+	pte_t swp_pte;
+
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+				      vma->vm_mm, address, min(vma->vm_end,
+				      address + page_size(page)), args->owner);
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		if (!pte_present(*pvmw.pte)) {
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
+
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
+
+		/*
+		 * Check that our target page is still mapped at the expected
+		 * address.
+		 */
+		if (args->mm == mm && args->address == address &&
+		    pte_write(pteval))
+			args->valid = true;
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		if (pte_write(pteval))
+			entry = make_writable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		else
+			entry = make_readable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+
+		set_pte_at(mm, address, pvmw.pte, swp_pte);
+
+		/*
+		 * There is a reference on the page for the swap entry which has
+		 * been removed, so shouldn't take another.
+		 */
+		page_remove_rmap(subpage, false);
+	}
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+/**
+ * page_make_device_exclusive - mark the page exclusively owned by a device
+ * @page: the page to replace page table entries for
+ * @mm: the mm_struct where the page is expected to be mapped
+ * @address: address where the page is expected to be mapped
+ * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special device exclusive swap entries to grant a device
+ * exclusive access to the page. Caller must hold the page lock.
+ *
+ * Returns false if the page is still mapped, or if it could not be unmapped
+ * from the expected address. Otherwise returns true (success).
+ */
+static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
+				unsigned long address, void *owner)
+{
+	struct make_exclusive_args args = {
+		.mm = mm,
+		.address = address,
+		.owner = owner,
+		.valid = false,
+	};
+	struct rmap_walk_control rwc = {
+		.rmap_one = page_make_device_exclusive_one,
+		.done = page_not_mapped,
+		.anon_lock = page_lock_anon_vma_read,
+		.arg = &args,
+	};
+
+	/*
+	 * Restrict to anonymous pages for now to avoid potential writeback
+	 * issues. Also tail pages shouldn't be passed to rmap_walk so skip
+	 * those.
+	 */
+	if (!PageAnon(page) || PageTail(page))
+		return false;
+
+	rmap_walk(page, &rwc);
+
+	return args.valid && !page_mapcount(page);
+}
+
+/**
+ * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * @mm: mm_struct of assoicated target process
+ * @start: start of the region to mark for exclusive device access
+ * @end: end address of region
+ * @pages: returns the pages which were successfully marked for exclusive access
+ * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
+ *
+ * Returns: number of pages found in the range by GUP. A page is marked for
+ * exclusive access only if the page pointer is non-NULL.
+ *
+ * This function finds ptes mapping page(s) to the given address range, locks
+ * them and replaces mappings with special swap entries preventing userspace CPU
+ * access. On fault these entries are replaced with the original mapping after
+ * calling MMU notifiers.
+ *
+ * A driver using this to program access from a device must use a mmu notifier
+ * critical section to hold a device specific lock during programming. Once
+ * programming is complete it should drop the page lock and reference after
+ * which point CPU access to the page will revoke the exclusive access.
+ */
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *owner)
+{
+	long npages = (end - start) >> PAGE_SHIFT;
+	long i;
+
+	npages = get_user_pages_remote(mm, start, npages,
+				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+				       pages, NULL, NULL);
+	if (npages < 0)
+		return npages;
+
+	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
+		if (!trylock_page(pages[i])) {
+			put_page(pages[i]);
+			pages[i] = NULL;
+			continue;
+		}
+
+		if (!page_make_device_exclusive(pages[i], mm, start, owner)) {
+			unlock_page(pages[i]);
+			put_page(pages[i]);
+			pages[i] = NULL;
+		}
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+#endif
+
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
 	struct anon_vma *root = anon_vma->root;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 137/192] mm: selftests for exclusive device memory
  2021-07-01  1:46 incoming Andrew Morton
                   ` (135 preceding siblings ...)
  2021-07-01  1:54 ` [patch 136/192] mm: device exclusive memory access Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 138/192] nouveau/svm: refactor nouveau_range_fault Andrew Morton
                   ` (55 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: mm: selftests for exclusive device memory

Adds some selftests for exclusive device memory.

Link: https://lkml.kernel.org/r/20210616105937.23201-9-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_hmm.c                         |  125 ++++++++++++++++++
 lib/test_hmm_uapi.h                    |    2 
 tools/testing/selftests/vm/hmm-tests.c |  158 +++++++++++++++++++++++
 3 files changed, 285 insertions(+)

--- a/lib/test_hmm.c~mm-selftests-for-exclusive-device-memory
+++ a/lib/test_hmm.c
@@ -25,6 +25,7 @@
 #include <linux/swapops.h>
 #include <linux/sched/mm.h>
 #include <linux/platform_device.h>
+#include <linux/rmap.h>
 
 #include "test_hmm_uapi.h"
 
@@ -46,6 +47,7 @@ struct dmirror_bounce {
 	unsigned long		cpages;
 };
 
+#define DPT_XA_TAG_ATOMIC 1UL
 #define DPT_XA_TAG_WRITE 3UL
 
 /*
@@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_co
 	}
 }
 
+static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start,
+			     unsigned long end)
+{
+	unsigned long pfn;
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+		void *entry;
+		struct page *page;
+
+		entry = xa_load(&dmirror->pt, pfn);
+		page = xa_untag_pointer(entry);
+		if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC)
+			return -EPERM;
+	}
+
+	return 0;
+}
+
+static int dmirror_atomic_map(unsigned long start, unsigned long end,
+			      struct page **pages, struct dmirror *dmirror)
+{
+	unsigned long pfn, mapped = 0;
+	int i;
+
+	/* Map the migrated pages into the device's page tables. */
+	mutex_lock(&dmirror->mutex);
+
+	for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) {
+		void *entry;
+
+		if (!pages[i])
+			continue;
+
+		entry = pages[i];
+		entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
+		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
+		if (xa_is_err(entry)) {
+			mutex_unlock(&dmirror->mutex);
+			return xa_err(entry);
+		}
+
+		mapped++;
+	}
+
+	mutex_unlock(&dmirror->mutex);
+	return mapped;
+}
+
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 					    struct dmirror *dmirror)
 {
@@ -661,6 +711,72 @@ static int dmirror_migrate_finalize_and_
 	return 0;
 }
 
+static int dmirror_exclusive(struct dmirror *dmirror,
+			     struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->notifier.mm;
+	struct page *pages[64];
+	struct dmirror_bounce bounce;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	mmap_read_lock(mm);
+	for (addr = start; addr < end; addr = next) {
+		unsigned long mapped;
+		int i;
+
+		if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT))
+			next = end;
+		else
+			next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT);
+
+		ret = make_device_exclusive_range(mm, addr, next, pages, NULL);
+		mapped = dmirror_atomic_map(addr, next, pages, dmirror);
+		for (i = 0; i < ret; i++) {
+			if (pages[i]) {
+				unlock_page(pages[i]);
+				put_page(pages[i]);
+			}
+		}
+
+		if (addr + (mapped << PAGE_SHIFT) < next) {
+			mmap_read_unlock(mm);
+			mmput(mm);
+			return -EBUSY;
+		}
+	}
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	/* Return the migrated data for verification. */
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0) {
+		if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr,
+				 bounce.size))
+			ret = -EFAULT;
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
 static int dmirror_migrate(struct dmirror *dmirror,
 			   struct hmm_dmirror_cmd *cmd)
 {
@@ -948,6 +1064,15 @@ static long dmirror_fops_unlocked_ioctl(
 		ret = dmirror_migrate(dmirror, &cmd);
 		break;
 
+	case HMM_DMIRROR_EXCLUSIVE:
+		ret = dmirror_exclusive(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_CHECK_EXCLUSIVE:
+		ret = dmirror_check_atomic(dmirror, cmd.addr,
+					cmd.addr + (cmd.npages << PAGE_SHIFT));
+		break;
+
 	case HMM_DMIRROR_SNAPSHOT:
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
--- a/lib/test_hmm_uapi.h~mm-selftests-for-exclusive-device-memory
+++ a/lib/test_hmm_uapi.h
@@ -33,6 +33,8 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_EXCLUSIVE		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_CHECK_EXCLUSIVE	_IOWR('H', 0x05, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
--- a/tools/testing/selftests/vm/hmm-tests.c~mm-selftests-for-exclusive-device-memory
+++ a/tools/testing/selftests/vm/hmm-tests.c
@@ -1485,4 +1485,162 @@ TEST_F(hmm2, double_map)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Basic check of exclusive faulting.
+ */
+TEST_F(hmm, exclusive)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Map memory exclusively for device access. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i]++, i);
+
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i+1);
+
+	/* Check atomic access revoked */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_CHECK_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+
+	hmm_buffer_free(buffer);
+}
+
+TEST_F(hmm, exclusive_mprotect)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Map memory exclusively for device access. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, -EPERM);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Check copy-on-write works.
+ */
+TEST_F(hmm, exclusive_cow)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Map memory exclusively for device access. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	fork();
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i]++, i);
+
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i+1);
+
+	hmm_buffer_free(buffer);
+}
+
 TEST_HARNESS_MAIN
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 138/192] nouveau/svm: refactor nouveau_range_fault
  2021-07-01  1:46 incoming Andrew Morton
                   ` (136 preceding siblings ...)
  2021-07-01  1:54 ` [patch 137/192] mm: selftests for exclusive device memory Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 139/192] nouveau/svm: implement atomic SVM access Andrew Morton
                   ` (54 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: nouveau/svm: refactor nouveau_range_fault

Call mmu_interval_notifier_insert() as part of nouveau_range_fault(). 
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.

Link: https://lkml.kernel.org/r/20210616105937.23201-10-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/gpu/drm/nouveau/nouveau_svm.c |   34 ++++++++++++++----------
 1 file changed, 20 insertions(+), 14 deletions(-)

--- a/drivers/gpu/drm/nouveau/nouveau_svm.c~nouveau-svm-refactor-nouveau_range_fault
+++ a/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -567,18 +567,27 @@ static int nouveau_range_fault(struct no
 	unsigned long hmm_pfns[1];
 	struct hmm_range range = {
 		.notifier = &notifier->notifier,
-		.start = notifier->notifier.interval_tree.start,
-		.end = notifier->notifier.interval_tree.last + 1,
 		.default_flags = hmm_flags,
 		.hmm_pfns = hmm_pfns,
 		.dev_private_owner = drm->dev,
 	};
-	struct mm_struct *mm = notifier->notifier.mm;
+	struct mm_struct *mm = svmm->notifier.mm;
 	int ret;
 
+	ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+					args->p.addr, args->p.size,
+					&nouveau_svm_mni_ops);
+	if (ret)
+		return ret;
+
+	range.start = notifier->notifier.interval_tree.start;
+	range.end = notifier->notifier.interval_tree.last + 1;
+
 	while (true) {
-		if (time_after(jiffies, timeout))
-			return -EBUSY;
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
 
 		range.notifier_seq = mmu_interval_read_begin(range.notifier);
 		mmap_read_lock(mm);
@@ -587,7 +596,7 @@ static int nouveau_range_fault(struct no
 		if (ret) {
 			if (ret == -EBUSY)
 				continue;
-			return ret;
+			goto out;
 		}
 
 		mutex_lock(&svmm->mutex);
@@ -606,6 +615,9 @@ static int nouveau_range_fault(struct no
 	svmm->vmm->vmm.object.client->super = false;
 	mutex_unlock(&svmm->mutex);
 
+out:
+	mmu_interval_notifier_remove(&notifier->notifier);
+
 	return ret;
 }
 
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *no
 		}
 
 		notifier.svmm = svmm;
-		ret = mmu_interval_notifier_insert(&notifier.notifier, mm,
-						   args.i.p.addr, args.i.p.size,
-						   &nouveau_svm_mni_ops);
-		if (!ret) {
-			ret = nouveau_range_fault(svmm, svm->drm, &args.i,
-				sizeof(args), hmm_flags, &notifier);
-			mmu_interval_notifier_remove(&notifier.notifier);
-		}
+		ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+					sizeof(args), hmm_flags, &notifier);
 		mmput(mm);
 
 		limit = args.i.p.addr + args.i.p.size;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 139/192] nouveau/svm: implement atomic SVM access
  2021-07-01  1:46 incoming Andrew Morton
                   ` (137 preceding siblings ...)
  2021-07-01  1:54 ` [patch 138/192] nouveau/svm: refactor nouveau_range_fault Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 140/192] proc: Avoid mixing integer types in mem_rw() Andrew Morton
                   ` (53 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, apopple, bskeggs, hch, hughd, jgg, jhubbard, linux-mm,
	mm-commits, peterx, rcampbell, shakeelb, torvalds, willy

From: Alistair Popple <apopple@nvidia.com>
Subject: nouveau/svm: implement atomic SVM access

Some NVIDIA GPUs do not support direct atomic access to system memory via
PCIe.  Instead this must be emulated by granting the GPU exclusive access
to the memory.  This is achieved by replacing CPU page table entries with
special swap entries that fault on userspace access.

The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables.  When CPU access to the page is
required a CPU fault is raised which calls into the device driver via MMU
notifiers to revoke the atomic access.  The original page table entries
are then restored allowing CPU access to proceed.

Link: https://lkml.kernel.org/r/20210616105937.23201-11-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/gpu/drm/nouveau/include/nvif/if000c.h      |    1 
 drivers/gpu/drm/nouveau/nouveau_svm.c              |  126 ++++++++++-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h      |    1 
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c |    6 
 4 files changed, 123 insertions(+), 11 deletions(-)

--- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h~nouveau-svm-implement-atomic-svm-access
+++ a/drivers/gpu/drm/nouveau/include/nvif/if000c.h
@@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 {
 #define NVIF_VMM_PFNMAP_V0_APER                           0x00000000000000f0ULL
 #define NVIF_VMM_PFNMAP_V0_HOST                           0x0000000000000000ULL
 #define NVIF_VMM_PFNMAP_V0_VRAM                           0x0000000000000010ULL
+#define NVIF_VMM_PFNMAP_V0_A				  0x0000000000000004ULL
 #define NVIF_VMM_PFNMAP_V0_W                              0x0000000000000002ULL
 #define NVIF_VMM_PFNMAP_V0_V                              0x0000000000000001ULL
 #define NVIF_VMM_PFNMAP_V0_NONE                           0x0000000000000000ULL
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c~nouveau-svm-implement-atomic-svm-access
+++ a/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include <linux/sched/mm.h>
 #include <linux/sort.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
 
 struct nouveau_svm {
 	struct nouveau_drm *drm;
@@ -67,6 +68,11 @@ struct nouveau_svm {
 	} buffer[1];
 };
 
+#define FAULT_ACCESS_READ 0
+#define FAULT_ACCESS_WRITE 1
+#define FAULT_ACCESS_ATOMIC 2
+#define FAULT_ACCESS_PREFETCH 3
+
 #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a)
 #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a)
 
@@ -412,6 +418,24 @@ nouveau_svm_fault_cancel_fault(struct no
 }
 
 static int
+nouveau_svm_fault_priority(u8 fault)
+{
+	switch (fault) {
+	case FAULT_ACCESS_PREFETCH:
+		return 0;
+	case FAULT_ACCESS_READ:
+		return 1;
+	case FAULT_ACCESS_WRITE:
+		return 2;
+	case FAULT_ACCESS_ATOMIC:
+		return 3;
+	default:
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+}
+
+static int
 nouveau_svm_fault_cmp(const void *a, const void *b)
 {
 	const struct nouveau_svm_fault *fa = *(struct nouveau_svm_fault **)a;
@@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, con
 		return ret;
 	if ((ret = (s64)fa->addr - fb->addr))
 		return ret;
-	/*XXX: atomic? */
-	return (fa->access == 0 || fa->access == 3) -
-	       (fb->access == 0 || fb->access == 3);
+	return nouveau_svm_fault_priority(fa->access) -
+		nouveau_svm_fault_priority(fb->access);
 }
 
 static void
@@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate
 	struct svm_notifier *sn =
 		container_of(mni, struct svm_notifier, notifier);
 
+	if (range->event == MMU_NOTIFY_EXCLUSIVE &&
+	    range->owner == sn->svmm->vmm->cli->drm->dev)
+		return true;
+
 	/*
 	 * serializes the update to mni->invalidate_seq done by caller and
 	 * prevents invalidation of the PTE from progressing while HW is being
@@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(stru
 		args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W;
 }
 
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm,
+			       struct nouveau_drm *drm,
+			       struct nouveau_pfnmap_args *args, u32 size,
+			       struct svm_notifier *notifier)
+{
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	struct mm_struct *mm = svmm->notifier.mm;
+	struct page *page;
+	unsigned long start = args->p.addr;
+	unsigned long notifier_seq;
+	int ret = 0;
+
+	ret = mmu_interval_notifier_insert(&notifier->notifier, mm,
+					args->p.addr, args->p.size,
+					&nouveau_svm_mni_ops);
+	if (ret)
+		return ret;
+
+	while (true) {
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		notifier_seq = mmu_interval_read_begin(&notifier->notifier);
+		mmap_read_lock(mm);
+		ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE,
+					    &page, drm->dev);
+		mmap_read_unlock(mm);
+		if (ret <= 0 || !page) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		mutex_lock(&svmm->mutex);
+		if (!mmu_interval_read_retry(&notifier->notifier,
+					     notifier_seq))
+			break;
+		mutex_unlock(&svmm->mutex);
+	}
+
+	/* Map the page on the GPU. */
+	args->p.page = 12;
+	args->p.size = PAGE_SIZE;
+	args->p.addr = start;
+	args->p.phys[0] = page_to_phys(page) |
+		NVIF_VMM_PFNMAP_V0_V |
+		NVIF_VMM_PFNMAP_V0_W |
+		NVIF_VMM_PFNMAP_V0_A |
+		NVIF_VMM_PFNMAP_V0_HOST;
+
+	svmm->vmm->vmm.object.client->super = true;
+	ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL);
+	svmm->vmm->vmm.object.client->super = false;
+	mutex_unlock(&svmm->mutex);
+
+	unlock_page(page);
+	put_page(page);
+
+out:
+	mmu_interval_notifier_remove(&notifier->notifier);
+	return ret;
+}
+
 static int nouveau_range_fault(struct nouveau_svmm *svmm,
 			       struct nouveau_drm *drm,
 			       struct nouveau_pfnmap_args *args, u32 size,
@@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *no
 	unsigned long hmm_flags;
 	u64 inst, start, limit;
 	int fi, fn;
-	int replay = 0, ret;
+	int replay = 0, atomic = 0, ret;
 
 	/* Parse available fault buffer entries into a cache, and update
 	 * the GET pointer so HW can reuse the entries.
@@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *no
 		/*
 		 * Determine required permissions based on GPU fault
 		 * access flags.
-		 * XXX: atomic?
 		 */
 		switch (buffer->fault[fi]->access) {
 		case 0: /* READ. */
 			hmm_flags = HMM_PFN_REQ_FAULT;
 			break;
+		case 2: /* ATOMIC. */
+			atomic = true;
+			break;
 		case 3: /* PREFETCH. */
 			hmm_flags = 0;
 			break;
@@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *no
 		}
 
 		notifier.svmm = svmm;
-		ret = nouveau_range_fault(svmm, svm->drm, &args.i,
-					sizeof(args), hmm_flags, &notifier);
+		if (atomic)
+			ret = nouveau_atomic_range_fault(svmm, svm->drm,
+							 &args.i, sizeof(args),
+							 &notifier);
+		else
+			ret = nouveau_range_fault(svmm, svm->drm, &args.i,
+						  sizeof(args), hmm_flags,
+						  &notifier);
 		mmput(mm);
 
 		limit = args.i.p.addr + args.i.p.size;
@@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *no
 			 */
 			if (buffer->fault[fn]->svmm != svmm ||
 			    buffer->fault[fn]->addr >= limit ||
-			    (buffer->fault[fi]->access == 0 /* READ. */ &&
+			    (buffer->fault[fi]->access == FAULT_ACCESS_READ &&
 			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) ||
-			    (buffer->fault[fi]->access != 0 /* READ. */ &&
-			     buffer->fault[fi]->access != 3 /* PREFETCH. */ &&
-			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)))
+			    (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+			     buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) ||
+			    (buffer->fault[fi]->access != FAULT_ACCESS_READ &&
+			     buffer->fault[fi]->access != FAULT_ACCESS_WRITE &&
+			     buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH &&
+			     !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A)))
 				break;
 		}
 
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c~nouveau-svm-implement-atomic-svm-access
+++ a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
@@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm,
 		if (!(*map->pfn & NVKM_VMM_PFN_W))
 			data |= BIT_ULL(6); /* RO. */
 
+		if (!(*map->pfn & NVKM_VMM_PFN_A))
+			data |= BIT_ULL(7); /* Atomic disable. */
+
 		if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
 			addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
 			addr = dma_map_page(dev, pfn_to_page(addr), 0,
@@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm,
 		if (!(*map->pfn & NVKM_VMM_PFN_W))
 			data |= BIT_ULL(6); /* RO. */
 
+		if (!(*map->pfn & NVKM_VMM_PFN_A))
+			data |= BIT_ULL(7); /* Atomic disable. */
+
 		if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) {
 			addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT;
 			addr = dma_map_page(dev, pfn_to_page(addr), 0,
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h~nouveau-svm-implement-atomic-svm-access
+++ a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_v
 #define NVKM_VMM_PFN_APER                                 0x00000000000000f0ULL
 #define NVKM_VMM_PFN_HOST                                 0x0000000000000000ULL
 #define NVKM_VMM_PFN_VRAM                                 0x0000000000000010ULL
+#define NVKM_VMM_PFN_A					  0x0000000000000004ULL
 #define NVKM_VMM_PFN_W                                    0x0000000000000002ULL
 #define NVKM_VMM_PFN_V                                    0x0000000000000001ULL
 #define NVKM_VMM_PFN_NONE                                 0x0000000000000000ULL
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 140/192] proc: Avoid mixing integer types in mem_rw()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (138 preceding siblings ...)
  2021-07-01  1:54 ` [patch 139/192] nouveau/svm: implement atomic SVM access Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 141/192] fs/proc/kcore.c: add mmap interface Andrew Morton
                   ` (52 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: adobriyan, akpm, cascardo, christian.brauner, ddiss, deller,
	linux-mm, lstoakes, marcelo.cerri, mm-commits, oleg, torvalds,
	walken

From: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Subject: proc: Avoid mixing integer types in mem_rw()

Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.

Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.

Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Souza Cascardo <cascardo@canonical.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/base.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/proc/base.c~proc-avoid-mixing-integer-types-in-mem_rw
+++ a/fs/proc/base.c
@@ -854,7 +854,7 @@ static ssize_t mem_rw(struct file *file,
 	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
 
 	while (count > 0) {
-		int this_len = min_t(int, count, PAGE_SIZE);
+		size_t this_len = min_t(size_t, count, PAGE_SIZE);
 
 		if (write && copy_from_user(page, buf, this_len)) {
 			copied = -EFAULT;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 141/192] fs/proc/kcore.c: add mmap interface
  2021-07-01  1:46 incoming Andrew Morton
                   ` (139 preceding siblings ...)
  2021-07-01  1:54 ` [patch 140/192] proc: Avoid mixing integer types in mem_rw() Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  3:32     ` Linus Torvalds
  2021-07-01  1:54 ` [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ Andrew Morton
                   ` (51 subsequent siblings)
  192 siblings, 1 reply; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: adobriyan, akpm, chenying.kernel, linux-mm, mm-commits, rppt,
	songmuchun, torvalds, zhouchengming, zhoufeng.zf

From: ZHOUFENG <zhoufeng.zf@bytedance.com>
Subject: fs/proc/kcore.c: add mmap interface

When we do the kernel monitor, use the DRGN
(https://github.com/osandov/drgn) access to kernel data structures, found
that the system calls a lot.  DRGN is implemented by reading /proc/kcore. 
After looking at the kcore code, it is found that kcore does not implement
mmap, resulting in frequent context switching triggered by read. 
Therefore, we want to add mmap interface to optimize performance.  Since
vmalloc and module areas will change with allocation and release,
consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
and KCORE_RAM.

The test results:
1. the default version of kcore
real 11.00
user 8.53
sys 3.59

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.64  128.578319          12  11168701           pread64
...
------ ----------- ----------- --------- --------- ----------------
100.00  129.042853              11193748       966 total

2. added kcore for the mmap interface
real 6.44
user 7.32
sys 0.24

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
32.94    0.130120          24      5317       315 futex
11.66    0.046077          21      2231         1 lstat
 9.23    0.036449         177       206           mmap
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.395077                 25435       971 total

The test results show that the number of system calls and time consumption
are significantly reduced.

Thanks to Andrew Morton for your advice.

[akpm@linux-foundation.org: KCORE_REMAP is no more]
Link: https://lkml.kernel.org/r/20210601082241.13378-1-zhoufeng.zf@bytedance.com
Co-developed-by: CHENYING <chenying.kernel@bytedance.com>
Signed-off-by: CHENYING <chenying.kernel@bytedance.com>
Signed-off-by: ZHOUFENG <zhoufeng.zf@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c |   67 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

--- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface
+++ a/fs/proc/kcore.c
@@ -614,11 +614,78 @@ static int release_kcore(struct inode *i
 	return 0;
 }
 
+static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct kcore_mmap_ops = {
+	.fault = mmap_kcore_fault,
+};
+
+static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
+{
+	size_t size = vma->vm_end - vma->vm_start;
+	u64 start, pfn;
+	int nphdr;
+	size_t data_offset;
+	size_t phdrs_len, notes_len;
+	struct kcore_list *m = NULL;
+	int ret = 0;
+
+	down_read(&kclist_lock);
+
+	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
+
+	start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
+		((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));
+
+	list_for_each_entry(m, &kclist_head, list) {
+		if (start >= m->addr && size <= m->size)
+			break;
+	}
+
+	if (&m->list == &kclist_head) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &kcore_mmap_ops;
+
+	if (kern_addr_valid(start)) {
+		if (m->type == KCORE_RAM)
+			pfn = __pa(start) >> PAGE_SHIFT;
+		else if (m->type == KCORE_TEXT)
+			pfn = __pa_symbol(start) >> PAGE_SHIFT;
+		else {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, pfn, size,
+				vma->vm_page_prot);
+	} else {
+		ret = -EFAULT;
+	}
+
+out:
+	up_read(&kclist_lock);
+	return ret;
+}
+
 static const struct proc_ops kcore_proc_ops = {
 	.proc_read	= read_kcore,
 	.proc_open	= open_kcore,
 	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
+	.proc_mmap	= mmap_kcore,
 };
 
 /* just remember that we have to update kcore */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-01  1:46 incoming Andrew Morton
                   ` (140 preceding siblings ...)
  2021-07-01  1:54 ` [patch 141/192] fs/proc/kcore.c: add mmap interface Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-02 14:54   ` Christian Brauner
  2021-07-02 18:43   ` Kees Cook
  2021-07-01  1:54 ` [patch 143/192] procfs/dmabuf: add inode number to /proc/*/fdinfo Andrew Morton
                   ` (50 subsequent siblings)
  192 siblings, 2 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: adobriyan, akpm, avagin, bernd.edlinger, christian.brauner,
	christian.koenig, corbet, deller, ebiederm, gladkov.alexey,
	hridya, jamorris, jannh, jeffv, kaleshsingh, keescook, linux-mm,
	mchehab+huawei, mhocko, minchan, mm-commits, rdunlap, surenb,
	szabolcs.nagy, torvalds, viro, walken, willy

From: Kalesh Singh <kaleshsingh@google.com>
Subject: procfs: allow reading fdinfo with PTRACE_MODE_READ

Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers.  In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting.  Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.

Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:

  1. Do a readlink on each FD.
  2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
  3. stat the file to get the dmabuf inode number.
  4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.

Accessing other processes' fdinfo requires root privileges.  This limits
the use of the interface to debugging environments and is not suitable for
production builds.  Granting root privileges even to a system process
increases the attack surface and is highly undesirable.

Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.

Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Jann Horn <jannh@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/base.c |    4 ++--
 fs/proc/fd.c   |   15 ++++++++++++++-
 2 files changed, 16 insertions(+), 3 deletions(-)

--- a/fs/proc/base.c~procfs-allow-reading-fdinfo-with-ptrace_mode_read
+++ a/fs/proc/base.c
@@ -3172,7 +3172,7 @@ static const struct pid_entry tgid_base_
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
 	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
-	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdinfo",     S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3517,7 +3517,7 @@ static const struct inode_operations pro
  */
 static const struct pid_entry tid_base_stuff[] = {
 	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
-	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
+	DIR("fdinfo",    S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
 	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
--- a/fs/proc/fd.c~procfs-allow-reading-fdinfo-with-ptrace_mode_read
+++ a/fs/proc/fd.c
@@ -6,6 +6,7 @@
 #include <linux/fdtable.h>
 #include <linux/namei.h>
 #include <linux/pid.h>
+#include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/file.h>
 #include <linux/seq_file.h>
@@ -72,6 +73,18 @@ out:
 
 static int seq_fdinfo_open(struct inode *inode, struct file *file)
 {
+	bool allowed = false;
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ESRCH;
+
+	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	put_task_struct(task);
+
+	if (!allowed)
+		return -EACCES;
+
 	return single_open(file, seq_show, inode);
 }
 
@@ -308,7 +321,7 @@ static struct dentry *proc_fdinfo_instan
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUSR);
+	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUGO);
 	if (!inode)
 		return ERR_PTR(-ENOENT);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 143/192] procfs/dmabuf: add inode number to /proc/*/fdinfo
  2021-07-01  1:46 incoming Andrew Morton
                   ` (141 preceding siblings ...)
  2021-07-01  1:54 ` [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 144/192] sysctl: remove redundant assignment to first Andrew Morton
                   ` (49 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: adobriyan, akpm, avagin, bernd.edlinger, christian.brauner,
	christian.koenig, corbet, deller, ebiederm, gladkov.alexey,
	hridya, jamorris, jannh, jeffv, kaleshsingh, keescook, linux-mm,
	mchehab+huawei, mhocko, minchan, mm-commits, rdunlap, surenb,
	szabolcs.nagy, torvalds, viro, walken, willy

From: Kalesh Singh <kaleshsingh@google.com>
Subject: procfs/dmabuf: add inode number to /proc/*/fdinfo

And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.

The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.

Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/filesystems/proc.rst |   37 ++++++++++++++++++++++-----
 fs/proc/fd.c                       |    5 ++-
 2 files changed, 34 insertions(+), 8 deletions(-)

--- a/Documentation/filesystems/proc.rst~procfs-dmabuf-add-inode-number-to-proc-fdinfo
+++ a/Documentation/filesystems/proc.rst
@@ -1920,18 +1920,20 @@ if precise results are needed.
 3.8	/proc/<pid>/fdinfo/<fd> - Information about opened file
 ---------------------------------------------------------------
 This file provides information associated with an opened file. The regular
-files have at least three fields -- 'pos', 'flags' and 'mnt_id'. The 'pos'
-represents the current offset of the opened file in decimal form [see lseek(2)
-for details], 'flags' denotes the octal O_xxx mask the file has been
-created with [see open(2) for details] and 'mnt_id' represents mount ID of
-the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
-for details].
+files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
+The 'pos' represents the current offset of the opened file in decimal
+form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the
+file has been created with [see open(2) for details] and 'mnt_id' represents
+mount ID of the file system containing the opened file [see 3.5
+/proc/<pid>/mountinfo for details]. 'ino' represents the inode number of
+the file.
 
 A typical output is::
 
 	pos:	0
 	flags:	0100002
 	mnt_id:	19
+	ino:	63107
 
 All locks associated with a file descriptor are shown in its fdinfo too::
 
@@ -1948,6 +1950,7 @@ Eventfd files
 	pos:	0
 	flags:	04002
 	mnt_id:	9
+	ino:	63107
 	eventfd-count:	5a
 
 where 'eventfd-count' is hex value of a counter.
@@ -1960,6 +1963,7 @@ Signalfd files
 	pos:	0
 	flags:	04002
 	mnt_id:	9
+	ino:	63107
 	sigmask:	0000000000000200
 
 where 'sigmask' is hex value of the signal mask associated
@@ -1973,6 +1977,7 @@ Epoll files
 	pos:	0
 	flags:	02
 	mnt_id:	9
+	ino:	63107
 	tfd:        5 events:       1d data: ffffffffffffffff pos:0 ino:61af sdev:7
 
 where 'tfd' is a target file descriptor number in decimal form,
@@ -1989,6 +1994,8 @@ For inotify files the format is the foll
 
 	pos:	0
 	flags:	02000000
+	mnt_id:	9
+	ino:	63107
 	inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
 
 where 'wd' is a watch descriptor in decimal form, i.e. a target file
@@ -2011,6 +2018,7 @@ For fanotify files the format is::
 	pos:	0
 	flags:	02
 	mnt_id:	9
+	ino:	63107
 	fanotify flags:10 event-flags:0
 	fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
 	fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
@@ -2035,6 +2043,7 @@ Timerfd files
 	pos:	0
 	flags:	02
 	mnt_id:	9
+	ino:	63107
 	clockid: 0
 	ticks: 0
 	settime flags: 01
@@ -2049,6 +2058,22 @@ details]. 'it_value' is remaining time u
 with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
 still exhibits timer's remaining time.
 
+DMA Buffer files
+~~~~~~~~~~~~~~~~
+
+::
+
+	pos:	0
+	flags:	04002
+	mnt_id:	9
+	ino:	63107
+	size:   32768
+	count:  2
+	exp_name:  system-heap
+
+where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of
+the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter.
+
 3.9	/proc/<pid>/map_files - Information about memory mapped files
 ---------------------------------------------------------------------
 This directory contains symbolic links which represent memory mapped files
--- a/fs/proc/fd.c~procfs-dmabuf-add-inode-number-to-proc-fdinfo
+++ a/fs/proc/fd.c
@@ -54,9 +54,10 @@ static int seq_show(struct seq_file *m,
 	if (ret)
 		return ret;
 
-	seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
+	seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\n",
 		   (long long)file->f_pos, f_flags,
-		   real_mount(file->f_path.mnt)->mnt_id);
+		   real_mount(file->f_path.mnt)->mnt_id,
+		   file_inode(file)->i_ino);
 
 	/* show_fd_locks() never deferences files so a stale value is safe */
 	show_fd_locks(m, file, files);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 144/192] sysctl: remove redundant assignment to first
  2021-07-01  1:46 incoming Andrew Morton
                   ` (142 preceding siblings ...)
  2021-07-01  1:54 ` [patch 143/192] procfs/dmabuf: add inode number to /proc/*/fdinfo Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 145/192] drm: include only needed headers in ascii85.h Andrew Morton
                   ` (48 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: abaci, akpm, andrii, ast, daniel, jiapeng.chong, john.fastabend,
	kafai, keescook, kpsingh, linux-mm, mcgrof, mm-commits,
	songliubraving, torvalds, yhs, yzaikin

From: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Subject: sysctl: remove redundant assignment to first

Variable first is set to '0', but this value is never read as it is not
used later on, hence it is a redundant assignment and can be removed.

Clean up the following clang-analyzer warning:

kernel/sysctl.c:1562:4: warning: Value stored to 'first' is never read
[clang-analyzer-deadcode.DeadStores].

Link: https://lkml.kernel.org/r/1620469990-22182-1-git-send-email-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/sysctl.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/sysctl.c~sysctl-remove-redundant-assignment-to-first
+++ a/kernel/sysctl.c
@@ -1494,7 +1494,6 @@ int proc_do_large_bitmap(struct ctl_tabl
 			 void *buffer, size_t *lenp, loff_t *ppos)
 {
 	int err = 0;
-	bool first = 1;
 	size_t left = *lenp;
 	unsigned long bitmap_len = table->maxlen;
 	unsigned long *bitmap = *(unsigned long **) table->data;
@@ -1579,12 +1578,12 @@ int proc_do_large_bitmap(struct ctl_tabl
 			}
 
 			bitmap_set(tmp_bitmap, val_a, val_b - val_a + 1);
-			first = 0;
 			proc_skip_char(&p, &left, '\n');
 		}
 		left += skipped;
 	} else {
 		unsigned long bit_a, bit_b = 0;
+		bool first = 1;
 
 		while (left) {
 			bit_a = find_next_bit(bitmap, bitmap_len, bit_b);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 145/192] drm: include only needed headers in ascii85.h
  2021-07-01  1:46 incoming Andrew Morton
                   ` (143 preceding siblings ...)
  2021-07-01  1:54 ` [patch 144/192] sysctl: remove redundant assignment to first Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:54 ` [patch 146/192] kernel.h: split out panic and oops helpers Andrew Morton
                   ` (47 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, jani.nikula, linux-mm, mm-commits, torvalds

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: drm: include only needed headers in ascii85.h

The ascii85.h is user of exactly two headers, i.e.  math.h and types.h. 
There is no need to carry on entire kernel.h.

Link: https://lkml.kernel.org/r/20210611185915.44181-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/ascii85.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/linux/ascii85.h~drm-include-only-needed-headers-in-ascii85h
+++ a/include/linux/ascii85.h
@@ -8,7 +8,8 @@
 #ifndef _ASCII85_H_
 #define _ASCII85_H_
 
-#include <linux/kernel.h>
+#include <linux/math.h>
+#include <linux/types.h>
 
 #define ASCII85_BUFSZ 6
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 146/192] kernel.h: split out panic and oops helpers
  2021-07-01  1:46 incoming Andrew Morton
                   ` (144 preceding siblings ...)
  2021-07-01  1:54 ` [patch 145/192] drm: include only needed headers in ascii85.h Andrew Morton
@ 2021-07-01  1:54 ` Andrew Morton
  2021-07-01  1:55 ` [patch 147/192] lib: decompress_bunzip2: remove an unneeded semicolon Andrew Morton
                   ` (46 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:54 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, arnd, bjorn.andersson,
	christian.brauner, cminyard, deller, keescook, linux-mm, linux,
	mcgrof, mm-commits, rppt, sboyd, sre, torvalds, tsbogend,
	wei.liu

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: kernel.h: split out panic and oops helpers

kernel.h is being used as a dump for all kinds of stuff for a long time. 
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.

There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain

At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.

[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
  Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/kernel/setup.c                         |    2 
 arch/arm64/kernel/setup.c                         |    1 
 arch/ia64/include/asm/pal.h                       |    1 
 arch/mips/kernel/relocate.c                       |    1 
 arch/mips/sgi-ip22/ip22-reset.c                   |    1 
 arch/mips/sgi-ip32/ip32-reset.c                   |    1 
 arch/parisc/kernel/pdc_chassis.c                  |    1 
 arch/powerpc/kernel/setup-common.c                |    1 
 arch/s390/kernel/ipl.c                            |    1 
 arch/sparc/kernel/sstate.c                        |    1 
 arch/um/drivers/mconsole_kern.c                   |    1 
 arch/um/kernel/um_arch.c                          |    1 
 arch/x86/include/asm/desc.h                       |    1 
 arch/x86/kernel/cpu/mshyperv.c                    |    1 
 arch/x86/kernel/setup.c                           |    1 
 arch/x86/purgatory/purgatory.c                    |    2 
 arch/x86/xen/enlighten.c                          |    1 
 arch/xtensa/platforms/iss/setup.c                 |    1 
 drivers/bus/brcmstb_gisb.c                        |    1 
 drivers/char/ipmi/ipmi_msghandler.c               |    1 
 drivers/clk/analogbits/wrpll-cln28hpc.c           |    4 
 drivers/edac/altera_edac.c                        |    1 
 drivers/firmware/google/gsmi.c                    |    1 
 drivers/hv/vmbus_drv.c                            |    1 
 drivers/hwtracing/coresight/coresight-cpu-debug.c |    1 
 drivers/leds/trigger/ledtrig-activity.c           |    1 
 drivers/leds/trigger/ledtrig-heartbeat.c          |    1 
 drivers/leds/trigger/ledtrig-panic.c              |    1 
 drivers/misc/bcm-vk/bcm_vk_dev.c                  |    1 
 drivers/misc/ibmasm/heartbeat.c                   |    1 
 drivers/misc/pvpanic/pvpanic.c                    |    1 
 drivers/net/ipa/ipa_smp2p.c                       |    1 
 drivers/parisc/power.c                            |    1 
 drivers/power/reset/ltc2952-poweroff.c            |    1 
 drivers/remoteproc/remoteproc_core.c              |    1 
 drivers/s390/char/con3215.c                       |    1 
 drivers/s390/char/con3270.c                       |    1 
 drivers/s390/char/sclp.c                          |    1 
 drivers/s390/char/sclp_con.c                      |    1 
 drivers/s390/char/sclp_vt220.c                    |    1 
 drivers/s390/char/zcore.c                         |    1 
 drivers/soc/bcm/brcmstb/pm/pm-arm.c               |    1 
 drivers/staging/olpc_dcon/olpc_dcon.c             |    1 
 drivers/video/fbdev/hyperv_fb.c                   |    1 
 include/asm-generic/bug.h                         |    3 
 include/linux/kernel.h                            |   84 ----------
 include/linux/panic.h                             |   98 ++++++++++++
 include/linux/panic_notifier.h                    |   12 +
 include/linux/thread_info.h                       |    1 
 kernel/hung_task.c                                |    1 
 kernel/kexec_core.c                               |    1 
 kernel/panic.c                                    |    1 
 kernel/rcu/tree.c                                 |    2 
 kernel/sysctl.c                                   |    1 
 kernel/trace/trace.c                              |    1 
 55 files changed, 169 insertions(+), 85 deletions(-)

--- a/arch/alpha/kernel/setup.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/alpha/kernel/setup.c
@@ -28,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/ioport.h>
+#include <linux/panic_notifier.h>
 #include <linux/platform_device.h>
 #include <linux/memblock.h>
 #include <linux/pci.h>
@@ -46,7 +47,6 @@
 #include <linux/log2.h>
 #include <linux/export.h>
 
-extern struct atomic_notifier_head panic_notifier_list;
 static int alpha_panic_event(struct notifier_block *, unsigned long, void *);
 static struct notifier_block alpha_panic_block = {
 	alpha_panic_event,
--- a/arch/arm64/kernel/setup.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/arm64/kernel/setup.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/smp.h>
 #include <linux/fs.h>
+#include <linux/panic_notifier.h>
 #include <linux/proc_fs.h>
 #include <linux/memblock.h>
 #include <linux/of_fdt.h>
--- a/arch/ia64/include/asm/pal.h~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/ia64/include/asm/pal.h
@@ -99,6 +99,7 @@
 
 #include <linux/types.h>
 #include <asm/fpu.h>
+#include <asm/intrinsics.h>
 
 /*
  * Data types needed to pass information into PAL procedures and
--- a/arch/mips/kernel/relocate.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/mips/kernel/relocate.c
@@ -18,6 +18,7 @@
 #include <linux/kernel.h>
 #include <linux/libfdt.h>
 #include <linux/of_fdt.h>
+#include <linux/panic_notifier.h>
 #include <linux/sched/task.h>
 #include <linux/start_kernel.h>
 #include <linux/string.h>
--- a/arch/mips/sgi-ip22/ip22-reset.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/mips/sgi-ip22/ip22-reset.c
@@ -12,6 +12,7 @@
 #include <linux/kernel.h>
 #include <linux/sched/signal.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/pm.h>
 #include <linux/timer.h>
 
--- a/arch/mips/sgi-ip32/ip32-reset.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/mips/sgi-ip32/ip32-reset.c
@@ -12,6 +12,7 @@
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/panic_notifier.h>
 #include <linux/sched.h>
 #include <linux/sched/signal.h>
 #include <linux/notifier.h>
--- a/arch/parisc/kernel/pdc_chassis.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/parisc/kernel/pdc_chassis.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/kernel.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/notifier.h>
 #include <linux/cache.h>
--- a/arch/powerpc/kernel/setup-common.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/powerpc/kernel/setup-common.c
@@ -9,6 +9,7 @@
 #undef DEBUG
 
 #include <linux/export.h>
+#include <linux/panic_notifier.h>
 #include <linux/string.h>
 #include <linux/sched.h>
 #include <linux/init.h>
--- a/arch/s390/kernel/ipl.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/s390/kernel/ipl.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/device.h>
 #include <linux/delay.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/ctype.h>
 #include <linux/fs.h>
--- a/arch/sparc/kernel/sstate.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/sparc/kernel/sstate.c
@@ -6,6 +6,7 @@
 
 #include <linux/kernel.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/init.h>
 
--- a/arch/um/drivers/mconsole_kern.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/um/drivers/mconsole_kern.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/sched/debug.h>
 #include <linux/proc_fs.h>
--- a/arch/um/kernel/um_arch.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/um/kernel/um_arch.c
@@ -7,6 +7,7 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/module.h>
+#include <linux/panic_notifier.h>
 #include <linux/seq_file.h>
 #include <linux/string.h>
 #include <linux/utsname.h>
--- a/arch/x86/include/asm/desc.h~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/x86/include/asm/desc.h
@@ -9,6 +9,7 @@
 #include <asm/irq_vectors.h>
 #include <asm/cpu_entry_area.h>
 
+#include <linux/debug_locks.h>
 #include <linux/smp.h>
 #include <linux/percpu.h>
 
--- a/arch/x86/kernel/cpu/mshyperv.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/x86/kernel/cpu/mshyperv.c
@@ -17,6 +17,7 @@
 #include <linux/irq.h>
 #include <linux/kexec.h>
 #include <linux/i8253.h>
+#include <linux/panic_notifier.h>
 #include <linux/random.h>
 #include <asm/processor.h>
 #include <asm/hypervisor.h>
--- a/arch/x86/kernel/setup.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/x86/kernel/setup.c
@@ -14,6 +14,7 @@
 #include <linux/initrd.h>
 #include <linux/iscsi_ibft.h>
 #include <linux/memblock.h>
+#include <linux/panic_notifier.h>
 #include <linux/pci.h>
 #include <linux/root_dev.h>
 #include <linux/hugetlb.h>
--- a/arch/x86/purgatory/purgatory.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/x86/purgatory/purgatory.c
@@ -9,6 +9,8 @@
  */
 
 #include <linux/bug.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
 #include <crypto/sha2.h>
 #include <asm/purgatory.h>
 
--- a/arch/x86/xen/enlighten.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/x86/xen/enlighten.c
@@ -6,6 +6,7 @@
 #include <linux/cpu.h>
 #include <linux/kexec.h>
 #include <linux/slab.h>
+#include <linux/panic_notifier.h>
 
 #include <xen/xen.h>
 #include <xen/features.h>
--- a/arch/xtensa/platforms/iss/setup.c~kernelh-split-out-panic-and-oops-helpers
+++ a/arch/xtensa/platforms/iss/setup.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/printk.h>
 #include <linux/string.h>
 
--- a/drivers/bus/brcmstb_gisb.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/bus/brcmstb_gisb.c
@@ -6,6 +6,7 @@
 #include <linux/init.h>
 #include <linux/types.h>
 #include <linux/module.h>
+#include <linux/panic_notifier.h>
 #include <linux/platform_device.h>
 #include <linux/interrupt.h>
 #include <linux/sysfs.h>
--- a/drivers/char/ipmi/ipmi_msghandler.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/char/ipmi/ipmi_msghandler.c
@@ -16,6 +16,7 @@
 
 #include <linux/module.h>
 #include <linux/errno.h>
+#include <linux/panic_notifier.h>
 #include <linux/poll.h>
 #include <linux/sched.h>
 #include <linux/seq_file.h>
--- a/drivers/clk/analogbits/wrpll-cln28hpc.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/clk/analogbits/wrpll-cln28hpc.c
@@ -23,8 +23,12 @@
 
 #include <linux/bug.h>
 #include <linux/err.h>
+#include <linux/limits.h>
 #include <linux/log2.h>
 #include <linux/math64.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
 #include <linux/clk/analogbits-wrpll-cln28hpc.h>
 
 /* MIN_INPUT_FREQ: minimum input clock frequency, in Hz (Fref_min) */
--- a/drivers/edac/altera_edac.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/edac/altera_edac.c
@@ -20,6 +20,7 @@
 #include <linux/of_address.h>
 #include <linux/of_irq.h>
 #include <linux/of_platform.h>
+#include <linux/panic_notifier.h>
 #include <linux/platform_device.h>
 #include <linux/regmap.h>
 #include <linux/types.h>
--- a/drivers/firmware/google/gsmi.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/firmware/google/gsmi.c
@@ -19,6 +19,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
+#include <linux/panic_notifier.h>
 #include <linux/ioctl.h>
 #include <linux/acpi.h>
 #include <linux/io.h>
--- a/drivers/hv/vmbus_drv.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/hv/vmbus_drv.c
@@ -25,6 +25,7 @@
 
 #include <linux/delay.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/ptrace.h>
 #include <linux/screen_info.h>
 #include <linux/kdebug.h>
--- a/drivers/hwtracing/coresight/coresight-cpu-debug.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/hwtracing/coresight/coresight-cpu-debug.c
@@ -17,6 +17,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
+#include <linux/panic_notifier.h>
 #include <linux/pm_qos.h>
 #include <linux/slab.h>
 #include <linux/smp.h>
--- a/drivers/leds/trigger/ledtrig-activity.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/leds/trigger/ledtrig-activity.c
@@ -11,6 +11,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/leds.h>
 #include <linux/module.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
--- a/drivers/leds/trigger/ledtrig-heartbeat.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/leds/trigger/ledtrig-heartbeat.c
@@ -11,6 +11,7 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/panic_notifier.h>
 #include <linux/slab.h>
 #include <linux/timer.h>
 #include <linux/sched.h>
--- a/drivers/leds/trigger/ledtrig-panic.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/leds/trigger/ledtrig-panic.c
@@ -8,6 +8,7 @@
 #include <linux/kernel.h>
 #include <linux/init.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/leds.h>
 #include "../leds.h"
 
--- a/drivers/misc/bcm-vk/bcm_vk_dev.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/misc/bcm-vk/bcm_vk_dev.c
@@ -9,6 +9,7 @@
 #include <linux/fs.h>
 #include <linux/idr.h>
 #include <linux/interrupt.h>
+#include <linux/panic_notifier.h>
 #include <linux/kref.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
--- a/drivers/misc/ibmasm/heartbeat.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/misc/ibmasm/heartbeat.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include "ibmasm.h"
 #include "dot_command.h"
 #include "lowlevel.h"
--- a/drivers/misc/pvpanic/pvpanic.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/misc/pvpanic/pvpanic.c
@@ -13,6 +13,7 @@
 #include <linux/mod_devicetable.h>
 #include <linux/module.h>
 #include <linux/platform_device.h>
+#include <linux/panic_notifier.h>
 #include <linux/types.h>
 #include <linux/cdev.h>
 #include <linux/list.h>
--- a/drivers/net/ipa/ipa_smp2p.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/net/ipa/ipa_smp2p.c
@@ -8,6 +8,7 @@
 #include <linux/device.h>
 #include <linux/interrupt.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/soc/qcom/smem.h>
 #include <linux/soc/qcom/smem_state.h>
 
--- a/drivers/parisc/power.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/parisc/power.c
@@ -38,6 +38,7 @@
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/notifier.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/sched/signal.h>
 #include <linux/kthread.h>
--- a/drivers/power/reset/ltc2952-poweroff.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/power/reset/ltc2952-poweroff.c
@@ -52,6 +52,7 @@
 #include <linux/slab.h>
 #include <linux/kmod.h>
 #include <linux/module.h>
+#include <linux/panic_notifier.h>
 #include <linux/mod_devicetable.h>
 #include <linux/gpio/consumer.h>
 #include <linux/reboot.h>
--- a/drivers/remoteproc/remoteproc_core.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/remoteproc/remoteproc_core.c
@@ -20,6 +20,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/device.h>
+#include <linux/panic_notifier.h>
 #include <linux/slab.h>
 #include <linux/mutex.h>
 #include <linux/dma-map-ops.h>
--- a/drivers/s390/char/con3215.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/s390/char/con3215.c
@@ -19,6 +19,7 @@
 #include <linux/console.h>
 #include <linux/interrupt.h>
 #include <linux/err.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/serial.h> /* ASYNC_* flags */
 #include <linux/slab.h>
--- a/drivers/s390/char/con3270.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/s390/char/con3270.c
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/interrupt.h>
 #include <linux/list.h>
+#include <linux/panic_notifier.h>
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/err.h>
--- a/drivers/s390/char/sclp.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/s390/char/sclp.c
@@ -11,6 +11,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/module.h>
 #include <linux/err.h>
+#include <linux/panic_notifier.h>
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 #include <linux/timer.h>
--- a/drivers/s390/char/sclp_con.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/s390/char/sclp_con.c
@@ -10,6 +10,7 @@
 #include <linux/kmod.h>
 #include <linux/console.h>
 #include <linux/init.h>
+#include <linux/panic_notifier.h>
 #include <linux/timer.h>
 #include <linux/jiffies.h>
 #include <linux/termios.h>
--- a/drivers/s390/char/sclp_vt220.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/s390/char/sclp_vt220.c
@@ -9,6 +9,7 @@
 
 #include <linux/module.h>
 #include <linux/spinlock.h>
+#include <linux/panic_notifier.h>
 #include <linux/list.h>
 #include <linux/wait.h>
 #include <linux/timer.h>
--- a/drivers/s390/char/zcore.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/s390/char/zcore.c
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/debugfs.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 
 #include <asm/asm-offsets.h>
--- a/drivers/soc/bcm/brcmstb/pm/pm-arm.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/soc/bcm/brcmstb/pm/pm-arm.c
@@ -28,6 +28,7 @@
 #include <linux/notifier.h>
 #include <linux/of.h>
 #include <linux/of_address.h>
+#include <linux/panic_notifier.h>
 #include <linux/platform_device.h>
 #include <linux/pm.h>
 #include <linux/printk.h>
--- a/drivers/staging/olpc_dcon/olpc_dcon.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/staging/olpc_dcon/olpc_dcon.c
@@ -22,6 +22,7 @@
 #include <linux/device.h>
 #include <linux/uaccess.h>
 #include <linux/ctype.h>
+#include <linux/panic_notifier.h>
 #include <linux/reboot.h>
 #include <linux/olpc-ec.h>
 #include <asm/tsc.h>
--- a/drivers/video/fbdev/hyperv_fb.c~kernelh-split-out-panic-and-oops-helpers
+++ a/drivers/video/fbdev/hyperv_fb.c
@@ -52,6 +52,7 @@
 #include <linux/completion.h>
 #include <linux/fb.h>
 #include <linux/pci.h>
+#include <linux/panic_notifier.h>
 #include <linux/efi.h>
 #include <linux/console.h>
 
--- a/include/asm-generic/bug.h~kernelh-split-out-panic-and-oops-helpers
+++ a/include/asm-generic/bug.h
@@ -17,7 +17,8 @@
 #endif
 
 #ifndef __ASSEMBLY__
-#include <linux/kernel.h>
+#include <linux/panic.h>
+#include <linux/printk.h>
 
 #ifdef CONFIG_BUG
 
--- a/include/linux/kernel.h~kernelh-split-out-panic-and-oops-helpers
+++ a/include/linux/kernel.h
@@ -14,6 +14,7 @@
 #include <linux/math.h>
 #include <linux/minmax.h>
 #include <linux/typecheck.h>
+#include <linux/panic.h>
 #include <linux/printk.h>
 #include <linux/build_bug.h>
 #include <linux/static_call_types.h>
@@ -72,7 +73,6 @@
 #define lower_32_bits(n) ((u32)((n) & 0xffffffff))
 
 struct completion;
-struct pt_regs;
 struct user;
 
 #ifdef CONFIG_PREEMPT_VOLUNTARY
@@ -177,14 +177,6 @@ void __might_fault(const char *file, int
 static inline void might_fault(void) { }
 #endif
 
-extern struct atomic_notifier_head panic_notifier_list;
-extern long (*panic_blink)(int state);
-__printf(1, 2)
-void panic(const char *fmt, ...) __noreturn __cold;
-void nmi_panic(struct pt_regs *regs, const char *msg);
-extern void oops_enter(void);
-extern void oops_exit(void);
-extern bool oops_may_print(void);
 void do_exit(long error_code) __noreturn;
 void complete_and_exit(struct completion *, long) __noreturn;
 
@@ -372,52 +364,8 @@ extern int __kernel_text_address(unsigne
 extern int kernel_text_address(unsigned long addr);
 extern int func_ptr_is_kernel_text(void *ptr);
 
-#ifdef CONFIG_SMP
-extern unsigned int sysctl_oops_all_cpu_backtrace;
-#else
-#define sysctl_oops_all_cpu_backtrace 0
-#endif /* CONFIG_SMP */
-
 extern void bust_spinlocks(int yes);
-extern int panic_timeout;
-extern unsigned long panic_print;
-extern int panic_on_oops;
-extern int panic_on_unrecovered_nmi;
-extern int panic_on_io_nmi;
-extern int panic_on_warn;
-extern unsigned long panic_on_taint;
-extern bool panic_on_taint_nousertaint;
-extern int sysctl_panic_on_rcu_stall;
-extern int sysctl_max_rcu_stall_to_panic;
-extern int sysctl_panic_on_stackoverflow;
-
-extern bool crash_kexec_post_notifiers;
-
-/*
- * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
- * holds a CPU number which is executing panic() currently. A value of
- * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec().
- */
-extern atomic_t panic_cpu;
-#define PANIC_CPU_INVALID	-1
 
-/*
- * Only to be used by arch init code. If the user over-wrote the default
- * CONFIG_PANIC_TIMEOUT, honor it.
- */
-static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
-{
-	if (panic_timeout == arch_default_timeout)
-		panic_timeout = timeout;
-}
-extern const char *print_tainted(void);
-enum lockdep_ok {
-	LOCKDEP_STILL_OK,
-	LOCKDEP_NOW_UNRELIABLE
-};
-extern void add_taint(unsigned flag, enum lockdep_ok);
-extern int test_taint(unsigned flag);
-extern unsigned long get_taint(void);
 extern int root_mountflags;
 
 extern bool early_boot_irqs_disabled;
@@ -436,36 +384,6 @@ extern enum system_states {
 	SYSTEM_SUSPEND,
 } system_state;
 
-/* This cannot be an enum because some may be used in assembly source. */
-#define TAINT_PROPRIETARY_MODULE	0
-#define TAINT_FORCED_MODULE		1
-#define TAINT_CPU_OUT_OF_SPEC		2
-#define TAINT_FORCED_RMMOD		3
-#define TAINT_MACHINE_CHECK		4
-#define TAINT_BAD_PAGE			5
-#define TAINT_USER			6
-#define TAINT_DIE			7
-#define TAINT_OVERRIDDEN_ACPI_TABLE	8
-#define TAINT_WARN			9
-#define TAINT_CRAP			10
-#define TAINT_FIRMWARE_WORKAROUND	11
-#define TAINT_OOT_MODULE		12
-#define TAINT_UNSIGNED_MODULE		13
-#define TAINT_SOFTLOCKUP		14
-#define TAINT_LIVEPATCH			15
-#define TAINT_AUX			16
-#define TAINT_RANDSTRUCT		17
-#define TAINT_FLAGS_COUNT		18
-#define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
-
-struct taint_flag {
-	char c_true;	/* character printed when tainted */
-	char c_false;	/* character printed when not tainted */
-	bool module;	/* also show as a per-module taint flag */
-};
-
-extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT];
-
 extern const char hex_asc[];
 #define hex_asc_lo(x)	hex_asc[((x) & 0x0f)]
 #define hex_asc_hi(x)	hex_asc[((x) & 0xf0) >> 4]
--- /dev/null
+++ a/include/linux/panic.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PANIC_H
+#define _LINUX_PANIC_H
+
+#include <linux/compiler_attributes.h>
+#include <linux/types.h>
+
+struct pt_regs;
+
+extern long (*panic_blink)(int state);
+__printf(1, 2)
+void panic(const char *fmt, ...) __noreturn __cold;
+void nmi_panic(struct pt_regs *regs, const char *msg);
+extern void oops_enter(void);
+extern void oops_exit(void);
+extern bool oops_may_print(void);
+
+#ifdef CONFIG_SMP
+extern unsigned int sysctl_oops_all_cpu_backtrace;
+#else
+#define sysctl_oops_all_cpu_backtrace 0
+#endif /* CONFIG_SMP */
+
+extern int panic_timeout;
+extern unsigned long panic_print;
+extern int panic_on_oops;
+extern int panic_on_unrecovered_nmi;
+extern int panic_on_io_nmi;
+extern int panic_on_warn;
+
+extern unsigned long panic_on_taint;
+extern bool panic_on_taint_nousertaint;
+
+extern int sysctl_panic_on_rcu_stall;
+extern int sysctl_max_rcu_stall_to_panic;
+extern int sysctl_panic_on_stackoverflow;
+
+extern bool crash_kexec_post_notifiers;
+
+/*
+ * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
+ * holds a CPU number which is executing panic() currently. A value of
+ * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec().
+ */
+extern atomic_t panic_cpu;
+#define PANIC_CPU_INVALID	-1
+
+/*
+ * Only to be used by arch init code. If the user over-wrote the default
+ * CONFIG_PANIC_TIMEOUT, honor it.
+ */
+static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
+{
+	if (panic_timeout == arch_default_timeout)
+		panic_timeout = timeout;
+}
+
+/* This cannot be an enum because some may be used in assembly source. */
+#define TAINT_PROPRIETARY_MODULE	0
+#define TAINT_FORCED_MODULE		1
+#define TAINT_CPU_OUT_OF_SPEC		2
+#define TAINT_FORCED_RMMOD		3
+#define TAINT_MACHINE_CHECK		4
+#define TAINT_BAD_PAGE			5
+#define TAINT_USER			6
+#define TAINT_DIE			7
+#define TAINT_OVERRIDDEN_ACPI_TABLE	8
+#define TAINT_WARN			9
+#define TAINT_CRAP			10
+#define TAINT_FIRMWARE_WORKAROUND	11
+#define TAINT_OOT_MODULE		12
+#define TAINT_UNSIGNED_MODULE		13
+#define TAINT_SOFTLOCKUP		14
+#define TAINT_LIVEPATCH			15
+#define TAINT_AUX			16
+#define TAINT_RANDSTRUCT		17
+#define TAINT_FLAGS_COUNT		18
+#define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
+
+struct taint_flag {
+	char c_true;	/* character printed when tainted */
+	char c_false;	/* character printed when not tainted */
+	bool module;	/* also show as a per-module taint flag */
+};
+
+extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT];
+
+enum lockdep_ok {
+	LOCKDEP_STILL_OK,
+	LOCKDEP_NOW_UNRELIABLE,
+};
+
+extern const char *print_tainted(void);
+extern void add_taint(unsigned flag, enum lockdep_ok);
+extern int test_taint(unsigned flag);
+extern unsigned long get_taint(void);
+
+#endif	/* _LINUX_PANIC_H */
--- /dev/null
+++ a/include/linux/panic_notifier.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PANIC_NOTIFIERS_H
+#define _LINUX_PANIC_NOTIFIERS_H
+
+#include <linux/notifier.h>
+#include <linux/types.h>
+
+extern struct atomic_notifier_head panic_notifier_list;
+
+extern bool crash_kexec_post_notifiers;
+
+#endif	/* _LINUX_PANIC_NOTIFIERS_H */
--- a/include/linux/thread_info.h~kernelh-split-out-panic-and-oops-helpers
+++ a/include/linux/thread_info.h
@@ -9,6 +9,7 @@
 #define _LINUX_THREAD_INFO_H
 
 #include <linux/types.h>
+#include <linux/limits.h>
 #include <linux/bug.h>
 #include <linux/restart_block.h>
 #include <linux/errno.h>
--- a/kernel/hung_task.c~kernelh-split-out-panic-and-oops-helpers
+++ a/kernel/hung_task.c
@@ -15,6 +15,7 @@
 #include <linux/kthread.h>
 #include <linux/lockdep.h>
 #include <linux/export.h>
+#include <linux/panic_notifier.h>
 #include <linux/sysctl.h>
 #include <linux/suspend.h>
 #include <linux/utsname.h>
--- a/kernel/kexec_core.c~kernelh-split-out-panic-and-oops-helpers
+++ a/kernel/kexec_core.c
@@ -26,6 +26,7 @@
 #include <linux/suspend.h>
 #include <linux/device.h>
 #include <linux/freezer.h>
+#include <linux/panic_notifier.h>
 #include <linux/pm.h>
 #include <linux/cpu.h>
 #include <linux/uaccess.h>
--- a/kernel/panic.c~kernelh-split-out-panic-and-oops-helpers
+++ a/kernel/panic.c
@@ -23,6 +23,7 @@
 #include <linux/reboot.h>
 #include <linux/delay.h>
 #include <linux/kexec.h>
+#include <linux/panic_notifier.h>
 #include <linux/sched.h>
 #include <linux/sysrq.h>
 #include <linux/init.h>
--- a/kernel/rcu/tree.c~kernelh-split-out-panic-and-oops-helpers
+++ a/kernel/rcu/tree.c
@@ -32,6 +32,8 @@
 #include <linux/export.h>
 #include <linux/completion.h>
 #include <linux/moduleparam.h>
+#include <linux/panic.h>
+#include <linux/panic_notifier.h>
 #include <linux/percpu.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
--- a/kernel/sysctl.c~kernelh-split-out-panic-and-oops-helpers
+++ a/kernel/sysctl.c
@@ -27,6 +27,7 @@
 #include <linux/sysctl.h>
 #include <linux/bitmap.h>
 #include <linux/signal.h>
+#include <linux/panic.h>
 #include <linux/printk.h>
 #include <linux/proc_fs.h>
 #include <linux/security.h>
--- a/kernel/trace/trace.c~kernelh-split-out-panic-and-oops-helpers
+++ a/kernel/trace/trace.c
@@ -39,6 +39,7 @@
 #include <linux/slab.h>
 #include <linux/ctype.h>
 #include <linux/init.h>
+#include <linux/panic_notifier.h>
 #include <linux/poll.h>
 #include <linux/nmi.h>
 #include <linux/fs.h>
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 147/192] lib: decompress_bunzip2: remove an unneeded semicolon
  2021-07-01  1:46 incoming Andrew Morton
                   ` (145 preceding siblings ...)
  2021-07-01  1:54 ` [patch 146/192] kernel.h: split out panic and oops helpers Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 148/192] lib/string_helpers: switch to use BIT() macro Andrew Morton
                   ` (45 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, thunder.leizhen, torvalds

From: Zhen Lei <thunder.leizhen@huawei.com>
Subject: lib: decompress_bunzip2: remove an unneeded semicolon

The semicolon immediately following '}' is unneeded.

Link: https://lkml.kernel.org/r/20210508094926.2889-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/decompress_bunzip2.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/decompress_bunzip2.c~lib-decompress_bunzip2-remove-an-unneeded-semicolon
+++ a/lib/decompress_bunzip2.c
@@ -385,7 +385,7 @@ static int INIT get_next_block(struct bu
 			bd->inbufBits =
 				(bd->inbufBits << 8)|bd->inbuf[bd->inbufPos++];
 			bd->inbufBitCount += 8;
-		};
+		}
 		bd->inbufBitCount -= hufGroup->maxLen;
 		j = (bd->inbufBits >> bd->inbufBitCount)&
 			((1 << hufGroup->maxLen)-1);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 148/192] lib/string_helpers: switch to use BIT() macro
  2021-07-01  1:46 incoming Andrew Morton
                   ` (146 preceding siblings ...)
  2021-07-01  1:55 ` [patch 147/192] lib: decompress_bunzip2: remove an unneeded semicolon Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 149/192] lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop Andrew Morton
                   ` (44 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/string_helpers: switch to use BIT() macro

Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3.

Get rid of ugly *_escape_mem_ascii() API since it's not flexible and has
the only single user.  Provide better approach based on usage of the
string_escape_mem() with appropriate flags.

Test cases has been expanded accordingly to cover new functionality.


This patch (of 15):

Switch to use BIT() macro for flag definitions.  No changes implied.

Link: https://lkml.kernel.org/r/20210504180819.73127-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210504180819.73127-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/string_helpers.h |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- a/include/linux/string_helpers.h~lib-string_helpers-switch-to-use-bit-macro
+++ a/include/linux/string_helpers.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_STRING_HELPERS_H_
 #define _LINUX_STRING_HELPERS_H_
 
+#include <linux/bits.h>
 #include <linux/ctype.h>
 #include <linux/types.h>
 
@@ -18,10 +19,10 @@ enum string_size_units {
 void string_get_size(u64 size, u64 blk_size, enum string_size_units units,
 		     char *buf, int len);
 
-#define UNESCAPE_SPACE		0x01
-#define UNESCAPE_OCTAL		0x02
-#define UNESCAPE_HEX		0x04
-#define UNESCAPE_SPECIAL	0x08
+#define UNESCAPE_SPACE		BIT(0)
+#define UNESCAPE_OCTAL		BIT(1)
+#define UNESCAPE_HEX		BIT(2)
+#define UNESCAPE_SPECIAL	BIT(3)
 #define UNESCAPE_ANY		\
 	(UNESCAPE_SPACE | UNESCAPE_OCTAL | UNESCAPE_HEX | UNESCAPE_SPECIAL)
 
@@ -42,15 +43,15 @@ static inline int string_unescape_any_in
 	return string_unescape_any(buf, buf, 0);
 }
 
-#define ESCAPE_SPACE		0x01
-#define ESCAPE_SPECIAL		0x02
-#define ESCAPE_NULL		0x04
-#define ESCAPE_OCTAL		0x08
+#define ESCAPE_SPACE		BIT(0)
+#define ESCAPE_SPECIAL		BIT(1)
+#define ESCAPE_NULL		BIT(2)
+#define ESCAPE_OCTAL		BIT(3)
 #define ESCAPE_ANY		\
 	(ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_SPECIAL | ESCAPE_NULL)
-#define ESCAPE_NP		0x10
+#define ESCAPE_NP		BIT(4)
 #define ESCAPE_ANY_NP		(ESCAPE_ANY | ESCAPE_NP)
-#define ESCAPE_HEX		0x20
+#define ESCAPE_HEX		BIT(5)
 
 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz,
 		unsigned int flags, const char *only);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 149/192] lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop
  2021-07-01  1:46 incoming Andrew Morton
                   ` (147 preceding siblings ...)
  2021-07-01  1:55 ` [patch 148/192] lib/string_helpers: switch to use BIT() macro Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 150/192] lib/string_helpers: drop indentation level in string_escape_mem() Andrew Morton
                   ` (43 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop

Refactor code to have better readability by moving ESCAPE_NP handling
inside 'else' branch in the loop.

No functional change intended.

Link: https://lkml.kernel.org/r/20210504180819.73127-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string_helpers.c |   17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

--- a/lib/string_helpers.c~lib-string_helpers-move-escape_np-check-inside-else-branch-in-a-loop
+++ a/lib/string_helpers.c
@@ -452,10 +452,10 @@ static bool escape_hex(unsigned char c,
  * The process of escaping byte buffer includes several parts. They are applied
  * in the following sequence.
  *
- *	1. The character is matched to the printable class, if asked, and in
- *	   case of match it passes through to the output.
- *	2. The character is not matched to the one from @only string and thus
+ *	1. The character is not matched to the one from @only string and thus
  *	   must go as-is to the output.
+ *	2. The character is matched to the printable class, if asked, and in
+ *	   case of match it passes through to the output.
  *	3. The character is checked if it falls into the class given by @flags.
  *	   %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any
  *	   character. Note that they actually can't go together, otherwise
@@ -506,19 +506,22 @@ int string_escape_mem(const char *src, s
 
 		/*
 		 * Apply rules in the following sequence:
-		 *	- the character is printable, when @flags has
-		 *	  %ESCAPE_NP bit set
 		 *	- the @only string is supplied and does not contain a
 		 *	  character under question
+		 *	- the character is printable, when @flags has
+		 *	  %ESCAPE_NP bit set
 		 *	- the character doesn't fall into a class of symbols
 		 *	  defined by given @flags
 		 * In these cases we just pass through a character to the
 		 * output buffer.
 		 */
-		if ((flags & ESCAPE_NP && isprint(c)) ||
-		    (is_dict && !strchr(only, c))) {
+		if (is_dict && !strchr(only, c)) {
 			/* do nothing */
 		} else {
+			if (isprint(c) &&
+			    flags & ESCAPE_NP && escape_passthrough(c, &p, end))
+				continue;
+
 			if (flags & ESCAPE_SPACE && escape_space(c, &p, end))
 				continue;
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 150/192] lib/string_helpers: drop indentation level in string_escape_mem()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (148 preceding siblings ...)
  2021-07-01  1:55 ` [patch 149/192] lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 151/192] lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII Andrew Morton
                   ` (42 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/string_helpers: drop indentation level in string_escape_mem()

The only one conditional is left on the upper level, move the rest to the
same level and drop indentation level.  No functional changes.

Link: https://lkml.kernel.org/r/20210504180819.73127-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string_helpers.c |   46 ++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

--- a/lib/string_helpers.c~lib-string_helpers-drop-indentation-level-in-string_escape_mem
+++ a/lib/string_helpers.c
@@ -515,29 +515,29 @@ int string_escape_mem(const char *src, s
 		 * In these cases we just pass through a character to the
 		 * output buffer.
 		 */
-		if (is_dict && !strchr(only, c)) {
-			/* do nothing */
-		} else {
-			if (isprint(c) &&
-			    flags & ESCAPE_NP && escape_passthrough(c, &p, end))
-				continue;
-
-			if (flags & ESCAPE_SPACE && escape_space(c, &p, end))
-				continue;
-
-			if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end))
-				continue;
-
-			if (flags & ESCAPE_NULL && escape_null(c, &p, end))
-				continue;
-
-			/* ESCAPE_OCTAL and ESCAPE_HEX always go last */
-			if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end))
-				continue;
-
-			if (flags & ESCAPE_HEX && escape_hex(c, &p, end))
-				continue;
-		}
+		if (is_dict && !strchr(only, c) &&
+					  escape_passthrough(c, &p, end))
+			continue;
+
+		if (isprint(c) &&
+		    flags & ESCAPE_NP && escape_passthrough(c, &p, end))
+			continue;
+
+		if (flags & ESCAPE_SPACE && escape_space(c, &p, end))
+			continue;
+
+		if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end))
+			continue;
+
+		if (flags & ESCAPE_NULL && escape_null(c, &p, end))
+			continue;
+
+		/* ESCAPE_OCTAL and ESCAPE_HEX always go last */
+		if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end))
+			continue;
+
+		if (flags & ESCAPE_HEX && escape_hex(c, &p, end))
+			continue;
 
 		escape_passthrough(c, &p, end);
 	}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 151/192] lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII
  2021-07-01  1:46 incoming Andrew Morton
                   ` (149 preceding siblings ...)
  2021-07-01  1:55 ` [patch 150/192] lib/string_helpers: drop indentation level in string_escape_mem() Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 152/192] lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable Andrew Morton
                   ` (41 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII

Some users may want to have an ASCII based filter, provided by isascii()
function.  Here is the addition of a such.

Link: https://lkml.kernel.org/r/20210504180819.73127-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/string_helpers.h |    1 +
 lib/string_helpers.c           |   21 +++++++++++++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

--- a/include/linux/string_helpers.h~lib-string_helpers-introduce-escape_na-for-escaping-non-ascii
+++ a/include/linux/string_helpers.h
@@ -52,6 +52,7 @@ static inline int string_unescape_any_in
 #define ESCAPE_NP		BIT(4)
 #define ESCAPE_ANY_NP		(ESCAPE_ANY | ESCAPE_NP)
 #define ESCAPE_HEX		BIT(5)
+#define ESCAPE_NA		BIT(6)
 
 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz,
 		unsigned int flags, const char *only);
--- a/lib/string_helpers.c~lib-string_helpers-introduce-escape_na-for-escaping-non-ascii
+++ a/lib/string_helpers.c
@@ -454,8 +454,8 @@ static bool escape_hex(unsigned char c,
  *
  *	1. The character is not matched to the one from @only string and thus
  *	   must go as-is to the output.
- *	2. The character is matched to the printable class, if asked, and in
- *	   case of match it passes through to the output.
+ *	2. The character is matched to the printable or ASCII class, if asked,
+ *	   and in case of match it passes through to the output.
  *	3. The character is checked if it falls into the class given by @flags.
  *	   %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any
  *	   character. Note that they actually can't go together, otherwise
@@ -463,7 +463,7 @@ static bool escape_hex(unsigned char c,
  *
  * Caller must provide valid source and destination pointers. Be aware that
  * destination buffer will not be NULL-terminated, thus caller have to append
- * it if needs.   The supported flags are::
+ * it if needs. The supported flags are::
  *
  *	%ESCAPE_SPACE: (special white space, not space itself)
  *		'\f' - form feed
@@ -482,11 +482,18 @@ static bool escape_hex(unsigned char c,
  *	%ESCAPE_ANY:
  *		all previous together
  *	%ESCAPE_NP:
- *		escape only non-printable characters (checked by isprint)
+ *		escape only non-printable characters, checked by isprint()
  *	%ESCAPE_ANY_NP:
  *		all previous together
  *	%ESCAPE_HEX:
  *		'\xHH' - byte with hexadecimal value HH (2 digits)
+ *	%ESCAPE_NA:
+ *		escape only non-ascii characters, checked by isascii()
+ *
+ * One notable caveat, the %ESCAPE_NP and %ESCAPE_NA have higher priority
+ * than the rest of the flags (%ESCAPE_NP is higher than %ESCAPE_NA).
+ * It doesn't make much sense to use either of them without %ESCAPE_OCTAL
+ * or %ESCAPE_HEX, because they cover most of the other character classes.
  *
  * Return:
  * The total size of the escaped output that would be generated for
@@ -510,6 +517,8 @@ int string_escape_mem(const char *src, s
 		 *	  character under question
 		 *	- the character is printable, when @flags has
 		 *	  %ESCAPE_NP bit set
+		 *	- the character is ASCII, when @flags has
+		 *	  %ESCAPE_NA bit set
 		 *	- the character doesn't fall into a class of symbols
 		 *	  defined by given @flags
 		 * In these cases we just pass through a character to the
@@ -523,6 +532,10 @@ int string_escape_mem(const char *src, s
 		    flags & ESCAPE_NP && escape_passthrough(c, &p, end))
 			continue;
 
+		if (isascii(c) &&
+		    flags & ESCAPE_NA && escape_passthrough(c, &p, end))
+			continue;
+
 		if (flags & ESCAPE_SPACE && escape_space(c, &p, end))
 			continue;
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 152/192] lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable
  2021-07-01  1:46 incoming Andrew Morton
                   ` (150 preceding siblings ...)
  2021-07-01  1:55 ` [patch 151/192] lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 153/192] lib/string_helpers: allow to append additional characters to be escaped Andrew Morton
                   ` (40 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable

Some users may want to have an ASCII based filter for printable only
characters, provided by conjunction of isascii() and isprint() functions.

Here is the addition of a such.

Link: https://lkml.kernel.org/r/20210504180819.73127-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/string_helpers.h |    1 +
 lib/string_helpers.c           |   20 ++++++++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

--- a/include/linux/string_helpers.h~lib-string_helpers-introduce-escape_nap-to-escape-non-ascii-and-non-printable
+++ a/include/linux/string_helpers.h
@@ -53,6 +53,7 @@ static inline int string_unescape_any_in
 #define ESCAPE_ANY_NP		(ESCAPE_ANY | ESCAPE_NP)
 #define ESCAPE_HEX		BIT(5)
 #define ESCAPE_NA		BIT(6)
+#define ESCAPE_NAP		BIT(7)
 
 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz,
 		unsigned int flags, const char *only);
--- a/lib/string_helpers.c~lib-string_helpers-introduce-escape_nap-to-escape-non-ascii-and-non-printable
+++ a/lib/string_helpers.c
@@ -454,9 +454,11 @@ static bool escape_hex(unsigned char c,
  *
  *	1. The character is not matched to the one from @only string and thus
  *	   must go as-is to the output.
- *	2. The character is matched to the printable or ASCII class, if asked,
+ *	2. The character is matched to the printable and ASCII classes, if asked,
  *	   and in case of match it passes through to the output.
- *	3. The character is checked if it falls into the class given by @flags.
+ *	3. The character is matched to the printable or ASCII class, if asked,
+ *	   and in case of match it passes through to the output.
+ *	4. The character is checked if it falls into the class given by @flags.
  *	   %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any
  *	   character. Note that they actually can't go together, otherwise
  *	   %ESCAPE_HEX will be ignored.
@@ -489,11 +491,15 @@ static bool escape_hex(unsigned char c,
  *		'\xHH' - byte with hexadecimal value HH (2 digits)
  *	%ESCAPE_NA:
  *		escape only non-ascii characters, checked by isascii()
+ *	%ESCAPE_NAP:
+ *		escape only non-printable or non-ascii characters
  *
- * One notable caveat, the %ESCAPE_NP and %ESCAPE_NA have higher priority
- * than the rest of the flags (%ESCAPE_NP is higher than %ESCAPE_NA).
+ * One notable caveat, the %ESCAPE_NAP, %ESCAPE_NP and %ESCAPE_NA have the
+ * higher priority than the rest of the flags (%ESCAPE_NAP is the highest).
  * It doesn't make much sense to use either of them without %ESCAPE_OCTAL
  * or %ESCAPE_HEX, because they cover most of the other character classes.
+ * %ESCAPE_NAP can utilize %ESCAPE_SPACE or %ESCAPE_SPECIAL in addition to
+ * the above.
  *
  * Return:
  * The total size of the escaped output that would be generated for
@@ -515,6 +521,8 @@ int string_escape_mem(const char *src, s
 		 * Apply rules in the following sequence:
 		 *	- the @only string is supplied and does not contain a
 		 *	  character under question
+		 *	- the character is printable and ASCII, when @flags has
+		 *	  %ESCAPE_NAP bit set
 		 *	- the character is printable, when @flags has
 		 *	  %ESCAPE_NP bit set
 		 *	- the character is ASCII, when @flags has
@@ -528,6 +536,10 @@ int string_escape_mem(const char *src, s
 					  escape_passthrough(c, &p, end))
 			continue;
 
+		if (isascii(c) && isprint(c) &&
+		    flags & ESCAPE_NAP && escape_passthrough(c, &p, end))
+			continue;
+
 		if (isprint(c) &&
 		    flags & ESCAPE_NP && escape_passthrough(c, &p, end))
 			continue;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 153/192] lib/string_helpers: allow to append additional characters to be escaped
  2021-07-01  1:46 incoming Andrew Morton
                   ` (151 preceding siblings ...)
  2021-07-01  1:55 ` [patch 152/192] lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 154/192] lib/test-string_helpers: print flags in hexadecimal format Andrew Morton
                   ` (39 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/string_helpers: allow to append additional characters to be escaped

Introduce a new flag to append additional characters, passed in 'only'
parameter, to be escaped if they fall in the corresponding class.

Link: https://lkml.kernel.org/r/20210504180819.73127-7-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/string_helpers.h |    1 +
 lib/string_helpers.c           |   19 +++++++++++++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

--- a/include/linux/string_helpers.h~lib-string_helpers-allow-to-append-additional-characters-to-be-escaped
+++ a/include/linux/string_helpers.h
@@ -54,6 +54,7 @@ static inline int string_unescape_any_in
 #define ESCAPE_HEX		BIT(5)
 #define ESCAPE_NA		BIT(6)
 #define ESCAPE_NAP		BIT(7)
+#define ESCAPE_APPEND		BIT(8)
 
 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz,
 		unsigned int flags, const char *only);
--- a/lib/string_helpers.c~lib-string_helpers-allow-to-append-additional-characters-to-be-escaped
+++ a/lib/string_helpers.c
@@ -493,6 +493,11 @@ static bool escape_hex(unsigned char c,
  *		escape only non-ascii characters, checked by isascii()
  *	%ESCAPE_NAP:
  *		escape only non-printable or non-ascii characters
+ *	%ESCAPE_APPEND:
+ *		append characters from @only to be escaped by the given classes
+ *
+ * %ESCAPE_APPEND would help to pass additional characters to the escaped, when
+ * one of %ESCAPE_NP, %ESCAPE_NA, or %ESCAPE_NAP is provided.
  *
  * One notable caveat, the %ESCAPE_NAP, %ESCAPE_NP and %ESCAPE_NA have the
  * higher priority than the rest of the flags (%ESCAPE_NAP is the highest).
@@ -513,9 +518,11 @@ int string_escape_mem(const char *src, s
 	char *p = dst;
 	char *end = p + osz;
 	bool is_dict = only && *only;
+	bool is_append = flags & ESCAPE_APPEND;
 
 	while (isz--) {
 		unsigned char c = *src++;
+		bool in_dict = is_dict && strchr(only, c);
 
 		/*
 		 * Apply rules in the following sequence:
@@ -531,20 +538,24 @@ int string_escape_mem(const char *src, s
 		 *	  defined by given @flags
 		 * In these cases we just pass through a character to the
 		 * output buffer.
+		 *
+		 * When %ESCAPE_APPEND is passed, the characters from @only
+		 * have been excluded from the %ESCAPE_NAP, %ESCAPE_NP, and
+		 * %ESCAPE_NA cases.
 		 */
-		if (is_dict && !strchr(only, c) &&
+		if (!(is_append || in_dict) && is_dict &&
 					  escape_passthrough(c, &p, end))
 			continue;
 
-		if (isascii(c) && isprint(c) &&
+		if (!(is_append && in_dict) && isascii(c) && isprint(c) &&
 		    flags & ESCAPE_NAP && escape_passthrough(c, &p, end))
 			continue;
 
-		if (isprint(c) &&
+		if (!(is_append && in_dict) && isprint(c) &&
 		    flags & ESCAPE_NP && escape_passthrough(c, &p, end))
 			continue;
 
-		if (isascii(c) &&
+		if (!(is_append && in_dict) && isascii(c) &&
 		    flags & ESCAPE_NA && escape_passthrough(c, &p, end))
 			continue;
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 154/192] lib/test-string_helpers: print flags in hexadecimal format
  2021-07-01  1:46 incoming Andrew Morton
                   ` (152 preceding siblings ...)
  2021-07-01  1:55 ` [patch 153/192] lib/string_helpers: allow to append additional characters to be escaped Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 155/192] lib/test-string_helpers: get rid of trailing comma in terminators Andrew Morton
                   ` (38 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/test-string_helpers: print flags in hexadecimal format

Since flags are bitmapped, it's better to print them in hexadecimal
format.

Link: https://lkml.kernel.org/r/20210504180819.73127-8-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test-string_helpers.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/lib/test-string_helpers.c~lib-test-string_helpers-print-flags-in-hexadecimal-format
+++ a/lib/test-string_helpers.c
@@ -19,7 +19,7 @@ static __init bool test_string_check_buf
 	if (q_real == q_test && !memcmp(out_test, out_real, q_test))
 		return true;
 
-	pr_warn("Test '%s' failed: flags = %u\n", name, flags);
+	pr_warn("Test '%s' failed: flags = %#x\n", name, flags);
 
 	print_hex_dump(KERN_WARNING, "Input: ", DUMP_PREFIX_NONE, 16, 1,
 		       in, p, true);
@@ -290,7 +290,7 @@ test_string_escape_overflow(const char *
 
 	q_real = string_escape_mem(in, p, NULL, 0, flags, esc);
 	if (q_real != q_test)
-		pr_warn("Test '%s' failed: flags = %u, osz = 0, expected %d, got %d\n",
+		pr_warn("Test '%s' failed: flags = %#x, osz = 0, expected %d, got %d\n",
 			name, flags, q_test, q_real);
 }
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 155/192] lib/test-string_helpers: get rid of trailing comma in terminators
  2021-07-01  1:46 incoming Andrew Morton
                   ` (153 preceding siblings ...)
  2021-07-01  1:55 ` [patch 154/192] lib/test-string_helpers: print flags in hexadecimal format Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 156/192] lib/test-string_helpers: add test cases for new features Andrew Morton
                   ` (37 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/test-string_helpers: get rid of trailing comma in terminators

Terminators by definition shouldn't accept anything behind.  Make them
robust by removing trailing commas.

Link: https://lkml.kernel.org/r/20210504180819.73127-9-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test-string_helpers.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/lib/test-string_helpers.c~lib-test-string_helpers-get-rid-of-trailing-comma-in-terminators
+++ a/lib/test-string_helpers.c
@@ -136,7 +136,7 @@ static const struct test_string_2 escape
 		.flags = ESCAPE_SPACE | ESCAPE_HEX,
 	},{
 		/* terminator */
-	}},
+	}}
 },{
 	.in = "\\h\\\"\a\e\\",
 	.s1 = {{
@@ -150,7 +150,7 @@ static const struct test_string_2 escape
 		.flags = ESCAPE_SPECIAL | ESCAPE_HEX,
 	},{
 		/* terminator */
-	}},
+	}}
 },{
 	.in = "\eb \\C\007\"\x90\r]",
 	.s1 = {{
@@ -201,7 +201,7 @@ static const struct test_string_2 escape
 		.flags = ESCAPE_NP | ESCAPE_HEX,
 	},{
 		/* terminator */
-	}},
+	}}
 },{
 	/* terminator */
 }};
@@ -217,7 +217,7 @@ static const struct test_string_2 escape
 		.flags = ESCAPE_HEX,
 	},{
 		/* terminator */
-	}},
+	}}
 },{
 	.in = "\\h\\\"\a\e\\",
 	.s1 = {{
@@ -225,7 +225,7 @@ static const struct test_string_2 escape
 		.flags = ESCAPE_OCTAL,
 	},{
 		/* terminator */
-	}},
+	}}
 },{
 	.in = "\eb \\C\007\"\x90\r]",
 	.s1 = {{
@@ -233,7 +233,7 @@ static const struct test_string_2 escape
 		.flags = ESCAPE_OCTAL,
 	},{
 		/* terminator */
-	}},
+	}}
 },{
 	/* terminator */
 }};
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 156/192] lib/test-string_helpers: add test cases for new features
  2021-07-01  1:46 incoming Andrew Morton
                   ` (154 preceding siblings ...)
  2021-07-01  1:55 ` [patch 155/192] lib/test-string_helpers: get rid of trailing comma in terminators Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 157/192] MAINTAINERS: add myself as designated reviewer for generic string library Andrew Morton
                   ` (36 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/test-string_helpers: add test cases for new features

We have got new flags and hence new features of string_escape_mem().
Add test cases for that.

Link: https://lkml.kernel.org/r/20210504180819.73127-10-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/string_helpers.h |    4 
 lib/test-string_helpers.c      |  141 +++++++++++++++++++++++++++++--
 2 files changed, 137 insertions(+), 8 deletions(-)

--- a/include/linux/string_helpers.h~lib-test-string_helpers-add-test-cases-for-new-features
+++ a/include/linux/string_helpers.h
@@ -26,6 +26,8 @@ void string_get_size(u64 size, u64 blk_s
 #define UNESCAPE_ANY		\
 	(UNESCAPE_SPACE | UNESCAPE_OCTAL | UNESCAPE_HEX | UNESCAPE_SPECIAL)
 
+#define UNESCAPE_ALL_MASK	GENMASK(3, 0)
+
 int string_unescape(char *src, char *dst, size_t size, unsigned int flags);
 
 static inline int string_unescape_inplace(char *buf, unsigned int flags)
@@ -56,6 +58,8 @@ static inline int string_unescape_any_in
 #define ESCAPE_NAP		BIT(7)
 #define ESCAPE_APPEND		BIT(8)
 
+#define ESCAPE_ALL_MASK		GENMASK(8, 0)
+
 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz,
 		unsigned int flags, const char *only);
 
--- a/lib/test-string_helpers.c~lib-test-string_helpers-add-test-cases-for-new-features
+++ a/lib/test-string_helpers.c
@@ -203,10 +203,24 @@ static const struct test_string_2 escape
 		/* terminator */
 	}}
 },{
+	.in = "\007 \eb\"\x90\xCF\r",
+	.s1 = {{
+		.out = "\007 \eb\"\\220\\317\r",
+		.flags = ESCAPE_OCTAL | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\\x90\\xcf\r",
+		.flags = ESCAPE_HEX | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_NA,
+	},{
+		/* terminator */
+	}}
+},{
 	/* terminator */
 }};
 
-#define	TEST_STRING_2_DICT_1		"b\\ \t\r"
+#define	TEST_STRING_2_DICT_1		"b\\ \t\r\xCF"
 static const struct test_string_2 escape1[] __initconst = {{
 	.in = "\f\\ \n\r\t\v",
 	.s1 = {{
@@ -216,14 +230,38 @@ static const struct test_string_2 escape
 		.out = "\f\\x5c\\x20\n\\x0d\\x09\v",
 		.flags = ESCAPE_HEX,
 	},{
+		.out = "\f\\134\\040\n\\015\\011\v",
+		.flags = ESCAPE_ANY | ESCAPE_APPEND,
+	},{
+		.out = "\\014\\134\\040\\012\\015\\011\\013",
+		.flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NAP,
+	},{
+		.out = "\\x0c\\x5c\\x20\\x0a\\x0d\\x09\\x0b",
+		.flags = ESCAPE_HEX | ESCAPE_APPEND | ESCAPE_NAP,
+	},{
+		.out = "\f\\134\\040\n\\015\\011\v",
+		.flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NA,
+	},{
+		.out = "\f\\x5c\\x20\n\\x0d\\x09\v",
+		.flags = ESCAPE_HEX | ESCAPE_APPEND | ESCAPE_NA,
+	},{
 		/* terminator */
 	}}
 },{
-	.in = "\\h\\\"\a\e\\",
+	.in = "\\h\\\"\a\xCF\e\\",
 	.s1 = {{
-		.out = "\\134h\\134\"\a\e\\134",
+		.out = "\\134h\\134\"\a\\317\e\\134",
 		.flags = ESCAPE_OCTAL,
 	},{
+		.out = "\\134h\\134\"\a\\317\e\\134",
+		.flags = ESCAPE_ANY | ESCAPE_APPEND,
+	},{
+		.out = "\\134h\\134\"\\007\\317\\033\\134",
+		.flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NAP,
+	},{
+		.out = "\\134h\\134\"\a\\317\e\\134",
+		.flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NA,
+	},{
 		/* terminator */
 	}}
 },{
@@ -235,6 +273,88 @@ static const struct test_string_2 escape
 		/* terminator */
 	}}
 },{
+	.in = "\007 \eb\"\x90\xCF\r",
+	.s1 = {{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_SPACE | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_SPECIAL | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\317\r",
+		.flags = ESCAPE_OCTAL | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\317\r",
+		.flags = ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\317\r",
+		.flags = ESCAPE_SPECIAL | ESCAPE_OCTAL | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\317\r",
+		.flags = ESCAPE_ANY | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\r",
+		.flags = ESCAPE_HEX | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\r",
+		.flags = ESCAPE_SPACE | ESCAPE_HEX | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\r",
+		.flags = ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NA,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\r",
+		.flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NA,
+	},{
+		/* terminator */
+	}}
+},{
+	.in = "\007 \eb\"\x90\xCF\r",
+	.s1 = {{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\xCF\\r",
+		.flags = ESCAPE_SPACE | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\xCF\r",
+		.flags = ESCAPE_SPECIAL | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\xCF\\r",
+		.flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\317\\015",
+		.flags = ESCAPE_OCTAL | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\317\\r",
+		.flags = ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\317\\015",
+		.flags = ESCAPE_SPECIAL | ESCAPE_OCTAL | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\317\r",
+		.flags = ESCAPE_ANY | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\\x0d",
+		.flags = ESCAPE_HEX | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\\r",
+		.flags = ESCAPE_SPACE | ESCAPE_HEX | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\\x0d",
+		.flags = ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NAP,
+	},{
+		.out = "\007 \eb\"\x90\\xcf\\r",
+		.flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NAP,
+	},{
+		/* terminator */
+	}}
+},{
 	/* terminator */
 }};
 
@@ -315,8 +435,13 @@ static __init void test_string_escape(co
 		/* NULL injection */
 		if (flags & ESCAPE_NULL) {
 			in[p++] = '\0';
-			out_test[q_test++] = '\\';
-			out_test[q_test++] = '0';
+			/* '\0' passes isascii() test */
+			if (flags & ESCAPE_NA && !(flags & ESCAPE_APPEND && esc)) {
+				out_test[q_test++] = '\0';
+			} else {
+				out_test[q_test++] = '\\';
+				out_test[q_test++] = '0';
+			}
 		}
 
 		/* Don't try strings that have no output */
@@ -459,17 +584,17 @@ static int __init test_string_helpers_in
 	unsigned int i;
 
 	pr_info("Running tests...\n");
-	for (i = 0; i < UNESCAPE_ANY + 1; i++)
+	for (i = 0; i < UNESCAPE_ALL_MASK + 1; i++)
 		test_string_unescape("unescape", i, false);
 	test_string_unescape("unescape inplace",
 			     get_random_int() % (UNESCAPE_ANY + 1), true);
 
 	/* Without dictionary */
-	for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++)
+	for (i = 0; i < ESCAPE_ALL_MASK + 1; i++)
 		test_string_escape("escape 0", escape0, i, TEST_STRING_2_DICT_0);
 
 	/* With dictionary */
-	for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++)
+	for (i = 0; i < ESCAPE_ALL_MASK + 1; i++)
 		test_string_escape("escape 1", escape1, i, TEST_STRING_2_DICT_1);
 
 	/* Test string_get_size() */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 157/192] MAINTAINERS: add myself as designated reviewer for generic string library
  2021-07-01  1:46 incoming Andrew Morton
                   ` (155 preceding siblings ...)
  2021-07-01  1:55 ` [patch 156/192] lib/test-string_helpers: add test cases for new features Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 158/192] seq_file: introduce seq_escape_mem() Andrew Morton
                   ` (35 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: MAINTAINERS: add myself as designated reviewer for generic string library

Add myself as designated reviewer for generic string library.

Link: https://lkml.kernel.org/r/20210504180819.73127-11-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/MAINTAINERS~maintainers-add-myself-as-designated-reviewer-for-generic-string-library
+++ a/MAINTAINERS
@@ -7648,6 +7648,14 @@ L:	linux-input@vger.kernel.org
 S:	Maintained
 F:	drivers/input/touchscreen/resistive-adc-touch.c
 
+GENERIC STRING LIBRARY
+R:	Andy Shevchenko <andy@kernel.org>
+S:	Maintained
+F:	lib/string.c
+F:	lib/string_helpers.c
+F:	lib/test_string.c
+F:	lib/test-string_helpers.c
+
 GENERIC UIO DRIVER FOR PCI DEVICES
 M:	"Michael S. Tsirkin" <mst@redhat.com>
 L:	kvm@vger.kernel.org
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 158/192] seq_file: introduce seq_escape_mem()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (156 preceding siblings ...)
  2021-07-01  1:55 ` [patch 157/192] MAINTAINERS: add myself as designated reviewer for generic string library Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 159/192] seq_file: add seq_escape_str() as replica of string_escape_str() Andrew Morton
                   ` (34 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: seq_file: introduce seq_escape_mem()

Introduce seq_escape_mem() to allow users to pass additional parameters to
string_escape_mem().

Link: https://lkml.kernel.org/r/20210504180819.73127-12-andriy.shevchenko@linux.intel.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/seq_file.c            |   25 +++++++++++++++++++++++++
 include/linux/seq_file.h |    2 ++
 2 files changed, 27 insertions(+)

--- a/fs/seq_file.c~seq_file-introduce-seq_escape_mem
+++ a/fs/seq_file.c
@@ -356,6 +356,31 @@ int seq_release(struct inode *inode, str
 EXPORT_SYMBOL(seq_release);
 
 /**
+ * seq_escape_mem - print data into buffer, escaping some characters
+ * @m: target buffer
+ * @src: source buffer
+ * @len: size of source buffer
+ * @flags: flags to pass to string_escape_mem()
+ * @esc: set of characters that need escaping
+ *
+ * Puts data into buffer, replacing each occurrence of character from
+ * given class (defined by @flags and @esc) with printable escaped sequence.
+ *
+ * Use seq_has_overflowed() to check for errors.
+ */
+void seq_escape_mem(struct seq_file *m, const char *src, size_t len,
+		    unsigned int flags, const char *esc)
+{
+	char *buf;
+	size_t size = seq_get_buf(m, &buf);
+	int ret;
+
+	ret = string_escape_mem(src, len, buf, size, flags, esc);
+	seq_commit(m, ret < size ? ret : -1);
+}
+EXPORT_SYMBOL(seq_escape_mem);
+
+/**
  *	seq_escape -	print string into buffer, escaping some characters
  *	@m:	target buffer
  *	@s:	string
--- a/include/linux/seq_file.h~seq_file-introduce-seq_escape_mem
+++ a/include/linux/seq_file.h
@@ -126,6 +126,8 @@ void seq_put_decimal_ll(struct seq_file
 void seq_put_hex_ll(struct seq_file *m, const char *delimiter,
 		    unsigned long long v, unsigned int width);
 
+void seq_escape_mem(struct seq_file *m, const char *src, size_t len,
+		    unsigned int flags, const char *esc);
 void seq_escape(struct seq_file *m, const char *s, const char *esc);
 void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 159/192] seq_file: add seq_escape_str() as replica of string_escape_str()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (157 preceding siblings ...)
  2021-07-01  1:55 ` [patch 158/192] seq_file: introduce seq_escape_mem() Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 160/192] seq_file: convert seq_escape() to use seq_escape_str() Andrew Morton
                   ` (33 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: seq_file: add seq_escape_str() as replica of string_escape_str()

In some cases we want to escape characters from NULL-terminated strings. 
Add seq_escape_str() as replica of string_escape_str() for that.

Link: https://lkml.kernel.org/r/20210504180819.73127-13-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/seq_file.h |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/include/linux/seq_file.h~seq_file-add-seq_escape_str-as-replica-of-string_escape_str
+++ a/include/linux/seq_file.h
@@ -128,6 +128,13 @@ void seq_put_hex_ll(struct seq_file *m,
 
 void seq_escape_mem(struct seq_file *m, const char *src, size_t len,
 		    unsigned int flags, const char *esc);
+
+static inline void seq_escape_str(struct seq_file *m, const char *src,
+				  unsigned int flags, const char *esc)
+{
+	seq_escape_mem(m, src, strlen(src), flags, esc);
+}
+
 void seq_escape(struct seq_file *m, const char *s, const char *esc);
 void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 160/192] seq_file: convert seq_escape() to use seq_escape_str()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (158 preceding siblings ...)
  2021-07-01  1:55 ` [patch 159/192] seq_file: add seq_escape_str() as replica of string_escape_str() Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 161/192] nfsd: avoid non-flexible API in seq_quote_mem() Andrew Morton
                   ` (32 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: seq_file: convert seq_escape() to use seq_escape_str()

Convert seq_escape() to use seq_escape_str() rather than open coding it.

Note, for now we leave it as an exported symbol due to some old code that
can't tolerate ctype.h being (indirectly) included.

Link: https://lkml.kernel.org/r/20210504180819.73127-14-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/seq_file.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

--- a/fs/seq_file.c~seq_file-convert-seq_escape-to-use-seq_escape_str
+++ a/fs/seq_file.c
@@ -392,12 +392,7 @@ EXPORT_SYMBOL(seq_escape_mem);
  */
 void seq_escape(struct seq_file *m, const char *s, const char *esc)
 {
-	char *buf;
-	size_t size = seq_get_buf(m, &buf);
-	int ret;
-
-	ret = string_escape_str(s, buf, size, ESCAPE_OCTAL, esc);
-	seq_commit(m, ret < size ? ret : -1);
+	seq_escape_str(m, s, ESCAPE_OCTAL, esc);
 }
 EXPORT_SYMBOL(seq_escape);
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 161/192] nfsd: avoid non-flexible API in seq_quote_mem()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (159 preceding siblings ...)
  2021-07-01  1:55 ` [patch 160/192] seq_file: convert seq_escape() to use seq_escape_str() Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 162/192] seq_file: drop unused *_escape_mem_ascii() Andrew Morton
                   ` (31 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: nfsd: avoid non-flexible API in seq_quote_mem()

The seq_escape_mem_ascii() is completely non-flexible and shouldn't be
used.  Replace it with properly called seq_escape_mem().

Link: https://lkml.kernel.org/r/20210504180819.73127-15-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nfsd/nfs4state.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/nfsd/nfs4state.c~nfsd-avoid-non-flexible-api-in-seq_quote_mem
+++ a/fs/nfsd/nfs4state.c
@@ -2351,7 +2351,7 @@ static struct nfs4_client *get_nfsdfs_cl
 static void seq_quote_mem(struct seq_file *m, char *data, int len)
 {
 	seq_printf(m, "\"");
-	seq_escape_mem_ascii(m, data, len);
+	seq_escape_mem(m, data, len, ESCAPE_HEX | ESCAPE_NAP | ESCAPE_APPEND, "\"\\");
 	seq_printf(m, "\"");
 }
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 162/192] seq_file: drop unused *_escape_mem_ascii()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (160 preceding siblings ...)
  2021-07-01  1:55 ` [patch 161/192] nfsd: avoid non-flexible API in seq_quote_mem() Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 163/192] lib/math/rational.c: fix divide by zero Andrew Morton
                   ` (30 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, bfields, chuck.lever, linux-mm,
	mm-commits, torvalds, viro

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: seq_file: drop unused *_escape_mem_ascii()

There are no more users of the seq_escape_mem_ascii() followed by
string_escape_mem_ascii().

Remove them for good.

Link: https://lkml.kernel.org/r/20210504180819.73127-16-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/seq_file.c                  |   11 -----------
 include/linux/seq_file.h       |    1 -
 include/linux/string_helpers.h |    3 ---
 lib/string_helpers.c           |   19 -------------------
 4 files changed, 34 deletions(-)

--- a/fs/seq_file.c~seq_file-drop-unused-_escape_mem_ascii
+++ a/fs/seq_file.c
@@ -396,17 +396,6 @@ void seq_escape(struct seq_file *m, cons
 }
 EXPORT_SYMBOL(seq_escape);
 
-void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz)
-{
-	char *buf;
-	size_t size = seq_get_buf(m, &buf);
-	int ret;
-
-	ret = string_escape_mem_ascii(src, isz, buf, size);
-	seq_commit(m, ret < size ? ret : -1);
-}
-EXPORT_SYMBOL(seq_escape_mem_ascii);
-
 void seq_vprintf(struct seq_file *m, const char *f, va_list args)
 {
 	int len;
--- a/include/linux/seq_file.h~seq_file-drop-unused-_escape_mem_ascii
+++ a/include/linux/seq_file.h
@@ -136,7 +136,6 @@ static inline void seq_escape_str(struct
 }
 
 void seq_escape(struct seq_file *m, const char *s, const char *esc);
-void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz);
 
 void seq_hex_dump(struct seq_file *m, const char *prefix_str, int prefix_type,
 		  int rowsize, int groupsize, const void *buf, size_t len,
--- a/include/linux/string_helpers.h~seq_file-drop-unused-_escape_mem_ascii
+++ a/include/linux/string_helpers.h
@@ -63,9 +63,6 @@ static inline int string_unescape_any_in
 int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz,
 		unsigned int flags, const char *only);
 
-int string_escape_mem_ascii(const char *src, size_t isz, char *dst,
-					size_t osz);
-
 static inline int string_escape_mem_any_np(const char *src, size_t isz,
 		char *dst, size_t osz, const char *only)
 {
--- a/lib/string_helpers.c~seq_file-drop-unused-_escape_mem_ascii
+++ a/lib/string_helpers.c
@@ -582,25 +582,6 @@ int string_escape_mem(const char *src, s
 }
 EXPORT_SYMBOL(string_escape_mem);
 
-int string_escape_mem_ascii(const char *src, size_t isz, char *dst,
-					size_t osz)
-{
-	char *p = dst;
-	char *end = p + osz;
-
-	while (isz--) {
-		unsigned char c = *src++;
-
-		if (!isprint(c) || !isascii(c) || c == '"' || c == '\\')
-			escape_hex(c, &p, end);
-		else
-			escape_passthrough(c, &p, end);
-	}
-
-	return p - dst;
-}
-EXPORT_SYMBOL(string_escape_mem_ascii);
-
 /*
  * Return an allocated string that has been escaped of special characters
  * and double quotes, making it safe to log in quotes.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 163/192] lib/math/rational.c: fix divide by zero
  2021-07-01  1:46 incoming Andrew Morton
                   ` (161 preceding siblings ...)
  2021-07-01  1:55 ` [patch 162/192] seq_file: drop unused *_escape_mem_ascii() Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 164/192] lib/math/rational: add Kunit test cases Andrew Morton
                   ` (29 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, dlatypov, linux-mm, mm-commits, oskar,
	torvalds, tpiepho, yguoaz

From: Trent Piepho <tpiepho@gmail.com>
Subject: lib/math/rational.c: fix divide by zero

If the input is out of the range of the allowed values, either larger than
the largest value or closer to zero than the smallest non-zero allowed
value, then a division by zero would occur.

In the case of input too large, the division by zero will occur on the
first iteration.  The best result (largest allowed value) will be found by
always choosing the semi-convergent and excluding the denominator based
limit when finding it.

In the case of the input too small, the division by zero will occur on the
second iteration.  The numerator based semi-convergent should not be
calculated to avoid the division by zero.  But the semi-convergent vs
previous convergent test is still needed, which effectively chooses
between 0 (the previous convergent) vs the smallest allowed fraction (best
semi-convergent) as the result.

Link: https://lkml.kernel.org/r/20210525144250.214670-1-tpiepho@gmail.com
Fixes: 323dd2c3ed0 ("lib/math/rational.c: fix possible incorrect result from rational fractions helper")
Signed-off-by: Trent Piepho <tpiepho@gmail.com>
Reported-by: Yiyuan Guo <yguoaz@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Oskar Schirmer <oskar@scara.com>
Cc: Daniel Latypov <dlatypov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/math/rational.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

--- a/lib/math/rational.c~lib-math-rationalc-fix-divide-by-zero
+++ a/lib/math/rational.c
@@ -12,6 +12,7 @@
 #include <linux/compiler.h>
 #include <linux/export.h>
 #include <linux/minmax.h>
+#include <linux/limits.h>
 
 /*
  * calculate best rational approximation for a given fraction
@@ -78,13 +79,18 @@ void rational_best_approximation(
 		 * found below as 't'.
 		 */
 		if ((n2 > max_numerator) || (d2 > max_denominator)) {
-			unsigned long t = min((max_numerator - n0) / n1,
-					      (max_denominator - d0) / d1);
+			unsigned long t = ULONG_MAX;
 
-			/* This tests if the semi-convergent is closer
-			 * than the previous convergent.
+			if (d1)
+				t = (max_denominator - d0) / d1;
+			if (n1)
+				t = min(t, (max_numerator - n0) / n1);
+
+			/* This tests if the semi-convergent is closer than the previous
+			 * convergent.  If d1 is zero there is no previous convergent as this
+			 * is the 1st iteration, so always choose the semi-convergent.
 			 */
-			if (2u * t > a || (2u * t == a && d0 * dp > d1 * d)) {
+			if (!d1 || 2u * t > a || (2u * t == a && d0 * dp > d1 * d)) {
 				n1 = n0 + t * n1;
 				d1 = d0 + t * d1;
 			}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 164/192] lib/math/rational: add Kunit test cases
  2021-07-01  1:46 incoming Andrew Morton
                   ` (162 preceding siblings ...)
  2021-07-01  1:55 ` [patch 163/192] lib/math/rational.c: fix divide by zero Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 165/192] lib/decompressors: fix spelling mistakes Andrew Morton
                   ` (28 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, colin.king, dlatypov, linux-mm,
	mm-commits, oskar, torvalds, tpiepho, yguoaz

From: Trent Piepho <tpiepho@gmail.com>
Subject: lib/math/rational: add Kunit test cases

Adds a number of test cases that cover a range of possible code paths.

[akpm@linux-foundation.org: remove non-ascii characters, fix whitespace]
[colin.king@canonical.com: fix spelling mistake "demominator" -> "denominator"]
  Link: https://lkml.kernel.org/r/20210526085049.6393-1-colin.king@canonical.com
Link: https://lkml.kernel.org/r/20210525144250.214670-2-tpiepho@gmail.com
Signed-off-by: Trent Piepho <tpiepho@gmail.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Latypov <dlatypov@google.com>
Cc: Oskar Schirmer <oskar@scara.com>
Cc: Yiyuan Guo <yguoaz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.debug        |   12 +++++++
 lib/math/Makefile        |    1 
 lib/math/rational-test.c |   56 +++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+)

--- a/lib/Kconfig.debug~lib-math-rational-add-kunit-test-cases
+++ a/lib/Kconfig.debug
@@ -2444,6 +2444,18 @@ config SLUB_KUNIT_TEST
 
 	  If unsure, say N.
 
+config RATIONAL_KUNIT_TEST
+	tristate "KUnit test for rational.c" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	select RATIONAL
+	default KUNIT_ALL_TESTS
+	help
+	  This builds the rational math unit test.
+	  For more information on KUnit and unit tests in general please refer
+	  to the KUnit documentation in Documentation/dev-tools/kunit/.
+
+	  If unsure, say N.
+
 config TEST_UDELAY
 	tristate "udelay test driver"
 	help
--- a/lib/math/Makefile~lib-math-rational-add-kunit-test-cases
+++ a/lib/math/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_PRIME_NUMBERS)	+= prime_num
 obj-$(CONFIG_RATIONAL)		+= rational.o
 
 obj-$(CONFIG_TEST_DIV64)	+= test_div64.o
+obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o
--- /dev/null
+++ a/lib/math/rational-test.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <kunit/test.h>
+
+#include <linux/rational.h>
+
+struct rational_test_param {
+	unsigned long num, den;
+	unsigned long max_num, max_den;
+	unsigned long exp_num, exp_den;
+
+	const char *name;
+};
+
+static const struct rational_test_param test_parameters[] = {
+	{ 1230,	10,	100, 20,	100, 1,    "Exceeds bounds, semi-convergent term > 1/2 last term" },
+	{ 34567,100, 	120, 20,	120, 1,    "Exceeds bounds, semi-convergent term < 1/2 last term" },
+	{ 1, 30,	100, 10,	0, 1,	   "Closest to zero" },
+	{ 1, 19,	100, 10,	1, 10,     "Closest to smallest non-zero" },
+	{ 27,32,	16, 16,		11, 13,    "Use convergent" },
+	{ 1155, 7735,	255, 255,	33, 221,   "Exact answer" },
+	{ 87, 32,	70, 32,		68, 25,    "Semiconvergent, numerator limit" },
+	{ 14533, 4626,	15000, 2400,	7433, 2366, "Semiconvergent, denominator limit" },
+};
+
+static void get_desc(const struct rational_test_param *param, char *desc)
+{
+	strscpy(desc, param->name, KUNIT_PARAM_DESC_SIZE);
+}
+
+/* Creates function rational_gen_params */
+KUNIT_ARRAY_PARAM(rational, test_parameters, get_desc);
+
+static void rational_test(struct kunit *test)
+{
+	const struct rational_test_param *param = (const struct rational_test_param *)test->param_value;
+	unsigned long n = 0, d = 0;
+
+	rational_best_approximation(param->num, param->den, param->max_num, param->max_den, &n, &d);
+	KUNIT_EXPECT_EQ(test, n, param->exp_num);
+	KUNIT_EXPECT_EQ(test, d, param->exp_den);
+}
+
+static struct kunit_case rational_test_cases[] = {
+	KUNIT_CASE_PARAM(rational_test, rational_gen_params),
+	{}
+};
+
+static struct kunit_suite rational_test_suite = {
+	.name = "rational",
+	.test_cases = rational_test_cases,
+};
+
+kunit_test_suites(&rational_test_suite);
+
+MODULE_LICENSE("GPL v2");
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 165/192] lib/decompressors: fix spelling mistakes
  2021-07-01  1:46 incoming Andrew Morton
                   ` (163 preceding siblings ...)
  2021-07-01  1:55 ` [patch 164/192] lib/math/rational: add Kunit test cases Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:55 ` [patch 166/192] lib/mpi: " Andrew Morton
                   ` (27 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, jkosina, linux-mm, mm-commits, thunder.leizhen, torvalds, tsbogend

From: Zhen Lei <thunder.leizhen@huawei.com>
Subject: lib/decompressors: fix spelling mistakes

Fix some spelling mistakes in comments:
sentinal ==> sentinel
compresed ==> compressed
dependeny ==> dependency
immediatelly ==> immediately
dervied ==> derived
splitted ==> split
nore ==> not
independed ==> independent
asumed ==> assumed

Link: https://lkml.kernel.org/r/20210604085656.12257-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/decompress_bunzip2.c   |    4 ++--
 lib/decompress_unxz.c      |    2 +-
 lib/decompress_unzstd.c    |    4 ++--
 lib/xz/xz_dec_bcj.c        |    2 +-
 lib/xz/xz_dec_lzma2.c      |    8 ++++----
 lib/zlib_inflate/inffast.c |    2 +-
 lib/zstd/huf.h             |    2 +-
 7 files changed, 12 insertions(+), 12 deletions(-)

--- a/lib/decompress_bunzip2.c~lib-decompressors-fix-spelling-mistakes
+++ a/lib/decompress_bunzip2.c
@@ -80,7 +80,7 @@
 
 /* This is what we know about each Huffman coding group */
 struct group_data {
-	/* We have an extra slot at the end of limit[] for a sentinal value. */
+	/* We have an extra slot at the end of limit[] for a sentinel value. */
 	int limit[MAX_HUFCODE_BITS+1];
 	int base[MAX_HUFCODE_BITS];
 	int permute[MAX_SYMBOLS];
@@ -337,7 +337,7 @@ static int INIT get_next_block(struct bu
 			pp <<= 1;
 			base[i+1] = pp-(t += temp[i]);
 		}
-		limit[maxLen+1] = INT_MAX; /* Sentinal value for
+		limit[maxLen+1] = INT_MAX; /* Sentinel value for
 					    * reading next sym. */
 		limit[maxLen] = pp+temp[maxLen]-1;
 		base[minLen] = 0;
--- a/lib/decompress_unxz.c~lib-decompressors-fix-spelling-mistakes
+++ a/lib/decompress_unxz.c
@@ -23,7 +23,7 @@
  * uncompressible. Thus, we must look for worst-case expansion when the
  * compressor is encoding uncompressible data.
  *
- * The structure of the .xz file in case of a compresed kernel is as follows.
+ * The structure of the .xz file in case of a compressed kernel is as follows.
  * Sizes (as bytes) of the fields are in parenthesis.
  *
  *    Stream Header (12)
--- a/lib/decompress_unzstd.c~lib-decompressors-fix-spelling-mistakes
+++ a/lib/decompress_unzstd.c
@@ -16,7 +16,7 @@
  * uncompressible. Thus, we must look for worst-case expansion when the
  * compressor is encoding uncompressible data.
  *
- * The structure of the .zst file in case of a compresed kernel is as follows.
+ * The structure of the .zst file in case of a compressed kernel is as follows.
  * Maximum sizes (as bytes) of the fields are in parenthesis.
  *
  *    Frame Header: (18)
@@ -56,7 +56,7 @@
 /*
  * Preboot environments #include "path/to/decompress_unzstd.c".
  * All of the source files we depend on must be #included.
- * zstd's only source dependeny is xxhash, which has no source
+ * zstd's only source dependency is xxhash, which has no source
  * dependencies.
  *
  * When UNZSTD_PREBOOT is defined we declare __decompress(), which is
--- a/lib/xz/xz_dec_bcj.c~lib-decompressors-fix-spelling-mistakes
+++ a/lib/xz/xz_dec_bcj.c
@@ -422,7 +422,7 @@ XZ_EXTERN enum xz_ret xz_dec_bcj_run(str
 
 	/*
 	 * Flush pending already filtered data to the output buffer. Return
-	 * immediatelly if we couldn't flush everything, or if the next
+	 * immediately if we couldn't flush everything, or if the next
 	 * filter in the chain had already returned XZ_STREAM_END.
 	 */
 	if (s->temp.filtered > 0) {
--- a/lib/xz/xz_dec_lzma2.c~lib-decompressors-fix-spelling-mistakes
+++ a/lib/xz/xz_dec_lzma2.c
@@ -147,8 +147,8 @@ struct lzma_dec {
 
 	/*
 	 * LZMA properties or related bit masks (number of literal
-	 * context bits, a mask dervied from the number of literal
-	 * position bits, and a mask dervied from the number
+	 * context bits, a mask derived from the number of literal
+	 * position bits, and a mask derived from the number
 	 * position bits)
 	 */
 	uint32_t lc;
@@ -484,7 +484,7 @@ static __always_inline void rc_normalize
 }
 
 /*
- * Decode one bit. In some versions, this function has been splitted in three
+ * Decode one bit. In some versions, this function has been split in three
  * functions so that the compiler is supposed to be able to more easily avoid
  * an extra branch. In this particular version of the LZMA decoder, this
  * doesn't seem to be a good idea (tested with GCC 3.3.6, 3.4.6, and 4.3.3
@@ -761,7 +761,7 @@ static bool lzma_main(struct xz_dec_lzma
 }
 
 /*
- * Reset the LZMA decoder and range decoder state. Dictionary is nore reset
+ * Reset the LZMA decoder and range decoder state. Dictionary is not reset
  * here, because LZMA state may be reset without resetting the dictionary.
  */
 static void lzma_reset(struct xz_dec_lzma2 *s)
--- a/lib/zlib_inflate/inffast.c~lib-decompressors-fix-spelling-mistakes
+++ a/lib/zlib_inflate/inffast.c
@@ -15,7 +15,7 @@ union uu {
 	unsigned char b[2];
 };
 
-/* Endian independed version */
+/* Endian independent version */
 static inline unsigned short
 get_unaligned16(const unsigned short *p)
 {
--- a/lib/zstd/huf.h~lib-decompressors-fix-spelling-mistakes
+++ a/lib/zstd/huf.h
@@ -134,7 +134,7 @@ typedef enum {
 	HUF_repeat_none,  /**< Cannot use the previous table */
 	HUF_repeat_check, /**< Can use the previous table but it must be checked. Note : The previous table must have been constructed by HUF_compress{1,
 			     4}X_repeat */
-	HUF_repeat_valid  /**< Can use the previous table and it is asumed to be valid */
+	HUF_repeat_valid  /**< Can use the previous table and it is assumed to be valid */
 } HUF_repeat;
 /** HUF_compress4X_repeat() :
 *   Same as HUF_compress4X_wksp(), but considers using hufTable if *repeat != HUF_repeat_none.
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 166/192] lib/mpi: fix spelling mistakes
  2021-07-01  1:46 incoming Andrew Morton
                   ` (164 preceding siblings ...)
  2021-07-01  1:55 ` [patch 165/192] lib/decompressors: fix spelling mistakes Andrew Morton
@ 2021-07-01  1:55 ` Andrew Morton
  2021-07-01  1:56 ` [patch 167/192] lib: memscan() fixlet Andrew Morton
                   ` (26 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:55 UTC (permalink / raw)
  To: akpm, herbert, linux-mm, mm-commits, thunder.leizhen, torvalds

From: Zhen Lei <thunder.leizhen@huawei.com>
Subject: lib/mpi: fix spelling mistakes

Fix some spelling mistakes in comments:
flaged ==> flagged
bufer ==> buffer
multipler ==> multiplier
MULTIPLER ==> MULTIPLIER
leaset ==> least
chnage ==> change

Link: https://lkml.kernel.org/r/20210604074401.12198-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mpi.h |    4 ++--
 lib/mpi/longlong.h  |    4 ++--
 lib/mpi/mpicoder.c  |    6 +++---
 lib/mpi/mpiutil.c   |    2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

--- a/include/linux/mpi.h~lib-mpi-fix-spelling-mistakes
+++ a/include/linux/mpi.h
@@ -200,7 +200,7 @@ struct mpi_ec_ctx {
 	unsigned int nbits;            /* Number of bits.  */
 
 	/* Domain parameters.  Note that they may not all be set and if set
-	 * the MPIs may be flaged as constant.
+	 * the MPIs may be flagged as constant.
 	 */
 	MPI p;         /* Prime specifying the field GF(p).  */
 	MPI a;         /* First coefficient of the Weierstrass equation.  */
@@ -267,7 +267,7 @@ int mpi_ec_curve_point(MPI_POINT point,
 /**
  * mpi_get_size() - returns max size required to store the number
  *
- * @a:	A multi precision integer for which we want to allocate a bufer
+ * @a:	A multi precision integer for which we want to allocate a buffer
  *
  * Return: size required to store the number
  */
--- a/lib/mpi/longlong.h~lib-mpi-fix-spelling-mistakes
+++ a/lib/mpi/longlong.h
@@ -48,8 +48,8 @@
 
 /* Define auxiliary asm macros.
  *
- * 1) umul_ppmm(high_prod, low_prod, multipler, multiplicand) multiplies two
- * UWtype integers MULTIPLER and MULTIPLICAND, and generates a two UWtype
+ * 1) umul_ppmm(high_prod, low_prod, multiplier, multiplicand) multiplies two
+ * UWtype integers MULTIPLIER and MULTIPLICAND, and generates a two UWtype
  * word product in HIGH_PROD and LOW_PROD.
  *
  * 2) __umulsidi3(a,b) multiplies two UWtype integers A and B, and returns a
--- a/lib/mpi/mpicoder.c~lib-mpi-fix-spelling-mistakes
+++ a/lib/mpi/mpicoder.c
@@ -234,11 +234,11 @@ static int count_lzeros(MPI a)
 }
 
 /**
- * mpi_read_buffer() - read MPI to a bufer provided by user (msb first)
+ * mpi_read_buffer() - read MPI to a buffer provided by user (msb first)
  *
  * @a:		a multi precision integer
- * @buf:	bufer to which the output will be written to. Needs to be at
- *		leaset mpi_get_size(a) long.
+ * @buf:	buffer to which the output will be written to. Needs to be at
+ *		least mpi_get_size(a) long.
  * @buf_len:	size of the buf.
  * @nbytes:	receives the actual length of the data written on success and
  *		the data to-be-written on -EOVERFLOW in case buf_len was too
--- a/lib/mpi/mpiutil.c~lib-mpi-fix-spelling-mistakes
+++ a/lib/mpi/mpiutil.c
@@ -80,7 +80,7 @@ EXPORT_SYMBOL_GPL(mpi_const);
 /****************
  * Note:  It was a bad idea to use the number of limbs to allocate
  *	  because on a alpha the limbs are large but we normally need
- *	  integers of n bits - So we should chnage this to bits (or bytes).
+ *	  integers of n bits - So we should change this to bits (or bytes).
  *
  *	  But mpi_alloc is used in a lot of places :-)
  */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 167/192] lib: memscan() fixlet
  2021-07-01  1:46 incoming Andrew Morton
                   ` (165 preceding siblings ...)
  2021-07-01  1:55 ` [patch 166/192] lib/mpi: " Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 168/192] lib: uninline simple_strtoull() Andrew Morton
                   ` (25 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: lib: memscan() fixlet

Generic version doesn't trucate second argument to char.

Older brother memchr() does as do s390, sparc and i386 assembly versions.

Fortunately, no code passes c >= 256.

Link: https://lkml.kernel.org/r/YLv4cCf0t5UPdyK+@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/string.c~lib-memscan-fixlet
+++ a/lib/string.c
@@ -977,7 +977,7 @@ void *memscan(void *addr, int c, size_t
 	unsigned char *p = addr;
 
 	while (size) {
-		if (*p == c)
+		if (*p == (unsigned char)c)
 			return (void *)p;
 		p++;
 		size--;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 168/192] lib: uninline simple_strtoull()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (166 preceding siblings ...)
  2021-07-01  1:56 ` [patch 167/192] lib: memscan() fixlet Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 169/192] lib/test_string.c: allow module removal Andrew Morton
                   ` (24 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: lib: uninline simple_strtoull()

Gcc inlines simple_strtoull() too agressively.

Given that all 4 signatures match, everything very efficiently calls or
tailcalls into simple_strtoull():

	ffffffff81da0240 <simple_strtoll>:
	ffffffff81da0240:       80 3f 2d                cmp    BYTE PTR [rdi],0x2d
	ffffffff81da0243:       74 05                   je     ffffffff81da024a <simple_strtoll+0xa>
	ffffffff81da0245:       e9 76 ff ff ff          jmp    simple_strtoull
	ffffffff81da024a:       48 83 c7 01             add    rdi,0x1
	ffffffff81da024e:       e8 6d ff ff ff          call   simple_strtoull
	ffffffff81da0253:       48 f7 d8                neg    rax
	ffffffff81da0256:       c3                      ret

Space savings (on F34-ish .config)

	add/remove: 0/0 grow/shrink: 1/3 up/down: 52/-313 (-261)
	Function                                     old     new   delta
	vsscanf                                     2167    2219     +52
	simple_strtoul                                72       2     -70
	simple_strtoll                               143      23    -120
	simple_strtol                                143      20    -123

Link: https://lkml.kernel.org/r/YMO2zoOQk2eF34tn@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/vsprintf.c |    1 +
 1 file changed, 1 insertion(+)

--- a/lib/vsprintf.c~lib-uninline-simple_strtoull
+++ a/lib/vsprintf.c
@@ -61,6 +61,7 @@
  *
  * This function has caveats. Please use kstrtoull instead.
  */
+noinline
 unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base)
 {
 	unsigned long long result;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 169/192] lib/test_string.c: allow module removal
  2021-07-01  1:46 incoming Andrew Morton
                   ` (167 preceding siblings ...)
  2021-07-01  1:56 ` [patch 168/192] lib: uninline simple_strtoull() Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 170/192] kernel.h: split out kstrtox() and simple_strtox() to a separate header Andrew Morton
                   ` (23 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, linux-mm, mcroce, mm-commits, torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: lib/test_string.c: allow module removal

The test_string module can't be removed because it lacks an exit hook. 
Since there is no reason for it to be permanent, add an empty one to allow
module removal.

Link: https://lkml.kernel.org/r/20210616234503.28678-1-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_string.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/lib/test_string.c~lib-test_stringc-allow-mode-removal
+++ a/lib/test_string.c
@@ -179,6 +179,10 @@ static __init int strnchr_selftest(void)
 	return 0;
 }
 
+static __exit void string_selftest_remove(void)
+{
+}
+
 static __init int string_selftest_init(void)
 {
 	int test, subtest;
@@ -216,4 +220,5 @@ fail:
 }
 
 module_init(string_selftest_init);
+module_exit(string_selftest_remove);
 MODULE_LICENSE("GPL v2");
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 170/192] kernel.h: split out kstrtox() and simple_strtox() to a separate header
  2021-07-01  1:46 incoming Andrew Morton
                   ` (168 preceding siblings ...)
  2021-07-01  1:56 ` [patch 169/192] lib/test_string.c: allow module removal Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 171/192] lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static Andrew Morton
                   ` (22 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, anna.schumaker, bfields, chuck.lever,
	Jonathan.Cameron, kerneldev, laniel_francis, linux-mm,
	mm-commits, rdunlap, torvalds, trond.myklebust

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: kernel.h: split out kstrtox() and simple_strtox() to a separate header

kernel.h is being used as a dump for all kinds of stuff for a long time. 
Here is the attempt to start cleaning it up by splitting out kstrtox() and
simple_strtox() helpers.

At the same time convert users in header and lib folders to use new
header.  Though for time being include new header back to kernel.h to
avoid twisted indirected includes for existing users.

[andy.shevchenko@gmail.com: fix documentation references]
  Link: https://lkml.kernel.org/r/20210615220003.377901-1-andy.shevchenko@gmail.com
Link: https://lkml.kernel.org/r/20210611185815.44103-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Francis Laniel <laniel_francis@privacyrequired.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Kars Mulder <kerneldev@karsmulder.nl>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/kernel-api.rst |    7 -
 include/linux/kernel.h                |  143 ----------------------
 include/linux/kstrtox.h               |  155 ++++++++++++++++++++++++
 include/linux/string.h                |    7 -
 include/linux/sunrpc/cache.h          |    1 
 lib/kstrtox.c                         |    5 
 lib/parser.c                          |    1 
 7 files changed, 163 insertions(+), 156 deletions(-)

--- a/Documentation/core-api/kernel-api.rst~kernelh-split-out-kstrtox-and-simple_strtox-to-a-separate-header
+++ a/Documentation/core-api/kernel-api.rst
@@ -24,11 +24,8 @@ String Conversions
 .. kernel-doc:: lib/vsprintf.c
    :export:
 
-.. kernel-doc:: include/linux/kernel.h
-   :functions: kstrtol
-
-.. kernel-doc:: include/linux/kernel.h
-   :functions: kstrtoul
+.. kernel-doc:: include/linux/kstrtox.h
+   :functions: kstrtol kstrtoul
 
 .. kernel-doc:: lib/kstrtox.c
    :export:
--- a/include/linux/kernel.h~kernelh-split-out-kstrtox-and-simple_strtox-to-a-separate-header
+++ a/include/linux/kernel.h
@@ -10,6 +10,7 @@
 #include <linux/types.h>
 #include <linux/compiler.h>
 #include <linux/bitops.h>
+#include <linux/kstrtox.h>
 #include <linux/log2.h>
 #include <linux/math.h>
 #include <linux/minmax.h>
@@ -180,148 +181,6 @@ static inline void might_fault(void) { }
 void do_exit(long error_code) __noreturn;
 void complete_and_exit(struct completion *, long) __noreturn;
 
-/* Internal, do not use. */
-int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res);
-int __must_check _kstrtol(const char *s, unsigned int base, long *res);
-
-int __must_check kstrtoull(const char *s, unsigned int base, unsigned long long *res);
-int __must_check kstrtoll(const char *s, unsigned int base, long long *res);
-
-/**
- * kstrtoul - convert a string to an unsigned long
- * @s: The start of the string. The string must be null-terminated, and may also
- *  include a single newline before its terminating null. The first character
- *  may also be a plus sign, but not a minus sign.
- * @base: The number base to use. The maximum supported base is 16. If base is
- *  given as 0, then the base of the string is automatically detected with the
- *  conventional semantics - If it begins with 0x the number will be parsed as a
- *  hexadecimal (case insensitive), if it otherwise begins with 0, it will be
- *  parsed as an octal number. Otherwise it will be parsed as a decimal.
- * @res: Where to write the result of the conversion on success.
- *
- * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.
- * Preferred over simple_strtoul(). Return code must be checked.
-*/
-static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res)
-{
-	/*
-	 * We want to shortcut function call, but
-	 * __builtin_types_compatible_p(unsigned long, unsigned long long) = 0.
-	 */
-	if (sizeof(unsigned long) == sizeof(unsigned long long) &&
-	    __alignof__(unsigned long) == __alignof__(unsigned long long))
-		return kstrtoull(s, base, (unsigned long long *)res);
-	else
-		return _kstrtoul(s, base, res);
-}
-
-/**
- * kstrtol - convert a string to a long
- * @s: The start of the string. The string must be null-terminated, and may also
- *  include a single newline before its terminating null. The first character
- *  may also be a plus sign or a minus sign.
- * @base: The number base to use. The maximum supported base is 16. If base is
- *  given as 0, then the base of the string is automatically detected with the
- *  conventional semantics - If it begins with 0x the number will be parsed as a
- *  hexadecimal (case insensitive), if it otherwise begins with 0, it will be
- *  parsed as an octal number. Otherwise it will be parsed as a decimal.
- * @res: Where to write the result of the conversion on success.
- *
- * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.
- * Preferred over simple_strtol(). Return code must be checked.
- */
-static inline int __must_check kstrtol(const char *s, unsigned int base, long *res)
-{
-	/*
-	 * We want to shortcut function call, but
-	 * __builtin_types_compatible_p(long, long long) = 0.
-	 */
-	if (sizeof(long) == sizeof(long long) &&
-	    __alignof__(long) == __alignof__(long long))
-		return kstrtoll(s, base, (long long *)res);
-	else
-		return _kstrtol(s, base, res);
-}
-
-int __must_check kstrtouint(const char *s, unsigned int base, unsigned int *res);
-int __must_check kstrtoint(const char *s, unsigned int base, int *res);
-
-static inline int __must_check kstrtou64(const char *s, unsigned int base, u64 *res)
-{
-	return kstrtoull(s, base, res);
-}
-
-static inline int __must_check kstrtos64(const char *s, unsigned int base, s64 *res)
-{
-	return kstrtoll(s, base, res);
-}
-
-static inline int __must_check kstrtou32(const char *s, unsigned int base, u32 *res)
-{
-	return kstrtouint(s, base, res);
-}
-
-static inline int __must_check kstrtos32(const char *s, unsigned int base, s32 *res)
-{
-	return kstrtoint(s, base, res);
-}
-
-int __must_check kstrtou16(const char *s, unsigned int base, u16 *res);
-int __must_check kstrtos16(const char *s, unsigned int base, s16 *res);
-int __must_check kstrtou8(const char *s, unsigned int base, u8 *res);
-int __must_check kstrtos8(const char *s, unsigned int base, s8 *res);
-int __must_check kstrtobool(const char *s, bool *res);
-
-int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res);
-int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned int base, long long *res);
-int __must_check kstrtoul_from_user(const char __user *s, size_t count, unsigned int base, unsigned long *res);
-int __must_check kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res);
-int __must_check kstrtouint_from_user(const char __user *s, size_t count, unsigned int base, unsigned int *res);
-int __must_check kstrtoint_from_user(const char __user *s, size_t count, unsigned int base, int *res);
-int __must_check kstrtou16_from_user(const char __user *s, size_t count, unsigned int base, u16 *res);
-int __must_check kstrtos16_from_user(const char __user *s, size_t count, unsigned int base, s16 *res);
-int __must_check kstrtou8_from_user(const char __user *s, size_t count, unsigned int base, u8 *res);
-int __must_check kstrtos8_from_user(const char __user *s, size_t count, unsigned int base, s8 *res);
-int __must_check kstrtobool_from_user(const char __user *s, size_t count, bool *res);
-
-static inline int __must_check kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 *res)
-{
-	return kstrtoull_from_user(s, count, base, res);
-}
-
-static inline int __must_check kstrtos64_from_user(const char __user *s, size_t count, unsigned int base, s64 *res)
-{
-	return kstrtoll_from_user(s, count, base, res);
-}
-
-static inline int __must_check kstrtou32_from_user(const char __user *s, size_t count, unsigned int base, u32 *res)
-{
-	return kstrtouint_from_user(s, count, base, res);
-}
-
-static inline int __must_check kstrtos32_from_user(const char __user *s, size_t count, unsigned int base, s32 *res)
-{
-	return kstrtoint_from_user(s, count, base, res);
-}
-
-/*
- * Use kstrto<foo> instead.
- *
- * NOTE: simple_strto<foo> does not check for the range overflow and,
- *	 depending on the input, may give interesting results.
- *
- * Use these functions if and only if you cannot use kstrto<foo>, because
- * the conversion ends on the first non-digit character, which may be far
- * beyond the supported range. It might be useful to parse the strings like
- * 10x50 or 12:21 without altering original string or temporary buffer in use.
- * Keep in mind above caveat.
- */
-
-extern unsigned long simple_strtoul(const char *,char **,unsigned int);
-extern long simple_strtol(const char *,char **,unsigned int);
-extern unsigned long long simple_strtoull(const char *,char **,unsigned int);
-extern long long simple_strtoll(const char *,char **,unsigned int);
-
 extern int num_to_str(char *buf, int size,
 		      unsigned long long num, unsigned int width);
 
--- /dev/null
+++ a/include/linux/kstrtox.h
@@ -0,0 +1,155 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_KSTRTOX_H
+#define _LINUX_KSTRTOX_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/* Internal, do not use. */
+int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res);
+int __must_check _kstrtol(const char *s, unsigned int base, long *res);
+
+int __must_check kstrtoull(const char *s, unsigned int base, unsigned long long *res);
+int __must_check kstrtoll(const char *s, unsigned int base, long long *res);
+
+/**
+ * kstrtoul - convert a string to an unsigned long
+ * @s: The start of the string. The string must be null-terminated, and may also
+ *  include a single newline before its terminating null. The first character
+ *  may also be a plus sign, but not a minus sign.
+ * @base: The number base to use. The maximum supported base is 16. If base is
+ *  given as 0, then the base of the string is automatically detected with the
+ *  conventional semantics - If it begins with 0x the number will be parsed as a
+ *  hexadecimal (case insensitive), if it otherwise begins with 0, it will be
+ *  parsed as an octal number. Otherwise it will be parsed as a decimal.
+ * @res: Where to write the result of the conversion on success.
+ *
+ * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.
+ * Preferred over simple_strtoul(). Return code must be checked.
+*/
+static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res)
+{
+	/*
+	 * We want to shortcut function call, but
+	 * __builtin_types_compatible_p(unsigned long, unsigned long long) = 0.
+	 */
+	if (sizeof(unsigned long) == sizeof(unsigned long long) &&
+	    __alignof__(unsigned long) == __alignof__(unsigned long long))
+		return kstrtoull(s, base, (unsigned long long *)res);
+	else
+		return _kstrtoul(s, base, res);
+}
+
+/**
+ * kstrtol - convert a string to a long
+ * @s: The start of the string. The string must be null-terminated, and may also
+ *  include a single newline before its terminating null. The first character
+ *  may also be a plus sign or a minus sign.
+ * @base: The number base to use. The maximum supported base is 16. If base is
+ *  given as 0, then the base of the string is automatically detected with the
+ *  conventional semantics - If it begins with 0x the number will be parsed as a
+ *  hexadecimal (case insensitive), if it otherwise begins with 0, it will be
+ *  parsed as an octal number. Otherwise it will be parsed as a decimal.
+ * @res: Where to write the result of the conversion on success.
+ *
+ * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error.
+ * Preferred over simple_strtol(). Return code must be checked.
+ */
+static inline int __must_check kstrtol(const char *s, unsigned int base, long *res)
+{
+	/*
+	 * We want to shortcut function call, but
+	 * __builtin_types_compatible_p(long, long long) = 0.
+	 */
+	if (sizeof(long) == sizeof(long long) &&
+	    __alignof__(long) == __alignof__(long long))
+		return kstrtoll(s, base, (long long *)res);
+	else
+		return _kstrtol(s, base, res);
+}
+
+int __must_check kstrtouint(const char *s, unsigned int base, unsigned int *res);
+int __must_check kstrtoint(const char *s, unsigned int base, int *res);
+
+static inline int __must_check kstrtou64(const char *s, unsigned int base, u64 *res)
+{
+	return kstrtoull(s, base, res);
+}
+
+static inline int __must_check kstrtos64(const char *s, unsigned int base, s64 *res)
+{
+	return kstrtoll(s, base, res);
+}
+
+static inline int __must_check kstrtou32(const char *s, unsigned int base, u32 *res)
+{
+	return kstrtouint(s, base, res);
+}
+
+static inline int __must_check kstrtos32(const char *s, unsigned int base, s32 *res)
+{
+	return kstrtoint(s, base, res);
+}
+
+int __must_check kstrtou16(const char *s, unsigned int base, u16 *res);
+int __must_check kstrtos16(const char *s, unsigned int base, s16 *res);
+int __must_check kstrtou8(const char *s, unsigned int base, u8 *res);
+int __must_check kstrtos8(const char *s, unsigned int base, s8 *res);
+int __must_check kstrtobool(const char *s, bool *res);
+
+int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res);
+int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned int base, long long *res);
+int __must_check kstrtoul_from_user(const char __user *s, size_t count, unsigned int base, unsigned long *res);
+int __must_check kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res);
+int __must_check kstrtouint_from_user(const char __user *s, size_t count, unsigned int base, unsigned int *res);
+int __must_check kstrtoint_from_user(const char __user *s, size_t count, unsigned int base, int *res);
+int __must_check kstrtou16_from_user(const char __user *s, size_t count, unsigned int base, u16 *res);
+int __must_check kstrtos16_from_user(const char __user *s, size_t count, unsigned int base, s16 *res);
+int __must_check kstrtou8_from_user(const char __user *s, size_t count, unsigned int base, u8 *res);
+int __must_check kstrtos8_from_user(const char __user *s, size_t count, unsigned int base, s8 *res);
+int __must_check kstrtobool_from_user(const char __user *s, size_t count, bool *res);
+
+static inline int __must_check kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 *res)
+{
+	return kstrtoull_from_user(s, count, base, res);
+}
+
+static inline int __must_check kstrtos64_from_user(const char __user *s, size_t count, unsigned int base, s64 *res)
+{
+	return kstrtoll_from_user(s, count, base, res);
+}
+
+static inline int __must_check kstrtou32_from_user(const char __user *s, size_t count, unsigned int base, u32 *res)
+{
+	return kstrtouint_from_user(s, count, base, res);
+}
+
+static inline int __must_check kstrtos32_from_user(const char __user *s, size_t count, unsigned int base, s32 *res)
+{
+	return kstrtoint_from_user(s, count, base, res);
+}
+
+/*
+ * Use kstrto<foo> instead.
+ *
+ * NOTE: simple_strto<foo> does not check for the range overflow and,
+ *	 depending on the input, may give interesting results.
+ *
+ * Use these functions if and only if you cannot use kstrto<foo>, because
+ * the conversion ends on the first non-digit character, which may be far
+ * beyond the supported range. It might be useful to parse the strings like
+ * 10x50 or 12:21 without altering original string or temporary buffer in use.
+ * Keep in mind above caveat.
+ */
+
+extern unsigned long simple_strtoul(const char *,char **,unsigned int);
+extern long simple_strtol(const char *,char **,unsigned int);
+extern unsigned long long simple_strtoull(const char *,char **,unsigned int);
+extern long long simple_strtoll(const char *,char **,unsigned int);
+
+static inline int strtobool(const char *s, bool *res)
+{
+	return kstrtobool(s, res);
+}
+
+#endif	/* _LINUX_KSTRTOX_H */
--- a/include/linux/string.h~kernelh-split-out-kstrtox-and-simple_strtox-to-a-separate-header
+++ a/include/linux/string.h
@@ -2,7 +2,6 @@
 #ifndef _LINUX_STRING_H_
 #define _LINUX_STRING_H_
 
-
 #include <linux/compiler.h>	/* for inline */
 #include <linux/types.h>	/* for size_t */
 #include <linux/stddef.h>	/* for NULL */
@@ -184,12 +183,6 @@ extern char **argv_split(gfp_t gfp, cons
 extern void argv_free(char **argv);
 
 extern bool sysfs_streq(const char *s1, const char *s2);
-extern int kstrtobool(const char *s, bool *res);
-static inline int strtobool(const char *s, bool *res)
-{
-	return kstrtobool(s, res);
-}
-
 int match_string(const char * const *array, size_t n, const char *string);
 int __sysfs_match_string(const char * const *array, size_t n, const char *s);
 
--- a/include/linux/sunrpc/cache.h~kernelh-split-out-kstrtox-and-simple_strtox-to-a-separate-header
+++ a/include/linux/sunrpc/cache.h
@@ -14,6 +14,7 @@
 #include <linux/kref.h>
 #include <linux/slab.h>
 #include <linux/atomic.h>
+#include <linux/kstrtox.h>
 #include <linux/proc_fs.h>
 
 /*
--- a/lib/kstrtox.c~kernelh-split-out-kstrtox-and-simple_strtox-to-a-separate-header
+++ a/lib/kstrtox.c
@@ -14,11 +14,12 @@
  */
 #include <linux/ctype.h>
 #include <linux/errno.h>
-#include <linux/kernel.h>
-#include <linux/math64.h>
 #include <linux/export.h>
+#include <linux/kstrtox.h>
+#include <linux/math64.h>
 #include <linux/types.h>
 #include <linux/uaccess.h>
+
 #include "kstrtox.h"
 
 const char *_parse_integer_fixup_radix(const char *s, unsigned int *base)
--- a/lib/parser.c~kernelh-split-out-kstrtox-and-simple_strtox-to-a-separate-header
+++ a/lib/parser.c
@@ -6,6 +6,7 @@
 #include <linux/ctype.h>
 #include <linux/types.h>
 #include <linux/export.h>
+#include <linux/kstrtox.h>
 #include <linux/parser.h>
 #include <linux/slab.h>
 #include <linux/string.h>
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 171/192] lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static
  2021-07-01  1:46 incoming Andrew Morton
                   ` (169 preceding siblings ...)
  2021-07-01  1:56 ` [patch 170/192] kernel.h: split out kstrtox() and simple_strtox() to a separate header Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 172/192] lib/decompress_unlz4.c: correctly handle zero-padding around initrds Andrew Morton
                   ` (21 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, hsiangkao, joe, linux-mm, mm-commits, terrelln,
	thisisrast7, torvalds

From: Rajat Asthana <thisisrast7@gmail.com>
Subject: lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static

Declare LZ4_decompress_safe_withPrefix64k as static to fix sparse
warning:

> warning: symbol 'LZ4_decompress_safe_withPrefix64k' was not declared.
> Should it be static?

Link: https://lkml.kernel.org/r/20210511154345.610569-1-thisisrast7@gmail.com
Signed-off-by: Rajat Asthana <thisisrast7@gmail.com>
Reviewed-by: Nick Terrell <terrelln@fb.com>
Cc: Gao Xiang <hsiangkao@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/lz4/lz4_decompress.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/lz4/lz4_decompress.c~lz4_decompress-declare-lz4_decompress_safe_withprefix64k-static
+++ a/lib/lz4/lz4_decompress.c
@@ -481,7 +481,7 @@ int LZ4_decompress_fast(const char *sour
 
 /* ===== Instantiate a few more decoding cases, used more than once. ===== */
 
-int LZ4_decompress_safe_withPrefix64k(const char *source, char *dest,
+static int LZ4_decompress_safe_withPrefix64k(const char *source, char *dest,
 				      int compressedSize, int maxOutputSize)
 {
 	return LZ4_decompress_generic(source, dest,
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 172/192] lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
  2021-07-01  1:46 incoming Andrew Morton
                   ` (170 preceding siblings ...)
  2021-07-01  1:56 ` [patch 171/192] lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 173/192] checkpatch: scripts/spdxcheck.py now requires python3 Andrew Morton
                   ` (20 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: 4sschmid, akpm, bongkyu.kim, dimitri.ledkov, hsiangkao, keescook,
	kyungsik.lee, linux-mm, mm-commits, terrelln, thisisrast7,
	torvalds, yinghai

From: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Subject: lib/decompress_unlz4.c: correctly handle zero-padding around initrds.

lz4 compatible decompressor is simple.  The format is underspecified and
relies on EOF notification to determine when to stop.  Initramfs buffer
format[1] explicitly states that it can have arbitrary number of zero
padding.  Thus when operating without a fill function, be extra careful to
ensure that sizes less than 4, or apperantly empty chunksizes are treated
as EOF.

To test this I have created two cpio initrds, first a normal one,
main.cpio.  And second one with just a single /test-file with content
"second" second.cpio.  Then i compressed both of them with gzip, and with
lz4 -l.  Then I created a padding of 4 bytes (dd if=/dev/zero of=pad4 bs=1
count=4).  To create four testcase initrds:

 1) main.cpio.gzip + extra.cpio.gzip = pad0.gzip
 2) main.cpio.lz4  + extra.cpio.lz4 = pad0.lz4
 3) main.cpio.gzip + pad4 + extra.cpio.gzip = pad4.gzip
 4) main.cpio.lz4  + pad4 + extra.cpio.lz4 = pad4.lz4

The pad4 test-cases replicate the initrd load by grub, as it pads and
aligns every initrd it loads.

All of the above boot, however /test-file was not accessible in the initrd
for the testcase #4, as decoding in lz4 decompressor failed.  Also an
error message printed which usually is harmless.

Whith a patched kernel, all of the above testcases now pass, and
/test-file is accessible.

This fixes lz4 initrd decompress warning on every boot with grub.  And
more importantly this fixes inability to load multiple lz4 compressed
initrds with grub.  This patch has been shipping in Ubuntu kernels since
January 2021.

[1] ./Documentation/driver-api/early-userspace/buffer-format.rst

BugLink: https://bugs.launchpad.net/bugs/1835660
Link: https://lore.kernel.org/lkml/20210114200256.196589-1-xnox@ubuntu.com/ # v0
Link: https://lkml.kernel.org/r/20210513104831.432975-1-dimitri.ledkov@canonical.com
Signed-off-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com>
Cc: Kyungsik Lee <kyungsik.lee@lge.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Bongkyu Kim <bongkyu.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sven Schmidt <4sschmid@informatik.uni-hamburg.de>
Cc: Rajat Asthana <thisisrast7@gmail.com>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Gao Xiang <hsiangkao@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/decompress_unlz4.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/lib/decompress_unlz4.c~lib-decompress_unlz4c-correctly-handle-zero-padding-around-initrds
+++ a/lib/decompress_unlz4.c
@@ -112,6 +112,9 @@ STATIC inline int INIT unlz4(u8 *input,
 				error("data corrupted");
 				goto exit_2;
 			}
+		} else if (size < 4) {
+			/* empty or end-of-file */
+			goto exit_3;
 		}
 
 		chunksize = get_unaligned_le32(inp);
@@ -125,6 +128,10 @@ STATIC inline int INIT unlz4(u8 *input,
 			continue;
 		}
 
+		if (!fill && chunksize == 0) {
+			/* empty or end-of-file */
+			goto exit_3;
+		}
 
 		if (posp)
 			*posp += 4;
@@ -184,6 +191,7 @@ STATIC inline int INIT unlz4(u8 *input,
 		}
 	}
 
+exit_3:
 	ret = 0;
 exit_2:
 	if (!input)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 173/192] checkpatch: scripts/spdxcheck.py now requires python3
  2021-07-01  1:46 incoming Andrew Morton
                   ` (171 preceding siblings ...)
  2021-07-01  1:56 ` [patch 172/192] lib/decompress_unlz4.c: correctly handle zero-padding around initrds Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 174/192] checkpatch: improve the indented label test Andrew Morton
                   ` (19 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, bert, dwaipayanray1, joe, linux-mm, linux, lukas.bulwahn,
	mm-commits, torvalds

From: Guenter Roeck <linux@roeck-us.net>
Subject: checkpatch: scripts/spdxcheck.py now requires python3

Since commit d0259c42abff ("spdxcheck.py: Use Python 3"), spdxcheck.py
explicitly expects to run as python3 script.  If "python" still points to
python v2.7 and the script is executed with "python scripts/spdxcheck.py",
the following error may be seen even if git-python is installed for
python3.

Traceback (most recent call last):
  File "scripts/spdxcheck.py", line 10, in <module>
    import git
ImportError: No module named git

To fix the problem, check for the existence of python3, check if
the script is executable and not just for its existence, and execute
it directly.

Link: https://lkml.kernel.org/r/20210505211720.447111-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Joe Perches <joe@perches.com>
Cc: Bert Vermeulen <bert@biot.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Bert Vermeulen <bert@biot.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-scripts-spdxcheckpy-now-requires-python3
+++ a/scripts/checkpatch.pl
@@ -1084,10 +1084,10 @@ sub is_maintained_obsolete {
 sub is_SPDX_License_valid {
 	my ($license) = @_;
 
-	return 1 if (!$tree || which("python") eq "" || !(-e "$root/scripts/spdxcheck.py") || !(-e "$gitroot"));
+	return 1 if (!$tree || which("python3") eq "" || !(-x "$root/scripts/spdxcheck.py") || !(-e "$gitroot"));
 
 	my $root_path = abs_path($root);
-	my $status = `cd "$root_path"; echo "$license" | python scripts/spdxcheck.py -`;
+	my $status = `cd "$root_path"; echo "$license" | scripts/spdxcheck.py -`;
 	return 0 if ($status ne "");
 	return 1;
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 174/192] checkpatch: improve the indented label test
  2021-07-01  1:46 incoming Andrew Morton
                   ` (172 preceding siblings ...)
  2021-07-01  1:56 ` [patch 173/192] checkpatch: scripts/spdxcheck.py now requires python3 Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 175/192] checkpatch: do not complain about positive return values starting with EPOLL Andrew Morton
                   ` (18 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, elder, gregkh, joe, linux-mm, manikishanghantasala,
	mm-commits, torvalds

From: Joe Perches <joe@perches.com>
Subject: checkpatch: improve the indented label test

checkpatch identifies a label only when a terminating colon
immediately follows an identifier.

Bitfield definitions can appear to be labels so ignore any
spaces between the identifier terminating colon and any digit
that may be used to define a bitfield length.

Miscellanea:

o Improve the initial checkpatch comment
o Use the more typical '&&' instead of 'and'
o Require the initial label character to be a non-digit
  (Can't use $Ident here because $Ident allows ## concatenation)
o Use $sline instead of $line to ignore comments
o Use '$sline !~ /.../' instead of '!($line =~ /.../)'

Link: https://lkml.kernel.org/r/b54d673e7cde7de5de0c9ba4dd57dd0858580ca4.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Manikishan Ghantasala <manikishanghantasala@gmail.com>
Cc: Alex Elder <elder@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-improve-the-indented-label-test
+++ a/scripts/checkpatch.pl
@@ -5361,9 +5361,13 @@ sub process {
 			}
 		}
 
-#goto labels aren't indented, allow a single space however
-		if ($line=~/^.\s+[A-Za-z\d_]+:(?![0-9]+)/ and
-		   !($line=~/^. [A-Za-z\d_]+:/) and !($line=~/^.\s+default:/)) {
+# check that goto labels aren't indented (allow a single space indentation)
+# and ignore bitfield definitions like foo:1
+# Strictly, labels can have whitespace after the identifier and before the :
+# but this is not allowed here as many ?: uses would appear to be labels
+		if ($sline =~ /^.\s+[A-Za-z_][A-Za-z\d_]*:(?!\s*\d+)/ &&
+		    $sline !~ /^. [A-Za-z\d_][A-Za-z\d_]*:/ &&
+		    $sline !~ /^.\s+default:/) {
 			if (WARN("INDENTED_LABEL",
 				 "labels should not be indented\n" . $herecurr) &&
 			    $fix) {
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 175/192] checkpatch: do not complain about positive return values starting with EPOLL
  2021-07-01  1:46 incoming Andrew Morton
                   ` (173 preceding siblings ...)
  2021-07-01  1:56 ` [patch 174/192] checkpatch: improve the indented label test Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 176/192] init: print out unknown kernel parameters Andrew Morton
                   ` (17 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, joe, linux-mm, linux, mm-commits, ribalda, torvalds

From: Guenter Roeck <linux@roeck-us.net>
Subject: checkpatch: do not complain about positive return values starting with EPOLL

checkpatch complains about positive return values of poll functions. 
Example:

WARNING: return of an errno should typically be negative (ie: return -EPOLLIN)
+		return EPOLLIN;

Poll functions return positive values.  The defines for the return values
of poll functions all start with EPOLL, resulting in a number of false
positives.  An often used workaround is to assign poll function return
values to variables and returning that variable, but that is a less than
perfect solution.

There is no error definition which starts with EPOLL, so it is safe to
omit the warning for return values starting with EPOLL.

Link: https://lkml.kernel.org/r/20210622004334.638680-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Joe Perches <joe@perches.com>
Cc: Ricardo Ribalda <ribalda@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/scripts/checkpatch.pl~checkpatch-do-not-complain-about-positive-return-values-starting-with-epoll
+++ a/scripts/checkpatch.pl
@@ -5462,7 +5462,7 @@ sub process {
 # Return of what appears to be an errno should normally be negative
 		if ($sline =~ /\breturn(?:\s*\(+\s*|\s+)(E[A-Z]+)(?:\s*\)+\s*|\s*)[;:,]/) {
 			my $name = $1;
-			if ($name ne 'EOF' && $name ne 'ERROR') {
+			if ($name ne 'EOF' && $name ne 'ERROR' && $name !~ /^EPOLL/) {
 				WARN("USE_NEGATIVE_ERRNO",
 				     "return of an errno should typically be negative (ie: return -$1)\n" . $herecurr);
 			}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 176/192] init: print out unknown kernel parameters
  2021-07-01  1:46 incoming Andrew Morton
                   ` (174 preceding siblings ...)
  2021-07-01  1:56 ` [patch 175/192] checkpatch: do not complain about positive return values starting with EPOLL Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 177/192] kprobes: remove duplicated strong free_insn_page in x86 and s390 Andrew Morton
                   ` (16 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: ahalaney, akpm, bp, linux-mm, mm-commits, rostedt, torvalds

From: Andrew Halaney <ahalaney@redhat.com>
Subject: init: print out unknown kernel parameters

It is easy to foobar setting a kernel parameter on the command line
without realizing it, there's not much output that you can use to assess
what the kernel did with that parameter by default.

Make it a little more explicit which parameters on the command line
_looked_ like a valid parameter for the kernel, but did not match anything
and ultimately got tossed to init.  This is very similar to the unknown
parameter message received when loading a module.

This assumes the parameters are processed in a normal fashion, some
parameters (dyndbg= for example) don't register their parameter with the
rest of the kernel's parameters, and therefore always show up in this list
(and are also given to init - like the rest of this list).

Another example is BOOT_IMAGE= is highlighted as an offender, which it
technically is, but is passed by LILO and GRUB so most systems will see
that complaint.

An example output where "foobared" and "unrecognized" are intentionally
invalid parameters:

  Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo
  Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo

Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Borislav Petkov <bp@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

--- a/init/main.c~init-print-out-unknown-kernel-parameters
+++ a/init/main.c
@@ -872,6 +872,47 @@ void __init __weak arch_call_rest_init(v
 	rest_init();
 }
 
+static void __init print_unknown_bootoptions(void)
+{
+	char *unknown_options;
+	char *end;
+	const char *const *p;
+	size_t len;
+
+	if (panic_later || (!argv_init[1] && !envp_init[2]))
+		return;
+
+	/*
+	 * Determine how many options we have to print out, plus a space
+	 * before each
+	 */
+	len = 1; /* null terminator */
+	for (p = &argv_init[1]; *p; p++) {
+		len++;
+		len += strlen(*p);
+	}
+	for (p = &envp_init[2]; *p; p++) {
+		len++;
+		len += strlen(*p);
+	}
+
+	unknown_options = memblock_alloc(len, SMP_CACHE_BYTES);
+	if (!unknown_options) {
+		pr_err("%s: Failed to allocate %zu bytes\n",
+			__func__, len);
+		return;
+	}
+	end = unknown_options;
+
+	for (p = &argv_init[1]; *p; p++)
+		end += sprintf(end, " %s", *p);
+	for (p = &envp_init[2]; *p; p++)
+		end += sprintf(end, " %s", *p);
+
+	pr_notice("Unknown command line parameters:%s\n", unknown_options);
+	memblock_free(__pa(unknown_options), len);
+}
+
 asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
 {
 	char *command_line;
@@ -913,6 +954,7 @@ asmlinkage __visible void __init __no_sa
 				  static_command_line, __start___param,
 				  __stop___param - __start___param,
 				  -1, -1, NULL, &unknown_bootoption);
+	print_unknown_bootoptions();
 	if (!IS_ERR_OR_NULL(after_dashes))
 		parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
 			   NULL, set_init_arg);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 177/192] kprobes: remove duplicated strong free_insn_page in x86 and s390
  2021-07-01  1:46 incoming Andrew Morton
                   ` (175 preceding siblings ...)
  2021-07-01  1:56 ` [patch 176/192] init: print out unknown kernel parameters Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 178/192] nilfs2: remove redundant continue statement in a while-loop Andrew Morton
                   ` (15 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, anil.s.keshavamurthy, borntraeger, bp, davem, gor, hca,
	hch, hpa, linux-mm, liuqi115, mhiramat, mingo, mm-commits,
	naveen.n.rao, song.bao.hua, tglx, torvalds

From: Barry Song <song.bao.hua@hisilicon.com>
Subject: kprobes: remove duplicated strong free_insn_page in x86 and s390

free_insn_page() in x86 and s390 is same with the common weak function in
kernel/kprobes.c.  Plus, the comment "Recover page to RW mode before
releasing it" in x86 seems insensible to be there since resetting mapping
is done by common code in vfree() of module_memfree().  So drop these two
duplicated strong functions and related comment, then mark the common one
in kernel/kprobes.c strong.

Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Qi Liu <liuqi115@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/kernel/kprobes.c     |    5 -----
 arch/x86/kernel/kprobes/core.c |    6 ------
 include/linux/kprobes.h        |    1 -
 kernel/kprobes.c               |    2 +-
 4 files changed, 1 insertion(+), 13 deletions(-)

--- a/arch/s390/kernel/kprobes.c~kprobes-remove-duplicated-strong-free_insn_page-in-x86-and-s390
+++ a/arch/s390/kernel/kprobes.c
@@ -44,11 +44,6 @@ void *alloc_insn_page(void)
 	return page;
 }
 
-void free_insn_page(void *page)
-{
-	module_memfree(page);
-}
-
 static void *alloc_s390_insn_page(void)
 {
 	if (xchg(&insn_page_in_use, 1) == 1)
--- a/arch/x86/kernel/kprobes/core.c~kprobes-remove-duplicated-strong-free_insn_page-in-x86-and-s390
+++ a/arch/x86/kernel/kprobes/core.c
@@ -422,12 +422,6 @@ void *alloc_insn_page(void)
 	return page;
 }
 
-/* Recover page to RW mode before releasing it */
-void free_insn_page(void *page)
-{
-	module_memfree(page);
-}
-
 /* Kprobe x86 instruction emulation - only regs->ip or IF flag modifiers */
 
 static void kprobe_emulate_ifmodifiers(struct kprobe *p, struct pt_regs *regs)
--- a/include/linux/kprobes.h~kprobes-remove-duplicated-strong-free_insn_page-in-x86-and-s390
+++ a/include/linux/kprobes.h
@@ -407,7 +407,6 @@ int enable_kprobe(struct kprobe *kp);
 void dump_kprobe(struct kprobe *kp);
 
 void *alloc_insn_page(void);
-void free_insn_page(void *page);
 
 int kprobe_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
 		       char *sym);
--- a/kernel/kprobes.c~kprobes-remove-duplicated-strong-free_insn_page-in-x86-and-s390
+++ a/kernel/kprobes.c
@@ -106,7 +106,7 @@ void __weak *alloc_insn_page(void)
 	return module_alloc(PAGE_SIZE);
 }
 
-void __weak free_insn_page(void *page)
+static void free_insn_page(void *page)
 {
 	module_memfree(page);
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 178/192] nilfs2: remove redundant continue statement in a while-loop
  2021-07-01  1:46 incoming Andrew Morton
                   ` (176 preceding siblings ...)
  2021-07-01  1:56 ` [patch 177/192] kprobes: remove duplicated strong free_insn_page in x86 and s390 Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 179/192] hfsplus: remove unnecessary oom message Andrew Morton
                   ` (14 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, colin.king, konishi.ryusuke, linux-mm, mm-commits, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: nilfs2: remove redundant continue statement in a while-loop

The continue statement at the end of the while-loop is redundant,
remove it.

Addresses-Coverity: ("Continue has no effect")
Link: https://lkml.kernel.org/r/20210621100519.10257-1-colin.king@canonical.com
Link: https://lkml.kernel.org/r/1624557664-17159-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/btree.c |    1 -
 1 file changed, 1 deletion(-)

--- a/fs/nilfs2/btree.c~nilfs2-remove-redundant-continue-statement-in-a-while-loop
+++ a/fs/nilfs2/btree.c
@@ -738,7 +738,6 @@ static int nilfs_btree_lookup_contig(con
 			if (ptr2 != ptr + cnt || ++cnt == maxblocks)
 				goto end;
 			index++;
-			continue;
 		}
 		if (level == maxlevel)
 			break;
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 179/192] hfsplus: remove unnecessary oom message
  2021-07-01  1:46 incoming Andrew Morton
                   ` (177 preceding siblings ...)
  2021-07-01  1:56 ` [patch 178/192] nilfs2: remove redundant continue statement in a while-loop Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 180/192] hfsplus: report create_date to kstat.btime Andrew Morton
                   ` (13 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, thunder.leizhen, torvalds

From: Zhen Lei <thunder.leizhen@huawei.com>
Subject: hfsplus: remove unnecessary oom message

Fixes scripts/checkpatch.pl warning:
WARNING: Possible unnecessary 'out of memory' message

Remove it can help us save a bit of memory.

Link: https://lkml.kernel.org/r/20210617084944.1279-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/hfsplus/xattr.c |    1 -
 1 file changed, 1 deletion(-)

--- a/fs/hfsplus/xattr.c~hfsplus-remove-unnecessary-oom-message
+++ a/fs/hfsplus/xattr.c
@@ -204,7 +204,6 @@ check_attr_tree_state_again:
 
 	buf = kzalloc(node_size, GFP_NOFS);
 	if (!buf) {
-		pr_err("failed to allocate memory for header node\n");
 		err = -ENOMEM;
 		goto end_attr_file_creation;
 	}
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 180/192] hfsplus: report create_date to kstat.btime
  2021-07-01  1:46 incoming Andrew Morton
                   ` (178 preceding siblings ...)
  2021-07-01  1:56 ` [patch 179/192] hfsplus: remove unnecessary oom message Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 181/192] x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned Andrew Morton
                   ` (12 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, axboe, cccheng, christian.brauner, jamorris, linux-mm,
	mm-commits, shepjeng, slava, torvalds

From: Chung-Chiang Cheng <shepjeng@gmail.com>
Subject: hfsplus: report create_date to kstat.btime

The create_date field of inode in hfsplus is corresponding to
kstat.btime and could be reported in statx.

Link: https://lkml.kernel.org/r/20210416172147.8736-1-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/hfsplus/inode.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/fs/hfsplus/inode.c~hfsplus-report-create_date-to-kstatbtime
+++ a/fs/hfsplus/inode.c
@@ -281,6 +281,11 @@ int hfsplus_getattr(struct user_namespac
 	struct inode *inode = d_inode(path->dentry);
 	struct hfsplus_inode_info *hip = HFSPLUS_I(inode);
 
+	if (request_mask & STATX_BTIME) {
+		stat->result_mask |= STATX_BTIME;
+		stat->btime = hfsp_mt2ut(hip->create_date);
+	}
+
 	if (inode->i_flags & S_APPEND)
 		stat->attributes |= STATX_ATTR_APPEND;
 	if (inode->i_flags & S_IMMUTABLE)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 181/192] x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
  2021-07-01  1:46 incoming Andrew Morton
                   ` (179 preceding siblings ...)
  2021-07-01  1:56 ` [patch 180/192] hfsplus: report create_date to kstat.btime Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 182/192] exec: remove checks in __register_bimfmt() Andrew Morton
                   ` (11 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, natechancellor, ndesaulniers, oleg,
	torvalds, viro

From: Al Viro <viro@zeniv.linux.org.uk>
Subject: x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned

Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings.  And if we manage to
set the sigframe up (we are not done with that yet), everything's fine. 
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered.  And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.

It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e.  in signal_delivered().

Link: https://lore.kernel.org/lkml/20200404170604.GN23230@ZenIV.linux.org.uk/
Link: https://github.com/ClangBuiltLinux/linux/issues/876
Link: https://lkml.kernel.org/r/20210422230846.1756380-1-ndesaulniers@google.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compat.h |    2 --
 include/linux/signal.h |    2 --
 kernel/signal.c        |   14 ++++----------
 3 files changed, 4 insertions(+), 14 deletions(-)

--- a/include/linux/compat.h~x86-signal-dont-do-sas_ss_reset-until-we-are-certain-that-sigframe-wont-be-abandoned
+++ a/include/linux/compat.h
@@ -532,8 +532,6 @@ int __compat_save_altstack(compat_stack_
 			&__uss->ss_sp, label); \
 	unsafe_put_user(t->sas_ss_flags, &__uss->ss_flags, label); \
 	unsafe_put_user(t->sas_ss_size, &__uss->ss_size, label); \
-	if (t->sas_ss_flags & SS_AUTODISARM) \
-		sas_ss_reset(t); \
 } while (0);
 
 /*
--- a/include/linux/signal.h~x86-signal-dont-do-sas_ss_reset-until-we-are-certain-that-sigframe-wont-be-abandoned
+++ a/include/linux/signal.h
@@ -462,8 +462,6 @@ int __save_altstack(stack_t __user *, un
 	unsafe_put_user((void __user *)t->sas_ss_sp, &__uss->ss_sp, label); \
 	unsafe_put_user(t->sas_ss_flags, &__uss->ss_flags, label); \
 	unsafe_put_user(t->sas_ss_size, &__uss->ss_size, label); \
-	if (t->sas_ss_flags & SS_AUTODISARM) \
-		sas_ss_reset(t); \
 } while (0);
 
 #ifdef CONFIG_PROC_FS
--- a/kernel/signal.c~x86-signal-dont-do-sas_ss_reset-until-we-are-certain-that-sigframe-wont-be-abandoned
+++ a/kernel/signal.c
@@ -2829,6 +2829,8 @@ static void signal_delivered(struct ksig
 	if (!(ksig->ka.sa.sa_flags & SA_NODEFER))
 		sigaddset(&blocked, ksig->sig);
 	set_current_blocked(&blocked);
+	if (current->sas_ss_flags & SS_AUTODISARM)
+		sas_ss_reset(current);
 	tracehook_signal_handler(stepping);
 }
 
@@ -4147,11 +4149,7 @@ int __save_altstack(stack_t __user *uss,
 	int err = __put_user((void __user *)t->sas_ss_sp, &uss->ss_sp) |
 		__put_user(t->sas_ss_flags, &uss->ss_flags) |
 		__put_user(t->sas_ss_size, &uss->ss_size);
-	if (err)
-		return err;
-	if (t->sas_ss_flags & SS_AUTODISARM)
-		sas_ss_reset(t);
-	return 0;
+	return err;
 }
 
 #ifdef CONFIG_COMPAT
@@ -4206,11 +4204,7 @@ int __compat_save_altstack(compat_stack_
 			 &uss->ss_sp) |
 		__put_user(t->sas_ss_flags, &uss->ss_flags) |
 		__put_user(t->sas_ss_size, &uss->ss_size);
-	if (err)
-		return err;
-	if (t->sas_ss_flags & SS_AUTODISARM)
-		sas_ss_reset(t);
-	return 0;
+	return err;
 }
 #endif
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 182/192] exec: remove checks in __register_bimfmt()
  2021-07-01  1:46 incoming Andrew Morton
                   ` (180 preceding siblings ...)
  2021-07-01  1:56 ` [patch 181/192] x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 183/192] kcov: add __no_sanitize_coverage to fix noinstr for all architectures Andrew Morton
                   ` (10 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: exec: remove checks in __register_bimfmt()

Delete NULL check, all callers pass valid pointer.

Delete ->load_binary check -- failure to provide hook in a custom module
will be very noticeable at the very first execve call.

Link: https://lkml.kernel.org/r/YK1Gy1qXaLAR+tPl@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/exec.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/fs/exec.c~exec-remove-checks-in-__register_bimfmt
+++ a/fs/exec.c
@@ -84,9 +84,6 @@ static DEFINE_RWLOCK(binfmt_lock);
 
 void __register_binfmt(struct linux_binfmt * fmt, int insert)
 {
-	BUG_ON(!fmt);
-	if (WARN_ON(!fmt->load_binary))
-		return;
 	write_lock(&binfmt_lock);
 	insert ? list_add(&fmt->lh, &formats) :
 		 list_add_tail(&fmt->lh, &formats);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 183/192] kcov: add __no_sanitize_coverage to fix noinstr for all architectures
  2021-07-01  1:46 incoming Andrew Morton
                   ` (181 preceding siblings ...)
  2021-07-01  1:56 ` [patch 182/192] exec: remove checks in __register_bimfmt() Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 184/192] selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random Andrew Morton
                   ` (9 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, ardb, arnd, dvyukov, elver, keescook, linux-mm,
	luc.vanoostenryck, mark.rutland, masahiroy, mm-commits, nathan,
	ndesaulniers, nivedita, ojeda, peterz, samitolvanen, torvalds,
	will

From: Marco Elver <elver@google.com>
Subject: kcov: add __no_sanitize_coverage to fix noinstr for all architectures

Until now no compiler supported an attribute to disable coverage
instrumentation as used by KCOV.

To work around this limitation on x86, noinstr functions have their
coverage instrumentation turned into nops by objtool.  However, this
solution doesn't scale automatically to other architectures, such as
arm64, which are migrating to use the generic entry code.

Clang [1] and GCC [2] have added support for the attribute recently.
[1] https://github.com/llvm/llvm-project/commit/280333021e9550d80f5c1152a34e33e81df1e178
[2] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=cec4d4a6782c9bd8d071839c50a239c49caca689
The changes will appear in Clang 13 and GCC 12.

Add __no_sanitize_coverage for both compilers, and add it to noinstr.

Note: In the Clang case, __has_feature(coverage_sanitizer) is only true if
the feature is enabled, and therefore we do not require an additional
defined(CONFIG_KCOV) (like in the GCC case where __has_attribute(..) is
always true) to avoid adding redundant attributes to functions if KCOV is
off.  That being said, compilers that support the attribute will not
generate errors/warnings if the attribute is redundantly used; however,
where possible let's avoid it as it reduces preprocessed code size and
associated compile-time overheads.

[elver@google.com: Implement __has_feature(coverage_sanitizer) in Clang]
  Link: https://lkml.kernel.org/r/20210527162655.3246381-1-elver@google.com
[elver@google.com: add comment explaining __has_feature() in Clang]
  Link: https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com
Link: https://lkml.kernel.org/r/20210525175819.699786-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compiler-clang.h |   17 +++++++++++++++++
 include/linux/compiler-gcc.h   |    6 ++++++
 include/linux/compiler_types.h |    2 +-
 3 files changed, 24 insertions(+), 1 deletion(-)

--- a/include/linux/compiler-clang.h~kcov-add-__no_sanitize_coverage-to-fix-noinstr-for-all-architectures
+++ a/include/linux/compiler-clang.h
@@ -13,6 +13,12 @@
 /* all clang versions usable with the kernel support KASAN ABI version 5 */
 #define KASAN_ABI_VERSION 5
 
+/*
+ * Note: Checking __has_feature(*_sanitizer) is only true if the feature is
+ * enabled. Therefore it is not required to additionally check defined(CONFIG_*)
+ * to avoid adding redundant attributes in other configurations.
+ */
+
 #if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer)
 /* Emulate GCC's __SANITIZE_ADDRESS__ flag */
 #define __SANITIZE_ADDRESS__
@@ -46,6 +52,17 @@
 #endif
 
 /*
+ * Support for __has_feature(coverage_sanitizer) was added in Clang 13 together
+ * with no_sanitize("coverage"). Prior versions of Clang support coverage
+ * instrumentation, but cannot be queried for support by the preprocessor.
+ */
+#if __has_feature(coverage_sanitizer)
+#define __no_sanitize_coverage __attribute__((no_sanitize("coverage")))
+#else
+#define __no_sanitize_coverage
+#endif
+
+/*
  * Not all versions of clang implement the type-generic versions
  * of the builtin overflow checkers. Fortunately, clang implements
  * __has_builtin allowing us to avoid awkward version
--- a/include/linux/compiler-gcc.h~kcov-add-__no_sanitize_coverage-to-fix-noinstr-for-all-architectures
+++ a/include/linux/compiler-gcc.h
@@ -122,6 +122,12 @@
 #define __no_sanitize_undefined
 #endif
 
+#if defined(CONFIG_KCOV) && __has_attribute(__no_sanitize_coverage__)
+#define __no_sanitize_coverage __attribute__((no_sanitize_coverage))
+#else
+#define __no_sanitize_coverage
+#endif
+
 #if GCC_VERSION >= 50100
 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
 #endif
--- a/include/linux/compiler_types.h~kcov-add-__no_sanitize_coverage-to-fix-noinstr-for-all-architectures
+++ a/include/linux/compiler_types.h
@@ -210,7 +210,7 @@ struct ftrace_likely_data {
 /* Section for code which can't be instrumented at all */
 #define noinstr								\
 	noinline notrace __attribute((__section__(".noinstr.text")))	\
-	__no_kcsan __no_sanitize_address
+	__no_kcsan __no_sanitize_address __no_sanitize_coverage
 
 #endif /* __KERNEL__ */
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 184/192] selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
  2021-07-01  1:46 incoming Andrew Morton
                   ` (182 preceding siblings ...)
  2021-07-01  1:56 ` [patch 183/192] kcov: add __no_sanitize_coverage to fix noinstr for all architectures Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 185/192] selftests/vm/pkeys: handle negative sys_pkey_alloc() return code Andrew Morton
                   ` (8 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, aneesh.kumar, bauerman, dave.hansen, desnesn, fweimer,
	linux-mm, linuxram, mhocko, mingo, mm-commits, mpe, msuchanek,
	sandipan, shuah, tglx, torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>
Subject: selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random

Patch series "selftests/vm/pkeys: Bug fixes and a new test".

There has been a lot of activity on the x86 front around the XSAVE
architecture which is used to context-switch processor state (among other
things).  In addition, AMD has recently joined the protection keys club by
adding processor support for PKU.

The AMD implementation helped uncover a kernel bug around the PKRU "init
state", which actually applied to Intel's implementation but was just
harder to hit.  This series adds a test which is expected to help find
this class of bug both on AMD and Intel.  All the work around pkeys on x86
also uncovered a few bugs in the selftest.


This patch (of 4):

The "random" pkey allocation code currently does the good old:

	srand((unsigned int)time(NULL));

*But*, it unfortunately does this on every random pkey allocation.

There may be thousands of these a second.  time() has a one second
resolution.  So, each time alloc_random_pkey() is called, the PRNG is
*RESET* to time().  This is nasty.  Normally, if you do:

	srand(<ANYTHING>);
	foo = rand();
	bar = rand();

You'll be quite guaranteed that 'foo' and 'bar' are different.  But, if
you do:

	srand(1);
	foo = rand();
	srand(1);
	bar = rand();

You are quite guaranteed that 'foo' and 'bar' are the *SAME*.  The recent
"fix" effectively forced the test case to use the same "random" pkey for
the whole test, unless the test run crossed a second boundary.

Only run srand() once at program startup.

This explains some very odd and persistent test failures I've been seeing.

Link: https://lkml.kernel.org/r/20210611164153.91B76FB8@viggo.jf.intel.com
Link: https://lkml.kernel.org/r/20210611164155.192D00FF@viggo.jf.intel.com
Fixes: 6e373263ce07 ("selftests/vm/pkeys: fix alloc_random_pkey() to make it really random")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Suchanek <msuchanek@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/protection_keys.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/protection_keys.c~selftests-vm-pkeys-fix-alloc_random_pkey-to-make-it-really-really-random
+++ a/tools/testing/selftests/vm/protection_keys.c
@@ -561,7 +561,6 @@ int alloc_random_pkey(void)
 	int nr_alloced = 0;
 	int random_index;
 	memset(alloced_pkeys, 0, sizeof(alloced_pkeys));
-	srand((unsigned int)time(NULL));
 
 	/* allocate every possible key and make a note of which ones we got */
 	max_nr_pkey_allocs = NR_PKEYS;
@@ -1552,6 +1551,8 @@ int main(void)
 	int nr_iterations = 22;
 	int pkeys_supported = is_pkeys_supported();
 
+	srand((unsigned int)time(NULL));
+
 	setup_handlers();
 
 	printf("has pkeys: %d\n", pkeys_supported);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 185/192] selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
  2021-07-01  1:46 incoming Andrew Morton
                   ` (183 preceding siblings ...)
  2021-07-01  1:56 ` [patch 184/192] selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:56 ` [patch 186/192] selftests/vm/pkeys: refill shadow register after implicit kernel write Andrew Morton
                   ` (7 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, aneesh.kumar, bauerman, dave.hansen, desnesn, fweimer,
	linux-mm, linuxram, mhocko, mingo, mm-commits, mpe, msuchanek,
	sandipan, shuah, tglx, torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>
Subject: selftests/vm/pkeys: handle negative sys_pkey_alloc() return code

The alloc_pkey() sefltest function wraps the sys_pkey_alloc() system call.
On success, it updates its "shadow" register value because
sys_pkey_alloc() updates the real register.

But, the success check is wrong.  pkey_alloc() considers any non-zero
return code to indicate success where the pkey register will be modified. 
This fails to take negative return codes into account.

Consider only a positive return value as a successful call.

Link: https://lkml.kernel.org/r/20210611164157.87AB4246@viggo.jf.intel.com
Fixes: 5f23f6d082a9 ("x86/pkeys: Add self-tests")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Suchanek <msuchanek@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/protection_keys.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/protection_keys.c~selftests-vm-pkeys-handle-negative-sys_pkey_alloc-return-code
+++ a/tools/testing/selftests/vm/protection_keys.c
@@ -510,7 +510,7 @@ int alloc_pkey(void)
 			" shadow: 0x%016llx\n",
 			__func__, __LINE__, ret, __read_pkey_reg(),
 			shadow_pkey_reg);
-	if (ret) {
+	if (ret > 0) {
 		/* clear both the bits: */
 		shadow_pkey_reg = set_pkey_bits(shadow_pkey_reg, ret,
 						~PKEY_MASK);
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 186/192] selftests/vm/pkeys: refill shadow register after implicit kernel write
  2021-07-01  1:46 incoming Andrew Morton
                   ` (184 preceding siblings ...)
  2021-07-01  1:56 ` [patch 185/192] selftests/vm/pkeys: handle negative sys_pkey_alloc() return code Andrew Morton
@ 2021-07-01  1:56 ` Andrew Morton
  2021-07-01  1:57 ` [patch 187/192] selftests/vm/pkeys: exercise x86 XSAVE init state Andrew Morton
                   ` (6 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:56 UTC (permalink / raw)
  To: akpm, aneesh.kumar, bauerman, dave.hansen, desnesn, fweimer,
	linux-mm, linuxram, mhocko, mingo, mm-commits, mpe, msuchanek,
	sandipan, shuah, tglx, torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>
Subject: selftests/vm/pkeys: refill shadow register after implicit kernel write

The pkey test code keeps a "shadow" of the pkey register around.  This
ensures that any bugs which might write to the register can be caught more
quickly.

Generally, userspace has a good idea when the kernel is going to write to
the register.  For instance, alloc_pkey() is passed a permission mask. 
The caller of alloc_pkey() can update the shadow based on the return value
and the mask.

But, the kernel can also modify the pkey register in a more sneaky way. 
For mprotect(PROT_EXEC) mappings, the kernel will allocate a pkey and
write the pkey register to create an execute-only mapping.  The kernel
never tells userspace what key it uses for this.

This can cause the test to fail with messages like:

	protection_keys_64.2: pkey-helpers.h:132: _read_pkey_reg: Assertion `pkey_reg == shadow_pkey_reg' failed.

because the shadow was not updated with the new kernel-set value.

Forcibly update the shadow value immediately after an mprotect().

Link: https://lkml.kernel.org/r/20210611164200.EF76AB73@viggo.jf.intel.com
Fixes: 6af17cf89e99 ("x86/pkeys/selftests: Add PROT_EXEC test")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Suchanek <msuchanek@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/protection_keys.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/tools/testing/selftests/vm/protection_keys.c~selftests-vm-pkeys-refill-shadow-register-after-implicit-kernel-write
+++ a/tools/testing/selftests/vm/protection_keys.c
@@ -1448,6 +1448,13 @@ void test_implicit_mprotect_exec_only_me
 	ret = mprotect(p1, PAGE_SIZE, PROT_EXEC);
 	pkey_assert(!ret);
 
+	/*
+	 * Reset the shadow, assuming that the above mprotect()
+	 * correctly changed PKRU, but to an unknown value since
+	 * the actual alllocated pkey is unknown.
+	 */
+	shadow_pkey_reg = __read_pkey_reg();
+
 	dprintf2("pkey_reg: %016llx\n", read_pkey_reg());
 
 	/* Make sure this is an *instruction* fault */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 187/192] selftests/vm/pkeys: exercise x86 XSAVE init state
  2021-07-01  1:46 incoming Andrew Morton
                   ` (185 preceding siblings ...)
  2021-07-01  1:56 ` [patch 186/192] selftests/vm/pkeys: refill shadow register after implicit kernel write Andrew Morton
@ 2021-07-01  1:57 ` Andrew Morton
  2021-07-01  1:57 ` [patch 188/192] lib/decompressors: remove set but not used variabled 'level' Andrew Morton
                   ` (5 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:57 UTC (permalink / raw)
  To: akpm, aneesh.kumar, bauerman, dave.hansen, desnesn, fweimer,
	linux-mm, linuxram, mhocko, mingo, mm-commits, mpe, msuchanek,
	sandipan, shuah, tglx, torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>
Subject: selftests/vm/pkeys: exercise x86 XSAVE init state

On x86, there is a set of instructions used to save and restore register
state collectively known as the XSAVE architecture.  There are about a
dozen different features managed with XSAVE.  The protection keys
register, PKRU, is one of those features.

The hardware optimizes XSAVE by tracking when the state has not changed
from its initial (init) state.  In this case, it can avoid the cost of
writing state to memory (it would usually just be a bunch of 0's).

When the pkey register is 0x0 the hardware optionally choose to track the
register as being in the init state (optimize away the writes).  AMD CPUs
do this more aggressively compared to Intel.

On x86, PKRU is rarely in its (very permissive) init state.  Instead, the
value defaults to something very restrictive.  It is not surprising that
bugs have popped up in the rare cases when PKRU reaches its init state.

Add a protection key selftest which gets the protection keys register into
its init state in a way that should work on Intel and AMD.  Then, do a
bunch of pkey register reads to watch for inadvertent changes.

This adds "-mxsave" to CFLAGS for all the x86 vm selftests in order to
allow use of the XSAVE instruction __builtin functions.  This will make
the builtins available on all of the vm selftests, but is expected to be
harmless.

Link: https://lkml.kernel.org/r/20210611164202.1849B712@viggo.jf.intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "Desnes A. Nunes do Rosario" <desnesn@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Suchanek <msuchanek@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/Makefile          |    4 
 tools/testing/selftests/vm/pkey-x86.h        |    1 
 tools/testing/selftests/vm/protection_keys.c |   73 +++++++++++++++++
 3 files changed, 76 insertions(+), 2 deletions(-)

--- a/tools/testing/selftests/vm/Makefile~selftests-vm-pkeys-exercise-x86-xsave-init-state
+++ a/tools/testing/selftests/vm/Makefile
@@ -101,7 +101,7 @@ $(1) $(1)_64: $(OUTPUT)/$(1)_64
 endef
 
 ifeq ($(CAN_BUILD_I386),1)
-$(BINARIES_32): CFLAGS += -m32
+$(BINARIES_32): CFLAGS += -m32 -mxsave
 $(BINARIES_32): LDLIBS += -lrt -ldl -lm
 $(BINARIES_32): $(OUTPUT)/%_32: %.c
 	$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@
@@ -109,7 +109,7 @@ $(foreach t,$(TARGETS),$(eval $(call gen
 endif
 
 ifeq ($(CAN_BUILD_X86_64),1)
-$(BINARIES_64): CFLAGS += -m64
+$(BINARIES_64): CFLAGS += -m64 -mxsave
 $(BINARIES_64): LDLIBS += -lrt -ldl
 $(BINARIES_64): $(OUTPUT)/%_64: %.c
 	$(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@
--- a/tools/testing/selftests/vm/pkey-x86.h~selftests-vm-pkeys-exercise-x86-xsave-init-state
+++ a/tools/testing/selftests/vm/pkey-x86.h
@@ -126,6 +126,7 @@ static inline u32 pkey_bit_position(int
 
 #define XSTATE_PKEY_BIT	(9)
 #define XSTATE_PKEY	0x200
+#define XSTATE_BV_OFFSET	512
 
 int pkey_reg_xstate_offset(void)
 {
--- a/tools/testing/selftests/vm/protection_keys.c~selftests-vm-pkeys-exercise-x86-xsave-init-state
+++ a/tools/testing/selftests/vm/protection_keys.c
@@ -1277,6 +1277,78 @@ void test_pkey_alloc_exhaust(int *ptr, u
 	}
 }
 
+void arch_force_pkey_reg_init(void)
+{
+#if defined(__i386__) || defined(__x86_64__) /* arch */
+	u64 *buf;
+
+	/*
+	 * All keys should be allocated and set to allow reads and
+	 * writes, so the register should be all 0.  If not, just
+	 * skip the test.
+	 */
+	if (read_pkey_reg())
+		return;
+
+	/*
+	 * Just allocate an absurd about of memory rather than
+	 * doing the XSAVE size enumeration dance.
+	 */
+	buf = mmap(NULL, 1*MB, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
+
+	/* These __builtins require compiling with -mxsave */
+
+	/* XSAVE to build a valid buffer: */
+	__builtin_ia32_xsave(buf, XSTATE_PKEY);
+	/* Clear XSTATE_BV[PKRU]: */
+	buf[XSTATE_BV_OFFSET/sizeof(u64)] &= ~XSTATE_PKEY;
+	/* XRSTOR will likely get PKRU back to the init state: */
+	__builtin_ia32_xrstor(buf, XSTATE_PKEY);
+
+	munmap(buf, 1*MB);
+#endif
+}
+
+
+/*
+ * This is mostly useless on ppc for now.  But it will not
+ * hurt anything and should give some better coverage as
+ * a long-running test that continually checks the pkey
+ * register.
+ */
+void test_pkey_init_state(int *ptr, u16 pkey)
+{
+	int err;
+	int allocated_pkeys[NR_PKEYS] = {0};
+	int nr_allocated_pkeys = 0;
+	int i;
+
+	for (i = 0; i < NR_PKEYS; i++) {
+		int new_pkey = alloc_pkey();
+
+		if (new_pkey < 0)
+			continue;
+		allocated_pkeys[nr_allocated_pkeys++] = new_pkey;
+	}
+
+	dprintf3("%s()::%d\n", __func__, __LINE__);
+
+	arch_force_pkey_reg_init();
+
+	/*
+	 * Loop for a bit, hoping to get exercise the kernel
+	 * context switch code.
+	 */
+	for (i = 0; i < 1000000; i++)
+		read_pkey_reg();
+
+	for (i = 0; i < nr_allocated_pkeys; i++) {
+		err = sys_pkey_free(allocated_pkeys[i]);
+		pkey_assert(!err);
+		read_pkey_reg(); /* for shadow checking */
+	}
+}
+
 /*
  * pkey 0 is special.  It is allocated by default, so you do not
  * have to call pkey_alloc() to use it first.  Make sure that it
@@ -1508,6 +1580,7 @@ void (*pkey_tests[])(int *ptr, u16 pkey)
 	test_implicit_mprotect_exec_only_memory,
 	test_mprotect_with_pkey_0,
 	test_ptrace_of_child,
+	test_pkey_init_state,
 	test_pkey_syscalls_on_non_allocated_pkey,
 	test_pkey_syscalls_bad_args,
 	test_pkey_alloc_exhaust,
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 188/192] lib/decompressors: remove set but not used variabled 'level'
  2021-07-01  1:46 incoming Andrew Morton
                   ` (186 preceding siblings ...)
  2021-07-01  1:57 ` [patch 187/192] selftests/vm/pkeys: exercise x86 XSAVE init state Andrew Morton
@ 2021-07-01  1:57 ` Andrew Morton
  2021-07-01  1:57 ` [patch 189/192] ipc sem: use kvmalloc for sem_undo allocation Andrew Morton
                   ` (4 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:57 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, yukuai3

From: Yu Kuai <yukuai3@huawei.com>
Subject: lib/decompressors: remove set but not used variabled 'level'

Fixes gcc '-Wunused-but-set-variable' warning:

lib/decompress_unlzo.c:46:5: warning: variable `level' set but
not used [-Wunused-but-set-variable]

It is never used and so can be removed.

[akpm@linux-foundation.org: warning: value computed is not used]
Link: https://lkml.kernel.org/r/20210514062050.3532344-1-yukuai3@huawei.com
Fixes: 7dd65feb6c60 ("lib: add support for LZO-compressed kernels")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/decompress_unlzo.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/lib/decompress_unlzo.c~lib-decompressors-remove-set-but-not-used-variabled-level
+++ a/lib/decompress_unlzo.c
@@ -43,7 +43,6 @@ STATIC inline long INIT parse_header(u8
 	int l;
 	u8 *parse = input;
 	u8 *end = input + in_len;
-	u8 level = 0;
 	u16 version;
 
 	/*
@@ -65,7 +64,7 @@ STATIC inline long INIT parse_header(u8
 	version = get_unaligned_be16(parse);
 	parse += 7;
 	if (version >= 0x0940)
-		level = *parse++;
+		parse++;
 	if (get_unaligned_be32(parse) & HEADER_HAS_FILTER)
 		parse += 8; /* flags + filter info */
 	else
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 189/192] ipc sem: use kvmalloc for sem_undo allocation
  2021-07-01  1:46 incoming Andrew Morton
                   ` (187 preceding siblings ...)
  2021-07-01  1:57 ` [patch 188/192] lib/decompressors: remove set but not used variabled 'level' Andrew Morton
@ 2021-07-01  1:57 ` Andrew Morton
  2021-07-01  1:57 ` [patch 190/192] ipc: use kmalloc for msg_queue and shmid_kernel Andrew Morton
                   ` (3 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:57 UTC (permalink / raw)
  To: 0x7f454c46, adobriyan, akpm, dave, guro, hannes, linux-mm,
	manfred, mhocko, mm-commits, shakeelb, torvalds, vdavydov.dev,
	vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: ipc sem: use kvmalloc for sem_undo allocation

Patch series "ipc: allocations cleanup", v2.

Some ipc objects use the wrong allocation functions: small objects can use
kmalloc(), and vice versa, potentially large objects can use kmalloc().


This patch (of 2):

Size of sem_undo can exceed one page and with the maximum possible nsems =
32000 it can grow up to 64Kb.  Let's switch its allocation to kvmalloc to
avoid user-triggered disruptive actions like OOM killer in case of
high-order memory shortage.

User triggerable high order allocations are quite a problem on heavily
fragmented systems.  They can be a DoS vector.

Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com
Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/sem.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/ipc/sem.c~ipc-sem-use-kvmalloc-for-sem_undo-allocation
+++ a/ipc/sem.c
@@ -1154,7 +1154,7 @@ static void freeary(struct ipc_namespace
 		un->semid = -1;
 		list_del_rcu(&un->list_proc);
 		spin_unlock(&un->ulp->lock);
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
@@ -1937,7 +1937,8 @@ static struct sem_undo *find_alloc_undo(
 	rcu_read_unlock();
 
 	/* step 2: allocate new undo structure */
-	new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
+	new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems,
+		       GFP_KERNEL);
 	if (!new) {
 		ipc_rcu_putref(&sma->sem_perm, sem_rcu_free);
 		return ERR_PTR(-ENOMEM);
@@ -1949,7 +1950,7 @@ static struct sem_undo *find_alloc_undo(
 	if (!ipc_valid_object(&sma->sem_perm)) {
 		sem_unlock(sma, -1);
 		rcu_read_unlock();
-		kfree(new);
+		kvfree(new);
 		un = ERR_PTR(-EIDRM);
 		goto out;
 	}
@@ -1960,7 +1961,7 @@ static struct sem_undo *find_alloc_undo(
 	 */
 	un = lookup_undo(ulp, semid);
 	if (un) {
-		kfree(new);
+		kvfree(new);
 		goto success;
 	}
 	/* step 5: initialize & link new undo structure */
@@ -2420,7 +2421,7 @@ void exit_sem(struct task_struct *tsk)
 		rcu_read_unlock();
 		wake_up_q(&wake_q);
 
-		kfree_rcu(un, rcu);
+		kvfree_rcu(un, rcu);
 	}
 	kfree(ulp);
 }
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 190/192] ipc: use kmalloc for msg_queue and shmid_kernel
  2021-07-01  1:46 incoming Andrew Morton
                   ` (188 preceding siblings ...)
  2021-07-01  1:57 ` [patch 189/192] ipc sem: use kvmalloc for sem_undo allocation Andrew Morton
@ 2021-07-01  1:57 ` Andrew Morton
  2021-07-01  1:57 ` [patch 191/192] ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock Andrew Morton
                   ` (2 subsequent siblings)
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:57 UTC (permalink / raw)
  To: 0x7f454c46, adobriyan, akpm, dave, guro, hannes, linux-mm,
	manfred, mhocko, mm-commits, shakeelb, torvalds, vdavydov.dev,
	vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: ipc: use kmalloc for msg_queue and shmid_kernel

msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them.  mhocko@: "Both of them are 256B on most 64b systems."

Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects.  It had kvmalloc call inside(). 
Later, this function went away and was finally replaced by direct kvmalloc
call, and now we can use more suitable kmalloc/kfree for them.

Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/msg.c |    6 +++---
 ipc/shm.c |    6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

--- a/ipc/msg.c~ipc-use-kmalloc-for-msg_queue-and-shmid_kernel
+++ a/ipc/msg.c
@@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head
 	struct msg_queue *msq = container_of(p, struct msg_queue, q_perm);
 
 	security_msg_queue_free(&msq->q_perm);
-	kvfree(msq);
+	kfree(msq);
 }
 
 /**
@@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *
 	key_t key = params->key;
 	int msgflg = params->flg;
 
-	msq = kvmalloc(sizeof(*msq), GFP_KERNEL);
+	msq = kmalloc(sizeof(*msq), GFP_KERNEL);
 	if (unlikely(!msq))
 		return -ENOMEM;
 
@@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *
 	msq->q_perm.security = NULL;
 	retval = security_msg_queue_alloc(&msq->q_perm);
 	if (retval) {
-		kvfree(msq);
+		kfree(msq);
 		return retval;
 	}
 
--- a/ipc/shm.c~ipc-use-kmalloc-for-msg_queue-and-shmid_kernel
+++ a/ipc/shm.c
@@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head
 	struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel,
 							shm_perm);
 	security_shm_free(&shp->shm_perm);
-	kvfree(shp);
+	kfree(shp);
 }
 
 static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s)
@@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *
 			ns->shm_tot + numpages > ns->shm_ctlall)
 		return -ENOSPC;
 
-	shp = kvmalloc(sizeof(*shp), GFP_KERNEL);
+	shp = kmalloc(sizeof(*shp), GFP_KERNEL);
 	if (unlikely(!shp))
 		return -ENOMEM;
 
@@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *
 	shp->shm_perm.security = NULL;
 	error = security_shm_alloc(&shp->shm_perm);
 	if (error) {
-		kvfree(shp);
+		kfree(shp);
 		return error;
 	}
 
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 191/192] ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
  2021-07-01  1:46 incoming Andrew Morton
                   ` (189 preceding siblings ...)
  2021-07-01  1:57 ` [patch 190/192] ipc: use kmalloc for msg_queue and shmid_kernel Andrew Morton
@ 2021-07-01  1:57 ` Andrew Morton
  2021-07-01  1:57 ` [patch 192/192] ipc/util.c: use binary search for max_idx Andrew Morton
  2021-07-03  0:28 ` incoming Linus Torvalds
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:57 UTC (permalink / raw)
  To: 1vier1, akpm, dbueso, linux-mm, manfred, mm-commits, paulmck, torvalds

From: Manfred Spraul <manfred@colorfullife.com>
Subject: ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock

The patch solves three weaknesses in ipc/sem.c:

1) The initial read of use_global_lock in sem_lock() is an intentional
   race.  KCSAN detects these accesses and prints a warning.

2) The code assumes that plain C read/writes are not mangled by the CPU
   or the compiler.

3) The comment it sysvipc_sem_proc_show() was hard to understand: The
   rest of the comments in ipc/sem.c speaks about sem_perm.lock, and
   suddenly this function speaks about ipc_lock_object().

To solve 1) and 2), use READ_ONCE()/WRITE_ONCE().  Plain C reads are used
in code that owns sma->sem_perm.lock.

The comment is updated to solve 3)

[manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock]
  Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com
Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/sem.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

--- a/ipc/sem.c~ipc-semc-use-read_once-write_once-for-use_global_lock
+++ a/ipc/sem.c
@@ -217,6 +217,8 @@ static int sysvipc_sem_proc_show(struct
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ * To prevent the compiler/cpu temporarily writing 0 to use_global_lock,
+ * READ_ONCE()/WRITE_ONCE() is used.
  *
  * 2) queue.status: (SEM_BARRIER_2)
  * Initialization is done while holding sem_lock(), so no further barrier is
@@ -342,10 +344,10 @@ static void complexmode_enter(struct sem
 		 * Nothing to do, just reset the
 		 * counter until we return to simple mode.
 		 */
-		sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
+		WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS);
 		return;
 	}
-	sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS;
+	WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS);
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = &sma->sems[i];
@@ -371,7 +373,8 @@ static void complexmode_tryleave(struct
 		/* See SEM_BARRIER_1 for purpose/pairing */
 		smp_store_release(&sma->use_global_lock, 0);
 	} else {
-		sma->use_global_lock--;
+		WRITE_ONCE(sma->use_global_lock,
+				sma->use_global_lock-1);
 	}
 }
 
@@ -412,7 +415,7 @@ static inline int sem_lock(struct sem_ar
 	 * Initial check for use_global_lock. Just an optimization,
 	 * no locking, no memory barrier.
 	 */
-	if (!sma->use_global_lock) {
+	if (!READ_ONCE(sma->use_global_lock)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
@@ -2436,7 +2439,8 @@ static int sysvipc_sem_proc_show(struct
 
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
-	 * ipc_lock_object() directly (in sysvipc_find_ipc).
+	 * ipc_lock_object(), i.e. spin_lock(&sma->sem_perm.lock).
+	 * (in sysvipc_find_ipc)
 	 * In order to stay compatible with sem_lock(), we must
 	 * enter / leave complex_mode.
 	 */
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* [patch 192/192] ipc/util.c: use binary search for max_idx
  2021-07-01  1:46 incoming Andrew Morton
                   ` (190 preceding siblings ...)
  2021-07-01  1:57 ` [patch 191/192] ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock Andrew Morton
@ 2021-07-01  1:57 ` Andrew Morton
  2021-07-03  0:28 ` incoming Linus Torvalds
  192 siblings, 0 replies; 228+ messages in thread
From: Andrew Morton @ 2021-07-01  1:57 UTC (permalink / raw)
  To: 1vier1, akpm, dbueso, linux-mm, manfred, mm-commits, torvalds

From: Manfred Spraul <manfred@colorfullife.com>
Subject: ipc/util.c: use binary search for max_idx

If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used index in the kernel's internal array recording information about all
SysV objects of the requested type for the current namespace.  (This
information can be used with repeated ..._STAT or ..._STAT_ANY operations
to obtain information about all SysV objects on the system.)

There is a cache for this value.  But when the cache needs up be updated,
then the highest used index is determined by looping over all possible
values.  With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries.  And due to /proc/sys/kernel/*next_id, the
index values do not need to be consecutive.

With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
loop, I have observed a performance increase of around factor 13000.

As there is no get_last() function for idr structures: Implement a
"get_last()" using a binary search.

As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.

[akpm@linux-foundation.org: tweak comment, fix typo]
Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/util.c |   44 +++++++++++++++++++++++++++++++++++++++-----
 ipc/util.h |    3 +++
 2 files changed, 42 insertions(+), 5 deletions(-)

--- a/ipc/util.c~ipc-utilc-use-binary-search-for-max_idx
+++ a/ipc/util.c
@@ -64,6 +64,7 @@
 #include <linux/memory.h>
 #include <linux/ipc_namespace.h>
 #include <linux/rhashtable.h>
+#include <linux/log2.h>
 
 #include <asm/unistd.h>
 
@@ -451,6 +452,41 @@ static void ipc_kht_remove(struct ipc_id
 }
 
 /**
+ * ipc_search_maxidx - search for the highest assigned index
+ * @ids: ipc identifier set
+ * @limit: known upper limit for highest assigned index
+ *
+ * The function determines the highest assigned index in @ids. It is intended
+ * to be called when ids->max_idx needs to be updated.
+ * Updating ids->max_idx is necessary when the current highest index ipc
+ * object is deleted.
+ * If no ipc object is allocated, then -1 is returned.
+ *
+ * ipc_ids.rwsem needs to be held by the caller.
+ */
+static int ipc_search_maxidx(struct ipc_ids *ids, int limit)
+{
+	int tmpidx;
+	int i;
+	int retval;
+
+	i = ilog2(limit+1);
+
+	retval = 0;
+	for (; i >= 0; i--) {
+		tmpidx = retval | (1<<i);
+		/*
+		 * "0" is a possible index value, thus search using
+		 * e.g. 15,7,3,1,0 instead of 16,8,4,2,1.
+		 */
+		tmpidx = tmpidx-1;
+		if (idr_get_next(&ids->ipcs_idr, &tmpidx))
+			retval |= (1<<i);
+	}
+	return retval - 1;
+}
+
+/**
  * ipc_rmid - remove an ipc identifier
  * @ids: ipc identifier set
  * @ipcp: ipc perm structure containing the identifier to remove
@@ -468,11 +504,9 @@ void ipc_rmid(struct ipc_ids *ids, struc
 	ipcp->deleted = true;
 
 	if (unlikely(idx == ids->max_idx)) {
-		do {
-			idx--;
-			if (idx == -1)
-				break;
-		} while (!idr_find(&ids->ipcs_idr, idx));
+		idx = ids->max_idx-1;
+		if (idx >= 0)
+			idx = ipc_search_maxidx(ids, idx);
 		ids->max_idx = idx;
 	}
 }
--- a/ipc/util.h~ipc-utilc-use-binary-search-for-max_idx
+++ a/ipc/util.h
@@ -145,6 +145,9 @@ int ipcperms(struct ipc_namespace *ns, s
  * ipc_get_maxidx - get the highest assigned index
  * @ids: ipc identifier set
  *
+ * The function returns the highest assigned index for @ids. The function
+ * doesn't scan the idr tree, it uses a cached value.
+ *
  * Called with ipc_ids.rwsem held for reading.
  */
 static inline int ipc_get_maxidx(struct ipc_ids *ids)
_

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 097/192] mm: generalize ZONE_[DMA|DMA32]
  2021-07-01  1:52 ` [patch 097/192] mm: generalize ZONE_[DMA|DMA32] Andrew Morton
@ 2021-07-01  2:46   ` Linus Torvalds
  2021-07-01  4:29     ` Konstantin Ryabitsev
  0 siblings, 1 reply; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01  2:46 UTC (permalink / raw)
  To: Andrew Morton, Konstantin Ryabitsev; +Cc: mm-commits

On Wed, Jun 30, 2021 at 6:52 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Kefeng Wang <wangkefeng.wang@huawei.com>
> Subject: mm: generalize ZONE_[DMA|DMA32]
>
> ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that
> subscribe to them.  Instead, just make them generic options which can be
> selected on applicable platforms.

Hmm. 'b4' doesn't pick up this patch, for some reason.

Once again, it seems to be a "missing from the lists" case, even
though I see the "cc: mm-commits".

I don't see _why_, though.

I've picked it up from my own mailbox, but it's a bit annoying how b4
_almost_ makes this all painfree - but not quite.

                  Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 141/192] fs/proc/kcore.c: add mmap interface
  2021-07-01  1:54 ` [patch 141/192] fs/proc/kcore.c: add mmap interface Andrew Morton
@ 2021-07-01  3:32     ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01  3:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexey Dobriyan, chenying.kernel, Linux-MM, mm-commits,
	Mike Rapoport, Muchun Song, zhouchengming, zhoufeng.zf

On Wed, Jun 30, 2021 at 6:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> When we do the kernel monitor, use the DRGN
> (https://github.com/osandov/drgn) access to kernel data structures, found
> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.

Ok, this is funky, but I'm going to drop this patch because I think
it's buggy as is.

 Since

> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
> +{
> +       size_t size = vma->vm_end - vma->vm_start;

Ok.

But then:

> +       start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
> +               ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));

Not only is that

        ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT)

a very strange calculation (did you mean "data_offset & PAGE_MASK"?),
but I don't see anything that protects against underflow in that
calculation. pg_off can easily be arbitrarily small (eg zero), so that
subtraction can underflow afaik.

So that needs a test, and return -EINVAL or whatever.

But even if that is fixed, this test is entirely broken:

> +       list_for_each_entry(m, &kclist_head, list) {
> +               if (start >= m->addr && size <= m->size)
> +                       break;
> +       }

No, that's wrong.

You allow 'size' to be as big as 'm->size', but you do that even if
'start' isn't 'm->start'.

The proper check would be something like

       u64 end = start + size;

        if (start >= m->addr && end <= m->addr+m->size) ..

or similar (and that should check that "start+size" hasn't overflowed).

So I see what appears to be multiple problems, and while I hand-waved
some fixes for them, those are very much "maybe something like this",
and I'm going to drop this patch. Not for 5.14.

           Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 141/192] fs/proc/kcore.c: add mmap interface
@ 2021-07-01  3:32     ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01  3:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexey Dobriyan, chenying.kernel, Linux-MM, mm-commits,
	Mike Rapoport, Muchun Song, zhouchengming, zhoufeng.zf

On Wed, Jun 30, 2021 at 6:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> When we do the kernel monitor, use the DRGN
> (https://github.com/osandov/drgn) access to kernel data structures, found
> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.

Ok, this is funky, but I'm going to drop this patch because I think
it's buggy as is.

 Since

> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
> +{
> +       size_t size = vma->vm_end - vma->vm_start;

Ok.

But then:

> +       start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
> +               ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));

Not only is that

        ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT)

a very strange calculation (did you mean "data_offset & PAGE_MASK"?),
but I don't see anything that protects against underflow in that
calculation. pg_off can easily be arbitrarily small (eg zero), so that
subtraction can underflow afaik.

So that needs a test, and return -EINVAL or whatever.

But even if that is fixed, this test is entirely broken:

> +       list_for_each_entry(m, &kclist_head, list) {
> +               if (start >= m->addr && size <= m->size)
> +                       break;
> +       }

No, that's wrong.

You allow 'size' to be as big as 'm->size', but you do that even if
'start' isn't 'm->start'.

The proper check would be something like

       u64 end = start + size;

        if (start >= m->addr && end <= m->addr+m->size) ..

or similar (and that should check that "start+size" hasn't overflowed).

So I see what appears to be multiple problems, and while I hand-waved
some fixes for them, those are very much "maybe something like this",
and I'm going to drop this patch. Not for 5.14.

           Linus


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
  2021-07-01  1:47 ` [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Andrew Morton
@ 2021-07-01  3:46     ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01  3:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mina Almasry, Anshuman Khandual, bodeddub, Borislav Petkov,
	bsingharora, chenhuang5, Jonathan Corbet, Dave Hansen,
	David Hildenbrand, duanxiongchun, Peter Anvin, joao.m.martins,
	Joerg Roedel, Miaohe Lin, Linux-MM, Andrew Lutomirski,
	Michal Hocko, Mike Kravetz, Ingo Molnar, mm-commits,
	naoya.horiguchi, oneukum, Oscar Salvador, Paul E. McKenney,
	Pawan Gupta, Peter Zijlstra, Randy Dunlap, David Rientjes,
	song.bao.hua, Muchun Song, Thomas Gleixner, Al Viro,
	Matthew Wilcox

On Wed, Jun 30, 2021 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Muchun Song <songmuchun@bytedance.com>
> Subject: mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
>
> Every HugeTLB has more than one struct page structure.  We __know__ that
> we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
> store metadata associated with each HugeTLB.
>
> There are a lot of struct page structures associated with each HugeTLB
> page.  For tail pages, the value of compound_head is the same.  So we can
> reuse first page of tail page structures.   [..]

I think this means to say that we can reuse the _second_ page of the
tail page structures, since the first page is special and also
contains the first (non-tail) 'struct page'.

Or maybe the intent is to say that that second page is the "first page
of purely tail page structures"?

Anyway, this HugeTLB 'struct page' vmemmap patch-series doesn't look
_wrong_ to me, but it does look like it is a nightmare to debug if
something ever goes wrong. And it looks like a lot of things _could_
go wrong. It all looks very subtle.

Put another way: I'm not objecting to this series, but it does make me
nervous, and I just want to give a heads-up that if we start seeing
problems with this, I think people need to be ready to very
aggressively revert it unless the fixes are obvious.

How much testing has this series gotten on loads that are heavy users
of hugetlb?

                Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
@ 2021-07-01  3:46     ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01  3:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mina Almasry, Anshuman Khandual, bodeddub, Borislav Petkov,
	bsingharora, chenhuang5, Jonathan Corbet, Dave Hansen,
	David Hildenbrand, duanxiongchun, Peter Anvin, joao.m.martins,
	Joerg Roedel, Miaohe Lin, Linux-MM, Andrew Lutomirski,
	Michal Hocko, Mike Kravetz, Ingo Molnar, mm-commits,
	naoya.horiguchi, oneukum, Oscar Salvador, Paul E. McKenney,
	Pawan Gupta, Peter Zijlstra, Randy Dunlap, David Rientjes,
	song.bao.hua, Muchun Song, Thomas Gleixner, Al Viro,
	Matthew Wilcox

On Wed, Jun 30, 2021 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Muchun Song <songmuchun@bytedance.com>
> Subject: mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
>
> Every HugeTLB has more than one struct page structure.  We __know__ that
> we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
> store metadata associated with each HugeTLB.
>
> There are a lot of struct page structures associated with each HugeTLB
> page.  For tail pages, the value of compound_head is the same.  So we can
> reuse first page of tail page structures.   [..]

I think this means to say that we can reuse the _second_ page of the
tail page structures, since the first page is special and also
contains the first (non-tail) 'struct page'.

Or maybe the intent is to say that that second page is the "first page
of purely tail page structures"?

Anyway, this HugeTLB 'struct page' vmemmap patch-series doesn't look
_wrong_ to me, but it does look like it is a nightmare to debug if
something ever goes wrong. And it looks like a lot of things _could_
go wrong. It all looks very subtle.

Put another way: I'm not objecting to this series, but it does make me
nervous, and I just want to give a heads-up that if we start seeing
problems with this, I think people need to be ready to very
aggressively revert it unless the fixes are obvious.

How much testing has this series gotten on loads that are heavy users
of hugetlb?

                Linus


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 097/192] mm: generalize ZONE_[DMA|DMA32]
  2021-07-01  2:46   ` Linus Torvalds
@ 2021-07-01  4:29     ` Konstantin Ryabitsev
  0 siblings, 0 replies; 228+ messages in thread
From: Konstantin Ryabitsev @ 2021-07-01  4:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, mm-commits

On Wed, Jun 30, 2021 at 07:46:10PM -0700, Linus Torvalds wrote:
> > ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that
> > subscribe to them.  Instead, just make them generic options which can be
> > selected on applicable platforms.
> 
> Hmm. 'b4' doesn't pick up this patch, for some reason.
> 
> Once again, it seems to be a "missing from the lists" case, even
> though I see the "cc: mm-commits".
> 
> I don't see _why_, though.
> 
> I've picked it up from my own mailbox, but it's a bit annoying how b4
> _almost_ makes this all painfree - but not quite.

It seems that 097/192 didn't make it to any of the archival services either
for linux-mm or mm-commits, which is really bizarre, since they are hosted at
two different listserv hosts. E.g. marc.info doesn't have 097 either:

https://marc.info/?l=linux-mm&r=4&b=202107&w=2
https://marc.info/?l=linux-mm-commits&r=4&b=202107&w=2

Since both kvack.org and vger.kernel.org use majordomo, I would guess that
something about that message caused them to drop it. Unfortunately, I don't
have anything insightful to add here other than "b4 is only as good as what
lore.kernel.org actually receives."

-K

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [External] Re: [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
  2021-07-01  3:46     ` Linus Torvalds
  (?)
@ 2021-07-01  6:29     ` Muchun Song
  2021-07-01 18:25       ` Linus Torvalds
  -1 siblings, 1 reply; 228+ messages in thread
From: Muchun Song @ 2021-07-01  6:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Mina Almasry, Anshuman Khandual, Bodeddula,
	Balasubramaniam, Borislav Petkov, Singh, Balbir, Chen Huang,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, Xiongchun duan,
	Peter Anvin, Joao Martins, Joerg Roedel, Miaohe Lin, Linux-MM,
	Andrew Lutomirski, Michal Hocko, Mike Kravetz, Ingo Molnar,
	mm-commits, HORIGUCHI NAOYA(堀口 直也),
	oneukum, Oscar Salvador, Paul E. McKenney, Pawan Gupta,
	Peter Zijlstra, Randy Dunlap, David Rientjes,
	Song Bao Hua (Barry Song),
	Thomas Gleixner, Al Viro, Matthew Wilcox

On Thu, Jul 1, 2021 at 11:46 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, Jun 30, 2021 at 6:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Muchun Song <songmuchun@bytedance.com>
> > Subject: mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
> >
> > Every HugeTLB has more than one struct page structure.  We __know__ that
> > we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
> > store metadata associated with each HugeTLB.
> >
> > There are a lot of struct page structures associated with each HugeTLB
> > page.  For tail pages, the value of compound_head is the same.  So we can
> > reuse first page of tail page structures.   [..]
>
> I think this means to say that we can reuse the _second_ page of the
> tail page structures, since the first page is special and also
> contains the first (non-tail) 'struct page'.
>
> Or maybe the intent is to say that that second page is the "first page
> of purely tail page structures"?

Hi Linus,

Right. This is what I mean. Evey 2MB hugepage has 8 vmemmap
pages (32KB), the 2nd vmemmap page is reused here. The remapping
details can refer to the head of mm/hugetlb_vmemmap.c.

>
> Anyway, this HugeTLB 'struct page' vmemmap patch-series doesn't look
> _wrong_ to me, but it does look like it is a nightmare to debug if
> something ever goes wrong. And it looks like a lot of things _could_
> go wrong. It all looks very subtle.

In order to make things work well, some addresses of vmemmap are
also mapped with read only to catch invalid usage from other modules
(e.g. write operation). I didn't get the point of "a lot of things _could_ go
wrong". Would you like to describe the details? Thanks.

>
> Put another way: I'm not objecting to this series, but it does make me
> nervous, and I just want to give a heads-up that if we start seeing
> problems with this, I think people need to be ready to very
> aggressively revert it unless the fixes are obvious.
>
> How much testing has this series gotten on loads that are heavy users
> of hugetlb?

This series was tested by Huawei, AWS and Bytedance. In our company,
this feature was proposed in 2020.03, we have tested several months on our
servers (we have a lot of virtual machines). We didn't find any issues.

Thanks.

>
>                 Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [External] Re: [patch 141/192] fs/proc/kcore.c: add mmap interface
  2021-07-01  3:32     ` Linus Torvalds
  (?)
@ 2021-07-01  6:35     ` zhoufeng
  -1 siblings, 0 replies; 228+ messages in thread
From: zhoufeng @ 2021-07-01  6:35 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Alexey Dobriyan, chenying.kernel, Linux-MM, mm-commits,
	Mike Rapoport, Muchun Song, zhouchengming, duanxiongchun



在 2021/7/1 上午11:32, Linus Torvalds 写道:
> On Wed, Jun 30, 2021 at 6:54 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> When we do the kernel monitor, use the DRGN
>> (https://github.com/osandov/drgn) access to kernel data structures, found
>> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
>> After looking at the kcore code, it is found that kcore does not implement
>> mmap, resulting in frequent context switching triggered by read.
>> Therefore, we want to add mmap interface to optimize performance.
> 
> Ok, this is funky, but I'm going to drop this patch because I think
> it's buggy as is.
> 
>   Since
> 
>> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
>> +{
>> +       size_t size = vma->vm_end - vma->vm_start;
> 
> Ok.
> 
> But then:
> 
>> +       start = kc_offset_to_vaddr(((u64)vma->vm_pgoff << PAGE_SHIFT) -
>> +               ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT));
> 
> Not only is that
> 
>          ((data_offset >> PAGE_SHIFT) << PAGE_SHIFT)
> 
> a very strange calculation (did you mean "data_offset & PAGE_MASK"?),
> but I don't see anything that protects against underflow in that
> calculation. pg_off can easily be arbitrarily small (eg zero), so that
> subtraction can underflow afaik.

Sorry, the calculations here are really confusing. The reason is that 
when DRGN read /proc/kcore for ELF file header:
phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset;
and DRGN call mmap, use phdr->p_offset passed in, I need to subtract 
"data_offset".

> 
> So that needs a test, and return -EINVAL or whatever.
>
There's a problem with not judging "start". I will fix it in a v3.

> But even if that is fixed, this test is entirely broken:
> 
>> +       list_for_each_entry(m, &kclist_head, list) {
>> +               if (start >= m->addr && size <= m->size)
>> +                       break;
>> +       }
> 
> No, that's wrong.
> 

Yes, this is indeed wrong, I will fix it in a v3.


> You allow 'size' to be as big as 'm->size', but you do that even if
> 'start' isn't 'm->start'.
> 
> The proper check would be something like
> 
>         u64 end = start + size;
> 
>          if (start >= m->addr && end <= m->addr+m->size) ..
> 
> or similar (and that should check that "start+size" hasn't overflowed).
> 
> So I see what appears to be multiple problems, and while I hand-waved
> some fixes for them, those are very much "maybe something like this",
> and I'm going to drop this patch. Not for 5.14.
> 
>             Linus
> 

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-01  1:52 ` [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep Andrew Morton
@ 2021-07-01 14:55   ` Minchan Kim
  2021-07-01 18:07       ` Linus Torvalds
  2021-07-02  2:45       ` Zhaoyang Huang
  0 siblings, 2 replies; 228+ messages in thread
From: Minchan Kim @ 2021-07-01 14:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, mm-commits, senozhatsky, torvalds, zhaoyang.huang

On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> 
> Zspage_cachep is found be merged with other kmem cache during test, which
> is not good for debug things (zs_pool->zspage_cachep present to be another
> kmem cache in memory dumpfile).  It is also neccessary to do so as
> shrinker has been registered for zspage.
> 
> Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> 
> Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Sorry for the late. I don't think this is correct.

It's true "struct zspage" can be freed by zsmalloc's compaction registerred
by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
quite limited to work only when objects in the zspage are heavily fragmented.
Once the compaction is done, zspage are never discardable until objects are
fragmented again. It means it could hurt other reclaimable slab page reclaiming
since the zspage slab object pins the page.

> ---
> 
>  mm/zsmalloc.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/mm/zsmalloc.c~mm-zram-amend-slab_reclaim_account-on-zspage_cachep
> +++ a/mm/zsmalloc.c
> @@ -328,7 +328,7 @@ static int create_cache(struct zs_pool *
>  		return 1;
>  
>  	pool->zspage_cachep = kmem_cache_create("zspage", sizeof(struct zspage),
> -					0, 0, NULL);
> +					0, SLAB_RECLAIM_ACCOUNT, NULL);
>  	if (!pool->zspage_cachep) {
>  		kmem_cache_destroy(pool->handle_cachep);
>  		pool->handle_cachep = NULL;
> _

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-01 14:55   ` Minchan Kim
@ 2021-07-01 18:07       ` Linus Torvalds
  2021-07-02  2:45       ` Zhaoyang Huang
  1 sibling, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01 18:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Linux-MM, mm-commits, Sergey Senozhatsky, zhaoyang.huang

On Thu, Jul 1, 2021 at 7:55 AM Minchan Kim <minchan@kernel.org> wrote:
>
> Sorry for the late. I don't think this is correct.

Not _too_ late - I had applied the series to my tree already, but I
try to delay merging my akpm branch overnight exactly to see if there
are any replies to Andrew's sending of the series.

So I've dropped this patch from that branch.

              Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
@ 2021-07-01 18:07       ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01 18:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Linux-MM, mm-commits, Sergey Senozhatsky, zhaoyang.huang

On Thu, Jul 1, 2021 at 7:55 AM Minchan Kim <minchan@kernel.org> wrote:
>
> Sorry for the late. I don't think this is correct.

Not _too_ late - I had applied the series to my tree already, but I
try to delay merging my akpm branch overnight exactly to see if there
are any replies to Andrew's sending of the series.

So I've dropped this patch from that branch.

              Linus


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [External] Re: [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
  2021-07-01  6:29     ` [External] " Muchun Song
@ 2021-07-01 18:25       ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-01 18:25 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, Mina Almasry, Anshuman Khandual, Bodeddula,
	Balasubramaniam, Borislav Petkov, Singh, Balbir, Chen Huang,
	Jonathan Corbet, Dave Hansen, David Hildenbrand, Xiongchun duan,
	Peter Anvin, Joao Martins, Joerg Roedel, Miaohe Lin, Linux-MM,
	Andrew Lutomirski, Michal Hocko, Mike Kravetz, Ingo Molnar,
	mm-commits, HORIGUCHI NAOYA(堀口 直也),
	oneukum, Oscar Salvador, Paul E. McKenney, Pawan Gupta,
	Peter Zijlstra, Randy Dunlap, David Rientjes,
	Song Bao Hua (Barry Song),
	Thomas Gleixner, Al Viro, Matthew Wilcox

On Wed, Jun 30, 2021 at 11:30 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Thu, Jul 1, 2021 at 11:46 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Anyway, this HugeTLB 'struct page' vmemmap patch-series doesn't look
> > _wrong_ to me, but it does look like it is a nightmare to debug if
> > something ever goes wrong. And it looks like a lot of things _could_
> > go wrong. It all looks very subtle.
>
> In order to make things work well, some addresses of vmemmap are
> also mapped with read only to catch invalid usage from other modules
> (e.g. write operation). I didn't get the point of "a lot of things _could_ go
> wrong". Would you like to describe the details? Thanks.

I just worry about the subtlety.

Things like "oh, now I can't free the page because I need allocations
for the mapping pages" is a very new condition for hugetlb pages.

And if the page table mapping ever gets out-of-sync, debugging it
sounds nightmarish. The real horror: missed TLB invalidates or things
like that, where even if the page tables themselves updated, the CPU
actually uses something else.

So I didn't see any bugs, but honestly, while I read through all the
patches that was really just that: "reading patches". I just want
people to be very ready to revert, because I suspect that any
potential bugs will just result in very subtle behavior problems.

                  Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-01 14:55   ` Minchan Kim
@ 2021-07-02  2:45       ` Zhaoyang Huang
  2021-07-02  2:45       ` Zhaoyang Huang
  1 sibling, 0 replies; 228+ messages in thread
From: Zhaoyang Huang @ 2021-07-02  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, open list:MEMORY MANAGEMENT, mm-commits,
	Sergey Senozhatsky, torvalds, Zhaoyang Huang

On Thu, Jul 1, 2021 at 10:56 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> >
> > Zspage_cachep is found be merged with other kmem cache during test, which
> > is not good for debug things (zs_pool->zspage_cachep present to be another
> > kmem cache in memory dumpfile).  It is also neccessary to do so as
> > shrinker has been registered for zspage.
> >
> > Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> >
> > Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> Sorry for the late. I don't think this is correct.
>
> It's true "struct zspage" can be freed by zsmalloc's compaction registerred
> by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
> quite limited to work only when objects in the zspage are heavily fragmented.
> Once the compaction is done, zspage are never discardable until objects are
> fragmented again. It means it could hurt other reclaimable slab page reclaiming
> since the zspage slab object pins the page.
IMHO, kmem cache's reclaiming is NOT affected by SLAB_RECLAIM_ACCOUNT
. This flag just affects kmem cache merge[1], the slab page's migrate
type[2] and the page's statistics. Actually, zspage's cache DO merged
with others even without SLAB_RECLAIM_ACCOUNT currently, which maybe
cause zspage's object will NEVER be discarded.(SLAB_MERGE_SAME
introduce confusions as people believe the cache will merge with
others when it set and vice versa)

[1]
 struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned
long flags, const char *name, void (*ctor)(void *))
...
    if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
     continue;

[2]
if (s->flags & SLAB_RECLAIM_ACCOUNT)
    s->allocflags |= __GFP_RECLAIMABLE;

>
> > ---
> >
> >  mm/zsmalloc.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- a/mm/zsmalloc.c~mm-zram-amend-slab_reclaim_account-on-zspage_cachep
> > +++ a/mm/zsmalloc.c
> > @@ -328,7 +328,7 @@ static int create_cache(struct zs_pool *
> >               return 1;
> >
> >       pool->zspage_cachep = kmem_cache_create("zspage", sizeof(struct zspage),
> > -                                     0, 0, NULL);
> > +                                     0, SLAB_RECLAIM_ACCOUNT, NULL);
> >       if (!pool->zspage_cachep) {
> >               kmem_cache_destroy(pool->handle_cachep);
> >               pool->handle_cachep = NULL;
> > _

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
@ 2021-07-02  2:45       ` Zhaoyang Huang
  0 siblings, 0 replies; 228+ messages in thread
From: Zhaoyang Huang @ 2021-07-02  2:45 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, open list:MEMORY MANAGEMENT, mm-commits,
	Sergey Senozhatsky, torvalds, Zhaoyang Huang

On Thu, Jul 1, 2021 at 10:56 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> >
> > Zspage_cachep is found be merged with other kmem cache during test, which
> > is not good for debug things (zs_pool->zspage_cachep present to be another
> > kmem cache in memory dumpfile).  It is also neccessary to do so as
> > shrinker has been registered for zspage.
> >
> > Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> >
> > Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> Sorry for the late. I don't think this is correct.
>
> It's true "struct zspage" can be freed by zsmalloc's compaction registerred
> by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
> quite limited to work only when objects in the zspage are heavily fragmented.
> Once the compaction is done, zspage are never discardable until objects are
> fragmented again. It means it could hurt other reclaimable slab page reclaiming
> since the zspage slab object pins the page.
IMHO, kmem cache's reclaiming is NOT affected by SLAB_RECLAIM_ACCOUNT
. This flag just affects kmem cache merge[1], the slab page's migrate
type[2] and the page's statistics. Actually, zspage's cache DO merged
with others even without SLAB_RECLAIM_ACCOUNT currently, which maybe
cause zspage's object will NEVER be discarded.(SLAB_MERGE_SAME
introduce confusions as people believe the cache will merge with
others when it set and vice versa)

[1]
 struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned
long flags, const char *name, void (*ctor)(void *))
...
    if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
     continue;

[2]
if (s->flags & SLAB_RECLAIM_ACCOUNT)
    s->allocflags |= __GFP_RECLAIMABLE;

>
> > ---
> >
> >  mm/zsmalloc.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- a/mm/zsmalloc.c~mm-zram-amend-slab_reclaim_account-on-zspage_cachep
> > +++ a/mm/zsmalloc.c
> > @@ -328,7 +328,7 @@ static int create_cache(struct zs_pool *
> >               return 1;
> >
> >       pool->zspage_cachep = kmem_cache_create("zspage", sizeof(struct zspage),
> > -                                     0, 0, NULL);
> > +                                     0, SLAB_RECLAIM_ACCOUNT, NULL);
> >       if (!pool->zspage_cachep) {
> >               kmem_cache_destroy(pool->handle_cachep);
> >               pool->handle_cachep = NULL;
> > _


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-02  2:45       ` Zhaoyang Huang
  (?)
@ 2021-07-02  5:47       ` Minchan Kim
  2021-07-02  6:20           ` Zhaoyang Huang
  -1 siblings, 1 reply; 228+ messages in thread
From: Minchan Kim @ 2021-07-02  5:47 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: LKML, Andrew Morton, open list:MEMORY MANAGEMENT, mm-commits,
	Sergey Senozhatsky, torvalds, Zhaoyang Huang

On Fri, Jul 02, 2021 at 10:45:09AM +0800, Zhaoyang Huang wrote:
> On Thu, Jul 1, 2021 at 10:56 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> > >
> > > Zspage_cachep is found be merged with other kmem cache during test, which
> > > is not good for debug things (zs_pool->zspage_cachep present to be another
> > > kmem cache in memory dumpfile).  It is also neccessary to do so as
> > > shrinker has been registered for zspage.
> > >
> > > Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> > >
> > > Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >
> > Sorry for the late. I don't think this is correct.
> >
> > It's true "struct zspage" can be freed by zsmalloc's compaction registerred
> > by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
> > quite limited to work only when objects in the zspage are heavily fragmented.
> > Once the compaction is done, zspage are never discardable until objects are
> > fragmented again. It means it could hurt other reclaimable slab page reclaiming
> > since the zspage slab object pins the page.
> IMHO, kmem cache's reclaiming is NOT affected by SLAB_RECLAIM_ACCOUNT
> . This flag just affects kmem cache merge[1], the slab page's migrate
> type[2] and the page's statistics. Actually, zspage's cache DO merged
> with others even without SLAB_RECLAIM_ACCOUNT currently, which maybe
> cause zspage's object will NEVER be discarded.(SLAB_MERGE_SAME
> introduce confusions as people believe the cache will merge with
> others when it set and vice versa)
> 
> [1]
>  struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned
> long flags, const char *name, void (*ctor)(void *))
> ...
>     if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
>      continue;
> 
> [2]
> if (s->flags & SLAB_RECLAIM_ACCOUNT)
>     s->allocflags |= __GFP_RECLAIMABLE;

That's the point here. With SLAB_RECLAIM_ACCOUNT, page allocator
try to allocate pages from MIGRATE_RECLAIMABLE with belief those
objects are easily reclaimable. Say a page has object A, B, C, D
and E. A-D are easily reclaimable but E is hard. What happens is
VM couldn't reclaim the page in the end due to E even though it
already reclaimed A-D. And the such fragmenation could be spread
out entire MIGRATE_RECLAIMABLE pageblocks over time.
That's why I'd like to put zspage into MIGRATE_UNMOVALBE from the
beginning since I don't think it's easily reclaimble once compaction
is done.

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-02  5:47       ` Minchan Kim
@ 2021-07-02  6:20           ` Zhaoyang Huang
  0 siblings, 0 replies; 228+ messages in thread
From: Zhaoyang Huang @ 2021-07-02  6:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, open list:MEMORY MANAGEMENT, mm-commits,
	Sergey Senozhatsky, torvalds, Zhaoyang Huang

On Fri, Jul 2, 2021 at 1:47 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Fri, Jul 02, 2021 at 10:45:09AM +0800, Zhaoyang Huang wrote:
> > On Thu, Jul 1, 2021 at 10:56 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> > > >
> > > > Zspage_cachep is found be merged with other kmem cache during test, which
> > > > is not good for debug things (zs_pool->zspage_cachep present to be another
> > > > kmem cache in memory dumpfile).  It is also neccessary to do so as
> > > > shrinker has been registered for zspage.
> > > >
> > > > Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> > > >
> > > > Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > Cc: Minchan Kim <minchan@kernel.org>
> > > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > >
> > > Sorry for the late. I don't think this is correct.
> > >
> > > It's true "struct zspage" can be freed by zsmalloc's compaction registerred
> > > by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
> > > quite limited to work only when objects in the zspage are heavily fragmented.
> > > Once the compaction is done, zspage are never discardable until objects are
> > > fragmented again. It means it could hurt other reclaimable slab page reclaiming
> > > since the zspage slab object pins the page.
> > IMHO, kmem cache's reclaiming is NOT affected by SLAB_RECLAIM_ACCOUNT
> > . This flag just affects kmem cache merge[1], the slab page's migrate
> > type[2] and the page's statistics. Actually, zspage's cache DO merged
> > with others even without SLAB_RECLAIM_ACCOUNT currently, which maybe
> > cause zspage's object will NEVER be discarded.(SLAB_MERGE_SAME
> > introduce confusions as people believe the cache will merge with
> > others when it set and vice versa)
> >
> > [1]
> >  struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned
> > long flags, const char *name, void (*ctor)(void *))
> > ...
> >     if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
> >      continue;
> >
> > [2]
> > if (s->flags & SLAB_RECLAIM_ACCOUNT)
> >     s->allocflags |= __GFP_RECLAIMABLE;
>
> That's the point here. With SLAB_RECLAIM_ACCOUNT, page allocator
> try to allocate pages from MIGRATE_RECLAIMABLE with belief those
> objects are easily reclaimable. Say a page has object A, B, C, D
> and E. A-D are easily reclaimable but E is hard. What happens is
> VM couldn't reclaim the page in the end due to E even though it
> already reclaimed A-D. And the such fragmenation could be spread
> out entire MIGRATE_RECLAIMABLE pageblocks over time.
> That's why I'd like to put zspage into MIGRATE_UNMOVALBE from the
> beginning since I don't think it's easily reclaimble once compaction
> is done.
The slab page could fallback to any migrate type even allocating with
__GFP_RECLAIMABLE, and there is only one page per slab within zspage's
cache, which will not be affected by compaction, so I think that
doesn't make sense.

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
@ 2021-07-02  6:20           ` Zhaoyang Huang
  0 siblings, 0 replies; 228+ messages in thread
From: Zhaoyang Huang @ 2021-07-02  6:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, open list:MEMORY MANAGEMENT, mm-commits,
	Sergey Senozhatsky, torvalds, Zhaoyang Huang

On Fri, Jul 2, 2021 at 1:47 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Fri, Jul 02, 2021 at 10:45:09AM +0800, Zhaoyang Huang wrote:
> > On Thu, Jul 1, 2021 at 10:56 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> > > >
> > > > Zspage_cachep is found be merged with other kmem cache during test, which
> > > > is not good for debug things (zs_pool->zspage_cachep present to be another
> > > > kmem cache in memory dumpfile).  It is also neccessary to do so as
> > > > shrinker has been registered for zspage.
> > > >
> > > > Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> > > >
> > > > Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > Cc: Minchan Kim <minchan@kernel.org>
> > > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > >
> > > Sorry for the late. I don't think this is correct.
> > >
> > > It's true "struct zspage" can be freed by zsmalloc's compaction registerred
> > > by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
> > > quite limited to work only when objects in the zspage are heavily fragmented.
> > > Once the compaction is done, zspage are never discardable until objects are
> > > fragmented again. It means it could hurt other reclaimable slab page reclaiming
> > > since the zspage slab object pins the page.
> > IMHO, kmem cache's reclaiming is NOT affected by SLAB_RECLAIM_ACCOUNT
> > . This flag just affects kmem cache merge[1], the slab page's migrate
> > type[2] and the page's statistics. Actually, zspage's cache DO merged
> > with others even without SLAB_RECLAIM_ACCOUNT currently, which maybe
> > cause zspage's object will NEVER be discarded.(SLAB_MERGE_SAME
> > introduce confusions as people believe the cache will merge with
> > others when it set and vice versa)
> >
> > [1]
> >  struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned
> > long flags, const char *name, void (*ctor)(void *))
> > ...
> >     if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
> >      continue;
> >
> > [2]
> > if (s->flags & SLAB_RECLAIM_ACCOUNT)
> >     s->allocflags |= __GFP_RECLAIMABLE;
>
> That's the point here. With SLAB_RECLAIM_ACCOUNT, page allocator
> try to allocate pages from MIGRATE_RECLAIMABLE with belief those
> objects are easily reclaimable. Say a page has object A, B, C, D
> and E. A-D are easily reclaimable but E is hard. What happens is
> VM couldn't reclaim the page in the end due to E even though it
> already reclaimed A-D. And the such fragmenation could be spread
> out entire MIGRATE_RECLAIMABLE pageblocks over time.
> That's why I'd like to put zspage into MIGRATE_UNMOVALBE from the
> beginning since I don't think it's easily reclaimble once compaction
> is done.
The slab page could fallback to any migrate type even allocating with
__GFP_RECLAIMABLE, and there is only one page per slab within zspage's
cache, which will not be affected by compaction, so I think that
doesn't make sense.


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
  2021-07-02  6:20           ` Zhaoyang Huang
  (?)
@ 2021-07-02  7:33           ` Minchan Kim
  -1 siblings, 0 replies; 228+ messages in thread
From: Minchan Kim @ 2021-07-02  7:33 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: LKML, Andrew Morton, open list:MEMORY MANAGEMENT, mm-commits,
	Sergey Senozhatsky, torvalds, Zhaoyang Huang

On Fri, Jul 02, 2021 at 02:20:42PM +0800, Zhaoyang Huang wrote:
> On Fri, Jul 2, 2021 at 1:47 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Fri, Jul 02, 2021 at 10:45:09AM +0800, Zhaoyang Huang wrote:
> > > On Thu, Jul 1, 2021 at 10:56 PM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Wed, Jun 30, 2021 at 06:52:58PM -0700, Andrew Morton wrote:
> > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > Subject: mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep
> > > > >
> > > > > Zspage_cachep is found be merged with other kmem cache during test, which
> > > > > is not good for debug things (zs_pool->zspage_cachep present to be another
> > > > > kmem cache in memory dumpfile).  It is also neccessary to do so as
> > > > > shrinker has been registered for zspage.
> > > > >
> > > > > Amending this flag can help kernel to calculate SLAB_RECLAIMBLE correctly.
> > > > >
> > > > > Link: https://lkml.kernel.org/r/1623137297-29685-1-git-send-email-huangzhaoyang@gmail.com
> > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > Cc: Minchan Kim <minchan@kernel.org>
> > > > > Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> > > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > >
> > > > Sorry for the late. I don't think this is correct.
> > > >
> > > > It's true "struct zspage" can be freed by zsmalloc's compaction registerred
> > > > by slab shrinker so tempted to make it SLAB_RECLAIM_ACCOUNT. However, it's
> > > > quite limited to work only when objects in the zspage are heavily fragmented.
> > > > Once the compaction is done, zspage are never discardable until objects are
> > > > fragmented again. It means it could hurt other reclaimable slab page reclaiming
> > > > since the zspage slab object pins the page.
> > > IMHO, kmem cache's reclaiming is NOT affected by SLAB_RECLAIM_ACCOUNT
> > > . This flag just affects kmem cache merge[1], the slab page's migrate
> > > type[2] and the page's statistics. Actually, zspage's cache DO merged
> > > with others even without SLAB_RECLAIM_ACCOUNT currently, which maybe
> > > cause zspage's object will NEVER be discarded.(SLAB_MERGE_SAME
> > > introduce confusions as people believe the cache will merge with
> > > others when it set and vice versa)
> > >
> > > [1]
> > >  struct kmem_cache *find_mergeable(size_t size, size_t align, unsigned
> > > long flags, const char *name, void (*ctor)(void *))
> > > ...
> > >     if ((flags & SLAB_MERGE_SAME) != (s->flags & SLAB_MERGE_SAME))
> > >      continue;
> > >
> > > [2]
> > > if (s->flags & SLAB_RECLAIM_ACCOUNT)
> > >     s->allocflags |= __GFP_RECLAIMABLE;
> >
> > That's the point here. With SLAB_RECLAIM_ACCOUNT, page allocator
> > try to allocate pages from MIGRATE_RECLAIMABLE with belief those
> > objects are easily reclaimable. Say a page has object A, B, C, D
> > and E. A-D are easily reclaimable but E is hard. What happens is
> > VM couldn't reclaim the page in the end due to E even though it
> > already reclaimed A-D. And the such fragmenation could be spread
> > out entire MIGRATE_RECLAIMABLE pageblocks over time.
> > That's why I'd like to put zspage into MIGRATE_UNMOVALBE from the
> > beginning since I don't think it's easily reclaimble once compaction
> > is done.
> The slab page could fallback to any migrate type even allocating with

It's true but it couldn't be justication to allocate objects from any
migration type. We should try to select right type. Please see below.

> __GFP_RECLAIMABLE, and there is only one page per slab within zspage's
> cache, which will not be affected by compaction, so I think that
> doesn't make sense.

You shouldn't rely on how many pages the slab has since it's internal
implemenation and zspage size also could be changed in the future.
And please think about external fragmentaion as well as internal one.

What we want to try with allocation type is to group similar lifetime
objects together in a pageblock group to help external fragmentation
for high-order allocation. Think what happens if the unreclaimable
object is located in a reclaimable pageblock. The block couldn't be
merged into high-order page in the end so it causes more compaction
and smaller available high-order pages in the system.

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-01  1:54 ` [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ Andrew Morton
@ 2021-07-02 14:54   ` Christian Brauner
  2021-07-02 18:43   ` Kees Cook
  1 sibling, 0 replies; 228+ messages in thread
From: Christian Brauner @ 2021-07-02 14:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: adobriyan, avagin, bernd.edlinger, christian.koenig, corbet,
	deller, ebiederm, gladkov.alexey, hridya, jamorris, jannh, jeffv,
	kaleshsingh, keescook, linux-mm, mchehab+huawei, mhocko, minchan,
	mm-commits, rdunlap, surenb, szabolcs.nagy, torvalds, viro,
	walken, willy

On Wed, Jun 30, 2021 at 06:54:44PM -0700, Andrew Morton wrote:
> From: Kalesh Singh <kaleshsingh@google.com>
> Subject: procfs: allow reading fdinfo with PTRACE_MODE_READ
> 
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers.  In order to measure how much memory a process actually consumes,
> it is necessary to include the DMA buffer sizes for that process in the
> memory accounting.  Since the handle to DMA buffers are raw FDs, it is
> important to be able to identify which processes have FD references to a
> DMA buffer.
> 
> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both are only readable by the process owner, as
> follows:
> 
>   1. Do a readlink on each FD.
>   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>   3. stat the file to get the dmabuf inode number.
>   4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
> 
> Accessing other processes' fdinfo requires root privileges.  This limits
> the use of the interface to debugging environments and is not suitable for
> production builds.  Granting root privileges even to a system process
> increases the attack surface and is highly undesirable.
> 
> Since fdinfo doesn't permit reading process memory and manipulating
> process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
> 
> Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> Suggested-by: Jann Horn <jannh@google.com>
> Acked-by: Christian König <christian.koenig@amd.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Hridya Valsaraju <hridya@google.com>
> Cc: James Morris <jamorris@linux.microsoft.com>
> Cc: Jeff Vander Stoep <jeffv@google.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---

Rather useful (also for CRIU and others).
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-01  1:54 ` [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ Andrew Morton
  2021-07-02 14:54   ` Christian Brauner
@ 2021-07-02 18:43   ` Kees Cook
  2021-07-02 19:00       ` Linus Torvalds
  1 sibling, 1 reply; 228+ messages in thread
From: Kees Cook @ 2021-07-02 18:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: adobriyan, avagin, bernd.edlinger, christian.brauner,
	christian.koenig, corbet, deller, ebiederm, gladkov.alexey,
	hridya, jamorris, jannh, jeffv, kaleshsingh, linux-mm,
	mchehab+huawei, mhocko, minchan, mm-commits, rdunlap, surenb,
	szabolcs.nagy, torvalds, viro, walken, willy

On Wed, Jun 30, 2021 at 06:54:44PM -0700, Andrew Morton wrote:
> From: Kalesh Singh <kaleshsingh@google.com>
> Subject: procfs: allow reading fdinfo with PTRACE_MODE_READ
> 
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers.  In order to measure how much memory a process actually consumes,
> it is necessary to include the DMA buffer sizes for that process in the
> memory accounting.  Since the handle to DMA buffers are raw FDs, it is
> important to be able to identify which processes have FD references to a
> DMA buffer.
> 
> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both are only readable by the process owner, as
> follows:
> 
>   1. Do a readlink on each FD.
>   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>   3. stat the file to get the dmabuf inode number.
>   4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
> 
> Accessing other processes' fdinfo requires root privileges.  This limits
> the use of the interface to debugging environments and is not suitable for
> production builds.  Granting root privileges even to a system process
> increases the attack surface and is highly undesirable.
> 
> Since fdinfo doesn't permit reading process memory and manipulating
> process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
> 
> Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> Suggested-by: Jann Horn <jannh@google.com>
> Acked-by: Christian König <christian.koenig@amd.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Hridya Valsaraju <hridya@google.com>
> Cc: James Morris <jamorris@linux.microsoft.com>
> Cc: Jeff Vander Stoep <jeffv@google.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/proc/base.c |    4 ++--
>  fs/proc/fd.c   |   15 ++++++++++++++-
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> --- a/fs/proc/base.c~procfs-allow-reading-fdinfo-with-ptrace_mode_read
> +++ a/fs/proc/base.c
> @@ -3172,7 +3172,7 @@ static const struct pid_entry tgid_base_
>  	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
>  	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
>  	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
> -	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
> +	DIR("fdinfo",     S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations),
>  	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
>  #ifdef CONFIG_NET
>  	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
> @@ -3517,7 +3517,7 @@ static const struct inode_operations pro
>   */
>  static const struct pid_entry tid_base_stuff[] = {
>  	DIR("fd",        S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
> -	DIR("fdinfo",    S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
> +	DIR("fdinfo",    S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations),
>  	DIR("ns",	 S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
>  #ifdef CONFIG_NET
>  	DIR("net",        S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
> --- a/fs/proc/fd.c~procfs-allow-reading-fdinfo-with-ptrace_mode_read
> +++ a/fs/proc/fd.c
> @@ -6,6 +6,7 @@
>  #include <linux/fdtable.h>
>  #include <linux/namei.h>
>  #include <linux/pid.h>
> +#include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/file.h>
>  #include <linux/seq_file.h>
> @@ -72,6 +73,18 @@ out:
>  
>  static int seq_fdinfo_open(struct inode *inode, struct file *file)
>  {
> +	bool allowed = false;
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	if (!task)
> +		return -ESRCH;
> +
> +	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
> +	put_task_struct(task);
> +
> +	if (!allowed)
> +		return -EACCES;

Uhm, this is only checked in open(), and never again? Is this safe in
the face of exec or pid re-use?

-Kees

> +
>  	return single_open(file, seq_show, inode);
>  }
>  
> @@ -308,7 +321,7 @@ static struct dentry *proc_fdinfo_instan
>  	struct proc_inode *ei;
>  	struct inode *inode;
>  
> -	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUSR);
> +	inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUGO);
>  	if (!inode)
>  		return ERR_PTR(-ENOENT);
>  
> _

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-02 18:43   ` Kees Cook
@ 2021-07-02 19:00       ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-02 19:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Alexey Dobriyan, Andrei Vagin, Bernd Edlinger,
	Christian Brauner, Christian Koenig, Jonathan Corbet,
	Helge Deller, Eric W. Biederman, Alexey Gladkov, hridya,
	jamorris, Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

On Fri, Jul 2, 2021 at 11:43 AM Kees Cook <keescook@chromium.org> wrote:
>
> Uhm, this is only checked in open(), and never again? Is this safe in
> the face of exec or pid re-use?

Interesting question, but not really all that valid for this particular patch.

Why? Because we already only check for owner permissions on open, and
never again. So if we have fdinfo issues across a suid exec or pid
re-use, they are pre-existing..

But yes, it would probably be a good idea to think about readdir() on
that directory. If somebody reminds me after the merge window is over,
I'll come back to this, but if somebody else wants to think about it
before then, that would be great.

              Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
@ 2021-07-02 19:00       ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-02 19:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Alexey Dobriyan, Andrei Vagin, Bernd Edlinger,
	Christian Brauner, Christian Koenig, Jonathan Corbet,
	Helge Deller, Eric W. Biederman, Alexey Gladkov, hridya,
	jamorris, Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

On Fri, Jul 2, 2021 at 11:43 AM Kees Cook <keescook@chromium.org> wrote:
>
> Uhm, this is only checked in open(), and never again? Is this safe in
> the face of exec or pid re-use?

Interesting question, but not really all that valid for this particular patch.

Why? Because we already only check for owner permissions on open, and
never again. So if we have fdinfo issues across a suid exec or pid
re-use, they are pre-existing..

But yes, it would probably be a good idea to think about readdir() on
that directory. If somebody reminds me after the merge window is over,
I'll come back to this, but if somebody else wants to think about it
before then, that would be great.

              Linus


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-02 19:00       ` Linus Torvalds
@ 2021-07-02 20:40         ` Eric W. Biederman
  -1 siblings, 0 replies; 228+ messages in thread
From: Eric W. Biederman @ 2021-07-02 20:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, hridya, jamorris,
	Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jul 2, 2021 at 11:43 AM Kees Cook <keescook@chromium.org> wrote:
>>
>> Uhm, this is only checked in open(), and never again? Is this safe in
>> the face of exec or pid re-use?

Exec does not change the file descriptor table.

The open holds a reference to the proc inode.  The proc inode holds the
struct pid of the task and the file descriptor number.  References using
struct pid do not suffer from userspace pid rollover issues.

So the only issue I see is file descriptor reuse after an exec,
that changes the processes struct cred.

Assuming we care it would probably be worth a bug fix patch to check
something.


Eric

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
@ 2021-07-02 20:40         ` Eric W. Biederman
  0 siblings, 0 replies; 228+ messages in thread
From: Eric W. Biederman @ 2021-07-02 20:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, hridya, jamorris,
	Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jul 2, 2021 at 11:43 AM Kees Cook <keescook@chromium.org> wrote:
>>
>> Uhm, this is only checked in open(), and never again? Is this safe in
>> the face of exec or pid re-use?

Exec does not change the file descriptor table.

The open holds a reference to the proc inode.  The proc inode holds the
struct pid of the task and the file descriptor number.  References using
struct pid do not suffer from userspace pid rollover issues.

So the only issue I see is file descriptor reuse after an exec,
that changes the processes struct cred.

Assuming we care it would probably be worth a bug fix patch to check
something.


Eric


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-02 20:40         ` Eric W. Biederman
  (?)
@ 2021-07-02 23:31         ` Kees Cook
  2021-07-03  0:15             ` Linus Torvalds
  -1 siblings, 1 reply; 228+ messages in thread
From: Kees Cook @ 2021-07-02 23:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, hridya, jamorris,
	Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

On Fri, Jul 02, 2021 at 03:40:49PM -0500, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Fri, Jul 2, 2021 at 11:43 AM Kees Cook <keescook@chromium.org> wrote:
> >>
> >> Uhm, this is only checked in open(), and never again? Is this safe in
> >> the face of exec or pid re-use?
> 
> Exec does not change the file descriptor table.

Ah yeah, good point. I've been thinking too much about vmas.

> The open holds a reference to the proc inode.  The proc inode holds the
> struct pid of the task and the file descriptor number.  References using
> struct pid do not suffer from userspace pid rollover issues.

Okay, cool.

> So the only issue I see is file descriptor reuse after an exec,
> that changes the processes struct cred.

Right -- the info leak would be snooping on what a privileged process
was doing with a given fd? Similar stuff has been used to do typing
pattern analysis with login passwords, but that's a stretch here, I
think. Hmm.

> Assuming we care it would probably be worth a bug fix patch to check
> something.

Sounds good.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-02 23:31         ` Kees Cook
@ 2021-07-03  0:15             ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-03  0:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, Hridya Valsaraju,
	jamorris, Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

On Fri, Jul 2, 2021 at 4:31 PM Kees Cook <keescook@chromium.org> wrote:
>
> Right -- the info leak would be snooping on what a privileged process
> was doing with a given fd? Similar stuff has been used to do typing
> pattern analysis with login passwords, but that's a stretch here, I
> think. Hmm.

So I think you'd see the directory list, but generally that's just the
file descriptor numbers.

Which is information you shouldn't have access to, but it's probably
not very *interesting* information.

I think it would be worth fixing but possibly not a very high priority.

             Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
@ 2021-07-03  0:15             ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-03  0:15 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, Hridya Valsaraju,
	jamorris, Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

On Fri, Jul 2, 2021 at 4:31 PM Kees Cook <keescook@chromium.org> wrote:
>
> Right -- the info leak would be snooping on what a privileged process
> was doing with a given fd? Similar stuff has been used to do typing
> pattern analysis with login passwords, but that's a stretch here, I
> think. Hmm.

So I think you'd see the directory list, but generally that's just the
file descriptor numbers.

Which is information you shouldn't have access to, but it's probably
not very *interesting* information.

I think it would be worth fixing but possibly not a very high priority.

             Linus


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: incoming
  2021-07-01  1:46 incoming Andrew Morton
                   ` (191 preceding siblings ...)
  2021-07-01  1:57 ` [patch 192/192] ipc/util.c: use binary search for max_idx Andrew Morton
@ 2021-07-03  0:28 ` Linus Torvalds
  2021-07-03  1:06   ` incoming Linus Torvalds
  192 siblings, 1 reply; 228+ messages in thread
From: Linus Torvalds @ 2021-07-03  0:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux-MM, mm-commits

On Wed, Jun 30, 2021 at 6:46 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> This is the rest of the -mm tree, less 66 patches which are dependent on
> things which are (or were recently) in linux-next.  I'll trickle that
> material over next week.

I haven't bisected this yet, but with the current -git I'm getting

   watchdog: BUG: soft lockup - CPU#41 stuck for 49s!

and the common call chain seems to be in flush_tlb_mm_range ->
on_each_cpu_cond_mask.

Commit e058a84bfddc42ba356a2316f2cf1141974625c9 is good, and looking
at the pulls and merges I've done since, this -mm series looks like
the obvious culprit.

I'll go start bisection, but I thought I'd give a heads-up in case
somebody else has seen TLB-flush-related lockups and already figured
out the guilty party..

                 Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: incoming
  2021-07-03  0:28 ` incoming Linus Torvalds
@ 2021-07-03  1:06   ` Linus Torvalds
  0 siblings, 0 replies; 228+ messages in thread
From: Linus Torvalds @ 2021-07-03  1:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux-MM, mm-commits

On Fri, Jul 2, 2021 at 5:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Commit e058a84bfddc42ba356a2316f2cf1141974625c9 is good, and looking
> at the pulls and merges I've done since, this -mm series looks like
> the obvious culprit.

No, unless my bisection is wrong, the -mm branch is innocent, and was
discarded from the suspects on the very first bisection trial.

So never mind.

             Linus

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
  2021-07-03  0:15             ` Linus Torvalds
@ 2021-07-03 21:43               ` Eric W. Biederman
  -1 siblings, 0 replies; 228+ messages in thread
From: Eric W. Biederman @ 2021-07-03 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, Hridya Valsaraju,
	jamorris, Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jul 2, 2021 at 4:31 PM Kees Cook <keescook@chromium.org> wrote:
>>
>> Right -- the info leak would be snooping on what a privileged process
>> was doing with a given fd? Similar stuff has been used to do typing
>> pattern analysis with login passwords, but that's a stretch here, I
>> think. Hmm.
>
> So I think you'd see the directory list, but generally that's just the
> file descriptor numbers.
>
> Which is information you shouldn't have access to, but it's probably
> not very *interesting* information.
>
> I think it would be worth fixing but possibly not a very high
> priority.

It is not just the directory whose permission changed but the individual
files in that directory.

You can also see the position, flags, mnt_id, and soon inode number
of fdinfo files you open before a suid exec.

Knowing what file someone is reading on a particular file descriptor
number and how far they are in reading that file sounds like a side
channel someone can do something with.

Eric

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ
@ 2021-07-03 21:43               ` Eric W. Biederman
  0 siblings, 0 replies; 228+ messages in thread
From: Eric W. Biederman @ 2021-07-03 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kees Cook, Andrew Morton, Alexey Dobriyan, Andrei Vagin,
	Bernd Edlinger, Christian Brauner, Christian Koenig,
	Jonathan Corbet, Helge Deller, Alexey Gladkov, Hridya Valsaraju,
	jamorris, Jann Horn, Jeff Vander Stoep, Kalesh Singh, Linux-MM,
	Mauro Carvalho Chehab, Michal Hocko, Minchan Kim, mm-commits,
	Randy Dunlap, Suren Baghdasaryan, Szabolcs Nagy, Al Viro,
	Michel Lespinasse, Matthew Wilcox

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jul 2, 2021 at 4:31 PM Kees Cook <keescook@chromium.org> wrote:
>>
>> Right -- the info leak would be snooping on what a privileged process
>> was doing with a given fd? Similar stuff has been used to do typing
>> pattern analysis with login passwords, but that's a stretch here, I
>> think. Hmm.
>
> So I think you'd see the directory list, but generally that's just the
> file descriptor numbers.
>
> Which is information you shouldn't have access to, but it's probably
> not very *interesting* information.
>
> I think it would be worth fixing but possibly not a very high
> priority.

It is not just the directory whose permission changed but the individual
files in that directory.

You can also see the position, flags, mnt_id, and soon inode number
of fdinfo files you open before a suid exec.

Knowing what file someone is reading on a particular file descriptor
number and how far they are in reading that file sounds like a side
channel someone can do something with.

Eric


^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
  2021-07-01  1:48 ` [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY Andrew Morton
@ 2021-07-12 14:48   ` Matthew Wilcox
  2021-07-12 16:58     ` Mike Kravetz
  0 siblings, 1 reply; 228+ messages in thread
From: Matthew Wilcox @ 2021-07-12 14:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: almasrymina, axelrasmussen, linux-mm, mike.kravetz, mm-commits,
	peterx, torvalds, yuehaibing

On Wed, Jun 30, 2021 at 06:48:19PM -0700, Andrew Morton wrote:
> From: Mina Almasry <almasrymina@google.com>
> Subject: mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
> 
> On UFFDIO_COPY, if we fail to copy the page contents while holding the
> hugetlb_fault_mutex, we will drop the mutex and return to the caller after
> allocating a page that consumed a reservation.  In this case there may be
> a fault that double consumes the reservation.  To handle this, we free the
> allocated page, fix the reservations, and allocate a temporary hugetlb
> page and return that to the caller.  When the caller does the copy outside
> of the lock, we again check the cache, and allocate a page consuming the
> reservation, and copy over the contents.

But you only copy over the contents *IF* CONFIG_MIGRATION is enabled!
Now, maybe there aren't many configs out there that enable HUGETLBFS
and disable MIGRATION, but this is sloppy.

> +++ a/include/linux/migrate.h
> @@ -51,6 +51,7 @@ extern int migrate_huge_page_move_mappin
>  				  struct page *newpage, struct page *page);
>  extern int migrate_page_move_mapping(struct address_space *mapping,
>  		struct page *newpage, struct page *page, int extra_count);
> +extern void copy_huge_page(struct page *dst, struct page *src);
>  #else
>  
>  static inline void putback_movable_pages(struct list_head *l) {}
> @@ -77,6 +78,9 @@ static inline int migrate_huge_page_move
>  	return -ENOSYS;
>  }
>  
> +static inline void copy_huge_page(struct page *dst, struct page *src)
> +{
> +}
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_COMPACTION

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
  2021-07-12 14:48   ` Matthew Wilcox
@ 2021-07-12 16:58     ` Mike Kravetz
  2021-07-12 19:28         ` Mina Almasry
  0 siblings, 1 reply; 228+ messages in thread
From: Mike Kravetz @ 2021-07-12 16:58 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: almasrymina, axelrasmussen, linux-mm, mm-commits, peterx,
	torvalds, yuehaibing

On 7/12/21 7:48 AM, Matthew Wilcox wrote:
> On Wed, Jun 30, 2021 at 06:48:19PM -0700, Andrew Morton wrote:
>> From: Mina Almasry <almasrymina@google.com>
>> Subject: mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
>>
>> On UFFDIO_COPY, if we fail to copy the page contents while holding the
>> hugetlb_fault_mutex, we will drop the mutex and return to the caller after
>> allocating a page that consumed a reservation.  In this case there may be
>> a fault that double consumes the reservation.  To handle this, we free the
>> allocated page, fix the reservations, and allocate a temporary hugetlb
>> page and return that to the caller.  When the caller does the copy outside
>> of the lock, we again check the cache, and allocate a page consuming the
>> reservation, and copy over the contents.
> 
> But you only copy over the contents *IF* CONFIG_MIGRATION is enabled!
> Now, maybe there aren't many configs out there that enable HUGETLBFS
> and disable MIGRATION, but this is sloppy.
> 

Thanks Matthew!

Not copying the contents is also a security exposure.  We rely on copying
the contents to clear the page's previous contents.

I suggested using copy_huge_page here as a previous version of the patch
replicated the code.  The NULL function slipped by me when reviewing.
Perhaps it would be best to move those copy_huge_page routines to
huge_memory.c as it is used by both THP and hugetlbfs.

Mina, can you look into fixing this?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
  2021-07-12 16:58     ` Mike Kravetz
@ 2021-07-12 19:28         ` Mina Almasry
  0 siblings, 0 replies; 228+ messages in thread
From: Mina Almasry @ 2021-07-12 19:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Matthew Wilcox, Andrew Morton, axelrasmussen, linux-mm,
	mm-commits, peterx, torvalds, yuehaibing

On Mon, Jul 12, 2021 at 9:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 7/12/21 7:48 AM, Matthew Wilcox wrote:
> > On Wed, Jun 30, 2021 at 06:48:19PM -0700, Andrew Morton wrote:
> >> From: Mina Almasry <almasrymina@google.com>
> >> Subject: mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
> >>
> >> On UFFDIO_COPY, if we fail to copy the page contents while holding the
> >> hugetlb_fault_mutex, we will drop the mutex and return to the caller after
> >> allocating a page that consumed a reservation.  In this case there may be
> >> a fault that double consumes the reservation.  To handle this, we free the
> >> allocated page, fix the reservations, and allocate a temporary hugetlb
> >> page and return that to the caller.  When the caller does the copy outside
> >> of the lock, we again check the cache, and allocate a page consuming the
> >> reservation, and copy over the contents.
> >
> > But you only copy over the contents *IF* CONFIG_MIGRATION is enabled!
> > Now, maybe there aren't many configs out there that enable HUGETLBFS
> > and disable MIGRATION, but this is sloppy.
> >
>
> Thanks Matthew!
>
> Not copying the contents is also a security exposure.  We rely on copying
> the contents to clear the page's previous contents.
>
> I suggested using copy_huge_page here as a previous version of the patch
> replicated the code.  The NULL function slipped by me when reviewing.
> Perhaps it would be best to move those copy_huge_page routines to
> huge_memory.c as it is used by both THP and hugetlbfs.
>
> Mina, can you look into fixing this?

Gah, sorry, I missed that the function is a no-op if CONFIG_MIGRATION
is not set. I'll send a follow up fix to this. Thanks for catching!

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 228+ messages in thread

* Re: [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
@ 2021-07-12 19:28         ` Mina Almasry
  0 siblings, 0 replies; 228+ messages in thread
From: Mina Almasry @ 2021-07-12 19:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Matthew Wilcox, Andrew Morton, axelrasmussen, linux-mm,
	mm-commits, peterx, torvalds, yuehaibing

On Mon, Jul 12, 2021 at 9:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 7/12/21 7:48 AM, Matthew Wilcox wrote:
> > On Wed, Jun 30, 2021 at 06:48:19PM -0700, Andrew Morton wrote:
> >> From: Mina Almasry <almasrymina@google.com>
> >> Subject: mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY
> >>
> >> On UFFDIO_COPY, if we fail to copy the page contents while holding the
> >> hugetlb_fault_mutex, we will drop the mutex and return to the caller after
> >> allocating a page that consumed a reservation.  In this case there may be
> >> a fault that double consumes the reservation.  To handle this, we free the
> >> allocated page, fix the reservations, and allocate a temporary hugetlb
> >> page and return that to the caller.  When the caller does the copy outside
> >> of the lock, we again check the cache, and allocate a page consuming the
> >> reservation, and copy over the contents.
> >
> > But you only copy over the contents *IF* CONFIG_MIGRATION is enabled!
> > Now, maybe there aren't many configs out there that enable HUGETLBFS
> > and disable MIGRATION, but this is sloppy.
> >
>
> Thanks Matthew!
>
> Not copying the contents is also a security exposure.  We rely on copying
> the contents to clear the page's previous contents.
>
> I suggested using copy_huge_page here as a previous version of the patch
> replicated the code.  The NULL function slipped by me when reviewing.
> Perhaps it would be best to move those copy_huge_page routines to
> huge_memory.c as it is used by both THP and hugetlbfs.
>
> Mina, can you look into fixing this?

Gah, sorry, I missed that the function is a no-op if CONFIG_MIGRATION
is not set. I'll send a follow up fix to this. Thanks for catching!

> --
> Mike Kravetz


^ permalink raw reply	[flat|nested] 228+ messages in thread

end of thread, other threads:[~2021-07-12 19:28 UTC | newest]

Thread overview: 228+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-01  1:46 incoming Andrew Morton
2021-07-01  1:47 ` [patch 001/192] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c Andrew Morton
2021-07-01  1:47 ` [patch 002/192] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP Andrew Morton
2021-07-01  1:47 ` [patch 003/192] mm: hugetlb: gather discrete indexes of tail page Andrew Morton
2021-07-01  1:47 ` [patch 004/192] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Andrew Morton
2021-07-01  3:46   ` Linus Torvalds
2021-07-01  3:46     ` Linus Torvalds
2021-07-01  6:29     ` [External] " Muchun Song
2021-07-01 18:25       ` Linus Torvalds
2021-07-01  1:47 ` [patch 005/192] mm: hugetlb: defer freeing of HugeTLB pages Andrew Morton
2021-07-01  1:47 ` [patch 006/192] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page Andrew Morton
2021-07-01  1:47 ` [patch 007/192] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Andrew Morton
2021-07-01  1:47 ` [patch 008/192] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled Andrew Morton
2021-07-01  1:47 ` [patch 009/192] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate Andrew Morton
2021-07-01  1:47 ` [patch 010/192] mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE Andrew Morton
2021-07-01  1:47 ` [patch 011/192] mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake Andrew Morton
2021-07-01  1:47 ` [patch 012/192] mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK Andrew Morton
2021-07-01  1:47 ` [patch 013/192] mm/huge_memory.c: use page->deferred_list Andrew Morton
2021-07-01  1:47 ` [patch 014/192] mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() Andrew Morton
2021-07-01  1:47 ` [patch 015/192] mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd Andrew Morton
2021-07-01  1:47 ` [patch 016/192] mm/huge_memory.c: don't discard hugepage if other processes are mapping it Andrew Morton
2021-07-01  1:48 ` [patch 017/192] mm/hugetlb: change parameters of arch_make_huge_pte() Andrew Morton
2021-07-01  1:48 ` [patch 018/192] mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge Andrew Morton
2021-07-01  1:48 ` [patch 019/192] mm/vmalloc: enable mapping of huge pages at pte level in vmap Andrew Morton
2021-07-01  1:48 ` [patch 020/192] mm/vmalloc: enable mapping of huge pages at pte level in vmalloc Andrew Morton
2021-07-01  1:48 ` [patch 021/192] powerpc/8xx: add support for huge pages on VMAP and VMALLOC Andrew Morton
2021-07-01  1:48 ` [patch 022/192] khugepaged: selftests: remove debug_cow Andrew Morton
2021-07-01  1:48 ` [patch 023/192] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY Andrew Morton
2021-07-12 14:48   ` Matthew Wilcox
2021-07-12 16:58     ` Mike Kravetz
2021-07-12 19:28       ` Mina Almasry
2021-07-12 19:28         ` Mina Almasry
2021-07-01  1:48 ` [patch 024/192] mm: sparsemem: split the huge PMD mapping of vmemmap pages Andrew Morton
2021-07-01  1:48 ` [patch 025/192] mm: sparsemem: use huge PMD mapping for " Andrew Morton
2021-07-01  1:48 ` [patch 026/192] mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON Andrew Morton
2021-07-01  1:48 ` [patch 027/192] hugetlb: remove prep_compound_huge_page cleanup Andrew Morton
2021-07-01  1:48 ` [patch 028/192] hugetlb: address ref count racing in prep_compound_gigantic_page Andrew Morton
2021-07-01  1:48 ` [patch 029/192] mm/hwpoison: disable pcp for page_handle_poison() Andrew Morton
2021-07-01  1:48 ` [patch 030/192] userfaultfd/selftests: use user mode only Andrew Morton
2021-07-01  1:48 ` [patch 031/192] userfaultfd/selftests: remove the time() check on delayed uffd Andrew Morton
2021-07-01  1:48 ` [patch 032/192] userfaultfd/selftests: dropping VERIFY check in locking_thread Andrew Morton
2021-07-01  1:48 ` [patch 033/192] userfaultfd/selftests: only dump counts if mode enabled Andrew Morton
2021-07-01  1:48 ` [patch 034/192] userfaultfd/selftests: unify error handling Andrew Morton
2021-07-01  1:48 ` [patch 035/192] mm/thp: simplify copying of huge zero page pmd when fork Andrew Morton
2021-07-01  1:49 ` [patch 036/192] mm/userfaultfd: fix uffd-wp special cases for fork() Andrew Morton
2021-07-01  1:49 ` [patch 037/192] mm/userfaultfd: fail uffd-wp registration if not supported Andrew Morton
2021-07-01  1:49 ` [patch 038/192] mm/pagemap: export uffd-wp protection information Andrew Morton
2021-07-01  1:49 ` [patch 039/192] userfaultfd/selftests: add pagemap uffd-wp test Andrew Morton
2021-07-01  1:49 ` [patch 040/192] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Andrew Morton
2021-07-01  1:49 ` [patch 041/192] userfaultfd/shmem: support minor fault registration for shmem Andrew Morton
2021-07-01  1:49 ` [patch 042/192] userfaultfd/shmem: support UFFDIO_CONTINUE " Andrew Morton
2021-07-01  1:49 ` [patch 043/192] userfaultfd/shmem: advertise shmem minor fault support Andrew Morton
2021-07-01  1:49 ` [patch 044/192] userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() Andrew Morton
2021-07-01  1:49 ` [patch 045/192] userfaultfd/selftests: use memfd_create for shmem test type Andrew Morton
2021-07-01  1:49 ` [patch 046/192] userfaultfd/selftests: create alias mappings in the shmem test Andrew Morton
2021-07-01  1:49 ` [patch 047/192] userfaultfd/selftests: reinitialize test context in each test Andrew Morton
2021-07-01  1:49 ` [patch 048/192] userfaultfd/selftests: exercise minor fault handling shmem support Andrew Morton
2021-07-01  1:49 ` [patch 049/192] mm/vmscan.c: fix potential deadlock in reclaim_pages() Andrew Morton
2021-07-01  1:49 ` [patch 050/192] include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low Andrew Morton
2021-07-01  1:49 ` [patch 051/192] mm: workingset: define macro WORKINGSET_SHIFT Andrew Morton
2021-07-01  1:49 ` [patch 052/192] mm/kconfig: move HOLES_IN_ZONE into mm Andrew Morton
2021-07-01  1:50 ` [patch 053/192] docs: proc.rst: meminfo: briefly describe gaps in memory accounting Andrew Morton
2021-07-01  1:50 ` [patch 054/192] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER Andrew Morton
2021-07-01  1:50 ` [patch 055/192] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM Andrew Morton
2021-07-01  1:50 ` [patch 056/192] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages Andrew Morton
2021-07-01  1:50 ` [patch 057/192] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() Andrew Morton
2021-07-01  1:50 ` [patch 058/192] virtio-mem: use page_offline_(start|end) when " Andrew Morton
2021-07-01  1:50 ` [patch 059/192] fs/proc/kcore: use page_offline_(freeze|thaw) Andrew Morton
2021-07-01  1:50 ` [patch 060/192] mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS Andrew Morton
2021-07-01  1:50 ` [patch 061/192] mm/z3fold: avoid possible underflow in z3fold_alloc() Andrew Morton
2021-07-01  1:50 ` [patch 062/192] mm/z3fold: remove magic number in z3fold_create_pool() Andrew Morton
2021-07-01  1:50 ` [patch 063/192] mm/z3fold: remove unused function handle_to_z3fold_header() Andrew Morton
2021-07-01  1:50 ` [patch 064/192] mm/z3fold: fix potential memory leak in z3fold_destroy_pool() Andrew Morton
2021-07-01  1:50 ` [patch 065/192] mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page Andrew Morton
2021-07-01  1:50 ` [patch 066/192] mm/zbud: reuse unbuddied[0] as buddied in zbud_pool Andrew Morton
2021-07-01  1:50 ` [patch 067/192] mm/zbud: don't export any zbud API Andrew Morton
2021-07-01  1:50 ` [patch 068/192] mm/compaction: use DEVICE_ATTR_WO macro Andrew Morton
2021-07-01  1:50 ` [patch 069/192] mm: compaction: remove duplicate !list_empty(&sublist) check Andrew Morton
2021-07-01  1:50 ` [patch 070/192] mm/compaction: fix 'limit' in fast_isolate_freepages Andrew Morton
2021-07-01  1:50 ` [patch 071/192] mm/mempolicy: cleanup nodemask intersection check for oom Andrew Morton
2021-07-01  1:51 ` [patch 072/192] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Andrew Morton
2021-07-01  1:51 ` [patch 073/192] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Andrew Morton
2021-07-01  1:51 ` [patch 074/192] mm: mempolicy: don't have to split pmd for huge zero page Andrew Morton
2021-07-01  1:51 ` [patch 075/192] mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies Andrew Morton
2021-07-01  1:51 ` [patch 076/192] include/linux/mmzone.h: add documentation for pfn_valid() Andrew Morton
2021-07-01  1:51 ` [patch 077/192] memblock: update initialization of reserved pages Andrew Morton
2021-07-01  1:51 ` [patch 078/192] arm64: decouple check whether pfn is in linear map from pfn_valid() Andrew Morton
2021-07-01  1:51 ` [patch 079/192] arm64: drop pfn_valid_within() and simplify pfn_valid() Andrew Morton
2021-07-01  1:51 ` [patch 080/192] arm64/mm: drop HAVE_ARCH_PFN_VALID Andrew Morton
2021-07-01  1:51 ` [patch 081/192] mm: migrate: fix missing update page_private to hugetlb_page_subpool Andrew Morton
2021-07-01  1:51 ` [patch 082/192] mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs Andrew Morton
2021-07-01  1:51 ` [patch 083/192] mm: memory: add orig_pmd to struct vm_fault Andrew Morton
2021-07-01  1:51 ` [patch 084/192] mm: memory: make numa_migrate_prep() non-static Andrew Morton
2021-07-01  1:51 ` [patch 085/192] mm: thp: refactor NUMA fault handling Andrew Morton
2021-07-01  1:51 ` [patch 086/192] mm: migrate: account THP NUMA migration counters correctly Andrew Morton
2021-07-01  1:51 ` [patch 087/192] mm: migrate: don't split THP for misplaced NUMA page Andrew Morton
2021-07-01  1:51 ` [patch 088/192] mm: migrate: check mapcount for THP instead of refcount Andrew Morton
2021-07-01  1:51 ` [patch 089/192] mm: thp: skip make PMD PROT_NONE if THP migration is not supported Andrew Morton
2021-07-01  1:51 ` [patch 090/192] mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2 Andrew Morton
2021-07-01  1:52 ` [patch 091/192] mm: rmap: make try_to_unmap() void function Andrew Morton
2021-07-01  1:52 ` [patch 092/192] mm/thp: remap_page() is only needed on anonymous THP Andrew Morton
2021-07-01  1:52 ` [patch 093/192] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC Andrew Morton
2021-07-01  1:52 ` [patch 094/192] mm/thp: fix strncpy warning Andrew Morton
2021-07-01  1:52 ` [patch 095/192] nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc Andrew Morton
2021-07-01  1:52 ` [patch 096/192] mm/nommu: unexport do_munmap() Andrew Morton
2021-07-01  1:52 ` [patch 097/192] mm: generalize ZONE_[DMA|DMA32] Andrew Morton
2021-07-01  2:46   ` Linus Torvalds
2021-07-01  4:29     ` Konstantin Ryabitsev
2021-07-01  1:52 ` [patch 098/192] mm: make variable names for populate_vma_page_range() consistent Andrew Morton
2021-07-01  1:52 ` [patch 099/192] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables Andrew Morton
2021-07-01  1:52 ` [patch 100/192] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT Andrew Morton
2021-07-01  1:52 ` [patch 101/192] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore Andrew Morton
2021-07-01  1:52 ` [patch 102/192] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) Andrew Morton
2021-07-01  1:52 ` [patch 103/192] mm/memory_hotplug: rate limit page migration warnings Andrew Morton
2021-07-01  1:52 ` [patch 104/192] mm,memory_hotplug: drop unneeded locking Andrew Morton
2021-07-01  1:52 ` [patch 105/192] mm/zswap.c: remove unused function zswap_debugfs_exit() Andrew Morton
2021-07-01  1:52 ` [patch 106/192] mm/zswap.c: avoid unnecessary copy-in at map time Andrew Morton
2021-07-01  1:52 ` [patch 107/192] mm/zswap.c: fix two bugs in zswap_writeback_entry() Andrew Morton
2021-07-01  1:52 ` [patch 108/192] mm: zram: amend SLAB_RECLAIM_ACCOUNT on zspage_cachep Andrew Morton
2021-07-01 14:55   ` Minchan Kim
2021-07-01 18:07     ` Linus Torvalds
2021-07-01 18:07       ` Linus Torvalds
2021-07-02  2:45     ` Zhaoyang Huang
2021-07-02  2:45       ` Zhaoyang Huang
2021-07-02  5:47       ` Minchan Kim
2021-07-02  6:20         ` Zhaoyang Huang
2021-07-02  6:20           ` Zhaoyang Huang
2021-07-02  7:33           ` Minchan Kim
2021-07-01  1:53 ` [patch 109/192] mm/zsmalloc.c: remove confusing code in obj_free() Andrew Morton
2021-07-01  1:53 ` [patch 110/192] mm/zsmalloc.c: improve readability for async_free_zspage() Andrew Morton
2021-07-01  1:53 ` [patch 111/192] zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK Andrew Morton
2021-07-01  1:53 ` [patch 112/192] mm: fix typos and grammar error in comments Andrew Morton
2021-07-01  1:53 ` [patch 113/192] mm: define default value for FIRST_USER_ADDRESS Andrew Morton
2021-07-01  1:53 ` [patch 114/192] mm: fix spelling mistakes Andrew Morton
2021-07-01  1:53 ` [patch 115/192] mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages Andrew Morton
2021-07-01  1:53 ` [patch 116/192] mm/vmalloc: include header for prototype of set_iounmap_nonlazy Andrew Morton
2021-07-01  1:53 ` [patch 117/192] mm/page_alloc: make should_fail_alloc_page() static Andrew Morton
2021-07-01  1:53 ` [patch 118/192] mm/mapping_dirty_helpers: remove double Note in kerneldoc Andrew Morton
2021-07-01  1:53 ` [patch 119/192] mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection Andrew Morton
2021-07-01  1:53 ` [patch 120/192] mm/memory_hotplug: fix kerneldoc comment for __try_online_node Andrew Morton
2021-07-01  1:53 ` [patch 121/192] mm/memory_hotplug: fix kerneldoc comment for __remove_memory Andrew Morton
2021-07-01  1:53 ` [patch 122/192] mm/zbud: add kerneldoc fields for zbud_pool Andrew Morton
2021-07-01  1:53 ` [patch 123/192] mm/z3fold: add kerneldoc fields for z3fold_pool Andrew Morton
2021-07-01  1:53 ` [patch 124/192] mm/swap: make swap_address_space an inline function Andrew Morton
2021-07-01  1:53 ` [patch 125/192] mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations Andrew Morton
2021-07-01  1:53 ` [patch 126/192] mm/page_alloc: move prototype for find_suitable_fallback Andrew Morton
2021-07-01  1:53 ` [patch 127/192] mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM Andrew Morton
2021-07-01  1:53 ` [patch 128/192] mm/thp: define default pmd_pgtable() Andrew Morton
2021-07-01  1:54 ` [patch 129/192] kfence: unconditionally use unbound work queue Andrew Morton
2021-07-01  1:54 ` [patch 130/192] mm: remove special swap entry functions Andrew Morton
2021-07-01  1:54 ` [patch 131/192] mm/swapops: rework swap entry manipulation code Andrew Morton
2021-07-01  1:54 ` [patch 132/192] mm/rmap: split try_to_munlock from try_to_unmap Andrew Morton
2021-07-01  1:54 ` [patch 133/192] mm/rmap: split migration into its own function Andrew Morton
2021-07-01  1:54 ` [patch 134/192] mm: rename migrate_pgmap_owner Andrew Morton
2021-07-01  1:54 ` [patch 135/192] mm/memory.c: allow different return codes for copy_nonpresent_pte() Andrew Morton
2021-07-01  1:54 ` [patch 136/192] mm: device exclusive memory access Andrew Morton
2021-07-01  1:54 ` [patch 137/192] mm: selftests for exclusive device memory Andrew Morton
2021-07-01  1:54 ` [patch 138/192] nouveau/svm: refactor nouveau_range_fault Andrew Morton
2021-07-01  1:54 ` [patch 139/192] nouveau/svm: implement atomic SVM access Andrew Morton
2021-07-01  1:54 ` [patch 140/192] proc: Avoid mixing integer types in mem_rw() Andrew Morton
2021-07-01  1:54 ` [patch 141/192] fs/proc/kcore.c: add mmap interface Andrew Morton
2021-07-01  3:32   ` Linus Torvalds
2021-07-01  3:32     ` Linus Torvalds
2021-07-01  6:35     ` [External] " zhoufeng
2021-07-01  1:54 ` [patch 142/192] procfs: allow reading fdinfo with PTRACE_MODE_READ Andrew Morton
2021-07-02 14:54   ` Christian Brauner
2021-07-02 18:43   ` Kees Cook
2021-07-02 19:00     ` Linus Torvalds
2021-07-02 19:00       ` Linus Torvalds
2021-07-02 20:40       ` Eric W. Biederman
2021-07-02 20:40         ` Eric W. Biederman
2021-07-02 23:31         ` Kees Cook
2021-07-03  0:15           ` Linus Torvalds
2021-07-03  0:15             ` Linus Torvalds
2021-07-03 21:43             ` Eric W. Biederman
2021-07-03 21:43               ` Eric W. Biederman
2021-07-01  1:54 ` [patch 143/192] procfs/dmabuf: add inode number to /proc/*/fdinfo Andrew Morton
2021-07-01  1:54 ` [patch 144/192] sysctl: remove redundant assignment to first Andrew Morton
2021-07-01  1:54 ` [patch 145/192] drm: include only needed headers in ascii85.h Andrew Morton
2021-07-01  1:54 ` [patch 146/192] kernel.h: split out panic and oops helpers Andrew Morton
2021-07-01  1:55 ` [patch 147/192] lib: decompress_bunzip2: remove an unneeded semicolon Andrew Morton
2021-07-01  1:55 ` [patch 148/192] lib/string_helpers: switch to use BIT() macro Andrew Morton
2021-07-01  1:55 ` [patch 149/192] lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop Andrew Morton
2021-07-01  1:55 ` [patch 150/192] lib/string_helpers: drop indentation level in string_escape_mem() Andrew Morton
2021-07-01  1:55 ` [patch 151/192] lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII Andrew Morton
2021-07-01  1:55 ` [patch 152/192] lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable Andrew Morton
2021-07-01  1:55 ` [patch 153/192] lib/string_helpers: allow to append additional characters to be escaped Andrew Morton
2021-07-01  1:55 ` [patch 154/192] lib/test-string_helpers: print flags in hexadecimal format Andrew Morton
2021-07-01  1:55 ` [patch 155/192] lib/test-string_helpers: get rid of trailing comma in terminators Andrew Morton
2021-07-01  1:55 ` [patch 156/192] lib/test-string_helpers: add test cases for new features Andrew Morton
2021-07-01  1:55 ` [patch 157/192] MAINTAINERS: add myself as designated reviewer for generic string library Andrew Morton
2021-07-01  1:55 ` [patch 158/192] seq_file: introduce seq_escape_mem() Andrew Morton
2021-07-01  1:55 ` [patch 159/192] seq_file: add seq_escape_str() as replica of string_escape_str() Andrew Morton
2021-07-01  1:55 ` [patch 160/192] seq_file: convert seq_escape() to use seq_escape_str() Andrew Morton
2021-07-01  1:55 ` [patch 161/192] nfsd: avoid non-flexible API in seq_quote_mem() Andrew Morton
2021-07-01  1:55 ` [patch 162/192] seq_file: drop unused *_escape_mem_ascii() Andrew Morton
2021-07-01  1:55 ` [patch 163/192] lib/math/rational.c: fix divide by zero Andrew Morton
2021-07-01  1:55 ` [patch 164/192] lib/math/rational: add Kunit test cases Andrew Morton
2021-07-01  1:55 ` [patch 165/192] lib/decompressors: fix spelling mistakes Andrew Morton
2021-07-01  1:55 ` [patch 166/192] lib/mpi: " Andrew Morton
2021-07-01  1:56 ` [patch 167/192] lib: memscan() fixlet Andrew Morton
2021-07-01  1:56 ` [patch 168/192] lib: uninline simple_strtoull() Andrew Morton
2021-07-01  1:56 ` [patch 169/192] lib/test_string.c: allow module removal Andrew Morton
2021-07-01  1:56 ` [patch 170/192] kernel.h: split out kstrtox() and simple_strtox() to a separate header Andrew Morton
2021-07-01  1:56 ` [patch 171/192] lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static Andrew Morton
2021-07-01  1:56 ` [patch 172/192] lib/decompress_unlz4.c: correctly handle zero-padding around initrds Andrew Morton
2021-07-01  1:56 ` [patch 173/192] checkpatch: scripts/spdxcheck.py now requires python3 Andrew Morton
2021-07-01  1:56 ` [patch 174/192] checkpatch: improve the indented label test Andrew Morton
2021-07-01  1:56 ` [patch 175/192] checkpatch: do not complain about positive return values starting with EPOLL Andrew Morton
2021-07-01  1:56 ` [patch 176/192] init: print out unknown kernel parameters Andrew Morton
2021-07-01  1:56 ` [patch 177/192] kprobes: remove duplicated strong free_insn_page in x86 and s390 Andrew Morton
2021-07-01  1:56 ` [patch 178/192] nilfs2: remove redundant continue statement in a while-loop Andrew Morton
2021-07-01  1:56 ` [patch 179/192] hfsplus: remove unnecessary oom message Andrew Morton
2021-07-01  1:56 ` [patch 180/192] hfsplus: report create_date to kstat.btime Andrew Morton
2021-07-01  1:56 ` [patch 181/192] x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned Andrew Morton
2021-07-01  1:56 ` [patch 182/192] exec: remove checks in __register_bimfmt() Andrew Morton
2021-07-01  1:56 ` [patch 183/192] kcov: add __no_sanitize_coverage to fix noinstr for all architectures Andrew Morton
2021-07-01  1:56 ` [patch 184/192] selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random Andrew Morton
2021-07-01  1:56 ` [patch 185/192] selftests/vm/pkeys: handle negative sys_pkey_alloc() return code Andrew Morton
2021-07-01  1:56 ` [patch 186/192] selftests/vm/pkeys: refill shadow register after implicit kernel write Andrew Morton
2021-07-01  1:57 ` [patch 187/192] selftests/vm/pkeys: exercise x86 XSAVE init state Andrew Morton
2021-07-01  1:57 ` [patch 188/192] lib/decompressors: remove set but not used variabled 'level' Andrew Morton
2021-07-01  1:57 ` [patch 189/192] ipc sem: use kvmalloc for sem_undo allocation Andrew Morton
2021-07-01  1:57 ` [patch 190/192] ipc: use kmalloc for msg_queue and shmid_kernel Andrew Morton
2021-07-01  1:57 ` [patch 191/192] ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock Andrew Morton
2021-07-01  1:57 ` [patch 192/192] ipc/util.c: use binary search for max_idx Andrew Morton
2021-07-03  0:28 ` incoming Linus Torvalds
2021-07-03  1:06   ` incoming Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.