mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2020-02-04  1:33 Andrew Morton
  2020-02-04  1:33 ` [patch 01/67] ocfs2: fix oops when writing cloned file Andrew Morton
                   ` (238 more replies)
  0 siblings, 239 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm


The rest of MM and the rest of everything else.


Subsystems affected by this patch series:

  hotfixes
  mm/pagealloc
  mm/memory-hotplug
  ipc
  misc
  mm/cleanups
  mm/pagemap
  procfs
  lib
  cleanups
  arm

Subsystem: hotfixes

    Gang He <GHe@suse.com>:
      ocfs2: fix oops when writing cloned file

    David Hildenbrand <david@redhat.com>:
    Patch series "mm: fix max_pfn not falling on section boundary", v2:
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section
      fs/proc/page.c: allow inspection of last section and fix end detection
      mm/page_alloc.c: initialize memmap of unavailable memory directly

Subsystem: mm/pagealloc

    David Hildenbrand <david@redhat.com>:
      mm/page_alloc: fix and rework pfn handling in memmap_init_zone()
      mm: factor out next_present_section_nr()

Subsystem: mm/memory-hotplug

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
    Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6:
      mm/memmap_init: update variable name in memmap_init_zone

    David Hildenbrand <david@redhat.com>:
      mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()
      mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfn
      mm/memory_hotplug: don't check for "all holes" in shrink_zone_span()
      mm/memory_hotplug: drop local variables in shrink_zone_span()
      mm/memory_hotplug: cleanup __remove_pages()
      mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone()

Subsystem: ipc

    Manfred Spraul <manfred@colorfullife.com>:
      smp_mb__{before,after}_atomic(): update Documentation

    Davidlohr Bueso <dave@stgolabs.net>:
      ipc/mqueue.c: remove duplicated code

    Manfred Spraul <manfred@colorfullife.com>:
      ipc/mqueue.c: update/document memory barriers
      ipc/msg.c: update and document memory barriers
      ipc/sem.c: document and update memory barriers

    Lu Shuaibing <shuaibinglu@126.com>:
      ipc/msg.c: consolidate all xxxctl_down() functions
drivers/block/null_blk_main.c: fix layout

Subsystem: misc

    Andrew Morton <akpm@linux-foundation.org>:
      drivers/block/null_blk_main.c: fix layout
      drivers/block/null_blk_main.c: fix uninitialized var warnings

    Randy Dunlap <rdunlap@infradead.org>:
      pinctrl: fix pxa2xx.c build warnings

Subsystem: mm/cleanups

    Florian Westphal <fw@strlen.de>:
      mm: remove __krealloc

Subsystem: mm/pagemap

    Steven Price <steven.price@arm.com>:
    Patch series "Generic page walk and ptdump", v17:
      mm: add generic p?d_leaf() macros
      arc: mm: add p?d_leaf() definitions
      arm: mm: add p?d_leaf() definitions
      arm64: mm: add p?d_leaf() definitions
      mips: mm: add p?d_leaf() definitions
      powerpc: mm: add p?d_leaf() definitions
      riscv: mm: add p?d_leaf() definitions
      s390: mm: add p?d_leaf() definitions
      sparc: mm: add p?d_leaf() definitions
      x86: mm: add p?d_leaf() definitions
      mm: pagewalk: add p4d_entry() and pgd_entry()
      mm: pagewalk: allow walking without vma
      mm: pagewalk: don't lock PTEs for walk_page_range_novma()
      mm: pagewalk: fix termination condition in walk_pte_range()
      mm: pagewalk: add 'depth' parameter to pte_hole
      x86: mm: point to struct seq_file from struct pg_state
      x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct
      x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
      mm: add generic ptdump
      x86: mm: convert dump_pagetables to use walk_page_range
      arm64: mm: convert mm/dump.c to use walk_page_range()
      arm64: mm: display non-present entries in ptdump
      mm: ptdump: reduce level numbers by 1 in note_page()
      x86: mm: avoid allocating struct mm_struct on the stack

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
    Patch series "Fixup page directory freeing", v4:
      powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

    Peter Zijlstra <peterz@infradead.org>:
      mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
      asm-generic/tlb: avoid potential double flush
      asm-gemeric/tlb: remove stray function declarations
      asm-generic/tlb: add missing CONFIG symbol
      asm-generic/tlb: rename HAVE_RCU_TABLE_FREE
      asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE
      asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER
      asm-generic/tlb: provide MMU_GATHER_TABLE_FREE

Subsystem: procfs

    Alexey Dobriyan <adobriyan@gmail.com>:
      proc: decouple proc from VFS with "struct proc_ops"
      proc: convert everything to "struct proc_ops"

Subsystem: lib

    Yury Norov <yury.norov@gmail.com>:
    Patch series "lib: rework bitmap_parse", v5:
      lib/string: add strnchrnul()
      bitops: more BITS_TO_* macros
      lib: add test for bitmap_parse()
      lib: make bitmap_parse_user a wrapper on bitmap_parse
      lib: rework bitmap_parse()
      lib: new testcases for bitmap_parse{_user}
      include/linux/cpumask.h: don't calculate length of the input string

Subsystem: cleanups

    Masahiro Yamada <masahiroy@kernel.org>:
      treewide: remove redundant IS_ERR() before error code check

Subsystem: arm

    Chen-Yu Tsai <wens@csie.org>:
      ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()

 Documentation/memory-barriers.txt                     |   14 
 arch/Kconfig                                          |   17 
 arch/alpha/kernel/srm_env.c                           |   17 
 arch/arc/include/asm/pgtable.h                        |    1 
 arch/arm/Kconfig                                      |    2 
 arch/arm/include/asm/pgtable-2level.h                 |    1 
 arch/arm/include/asm/pgtable-3level.h                 |    1 
 arch/arm/include/asm/tlb.h                            |    6 
 arch/arm/kernel/atags_proc.c                          |    8 
 arch/arm/mm/alignment.c                               |   14 
 arch/arm/mm/dma-mapping.c                             |    2 
 arch/arm64/Kconfig                                    |    3 
 arch/arm64/Kconfig.debug                              |   19 
 arch/arm64/include/asm/pgtable.h                      |    2 
 arch/arm64/include/asm/ptdump.h                       |    8 
 arch/arm64/mm/Makefile                                |    4 
 arch/arm64/mm/dump.c                                  |  152 ++----
 arch/arm64/mm/mmu.c                                   |    4 
 arch/arm64/mm/ptdump_debugfs.c                        |    2 
 arch/ia64/kernel/salinfo.c                            |   24 -
 arch/m68k/kernel/bootinfo_proc.c                      |    8 
 arch/mips/include/asm/pgtable.h                       |    5 
 arch/mips/lasat/picvue_proc.c                         |   31 -
 arch/powerpc/Kconfig                                  |    7 
 arch/powerpc/include/asm/book3s/32/pgalloc.h          |    8 
 arch/powerpc/include/asm/book3s/64/pgalloc.h          |    2 
 arch/powerpc/include/asm/book3s/64/pgtable.h          |    3 
 arch/powerpc/include/asm/nohash/pgalloc.h             |    8 
 arch/powerpc/include/asm/tlb.h                        |   11 
 arch/powerpc/kernel/proc_powerpc.c                    |   10 
 arch/powerpc/kernel/rtas-proc.c                       |   70 +--
 arch/powerpc/kernel/rtas_flash.c                      |   34 -
 arch/powerpc/kernel/rtasd.c                           |   14 
 arch/powerpc/mm/book3s64/pgtable.c                    |    7 
 arch/powerpc/mm/numa.c                                |   12 
 arch/powerpc/platforms/pseries/lpar.c                 |   24 -
 arch/powerpc/platforms/pseries/lparcfg.c              |   14 
 arch/powerpc/platforms/pseries/reconfig.c             |    8 
 arch/powerpc/platforms/pseries/scanlog.c              |   15 
 arch/riscv/include/asm/pgtable-64.h                   |    7 
 arch/riscv/include/asm/pgtable.h                      |    7 
 arch/s390/Kconfig                                     |    4 
 arch/s390/include/asm/pgtable.h                       |    2 
 arch/sh/mm/alignment.c                                |   17 
 arch/sparc/Kconfig                                    |    3 
 arch/sparc/include/asm/pgtable_64.h                   |    2 
 arch/sparc/include/asm/tlb_64.h                       |   11 
 arch/sparc/kernel/led.c                               |   15 
 arch/um/drivers/mconsole_kern.c                       |    9 
 arch/um/kernel/exitcode.c                             |   15 
 arch/um/kernel/process.c                              |   15 
 arch/x86/Kconfig                                      |    3 
 arch/x86/Kconfig.debug                                |   20 
 arch/x86/include/asm/pgtable.h                        |   10 
 arch/x86/include/asm/tlb.h                            |    4 
 arch/x86/kernel/cpu/mtrr/if.c                         |   21 
 arch/x86/mm/Makefile                                  |    4 
 arch/x86/mm/debug_pagetables.c                        |   18 
 arch/x86/mm/dump_pagetables.c                         |  418 +++++-------------
 arch/x86/platform/efi/efi_32.c                        |    2 
 arch/x86/platform/efi/efi_64.c                        |    4 
 arch/x86/platform/uv/tlb_uv.c                         |   14 
 arch/xtensa/platforms/iss/simdisk.c                   |   10 
 crypto/af_alg.c                                       |    2 
 drivers/acpi/battery.c                                |   15 
 drivers/acpi/proc.c                                   |   15 
 drivers/acpi/scan.c                                   |    2 
 drivers/base/memory.c                                 |    9 
 drivers/block/null_blk_main.c                         |   58 +-
 drivers/char/hw_random/bcm2835-rng.c                  |    2 
 drivers/char/hw_random/omap-rng.c                     |    4 
 drivers/clk/clk.c                                     |    2 
 drivers/dma/mv_xor_v2.c                               |    2 
 drivers/firmware/efi/arm-runtime.c                    |    2 
 drivers/gpio/gpiolib-devres.c                         |    2 
 drivers/gpio/gpiolib-of.c                             |    8 
 drivers/gpio/gpiolib.c                                |    2 
 drivers/hwmon/dell-smm-hwmon.c                        |   15 
 drivers/i2c/busses/i2c-mv64xxx.c                      |    5 
 drivers/i2c/busses/i2c-synquacer.c                    |    2 
 drivers/ide/ide-proc.c                                |   19 
 drivers/input/input.c                                 |   28 -
 drivers/isdn/capi/kcapi_proc.c                        |    6 
 drivers/macintosh/via-pmu.c                           |   17 
 drivers/md/md.c                                       |   15 
 drivers/misc/sgi-gru/gruprocfs.c                      |   42 -
 drivers/mtd/ubi/build.c                               |    2 
 drivers/net/wireless/cisco/airo.c                     |  126 ++---
 drivers/net/wireless/intel/ipw2x00/libipw_module.c    |   15 
 drivers/net/wireless/intersil/hostap/hostap_hw.c      |    4 
 drivers/net/wireless/intersil/hostap/hostap_proc.c    |   14 
 drivers/net/wireless/intersil/hostap/hostap_wlan.h    |    2 
 drivers/net/wireless/ray_cs.c                         |   20 
 drivers/of/device.c                                   |    2 
 drivers/parisc/led.c                                  |   17 
 drivers/pci/controller/pci-tegra.c                    |    2 
 drivers/pci/proc.c                                    |   25 -
 drivers/phy/phy-core.c                                |    4 
 drivers/pinctrl/pxa/pinctrl-pxa2xx.c                  |    1 
 drivers/platform/x86/thinkpad_acpi.c                  |   15 
 drivers/platform/x86/toshiba_acpi.c                   |   60 +-
 drivers/pnp/isapnp/proc.c                             |    9 
 drivers/pnp/pnpbios/proc.c                            |   17 
 drivers/s390/block/dasd_proc.c                        |   15 
 drivers/s390/cio/blacklist.c                          |   14 
 drivers/s390/cio/css.c                                |   11 
 drivers/scsi/esas2r/esas2r_main.c                     |    9 
 drivers/scsi/scsi_devinfo.c                           |   15 
 drivers/scsi/scsi_proc.c                              |   29 -
 drivers/scsi/sg.c                                     |   30 -
 drivers/spi/spi-orion.c                               |    3 
 drivers/staging/rtl8192u/ieee80211/ieee80211_module.c |   14 
 drivers/tty/sysrq.c                                   |    8 
 drivers/usb/gadget/function/rndis.c                   |   17 
 drivers/video/fbdev/imxfb.c                           |    2 
 drivers/video/fbdev/via/viafbdev.c                    |  105 ++--
 drivers/zorro/proc.c                                  |    9 
 fs/cifs/cifs_debug.c                                  |  108 ++--
 fs/cifs/dfs_cache.c                                   |   13 
 fs/cifs/dfs_cache.h                                   |    2 
 fs/ext4/super.c                                       |    2 
 fs/f2fs/node.c                                        |    2 
 fs/fscache/internal.h                                 |    2 
 fs/fscache/object-list.c                              |   11 
 fs/fscache/proc.c                                     |    2 
 fs/jbd2/journal.c                                     |   13 
 fs/jfs/jfs_debug.c                                    |   14 
 fs/lockd/procfs.c                                     |   12 
 fs/nfsd/nfsctl.c                                      |   13 
 fs/nfsd/stats.c                                       |   12 
 fs/ocfs2/file.c                                       |   14 
 fs/ocfs2/suballoc.c                                   |    2 
 fs/proc/cpuinfo.c                                     |   12 
 fs/proc/generic.c                                     |   38 -
 fs/proc/inode.c                                       |   76 +--
 fs/proc/internal.h                                    |    5 
 fs/proc/kcore.c                                       |   13 
 fs/proc/kmsg.c                                        |   14 
 fs/proc/page.c                                        |   54 +-
 fs/proc/proc_net.c                                    |   32 -
 fs/proc/proc_sysctl.c                                 |    2 
 fs/proc/root.c                                        |    2 
 fs/proc/stat.c                                        |   12 
 fs/proc/task_mmu.c                                    |    4 
 fs/proc/vmcore.c                                      |   10 
 fs/sysfs/group.c                                      |    2 
 include/asm-generic/pgtable.h                         |   20 
 include/asm-generic/tlb.h                             |  138 +++--
 include/linux/bitmap.h                                |    8 
 include/linux/bitops.h                                |    4 
 include/linux/cpumask.h                               |    4 
 include/linux/memory_hotplug.h                        |    4 
 include/linux/mm.h                                    |    6 
 include/linux/mmzone.h                                |   10 
 include/linux/pagewalk.h                              |   49 +-
 include/linux/proc_fs.h                               |   23 
 include/linux/ptdump.h                                |   24 -
 include/linux/seq_file.h                              |   13 
 include/linux/slab.h                                  |    1 
 include/linux/string.h                                |    1 
 include/linux/sunrpc/stats.h                          |    4 
 ipc/mqueue.c                                          |  123 ++++-
 ipc/msg.c                                             |   62 +-
 ipc/sem.c                                             |   66 +-
 ipc/util.c                                            |   14 
 kernel/configs.c                                      |    9 
 kernel/irq/proc.c                                     |   42 -
 kernel/kallsyms.c                                     |   12 
 kernel/latencytop.c                                   |   14 
 kernel/locking/lockdep_proc.c                         |   15 
 kernel/module.c                                       |   12 
 kernel/profile.c                                      |   24 -
 kernel/sched/psi.c                                    |   48 +-
 lib/bitmap.c                                          |  195 ++++----
 lib/string.c                                          |   17 
 lib/test_bitmap.c                                     |  105 ++++
 mm/Kconfig.debug                                      |   21 
 mm/Makefile                                           |    1 
 mm/gup.c                                              |    2 
 mm/hmm.c                                              |   66 +-
 mm/memory_hotplug.c                                   |  104 +---
 mm/memremap.c                                         |    2 
 mm/migrate.c                                          |    5 
 mm/mincore.c                                          |    1 
 mm/mmu_gather.c                                       |  158 ++++--
 mm/page_alloc.c                                       |   75 +--
 mm/pagewalk.c                                         |  167 +++++--
 mm/ptdump.c                                           |  159 ++++++
 mm/slab_common.c                                      |   37 -
 mm/sparse.c                                           |   10 
 mm/swapfile.c                                         |   14 
 net/atm/mpoa_proc.c                                   |   17 
 net/atm/proc.c                                        |    8 
 net/core/dev.c                                        |    2 
 net/core/filter.c                                     |    2 
 net/core/pktgen.c                                     |   44 -
 net/ipv4/ipconfig.c                                   |   10 
 net/ipv4/netfilter/ipt_CLUSTERIP.c                    |   16 
 net/ipv4/route.c                                      |   24 -
 net/netfilter/xt_recent.c                             |   17 
 net/sunrpc/auth_gss/svcauth_gss.c                     |   10 
 net/sunrpc/cache.c                                    |   45 -
 net/sunrpc/stats.c                                    |   21 
 net/xfrm/xfrm_policy.c                                |    2 
 samples/kfifo/bytestream-example.c                    |   11 
 samples/kfifo/inttype-example.c                       |   11 
 samples/kfifo/record-example.c                        |   11 
 scripts/coccinelle/free/devm_free.cocci               |    4 
 sound/core/info.c                                     |   34 -
 sound/soc/codecs/ak4104.c                             |    3 
 sound/soc/codecs/cs4270.c                             |    3 
 sound/soc/codecs/tlv320aic32x4.c                      |    6 
 sound/soc/sunxi/sun4i-spdif.c                         |    2 
 tools/include/linux/bitops.h                          |    9 
 214 files changed, 2589 insertions(+), 2227 deletions(-)

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 01/67] ocfs2: fix oops when writing cloned file
  2020-02-04  1:33 incoming Andrew Morton
@ 2020-02-04  1:33 ` Andrew Morton
  2020-02-04  1:33 ` [patch 02/67] mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section Andrew Morton
                   ` (237 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:33 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, stable, torvalds

From: Gang He <GHe@suse.com>
Subject: ocfs2: fix oops when writing cloned file

Writing a cloned file triggers a kernel oops and the user-space command
process is also killed by the system.  The bug can be reproduced stably
via:

1) create a file under ocfs2 file system directory.

  journalctl -b > aa.txt

2) create a cloned file for this file.

  reflink aa.txt bb.txt

3) write the cloned file with dd command.

  dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc

The dd command is killed by the kernel, then you can see the oops message
via dmesg command.

[  463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028
[  463.875413] #PF: supervisor read access in kernel mode
[  463.875416] #PF: error_code(0x0000) - not-present page
[  463.875418] PGD 0 P4D 0
[  463.875425] Oops: 0000 [#1] SMP PTI
[  463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G           OE     5.3.16-2-default
[  463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2]
[  463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28
[  463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202
[  463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0
[  463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000
[  463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000
[  463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048
[  463.875526] FS:  00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000
[  463.875529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0
[  463.875546] Call Trace:
[  463.875596]  ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2]
[  463.875648]  ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2]
[  463.875672]  new_sync_write+0x12d/0x1d0
[  463.875688]  vfs_write+0xad/0x1a0
[  463.875697]  ksys_write+0xa1/0xe0
[  463.875710]  do_syscall_64+0x60/0x1f0
[  463.875743]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  463.875758] RIP: 0033:0x7f8f4482ed44
[  463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00
[  463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44
[  463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001
[  463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003
[  463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000
[  463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000

This regression problem was introduced by commit e74540b28556 ("ocfs2:
protect extent tree in ocfs2_prepare_inode_for_write()").

Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com
Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()").
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/file.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- a/fs/ocfs2/file.c~ocfs2-fix-the-oops-problem-when-write-cloned-file
+++ a/fs/ocfs2/file.c
@@ -2101,17 +2101,15 @@ static int ocfs2_is_io_unaligned(struct
 static int ocfs2_inode_lock_for_extent_tree(struct inode *inode,
 					    struct buffer_head **di_bh,
 					    int meta_level,
-					    int overwrite_io,
 					    int write_sem,
 					    int wait)
 {
 	int ret = 0;
 
 	if (wait)
-		ret = ocfs2_inode_lock(inode, NULL, meta_level);
+		ret = ocfs2_inode_lock(inode, di_bh, meta_level);
 	else
-		ret = ocfs2_try_inode_lock(inode,
-			overwrite_io ? NULL : di_bh, meta_level);
+		ret = ocfs2_try_inode_lock(inode, di_bh, meta_level);
 	if (ret < 0)
 		goto out;
 
@@ -2136,6 +2134,7 @@ static int ocfs2_inode_lock_for_extent_t
 
 out_unlock:
 	brelse(*di_bh);
+	*di_bh = NULL;
 	ocfs2_inode_unlock(inode, meta_level);
 out:
 	return ret;
@@ -2177,7 +2176,6 @@ static int ocfs2_prepare_inode_for_write
 		ret = ocfs2_inode_lock_for_extent_tree(inode,
 						       &di_bh,
 						       meta_level,
-						       overwrite_io,
 						       write_sem,
 						       wait);
 		if (ret < 0) {
@@ -2233,13 +2231,13 @@ static int ocfs2_prepare_inode_for_write
 							   &di_bh,
 							   meta_level,
 							   write_sem);
+			meta_level = 1;
+			write_sem = 1;
 			ret = ocfs2_inode_lock_for_extent_tree(inode,
 							       &di_bh,
 							       meta_level,
-							       overwrite_io,
-							       1,
+							       write_sem,
 							       wait);
-			write_sem = 1;
 			if (ret < 0) {
 				if (ret != -EAGAIN)
 					mlog_errno(ret);
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 02/67] mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section
  2020-02-04  1:33 incoming Andrew Morton
  2020-02-04  1:33 ` [patch 01/67] ocfs2: fix oops when writing cloned file Andrew Morton
@ 2020-02-04  1:33 ` Andrew Morton
  2020-02-04  1:33 ` [patch 03/67] fs/proc/page.c: allow inspection of last section and fix end detection Andrew Morton
                   ` (236 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:33 UTC (permalink / raw)
  To: adobriyan, akpm, bob.picco, dan.j.williams, daniel.m.jordan,
	david, linux-mm, mhocko, mhocko, mm-commits, n-horiguchi,
	osalvador, pasha.tatashin, sfr, stable, steven.sistare, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section

Patch series "mm: fix max_pfn not falling on section boundary", v2.

Playing with different memory sizes for a x86-64 guest, I discovered that
some memmaps (highest section if max_mem does not fall on the section
boundary) are marked as being valid and online, but contain garbage.  We
have to properly initialize these memmaps.

Looking at /proc/kpageflags and friends, I found some more issues,
partially related to this.


This patch (of 3):

If max_pfn is not aligned to a section boundary, we can easily run into
BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
memory size that is not a multiple of 128MB (e.g., 4097MB, but also
4160MB).  I was told that on real HW, we can easily have this scenario
(esp., one of the main reasons sub-section hotadd of devmem was added).

The issue is, that we have a valid memmap (pfn_valid()) for the whole
section, and the whole section will be marked "online". 
pfn_to_online_page() will succeed, but the memmap contains garbage.

E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
4160M" - (see tools/vm/page-types.c):

[  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
[  200.477500] #PF: supervisor read access in kernel mode
[  200.478334] #PF: error_code(0x0000) - not-present page
[  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
[  200.479557] Oops: 0000 [#4] SMP NOPTI
[  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
[  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
[  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
[  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
[  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
[  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
[  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
[  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
[  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
[  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
[  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
[  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
[  200.488897] Call Trace:
[  200.489115]  kpageflags_read+0xe9/0x140
[  200.489447]  proc_reg_read+0x3c/0x60
[  200.489755]  vfs_read+0xc2/0x170
[  200.490037]  ksys_pread64+0x65/0xa0
[  200.490352]  do_syscall_64+0x5c/0xa0
[  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
after cold/hot plugging a DIMM to such a system:

[root@localhost ~]# cat /proc/kpageflags > /dev/null
[  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
[  111.517907] #PF: supervisor read access in kernel mode
[  111.518333] #PF: error_code(0x0000) - not-present page
[  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

This patch fixes that by at least zero-ing out that memmap (so e.g.,
page_to_pfn() will not crash).  Commit 907ec5fca3dc ("mm: zero remaining
unavailable struct pages") tried to fix a similar issue, but forgot to
consider this special case.

After this patch, there are still problems to solve.  E.g., not all of
these pages falling into a memory hole will actually get initialized later
and set PageReserved - they are only zeroed out - but at least the
immediate crashes are gone.  A follow-up patch will take care of this.

Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>	[4.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/mm/page_alloc.c~mm-fix-uninitialized-memmaps-on-a-partially-populated-last-section
+++ a/mm/page_alloc.c
@@ -6947,7 +6947,8 @@ static u64 zero_pfn_range(unsigned long
  * This function also addresses a similar issue where struct pages are left
  * uninitialized because the physical address range is not covered by
  * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=.
+ * layout is manually configured via memmap=, or when the highest physical
+ * address (max_pfn) does not end on a section boundary.
  */
 void __init zero_resv_unavail(void)
 {
@@ -6965,7 +6966,16 @@ void __init zero_resv_unavail(void)
 			pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start));
 		next = end;
 	}
-	pgcnt += zero_pfn_range(PFN_DOWN(next), max_pfn);
+
+	/*
+	 * Early sections always have a fully populated memmap for the whole
+	 * section - see pfn_valid(). If the last section has holes at the
+	 * end and that section is marked "online", the memmap will be
+	 * considered initialized. Make sure that memmap has a well defined
+	 * state.
+	 */
+	pgcnt += zero_pfn_range(PFN_DOWN(next),
+				round_up(max_pfn, PAGES_PER_SECTION));
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 03/67] fs/proc/page.c: allow inspection of last section and fix end detection
  2020-02-04  1:33 incoming Andrew Morton
  2020-02-04  1:33 ` [patch 01/67] ocfs2: fix oops when writing cloned file Andrew Morton
  2020-02-04  1:33 ` [patch 02/67] mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section Andrew Morton
@ 2020-02-04  1:33 ` Andrew Morton
  2020-02-04  1:33 ` [patch 04/67] mm/page_alloc.c: initialize memmap of unavailable memory directly Andrew Morton
                   ` (235 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:33 UTC (permalink / raw)
  To: adobriyan, akpm, bob.picco, dan.j.williams, daniel.m.jordan,
	david, linux-mm, mhocko, mhocko, mm-commits, n-horiguchi,
	osalvador, pasha.tatashin, sfr, steven.sistare, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: fs/proc/page.c: allow inspection of last section and fix end detection

If max_pfn does not fall onto a section boundary, it is possible to
inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn
itself can't be inspected.  We can have a valid (and online) memmap at and
above max_pfn if max_pfn is not aligned to a section boundary.  The whole
early section has a memmap and is marked online.  Being able to inspect
the state of these PFNs is valuable for debugging, especially because
max_pfn can change on memory hotplug and expose these memmaps.

Also, querying page flags via "./page-types -r -a 0x144001,"
(tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU
results in an (almost) endless loop in user space, because the end is not
detected properly when starting after max_pfn.

Instead, let's allow to inspect all pages in the highest section and
return 0 directly if we try to access pages above that section.

While at it, check the count before adjusting it, to avoid masking user
errors.

Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/page.c |   30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

--- a/fs/proc/page.c~fs-proc-pagec-allow-inspection-of-last-section-and-fix-end-detection
+++ a/fs/proc/page.c
@@ -21,6 +21,21 @@
 #define KPMMASK (KPMSIZE - 1)
 #define KPMBITS (KPMSIZE * BITS_PER_BYTE)
 
+static inline unsigned long get_max_dump_pfn(void)
+{
+#ifdef CONFIG_SPARSEMEM
+	/*
+	 * The memmap of early sections is completely populated and marked
+	 * online even if max_pfn does not fall on a section boundary -
+	 * pfn_to_online_page() will succeed on all pages. Allow inspecting
+	 * these memmaps.
+	 */
+	return round_up(max_pfn, PAGES_PER_SECTION);
+#else
+	return max_pfn;
+#endif
+}
+
 /* /proc/kpagecount - an array exposing page counts
  *
  * Each entry is a u64 representing the corresponding
@@ -29,6 +44,7 @@
 static ssize_t kpagecount_read(struct file *file, char __user *buf,
 			     size_t count, loff_t *ppos)
 {
+	const unsigned long max_dump_pfn = get_max_dump_pfn();
 	u64 __user *out = (u64 __user *)buf;
 	struct page *ppage;
 	unsigned long src = *ppos;
@@ -37,9 +53,11 @@ static ssize_t kpagecount_read(struct fi
 	u64 pcount;
 
 	pfn = src / KPMSIZE;
-	count = min_t(size_t, count, (max_pfn * KPMSIZE) - src);
 	if (src & KPMMASK || count & KPMMASK)
 		return -EINVAL;
+	if (src >= max_dump_pfn * KPMSIZE)
+		return 0;
+	count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src);
 
 	while (count > 0) {
 		/*
@@ -206,6 +224,7 @@ u64 stable_page_flags(struct page *page)
 static ssize_t kpageflags_read(struct file *file, char __user *buf,
 			     size_t count, loff_t *ppos)
 {
+	const unsigned long max_dump_pfn = get_max_dump_pfn();
 	u64 __user *out = (u64 __user *)buf;
 	struct page *ppage;
 	unsigned long src = *ppos;
@@ -213,9 +232,11 @@ static ssize_t kpageflags_read(struct fi
 	ssize_t ret = 0;
 
 	pfn = src / KPMSIZE;
-	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
 	if (src & KPMMASK || count & KPMMASK)
 		return -EINVAL;
+	if (src >= max_dump_pfn * KPMSIZE)
+		return 0;
+	count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src);
 
 	while (count > 0) {
 		/*
@@ -251,6 +272,7 @@ static const struct file_operations proc
 static ssize_t kpagecgroup_read(struct file *file, char __user *buf,
 				size_t count, loff_t *ppos)
 {
+	const unsigned long max_dump_pfn = get_max_dump_pfn();
 	u64 __user *out = (u64 __user *)buf;
 	struct page *ppage;
 	unsigned long src = *ppos;
@@ -259,9 +281,11 @@ static ssize_t kpagecgroup_read(struct f
 	u64 ino;
 
 	pfn = src / KPMSIZE;
-	count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
 	if (src & KPMMASK || count & KPMMASK)
 		return -EINVAL;
+	if (src >= max_dump_pfn * KPMSIZE)
+		return 0;
+	count = min_t(unsigned long, count, (max_dump_pfn * KPMSIZE) - src);
 
 	while (count > 0) {
 		/*
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 04/67] mm/page_alloc.c: initialize memmap of unavailable memory directly
  2020-02-04  1:33 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2020-02-04  1:33 ` [patch 03/67] fs/proc/page.c: allow inspection of last section and fix end detection Andrew Morton
@ 2020-02-04  1:33 ` Andrew Morton
  2020-02-04  1:33 ` [patch 05/67] mm/page_alloc: fix and rework pfn handling in memmap_init_zone() Andrew Morton
                   ` (234 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:33 UTC (permalink / raw)
  To: adobriyan, akpm, bob.picco, dan.j.williams, daniel.m.jordan,
	david, linux-mm, mhocko, mhocko, mm-commits, n-horiguchi,
	osalvador, pasha.tatashin, sfr, steven.sistare, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_alloc.c: initialize memmap of unavailable memory directly

Let's make sure that all memory holes are actually marked PageReserved(),
that page_to_pfn() produces reliable results, and that these pages are not
detected as "mmap" pages due to the mapcount.

E.g., booting a x86-64 QEMU guest with 4160 MB:

[    0.010585] Early memory node ranges
[    0.010586]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.010588]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
[    0.010589]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]

max_pfn is 0x144000.

Before this change:

[root@localhost ~]# ./page-types -r -a 0x144000,
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000800           16384       64  ___________M_______________________________        mmap
             total           16384       64

After this change:

[root@localhost ~]# ./page-types -r -a 0x144000,
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000100000000           16384       64  ___________________________r_______________        reserved
             total           16384       64

IOW, especially the unavailable physical memory ("memory hole") in the
last section would not get properly marked PageReserved() and is indicated
to be "mmap" memory.

Drop the trace of that function from include/linux/mm.h - nobody else
needs it, and rename it accordingly.

Note: The fake zone/node might not be covered by the zone/node span.  This
is not an urgent issue (for now, we had the same node/zone due to the
zeroing).  We'll need a clean way to mark memory holes (e.g., using a page
type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
marking these memory holes PageReserved().

Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    6 ------
 mm/page_alloc.c    |   33 ++++++++++++++++++++++-----------
 2 files changed, 22 insertions(+), 17 deletions(-)

--- a/include/linux/mm.h~mm-initialize-memmap-of-unavailable-memory-directly
+++ a/include/linux/mm.h
@@ -2182,12 +2182,6 @@ extern int __meminit __early_pfn_to_nid(
 					struct mminit_pfnnid_cache *state);
 #endif
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
-void zero_resv_unavail(void);
-#else
-static inline void zero_resv_unavail(void) {}
-#endif
-
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
 		enum memmap_context, struct vmem_altmap *);
--- a/mm/page_alloc.c~mm-initialize-memmap-of-unavailable-memory-directly
+++ a/mm/page_alloc.c
@@ -6916,10 +6916,10 @@ void __init free_area_init_node(int nid,
 
 #if !defined(CONFIG_FLAT_NODE_MEM_MAP)
 /*
- * Zero all valid struct pages in range [spfn, epfn), return number of struct
- * pages zeroed
+ * Initialize all valid struct pages in the range [spfn, epfn) and mark them
+ * PageReserved(). Return the number of struct pages that were initialized.
  */
-static u64 zero_pfn_range(unsigned long spfn, unsigned long epfn)
+static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
 {
 	unsigned long pfn;
 	u64 pgcnt = 0;
@@ -6930,7 +6930,13 @@ static u64 zero_pfn_range(unsigned long
 				+ pageblock_nr_pages - 1;
 			continue;
 		}
-		mm_zero_struct_page(pfn_to_page(pfn));
+		/*
+		 * Use a fake node/zone (0) for now. Some of these pages
+		 * (in memblock.reserved but not in memblock.memory) will
+		 * get re-initialized via reserve_bootmem_region() later.
+		 */
+		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}
 
@@ -6942,7 +6948,7 @@ static u64 zero_pfn_range(unsigned long
  * initialized by going through __init_single_page(). But, there are some
  * struct pages which are reserved in memblock allocator and their fields
  * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly zero those struct pages.
+ * flags). We must explicitly initialize those struct pages.
  *
  * This function also addresses a similar issue where struct pages are left
  * uninitialized because the physical address range is not covered by
@@ -6950,7 +6956,7 @@ static u64 zero_pfn_range(unsigned long
  * layout is manually configured via memmap=, or when the highest physical
  * address (max_pfn) does not end on a section boundary.
  */
-void __init zero_resv_unavail(void)
+static void __init init_unavailable_mem(void)
 {
 	phys_addr_t start, end;
 	u64 i, pgcnt;
@@ -6963,7 +6969,8 @@ void __init zero_resv_unavail(void)
 	for_each_mem_range(i, &memblock.memory, NULL,
 			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
 		if (next < start)
-			pgcnt += zero_pfn_range(PFN_DOWN(next), PFN_UP(start));
+			pgcnt += init_unavailable_range(PFN_DOWN(next),
+							PFN_UP(start));
 		next = end;
 	}
 
@@ -6974,8 +6981,8 @@ void __init zero_resv_unavail(void)
 	 * considered initialized. Make sure that memmap has a well defined
 	 * state.
 	 */
-	pgcnt += zero_pfn_range(PFN_DOWN(next),
-				round_up(max_pfn, PAGES_PER_SECTION));
+	pgcnt += init_unavailable_range(PFN_DOWN(next),
+					round_up(max_pfn, PAGES_PER_SECTION));
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
@@ -6984,6 +6991,10 @@ void __init zero_resv_unavail(void)
 	if (pgcnt)
 		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
 }
+#else
+static inline void __init init_unavailable_mem(void)
+{
+}
 #endif /* !CONFIG_FLAT_NODE_MEM_MAP */
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
@@ -7413,7 +7424,7 @@ void __init free_area_init_nodes(unsigne
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	zero_resv_unavail();
+	init_unavailable_mem();
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		free_area_init_node(nid, NULL,
@@ -7608,7 +7619,7 @@ void __init set_dma_reserve(unsigned lon
 
 void __init free_area_init(unsigned long *zones_size)
 {
-	zero_resv_unavail();
+	init_unavailable_mem();
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
 }
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 05/67] mm/page_alloc: fix and rework pfn handling in memmap_init_zone()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2020-02-04  1:33 ` [patch 04/67] mm/page_alloc.c: initialize memmap of unavailable memory directly Andrew Morton
@ 2020-02-04  1:33 ` Andrew Morton
  2020-02-04  1:34 ` [patch 06/67] mm: factor out next_present_section_nr() Andrew Morton
                   ` (233 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:33 UTC (permalink / raw)
  To: akpm, alexander.h.duyck, bhe, dan.j.williams, david,
	kirill.shutemov, kirill, linux-mm, mgorman, mhocko, mhocko,
	mm-commits, osalvador, pasha.tatashin, torvalds, vbabka, zhi.jin

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_alloc: fix and rework pfn handling in memmap_init_zone()

Let's update the pfn manually whenever we continue the loop.  This makes
the code easier to read but also less error prone (and we can directly fix
one issue).

When overlap_memmap_init() returns true, pfn is updated to
"memblock_region_memory_end_pfn(r)".  So it already points at the *next*
pfn to process.  Incrementing the pfn another time is wrong, we might
leave one uninitialized.  I spotted this by inspecting the code, so I have
no idea if this is relevant in practise (with kernelcore=3Dmirror).

Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of m=
emmap_init_zone")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/mm/page_alloc.c~mm-page_alloc-fix-and-rework-pfn-handling-in-memmap_init_zone
+++ a/mm/page_alloc.c
@@ -5905,18 +5905,20 @@ void __meminit memmap_init_zone(unsigned
 	}
 #endif
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; ) {
 		/*
 		 * There can be holes in boot-time mem_map[]s handed to this
 		 * function.  They do not exist on hotplugged memory.
 		 */
 		if (context == MEMMAP_EARLY) {
 			if (!early_pfn_valid(pfn)) {
-				pfn = next_pfn(pfn) - 1;
+				pfn = next_pfn(pfn);
 				continue;
 			}
-			if (!early_pfn_in_nid(pfn, nid))
+			if (!early_pfn_in_nid(pfn, nid)) {
+				pfn++;
 				continue;
+			}
 			if (overlap_memmap_init(zone, &pfn))
 				continue;
 			if (defer_init(nid, pfn, end_pfn))
@@ -5944,6 +5946,7 @@ void __meminit memmap_init_zone(unsigned
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
 			cond_resched();
 		}
+		pfn++;
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 06/67] mm: factor out next_present_section_nr()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2020-02-04  1:33 ` [patch 05/67] mm/page_alloc: fix and rework pfn handling in memmap_init_zone() Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
       [not found]   ` <CAHk-=widODg0oV6uNrpTvA3tFVN706o-nrWC5XBL9RZ4y3RFcg@mail.gmail.com>
  2020-02-04  1:34 ` [patch 07/67] mm/memmap_init: update variable name in memmap_init_zone Andrew Morton
                   ` (232 subsequent siblings)
  238 siblings, 1 reply; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, bhe, dan.j.williams, david, kirill.shutemov, kirill,
	linux-mm, mgorman, mhocko, mhocko, mm-commits, osalvador,
	pasha.tatashin, torvalds, vbabka, zhi.jin

From: David Hildenbrand <david@redhat.com>
Subject: mm: factor out next_present_section_nr()

Let's move it to the header and use the shorter variant from
mm/page_alloc.c (the original one will also check
"__highest_present_section_nr + 1", which is not necessary).  While at it,
make the section_nr in next_pfn() const.

In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
exceed __highest_present_section_nr, which doesn't make a difference i= n
the caller as it is big enough (>=3D all sane end_pfn).

Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   10 ++++++++++
 mm/page_alloc.c        |   11 ++---------
 mm/sparse.c            |   10 ----------
 3 files changed, 12 insertions(+), 19 deletions(-)

--- a/include/linux/mmzone.h~mm-factor-out-next_present_section_nr
+++ a/include/linux/mmzone.h
@@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned l
 	return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
 }
 
+static inline unsigned long next_present_section_nr(unsigned long section_nr)
+{
+	while (++section_nr <= __highest_present_section_nr) {
+		if (present_section_nr(section_nr))
+			return section_nr;
+	}
+
+	return -1;
+}
+
 /*
  * These are _only_ used during initialisation, therefore they
  * can use __initdata ...  They could have names to indicate
--- a/mm/page_alloc.c~mm-factor-out-next_present_section_nr
+++ a/mm/page_alloc.c
@@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone,
 /* Skip PFNs that belong to non-present sections */
 static inline __meminit unsigned long next_pfn(unsigned long pfn)
 {
-	unsigned long section_nr;
+	const unsigned long section_nr = pfn_to_section_nr(++pfn);
 
-	section_nr = pfn_to_section_nr(++pfn);
 	if (present_section_nr(section_nr))
 		return pfn;
-
-	while (++section_nr <= __highest_present_section_nr) {
-		if (present_section_nr(section_nr))
-			return section_nr_to_pfn(section_nr);
-	}
-
-	return -1;
+	return section_nr_to_pfn(next_present_section_nr(section_nr));
 }
 #else
 static inline __meminit unsigned long next_pfn(unsigned long pfn)
--- a/mm/sparse.c~mm-factor-out-next_present_section_nr
+++ a/mm/sparse.c
@@ -198,16 +198,6 @@ static void section_mark_present(struct
 	ms->section_mem_map |= SECTION_MARKED_PRESENT;
 }
 
-static inline unsigned long next_present_section_nr(unsigned long section_nr)
-{
-	do {
-		section_nr++;
-		if (present_section_nr(section_nr))
-			return section_nr;
-	} while ((section_nr <= __highest_present_section_nr));

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 07/67] mm/memmap_init: update variable name in memmap_init_zone
  2020-02-04  1:33 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2020-02-04  1:34 ` [patch 06/67] mm: factor out next_present_section_nr() Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 08/67] mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone() Andrew Morton
                   ` (231 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, david, gregkh, linux-mm,
	logang, mhocko, mm-commits, osalvador, pagupta, pasha.tatashin,
	torvalds, willy

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Subject: mm/memmap_init: update variable name in memmap_init_zone

Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.


This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory.  Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect the
ZONE_DEVICE as contiguous).

We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
code to a minimum.  Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of DIMMs
at zone boundaries.

--------------------------------------------------------------------------

Zones are now properly shrunk when offlining memory blocks or when
onlining failed.  This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.

Example:

:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  0
        present  0
        managed  0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  98304
        present  65536
        managed  65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  32768
        present  32768
        managed  32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
        spanned  0
        present  0
        managed  0


This patch (of 6):

The third argument is actually number of pages.  Change the variable name
from size to nr_pages to indicate this better.

No functional change in this patch.

Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/page_alloc.c~mm-memmap_init-update-variable-name-in-memmap_init_zone
+++ a/mm/page_alloc.c
@@ -5946,10 +5946,10 @@ void __meminit memmap_init_zone(unsigned
 #ifdef CONFIG_ZONE_DEVICE
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
-				   unsigned long size,
+				   unsigned long nr_pages,
 				   struct dev_pagemap *pgmap)
 {
-	unsigned long pfn, end_pfn = start_pfn + size;
+	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
@@ -5966,7 +5966,7 @@ void __ref memmap_init_zone_device(struc
 	 */
 	if (altmap) {
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
-		size = end_pfn - start_pfn;
+		nr_pages = end_pfn - start_pfn;
 	}
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
@@ -6013,7 +6013,7 @@ void __ref memmap_init_zone_device(struc
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-		size, jiffies_to_msecs(jiffies - start));
+		nr_pages, jiffies_to_msecs(jiffies - start));
 }
 
 #endif
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 08/67] mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2020-02-04  1:34 ` [patch 07/67] mm/memmap_init: update variable name in memmap_init_zone Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 09/67] mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfn Andrew Morton
                   ` (230 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, david, gregkh, linux-mm,
	logang, mhocko, mm-commits, osalvador, pagupta, pasha.tatashin,
	torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()

Let's poison the pages similar to when adding new memory in
sparse_add_section().  Also call remove_pfn_range_from_zone() from
memunmap_pages(), so we can poison the memmap from there as well.

Link: http://lkml.kernel.org/r/20191006085646.5768-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    3 +++
 mm/memremap.c       |    2 ++
 2 files changed, 5 insertions(+)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-poison-memmap-in-remove_pfn_range_from_zone
+++ a/mm/memory_hotplug.c
@@ -490,6 +490,9 @@ void __ref remove_pfn_range_from_zone(st
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	unsigned long flags;
 
+	/* Poison struct pages because they are now uninitialized again. */
+	page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * nr_pages);
+
 #ifdef CONFIG_ZONE_DEVICE
 	/*
 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
--- a/mm/memremap.c~mm-memory_hotplug-poison-memmap-in-remove_pfn_range_from_zone
+++ a/mm/memremap.c
@@ -120,6 +120,8 @@ void memunmap_pages(struct dev_pagemap *
 	nid = page_to_nid(first_page);
 
 	mem_hotplug_begin();
+	remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(res->start),
+				   PHYS_PFN(resource_size(res)));
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		__remove_pages(PHYS_PFN(res->start),
 			       PHYS_PFN(resource_size(res)), NULL);
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 09/67] mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfn
  2020-02-04  1:33 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2020-02-04  1:34 ` [patch 08/67] mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone() Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 10/67] mm/memory_hotplug: don't check for "all holes" in shrink_zone_span() Andrew Morton
                   ` (229 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, david, gregkh, linux-mm,
	logang, mhocko, mm-commits, osalvador, pagupta, pasha.tatashin,
	torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfn

With shrink_pgdat_span() out of the way, we now always have a valid zone.

Link: http://lkml.kernel.org/r/20191006085646.5768-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-we-always-have-a-zone-in-find_smallestbiggest_section_pfn
+++ a/mm/memory_hotplug.c
@@ -355,7 +355,7 @@ static unsigned long find_smallest_secti
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
 			continue;
 
-		if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+		if (zone != page_zone(pfn_to_page(start_pfn)))
 			continue;
 
 		return start_pfn;
@@ -380,7 +380,7 @@ static unsigned long find_biggest_sectio
 		if (unlikely(pfn_to_nid(pfn) != nid))
 			continue;
 
-		if (zone && zone != page_zone(pfn_to_page(pfn)))
+		if (zone != page_zone(pfn_to_page(pfn)))
 			continue;
 
 		return pfn;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 10/67] mm/memory_hotplug: don't check for "all holes" in shrink_zone_span()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2020-02-04  1:34 ` [patch 09/67] mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfn Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 11/67] mm/memory_hotplug: drop local variables " Andrew Morton
                   ` (228 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, david, gregkh, linux-mm,
	logang, mhocko, mm-commits, osalvador, pagupta, pasha.tatashin,
	torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: don't check for "all holes" in shrink_zone_span()

If we have holes, the holes will automatically get detected and removed
once we remove the next bigger/smaller section.  The extra checks can go.

Link: http://lkml.kernel.org/r/20191006085646.5768-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   34 +++++++---------------------------
 1 file changed, 7 insertions(+), 27 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-dont-check-for-all-holes-in-shrink_zone_span
+++ a/mm/memory_hotplug.c
@@ -411,6 +411,9 @@ static void shrink_zone_span(struct zone
 		if (pfn) {
 			zone->zone_start_pfn = pfn;
 			zone->spanned_pages = zone_end_pfn - pfn;
+		} else {
+			zone->zone_start_pfn = 0;
+			zone->spanned_pages = 0;
 		}
 	} else if (zone_end_pfn == end_pfn) {
 		/*
@@ -423,34 +426,11 @@ static void shrink_zone_span(struct zone
 					       start_pfn);
 		if (pfn)
 			zone->spanned_pages = pfn - zone_start_pfn + 1;
+		else {
+			zone->zone_start_pfn = 0;
+			zone->spanned_pages = 0;
+		}
 	}
-
-	/*
-	 * The section is not biggest or smallest mem_section in the zone, it
-	 * only creates a hole in the zone. So in this case, we need not
-	 * change the zone. But perhaps, the zone has only hole data. Thus
-	 * it check the zone has only hole or not.
-	 */
-	pfn = zone_start_pfn;
-	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-		if (unlikely(!pfn_to_online_page(pfn)))
-			continue;
-
-		if (page_zone(pfn_to_page(pfn)) != zone)
-			continue;
-
-		/* Skip range to be removed */
-		if (pfn >= start_pfn && pfn < end_pfn)
-			continue;
-
-		/* If we find valid section, we have nothing to do */
-		zone_span_writeunlock(zone);
-		return;
-	}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 11/67] mm/memory_hotplug: drop local variables in shrink_zone_span()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2020-02-04  1:34 ` [patch 10/67] mm/memory_hotplug: don't check for "all holes" in shrink_zone_span() Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 12/67] mm/memory_hotplug: cleanup __remove_pages() Andrew Morton
                   ` (227 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, david, gregkh, linux-mm,
	logang, mhocko, mm-commits, osalvador, pagupta, pasha.tatashin,
	torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: drop local variables in shrink_zone_span()

Get rid of the unnecessary local variables.

Link: http://lkml.kernel.org/r/20191006085646.5768-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-drop-local-variables-in-shrink_zone_span
+++ a/mm/memory_hotplug.c
@@ -392,14 +392,11 @@ static unsigned long find_biggest_sectio
 static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			     unsigned long end_pfn)
 {
-	unsigned long zone_start_pfn = zone->zone_start_pfn;
-	unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
-	unsigned long zone_end_pfn = z;
 	unsigned long pfn;
 	int nid = zone_to_nid(zone);
 
 	zone_span_writelock(zone);
-	if (zone_start_pfn == start_pfn) {
+	if (zone->zone_start_pfn == start_pfn) {
 		/*
 		 * If the section is smallest section in the zone, it need
 		 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
@@ -407,25 +404,25 @@ static void shrink_zone_span(struct zone
 		 * for shrinking zone.
 		 */
 		pfn = find_smallest_section_pfn(nid, zone, end_pfn,
-						zone_end_pfn);
+						zone_end_pfn(zone));
 		if (pfn) {
+			zone->spanned_pages = zone_end_pfn(zone) - pfn;
 			zone->zone_start_pfn = pfn;
-			zone->spanned_pages = zone_end_pfn - pfn;
 		} else {
 			zone->zone_start_pfn = 0;
 			zone->spanned_pages = 0;
 		}
-	} else if (zone_end_pfn == end_pfn) {
+	} else if (zone_end_pfn(zone) == end_pfn) {
 		/*
 		 * If the section is biggest section in the zone, it need
 		 * shrink zone->spanned_pages.
 		 * In this case, we find second biggest valid mem_section for
 		 * shrinking zone.
 		 */
-		pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+		pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
 					       start_pfn);
 		if (pfn)
-			zone->spanned_pages = pfn - zone_start_pfn + 1;
+			zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
 		else {
 			zone->zone_start_pfn = 0;
 			zone->spanned_pages = 0;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 12/67] mm/memory_hotplug: cleanup __remove_pages()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2020-02-04  1:34 ` [patch 11/67] mm/memory_hotplug: drop local variables " Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 13/67] mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone() Andrew Morton
                   ` (226 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, aneesh.kumar, dan.j.williams, david, gregkh, linux-mm,
	logang, mhocko, mm-commits, osalvador, pagupta, pasha.tatashin,
	torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: cleanup __remove_pages()

Let's drop the basically unused section stuff and simplify.

Also, let's use a shorter variant to calculate the number of pages to
the next section boundary.

Link: http://lkml.kernel.org/r/20191006085646.5768-11-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |   17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-cleanup-__remove_pages
+++ a/mm/memory_hotplug.c
@@ -516,25 +516,20 @@ static void __remove_section(unsigned lo
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
 		    struct vmem_altmap *altmap)
 {
+	const unsigned long end_pfn = pfn + nr_pages;
+	unsigned long cur_nr_pages;
 	unsigned long map_offset = 0;
-	unsigned long nr, start_sec, end_sec;
 
 	map_offset = vmem_altmap_offset(altmap);
 
 	if (check_pfn_span(pfn, nr_pages, "remove"))
 		return;
 
-	start_sec = pfn_to_section_nr(pfn);
-	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
-	for (nr = start_sec; nr <= end_sec; nr++) {
-		unsigned long pfns;

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 13/67] mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2020-02-04  1:34 ` [patch 12/67] mm/memory_hotplug: cleanup __remove_pages() Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 14/67] smp_mb__{before,after}_atomic(): update Documentation Andrew Morton
                   ` (225 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, david, gregkh, linux-mm, mhocko, mm-commits, osalvador,
	rafael, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone()

The callers are only interested in the actual zone, they don't care about
boundaries.  Return the zone instead to simplify.

Link: http://lkml.kernel.org/r/20200110183308.11849-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |    9 ++++-----
 include/linux/memory_hotplug.h |    4 ++--
 mm/memory_hotplug.c            |   31 +++++++++----------------------
 3 files changed, 15 insertions(+), 29 deletions(-)

--- a/drivers/base/memory.c~mm-memory_hotplug-drop-valid_start-valid_end-from-test_pages_in_a_zone
+++ a/drivers/base/memory.c
@@ -376,7 +376,6 @@ static ssize_t valid_zones_show(struct d
 	struct memory_block *mem = to_memory_block(dev);
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-	unsigned long valid_start_pfn, valid_end_pfn;
 	struct zone *default_zone;
 	int nid;
 
@@ -389,11 +388,11 @@ static ssize_t valid_zones_show(struct d
 		 * The block contains more than one zone can not be offlined.
 		 * This can happen e.g. for ZONE_DMA and ZONE_DMA32
 		 */
-		if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages,
-					  &valid_start_pfn, &valid_end_pfn))
+		default_zone = test_pages_in_a_zone(start_pfn,
+						    start_pfn + nr_pages);
+		if (!default_zone)
 			return sprintf(buf, "none\n");
-		start_pfn = valid_start_pfn;
-		strcat(buf, page_zone(pfn_to_page(start_pfn))->name);
+		strcat(buf, default_zone->name);
 		goto out;
 	}
 
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-drop-valid_start-valid_end-from-test_pages_in_a_zone
+++ a/include/linux/memory_hotplug.h
@@ -96,8 +96,8 @@ extern int add_one_highpage(struct page
 /* VM interface that may be used by firmware interface */
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
 			int online_type, int nid);
-extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
-	unsigned long *valid_start, unsigned long *valid_end);
+extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
+					 unsigned long end_pfn);
 extern unsigned long __offline_isolated_pages(unsigned long start_pfn,
 						unsigned long end_pfn);
 
--- a/mm/memory_hotplug.c~mm-memory_hotplug-drop-valid_start-valid_end-from-test_pages_in_a_zone
+++ a/mm/memory_hotplug.c
@@ -1172,14 +1172,13 @@ bool is_mem_section_removable(unsigned l
 }
 
 /*
- * Confirm all pages in a range [start, end) belong to the same zone.
- * When true, return its valid [start, end).
+ * Confirm all pages in a range [start, end) belong to the same zone (skipping
+ * memory holes). When true, return the zone.
  */
-int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
-			 unsigned long *valid_start, unsigned long *valid_end)
+struct zone *test_pages_in_a_zone(unsigned long start_pfn,
+				  unsigned long end_pfn)
 {
 	unsigned long pfn, sec_end_pfn;
-	unsigned long start, end;
 	struct zone *zone = NULL;
 	struct page *page;
 	int i;
@@ -1200,24 +1199,15 @@ int test_pages_in_a_zone(unsigned long s
 				continue;
 			/* Check if we got outside of the zone */
 			if (zone && !zone_spans_pfn(zone, pfn + i))
-				return 0;
+				return NULL;
 			page = pfn_to_page(pfn + i);
 			if (zone && page_zone(page) != zone)
-				return 0;
-			if (!zone)
-				start = pfn + i;
+				return NULL;
 			zone = page_zone(page);
-			end = pfn + MAX_ORDER_NR_PAGES;
 		}
 	}
 
-	if (zone) {
-		*valid_start = start;
-		*valid_end = min(end, end_pfn);
-		return 1;
-	} else {
-		return 0;
-	}
+	return zone;
 }
 
 /*
@@ -1462,7 +1452,6 @@ static int __ref __offline_pages(unsigne
 	unsigned long offlined_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
-	unsigned long valid_start, valid_end;
 	struct zone *zone;
 	struct memory_notify arg;
 	char *reason;
@@ -1487,14 +1476,12 @@ static int __ref __offline_pages(unsigne
 
 	/* This makes hotplug much easier...and readable.
 	   we assume this for now. .*/
-	if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start,
-				  &valid_end)) {
+	zone = test_pages_in_a_zone(start_pfn, end_pfn);
+	if (!zone) {
 		ret = -EINVAL;
 		reason = "multizone range";
 		goto failed_removal;
 	}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 14/67] smp_mb__{before,after}_atomic(): update Documentation
  2020-02-04  1:33 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2020-02-04  1:34 ` [patch 13/67] mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone() Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 15/67] ipc/mqueue.c: remove duplicated code Andrew Morton
                   ` (224 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: 1vier1, akpm, dave, linux-mm, longman, manfred, mm-commits,
	peterz, torvalds, will.deacon

From: Manfred Spraul <manfred@colorfullife.com>
Subject: smp_mb__{before,after}_atomic(): update Documentation

When adding the _{acquire|release|relaxed}() variants of some atomic
operations, it was forgotten to update Documentation/memory_barrier.txt:

smp_mb__{before,after}_atomic() is now intended for all RMW operations
that do not imply a memory barrier.

1)
	smp_mb__before_atomic();
	atomic_add();

2)
	smp_mb__before_atomic();
	atomic_xchg_relaxed();

3)
	smp_mb__before_atomic();
	atomic_fetch_add_relaxed();

Invalid would be:
	smp_mb__before_atomic();
	atomic_set();

In addition, the patch splits the long sentence into multiple shorter
sentences.

Link: http://lkml.kernel.org/r/20191020123305.14715-2-manfred@colorfullife.com
Fixes: 654672d4ba1a ("locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations")
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <1vier1@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/memory-barriers.txt |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

--- a/Documentation/memory-barriers.txt~smp_mb__beforeafter_atomic-update-documentation
+++ a/Documentation/memory-barriers.txt
@@ -1868,12 +1868,16 @@ There are some more advanced barrier fun
  (*) smp_mb__before_atomic();
  (*) smp_mb__after_atomic();
 
-     These are for use with atomic (such as add, subtract, increment and
-     decrement) functions that don't return a value, especially when used for
-     reference counting.  These functions do not imply memory barriers.
+     These are for use with atomic RMW functions that do not imply memory
+     barriers, but where the code needs a memory barrier. Examples for atomic
+     RMW functions that do not imply are memory barrier are e.g. add,
+     subtract, (failed) conditional operations, _relaxed functions,
+     but not atomic_read or atomic_set. A common example where a memory
+     barrier may be required is when atomic ops are used for reference
+     counting.
 
-     These are also used for atomic bitop functions that do not return a
-     value (such as set_bit and clear_bit).
+     These are also used for atomic RMW bitop functions that do not imply a
+     memory barrier (such as set_bit and clear_bit).
 
      As an example, consider a piece of code that marks an object as being dead
      and then decrements the object's reference count:
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 15/67] ipc/mqueue.c: remove duplicated code
  2020-02-04  1:33 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2020-02-04  1:34 ` [patch 14/67] smp_mb__{before,after}_atomic(): update Documentation Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 16/67] ipc/mqueue.c: update/document memory barriers Andrew Morton
                   ` (223 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: 1vier1, akpm, dave, linux-mm, longman, manfred, mm-commits,
	peterz, torvalds, will.deacon

From: Davidlohr Bueso <dave@stgolabs.net>
Subject: ipc/mqueue.c: remove duplicated code

pipelined_send() and pipelined_receive() are identical, so merge them.

[manfred@colorfullife.com: add changelog]
Link: http://lkml.kernel.org/r/20191020123305.14715-3-manfred@colorfullife.com
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/mqueue.c |   31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

--- a/ipc/mqueue.c~ipc-mqueuec-remove-duplicated-code
+++ a/ipc/mqueue.c
@@ -918,17 +918,12 @@ out_name:
  * The same algorithm is used for senders.
  */
 
-/* pipelined_send() - send a message directly to the task waiting in
- * sys_mq_timedreceive() (without inserting message into a queue).
- */
-static inline void pipelined_send(struct wake_q_head *wake_q,
+static inline void __pipelined_op(struct wake_q_head *wake_q,
 				  struct mqueue_inode_info *info,
-				  struct msg_msg *message,
-				  struct ext_wait_queue *receiver)
+				  struct ext_wait_queue *this)
 {
-	receiver->msg = message;
-	list_del(&receiver->list);
-	wake_q_add(wake_q, receiver->task);
+	list_del(&this->list);
+	wake_q_add(wake_q, this->task);
 	/*
 	 * Rely on the implicit cmpxchg barrier from wake_q_add such
 	 * that we can ensure that updating receiver->state is the last
@@ -937,7 +932,19 @@ static inline void pipelined_send(struct
 	 * yet, at that point we can later have a use-after-free
 	 * condition and bogus wakeup.
 	 */
-	receiver->state = STATE_READY;
+	this->state = STATE_READY;
+}
+
+/* pipelined_send() - send a message directly to the task waiting in
+ * sys_mq_timedreceive() (without inserting message into a queue).
+ */
+static inline void pipelined_send(struct wake_q_head *wake_q,
+				  struct mqueue_inode_info *info,
+				  struct msg_msg *message,
+				  struct ext_wait_queue *receiver)
+{
+	receiver->msg = message;
+	__pipelined_op(wake_q, info, receiver);
 }
 
 /* pipelined_receive() - if there is task waiting in sys_mq_timedsend()
@@ -955,9 +962,7 @@ static inline void pipelined_receive(str
 	if (msg_insert(sender->msg, info))
 		return;
 
-	list_del(&sender->list);
-	wake_q_add(wake_q, sender->task);
-	sender->state = STATE_READY;
+	__pipelined_op(wake_q, info, sender);
 }
 
 static int do_mq_timedsend(mqd_t mqdes, const char __user *u_msg_ptr,
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 16/67] ipc/mqueue.c: update/document memory barriers
  2020-02-04  1:33 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2020-02-04  1:34 ` [patch 15/67] ipc/mqueue.c: remove duplicated code Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 17/67] ipc/msg.c: update and document " Andrew Morton
                   ` (222 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: 1vier1, akpm, dbueso, linux-mm, longman, manfred, mm-commits,
	peterz, torvalds, will.deacon

From: Manfred Spraul <manfred@colorfullife.com>
Subject: ipc/mqueue.c: update/document memory barriers

Update and document memory barriers for mqueue.c:

- ewp->state is read without any locks, thus READ_ONCE is required.

- add smp_aquire__after_ctrl_dep() after the READ_ONCE, we need
  acquire semantics if the value is STATE_READY.

- use wake_q_add_safe()

- document why __set_current_state() may be used:
  Reading task->state cannot happen before the wake_q_add() call,
  which happens while holding info->lock. Thus the spin_unlock()
  is the RELEASE, and the spin_lock() is the ACQUIRE.

For completeness: there is also a 3 CPU scenario, if the to be woken
up task is already on another wake_q.
Then:
- CPU1: spin_unlock() of the task that goes to sleep is the RELEASE
- CPU2: the spin_lock() of the waker is the ACQUIRE
- CPU2: smp_mb__before_atomic inside wake_q_add() is the RELEASE
- CPU3: smp_mb__after_spinlock() inside try_to_wake_up() is the ACQUIRE

Link: http://lkml.kernel.org/r/20191020123305.14715-4-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Waiman Long <longman@redhat.com>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/mqueue.c |   92 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 78 insertions(+), 14 deletions(-)

--- a/ipc/mqueue.c~ipc-mqueuec-update-document-memory-barriers
+++ a/ipc/mqueue.c
@@ -63,6 +63,66 @@ struct posix_msg_tree_node {
 	int			priority;
 };
 
+/*
+ * Locking:
+ *
+ * Accesses to a message queue are synchronized by acquiring info->lock.
+ *
+ * There are two notable exceptions:
+ * - The actual wakeup of a sleeping task is performed using the wake_q
+ *   framework. info->lock is already released when wake_up_q is called.
+ * - The exit codepaths after sleeping check ext_wait_queue->state without
+ *   any locks. If it is STATE_READY, then the syscall is completed without
+ *   acquiring info->lock.
+ *
+ * MQ_BARRIER:
+ * To achieve proper release/acquire memory barrier pairing, the state is set to
+ * STATE_READY with smp_store_release(), and it is read with READ_ONCE followed
+ * by smp_acquire__after_ctrl_dep(). In addition, wake_q_add_safe() is used.
+ *
+ * This prevents the following races:
+ *
+ * 1) With the simple wake_q_add(), the task could be gone already before
+ *    the increase of the reference happens
+ * Thread A
+ *				Thread B
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ *				wake_q_add(A)
+ *				if (cmpxchg()) // success
+ *				   ->state = STATE_READY (reordered)
+ * <timeout returns>
+ * if (wait.state == STATE_READY) return;
+ * sysret to user space
+ * sys_exit()
+ *				get_task_struct() // UaF
+ *
+ * Solution: Use wake_q_add_safe() and perform the get_task_struct() before
+ * the smp_store_release() that does ->state = STATE_READY.
+ *
+ * 2) Without proper _release/_acquire barriers, the woken up task
+ *    could read stale data
+ *
+ * Thread A
+ *				Thread B
+ * do_mq_timedreceive
+ * WRITE_ONCE(wait.state, STATE_NONE);
+ * schedule_hrtimeout()
+ *				state = STATE_READY;
+ * <timeout returns>
+ * if (wait.state == STATE_READY) return;
+ * msg_ptr = wait.msg;		// Access to stale data!
+ *				receiver->msg = message; (reordered)
+ *
+ * Solution: use _release and _acquire barriers.
+ *
+ * 3) There is intentionally no barrier when setting current->state
+ *    to TASK_INTERRUPTIBLE: spin_unlock(&info->lock) provides the
+ *    release memory barrier, and the wakeup is triggered when holding
+ *    info->lock, i.e. spin_lock(&info->lock) provided a pairing
+ *    acquire memory barrier.
+ */
+
 struct ext_wait_queue {		/* queue of sleeping tasks */
 	struct task_struct *task;
 	struct list_head list;
@@ -646,18 +706,23 @@ static int wq_sleep(struct mqueue_inode_
 	wq_add(info, sr, ewp);
 
 	for (;;) {
+		/* memory barrier not required, we hold info->lock */
 		__set_current_state(TASK_INTERRUPTIBLE);
 
 		spin_unlock(&info->lock);
 		time = schedule_hrtimeout_range_clock(timeout, 0,
 			HRTIMER_MODE_ABS, CLOCK_REALTIME);
 
-		if (ewp->state == STATE_READY) {
+		if (READ_ONCE(ewp->state) == STATE_READY) {
+			/* see MQ_BARRIER for purpose/pairing */
+			smp_acquire__after_ctrl_dep();
 			retval = 0;
 			goto out;
 		}
 		spin_lock(&info->lock);
-		if (ewp->state == STATE_READY) {
+
+		/* we hold info->lock, so no memory barrier required */
+		if (READ_ONCE(ewp->state) == STATE_READY) {
 			retval = 0;
 			goto out_unlock;
 		}
@@ -923,16 +988,11 @@ static inline void __pipelined_op(struct
 				  struct ext_wait_queue *this)
 {
 	list_del(&this->list);
-	wake_q_add(wake_q, this->task);
-	/*
-	 * Rely on the implicit cmpxchg barrier from wake_q_add such
-	 * that we can ensure that updating receiver->state is the last
-	 * write operation: As once set, the receiver can continue,
-	 * and if we don't have the reference count from the wake_q,
-	 * yet, at that point we can later have a use-after-free
-	 * condition and bogus wakeup.
-	 */
-	this->state = STATE_READY;
+	get_task_struct(this->task);
+
+	/* see MQ_BARRIER for purpose/pairing */
+	smp_store_release(&this->state, STATE_READY);
+	wake_q_add_safe(wake_q, this->task);
 }
 
 /* pipelined_send() - send a message directly to the task waiting in
@@ -1049,7 +1109,9 @@ static int do_mq_timedsend(mqd_t mqdes,
 		} else {
 			wait.task = current;
 			wait.msg = (void *) msg_ptr;
-			wait.state = STATE_NONE;
+
+			/* memory barrier not required, we hold info->lock */
+			WRITE_ONCE(wait.state, STATE_NONE);
 			ret = wq_sleep(info, SEND, timeout, &wait);
 			/*
 			 * wq_sleep must be called with info->lock held, and
@@ -1152,7 +1214,9 @@ static int do_mq_timedreceive(mqd_t mqde
 			ret = -EAGAIN;
 		} else {
 			wait.task = current;
-			wait.state = STATE_NONE;
+
+			/* memory barrier not required, we hold info->lock */
+			WRITE_ONCE(wait.state, STATE_NONE);
 			ret = wq_sleep(info, RECV, timeout, &wait);
 			msg_ptr = wait.msg;
 		}
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 17/67] ipc/msg.c: update and document memory barriers
  2020-02-04  1:33 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2020-02-04  1:34 ` [patch 16/67] ipc/mqueue.c: update/document memory barriers Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 18/67] ipc/sem.c: document and update " Andrew Morton
                   ` (221 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: 1vier1, akpm, dave, linux-mm, longman, manfred, mm-commits,
	peterz, torvalds, will.deacon

From: Manfred Spraul <manfred@colorfullife.com>
Subject: ipc/msg.c: update and document memory barriers

Transfer findings from ipc/mqueue.c:

- A control barrier was missing for the lockless receive case So in
  theory, not yet initialized data may have been copied to user space -
  obviously only for architectures where control barriers are not NOP.

- use smp_store_release().  In theory, the refount may have been
  decreased to 0 already when wake_q_add() tries to get a reference.

Link: http://lkml.kernel.org/r/20191020123305.14715-5-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/msg.c |   43 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

--- a/ipc/msg.c~ipc-msgc-update-and-document-memory-barriers
+++ a/ipc/msg.c
@@ -61,6 +61,16 @@ struct msg_queue {
 	struct list_head q_senders;
 } __randomize_layout;
 
+/*
+ * MSG_BARRIER Locking:
+ *
+ * Similar to the optimization used in ipc/mqueue.c, one syscall return path
+ * does not acquire any locks when it sees that a message exists in
+ * msg_receiver.r_msg. Therefore r_msg is set using smp_store_release()
+ * and accessed using READ_ONCE()+smp_acquire__after_ctrl_dep(). In addition,
+ * wake_q_add_safe() is used. See ipc/mqueue.c for more details
+ */
+
 /* one msg_receiver structure for each sleeping receiver */
 struct msg_receiver {
 	struct list_head	r_list;
@@ -184,6 +194,10 @@ static inline void ss_add(struct msg_que
 {
 	mss->tsk = current;
 	mss->msgsz = msgsz;
+	/*
+	 * No memory barrier required: we did ipc_lock_object(),
+	 * and the waker obtains that lock before calling wake_q_add().
+	 */
 	__set_current_state(TASK_INTERRUPTIBLE);
 	list_add_tail(&mss->list, &msq->q_senders);
 }
@@ -237,8 +251,11 @@ static void expunge_all(struct msg_queue
 	struct msg_receiver *msr, *t;
 
 	list_for_each_entry_safe(msr, t, &msq->q_receivers, r_list) {
-		wake_q_add(wake_q, msr->r_tsk);
-		WRITE_ONCE(msr->r_msg, ERR_PTR(res));
+		get_task_struct(msr->r_tsk);
+
+		/* see MSG_BARRIER for purpose/pairing */
+		smp_store_release(&msr->r_msg, ERR_PTR(res));
+		wake_q_add_safe(wake_q, msr->r_tsk);
 	}
 }
 
@@ -798,13 +815,17 @@ static inline int pipelined_send(struct
 			list_del(&msr->r_list);
 			if (msr->r_maxsize < msg->m_ts) {
 				wake_q_add(wake_q, msr->r_tsk);
-				WRITE_ONCE(msr->r_msg, ERR_PTR(-E2BIG));
+
+				/* See expunge_all regarding memory barrier */
+				smp_store_release(&msr->r_msg, ERR_PTR(-E2BIG));
 			} else {
 				ipc_update_pid(&msq->q_lrpid, task_pid(msr->r_tsk));
 				msq->q_rtime = ktime_get_real_seconds();
 
 				wake_q_add(wake_q, msr->r_tsk);
-				WRITE_ONCE(msr->r_msg, msg);
+
+				/* See expunge_all regarding memory barrier */
+				smp_store_release(&msr->r_msg, msg);
 				return 1;
 			}
 		}
@@ -1154,7 +1175,11 @@ static long do_msgrcv(int msqid, void __
 			msr_d.r_maxsize = INT_MAX;
 		else
 			msr_d.r_maxsize = bufsz;
-		msr_d.r_msg = ERR_PTR(-EAGAIN);
+
+		/* memory barrier not require due to ipc_lock_object() */
+		WRITE_ONCE(msr_d.r_msg, ERR_PTR(-EAGAIN));
+
+		/* memory barrier not required, we own ipc_lock_object() */
 		__set_current_state(TASK_INTERRUPTIBLE);
 
 		ipc_unlock_object(&msq->q_perm);
@@ -1183,8 +1208,12 @@ static long do_msgrcv(int msqid, void __
 		 * signal) it will either see the message and continue ...
 		 */
 		msg = READ_ONCE(msr_d.r_msg);
-		if (msg != ERR_PTR(-EAGAIN))
+		if (msg != ERR_PTR(-EAGAIN)) {
+			/* see MSG_BARRIER for purpose/pairing */
+			smp_acquire__after_ctrl_dep();
+
 			goto out_unlock1;
+		}
 
 		 /*
 		  * ... or see -EAGAIN, acquire the lock to check the message
@@ -1192,7 +1221,7 @@ static long do_msgrcv(int msqid, void __
 		  */
 		ipc_lock_object(&msq->q_perm);
 
-		msg = msr_d.r_msg;
+		msg = READ_ONCE(msr_d.r_msg);
 		if (msg != ERR_PTR(-EAGAIN))
 			goto out_unlock0;
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 18/67] ipc/sem.c: document and update memory barriers
  2020-02-04  1:33 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2020-02-04  1:34 ` [patch 17/67] ipc/msg.c: update and document " Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 19/67] ipc/msg.c: consolidate all xxxctl_down() functions Andrew Morton
                   ` (220 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: 1vier1, akpm, dave, linux-mm, longman, manfred, mm-commits,
	peterz, torvalds, will.deacon

From: Manfred Spraul <manfred@colorfullife.com>
Subject: ipc/sem.c: document and update memory barriers

Document and update the memory barriers in ipc/sem.c:

- Add smp_store_release() to wake_up_sem_queue_prepare() and
  document why it is needed.

- Read q->status using READ_ONCE+smp_acquire__after_ctrl_dep().
  as the pair for the barrier inside wake_up_sem_queue_prepare().

- Add comments to all barriers, and mention the rules in the block
  regarding locking.

- Switch to using wake_q_add_safe().

Link: http://lkml.kernel.org/r/20191020123305.14715-6-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <1vier1@web.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/sem.c |   66 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 25 deletions(-)

--- a/ipc/sem.c~ipc-semc-document-and-update-memory-barriers
+++ a/ipc/sem.c
@@ -205,15 +205,38 @@ static int sysvipc_sem_proc_show(struct
  *
  * Memory ordering:
  * Most ordering is enforced by using spin_lock() and spin_unlock().
- * The special case is use_global_lock:
+ *
+ * Exceptions:
+ * 1) use_global_lock: (SEM_BARRIER_1)
  * Setting it from non-zero to 0 is a RELEASE, this is ensured by
- * using smp_store_release().
+ * using smp_store_release(): Immediately after setting it to 0,
+ * a simple op can start.
  * Testing if it is non-zero is an ACQUIRE, this is ensured by using
  * smp_load_acquire().
  * Setting it from 0 to non-zero must be ordered with regards to
  * this smp_load_acquire(), this is guaranteed because the smp_load_acquire()
  * is inside a spin_lock() and after a write from 0 to non-zero a
  * spin_lock()+spin_unlock() is done.
+ *
+ * 2) queue.status: (SEM_BARRIER_2)
+ * Initialization is done while holding sem_lock(), so no further barrier is
+ * required.
+ * Setting it to a result code is a RELEASE, this is ensured by both a
+ * smp_store_release() (for case a) and while holding sem_lock()
+ * (for case b).
+ * The AQUIRE when reading the result code without holding sem_lock() is
+ * achieved by using READ_ONCE() + smp_acquire__after_ctrl_dep().
+ * (case a above).
+ * Reading the result code while holding sem_lock() needs no further barriers,
+ * the locks inside sem_lock() enforce ordering (case b above)
+ *
+ * 3) current->state:
+ * current->state is set to TASK_INTERRUPTIBLE while holding sem_lock().
+ * The wakeup is handled using the wake_q infrastructure. wake_q wakeups may
+ * happen immediately after calling wake_q_add. As wake_q_add_safe() is called
+ * when holding sem_lock(), no further barriers are required.
+ *
+ * See also ipc/mqueue.c for more details on the covered races.
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -344,12 +367,8 @@ static void complexmode_tryleave(struct
 		return;
 	}
 	if (sma->use_global_lock == 1) {
-		/*
-		 * Immediately after setting use_global_lock to 0,
-		 * a simple op can start. Thus: all memory writes
-		 * performed by the current operation must be visible
-		 * before we set use_global_lock to 0.
-		 */
+
+		/* See SEM_BARRIER_1 for purpose/pairing */
 		smp_store_release(&sma->use_global_lock, 0);
 	} else {
 		sma->use_global_lock--;
@@ -400,7 +419,7 @@ static inline int sem_lock(struct sem_ar
 		 */
 		spin_lock(&sem->lock);
 
-		/* pairs with smp_store_release() */
+		/* see SEM_BARRIER_1 for purpose/pairing */
 		if (!smp_load_acquire(&sma->use_global_lock)) {
 			/* fast path successful! */
 			return sops->sem_num;
@@ -766,15 +785,12 @@ would_block:
 static inline void wake_up_sem_queue_prepare(struct sem_queue *q, int error,
 					     struct wake_q_head *wake_q)
 {
-	wake_q_add(wake_q, q->sleeper);
-	/*
-	 * Rely on the above implicit barrier, such that we can
-	 * ensure that we hold reference to the task before setting
-	 * q->status. Otherwise we could race with do_exit if the
-	 * task is awoken by an external event before calling
-	 * wake_up_process().
-	 */
-	WRITE_ONCE(q->status, error);
+	get_task_struct(q->sleeper);
+
+	/* see SEM_BARRIER_2 for purpuse/pairing */
+	smp_store_release(&q->status, error);
+
+	wake_q_add_safe(wake_q, q->sleeper);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -2148,9 +2164,11 @@ static long do_semtimedop(int semid, str
 	}
 
 	do {
+		/* memory ordering ensured by the lock in sem_lock() */
 		WRITE_ONCE(queue.status, -EINTR);
 		queue.sleeper = current;
 
+		/* memory ordering is ensured by the lock in sem_lock() */
 		__set_current_state(TASK_INTERRUPTIBLE);
 		sem_unlock(sma, locknum);
 		rcu_read_unlock();
@@ -2173,13 +2191,8 @@ static long do_semtimedop(int semid, str
 		 */
 		error = READ_ONCE(queue.status);
 		if (error != -EINTR) {
-			/*
-			 * User space could assume that semop() is a memory
-			 * barrier: Without the mb(), the cpu could
-			 * speculatively read in userspace stale data that was
-			 * overwritten by the previous owner of the semaphore.
-			 */
-			smp_mb();
+			/* see SEM_BARRIER_2 for purpose/pairing */
+			smp_acquire__after_ctrl_dep();
 			goto out_free;
 		}
 
@@ -2189,6 +2202,9 @@ static long do_semtimedop(int semid, str
 		if (!ipc_valid_object(&sma->sem_perm))
 			goto out_unlock_free;
 
+		/*
+		 * No necessity for any barrier: We are protect by sem_lock()
+		 */
 		error = READ_ONCE(queue.status);
 
 		/*
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 19/67] ipc/msg.c: consolidate all xxxctl_down() functions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2020-02-04  1:34 ` [patch 18/67] ipc/sem.c: document and update " Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 20/67] drivers/block/null_blk_main.c: fix layout Andrew Morton
                   ` (219 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, arnd, axboe, dave, linux-mm, manfred, mm-commits,
	natechancellor, neilb, shli, shuaibinglu, torvalds

From: Lu Shuaibing <shuaibinglu@126.com>
Subject: ipc/msg.c: consolidate all xxxctl_down() functions

A use of uninitialized memory in msgctl_down() because msqid64 in
ksys_msgctl hasn't been initialized.  The local | msqid64 | is created in
ksys_msgctl() and then passed into msgctl_down().  Along the way msqid64
is never initialized before msgctl_down() checks msqid64->msg_qbytes.

KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports:

==================================================================
BUG: KUMSAN: use of uninitialized memory in msgctl_down+0x94/0x300
Read of size 8 at addr ffff88806bb97eb8 by task syz-executor707/2022

CPU: 0 PID: 2022 Comm: syz-executor707 Not tainted 5.2.0-rc4+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x75/0xae
 __kumsan_report+0x17c/0x3e6
 kumsan_report+0xe/0x20
 msgctl_down+0x94/0x300
 ksys_msgctl.constprop.14+0xef/0x260
 do_syscall_64+0x7e/0x1f0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4400e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd869e0598 EFLAGS: 00000246 ORIG_RAX: 0000000000000047
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004400e9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000401970
R13: 0000000000401a00 R14: 0000000000000000 R15: 0000000000000000

The buggy address belongs to the page:
page:ffffea0001aee5c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 0000000000000000 ffffffff01ae0101 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================

Syzkaller reproducer:
msgctl$IPC_RMID(0x0, 0x0)

C reproducer:
// autogenerated by syzkaller (https://github.com/google/syzkaller)

int main(void)
{
  syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
  syscall(__NR_msgctl, 0, 0, 0);
  return 0;
}

[natechancellor@gmail.com: adjust indentation in ksys_msgctl]
  Link: https://github.com/ClangBuiltLinux/linux/issues/829
  Link: http://lkml.kernel.org/r/20191218032932.37479-1-natechancellor@gmail.com
Link: http://lkml.kernel.org/r/20190613014044.24234-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character.  Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/msg.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- a/ipc/msg.c~ipc-consolidate-all-xxxctl_down-functions
+++ a/ipc/msg.c
@@ -394,7 +394,7 @@ copy_msqid_from_user(struct msqid64_ds *
  * NOTE: no locks must be held, the rwsem is taken inside this function.
  */
 static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
-			struct msqid64_ds *msqid64)
+			struct ipc64_perm *perm, int msg_qbytes)
 {
 	struct kern_ipc_perm *ipcp;
 	struct msg_queue *msq;
@@ -404,7 +404,7 @@ static int msgctl_down(struct ipc_namesp
 	rcu_read_lock();
 
 	ipcp = ipcctl_obtain_check(ns, &msg_ids(ns), msqid, cmd,
-				      &msqid64->msg_perm, msqid64->msg_qbytes);
+				      perm, msg_qbytes);
 	if (IS_ERR(ipcp)) {
 		err = PTR_ERR(ipcp);
 		goto out_unlock1;
@@ -426,18 +426,18 @@ static int msgctl_down(struct ipc_namesp
 	{
 		DEFINE_WAKE_Q(wake_q);
 
-		if (msqid64->msg_qbytes > ns->msg_ctlmnb &&
+		if (msg_qbytes > ns->msg_ctlmnb &&
 		    !capable(CAP_SYS_RESOURCE)) {
 			err = -EPERM;
 			goto out_unlock1;
 		}
 
 		ipc_lock_object(&msq->q_perm);
-		err = ipc_update_perm(&msqid64->msg_perm, ipcp);
+		err = ipc_update_perm(perm, ipcp);
 		if (err)
 			goto out_unlock0;
 
-		msq->q_qbytes = msqid64->msg_qbytes;
+		msq->q_qbytes = msg_qbytes;
 
 		msq->q_ctime = ktime_get_real_seconds();
 		/*
@@ -618,9 +618,10 @@ static long ksys_msgctl(int msqid, int c
 	case IPC_SET:
 		if (copy_msqid_from_user(&msqid64, buf, version))
 			return -EFAULT;
-		/* fallthru */
+		return msgctl_down(ns, msqid, cmd, &msqid64.msg_perm,
+				   msqid64.msg_qbytes);
 	case IPC_RMID:
-		return msgctl_down(ns, msqid, cmd, &msqid64);
+		return msgctl_down(ns, msqid, cmd, NULL, 0);
 	default:
 		return  -EINVAL;
 	}
@@ -752,9 +753,9 @@ static long compat_ksys_msgctl(int msqid
 	case IPC_SET:
 		if (copy_compat_msqid_from_user(&msqid64, uptr, version))
 			return -EFAULT;
-		/* fallthru */
+		return msgctl_down(ns, msqid, cmd, &msqid64.msg_perm, msqid64.msg_qbytes);
 	case IPC_RMID:
-		return msgctl_down(ns, msqid, cmd, &msqid64);
+		return msgctl_down(ns, msqid, cmd, NULL, 0);
 	default:
 		return -EINVAL;
 	}
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 20/67] drivers/block/null_blk_main.c: fix layout
  2020-02-04  1:33 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2020-02-04  1:34 ` [patch 19/67] ipc/msg.c: consolidate all xxxctl_down() functions Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 21/67] drivers/block/null_blk_main.c: fix uninitialized var warnings Andrew Morton
                   ` (218 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, axboe, linux-mm, mm-commits, shli, torvalds

From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix layout

Each line here overflows 80 cols by exactly one character.  Delete one tab
per line to fix.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/null_blk_main.c |   56 ++++++++++++++++----------------
 1 file changed, 28 insertions(+), 28 deletions(-)

--- a/drivers/block/null_blk_main.c~drivers-block-null_blk_mainc-fix-layout
+++ a/drivers/block/null_blk_main.c
@@ -263,34 +263,34 @@ static ssize_t nullb_device_bool_attr_st
 }
 
 /* The following macro should only be used with TYPE = {uint, ulong, bool}. */
-#define NULLB_DEVICE_ATTR(NAME, TYPE, APPLY)					\
-static ssize_t									\
-nullb_device_##NAME##_show(struct config_item *item, char *page)		\
-{										\
-	return nullb_device_##TYPE##_attr_show(					\
-				to_nullb_device(item)->NAME, page);		\
-}										\
-static ssize_t									\
-nullb_device_##NAME##_store(struct config_item *item, const char *page,		\
-			    size_t count)					\
-{										\
-	int (*apply_fn)(struct nullb_device *dev, TYPE new_value) = APPLY;	\
-	struct nullb_device *dev = to_nullb_device(item);			\
-	TYPE new_value;								\
-	int ret;								\
-										\
-	ret = nullb_device_##TYPE##_attr_store(&new_value, page, count);	\
-	if (ret < 0)								\
-		return ret;							\
-	if (apply_fn)								\
-		ret = apply_fn(dev, new_value);					\
-	else if (test_bit(NULLB_DEV_FL_CONFIGURED, &dev->flags)) 		\
-		ret = -EBUSY;							\
-	if (ret < 0)								\
-		return ret;							\
-	dev->NAME = new_value;							\
-	return count;								\
-}										\
+#define NULLB_DEVICE_ATTR(NAME, TYPE, APPLY)				\
+static ssize_t								\
+nullb_device_##NAME##_show(struct config_item *item, char *page)	\
+{									\
+	return nullb_device_##TYPE##_attr_show(				\
+				to_nullb_device(item)->NAME, page);	\
+}									\
+static ssize_t								\
+nullb_device_##NAME##_store(struct config_item *item, const char *page,	\
+			    size_t count)				\
+{									\
+	int (*apply_fn)(struct nullb_device *dev, TYPE new_value) = APPLY;\
+	struct nullb_device *dev = to_nullb_device(item);		\
+	TYPE new_value;							\
+	int ret;							\
+									\
+	ret = nullb_device_##TYPE##_attr_store(&new_value, page, count);\
+	if (ret < 0)							\
+		return ret;						\
+	if (apply_fn)							\
+		ret = apply_fn(dev, new_value);				\
+	else if (test_bit(NULLB_DEV_FL_CONFIGURED, &dev->flags)) 	\
+		ret = -EBUSY;						\
+	if (ret < 0)							\
+		return ret;						\
+	dev->NAME = new_value;						\
+	return count;							\
+}									\
 CONFIGFS_ATTR(nullb_device_, NAME);
 
 static int nullb_apply_submit_queues(struct nullb_device *dev,
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 21/67] drivers/block/null_blk_main.c: fix uninitialized var warnings
  2020-02-04  1:33 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2020-02-04  1:34 ` [patch 20/67] drivers/block/null_blk_main.c: fix layout Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 22/67] pinctrl: fix pxa2xx.c build warnings Andrew Morton
                   ` (217 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, axboe, linux-mm, mm-commits, shli, torvalds

From: Andrew Morton <akpm@linux-foundation.org>
Subject: drivers/block/null_blk_main.c: fix uninitialized var warnings

With gcc-7.2, many instances of

drivers/block/null_blk_main.c: In function ‘nullb_device_zone_nr_conv_store’:
drivers/block/null_blk_main.c:291:12: warning: ‘new_value’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  dev->NAME = new_value;      \
            ^
drivers/block/null_blk_main.c:279:7: note: ‘new_value’ was declared here
  TYPE new_value;       \
       ^

Presumably notabug, so use uninitialized_var() to suppress them.

Cc: Shaohua Li <shli@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/null_blk_main.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/block/null_blk_main.c~drivers-block-null_blk_mainc-fix-uninitialized-var-warnings
+++ a/drivers/block/null_blk_main.c
@@ -276,7 +276,7 @@ nullb_device_##NAME##_store(struct confi
 {									\
 	int (*apply_fn)(struct nullb_device *dev, TYPE new_value) = APPLY;\
 	struct nullb_device *dev = to_nullb_device(item);		\
-	TYPE new_value;							\
+	TYPE uninitialized_var(new_value);				\
 	int ret;							\
 									\
 	ret = nullb_device_##TYPE##_attr_store(&new_value, page, count);\
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 22/67] pinctrl: fix pxa2xx.c build warnings
  2020-02-04  1:33 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2020-02-04  1:34 ` [patch 21/67] drivers/block/null_blk_main.c: fix uninitialized var warnings Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:34 ` [patch 23/67] mm: remove __krealloc Andrew Morton
                   ` (216 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rdunlap, robert.jarzmik, torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: pinctrl: fix pxa2xx.c build warnings

Add #include of <linux/pinctrl/machine.h> to fix build
warnings in pinctrl-pxa2xx.c.  Fixes these warnings:

In file included from ../drivers/pinctrl/pxa/pinctrl-pxa2xx.c:24:0:
../drivers/pinctrl/pxa/../pinctrl-utils.h:36:8: warning: `enum pinctrl_map_type' declared inside parameter list [enabled by default]
   enum pinctrl_map_type type);
        ^
../drivers/pinctrl/pxa/../pinctrl-utils.h:36:8: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]

Link: http://lkml.kernel.org/r/0024542e-cba9-8f13-6c18-32d0050a6007@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Jarzmik <robert.jarzmik@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/pinctrl/pxa/pinctrl-pxa2xx.c |    1 +
 1 file changed, 1 insertion(+)

--- a/drivers/pinctrl/pxa/pinctrl-pxa2xx.c~pinctrl-fix-pxa2xxc-build-warnings
+++ a/drivers/pinctrl/pxa/pinctrl-pxa2xx.c
@@ -10,6 +10,7 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/module.h>
+#include <linux/pinctrl/machine.h>
 #include <linux/pinctrl/pinconf.h>
 #include <linux/pinctrl/pinconf-generic.h>
 #include <linux/pinctrl/pinmux.h>
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 23/67] mm: remove __krealloc
  2020-02-04  1:33 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2020-02-04  1:34 ` [patch 22/67] pinctrl: fix pxa2xx.c build warnings Andrew Morton
@ 2020-02-04  1:34 ` Andrew Morton
  2020-02-04  1:35 ` [patch 24/67] mm: add generic p?d_leaf() macros Andrew Morton
                   ` (215 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:34 UTC (permalink / raw)
  To: akpm, cl, david, fw, iamjoonsoo.kim, linux-mm, mm-commits,
	penberg, rientjes, torvalds

From: Florian Westphal <fw@strlen.de>
Subject: mm: remove __krealloc

Since 5.5-rc1 the last user of this function is gone, so remove the
functionality.

See commit
2ad9d7747c10 ("netfilter: conntrack: free extension area immediately")
for details.

Link: http://lkml.kernel.org/r/20191212223442.22141-1-fw@strlen.de
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slab.h                    |    1 -
 mm/slab_common.c                        |   22 ----------------------
 scripts/coccinelle/free/devm_free.cocci |    4 ----
 3 files changed, 27 deletions(-)

--- a/include/linux/slab.h~mm-remove-__krealloc
+++ a/include/linux/slab.h
@@ -184,7 +184,6 @@ void memcg_deactivate_kmem_caches(struct
 /*
  * Common kmalloc functions provided by all allocators
  */
-void * __must_check __krealloc(const void *, size_t, gfp_t);
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 void kzfree(const void *);
--- a/mm/slab_common.c~mm-remove-__krealloc
+++ a/mm/slab_common.c
@@ -1677,28 +1677,6 @@ static __always_inline void *__do_kreall
 }
 
 /**
- * __krealloc - like krealloc() but don't free @p.
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * This function is like krealloc() except it never frees the originally
- * allocated buffer. Use this if you don't want to free the buffer immediately
- * like, for example, with RCU.
- *
- * Return: pointer to the allocated memory or %NULL in case of error
- */
-void *__krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	if (unlikely(!new_size))
-		return ZERO_SIZE_PTR;
-
-	return __do_krealloc(p, new_size, flags);
-
-}
-EXPORT_SYMBOL(__krealloc);

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 24/67] mm: add generic p?d_leaf() macros
  2020-02-04  1:33 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2020-02-04  1:34 ` [patch 23/67] mm: remove __krealloc Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 25/67] arc: mm: add p?d_leaf() definitions Andrew Morton
                   ` (214 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: add generic p?d_leaf() macros

Patch series "Generic page walk and ptdump", v17.

Many architectures current have a debugfs file for dumping the kernel page
tables.  Currently each architecture has to implement custom functions for
this because the details of walking the page tables used by the kernel are
different between architectures.

This series extends the capabilities of walk_page_range() so that it can
deal with the page tables of the kernel (which have no VMAs and can
contain larger huge pages than exist for user space).  A generic PTDUMP
implementation is the implemented making use of the new functionality of
walk_page_range() and finally arm64 and x86 are switch to using it,
removing the custom table walkers.

To enable a generic page table walker to walk the unusual mappings of the
kernel we need to implement a set of functions which let us know when the
walker has reached the leaf entry.  After a suggestion from Will Deacon
I've chosen the name p?d_leaf() as this (hopefully) describes the purpose
(and is a new name so has no historic baggage).  Some architectures have
p?d_large macros but this is easily confused with "large pages".

This series ends with a generic PTDUMP implemention for arm64 and x86.

Mostly this is a clean up and there should be very little functional
change.  The exceptions are:

* arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).

* arm64 has the ability to efficiently process KASAN pages (which
  previously only x86 implemented).  This means that the combination of
  KASAN and DEBUG_WX is now useable.


This patch (of 23):

Exposing the pud/pgd levels of the page tables to walk_page_range() means
we may come across the exotic large mappings that come with large areas of
contiguous memory (such as the kernel's linear map).

For architectures that don't provide all p?d_leaf() macros, provide
generic do nothing default that are suitable where there cannot be leaf
pages at that level.  Futher patches will add implementations for
individual architectures.

The name p?d_leaf() is chosen to minimize the confusion with existing uses
of "large" pages and "huge" pages which do not necessary mean that the
entry is a leaf (for example it may be a set of contiguous entries that
only take 1 TLB slot).  For the purpose of walking the page tables we
don't need to know how it will be represented in the TLB, but we do need
to know for sure if it is a leaf of the tree.

Link: http://lkml.kernel.org/r/20191218162402.45610-2-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/pgtable.h |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- a/include/asm-generic/pgtable.h~mm-add-generic-pd_leaf-macros
+++ a/include/asm-generic/pgtable.h
@@ -1238,4 +1238,24 @@ static inline bool arch_has_pfn_modify_c
 #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
 #endif
 
+/*
+ * p?d_leaf() - true if this entry is a final mapping to a physical address.
+ * This differs from p?d_huge() by the fact that they are always available (if
+ * the architecture supports large pages at the appropriate level) even
+ * if CONFIG_HUGETLB_PAGE is not defined.
+ * Only meaningful when called on a valid entry.
+ */
+#ifndef pgd_leaf
+#define pgd_leaf(x)	0
+#endif
+#ifndef p4d_leaf
+#define p4d_leaf(x)	0
+#endif
+#ifndef pud_leaf
+#define pud_leaf(x)	0
+#endif
+#ifndef pmd_leaf
+#define pmd_leaf(x)	0
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 25/67] arc: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2020-02-04  1:35 ` [patch 24/67] mm: add generic p?d_leaf() macros Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 26/67] arm: " Andrew Morton
                   ` (213 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: arc: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information will be provided by the
p?d_leaf() functions/macros.

For arc, we only have two levels, so only pmd_leaf() is needed.

Link: http://lkml.kernel.org/r/20191218162402.45610-3-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arc/include/asm/pgtable.h |    1 +
 1 file changed, 1 insertion(+)

--- a/arch/arc/include/asm/pgtable.h~arc-mm-add-pd_leaf-definitions
+++ a/arch/arc/include/asm/pgtable.h
@@ -273,6 +273,7 @@ static inline void pmd_set(pmd_t *pmdp,
 #define pmd_none(x)			(!pmd_val(x))
 #define	pmd_bad(x)			((pmd_val(x) & ~PAGE_MASK))
 #define pmd_present(x)			(pmd_val(x))
+#define pmd_leaf(x)			(pmd_val(x) & _PAGE_HW_SZ)
 #define pmd_clear(xp)			do { pmd_val(*(xp)) = 0; } while (0)
 
 #define pte_page(pte)		pfn_to_page(pte_pfn(pte))
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 26/67] arm: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2020-02-04  1:35 ` [patch 25/67] arc: mm: add p?d_leaf() definitions Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 27/67] arm64: " Andrew Morton
                   ` (212 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: arm: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

For arm pmd_large() already exists and does what we want.  So simply
provide the generic pmd_leaf() name.

Link: http://lkml.kernel.org/r/20191218162402.45610-4-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/include/asm/pgtable-2level.h |    1 +
 arch/arm/include/asm/pgtable-3level.h |    1 +
 2 files changed, 2 insertions(+)

--- a/arch/arm/include/asm/pgtable-2level.h~arm-mm-add-pd_leaf-definitions
+++ a/arch/arm/include/asm/pgtable-2level.h
@@ -189,6 +189,7 @@ static inline pmd_t *pmd_offset(pud_t *p
 }
 
 #define pmd_large(pmd)		(pmd_val(pmd) & 2)
+#define pmd_leaf(pmd)		(pmd_val(pmd) & 2)
 #define pmd_bad(pmd)		(pmd_val(pmd) & 2)
 #define pmd_present(pmd)	(pmd_val(pmd))
 
--- a/arch/arm/include/asm/pgtable-3level.h~arm-mm-add-pd_leaf-definitions
+++ a/arch/arm/include/asm/pgtable-3level.h
@@ -134,6 +134,7 @@
 #define pmd_sect(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
 						 PMD_TYPE_SECT)
 #define pmd_large(pmd)		pmd_sect(pmd)
+#define pmd_leaf(pmd)		pmd_sect(pmd)
 
 #define pud_clear(pudp)			\
 	do {				\
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 27/67] arm64: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2020-02-04  1:35 ` [patch 26/67] arm: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 28/67] mips: " Andrew Morton
                   ` (211 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: arm64: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information will be provided by the
p?d_leaf() functions/macros.

For arm64, we already have p?d_sect() macros which we can reuse for
p?d_leaf().

pud_sect() is defined as a dummy function when CONFIG_PGTABLE_LEVELS < 3
or CONFIG_ARM64_64K_PAGES is defined.  However when the kernel is
configured this way then architecturally it isn't allowed to have a large
page at this level, and any code using these page walking macros is
implicitly relying on the page size/number of levels being the same as the
kernel.  So it is safe to reuse this for p?d_leaf() as it is an
architectural restriction.

Link: http://lkml.kernel.org/r/20191218162402.45610-5-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/include/asm/pgtable.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/arch/arm64/include/asm/pgtable.h~arm64-mm-add-pd_leaf-definitions
+++ a/arch/arm64/include/asm/pgtable.h
@@ -441,6 +441,7 @@ extern pgprot_t phys_mem_access_prot(str
 				 PMD_TYPE_TABLE)
 #define pmd_sect(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
 				 PMD_TYPE_SECT)
+#define pmd_leaf(pmd)		pmd_sect(pmd)
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
@@ -525,6 +526,7 @@ static inline void pte_unmap(pte_t *pte)
 #define pud_none(pud)		(!pud_val(pud))
 #define pud_bad(pud)		(!(pud_val(pud) & PUD_TABLE_BIT))
 #define pud_present(pud)	pte_present(pud_pte(pud))
+#define pud_leaf(pud)		pud_sect(pud)
 #define pud_valid(pud)		pte_valid(pud_pte(pud))
 
 static inline void set_pud(pud_t *pudp, pud_t pud)
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 28/67] mips: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2020-02-04  1:35 ` [patch 27/67] arm64: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 29/67] powerpc: " Andrew Morton
                   ` (210 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mips: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

If _PAGE_HUGE is defined we can simply look for it.  When not defined we
can be confident that there are no leaf pages in existence and fall back
on the generic implementation (added in a later patch) which returns 0.

Link: http://lkml.kernel.org/r/20191218162402.45610-6-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/mips/include/asm/pgtable.h |    5 +++++
 1 file changed, 5 insertions(+)

--- a/arch/mips/include/asm/pgtable.h~mips-mm-add-pd_leaf-definitions
+++ a/arch/mips/include/asm/pgtable.h
@@ -639,6 +639,11 @@ static inline pmd_t pmdp_huge_get_and_cl
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef _PAGE_HUGE
+#define pmd_leaf(pmd)	((pmd_val(pmd) & _PAGE_HUGE) != 0)
+#define pud_leaf(pud)	((pud_val(pud) & _PAGE_HUGE) != 0)
+#endif
+
 #define gup_fast_permitted(start, end)	(!cpu_has_dc_aliases)
 
 #include <asm-generic/pgtable.h>
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 29/67] powerpc: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2020-02-04  1:35 ` [patch 28/67] mips: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 30/67] riscv: " Andrew Morton
                   ` (209 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: powerpc: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

For powerpc p?d_is_leaf() functions already exist.  Export them using the
new p?d_leaf() name.

Link: http://lkml.kernel.org/r/20191218162402.45610-7-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/include/asm/book3s/64/pgtable.h |    3 +++
 1 file changed, 3 insertions(+)

--- a/arch/powerpc/include/asm/book3s/64/pgtable.h~powerpc-mm-add-pd_leaf-definitions
+++ a/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1355,18 +1355,21 @@ static inline bool is_pte_rw_upgrade(uns
  * Like pmd_huge() and pmd_large(), but works regardless of config options
  */
 #define pmd_is_leaf pmd_is_leaf
+#define pmd_leaf pmd_is_leaf
 static inline bool pmd_is_leaf(pmd_t pmd)
 {
 	return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
 #define pud_is_leaf pud_is_leaf
+#define pud_leaf pud_is_leaf
 static inline bool pud_is_leaf(pud_t pud)
 {
 	return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
 }
 
 #define pgd_is_leaf pgd_is_leaf
+#define pgd_leaf pgd_is_leaf
 static inline bool pgd_is_leaf(pgd_t pgd)
 {
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 30/67] riscv: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (28 preceding siblings ...)
  2020-02-04  1:35 ` [patch 29/67] powerpc: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 31/67] s390: " Andrew Morton
                   ` (208 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: riscv: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

For riscv a page is a leaf page when it has a read, write or execute bit
set on it.

Link: http://lkml.kernel.org/r/20191218162402.45610-8-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Alexandre Ghiti <alex@ghiti.fr>
Reviewed-by: Zong Li <zong.li@sifive.com>
Acked-by: Paul Walmsley <paul.walmsley@sifive.com>	[arch/riscv]
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/riscv/include/asm/pgtable-64.h |    7 +++++++
 arch/riscv/include/asm/pgtable.h    |    7 +++++++
 2 files changed, 14 insertions(+)

--- a/arch/riscv/include/asm/pgtable-64.h~riscv-mm-add-pd_leaf-definitions
+++ a/arch/riscv/include/asm/pgtable-64.h
@@ -43,6 +43,13 @@ static inline int pud_bad(pud_t pud)
 	return !pud_present(pud);
 }
 
+#define pud_leaf	pud_leaf
+static inline int pud_leaf(pud_t pud)
+{
+	return pud_present(pud) &&
+	       (pud_val(pud) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
 static inline void set_pud(pud_t *pudp, pud_t pud)
 {
 	*pudp = pud;
--- a/arch/riscv/include/asm/pgtable.h~riscv-mm-add-pd_leaf-definitions
+++ a/arch/riscv/include/asm/pgtable.h
@@ -130,6 +130,13 @@ static inline int pmd_bad(pmd_t pmd)
 	return !pmd_present(pmd);
 }
 
+#define pmd_leaf	pmd_leaf
+static inline int pmd_leaf(pmd_t pmd)
+{
+	return pmd_present(pmd) &&
+	       (pmd_val(pmd) & (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC));
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	*pmdp = pmd;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 31/67] s390: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (29 preceding siblings ...)
  2020-02-04  1:35 ` [patch 30/67] riscv: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 32/67] sparc: " Andrew Morton
                   ` (207 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: s390: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

For s390, pud_large() and pmd_large() are already implemented as static
inline functions.  Add a macro to provide the p?d_leaf names for the
generic code to use.

Link: http://lkml.kernel.org/r/20191218162402.45610-9-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/include/asm/pgtable.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/arch/s390/include/asm/pgtable.h~s390-mm-add-pd_leaf-definitions
+++ a/arch/s390/include/asm/pgtable.h
@@ -673,6 +673,7 @@ static inline int pud_none(pud_t pud)
 	return pud_val(pud) == _REGION3_ENTRY_EMPTY;
 }
 
+#define pud_leaf	pud_large
 static inline int pud_large(pud_t pud)
 {
 	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) != _REGION_ENTRY_TYPE_R3)
@@ -690,6 +691,7 @@ static inline unsigned long pud_pfn(pud_
 	return (pud_val(pud) & origin_mask) >> PAGE_SHIFT;
 }
 
+#define pmd_leaf	pmd_large
 static inline int pmd_large(pmd_t pmd)
 {
 	return (pmd_val(pmd) & _SEGMENT_ENTRY_LARGE) != 0;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 32/67] sparc: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (30 preceding siblings ...)
  2020-02-04  1:35 ` [patch 31/67] s390: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 33/67] x86: " Andrew Morton
                   ` (206 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: sparc: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

For sparc 64 bit, pmd_large() and pud_large() are already provided, so add
macros to provide the p?d_leaf names required by the generic code.

Link: http://lkml.kernel.org/r/20191218162402.45610-10-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/sparc/include/asm/pgtable_64.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/arch/sparc/include/asm/pgtable_64.h~sparc-mm-add-pd_leaf-definitions
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -683,6 +683,7 @@ static inline unsigned long pte_special(
 	return pte_val(pte) & _PAGE_SPECIAL;
 }
 
+#define pmd_leaf	pmd_large
 static inline unsigned long pmd_large(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
@@ -867,6 +868,7 @@ static inline unsigned long pud_page_vad
 /* only used by the stubbed out hugetlb gup code, should never be called */
 #define p4d_page(p4d)			NULL
 
+#define pud_leaf	pud_large
 static inline unsigned long pud_large(pud_t pud)
 {
 	pte_t pte = __pte(pud_val(pud));
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 33/67] x86: mm: add p?d_leaf() definitions
  2020-02-04  1:33 incoming Andrew Morton
                   ` (31 preceding siblings ...)
  2020-02-04  1:35 ` [patch 32/67] sparc: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 34/67] mm: pagewalk: add p4d_entry() and pgd_entry() Andrew Morton
                   ` (205 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: x86: mm: add p?d_leaf() definitions

walk_page_range() is going to be allowed to walk page tables other than
those of user space.  For this it needs to know when it has reached a
'leaf' entry in the page tables.  This information is provided by the
p?d_leaf() functions/macros.

For x86 we already have p?d_large() functions, so simply add macros to
provide the generic p?d_leaf() names for the generic code.

Link: http://lkml.kernel.org/r/20191218162402.45610-11-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/pgtable.h |    5 +++++
 1 file changed, 5 insertions(+)

--- a/arch/x86/include/asm/pgtable.h~x86-mm-add-pd_leaf-definitions
+++ a/arch/x86/include/asm/pgtable.h
@@ -239,6 +239,7 @@ static inline unsigned long pgd_pfn(pgd_
 	return (pgd_val(pgd) & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
+#define p4d_leaf	p4d_large
 static inline int p4d_large(p4d_t p4d)
 {
 	/* No 512 GiB pages yet */
@@ -247,6 +248,7 @@ static inline int p4d_large(p4d_t p4d)
 
 #define pte_page(pte)	pfn_to_page(pte_pfn(pte))
 
+#define pmd_leaf	pmd_large
 static inline int pmd_large(pmd_t pte)
 {
 	return pmd_flags(pte) & _PAGE_PSE;
@@ -874,6 +876,7 @@ static inline pmd_t *pmd_offset(pud_t *p
 	return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
 }
 
+#define pud_leaf	pud_large
 static inline int pud_large(pud_t pud)
 {
 	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
@@ -885,6 +888,7 @@ static inline int pud_bad(pud_t pud)
 	return (pud_flags(pud) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
 }
 #else
+#define pud_leaf	pud_large
 static inline int pud_large(pud_t pud)
 {
 	return 0;
@@ -1233,6 +1237,7 @@ static inline bool pgdp_maps_userspace(v
 	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
 }
 
+#define pgd_leaf	pgd_large
 static inline int pgd_large(pgd_t pgd) { return 0; }
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 34/67] mm: pagewalk: add p4d_entry() and pgd_entry()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (32 preceding siblings ...)
  2020-02-04  1:35 ` [patch 33/67] x86: " Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 35/67] mm: pagewalk: allow walking without vma Andrew Morton
                   ` (204 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: pagewalk: add p4d_entry() and pgd_entry()

pgd_entry() and pud_entry() were removed by commit 0b1fbfe50006c410
("mm/pagewalk: remove pgd_entry() and pud_entry()") because there were no
users.  We're about to add users so reintroduce them, along with
p4d_entry() as we now have 5 levels of tables.

Note that commit a00cc7d9dd93d66a ("mm, x86: add support for PUD-sized
transparent hugepages") already re-added pud_entry() but with different
semantics to the other callbacks.  This commit reverts the semantics back
to match the other callbacks.

To support hmm.c which now uses the new semantics of pud_entry() a new
member ('action') of struct mm_walk is added which allows the callbacks to
either descend (ACTION_SUBTREE, the default), skip (ACTION_CONTINUE) or
repeat the callback (ACTION_AGAIN).  hmm.c is then updated to call
pud_trans_huge_lock() itself and make use of the splitting/retry logic of
the core code.

After this change pud_entry() is called for all entries, not just
transparent huge pages.

[arnd@arndb.de: fix unused variable warning]
 Link: http://lkml.kernel.org/r/20200107204607.1533842-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20191218162402.45610-12-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/pagewalk.h |   34 +++++++++++++++++----
 mm/hmm.c                 |   58 ++++++++++++++++++++-----------------
 mm/pagewalk.c            |   50 ++++++++++++++++++++++---------
 3 files changed, 95 insertions(+), 47 deletions(-)

--- a/include/linux/pagewalk.h~mm-pagewalk-add-p4d_entry-and-pgd_entry
+++ a/include/linux/pagewalk.h
@@ -8,15 +8,15 @@ struct mm_walk;
 
 /**
  * mm_walk_ops - callbacks for walk_page_range
- * @pud_entry:		if set, called for each non-empty PUD (2nd-level) entry
- *			this handler should only handle pud_trans_huge() puds.
- *			the pmd_entry or pte_entry callbacks will be used for
- *			regular PUDs.
- * @pmd_entry:		if set, called for each non-empty PMD (3rd-level) entry
+ * @pgd_entry:		if set, called for each non-empty PGD (top-level) entry
+ * @p4d_entry:		if set, called for each non-empty P4D entry
+ * @pud_entry:		if set, called for each non-empty PUD entry
+ * @pmd_entry:		if set, called for each non-empty PMD entry
  *			this handler is required to be able to handle
  *			pmd_trans_huge() pmds.  They may simply choose to
  *			split_huge_page() instead of handling it explicitly.
- * @pte_entry:		if set, called for each non-empty PTE (4th-level) entry
+ * @pte_entry:		if set, called for each non-empty PTE (lowest-level)
+ *			entry
  * @pte_hole:		if set, called for each hole at all levels
  * @hugetlb_entry:	if set, called for each hugetlb entry
  * @test_walk:		caller specific callback function to determine whether
@@ -27,8 +27,15 @@ struct mm_walk;
  * @pre_vma:            if set, called before starting walk on a non-null vma.
  * @post_vma:           if set, called after a walk on a non-null vma, provided
  *                      that @pre_vma and the vma walk succeeded.
+ *
+ * p?d_entry callbacks are called even if those levels are folded on a
+ * particular architecture/configuration.
  */
 struct mm_walk_ops {
+	int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
+			 unsigned long next, struct mm_walk *walk);
+	int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
+			 unsigned long next, struct mm_walk *walk);
 	int (*pud_entry)(pud_t *pud, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
@@ -47,11 +54,25 @@ struct mm_walk_ops {
 	void (*post_vma)(struct mm_walk *walk);
 };
 
+/*
+ * Action for pud_entry / pmd_entry callbacks.
+ * ACTION_SUBTREE is the default
+ */
+enum page_walk_action {
+	/* Descend to next level, splitting huge pages if needed and possible */
+	ACTION_SUBTREE = 0,
+	/* Continue to next entry at this level (ignoring any subtree) */
+	ACTION_CONTINUE = 1,
+	/* Call again for this entry */
+	ACTION_AGAIN = 2
+};
+
 /**
  * mm_walk - walk_page_range data
  * @ops:	operation to call during the walk
  * @mm:		mm_struct representing the target process of page table walk
  * @vma:	vma currently walked (NULL if walking outside vmas)
+ * @action:	next action to perform (see enum page_walk_action)
  * @private:	private data for callbacks' usage
  *
  * (see the comment on walk_page_range() for more details)
@@ -60,6 +81,7 @@ struct mm_walk {
 	const struct mm_walk_ops *ops;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
+	enum page_walk_action action;
 	void *private;
 };
 
--- a/mm/hmm.c~mm-pagewalk-add-p4d_entry-and-pgd_entry
+++ a/mm/hmm.c
@@ -474,23 +474,32 @@ static int hmm_vma_walk_pud(pud_t *pudp,
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
-	unsigned long addr = start, next;
-	pmd_t *pmdp;
+	unsigned long addr = start;
 	pud_t pud;
-	int ret;
+	int ret = 0;
+	spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);
+
+	if (!ptl)
+		return 0;
+
+	/* Normally we don't want to split the huge page */
+	walk->action = ACTION_CONTINUE;
 
-again:
 	pud = READ_ONCE(*pudp);
-	if (pud_none(pud))
-		return hmm_vma_walk_hole(start, end, walk);
+	if (pud_none(pud)) {
+		ret = hmm_vma_walk_hole(start, end, walk);
+		goto out_unlock;
+	}
 
 	if (pud_huge(pud) && pud_devmap(pud)) {
 		unsigned long i, npages, pfn;
 		uint64_t *pfns, cpu_flags;
 		bool fault, write_fault;
 
-		if (!pud_present(pud))
-			return hmm_vma_walk_hole(start, end, walk);
+		if (!pud_present(pud)) {
+			ret = hmm_vma_walk_hole(start, end, walk);
+			goto out_unlock;
+		}
 
 		i = (addr - range->start) >> PAGE_SHIFT;
 		npages = (end - addr) >> PAGE_SHIFT;
@@ -499,16 +508,20 @@ again:
 		cpu_flags = pud_to_hmm_pfn_flags(range, pud);
 		hmm_range_need_fault(hmm_vma_walk, pfns, npages,
 				     cpu_flags, &fault, &write_fault);
-		if (fault || write_fault)
-			return hmm_vma_walk_hole_(addr, end, fault,
-						write_fault, walk);
+		if (fault || write_fault) {
+			ret = hmm_vma_walk_hole_(addr, end, fault,
+						 write_fault, walk);
+			goto out_unlock;
+		}
 
 		pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 		for (i = 0; i < npages; ++i, ++pfn) {
 			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
 					      hmm_vma_walk->pgmap);
-			if (unlikely(!hmm_vma_walk->pgmap))
-				return -EBUSY;
+			if (unlikely(!hmm_vma_walk->pgmap)) {
+				ret = -EBUSY;
+				goto out_unlock;
+			}
 			pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
 				  cpu_flags;
 		}
@@ -517,22 +530,15 @@ again:
 			hmm_vma_walk->pgmap = NULL;
 		}
 		hmm_vma_walk->last = end;
-		return 0;
+		goto out_unlock;
 	}
 
-	split_huge_pud(walk->vma, pudp, addr);
-	if (pud_none(*pudp))
-		goto again;
-
-	pmdp = pmd_offset(pudp, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-		ret = hmm_vma_walk_pmd(pmdp, addr, next, walk);
-		if (ret)
-			return ret;
-	} while (pmdp++, addr = next, addr != end);
+	/* Ask for the PUD to be split */
+	walk->action = ACTION_SUBTREE;
 
-	return 0;
+out_unlock:
+	spin_unlock(ptl);
+	return ret;
 }
 #else
 #define hmm_vma_walk_pud	NULL
--- a/mm/pagewalk.c~mm-pagewalk-add-p4d_entry-and-pgd_entry
+++ a/mm/pagewalk.c
@@ -46,6 +46,9 @@ again:
 				break;
 			continue;
 		}
+
+		walk->action = ACTION_SUBTREE;
+
 		/*
 		 * This implies that each ->pmd_entry() handler
 		 * needs to know about pmd_trans_huge() pmds
@@ -55,16 +58,21 @@ again:
 		if (err)
 			break;
 
+		if (walk->action == ACTION_AGAIN)
+			goto again;
+
 		/*
 		 * Check this here so we only break down trans_huge
 		 * pages when we _need_ to
 		 */
-		if (!ops->pte_entry)
+		if (walk->action == ACTION_CONTINUE ||
+		    !(ops->pte_entry))
 			continue;
 
 		split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
+
 		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
 			break;
@@ -93,24 +101,25 @@ static int walk_pud_range(p4d_t *p4d, un
 			continue;
 		}
 
-		if (ops->pud_entry) {
-			spinlock_t *ptl = pud_trans_huge_lock(pud, walk->vma);
+		walk->action = ACTION_SUBTREE;
 
-			if (ptl) {
-				err = ops->pud_entry(pud, addr, next, walk);
-				spin_unlock(ptl);
-				if (err)
-					break;
-				continue;
-			}
-		}
+		if (ops->pud_entry)
+			err = ops->pud_entry(pud, addr, next, walk);
+		if (err)
+			break;
+
+		if (walk->action == ACTION_AGAIN)
+			goto again;
+
+		if (walk->action == ACTION_CONTINUE ||
+		    !(ops->pmd_entry || ops->pte_entry))
+			continue;
 
 		split_huge_pud(walk->vma, pud, addr);
 		if (pud_none(*pud))
 			goto again;
 
-		if (ops->pmd_entry || ops->pte_entry)
-			err = walk_pmd_range(pud, addr, next, walk);
+		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
 	} while (pud++, addr = next, addr != end);
@@ -136,7 +145,12 @@ static int walk_p4d_range(pgd_t *pgd, un
 				break;
 			continue;
 		}
-		if (ops->pmd_entry || ops->pte_entry)
+		if (ops->p4d_entry) {
+			err = ops->p4d_entry(p4d, addr, next, walk);
+			if (err)
+				break;
+		}
+		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
 			err = walk_pud_range(p4d, addr, next, walk);
 		if (err)
 			break;
@@ -163,7 +177,13 @@ static int walk_pgd_range(unsigned long
 				break;
 			continue;
 		}
-		if (ops->pmd_entry || ops->pte_entry)
+		if (ops->pgd_entry) {
+			err = ops->pgd_entry(pgd, addr, next, walk);
+			if (err)
+				break;
+		}
+		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
+		    ops->pte_entry)
 			err = walk_p4d_range(pgd, addr, next, walk);
 		if (err)
 			break;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 35/67] mm: pagewalk: allow walking without vma
  2020-02-04  1:33 incoming Andrew Morton
                   ` (33 preceding siblings ...)
  2020-02-04  1:35 ` [patch 34/67] mm: pagewalk: add p4d_entry() and pgd_entry() Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 36/67] mm: pagewalk: don't lock PTEs for walk_page_range_novma() Andrew Morton
                   ` (203 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: pagewalk: allow walking without vma

Since 48684a65b4e3: "mm: pagewalk: fix misbehavior of walk_page_range for
vma(VM_PFNMAP)", page_table_walk() will report any kernel area as a hole,
because it lacks a vma.

This means each arch has re-implemented page table walking when needed,
for example in the per-arch ptdump walker.

Remove the requirement to have a vma in the generic code and add a new
function walk_page_range_novma() which ignores the VMAs and simply walks
the page tables.

Link: http://lkml.kernel.org/r/20191218162402.45610-13-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/pagewalk.h |    5 ++++
 mm/pagewalk.c            |   40 +++++++++++++++++++++++++++++--------
 2 files changed, 37 insertions(+), 8 deletions(-)

--- a/include/linux/pagewalk.h~mm-pagewalk-allow-walking-without-vma
+++ a/include/linux/pagewalk.h
@@ -73,6 +73,7 @@ enum page_walk_action {
  * @mm:		mm_struct representing the target process of page table walk
  * @vma:	vma currently walked (NULL if walking outside vmas)
  * @action:	next action to perform (see enum page_walk_action)
+ * @no_vma:	walk ignoring vmas (vma will always be NULL)
  * @private:	private data for callbacks' usage
  *
  * (see the comment on walk_page_range() for more details)
@@ -82,12 +83,16 @@ struct mm_walk {
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	enum page_walk_action action;
+	bool no_vma;
 	void *private;
 };
 
 int walk_page_range(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
+int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+			  unsigned long end, const struct mm_walk_ops *ops,
+			  void *private);
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		void *private);
 int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
--- a/mm/pagewalk.c~mm-pagewalk-allow-walking-without-vma
+++ a/mm/pagewalk.c
@@ -39,7 +39,7 @@ static int walk_pmd_range(pud_t *pud, un
 	do {
 again:
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*pmd) || !walk->vma) {
+		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, walk);
 			if (err)
@@ -65,13 +65,16 @@ again:
 		 * Check this here so we only break down trans_huge
 		 * pages when we _need_ to
 		 */
-		if (walk->action == ACTION_CONTINUE ||
+		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
+		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pte_entry))
 			continue;
 
-		split_huge_pmd(walk->vma, pmd, addr);
-		if (pmd_trans_unstable(pmd))
-			goto again;
+		if (walk->vma) {
+			split_huge_pmd(walk->vma, pmd, addr);
+			if (pmd_trans_unstable(pmd))
+				goto again;
+		}
 
 		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
@@ -93,7 +96,7 @@ static int walk_pud_range(p4d_t *p4d, un
 	do {
  again:
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud) || !walk->vma) {
+		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, walk);
 			if (err)
@@ -111,11 +114,13 @@ static int walk_pud_range(p4d_t *p4d, un
 		if (walk->action == ACTION_AGAIN)
 			goto again;
 
-		if (walk->action == ACTION_CONTINUE ||
+		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
+		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pmd_entry || ops->pte_entry))
 			continue;
 
-		split_huge_pud(walk->vma, pud, addr);
+		if (walk->vma)
+			split_huge_pud(walk->vma, pud, addr);
 		if (pud_none(*pud))
 			goto again;
 
@@ -389,6 +394,25 @@ int walk_page_range(struct mm_struct *mm
 	return err;
 }
 
+int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+			  unsigned long end, const struct mm_walk_ops *ops,
+			  void *private)
+{
+	struct mm_walk walk = {
+		.ops		= ops,
+		.mm		= mm,
+		.private	= private,
+		.no_vma		= true
+	};
+
+	if (start >= end || !walk.mm)
+		return -EINVAL;
+
+	lockdep_assert_held(&walk.mm->mmap_sem);
+
+	return __walk_page_range(start, end, &walk);
+}
+
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		void *private)
 {
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 36/67] mm: pagewalk: don't lock PTEs for walk_page_range_novma()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (34 preceding siblings ...)
  2020-02-04  1:35 ` [patch 35/67] mm: pagewalk: allow walking without vma Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:35 ` [patch 37/67] mm: pagewalk: fix termination condition in walk_pte_range() Andrew Morton
                   ` (202 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: pagewalk: don't lock PTEs for walk_page_range_novma()

walk_page_range_novma() can be used to walk page tables or the kernel or
for firmware.  These page tables may contain entries that are not backed
by a struct page and so it isn't (in general) possible to take the PTE
lock for the pte_entry() callback.  So update walk_pte_range() to only
take the lock when no_vma==false by splitting out the inner loop to a
separate function and add a comment explaining the difference to
walk_page_range_novma().

Link: http://lkml.kernel.org/r/20191218162402.45610-14-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/pagewalk.c |   35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

--- a/mm/pagewalk.c~mm-pagewalk-dont-lock-ptes-for-walk_page_range_novma
+++ a/mm/pagewalk.c
@@ -4,15 +4,12 @@
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
 
-static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			  struct mm_walk *walk)
+static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
 {
-	pte_t *pte;
-	int err = 0;
 	const struct mm_walk_ops *ops = walk->ops;
-	spinlock_t *ptl;
+	int err = 0;
 
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (;;) {
 		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
@@ -22,8 +19,26 @@ static int walk_pte_range(pmd_t *pmd, un
 			break;
 		pte++;
 	}
+	return err;
+}
+
+static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			  struct mm_walk *walk)
+{
+	pte_t *pte;
+	int err = 0;
+	spinlock_t *ptl;
+
+	if (walk->no_vma) {
+		pte = pte_offset_map(pmd, addr);
+		err = walk_pte_range_inner(pte, addr, end, walk);
+		pte_unmap(pte);
+	} else {
+		pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+		err = walk_pte_range_inner(pte, addr, end, walk);
+		pte_unmap_unlock(pte, ptl);
+	}
 
-	pte_unmap_unlock(pte, ptl);
 	return err;
 }
 
@@ -394,6 +409,12 @@ int walk_page_range(struct mm_struct *mm
 	return err;
 }
 
+/*
+ * Similar to walk_page_range() but can walk any page tables even if they are
+ * not backed by VMAs. Because 'unusual' entries may be walked this function
+ * will also not lock the PTEs for the pte_entry() callback. This is useful for
+ * walking the kernel pages tables or page tables for firmware.
+ */
 int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  void *private)
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 37/67] mm: pagewalk: fix termination condition in walk_pte_range()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (35 preceding siblings ...)
  2020-02-04  1:35 ` [patch 36/67] mm: pagewalk: don't lock PTEs for walk_page_range_novma() Andrew Morton
@ 2020-02-04  1:35 ` Andrew Morton
  2020-02-04  1:36 ` [patch 38/67] mm: pagewalk: add 'depth' parameter to pte_hole Andrew Morton
                   ` (201 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:35 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: pagewalk: fix termination condition in walk_pte_range()

If walk_pte_range() is called with a 'end' argument that is beyond the
last page of memory (e.g.  ~0UL) then the comparison between 'addr' and
'end' will always fail and the loop will be infinite.  Instead change the
comparison to >= while accounting for overflow.

Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/pagewalk.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/pagewalk.c~mm-pagewalk-fix-termination-condition-in-walk_pte_range
+++ a/mm/pagewalk.c
@@ -14,9 +14,9 @@ static int walk_pte_range_inner(pte_t *p
 		err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
 		       break;
-		addr += PAGE_SIZE;
-		if (addr == end)
+		if (addr >= end - PAGE_SIZE)
 			break;
+		addr += PAGE_SIZE;
 		pte++;
 	}
 	return err;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 38/67] mm: pagewalk: add 'depth' parameter to pte_hole
  2020-02-04  1:33 incoming Andrew Morton
                   ` (36 preceding siblings ...)
  2020-02-04  1:35 ` [patch 37/67] mm: pagewalk: fix termination condition in walk_pte_range() Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 39/67] x86: mm: point to struct seq_file from struct pg_state Andrew Morton
                   ` (200 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: pagewalk: add 'depth' parameter to pte_hole

The pte_hole() callback is called at multiple levels of the page tables. 
Code dumping the kernel page tables needs to know what at what depth the
missing entry is.  Add this is an extra parameter to pte_hole().  When the
depth isn't know (e.g.  processing a vma) then -1 is passed.

The depth that is reported is the actual level where the entry is missing
(ignoring any folding that is in place), i.e.  any levels where
PTRS_PER_P?D is set to 1 are ignored.

Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their
natural numbers as levels 2/3/4.

Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Tested-by: Zong Li <zong.li@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c       |    4 ++--
 include/linux/pagewalk.h |    7 +++++--
 mm/hmm.c                 |    8 ++++----
 mm/migrate.c             |    5 +++--
 mm/mincore.c             |    1 +
 mm/pagewalk.c            |   31 +++++++++++++++++++++++++------
 6 files changed, 40 insertions(+), 16 deletions(-)

--- a/fs/proc/task_mmu.c~mm-pagewalk-add-depth-parameter-to-pte_hole
+++ a/fs/proc/task_mmu.c
@@ -505,7 +505,7 @@ static void smaps_account(struct mem_siz
 
 #ifdef CONFIG_SHMEM
 static int smaps_pte_hole(unsigned long addr, unsigned long end,
-		struct mm_walk *walk)
+			  __always_unused int depth, struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 
@@ -1282,7 +1282,7 @@ static int add_to_pagemap(unsigned long
 }
 
 static int pagemap_pte_hole(unsigned long start, unsigned long end,
-				struct mm_walk *walk)
+			    __always_unused int depth, struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
 	unsigned long addr = start;
--- a/include/linux/pagewalk.h~mm-pagewalk-add-depth-parameter-to-pte_hole
+++ a/include/linux/pagewalk.h
@@ -17,7 +17,10 @@ struct mm_walk;
  *			split_huge_page() instead of handling it explicitly.
  * @pte_entry:		if set, called for each non-empty PTE (lowest-level)
  *			entry
- * @pte_hole:		if set, called for each hole at all levels
+ * @pte_hole:		if set, called for each hole at all levels,
+ *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD
+ *			4:PTE. Any folded depths (where PTRS_PER_P?D is equal
+ *			to 1) are skipped.
  * @hugetlb_entry:	if set, called for each hugetlb entry
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
@@ -43,7 +46,7 @@ struct mm_walk_ops {
 	int (*pte_entry)(pte_t *pte, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
-			struct mm_walk *walk);
+			int depth, struct mm_walk *walk);
 	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
 			     unsigned long addr, unsigned long next,
 			     struct mm_walk *walk);
--- a/mm/hmm.c~mm-pagewalk-add-depth-parameter-to-pte_hole
+++ a/mm/hmm.c
@@ -186,7 +186,7 @@ static void hmm_range_need_fault(const s
 }
 
 static int hmm_vma_walk_hole(unsigned long addr, unsigned long end,
-			     struct mm_walk *walk)
+			     __always_unused int depth, struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
@@ -380,7 +380,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 again:
 	pmd = READ_ONCE(*pmdp);
 	if (pmd_none(pmd))
-		return hmm_vma_walk_hole(start, end, walk);
+		return hmm_vma_walk_hole(start, end, -1, walk);
 
 	if (thp_migration_supported() && is_pmd_migration_entry(pmd)) {
 		bool fault, write_fault;
@@ -487,7 +487,7 @@ static int hmm_vma_walk_pud(pud_t *pudp,
 
 	pud = READ_ONCE(*pudp);
 	if (pud_none(pud)) {
-		ret = hmm_vma_walk_hole(start, end, walk);
+		ret = hmm_vma_walk_hole(start, end, -1, walk);
 		goto out_unlock;
 	}
 
@@ -497,7 +497,7 @@ static int hmm_vma_walk_pud(pud_t *pudp,
 		bool fault, write_fault;
 
 		if (!pud_present(pud)) {
-			ret = hmm_vma_walk_hole(start, end, walk);
+			ret = hmm_vma_walk_hole(start, end, -1, walk);
 			goto out_unlock;
 		}
 
--- a/mm/migrate.c~mm-pagewalk-add-depth-parameter-to-pte_hole
+++ a/mm/migrate.c
@@ -2151,6 +2151,7 @@ out_unlock:
 #ifdef CONFIG_DEVICE_PRIVATE
 static int migrate_vma_collect_hole(unsigned long start,
 				    unsigned long end,
+				    __always_unused int depth,
 				    struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
@@ -2195,7 +2196,7 @@ static int migrate_vma_collect_pmd(pmd_t
 
 again:
 	if (pmd_none(*pmdp))
-		return migrate_vma_collect_hole(start, end, walk);
+		return migrate_vma_collect_hole(start, end, -1, walk);
 
 	if (pmd_trans_huge(*pmdp)) {
 		struct page *page;
@@ -2228,7 +2229,7 @@ again:
 				return migrate_vma_collect_skip(start, end,
 								walk);
 			if (pmd_none(*pmdp))
-				return migrate_vma_collect_hole(start, end,
+				return migrate_vma_collect_hole(start, end, -1,
 								walk);
 		}
 	}
--- a/mm/mincore.c~mm-pagewalk-add-depth-parameter-to-pte_hole
+++ a/mm/mincore.c
@@ -112,6 +112,7 @@ static int __mincore_unmapped_range(unsi
 }
 
 static int mincore_unmapped_range(unsigned long addr, unsigned long end,
+				   __always_unused int depth,
 				   struct mm_walk *walk)
 {
 	walk->private += __mincore_unmapped_range(addr, end,
--- a/mm/pagewalk.c~mm-pagewalk-add-depth-parameter-to-pte_hole
+++ a/mm/pagewalk.c
@@ -4,6 +4,22 @@
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
 
+/*
+ * We want to know the real level where a entry is located ignoring any
+ * folding of levels which may be happening. For example if p4d is folded then
+ * a missing entry found at level 1 (p4d) is actually at level 0 (pgd).
+ */
+static int real_depth(int depth)
+{
+	if (depth == 3 && PTRS_PER_PMD == 1)
+		depth = 2;
+	if (depth == 2 && PTRS_PER_PUD == 1)
+		depth = 1;
+	if (depth == 1 && PTRS_PER_P4D == 1)
+		depth = 0;
+	return depth;
+}
+
 static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 {
@@ -49,6 +65,7 @@ static int walk_pmd_range(pud_t *pud, un
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	int depth = real_depth(3);
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -56,7 +73,7 @@ again:
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
-				err = ops->pte_hole(addr, next, walk);
+				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
 			continue;
@@ -106,6 +123,7 @@ static int walk_pud_range(p4d_t *p4d, un
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	int depth = real_depth(2);
 
 	pud = pud_offset(p4d, addr);
 	do {
@@ -113,7 +131,7 @@ static int walk_pud_range(p4d_t *p4d, un
 		next = pud_addr_end(addr, end);
 		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
-				err = ops->pte_hole(addr, next, walk);
+				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
 			continue;
@@ -154,13 +172,14 @@ static int walk_p4d_range(pgd_t *pgd, un
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	int depth = real_depth(1);
 
 	p4d = p4d_offset(pgd, addr);
 	do {
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d)) {
 			if (ops->pte_hole)
-				err = ops->pte_hole(addr, next, walk);
+				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
 				break;
 			continue;
@@ -192,7 +211,7 @@ static int walk_pgd_range(unsigned long
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
 			if (ops->pte_hole)
-				err = ops->pte_hole(addr, next, walk);
+				err = ops->pte_hole(addr, next, 0, walk);
 			if (err)
 				break;
 			continue;
@@ -239,7 +258,7 @@ static int walk_hugetlb_range(unsigned l
 		if (pte)
 			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
 		else if (ops->pte_hole)
-			err = ops->pte_hole(addr, next, walk);
+			err = ops->pte_hole(addr, next, -1, walk);
 
 		if (err)
 			break;
@@ -283,7 +302,7 @@ static int walk_page_test(unsigned long
 	if (vma->vm_flags & VM_PFNMAP) {
 		int err = 1;
 		if (ops->pte_hole)
-			err = ops->pte_hole(start, end, walk);
+			err = ops->pte_hole(start, end, -1, walk);
 		return err ? err : 1;
 	}
 	return 0;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 39/67] x86: mm: point to struct seq_file from struct pg_state
  2020-02-04  1:33 incoming Andrew Morton
                   ` (37 preceding siblings ...)
  2020-02-04  1:36 ` [patch 38/67] mm: pagewalk: add 'depth' parameter to pte_hole Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 40/67] x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct Andrew Morton
                   ` (199 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: x86: mm: point to struct seq_file from struct pg_state

mm/dump_pagetables.c passes both struct seq_file and struct pg_state down
the chain of walk_*_level() functions to be passed to note_page(). 
Instead place the struct seq_file in struct pg_state and access it from
struct pg_state (which is private to this file) in note_page().

Link: http://lkml.kernel.org/r/20191218162402.45610-17-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/mm/dump_pagetables.c |   69 ++++++++++++++++----------------
 1 file changed, 35 insertions(+), 34 deletions(-)

--- a/arch/x86/mm/dump_pagetables.c~x86-mm-point-to-struct-seq_file-from-struct-pg_state
+++ a/arch/x86/mm/dump_pagetables.c
@@ -36,6 +36,7 @@ struct pg_state {
 	bool to_dmesg;
 	bool check_wx;
 	unsigned long wx_pages;
+	struct seq_file *seq;
 };
 
 struct addr_marker {
@@ -265,11 +266,12 @@ static void note_wx(struct pg_state *st)
  * of PTE entries; the next one is different so we need to
  * print what we collected so far.
  */
-static void note_page(struct seq_file *m, struct pg_state *st,
-		      pgprot_t new_prot, pgprotval_t new_eff, int level)
+static void note_page(struct pg_state *st, pgprot_t new_prot,
+		      pgprotval_t new_eff, int level)
 {
 	pgprotval_t prot, cur, eff;
 	static const char units[] = "BKMGTPE";
+	struct seq_file *m = st->seq;
 
 	/*
 	 * If we have a "break" in the series, we need to flush the state that
@@ -354,8 +356,8 @@ static inline pgprotval_t effective_prot
 	       ((prot1 | prot2) & _PAGE_NX);
 }
 
-static void walk_pte_level(struct seq_file *m, struct pg_state *st, pmd_t addr,
-			   pgprotval_t eff_in, unsigned long P)
+static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
+			   unsigned long P)
 {
 	int i;
 	pte_t *pte;
@@ -366,7 +368,7 @@ static void walk_pte_level(struct seq_fi
 		pte = pte_offset_map(&addr, st->current_address);
 		prot = pte_flags(*pte);
 		eff = effective_prot(eff_in, prot);
-		note_page(m, st, __pgprot(prot), eff, 5);
+		note_page(st, __pgprot(prot), eff, 5);
 		pte_unmap(pte);
 	}
 }
@@ -379,22 +381,20 @@ static void walk_pte_level(struct seq_fi
  * us dozens of seconds (minutes for 5-level config) while checking for
  * W+X mapping or reading kernel_page_tables debugfs file.
  */
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
-				void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
 {
 	if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
 	    (pgtable_l5_enabled() &&
 			__pa(pt) == __pa(kasan_early_shadow_p4d)) ||
 	    __pa(pt) == __pa(kasan_early_shadow_pud)) {
 		pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
-		note_page(m, st, __pgprot(prot), 0, 5);
+		note_page(st, __pgprot(prot), 0, 5);
 		return true;
 	}
 	return false;
 }
 #else
-static inline bool kasan_page_table(struct seq_file *m, struct pg_state *st,
-				void *pt)
+static inline bool kasan_page_table(struct pg_state *st, void *pt)
 {
 	return false;
 }
@@ -402,7 +402,7 @@ static inline bool kasan_page_table(stru
 
 #if PTRS_PER_PMD > 1
 
-static void walk_pmd_level(struct seq_file *m, struct pg_state *st, pud_t addr,
+static void walk_pmd_level(struct pg_state *st, pud_t addr,
 			   pgprotval_t eff_in, unsigned long P)
 {
 	int i;
@@ -416,27 +416,27 @@ static void walk_pmd_level(struct seq_fi
 			prot = pmd_flags(*start);
 			eff = effective_prot(eff_in, prot);
 			if (pmd_large(*start) || !pmd_present(*start)) {
-				note_page(m, st, __pgprot(prot), eff, 4);
-			} else if (!kasan_page_table(m, st, pmd_start)) {
-				walk_pte_level(m, st, *start, eff,
+				note_page(st, __pgprot(prot), eff, 4);
+			} else if (!kasan_page_table(st, pmd_start)) {
+				walk_pte_level(st, *start, eff,
 					       P + i * PMD_LEVEL_MULT);
 			}
 		} else
-			note_page(m, st, __pgprot(0), 0, 4);
+			note_page(st, __pgprot(0), 0, 4);
 		start++;
 	}
 }
 
 #else
-#define walk_pmd_level(m,s,a,e,p) walk_pte_level(m,s,__pmd(pud_val(a)),e,p)
+#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
 #define pud_large(a) pmd_large(__pmd(pud_val(a)))
 #define pud_none(a)  pmd_none(__pmd(pud_val(a)))
 #endif
 
 #if PTRS_PER_PUD > 1
 
-static void walk_pud_level(struct seq_file *m, struct pg_state *st, p4d_t addr,
-			   pgprotval_t eff_in, unsigned long P)
+static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
+			   unsigned long P)
 {
 	int i;
 	pud_t *start, *pud_start;
@@ -450,33 +450,33 @@ static void walk_pud_level(struct seq_fi
 			prot = pud_flags(*start);
 			eff = effective_prot(eff_in, prot);
 			if (pud_large(*start) || !pud_present(*start)) {
-				note_page(m, st, __pgprot(prot), eff, 3);
-			} else if (!kasan_page_table(m, st, pud_start)) {
-				walk_pmd_level(m, st, *start, eff,
+				note_page(st, __pgprot(prot), eff, 3);
+			} else if (!kasan_page_table(st, pud_start)) {
+				walk_pmd_level(st, *start, eff,
 					       P + i * PUD_LEVEL_MULT);
 			}
 		} else
-			note_page(m, st, __pgprot(0), 0, 3);
+			note_page(st, __pgprot(0), 0, 3);
 
 		start++;
 	}
 }
 
 #else
-#define walk_pud_level(m,s,a,e,p) walk_pmd_level(m,s,__pud(p4d_val(a)),e,p)
+#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
 #define p4d_large(a) pud_large(__pud(p4d_val(a)))
 #define p4d_none(a)  pud_none(__pud(p4d_val(a)))
 #endif
 
-static void walk_p4d_level(struct seq_file *m, struct pg_state *st, pgd_t addr,
-			   pgprotval_t eff_in, unsigned long P)
+static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
+			   unsigned long P)
 {
 	int i;
 	p4d_t *start, *p4d_start;
 	pgprotval_t prot, eff;
 
 	if (PTRS_PER_P4D == 1)
-		return walk_pud_level(m, st, __p4d(pgd_val(addr)), eff_in, P);
+		return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
 
 	p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
 
@@ -486,13 +486,13 @@ static void walk_p4d_level(struct seq_fi
 			prot = p4d_flags(*start);
 			eff = effective_prot(eff_in, prot);
 			if (p4d_large(*start) || !p4d_present(*start)) {
-				note_page(m, st, __pgprot(prot), eff, 2);
-			} else if (!kasan_page_table(m, st, p4d_start)) {
-				walk_pud_level(m, st, *start, eff,
+				note_page(st, __pgprot(prot), eff, 2);
+			} else if (!kasan_page_table(st, p4d_start)) {
+				walk_pud_level(st, *start, eff,
 					       P + i * P4D_LEVEL_MULT);
 			}
 		} else
-			note_page(m, st, __pgprot(0), 0, 2);
+			note_page(st, __pgprot(0), 0, 2);
 
 		start++;
 	}
@@ -529,6 +529,7 @@ static void ptdump_walk_pgd_level_core(s
 	}
 
 	st.check_wx = checkwx;
+	st.seq = m;
 	if (checkwx)
 		st.wx_pages = 0;
 
@@ -542,13 +543,13 @@ static void ptdump_walk_pgd_level_core(s
 			eff = prot;
 #endif
 			if (pgd_large(*start) || !pgd_present(*start)) {
-				note_page(m, &st, __pgprot(prot), eff, 1);
+				note_page(&st, __pgprot(prot), eff, 1);
 			} else {
-				walk_p4d_level(m, &st, *start, eff,
+				walk_p4d_level(&st, *start, eff,
 					       i * PGD_LEVEL_MULT);
 			}
 		} else
-			note_page(m, &st, __pgprot(0), 0, 1);
+			note_page(&st, __pgprot(0), 0, 1);
 
 		cond_resched();
 		start++;
@@ -556,7 +557,7 @@ static void ptdump_walk_pgd_level_core(s
 
 	/* Flush out the last page */
 	st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
-	note_page(m, &st, __pgprot(0), 0, 0);
+	note_page(&st, __pgprot(0), 0, 0);
 	if (!checkwx)
 		return;
 	if (st.wx_pages)
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 40/67] x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct
  2020-02-04  1:33 incoming Andrew Morton
                   ` (38 preceding siblings ...)
  2020-02-04  1:36 ` [patch 39/67] x86: mm: point to struct seq_file from struct pg_state Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 41/67] x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct Andrew Morton
                   ` (198 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct

To enable x86 to use the generic walk_page_range() function, the callers
of ptdump_walk_pgd_level() need to pass an mm_struct rather than the raw
pgd_t pointer.  Luckily since commit 7e904a91bf60 ("efi: Use efi_mm in x86
as well as ARM") we now have an mm_struct for EFI on x86.

Link: http://lkml.kernel.org/r/20191218162402.45610-18-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/pgtable.h |    2 +-
 arch/x86/mm/dump_pagetables.c  |    4 ++--
 arch/x86/platform/efi/efi_32.c |    2 +-
 arch/x86/platform/efi/efi_64.c |    4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h~x86-mmefi-convert-ptdump_walk_pgd_level-to-take-a-mm_struct
+++ a/arch/x86/include/asm/pgtable.h
@@ -29,7 +29,7 @@
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 void ptdump_walk_user_pgd_level_checkwx(void);
--- a/arch/x86/mm/dump_pagetables.c~x86-mmefi-convert-ptdump_walk_pgd_level-to-take-a-mm_struct
+++ a/arch/x86/mm/dump_pagetables.c
@@ -567,9 +567,9 @@ static void ptdump_walk_pgd_level_core(s
 		pr_info("x86/mm: Checked W+X mappings: passed, no W+X pages found.\n");
 }
 
-void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
 {
-	ptdump_walk_pgd_level_core(m, pgd, false, true);
+	ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
 }
 
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
--- a/arch/x86/platform/efi/efi_32.c~x86-mmefi-convert-ptdump_walk_pgd_level-to-take-a-mm_struct
+++ a/arch/x86/platform/efi/efi_32.c
@@ -49,7 +49,7 @@ void efi_sync_low_kernel_mappings(void)
 void __init efi_dump_pagetable(void)
 {
 #ifdef CONFIG_EFI_PGT_DUMP
-	ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+	ptdump_walk_pgd_level(NULL, &init_mm);
 #endif
 }
 
--- a/arch/x86/platform/efi/efi_64.c~x86-mmefi-convert-ptdump_walk_pgd_level-to-take-a-mm_struct
+++ a/arch/x86/platform/efi/efi_64.c
@@ -471,9 +471,9 @@ void __init efi_dump_pagetable(void)
 {
 #ifdef CONFIG_EFI_PGT_DUMP
 	if (efi_have_uv1_memmap())
-		ptdump_walk_pgd_level(NULL, swapper_pg_dir);
+		ptdump_walk_pgd_level(NULL, &init_mm);
 	else
-		ptdump_walk_pgd_level(NULL, efi_mm.pgd);
+		ptdump_walk_pgd_level(NULL, &efi_mm);
 #endif
 }
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 41/67] x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
  2020-02-04  1:33 incoming Andrew Morton
                   ` (39 preceding siblings ...)
  2020-02-04  1:36 ` [patch 40/67] x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 42/67] mm: add generic ptdump Andrew Morton
                   ` (197 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct

To enable x86 to use the generic walk_page_range() function, the callers
of ptdump_walk_pgd_level_debugfs() need to pass in the mm_struct.

This means that ptdump_walk_pgd_level_core() is now always passed a valid
pgd, so drop the support for pgd==NULL.

Link: http://lkml.kernel.org/r/20191218162402.45610-19-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/pgtable.h |    3 ++-
 arch/x86/mm/debug_pagetables.c |    8 ++++----
 arch/x86/mm/dump_pagetables.c  |   14 ++++++--------
 3 files changed, 12 insertions(+), 13 deletions(-)

--- a/arch/x86/include/asm/pgtable.h~x86-mm-convert-ptdump_walk_pgd_level_debugfs-to-take-an-mm_struct
+++ a/arch/x86/include/asm/pgtable.h
@@ -30,7 +30,8 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
+				   bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 void ptdump_walk_user_pgd_level_checkwx(void);
 
--- a/arch/x86/mm/debug_pagetables.c~x86-mm-convert-ptdump_walk_pgd_level_debugfs-to-take-an-mm_struct
+++ a/arch/x86/mm/debug_pagetables.c
@@ -7,7 +7,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level_debugfs(m, NULL, false);
+	ptdump_walk_pgd_level_debugfs(m, &init_mm, false);
 	return 0;
 }
 
@@ -17,7 +17,7 @@ static int ptdump_curknl_show(struct seq
 {
 	if (current->mm->pgd) {
 		down_read(&current->mm->mmap_sem);
-		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+		ptdump_walk_pgd_level_debugfs(m, current->mm, false);
 		up_read(&current->mm->mmap_sem);
 	}
 	return 0;
@@ -30,7 +30,7 @@ static int ptdump_curusr_show(struct seq
 {
 	if (current->mm->pgd) {
 		down_read(&current->mm->mmap_sem);
-		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+		ptdump_walk_pgd_level_debugfs(m, current->mm, true);
 		up_read(&current->mm->mmap_sem);
 	}
 	return 0;
@@ -43,7 +43,7 @@ DEFINE_SHOW_ATTRIBUTE(ptdump_curusr);
 static int ptdump_efi_show(struct seq_file *m, void *v)
 {
 	if (efi_mm.pgd)
-		ptdump_walk_pgd_level_debugfs(m, efi_mm.pgd, false);
+		ptdump_walk_pgd_level_debugfs(m, &efi_mm, false);
 	return 0;
 }
 
--- a/arch/x86/mm/dump_pagetables.c~x86-mm-convert-ptdump_walk_pgd_level_debugfs-to-take-an-mm_struct
+++ a/arch/x86/mm/dump_pagetables.c
@@ -518,16 +518,12 @@ static inline bool is_hypervisor_range(i
 static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
 				       bool checkwx, bool dmesg)
 {
-	pgd_t *start = INIT_PGD;
+	pgd_t *start = pgd;
 	pgprotval_t prot, eff;
 	int i;
 	struct pg_state st = {};
 
-	if (pgd) {
-		start = pgd;
-		st.to_dmesg = dmesg;
-	}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 42/67] mm: add generic ptdump
  2020-02-04  1:33 incoming Andrew Morton
                   ` (40 preceding siblings ...)
  2020-02-04  1:36 ` [patch 41/67] x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 43/67] x86: mm: convert dump_pagetables to use walk_page_range Andrew Morton
                   ` (196 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: add generic ptdump

Add a generic version of page table dumping that architectures can opt-in
to.

Link: http://lkml.kernel.org/r/20191218162402.45610-20-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/ptdump.h |   21 +++++
 mm/Kconfig.debug       |   21 +++++
 mm/Makefile            |    1 
 mm/ptdump.c            |  139 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 182 insertions(+)

--- /dev/null
+++ a/include/linux/ptdump.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_PTDUMP_H
+#define _LINUX_PTDUMP_H
+
+#include <linux/mm_types.h>
+
+struct ptdump_range {
+	unsigned long start;
+	unsigned long end;
+};
+
+struct ptdump_state {
+	void (*note_page)(struct ptdump_state *st, unsigned long addr,
+			  int level, unsigned long val);
+	const struct ptdump_range *range;
+};
+
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm);
+
+#endif /* _LINUX_PTDUMP_H */
--- a/mm/Kconfig.debug~mm-add-generic-ptdump
+++ a/mm/Kconfig.debug
@@ -117,3 +117,24 @@ config DEBUG_RODATA_TEST
     depends on STRICT_KERNEL_RWX
     ---help---
       This option enables a testcase for the setting rodata read-only.
+
+config GENERIC_PTDUMP
+	bool
+
+config PTDUMP_CORE
+	bool
+
+config PTDUMP_DEBUGFS
+	bool "Export kernel pagetable layout to userspace via debugfs"
+	depends on DEBUG_KERNEL
+	depends on DEBUG_FS
+	depends on GENERIC_PTDUMP
+	select PTDUMP_CORE
+	help
+	  Say Y here if you want to show the kernel pagetable layout in a
+	  debugfs file. This information is only useful for kernel developers
+	  who are working in architecture specific areas of the kernel.
+	  It is probably not a good idea to enable this feature in a production
+	  kernel.
+
+	  If in doubt, say N.
--- a/mm/Makefile~mm-add-generic-ptdump
+++ a/mm/Makefile
@@ -109,3 +109,4 @@ obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
+obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
--- /dev/null
+++ a/mm/ptdump.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/pagewalk.h>
+#include <linux/ptdump.h>
+#include <linux/kasan.h>
+
+#ifdef CONFIG_KASAN
+/*
+ * This is an optimization for KASAN=y case. Since all kasan page tables
+ * eventually point to the kasan_early_shadow_page we could call note_page()
+ * right away without walking through lower level page tables. This saves
+ * us dozens of seconds (minutes for 5-level config) while checking for
+ * W+X mapping or reading kernel_page_tables debugfs file.
+ */
+static inline int note_kasan_page_table(struct mm_walk *walk,
+					unsigned long addr)
+{
+	struct ptdump_state *st = walk->private;
+
+	st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
+
+	walk->action = ACTION_CONTINUE;
+
+	return 0;
+}
+#endif
+
+static int ptdump_pgd_entry(pgd_t *pgd, unsigned long addr,
+			    unsigned long next, struct mm_walk *walk)
+{
+	struct ptdump_state *st = walk->private;
+	pgd_t val = READ_ONCE(*pgd);
+
+#if CONFIG_PGTABLE_LEVELS > 4 && defined(CONFIG_KASAN)
+	if (pgd_page(val) == virt_to_page(lm_alias(kasan_early_shadow_p4d)))
+		return note_kasan_page_table(walk, addr);
+#endif
+
+	if (pgd_leaf(val))
+		st->note_page(st, addr, 1, pgd_val(val));
+
+	return 0;
+}
+
+static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
+			    unsigned long next, struct mm_walk *walk)
+{
+	struct ptdump_state *st = walk->private;
+	p4d_t val = READ_ONCE(*p4d);
+
+#if CONFIG_PGTABLE_LEVELS > 3 && defined(CONFIG_KASAN)
+	if (p4d_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pud)))
+		return note_kasan_page_table(walk, addr);
+#endif
+
+	if (p4d_leaf(val))
+		st->note_page(st, addr, 2, p4d_val(val));
+
+	return 0;
+}
+
+static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+			    unsigned long next, struct mm_walk *walk)
+{
+	struct ptdump_state *st = walk->private;
+	pud_t val = READ_ONCE(*pud);
+
+#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN)
+	if (pud_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pmd)))
+		return note_kasan_page_table(walk, addr);
+#endif
+
+	if (pud_leaf(val))
+		st->note_page(st, addr, 3, pud_val(val));
+
+	return 0;
+}
+
+static int ptdump_pmd_entry(pmd_t *pmd, unsigned long addr,
+			    unsigned long next, struct mm_walk *walk)
+{
+	struct ptdump_state *st = walk->private;
+	pmd_t val = READ_ONCE(*pmd);
+
+#if defined(CONFIG_KASAN)
+	if (pmd_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pte)))
+		return note_kasan_page_table(walk, addr);
+#endif
+
+	if (pmd_leaf(val))
+		st->note_page(st, addr, 4, pmd_val(val));
+
+	return 0;
+}
+
+static int ptdump_pte_entry(pte_t *pte, unsigned long addr,
+			    unsigned long next, struct mm_walk *walk)
+{
+	struct ptdump_state *st = walk->private;
+
+	st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte)));
+
+	return 0;
+}
+
+static int ptdump_hole(unsigned long addr, unsigned long next,
+		       int depth, struct mm_walk *walk)
+{
+	struct ptdump_state *st = walk->private;
+
+	st->note_page(st, addr, depth + 1, 0);
+
+	return 0;
+}
+
+static const struct mm_walk_ops ptdump_ops = {
+	.pgd_entry	= ptdump_pgd_entry,
+	.p4d_entry	= ptdump_p4d_entry,
+	.pud_entry	= ptdump_pud_entry,
+	.pmd_entry	= ptdump_pmd_entry,
+	.pte_entry	= ptdump_pte_entry,
+	.pte_hole	= ptdump_hole,
+};
+
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
+{
+	const struct ptdump_range *range = st->range;
+
+	down_read(&mm->mmap_sem);
+	while (range->start != range->end) {
+		walk_page_range_novma(mm, range->start, range->end,
+				      &ptdump_ops, st);
+		range++;
+	}
+	up_read(&mm->mmap_sem);
+
+	/* Flush out the last page */
+	st->note_page(st, 0, 0, 0);
+}
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 43/67] x86: mm: convert dump_pagetables to use walk_page_range
  2020-02-04  1:33 incoming Andrew Morton
                   ` (41 preceding siblings ...)
  2020-02-04  1:36 ` [patch 42/67] mm: add generic ptdump Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 44/67] arm64: mm: convert mm/dump.c to use walk_page_range() Andrew Morton
                   ` (195 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: x86: mm: convert dump_pagetables to use walk_page_range

Make use of the new functionality in walk_page_range to remove the arch
page walking code and use the generic code to walk the page tables.

The effective permissions are passed down the chain using new fields in
struct pg_state.

The KASAN optimisation is implemented by setting action=CONTINUE in the
callbacks to skip an entire tree of entries.

Link: http://lkml.kernel.org/r/20191218162402.45610-21-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/Kconfig              |    1 
 arch/x86/Kconfig.debug        |   20 --
 arch/x86/mm/Makefile          |    4 
 arch/x86/mm/dump_pagetables.c |  294 +++++++-------------------------
 4 files changed, 71 insertions(+), 248 deletions(-)

--- a/arch/x86/Kconfig~x86-mm-convert-dump_pagetables-to-use-walk_page_range
+++ a/arch/x86/Kconfig
@@ -120,6 +120,7 @@ config X86
 	select GENERIC_IRQ_RESERVATION_MODE
 	select GENERIC_IRQ_SHOW
 	select GENERIC_PENDING_IRQ		if SMP
+	select GENERIC_PTDUMP
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
--- a/arch/x86/Kconfig.debug~x86-mm-convert-dump_pagetables-to-use-walk_page_range
+++ a/arch/x86/Kconfig.debug
@@ -62,26 +62,10 @@ config EARLY_PRINTK_USB_XDBC
 config MCSAFE_TEST
 	def_bool n
 
-config X86_PTDUMP_CORE
-	def_bool n
-
-config X86_PTDUMP
-	tristate "Export kernel pagetable layout to userspace via debugfs"
-	depends on DEBUG_KERNEL
-	select DEBUG_FS
-	select X86_PTDUMP_CORE
-	---help---
-	  Say Y here if you want to show the kernel pagetable layout in a
-	  debugfs file. This information is only useful for kernel developers
-	  who are working in architecture specific areas of the kernel.
-	  It is probably not a good idea to enable this feature in a production
-	  kernel.
-	  If in doubt, say "N"
-
 config EFI_PGT_DUMP
 	bool "Dump the EFI pagetable"
 	depends on EFI
-	select X86_PTDUMP_CORE
+	select PTDUMP_CORE
 	---help---
 	  Enable this if you want to dump the EFI page table before
 	  enabling virtual mode. This can be used to debug miscellaneous
@@ -90,7 +74,7 @@ config EFI_PGT_DUMP
 
 config DEBUG_WX
 	bool "Warn on W+X mappings at boot"
-	select X86_PTDUMP_CORE
+	select PTDUMP_CORE
 	---help---
 	  Generate a warning if any W+X mappings are found at boot.
 
--- a/arch/x86/mm/dump_pagetables.c~x86-mm-convert-dump_pagetables-to-use-walk_page_range
+++ a/arch/x86/mm/dump_pagetables.c
@@ -16,6 +16,7 @@
 #include <linux/seq_file.h>
 #include <linux/highmem.h>
 #include <linux/pci.h>
+#include <linux/ptdump.h>
 
 #include <asm/e820/types.h>
 #include <asm/pgtable.h>
@@ -26,11 +27,12 @@
  * when a "break" in the continuity is found.
  */
 struct pg_state {
+	struct ptdump_state ptdump;
 	int level;
-	pgprot_t current_prot;
+	pgprotval_t current_prot;
 	pgprotval_t effective_prot;
+	pgprotval_t prot_levels[5];
 	unsigned long start_address;
-	unsigned long current_address;
 	const struct addr_marker *marker;
 	unsigned long lines;
 	bool to_dmesg;
@@ -175,9 +177,8 @@ static struct addr_marker address_marker
 /*
  * Print a readable form of a pgprot_t to the seq_file
  */
-static void printk_prot(struct seq_file *m, pgprot_t prot, int level, bool dmsg)
+static void printk_prot(struct seq_file *m, pgprotval_t pr, int level, bool dmsg)
 {
-	pgprotval_t pr = pgprot_val(prot);
 	static const char * const level_name[] =
 		{ "cr3", "pgd", "p4d", "pud", "pmd", "pte" };
 
@@ -224,24 +225,11 @@ static void printk_prot(struct seq_file
 	pt_dump_cont_printf(m, dmsg, "%s\n", level_name[level]);
 }
 
-/*
- * On 64 bits, sign-extend the 48 bit address to 64 bit
- */
-static unsigned long normalize_addr(unsigned long u)
-{
-	int shift;
-	if (!IS_ENABLED(CONFIG_X86_64))
-		return u;
-
-	shift = 64 - (__VIRTUAL_MASK_SHIFT + 1);
-	return (signed long)(u << shift) >> shift;
-}
-
-static void note_wx(struct pg_state *st)
+static void note_wx(struct pg_state *st, unsigned long addr)
 {
 	unsigned long npages;
 
-	npages = (st->current_address - st->start_address) / PAGE_SIZE;
+	npages = (addr - st->start_address) / PAGE_SIZE;
 
 #ifdef CONFIG_PCI_BIOS
 	/*
@@ -249,7 +237,7 @@ static void note_wx(struct pg_state *st)
 	 * Inform about it, but avoid the warning.
 	 */
 	if (pcibios_enabled && st->start_address >= PAGE_OFFSET + BIOS_BEGIN &&
-	    st->current_address <= PAGE_OFFSET + BIOS_END) {
+	    addr <= PAGE_OFFSET + BIOS_END) {
 		pr_warn_once("x86/mm: PCI BIOS W+X mapping %lu pages\n", npages);
 		return;
 	}
@@ -261,25 +249,44 @@ static void note_wx(struct pg_state *st)
 		  (void *)st->start_address);
 }
 
+static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
+{
+	return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) |
+	       ((prot1 | prot2) & _PAGE_NX);
+}
+
 /*
  * This function gets called on a break in a continuous series
  * of PTE entries; the next one is different so we need to
  * print what we collected so far.
  */
-static void note_page(struct pg_state *st, pgprot_t new_prot,
-		      pgprotval_t new_eff, int level)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+		      unsigned long val)
 {
-	pgprotval_t prot, cur, eff;
+	struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
+	pgprotval_t new_prot, new_eff;
+	pgprotval_t cur, eff;
 	static const char units[] = "BKMGTPE";
 	struct seq_file *m = st->seq;
 
+	new_prot = val & PTE_FLAGS_MASK;
+
+	if (level > 1) {
+		new_eff = effective_prot(st->prot_levels[level - 2],
+					 new_prot);
+	} else {
+		new_eff = new_prot;
+	}
+
+	if (level > 0)
+		st->prot_levels[level - 1] = new_eff;
+
 	/*
 	 * If we have a "break" in the series, we need to flush the state that
 	 * we have now. "break" is either changing perms, levels or
 	 * address space marker.
 	 */
-	prot = pgprot_val(new_prot);
-	cur = pgprot_val(st->current_prot);
+	cur = st->current_prot;
 	eff = st->effective_prot;
 
 	if (!st->level) {
@@ -291,14 +298,14 @@ static void note_page(struct pg_state *s
 		st->lines = 0;
 		pt_dump_seq_printf(m, st->to_dmesg, "---[ %s ]---\n",
 				   st->marker->name);
-	} else if (prot != cur || new_eff != eff || level != st->level ||
-		   st->current_address >= st->marker[1].start_address) {
+	} else if (new_prot != cur || new_eff != eff || level != st->level ||
+		   addr >= st->marker[1].start_address) {
 		const char *unit = units;
 		unsigned long delta;
 		int width = sizeof(unsigned long) * 2;
 
 		if (st->check_wx && (eff & _PAGE_RW) && !(eff & _PAGE_NX))
-			note_wx(st);
+			note_wx(st, addr);
 
 		/*
 		 * Now print the actual finished series
@@ -308,9 +315,9 @@ static void note_page(struct pg_state *s
 			pt_dump_seq_printf(m, st->to_dmesg,
 					   "0x%0*lx-0x%0*lx   ",
 					   width, st->start_address,
-					   width, st->current_address);
+					   width, addr);
 
-			delta = st->current_address - st->start_address;
+			delta = addr - st->start_address;
 			while (!(delta & 1023) && unit[1]) {
 				delta >>= 10;
 				unit++;
@@ -327,7 +334,7 @@ static void note_page(struct pg_state *s
 		 * such as the start of vmalloc space etc.
 		 * This helps in the interpretation.
 		 */
-		if (st->current_address >= st->marker[1].start_address) {
+		if (addr >= st->marker[1].start_address) {
 			if (st->marker->max_lines &&
 			    st->lines > st->marker->max_lines) {
 				unsigned long nskip =
@@ -343,217 +350,48 @@ static void note_page(struct pg_state *s
 					   st->marker->name);
 		}
 
-		st->start_address = st->current_address;
+		st->start_address = addr;
 		st->current_prot = new_prot;
 		st->effective_prot = new_eff;
 		st->level = level;
 	}
 }
 
-static inline pgprotval_t effective_prot(pgprotval_t prot1, pgprotval_t prot2)
-{
-	return (prot1 & prot2 & (_PAGE_USER | _PAGE_RW)) |
-	       ((prot1 | prot2) & _PAGE_NX);
-}
-
-static void walk_pte_level(struct pg_state *st, pmd_t addr, pgprotval_t eff_in,
-			   unsigned long P)
-{
-	int i;
-	pte_t *pte;
-	pgprotval_t prot, eff;
-
-	for (i = 0; i < PTRS_PER_PTE; i++) {
-		st->current_address = normalize_addr(P + i * PTE_LEVEL_MULT);
-		pte = pte_offset_map(&addr, st->current_address);
-		prot = pte_flags(*pte);
-		eff = effective_prot(eff_in, prot);
-		note_page(st, __pgprot(prot), eff, 5);
-		pte_unmap(pte);
-	}
-}
-#ifdef CONFIG_KASAN
-
-/*
- * This is an optimization for KASAN=y case. Since all kasan page tables
- * eventually point to the kasan_early_shadow_page we could call note_page()
- * right away without walking through lower level page tables. This saves
- * us dozens of seconds (minutes for 5-level config) while checking for
- * W+X mapping or reading kernel_page_tables debugfs file.
- */
-static inline bool kasan_page_table(struct pg_state *st, void *pt)
-{
-	if (__pa(pt) == __pa(kasan_early_shadow_pmd) ||
-	    (pgtable_l5_enabled() &&
-			__pa(pt) == __pa(kasan_early_shadow_p4d)) ||
-	    __pa(pt) == __pa(kasan_early_shadow_pud)) {
-		pgprotval_t prot = pte_flags(kasan_early_shadow_pte[0]);
-		note_page(st, __pgprot(prot), 0, 5);
-		return true;
-	}
-	return false;
-}
-#else
-static inline bool kasan_page_table(struct pg_state *st, void *pt)
-{
-	return false;
-}
-#endif
-
-#if PTRS_PER_PMD > 1
-
-static void walk_pmd_level(struct pg_state *st, pud_t addr,
-			   pgprotval_t eff_in, unsigned long P)
-{
-	int i;
-	pmd_t *start, *pmd_start;
-	pgprotval_t prot, eff;
-
-	pmd_start = start = (pmd_t *)pud_page_vaddr(addr);
-	for (i = 0; i < PTRS_PER_PMD; i++) {
-		st->current_address = normalize_addr(P + i * PMD_LEVEL_MULT);
-		if (!pmd_none(*start)) {
-			prot = pmd_flags(*start);
-			eff = effective_prot(eff_in, prot);
-			if (pmd_large(*start) || !pmd_present(*start)) {
-				note_page(st, __pgprot(prot), eff, 4);
-			} else if (!kasan_page_table(st, pmd_start)) {
-				walk_pte_level(st, *start, eff,
-					       P + i * PMD_LEVEL_MULT);
-			}
-		} else
-			note_page(st, __pgprot(0), 0, 4);
-		start++;
-	}
-}
-
-#else
-#define walk_pmd_level(s,a,e,p) walk_pte_level(s,__pmd(pud_val(a)),e,p)
-#define pud_large(a) pmd_large(__pmd(pud_val(a)))
-#define pud_none(a)  pmd_none(__pmd(pud_val(a)))
-#endif
-
-#if PTRS_PER_PUD > 1
-
-static void walk_pud_level(struct pg_state *st, p4d_t addr, pgprotval_t eff_in,
-			   unsigned long P)
-{
-	int i;
-	pud_t *start, *pud_start;
-	pgprotval_t prot, eff;
-
-	pud_start = start = (pud_t *)p4d_page_vaddr(addr);
-
-	for (i = 0; i < PTRS_PER_PUD; i++) {
-		st->current_address = normalize_addr(P + i * PUD_LEVEL_MULT);
-		if (!pud_none(*start)) {
-			prot = pud_flags(*start);
-			eff = effective_prot(eff_in, prot);
-			if (pud_large(*start) || !pud_present(*start)) {
-				note_page(st, __pgprot(prot), eff, 3);
-			} else if (!kasan_page_table(st, pud_start)) {
-				walk_pmd_level(st, *start, eff,
-					       P + i * PUD_LEVEL_MULT);
-			}
-		} else
-			note_page(st, __pgprot(0), 0, 3);
-
-		start++;
-	}
-}
-
-#else
-#define walk_pud_level(s,a,e,p) walk_pmd_level(s,__pud(p4d_val(a)),e,p)
-#define p4d_large(a) pud_large(__pud(p4d_val(a)))
-#define p4d_none(a)  pud_none(__pud(p4d_val(a)))
-#endif
-
-static void walk_p4d_level(struct pg_state *st, pgd_t addr, pgprotval_t eff_in,
-			   unsigned long P)
+static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
+				       bool checkwx, bool dmesg)
 {
-	int i;
-	p4d_t *start, *p4d_start;
-	pgprotval_t prot, eff;
-
-	if (PTRS_PER_P4D == 1)
-		return walk_pud_level(st, __p4d(pgd_val(addr)), eff_in, P);
-
-	p4d_start = start = (p4d_t *)pgd_page_vaddr(addr);
-
-	for (i = 0; i < PTRS_PER_P4D; i++) {
-		st->current_address = normalize_addr(P + i * P4D_LEVEL_MULT);
-		if (!p4d_none(*start)) {
-			prot = p4d_flags(*start);
-			eff = effective_prot(eff_in, prot);
-			if (p4d_large(*start) || !p4d_present(*start)) {
-				note_page(st, __pgprot(prot), eff, 2);
-			} else if (!kasan_page_table(st, p4d_start)) {
-				walk_pud_level(st, *start, eff,
-					       P + i * P4D_LEVEL_MULT);
-			}
-		} else
-			note_page(st, __pgprot(0), 0, 2);
-
-		start++;
-	}
-}
+	const struct ptdump_range ptdump_ranges[] = {
+#ifdef CONFIG_X86_64
 
-#define pgd_large(a) (pgtable_l5_enabled() ? pgd_large(a) : p4d_large(__p4d(pgd_val(a))))
-#define pgd_none(a)  (pgtable_l5_enabled() ? pgd_none(a) : p4d_none(__p4d(pgd_val(a))))
+#define normalize_addr_shift (64 - (__VIRTUAL_MASK_SHIFT + 1))
+#define normalize_addr(u) ((signed long)((u) << normalize_addr_shift) >> \
+			   normalize_addr_shift)
 
-static inline bool is_hypervisor_range(int idx)
-{
-#ifdef CONFIG_X86_64
-	/*
-	 * A hole in the beginning of kernel address space reserved
-	 * for a hypervisor.
-	 */
-	return	(idx >= pgd_index(GUARD_HOLE_BASE_ADDR)) &&
-		(idx <  pgd_index(GUARD_HOLE_END_ADDR));
+	{0, PTRS_PER_PGD * PGD_LEVEL_MULT / 2},
+	{normalize_addr(PTRS_PER_PGD * PGD_LEVEL_MULT / 2), ~0UL},
 #else
-	return false;
+	{0, ~0UL},
 #endif
-}
+	{0, 0}
+};
 
-static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
-				       bool checkwx, bool dmesg)
-{
-	pgd_t *start = pgd;
-	pgprotval_t prot, eff;
-	int i;
-	struct pg_state st = {};
-
-	st.to_dmesg = dmesg;
-	st.check_wx = checkwx;
-	st.seq = m;
-	if (checkwx)
-		st.wx_pages = 0;
-
-	for (i = 0; i < PTRS_PER_PGD; i++) {
-		st.current_address = normalize_addr(i * PGD_LEVEL_MULT);
-		if (!pgd_none(*start) && !is_hypervisor_range(i)) {
-			prot = pgd_flags(*start);
-#ifdef CONFIG_X86_PAE
-			eff = _PAGE_USER | _PAGE_RW;
-#else
-			eff = prot;
-#endif
-			if (pgd_large(*start) || !pgd_present(*start)) {
-				note_page(&st, __pgprot(prot), eff, 1);
-			} else {
-				walk_p4d_level(&st, *start, eff,
-					       i * PGD_LEVEL_MULT);
-			}
-		} else
-			note_page(&st, __pgprot(0), 0, 1);
+	struct pg_state st = {
+		.ptdump = {
+			.note_page	= note_page,
+			.range		= ptdump_ranges
+		},
+		.to_dmesg	= dmesg,
+		.check_wx	= checkwx,
+		.seq		= m
+	};
+
+	struct mm_struct fake_mm = {
+		.pgd = pgd
+	};
+	init_rwsem(&fake_mm.mmap_sem);
 
-		cond_resched();
-		start++;
-	}
+	ptdump_walk_pgd(&st.ptdump, &fake_mm);
 
-	/* Flush out the last page */
-	st.current_address = normalize_addr(PTRS_PER_PGD*PGD_LEVEL_MULT);
-	note_page(&st, __pgprot(0), 0, 0);
 	if (!checkwx)
 		return;
 	if (st.wx_pages)
--- a/arch/x86/mm/Makefile~x86-mm-convert-dump_pagetables-to-use-walk_page_range
+++ a/arch/x86/mm/Makefile
@@ -28,8 +28,8 @@ CFLAGS_fault.o := -I $(srctree)/$(src)/.
 obj-$(CONFIG_X86_32)		+= pgtable_32.o iomap_32.o
 
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
-obj-$(CONFIG_X86_PTDUMP_CORE)	+= dump_pagetables.o
-obj-$(CONFIG_X86_PTDUMP)	+= debug_pagetables.o
+obj-$(CONFIG_PTDUMP_CORE)	+= dump_pagetables.o
+obj-$(CONFIG_PTDUMP_DEBUGFS)	+= debug_pagetables.o
 
 obj-$(CONFIG_HIGHMEM)		+= highmem_32.o
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 44/67] arm64: mm: convert mm/dump.c to use walk_page_range()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (42 preceding siblings ...)
  2020-02-04  1:36 ` [patch 43/67] x86: mm: convert dump_pagetables to use walk_page_range Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 45/67] arm64: mm: display non-present entries in ptdump Andrew Morton
                   ` (194 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: arm64: mm: convert mm/dump.c to use walk_page_range()

Now walk_page_range() can walk kernel page tables, we can switch the arm64
ptdump code over to using it, simplifying the code.

Link: http://lkml.kernel.org/r/20191218162402.45610-22-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/Kconfig                 |    1 
 arch/arm64/Kconfig.debug           |   19 ----
 arch/arm64/include/asm/ptdump.h    |    8 -
 arch/arm64/mm/Makefile             |    4 
 arch/arm64/mm/dump.c               |  117 ++++++++-------------------
 arch/arm64/mm/mmu.c                |    4 
 arch/arm64/mm/ptdump_debugfs.c     |    2 
 drivers/firmware/efi/arm-runtime.c |    2 
 8 files changed, 50 insertions(+), 107 deletions(-)

--- a/arch/arm64/include/asm/ptdump.h~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/include/asm/ptdump.h
@@ -5,7 +5,7 @@
 #ifndef __ASM_PTDUMP_H
 #define __ASM_PTDUMP_H
 
-#ifdef CONFIG_ARM64_PTDUMP_CORE
+#ifdef CONFIG_PTDUMP_CORE
 
 #include <linux/mm_types.h>
 #include <linux/seq_file.h>
@@ -21,15 +21,15 @@ struct ptdump_info {
 	unsigned long			base_addr;
 };
 
-void ptdump_walk_pgd(struct seq_file *s, struct ptdump_info *info);
-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
+#ifdef CONFIG_PTDUMP_DEBUGFS
 void ptdump_debugfs_register(struct ptdump_info *info, const char *name);
 #else
 static inline void ptdump_debugfs_register(struct ptdump_info *info,
 					   const char *name) { }
 #endif
 void ptdump_check_wx(void);
-#endif /* CONFIG_ARM64_PTDUMP_CORE */
+#endif /* CONFIG_PTDUMP_CORE */
 
 #ifdef CONFIG_DEBUG_WX
 #define debug_checkwx()	ptdump_check_wx()
--- a/arch/arm64/Kconfig~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/Kconfig
@@ -104,6 +104,7 @@ config ARM64
 	select GENERIC_IRQ_SHOW
 	select GENERIC_IRQ_SHOW_LEVEL
 	select GENERIC_PCI_IOMAP
+	select GENERIC_PTDUMP
 	select GENERIC_SCHED_CLOCK
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_STRNCPY_FROM_USER
--- a/arch/arm64/Kconfig.debug~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/Kconfig.debug
@@ -1,22 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-config ARM64_PTDUMP_CORE
-	def_bool n
-
-config ARM64_PTDUMP_DEBUGFS
-	bool "Export kernel pagetable layout to userspace via debugfs"
-	depends on DEBUG_KERNEL
-	select ARM64_PTDUMP_CORE
-	select DEBUG_FS
-        help
-	  Say Y here if you want to show the kernel pagetable layout in a
-	  debugfs file. This information is only useful for kernel developers
-	  who are working in architecture specific areas of the kernel.
-	  It is probably not a good idea to enable this feature in a production
-	  kernel.
-
-	  If in doubt, say N.
-
 config PID_IN_CONTEXTIDR
 	bool "Write the current PID to the CONTEXTIDR register"
 	help
@@ -42,7 +25,7 @@ config ARM64_RANDOMIZE_TEXT_OFFSET
 
 config DEBUG_WX
 	bool "Warn on W+X mappings at boot"
-	select ARM64_PTDUMP_CORE
+	select PTDUMP_CORE
 	---help---
 	  Generate a warning if any W+X mappings are found at boot.
 
--- a/arch/arm64/mm/dump.c~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/mm/dump.c
@@ -15,6 +15,7 @@
 #include <linux/io.h>
 #include <linux/init.h>
 #include <linux/mm.h>
+#include <linux/ptdump.h>
 #include <linux/sched.h>
 #include <linux/seq_file.h>
 
@@ -75,10 +76,11 @@ static struct addr_marker address_marker
  * dumps out a description of the range.
  */
 struct pg_state {
+	struct ptdump_state ptdump;
 	struct seq_file *seq;
 	const struct addr_marker *marker;
 	unsigned long start_address;
-	unsigned level;
+	int level;
 	u64 current_prot;
 	bool check_wx;
 	unsigned long wx_pages;
@@ -179,6 +181,10 @@ static struct pg_level pg_level[] = {
 		.name	= "PGD",
 		.bits	= pte_bits,
 		.num	= ARRAY_SIZE(pte_bits),
+	}, { /* p4d */
+		.name	= "P4D",
+		.bits	= pte_bits,
+		.num	= ARRAY_SIZE(pte_bits),
 	}, { /* pud */
 		.name	= (CONFIG_PGTABLE_LEVELS > 3) ? "PUD" : "PGD",
 		.bits	= pte_bits,
@@ -241,11 +247,15 @@ static void note_prot_wx(struct pg_state
 	st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
 }
 
-static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
-				u64 val)
+static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+		      unsigned long val)
 {
+	struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
 	static const char units[] = "KMGTPE";
-	u64 prot = val & pg_level[level].mask;
+	u64 prot = 0;
+
+	if (level >= 0)
+		prot = val & pg_level[level].mask;
 
 	if (!st->level) {
 		st->level = level;
@@ -293,85 +303,27 @@ static void note_page(struct pg_state *s
 
 }
 
-static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start,
-		     unsigned long end)
-{
-	unsigned long addr = start;
-	pte_t *ptep = pte_offset_kernel(pmdp, start);
-
-	do {
-		note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
-	} while (ptep++, addr += PAGE_SIZE, addr != end);
-}
-
-static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start,
-		     unsigned long end)
+void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
 {
-	unsigned long next, addr = start;
-	pmd_t *pmdp = pmd_offset(pudp, start);
-
-	do {
-		pmd_t pmd = READ_ONCE(*pmdp);
-		next = pmd_addr_end(addr, end);
-
-		if (pmd_none(pmd) || pmd_sect(pmd)) {
-			note_page(st, addr, 3, pmd_val(pmd));
-		} else {
-			BUG_ON(pmd_bad(pmd));
-			walk_pte(st, pmdp, addr, next);
-		}
-	} while (pmdp++, addr = next, addr != end);
-}
-
-static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start,
-		     unsigned long end)
-{
-	unsigned long next, addr = start;
-	pud_t *pudp = pud_offset(pgdp, start);
-
-	do {
-		pud_t pud = READ_ONCE(*pudp);
-		next = pud_addr_end(addr, end);
-
-		if (pud_none(pud) || pud_sect(pud)) {
-			note_page(st, addr, 2, pud_val(pud));
-		} else {
-			BUG_ON(pud_bad(pud));
-			walk_pmd(st, pudp, addr, next);
-		}
-	} while (pudp++, addr = next, addr != end);
-}
+	unsigned long end = ~0UL;
+	struct pg_state st;
 
-static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
-		     unsigned long start)
-{
-	unsigned long end = (start < TASK_SIZE_64) ? TASK_SIZE_64 : 0;
-	unsigned long next, addr = start;
-	pgd_t *pgdp = pgd_offset(mm, start);
-
-	do {
-		pgd_t pgd = READ_ONCE(*pgdp);
-		next = pgd_addr_end(addr, end);
-
-		if (pgd_none(pgd)) {
-			note_page(st, addr, 1, pgd_val(pgd));
-		} else {
-			BUG_ON(pgd_bad(pgd));
-			walk_pud(st, pgdp, addr, next);
-		}
-	} while (pgdp++, addr = next, addr != end);
-}
+	if (info->base_addr < TASK_SIZE_64)
+		end = TASK_SIZE_64;
 
-void ptdump_walk_pgd(struct seq_file *m, struct ptdump_info *info)
-{
-	struct pg_state st = {
-		.seq = m,
+	st = (struct pg_state){
+		.seq = s,
 		.marker = info->markers,
+		.ptdump = {
+			.note_page = note_page,
+			.range = (struct ptdump_range[]){
+				{info->base_addr, end},
+				{0, 0}
+			}
+		}
 	};
 
-	walk_pgd(&st, info->mm, info->base_addr);
-
-	note_page(&st, 0, 0, 0);
+	ptdump_walk_pgd(&st.ptdump, info->mm);
 }
 
 static void ptdump_initialize(void)
@@ -399,10 +351,17 @@ void ptdump_check_wx(void)
 			{ -1, NULL},
 		},
 		.check_wx = true,
+		.ptdump = {
+			.note_page = note_page,
+			.range = (struct ptdump_range[]) {
+				{PAGE_OFFSET, ~0UL},
+				{0, 0}
+			}
+		}
 	};
 
-	walk_pgd(&st, &init_mm, PAGE_OFFSET);
-	note_page(&st, 0, 0, 0);
+	ptdump_walk_pgd(&st.ptdump, &init_mm);
+
 	if (st.wx_pages || st.uxn_pages)
 		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
 			st.wx_pages, st.uxn_pages);
--- a/arch/arm64/mm/Makefile~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/mm/Makefile
@@ -4,8 +4,8 @@ obj-y				:= dma-mapping.o extable.o faul
 				   ioremap.o mmap.o pgd.o mmu.o \
 				   context.o proc.o pageattr.o
 obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
-obj-$(CONFIG_ARM64_PTDUMP_CORE)	+= dump.o
-obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
+obj-$(CONFIG_PTDUMP_CORE)	+= dump.o
+obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
 KASAN_SANITIZE_physaddr.o	+= n
--- a/arch/arm64/mm/mmu.c~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/mm/mmu.c
@@ -943,13 +943,13 @@ int __init arch_ioremap_pud_supported(vo
 	 * SW table walks can't handle removal of intermediate entries.
 	 */
 	return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
-	       !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
+	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
 int __init arch_ioremap_pmd_supported(void)
 {
 	/* See arch_ioremap_pud_supported() */
-	return !IS_ENABLED(CONFIG_ARM64_PTDUMP_DEBUGFS);
+	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
--- a/arch/arm64/mm/ptdump_debugfs.c~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,7 @@
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
-	ptdump_walk_pgd(m, info);
+	ptdump_walk(m, info);
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);
--- a/drivers/firmware/efi/arm-runtime.c~arm64-mm-convert-mm-dumpc-to-use-walk_page_range
+++ a/drivers/firmware/efi/arm-runtime.c
@@ -27,7 +27,7 @@
 
 extern u64 efi_system_table;
 
-#ifdef CONFIG_ARM64_PTDUMP_DEBUGFS
+#if defined(CONFIG_PTDUMP_DEBUGFS) && defined(CONFIG_ARM64)
 #include <asm/ptdump.h>
 
 static struct ptdump_info efi_ptdump_info = {
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 45/67] arm64: mm: display non-present entries in ptdump
  2020-02-04  1:33 incoming Andrew Morton
                   ` (43 preceding siblings ...)
  2020-02-04  1:36 ` [patch 44/67] arm64: mm: convert mm/dump.c to use walk_page_range() Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 46/67] mm: ptdump: reduce level numbers by 1 in note_page() Andrew Morton
                   ` (193 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: arm64: mm: display non-present entries in ptdump

Previously the /sys/kernel/debug/kernel_page_tables file would only show
lines for entries present in the page tables.  However it is useful to
also show non-present entries as this makes the size and level of the
holes more visible.  This aligns the behaviour with x86 which also shows
holes.

Link: http://lkml.kernel.org/r/20191218162402.45610-23-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/dump.c |   25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

--- a/arch/arm64/mm/dump.c~arm64-mm-display-non-present-entries-in-ptdump
+++ a/arch/arm64/mm/dump.c
@@ -270,21 +270,22 @@ static void note_page(struct ptdump_stat
 		if (st->current_prot) {
 			note_prot_uxn(st, addr);
 			note_prot_wx(st, addr);
-			pt_dump_seq_printf(st->seq, "0x%016lx-0x%016lx   ",
+		}
+
+		pt_dump_seq_printf(st->seq, "0x%016lx-0x%016lx   ",
 				   st->start_address, addr);
 
-			delta = (addr - st->start_address) >> 10;
-			while (!(delta & 1023) && unit[1]) {
-				delta >>= 10;
-				unit++;
-			}
-			pt_dump_seq_printf(st->seq, "%9lu%c %s", delta, *unit,
-				   pg_level[st->level].name);
-			if (pg_level[st->level].bits)
-				dump_prot(st, pg_level[st->level].bits,
-					  pg_level[st->level].num);
-			pt_dump_seq_puts(st->seq, "\n");
+		delta = (addr - st->start_address) >> 10;
+		while (!(delta & 1023) && unit[1]) {
+			delta >>= 10;
+			unit++;
 		}
+		pt_dump_seq_printf(st->seq, "%9lu%c %s", delta, *unit,
+				   pg_level[st->level].name);
+		if (st->current_prot && pg_level[st->level].bits)
+			dump_prot(st, pg_level[st->level].bits,
+				  pg_level[st->level].num);
+		pt_dump_seq_puts(st->seq, "\n");
 
 		if (addr >= st->marker[1].start_address) {
 			st->marker++;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 46/67] mm: ptdump: reduce level numbers by 1 in note_page()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (44 preceding siblings ...)
  2020-02-04  1:36 ` [patch 45/67] arm64: mm: display non-present entries in ptdump Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 47/67] x86: mm: avoid allocating struct mm_struct on the stack Andrew Morton
                   ` (192 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, steven.price, tglx, torvalds, vgupta, will,
	zong.li

From: Steven Price <steven.price@arm.com>
Subject: mm: ptdump: reduce level numbers by 1 in note_page()

Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
let's change the meaning of 'level' in note_page() since that makes the
code simplier.

Note that for x86, the level numbers were previously increased by 1 in
commit 45dcd2091363 ("x86/mm/dump_pagetables: Fix printout of p4d level")
and the comment "Bit 7 has a different meaning" was not updated, so this
change also makes the code match the comment again.

Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/dump.c          |    6 +++---
 arch/x86/mm/dump_pagetables.c |   19 ++++++++++---------
 include/linux/ptdump.h        |    1 +
 mm/ptdump.c                   |   16 ++++++++--------
 4 files changed, 22 insertions(+), 20 deletions(-)

--- a/arch/arm64/mm/dump.c~mm-ptdump-reduce-level-numbers-by-1-in-note_page
+++ a/arch/arm64/mm/dump.c
@@ -176,8 +176,7 @@ struct pg_level {
 };
 
 static struct pg_level pg_level[] = {
-	{
-	}, { /* pgd */
+	{ /* pgd */
 		.name	= "PGD",
 		.bits	= pte_bits,
 		.num	= ARRAY_SIZE(pte_bits),
@@ -257,7 +256,7 @@ static void note_page(struct ptdump_stat
 	if (level >= 0)
 		prot = val & pg_level[level].mask;
 
-	if (!st->level) {
+	if (st->level == -1) {
 		st->level = level;
 		st->current_prot = prot;
 		st->start_address = addr;
@@ -351,6 +350,7 @@ void ptdump_check_wx(void)
 			{ 0, NULL},
 			{ -1, NULL},
 		},
+		.level = -1,
 		.check_wx = true,
 		.ptdump = {
 			.note_page = note_page,
--- a/arch/x86/mm/dump_pagetables.c~mm-ptdump-reduce-level-numbers-by-1-in-note_page
+++ a/arch/x86/mm/dump_pagetables.c
@@ -180,7 +180,7 @@ static struct addr_marker address_marker
 static void printk_prot(struct seq_file *m, pgprotval_t pr, int level, bool dmsg)
 {
 	static const char * const level_name[] =
-		{ "cr3", "pgd", "p4d", "pud", "pmd", "pte" };
+		{ "pgd", "p4d", "pud", "pmd", "pte" };
 
 	if (!(pr & _PAGE_PRESENT)) {
 		/* Not present */
@@ -204,12 +204,12 @@ static void printk_prot(struct seq_file
 			pt_dump_cont_printf(m, dmsg, "    ");
 
 		/* Bit 7 has a different meaning on level 3 vs 4 */
-		if (level <= 4 && pr & _PAGE_PSE)
+		if (level <= 3 && pr & _PAGE_PSE)
 			pt_dump_cont_printf(m, dmsg, "PSE ");
 		else
 			pt_dump_cont_printf(m, dmsg, "    ");
-		if ((level == 5 && pr & _PAGE_PAT) ||
-		    ((level == 4 || level == 3) && pr & _PAGE_PAT_LARGE))
+		if ((level == 4 && pr & _PAGE_PAT) ||
+		    ((level == 3 || level == 2) && pr & _PAGE_PAT_LARGE))
 			pt_dump_cont_printf(m, dmsg, "PAT ");
 		else
 			pt_dump_cont_printf(m, dmsg, "    ");
@@ -271,15 +271,15 @@ static void note_page(struct ptdump_stat
 
 	new_prot = val & PTE_FLAGS_MASK;
 
-	if (level > 1) {
-		new_eff = effective_prot(st->prot_levels[level - 2],
+	if (level > 0) {
+		new_eff = effective_prot(st->prot_levels[level - 1],
 					 new_prot);
 	} else {
 		new_eff = new_prot;
 	}
 
-	if (level > 0)
-		st->prot_levels[level - 1] = new_eff;
+	if (level >= 0)
+		st->prot_levels[level] = new_eff;
 
 	/*
 	 * If we have a "break" in the series, we need to flush the state that
@@ -289,7 +289,7 @@ static void note_page(struct ptdump_stat
 	cur = st->current_prot;
 	eff = st->effective_prot;
 
-	if (!st->level) {
+	if (st->level == -1) {
 		/* First entry */
 		st->current_prot = new_prot;
 		st->effective_prot = new_eff;
@@ -380,6 +380,7 @@ static void ptdump_walk_pgd_level_core(s
 			.note_page	= note_page,
 			.range		= ptdump_ranges
 		},
+		.level = -1,
 		.to_dmesg	= dmesg,
 		.check_wx	= checkwx,
 		.seq		= m
--- a/include/linux/ptdump.h~mm-ptdump-reduce-level-numbers-by-1-in-note_page
+++ a/include/linux/ptdump.h
@@ -11,6 +11,7 @@ struct ptdump_range {
 };
 
 struct ptdump_state {
+	/* level is 0:PGD to 4:PTE, or -1 if unknown */
 	void (*note_page)(struct ptdump_state *st, unsigned long addr,
 			  int level, unsigned long val);
 	const struct ptdump_range *range;
--- a/mm/ptdump.c~mm-ptdump-reduce-level-numbers-by-1-in-note_page
+++ a/mm/ptdump.c
@@ -17,7 +17,7 @@ static inline int note_kasan_page_table(
 {
 	struct ptdump_state *st = walk->private;
 
-	st->note_page(st, addr, 5, pte_val(kasan_early_shadow_pte[0]));
+	st->note_page(st, addr, 4, pte_val(kasan_early_shadow_pte[0]));
 
 	walk->action = ACTION_CONTINUE;
 
@@ -37,7 +37,7 @@ static int ptdump_pgd_entry(pgd_t *pgd,
 #endif
 
 	if (pgd_leaf(val))
-		st->note_page(st, addr, 1, pgd_val(val));
+		st->note_page(st, addr, 0, pgd_val(val));
 
 	return 0;
 }
@@ -54,7 +54,7 @@ static int ptdump_p4d_entry(p4d_t *p4d,
 #endif
 
 	if (p4d_leaf(val))
-		st->note_page(st, addr, 2, p4d_val(val));
+		st->note_page(st, addr, 1, p4d_val(val));
 
 	return 0;
 }
@@ -71,7 +71,7 @@ static int ptdump_pud_entry(pud_t *pud,
 #endif
 
 	if (pud_leaf(val))
-		st->note_page(st, addr, 3, pud_val(val));
+		st->note_page(st, addr, 2, pud_val(val));
 
 	return 0;
 }
@@ -88,7 +88,7 @@ static int ptdump_pmd_entry(pmd_t *pmd,
 #endif
 
 	if (pmd_leaf(val))
-		st->note_page(st, addr, 4, pmd_val(val));
+		st->note_page(st, addr, 3, pmd_val(val));
 
 	return 0;
 }
@@ -98,7 +98,7 @@ static int ptdump_pte_entry(pte_t *pte,
 {
 	struct ptdump_state *st = walk->private;
 
-	st->note_page(st, addr, 5, pte_val(READ_ONCE(*pte)));
+	st->note_page(st, addr, 4, pte_val(READ_ONCE(*pte)));
 
 	return 0;
 }
@@ -108,7 +108,7 @@ static int ptdump_hole(unsigned long add
 {
 	struct ptdump_state *st = walk->private;
 
-	st->note_page(st, addr, depth + 1, 0);
+	st->note_page(st, addr, depth, 0);
 
 	return 0;
 }
@@ -135,5 +135,5 @@ void ptdump_walk_pgd(struct ptdump_state
 	up_read(&mm->mmap_sem);
 
 	/* Flush out the last page */
-	st->note_page(st, 0, 0, 0);
+	st->note_page(st, 0, -1, 0);
 }
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 47/67] x86: mm: avoid allocating struct mm_struct on the stack
  2020-02-04  1:33 incoming Andrew Morton
                   ` (45 preceding siblings ...)
  2020-02-04  1:36 ` [patch 46/67] mm: ptdump: reduce level numbers by 1 in note_page() Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 48/67] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case Andrew Morton
                   ` (191 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, alex, aou, ard.biesheuvel, arnd, benh, borntraeger, bp,
	catalin.marinas, dave.hansen, davem, gor, heiko.carstens, hpa,
	james.morse, jglisse, jhogan, kan.liang, linux-mm, linux, luto,
	mark.rutland, mingo, mm-commits, mpe, paul.burton, paul.walmsley,
	paulus, peterz, ralf, sfr, steven.price, tglx, torvalds, vgupta,
	will, zong.li

From: Steven Price <steven.price@arm.com>
Subject: x86: mm: avoid allocating struct mm_struct on the stack

struct mm_struct is quite large (~1664 bytes) and so allocating on the
stack may cause problems as the kernel stack size is small.

Since ptdump_walk_pgd_level_core() was only allocating the structure so
that it could modify the pgd argument we can instead introduce a pgd
override in struct mm_walk and pass this down the call stack to where it
is needed.

Since the correct mm_struct is now being passed down, it is now also
unnecessary to take the mmap_sem semaphore because ptdump_walk_pgd() will
now take the semaphore on the real mm.

[steven.price@arm.com: restore missed arm64 changes]
  Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Link: http://lkml.kernel.org/r/20200108145710.34314-1-steven.price@arm.com
Signed-off-by: Steven Price <steven.price@arm.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/dump.c           |    4 ++--
 arch/x86/mm/debug_pagetables.c |   10 ++--------
 arch/x86/mm/dump_pagetables.c  |   18 +++++++-----------
 include/linux/pagewalk.h       |    3 +++
 include/linux/ptdump.h         |    2 +-
 mm/pagewalk.c                  |    7 ++++++-
 mm/ptdump.c                    |    4 ++--
 7 files changed, 23 insertions(+), 25 deletions(-)

--- a/arch/arm64/mm/dump.c~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/arch/arm64/mm/dump.c
@@ -323,7 +323,7 @@ void ptdump_walk(struct seq_file *s, str
 		}
 	};
 
-	ptdump_walk_pgd(&st.ptdump, info->mm);
+	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
 }
 
 static void ptdump_initialize(void)
@@ -361,7 +361,7 @@ void ptdump_check_wx(void)
 		}
 	};
 
-	ptdump_walk_pgd(&st.ptdump, &init_mm);
+	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
 
 	if (st.wx_pages || st.uxn_pages)
 		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",
--- a/arch/x86/mm/debug_pagetables.c~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/arch/x86/mm/debug_pagetables.c
@@ -15,11 +15,8 @@ DEFINE_SHOW_ATTRIBUTE(ptdump);
 
 static int ptdump_curknl_show(struct seq_file *m, void *v)
 {
-	if (current->mm->pgd) {
-		down_read(&current->mm->mmap_sem);
+	if (current->mm->pgd)
 		ptdump_walk_pgd_level_debugfs(m, current->mm, false);
-		up_read(&current->mm->mmap_sem);
-	}
 	return 0;
 }
 
@@ -28,11 +25,8 @@ DEFINE_SHOW_ATTRIBUTE(ptdump_curknl);
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 static int ptdump_curusr_show(struct seq_file *m, void *v)
 {
-	if (current->mm->pgd) {
-		down_read(&current->mm->mmap_sem);
+	if (current->mm->pgd)
 		ptdump_walk_pgd_level_debugfs(m, current->mm, true);
-		up_read(&current->mm->mmap_sem);
-	}
 	return 0;
 }
 
--- a/arch/x86/mm/dump_pagetables.c~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/arch/x86/mm/dump_pagetables.c
@@ -357,7 +357,8 @@ static void note_page(struct ptdump_stat
 	}
 }
 
-static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
+static void ptdump_walk_pgd_level_core(struct seq_file *m,
+				       struct mm_struct *mm, pgd_t *pgd,
 				       bool checkwx, bool dmesg)
 {
 	const struct ptdump_range ptdump_ranges[] = {
@@ -386,12 +387,7 @@ static void ptdump_walk_pgd_level_core(s
 		.seq		= m
 	};
 
-	struct mm_struct fake_mm = {
-		.pgd = pgd
-	};
-	init_rwsem(&fake_mm.mmap_sem);
-
-	ptdump_walk_pgd(&st.ptdump, &fake_mm);
+	ptdump_walk_pgd(&st.ptdump, mm, pgd);
 
 	if (!checkwx)
 		return;
@@ -404,7 +400,7 @@ static void ptdump_walk_pgd_level_core(s
 
 void ptdump_walk_pgd_level(struct seq_file *m, struct mm_struct *mm)
 {
-	ptdump_walk_pgd_level_core(m, mm->pgd, false, true);
+	ptdump_walk_pgd_level_core(m, mm, mm->pgd, false, true);
 }
 
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, struct mm_struct *mm,
@@ -415,7 +411,7 @@ void ptdump_walk_pgd_level_debugfs(struc
 	if (user && boot_cpu_has(X86_FEATURE_PTI))
 		pgd = kernel_to_user_pgdp(pgd);
 #endif
-	ptdump_walk_pgd_level_core(m, pgd, false, false);
+	ptdump_walk_pgd_level_core(m, mm, pgd, false, false);
 }
 EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
 
@@ -430,13 +426,13 @@ void ptdump_walk_user_pgd_level_checkwx(
 
 	pr_info("x86/mm: Checking user space page tables\n");
 	pgd = kernel_to_user_pgdp(pgd);
-	ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+	ptdump_walk_pgd_level_core(NULL, &init_mm, pgd, true, false);
 #endif
 }
 
 void ptdump_walk_pgd_level_checkwx(void)
 {
-	ptdump_walk_pgd_level_core(NULL, INIT_PGD, true, false);
+	ptdump_walk_pgd_level_core(NULL, &init_mm, INIT_PGD, true, false);
 }
 
 static int __init pt_dump_init(void)
--- a/include/linux/pagewalk.h~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/include/linux/pagewalk.h
@@ -74,6 +74,7 @@ enum page_walk_action {
  * mm_walk - walk_page_range data
  * @ops:	operation to call during the walk
  * @mm:		mm_struct representing the target process of page table walk
+ * @pgd:	pointer to PGD; only valid with no_vma (otherwise set to NULL)
  * @vma:	vma currently walked (NULL if walking outside vmas)
  * @action:	next action to perform (see enum page_walk_action)
  * @no_vma:	walk ignoring vmas (vma will always be NULL)
@@ -84,6 +85,7 @@ enum page_walk_action {
 struct mm_walk {
 	const struct mm_walk_ops *ops;
 	struct mm_struct *mm;
+	pgd_t *pgd;
 	struct vm_area_struct *vma;
 	enum page_walk_action action;
 	bool no_vma;
@@ -95,6 +97,7 @@ int walk_page_range(struct mm_struct *mm
 		void *private);
 int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
+			  pgd_t *pgd,
 			  void *private);
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		void *private);
--- a/include/linux/ptdump.h~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/include/linux/ptdump.h
@@ -17,6 +17,6 @@ struct ptdump_state {
 	const struct ptdump_range *range;
 };
 
-void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm);
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd);
 
 #endif /* _LINUX_PTDUMP_H */
--- a/mm/pagewalk.c~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/mm/pagewalk.c
@@ -206,7 +206,10 @@ static int walk_pgd_range(unsigned long
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 
-	pgd = pgd_offset(walk->mm, addr);
+	if (walk->pgd)
+		pgd = walk->pgd + pgd_index(addr);
+	else
+		pgd = pgd_offset(walk->mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd)) {
@@ -436,11 +439,13 @@ int walk_page_range(struct mm_struct *mm
  */
 int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
+			  pgd_t *pgd,
 			  void *private)
 {
 	struct mm_walk walk = {
 		.ops		= ops,
 		.mm		= mm,
+		.pgd		= pgd,
 		.private	= private,
 		.no_vma		= true
 	};
--- a/mm/ptdump.c~x86-mm-avoid-allocating-struct-mm_struct-on-the-stack
+++ a/mm/ptdump.c
@@ -122,14 +122,14 @@ static const struct mm_walk_ops ptdump_o
 	.pte_hole	= ptdump_hole,
 };
 
-void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
+void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
 {
 	const struct ptdump_range *range = st->range;
 
 	down_read(&mm->mmap_sem);
 	while (range->start != range->end) {
 		walk_page_range_novma(mm, range->start, range->end,
-				      &ptdump_ops, st);
+				      &ptdump_ops, pgd, st);
 		range++;
 	}
 	up_read(&mm->mmap_sem);
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 48/67] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
  2020-02-04  1:33 incoming Andrew Morton
                   ` (46 preceding siblings ...)
  2020-02-04  1:36 ` [patch 47/67] x86: mm: avoid allocating struct mm_struct on the stack Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 49/67] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush Andrew Morton
                   ` (190 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, stable, torvalds

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Subject: powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case

Patch series "Fixup page directory freeing", v4.

This is a repost of patch series from Peter with the arch specific changes
except ppc64 dropped.  ppc64 changes are added here because we are redoing
the patch series on top of ppc64 changes.  This makes it easy to backport
these changes.  Only the first 2 patches need to be backported to stable. 


The thing is, on anything SMP, freeing page directories should observe the
exact same order as normal page freeing:

 1) unhook page/directory
 2) TLB invalidate
 3) free page/directory

Without this, any concurrent page-table walk could end up with a
Use-after-Free.  This is esp.  trivial for anything that has software
page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware
caches partial page-walks (ie.  caches page directories).

Even on UP this might give issues since mmu_gather is preemptible these
days.  An interrupt or preempted task accessing user pages might stumble
into the free page if the hardware caches page directories.

This patch series fixes ppc64 and add generic MMU_GATHER changes to
support the conversion of other architectures.  I haven't added patches
w.r.t other architecture because they are yet to be acked.


This patch (of 9):

A followup patch is going to make sure we correctly invalidate page walk
cache before we free page table pages.  In order to keep things simple
enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the
!SMP case differently in the followup patch

!SMP case is right now broken for radix translation w.r.t page walk
cache flush.  We can get interrupted in between page table free and
that would imply we have page walk cache entries pointing to tables
which got freed already.  Michael said "both our platforms that run on
Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a
problem for anyone in practice, unless they've hacked their kernel to
build it !SMP."

Link: http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/Kconfig                         |    2 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |    8 --------
 arch/powerpc/include/asm/book3s/64/pgalloc.h |    2 --
 arch/powerpc/include/asm/nohash/pgalloc.h    |    8 --------
 arch/powerpc/mm/book3s64/pgtable.c           |    7 -------
 5 files changed, 1 insertion(+), 26 deletions(-)

--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h~powerpc-mmu_gather-enable-rcu_table_free-even-for-smp-case
+++ a/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -49,7 +49,6 @@ static inline void pgtable_free(void *ta
 
 #define get_hugepd_cache_index(x)  (x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
 				    void *table, int shift)
 {
@@ -66,13 +65,6 @@ static inline void __tlb_remove_table(vo
 
 	pgtable_free(table, shift);
 }
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
-				    void *table, int shift)
-{
-	pgtable_free(table, shift);
-}
-#endif
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 				  unsigned long address)
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h~powerpc-mmu_gather-enable-rcu_table_free-even-for-smp-case
+++ a/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -19,9 +19,7 @@ extern struct vmemmap_backing *vmemmap_l
 extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned long);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
-#ifdef CONFIG_SMP
 extern void __tlb_remove_table(void *_table);
-#endif
 void pte_frag_destroy(void *pte_frag);
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
--- a/arch/powerpc/include/asm/nohash/pgalloc.h~powerpc-mmu_gather-enable-rcu_table_free-even-for-smp-case
+++ a/arch/powerpc/include/asm/nohash/pgalloc.h
@@ -46,7 +46,6 @@ static inline void pgtable_free(void *ta
 
 #define get_hugepd_cache_index(x)	(x)
 
-#ifdef CONFIG_SMP
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
 {
 	unsigned long pgf = (unsigned long)table;
@@ -64,13 +63,6 @@ static inline void __tlb_remove_table(vo
 	pgtable_free(table, shift);
 }
 
-#else
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
-{
-	pgtable_free(table, shift);
-}
-#endif
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 				  unsigned long address)
 {
--- a/arch/powerpc/Kconfig~powerpc-mmu_gather-enable-rcu_table_free-even-for-smp-case
+++ a/arch/powerpc/Kconfig
@@ -222,7 +222,7 @@ config PPC
 	select HAVE_HARDLOCKUP_DETECTOR_PERF	if PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select HAVE_RCU_TABLE_FREE		if SMP
+	select HAVE_RCU_TABLE_FREE
 	select HAVE_RCU_TABLE_NO_INVALIDATE	if HAVE_RCU_TABLE_FREE
 	select HAVE_MMU_GATHER_PAGE_SIZE
 	select HAVE_REGS_AND_STACK_ACCESS_API
--- a/arch/powerpc/mm/book3s64/pgtable.c~powerpc-mmu_gather-enable-rcu_table_free-even-for-smp-case
+++ a/arch/powerpc/mm/book3s64/pgtable.c
@@ -378,7 +378,6 @@ static inline void pgtable_free(void *ta
 	}
 }
 
-#ifdef CONFIG_SMP
 void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int index)
 {
 	unsigned long pgf = (unsigned long)table;
@@ -395,12 +394,6 @@ void __tlb_remove_table(void *_table)
 
 	return pgtable_free(table, index);
 }
-#else
-void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int index)
-{
-	return pgtable_free(table, index);
-}
-#endif
 
 #ifdef CONFIG_PROC_FS
 atomic_long_t direct_pages_count[MMU_PAGE_COUNT];
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 49/67] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
  2020-02-04  1:33 incoming Andrew Morton
                   ` (47 preceding siblings ...)
  2020-02-04  1:36 ` [patch 48/67] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 50/67] asm-generic/tlb: avoid potential double flush Andrew Morton
                   ` (189 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, stable, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush

Architectures for which we have hardware walkers of Linux page table
should flush TLB on mmu gather batch allocation failures and batch flush. 
Some architectures like POWER supports multiple translation modes (hash
and radix) and in the case of POWER only radix translation mode needs the
above TLBI.  This is because for hash translation mode kernel wants to
avoid this extra flush since there are no hardware walkers of linux page
table.  With radix translation, the hardware also walks linux page table
and with that, kernel needs to make sure to TLB invalidate page walk cache
before page table pages are freed.

More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating
TLB caches for RCU_TABLE_FREE")

The changes to sparc are to make sure we keep the old behavior since we
are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
tlb_needs_table_invalidate is to always force an invalidate and sparc can
avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
false for sparc architecture.

Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig                    |    3 ---
 arch/powerpc/Kconfig            |    1 -
 arch/powerpc/include/asm/tlb.h  |   11 +++++++++++
 arch/sparc/Kconfig              |    1 -
 arch/sparc/include/asm/tlb_64.h |    9 +++++++++
 include/asm-generic/tlb.h       |   22 +++++++++++++++-------
 mm/mmu_gather.c                 |   16 ++++++++--------
 7 files changed, 43 insertions(+), 20 deletions(-)

--- a/arch/Kconfig~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/arch/Kconfig
@@ -396,9 +396,6 @@ config HAVE_ARCH_JUMP_LABEL_RELATIVE
 config HAVE_RCU_TABLE_FREE
 	bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-	bool
-
 config HAVE_MMU_GATHER_PAGE_SIZE
 	bool
 
--- a/arch/powerpc/include/asm/tlb.h~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/arch/powerpc/include/asm/tlb.h
@@ -26,6 +26,17 @@
 
 #define tlb_flush tlb_flush
 extern void tlb_flush(struct mmu_gather *tlb);
+/*
+ * book3s:
+ * Hash does not use the linux page-tables, so we can avoid
+ * the TLB invalidate for page-table freeing, Radix otoh does use the
+ * page-tables and needs the TLBI.
+ *
+ * nohash:
+ * We still do TLB invalidate in the __pte_free_tlb routine before we
+ * add the page table pages to mmu gather table batch.
+ */
+#define tlb_needs_table_invalidate()	radix_enabled()
 
 /* Get the generic bits... */
 #include <asm-generic/tlb.h>
--- a/arch/powerpc/Kconfig~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/arch/powerpc/Kconfig
@@ -223,7 +223,6 @@ config PPC
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_RCU_TABLE_FREE
-	select HAVE_RCU_TABLE_NO_INVALIDATE	if HAVE_RCU_TABLE_FREE
 	select HAVE_MMU_GATHER_PAGE_SIZE
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN
--- a/arch/sparc/include/asm/tlb_64.h~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb)	flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()	(false)
+#endif
+
 #include <asm-generic/tlb.h>
 
 #endif /* _SPARC64_TLB_H */
--- a/arch/sparc/Kconfig~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/arch/sparc/Kconfig
@@ -65,7 +65,6 @@ config SPARC64
 	select HAVE_KRETPROBES
 	select HAVE_KPROBES
 	select HAVE_RCU_TABLE_FREE if SMP
-	select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_DYNAMIC_FTRACE
--- a/include/asm-generic/tlb.h~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/include/asm-generic/tlb.h
@@ -137,13 +137,6 @@
  *  When used, an architecture is expected to provide __tlb_remove_table()
  *  which does the actual freeing of these pages.
  *
- *  HAVE_RCU_TABLE_NO_INVALIDATE
- *
- *  This makes HAVE_RCU_TABLE_FREE avoid calling tlb_flush_mmu_tlbonly() before
- *  freeing the page-table pages. This can be avoided if you use
- *  HAVE_RCU_TABLE_FREE and your architecture does _NOT_ use the Linux
- *  page-tables natively.
- *
  *  MMU_GATHER_NO_RANGE
  *
  *  Use this if your architecture lacks an efficient flush_tlb_range().
@@ -189,8 +182,23 @@ struct mmu_table_batch {
 
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define tlb_needs_table_invalidate() (true)
 #endif
 
+#else
+
+#ifdef tlb_needs_table_invalidate
+#error tlb_needs_table_invalidate() requires HAVE_RCU_TABLE_FREE
+#endif
+
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
+
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
 /*
  * If we can't allocate a page to make a big batch of page pointers
--- a/mm/mmu_gather.c~mm-mmu_gather-invalidate-tlb-correctly-on-batch-allocation-failure-and-flush
+++ a/mm/mmu_gather.c
@@ -102,14 +102,14 @@ bool __tlb_remove_page_size(struct mmu_g
  */
 static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 {
-#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE
-	/*
-	 * Invalidate page-table caches used by hardware walkers. Then we still
-	 * need to RCU-sched wait while freeing the pages because software
-	 * walkers can still be in-flight.
-	 */
-	tlb_flush_mmu_tlbonly(tlb);
-#endif
+	if (tlb_needs_table_invalidate()) {
+		/*
+		 * Invalidate page-table caches used by hardware walkers. Then
+		 * we still need to RCU-sched wait while freeing the pages
+		 * because software walkers can still be in-flight.
+		 */
+		tlb_flush_mmu_tlbonly(tlb);
+	}
 }
 
 static void tlb_remove_table_smp_sync(void *arg)
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 50/67] asm-generic/tlb: avoid potential double flush
  2020-02-04  1:33 incoming Andrew Morton
                   ` (48 preceding siblings ...)
  2020-02-04  1:36 ` [patch 49/67] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 51/67] asm-gemeric/tlb: remove stray function declarations Andrew Morton
                   ` (188 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-generic/tlb: avoid potential double flush

Aneesh reported that:

	tlb_flush_mmu()
	  tlb_flush_mmu_tlbonly()
	    tlb_flush()			<-- #1
	  tlb_flush_mmu_free()
	    tlb_table_flush()
	      tlb_table_invalidate()
		tlb_flush_mmu_tlbonly()
		  tlb_flush()		<-- #2

does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not
clear tlb->end in that case.

Observe that any caller to __tlb_adjust_range() also sets at least one of
the tlb->freed_tables || tlb->cleared_p* bits, and those are
unconditionally cleared by __tlb_reset_range().

Change the condition for actually issuing TLBI to having one of those bits
set, as opposed to having tlb->end != 0.

Link: http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/tlb.h |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/include/asm-generic/tlb.h~asm-generic-tlb-avoid-potential-double-flush
+++ a/include/asm-generic/tlb.h
@@ -402,7 +402,12 @@ tlb_update_vma_flags(struct mmu_gather *
 
 static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-	if (!tlb->end)
+	/*
+	 * Anything calling __tlb_adjust_range() also sets at least one of
+	 * these bits.
+	 */
+	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
+	      tlb->cleared_puds || tlb->cleared_p4ds))
 		return;
 
 	tlb_flush(tlb);
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 51/67] asm-gemeric/tlb: remove stray function declarations
  2020-02-04  1:33 incoming Andrew Morton
                   ` (49 preceding siblings ...)
  2020-02-04  1:36 ` [patch 50/67] asm-generic/tlb: avoid potential double flush Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:36 ` [patch 52/67] asm-generic/tlb: add missing CONFIG symbol Andrew Morton
                   ` (187 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-gemeric/tlb: remove stray function declarations

We removed the actual functions a while ago.

Link: http://lkml.kernel.org/r/20200116064531.483522-5-aneesh.kumar@linux.ibm.com
Fixes: 1808d65b55e4 ("asm-generic/tlb: Remove arch_tlb*_mmu()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/tlb.h |    4 ----
 1 file changed, 4 deletions(-)

--- a/include/asm-generic/tlb.h~asm-gemeric-tlb-remove-stray-function-declarations
+++ a/include/asm-generic/tlb.h
@@ -285,11 +285,7 @@ struct mmu_gather {
 #endif
 };
 
-void arch_tlb_gather_mmu(struct mmu_gather *tlb,
-	struct mm_struct *mm, unsigned long start, unsigned long end);
 void tlb_flush_mmu(struct mmu_gather *tlb);
-void arch_tlb_finish_mmu(struct mmu_gather *tlb,
-			 unsigned long start, unsigned long end, bool force);
 
 static inline void __tlb_adjust_range(struct mmu_gather *tlb,
 				      unsigned long address,
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 52/67] asm-generic/tlb: add missing CONFIG symbol
  2020-02-04  1:33 incoming Andrew Morton
                   ` (50 preceding siblings ...)
  2020-02-04  1:36 ` [patch 51/67] asm-gemeric/tlb: remove stray function declarations Andrew Morton
@ 2020-02-04  1:36 ` Andrew Morton
  2020-02-04  1:37 ` [patch 53/67] asm-generic/tlb: rename HAVE_RCU_TABLE_FREE Andrew Morton
                   ` (186 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:36 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-generic/tlb: add missing CONFIG symbol

Without this the symbol will not actually end up in .config files.

Link: http://lkml.kernel.org/r/20200116064531.483522-6-aneesh.kumar@linux.ibm.com
Fixes: a30e32bd79e9 ("asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig |    3 +++
 1 file changed, 3 insertions(+)

--- a/arch/Kconfig~asm-generic-tlb-add-missing-config-symbol
+++ a/arch/Kconfig
@@ -399,6 +399,9 @@ config HAVE_RCU_TABLE_FREE
 config HAVE_MMU_GATHER_PAGE_SIZE
 	bool
 
+config MMU_GATHER_NO_RANGE
+	bool
+
 config HAVE_MMU_GATHER_NO_GATHER
 	bool
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 53/67] asm-generic/tlb: rename HAVE_RCU_TABLE_FREE
  2020-02-04  1:33 incoming Andrew Morton
                   ` (51 preceding siblings ...)
  2020-02-04  1:36 ` [patch 52/67] asm-generic/tlb: add missing CONFIG symbol Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 54/67] asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE Andrew Morton
                   ` (185 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-generic/tlb: rename HAVE_RCU_TABLE_FREE

Towards a more consistent naming scheme.

[akpm@linux-foundation.org: fix sparc64 Kconfig]
Link: http://lkml.kernel.org/r/20200116064531.483522-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig                    |    2 +-
 arch/arm/Kconfig                |    2 +-
 arch/arm/include/asm/tlb.h      |    2 +-
 arch/arm64/Kconfig              |    2 +-
 arch/powerpc/Kconfig            |    2 +-
 arch/s390/Kconfig               |    2 +-
 arch/sparc/Kconfig              |    2 +-
 arch/sparc/include/asm/tlb_64.h |    2 +-
 arch/x86/Kconfig                |    2 +-
 arch/x86/include/asm/tlb.h      |    4 ++--
 include/asm-generic/tlb.h       |   10 +++++-----
 mm/gup.c                        |    2 +-
 mm/mmu_gather.c                 |    8 ++++----
 13 files changed, 21 insertions(+), 21 deletions(-)

--- a/arch/arm64/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/arm64/Kconfig
@@ -165,7 +165,7 @@ config ARM64
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_FUTEX_CMPXCHG if FUTEX
-	select HAVE_RCU_TABLE_FREE
+	select MMU_GATHER_RCU_TABLE_FREE
 	select HAVE_RSEQ
 	select HAVE_STACKPROTECTOR
 	select HAVE_SYSCALL_TRACEPOINTS
--- a/arch/arm/include/asm/tlb.h~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/arm/include/asm/tlb.h
@@ -37,7 +37,7 @@ static inline void __tlb_remove_table(vo
 
 #include <asm-generic/tlb.h>
 
-#ifndef CONFIG_HAVE_RCU_TABLE_FREE
+#ifndef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 #define tlb_remove_table(tlb, entry) tlb_remove_page(tlb, entry)
 #endif
 
--- a/arch/arm/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/arm/Kconfig
@@ -102,7 +102,7 @@ config ARM
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select HAVE_RCU_TABLE_FREE if SMP && ARM_LPAE
+	select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RSEQ
 	select HAVE_STACKPROTECTOR
--- a/arch/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/Kconfig
@@ -393,7 +393,7 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_ARCH_JUMP_LABEL_RELATIVE
 	bool
 
-config HAVE_RCU_TABLE_FREE
+config MMU_GATHER_RCU_TABLE_FREE
 	bool
 
 config HAVE_MMU_GATHER_PAGE_SIZE
--- a/arch/powerpc/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/powerpc/Kconfig
@@ -222,7 +222,7 @@ config PPC
 	select HAVE_HARDLOCKUP_DETECTOR_PERF	if PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select HAVE_RCU_TABLE_FREE
+	select MMU_GATHER_RCU_TABLE_FREE
 	select HAVE_MMU_GATHER_PAGE_SIZE
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN
--- a/arch/s390/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/s390/Kconfig
@@ -169,7 +169,7 @@ config S390
 	select HAVE_OPROFILE
 	select HAVE_PCI
 	select HAVE_PERF_EVENTS
-	select HAVE_RCU_TABLE_FREE
+	select MMU_GATHER_RCU_TABLE_FREE
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE
 	select HAVE_RSEQ
--- a/arch/sparc/include/asm/tlb_64.h~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/sparc/include/asm/tlb_64.h
@@ -33,7 +33,7 @@ void flush_tlb_pending(void);
  * and therefore we don't need a TLBI when freeing page-table pages.
  */
 
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 #define tlb_needs_table_invalidate()	(false)
 #endif
 
--- a/arch/sparc/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/sparc/Kconfig
@@ -64,7 +64,7 @@ config SPARC64
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_KRETPROBES
 	select HAVE_KPROBES
-	select HAVE_RCU_TABLE_FREE if SMP
+	select MMU_GATHER_RCU_TABLE_FREE if SMP
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_DYNAMIC_FTRACE
--- a/arch/x86/include/asm/tlb.h~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/x86/include/asm/tlb.h
@@ -29,8 +29,8 @@ static inline void tlb_flush(struct mmu_
  * shootdown, enablement code for several hypervisors overrides
  * .flush_tlb_others hook in pv_mmu_ops and implements it by issuing
  * a hypercall. To keep software pagetable walkers safe in this case we
- * switch to RCU based table free (HAVE_RCU_TABLE_FREE). See the comment
- * below 'ifdef CONFIG_HAVE_RCU_TABLE_FREE' in include/asm-generic/tlb.h
+ * switch to RCU based table free (MMU_GATHER_RCU_TABLE_FREE). See the comment
+ * below 'ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE' in include/asm-generic/tlb.h
  * for more details.
  */
 static inline void __tlb_remove_table(void *table)
--- a/arch/x86/Kconfig~asm-generic-tlb-rename-have_rcu_table_free
+++ a/arch/x86/Kconfig
@@ -203,7 +203,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select HAVE_RCU_TABLE_FREE		if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE		if PARAVIRT
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
 	select HAVE_FUNCTION_ARG_ACCESS_API
--- a/include/asm-generic/tlb.h~asm-generic-tlb-rename-have_rcu_table_free
+++ a/include/asm-generic/tlb.h
@@ -126,7 +126,7 @@
  *  This ensures we call tlb_flush() every time tlb_change_page_size() actually
  *  changes the size and provides mmu_gather::page_size to tlb_flush().
  *
- *  HAVE_RCU_TABLE_FREE
+ *  MMU_GATHER_RCU_TABLE_FREE
  *
  *  This provides tlb_remove_table(), to be used instead of tlb_remove_page()
  *  for page directores (__p*_free_tlb()). This provides separate freeing of
@@ -142,7 +142,7 @@
  *  Use this if your architecture lacks an efficient flush_tlb_range().
  */
 
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
  *
@@ -193,10 +193,10 @@ extern void tlb_remove_table(struct mmu_
 #else
 
 #ifdef tlb_needs_table_invalidate
-#error tlb_needs_table_invalidate() requires HAVE_RCU_TABLE_FREE
+#error tlb_needs_table_invalidate() requires MMU_GATHER_RCU_TABLE_FREE
 #endif
 
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
@@ -235,7 +235,7 @@ extern bool __tlb_remove_page_size(struc
 struct mmu_gather {
 	struct mm_struct	*mm;
 
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 
--- a/mm/gup.c~asm-generic-tlb-rename-have_rcu_table_free
+++ a/mm/gup.c
@@ -1792,7 +1792,7 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * Before activating this code, please be aware that the following assumptions
  * are currently made:
  *
- *  *) Either HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
+ *  *) Either MMU_GATHER_RCU_TABLE_FREE is enabled, and tlb_remove_table() is used to
  *  free pages containing page tables or TLB flushing requires IPI broadcast.
  *
  *  *) ptes can be read atomically by the architecture.
--- a/mm/mmu_gather.c~asm-generic-tlb-rename-have_rcu_table_free
+++ a/mm/mmu_gather.c
@@ -91,7 +91,7 @@ bool __tlb_remove_page_size(struct mmu_g
 
 #endif /* HAVE_MMU_GATHER_NO_GATHER */
 
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 
 /*
  * See the comment near struct mmu_table_batch.
@@ -173,11 +173,11 @@ void tlb_remove_table(struct mmu_gather
 		tlb_table_flush(tlb);
 }
 
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+#endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
@@ -220,7 +220,7 @@ void tlb_gather_mmu(struct mmu_gather *t
 	tlb->batch_count = 0;
 #endif
 
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
 #ifdef CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 54/67] asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE
  2020-02-04  1:33 incoming Andrew Morton
                   ` (52 preceding siblings ...)
  2020-02-04  1:37 ` [patch 53/67] asm-generic/tlb: rename HAVE_RCU_TABLE_FREE Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 55/67] asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER Andrew Morton
                   ` (184 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE

Towards a more consistent naming scheme.

Link: http://lkml.kernel.org/r/20200116064531.483522-8-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig              |    2 +-
 arch/powerpc/Kconfig      |    2 +-
 include/asm-generic/tlb.h |    9 ++++++---
 mm/mmu_gather.c           |    4 ++--
 4 files changed, 10 insertions(+), 7 deletions(-)

--- a/arch/Kconfig~asm-generic-tlb-rename-have_mmu_gather_page_size
+++ a/arch/Kconfig
@@ -396,7 +396,7 @@ config HAVE_ARCH_JUMP_LABEL_RELATIVE
 config MMU_GATHER_RCU_TABLE_FREE
 	bool
 
-config HAVE_MMU_GATHER_PAGE_SIZE
+config MMU_GATHER_PAGE_SIZE
 	bool
 
 config MMU_GATHER_NO_RANGE
--- a/arch/powerpc/Kconfig~asm-generic-tlb-rename-have_mmu_gather_page_size
+++ a/arch/powerpc/Kconfig
@@ -223,7 +223,7 @@ config PPC
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select MMU_GATHER_RCU_TABLE_FREE
-	select HAVE_MMU_GATHER_PAGE_SIZE
+	select MMU_GATHER_PAGE_SIZE
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN
 	select HAVE_SYSCALL_TRACEPOINTS
--- a/include/asm-generic/tlb.h~asm-generic-tlb-rename-have_mmu_gather_page_size
+++ a/include/asm-generic/tlb.h
@@ -121,11 +121,14 @@
  *
  * Additionally there are a few opt-in features:
  *
- *  HAVE_MMU_GATHER_PAGE_SIZE
+ *  MMU_GATHER_PAGE_SIZE
  *
  *  This ensures we call tlb_flush() every time tlb_change_page_size() actually
  *  changes the size and provides mmu_gather::page_size to tlb_flush().
  *
+ *  This might be useful if your architecture has size specific TLB
+ *  invalidation instructions.
+ *
  *  MMU_GATHER_RCU_TABLE_FREE
  *
  *  This provides tlb_remove_table(), to be used instead of tlb_remove_page()
@@ -279,7 +282,7 @@ struct mmu_gather {
 	struct mmu_gather_batch	local;
 	struct page		*__pages[MMU_GATHER_BUNDLE];
 
-#ifdef CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
+#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	unsigned int page_size;
 #endif
 #endif
@@ -435,7 +438,7 @@ static inline void tlb_remove_page(struc
 static inline void tlb_change_page_size(struct mmu_gather *tlb,
 						     unsigned int page_size)
 {
-#ifdef CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
+#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	if (tlb->page_size && tlb->page_size != page_size) {
 		if (!tlb->fullmm && !tlb->need_flush_all)
 			tlb_flush_mmu(tlb);
--- a/mm/mmu_gather.c~asm-generic-tlb-rename-have_mmu_gather_page_size
+++ a/mm/mmu_gather.c
@@ -69,7 +69,7 @@ bool __tlb_remove_page_size(struct mmu_g
 
 	VM_BUG_ON(!tlb->end);
 
-#ifdef CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
+#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	VM_WARN_ON(tlb->page_size != page_size);
 #endif
 
@@ -223,7 +223,7 @@ void tlb_gather_mmu(struct mmu_gather *t
 #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
-#ifdef CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
+#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	tlb->page_size = 0;
 #endif
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 55/67] asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER
  2020-02-04  1:33 incoming Andrew Morton
                   ` (53 preceding siblings ...)
  2020-02-04  1:37 ` [patch 54/67] asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 56/67] asm-generic/tlb: provide MMU_GATHER_TABLE_FREE Andrew Morton
                   ` (183 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER

Towards a more consistent naming scheme.

Link: http://lkml.kernel.org/r/20200116064531.483522-9-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig              |    2 +-
 arch/s390/Kconfig         |    2 +-
 include/asm-generic/tlb.h |   14 ++++++++++++--
 mm/mmu_gather.c           |   10 +++++-----
 4 files changed, 19 insertions(+), 9 deletions(-)

--- a/arch/Kconfig~asm-generic-tlb-rename-have_mmu_gather_no_gather
+++ a/arch/Kconfig
@@ -402,7 +402,7 @@ config MMU_GATHER_PAGE_SIZE
 config MMU_GATHER_NO_RANGE
 	bool
 
-config HAVE_MMU_GATHER_NO_GATHER
+config MMU_GATHER_NO_GATHER
 	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
--- a/arch/s390/Kconfig~asm-generic-tlb-rename-have_mmu_gather_no_gather
+++ a/arch/s390/Kconfig
@@ -163,7 +163,7 @@ config S390
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_MEMBLOCK_NODE_MAP
 	select HAVE_MEMBLOCK_PHYS_MAP
-	select HAVE_MMU_GATHER_NO_GATHER
+	select MMU_GATHER_NO_GATHER
 	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NOP_MCOUNT
 	select HAVE_OPROFILE
--- a/include/asm-generic/tlb.h~asm-generic-tlb-rename-have_mmu_gather_no_gather
+++ a/include/asm-generic/tlb.h
@@ -143,6 +143,16 @@
  *  MMU_GATHER_NO_RANGE
  *
  *  Use this if your architecture lacks an efficient flush_tlb_range().
+ *
+ *  MMU_GATHER_NO_GATHER
+ *
+ *  If the option is set the mmu_gather will not track individual pages for
+ *  delayed page free anymore. A platform that enables the option needs to
+ *  provide its own implementation of the __tlb_remove_page_size() function to
+ *  free pages.
+ *
+ *  This is useful if your architecture already flushes TLB entries in the
+ *  various ptep_get_and_clear() functions.
  */
 
 #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
@@ -202,7 +212,7 @@ extern void tlb_remove_table(struct mmu_
 #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 
-#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
 /*
  * If we can't allocate a page to make a big batch of page pointers
  * to work on, then just handle a few from the on-stack structure.
@@ -277,7 +287,7 @@ struct mmu_gather {
 
 	unsigned int		batch_count;
 
-#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
 	struct mmu_gather_batch *active;
 	struct mmu_gather_batch	local;
 	struct page		*__pages[MMU_GATHER_BUNDLE];
--- a/mm/mmu_gather.c~asm-generic-tlb-rename-have_mmu_gather_no_gather
+++ a/mm/mmu_gather.c
@@ -11,7 +11,7 @@
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 
-#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
 
 static bool tlb_next_batch(struct mmu_gather *tlb)
 {
@@ -89,7 +89,7 @@ bool __tlb_remove_page_size(struct mmu_g
 	return false;
 }
 
-#endif /* HAVE_MMU_GATHER_NO_GATHER */
+#endif /* MMU_GATHER_NO_GATHER */
 
 #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 
@@ -180,7 +180,7 @@ static void tlb_flush_mmu_free(struct mm
 #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
-#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
 	tlb_batch_pages_flush(tlb);
 #endif
 }
@@ -211,7 +211,7 @@ void tlb_gather_mmu(struct mmu_gather *t
 	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
 
-#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
 	tlb->need_flush_all = 0;
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
@@ -271,7 +271,7 @@ void tlb_finish_mmu(struct mmu_gather *t
 
 	tlb_flush_mmu(tlb);
 
-#ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
+#ifndef CONFIG_MMU_GATHER_NO_GATHER
 	tlb_batch_list_free(tlb);
 #endif
 	dec_tlb_flush_pending(tlb->mm);
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 56/67] asm-generic/tlb: provide MMU_GATHER_TABLE_FREE
  2020-02-04  1:33 incoming Andrew Morton
                   ` (54 preceding siblings ...)
  2020-02-04  1:37 ` [patch 55/67] asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 57/67] proc: decouple proc from VFS with "struct proc_ops" Andrew Morton
                   ` (182 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: akpm, aneesh.kumar, linux-mm, mm-commits, mpe, peterz, torvalds

From: Peter Zijlstra <peterz@infradead.org>
Subject: asm-generic/tlb: provide MMU_GATHER_TABLE_FREE

As described in the comment, the correct order for freeing pages is:

 1) unhook page
 2) TLB invalidate page
 3) free page

This order equally applies to page directories.

Currently there are two correct options:

 - use tlb_remove_page(), when all page directores are full pages and
   there are no futher contraints placed by things like software
   walkers (HAVE_FAST_GUP).

 - use MMU_GATHER_RCU_TABLE_FREE and tlb_remove_table() when the
   architecture does not do IPI based TLB invalidate and has
   HAVE_FAST_GUP (or software TLB fill).

This however leaves architectures that don't have page based directories
but don't need RCU in a bind.  For those, provide MMU_GATHER_TABLE_FREE,
which provides the independent batching for directories without the
additional RCU freeing.

Link: http://lkml.kernel.org/r/20200116064531.483522-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig               |    5 +
 arch/arm/include/asm/tlb.h |    4 -
 include/asm-generic/tlb.h  |   72 ++++++++++-----------
 mm/mmu_gather.c            |  120 +++++++++++++++++++++++++----------
 4 files changed, 130 insertions(+), 71 deletions(-)

--- a/arch/arm/include/asm/tlb.h~asm-generic-tlb-provide-mmu_gather_table_free
+++ a/arch/arm/include/asm/tlb.h
@@ -37,10 +37,6 @@ static inline void __tlb_remove_table(vo
 
 #include <asm-generic/tlb.h>
 
-#ifndef CONFIG_MMU_GATHER_RCU_TABLE_FREE
-#define tlb_remove_table(tlb, entry) tlb_remove_page(tlb, entry)
-#endif
-
 static inline void
 __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr)
 {
--- a/arch/Kconfig~asm-generic-tlb-provide-mmu_gather_table_free
+++ a/arch/Kconfig
@@ -393,8 +393,12 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_ARCH_JUMP_LABEL_RELATIVE
 	bool
 
+config MMU_GATHER_TABLE_FREE
+	bool
+
 config MMU_GATHER_RCU_TABLE_FREE
 	bool
+	select MMU_GATHER_TABLE_FREE
 
 config MMU_GATHER_PAGE_SIZE
 	bool
@@ -404,6 +408,7 @@ config MMU_GATHER_NO_RANGE
 
 config MMU_GATHER_NO_GATHER
 	bool
+	depends on MMU_GATHER_TABLE_FREE
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
--- a/include/asm-generic/tlb.h~asm-generic-tlb-provide-mmu_gather_table_free
+++ a/include/asm-generic/tlb.h
@@ -56,6 +56,15 @@
  *    Defaults to flushing at tlb_end_vma() to reset the range; helps when
  *    there's large holes between the VMAs.
  *
+ *  - tlb_remove_table()
+ *
+ *    tlb_remove_table() is the basic primitive to free page-table directories
+ *    (__p*_free_tlb()).  In it's most primitive form it is an alias for
+ *    tlb_remove_page() below, for when page directories are pages and have no
+ *    additional constraints.
+ *
+ *    See also MMU_GATHER_TABLE_FREE and MMU_GATHER_RCU_TABLE_FREE.
+ *
  *  - tlb_remove_page() / __tlb_remove_page()
  *  - tlb_remove_page_size() / __tlb_remove_page_size()
  *
@@ -129,17 +138,24 @@
  *  This might be useful if your architecture has size specific TLB
  *  invalidation instructions.
  *
- *  MMU_GATHER_RCU_TABLE_FREE
+ *  MMU_GATHER_TABLE_FREE
  *
  *  This provides tlb_remove_table(), to be used instead of tlb_remove_page()
- *  for page directores (__p*_free_tlb()). This provides separate freeing of
- *  the page-table pages themselves in a semi-RCU fashion (see comment below).
- *  Useful if your architecture doesn't use IPIs for remote TLB invalidates
- *  and therefore doesn't naturally serialize with software page-table walkers.
+ *  for page directores (__p*_free_tlb()).
+ *
+ *  Useful if your architecture has non-page page directories.
  *
  *  When used, an architecture is expected to provide __tlb_remove_table()
  *  which does the actual freeing of these pages.
  *
+ *  MMU_GATHER_RCU_TABLE_FREE
+ *
+ *  Like MMU_GATHER_TABLE_FREE, and adds semi-RCU semantics to the free (see
+ *  comment below).
+ *
+ *  Useful if your architecture doesn't use IPIs for remote TLB invalidates
+ *  and therefore doesn't naturally serialize with software page-table walkers.
+ *
  *  MMU_GATHER_NO_RANGE
  *
  *  Use this if your architecture lacks an efficient flush_tlb_range().
@@ -155,37 +171,12 @@
  *  various ptep_get_and_clear() functions.
  */
 
-#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
-/*
- * Semi RCU freeing of the page directories.
- *
- * This is needed by some architectures to implement software pagetable walkers.
- *
- * gup_fast() and other software pagetable walkers do a lockless page-table
- * walk and therefore needs some synchronization with the freeing of the page
- * directories. The chosen means to accomplish that is by disabling IRQs over
- * the walk.
- *
- * Architectures that use IPIs to flush TLBs will then automagically DTRT,
- * since we unlink the page, flush TLBs, free the page. Since the disabling of
- * IRQs delays the completion of the TLB flush we can never observe an already
- * freed page.
- *
- * Architectures that do not have this (PPC) need to delay the freeing by some
- * other means, this is that means.
- *
- * What we do is batch the freed directory pages (tables) and RCU free them.
- * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
- * holds off grace periods.
- *
- * However, in order to batch these pages we need to allocate storage, this
- * allocation is deep inside the MM code and can thus easily fail on memory
- * pressure. To guarantee progress we fall back to single table freeing, see
- * the implementation of tlb_remove_table_one().
- *
- */
+#ifdef CONFIG_MMU_GATHER_TABLE_FREE
+
 struct mmu_table_batch {
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	struct rcu_head		rcu;
+#endif
 	unsigned int		nr;
 	void			*tables[0];
 };
@@ -195,6 +186,17 @@ struct mmu_table_batch {
 
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else /* !CONFIG_MMU_GATHER_HAVE_TABLE_FREE */
+
+/*
+ * Without MMU_GATHER_TABLE_FREE the architecture is assumed to have page based
+ * page directories and we can use the normal page batching to free them.
+ */
+#define tlb_remove_table(tlb, page) tlb_remove_page((tlb), (page))
+
+#endif /* CONFIG_MMU_GATHER_TABLE_FREE */
+
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 /*
  * This allows an architecture that does not use the linux page-tables for
  * hardware to skip the TLBI when freeing page tables.
@@ -248,7 +250,7 @@ extern bool __tlb_remove_page_size(struc
 struct mmu_gather {
 	struct mm_struct	*mm;
 
-#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_TABLE_FREE
 	struct mmu_table_batch	*batch;
 #endif
 
--- a/mm/mmu_gather.c~asm-generic-tlb-provide-mmu_gather_table_free
+++ a/mm/mmu_gather.c
@@ -91,56 +91,106 @@ bool __tlb_remove_page_size(struct mmu_g
 
 #endif /* MMU_GATHER_NO_GATHER */
 
-#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
+#ifdef CONFIG_MMU_GATHER_TABLE_FREE
 
-/*
- * See the comment near struct mmu_table_batch.
- */
+static void __tlb_remove_table_free(struct mmu_table_batch *batch)
+{
+	int i;
+
+	for (i = 0; i < batch->nr; i++)
+		__tlb_remove_table(batch->tables[i]);
+
+	free_page((unsigned long)batch);
+}
+
+#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 
 /*
- * If we want tlb_remove_table() to imply TLB invalidates.
+ * Semi RCU freeing of the page directories.
+ *
+ * This is needed by some architectures to implement software pagetable walkers.
+ *
+ * gup_fast() and other software pagetable walkers do a lockless page-table
+ * walk and therefore needs some synchronization with the freeing of the page
+ * directories. The chosen means to accomplish that is by disabling IRQs over
+ * the walk.
+ *
+ * Architectures that use IPIs to flush TLBs will then automagically DTRT,
+ * since we unlink the page, flush TLBs, free the page. Since the disabling of
+ * IRQs delays the completion of the TLB flush we can never observe an already
+ * freed page.
+ *
+ * Architectures that do not have this (PPC) need to delay the freeing by some
+ * other means, this is that means.
+ *
+ * What we do is batch the freed directory pages (tables) and RCU free them.
+ * We use the sched RCU variant, as that guarantees that IRQ/preempt disabling
+ * holds off grace periods.
+ *
+ * However, in order to batch these pages we need to allocate storage, this
+ * allocation is deep inside the MM code and can thus easily fail on memory
+ * pressure. To guarantee progress we fall back to single table freeing, see
+ * the implementation of tlb_remove_table_one().
+ *
  */
-static inline void tlb_table_invalidate(struct mmu_gather *tlb)
-{
-	if (tlb_needs_table_invalidate()) {
-		/*
-		 * Invalidate page-table caches used by hardware walkers. Then
-		 * we still need to RCU-sched wait while freeing the pages
-		 * because software walkers can still be in-flight.
-		 */
-		tlb_flush_mmu_tlbonly(tlb);
-	}
-}
 
 static void tlb_remove_table_smp_sync(void *arg)
 {
 	/* Simply deliver the interrupt */
 }
 
-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_sync_one(void)
 {
 	/*
 	 * This isn't an RCU grace period and hence the page-tables cannot be
 	 * assumed to be actually RCU-freed.
 	 *
 	 * It is however sufficient for software page-table walkers that rely on
-	 * IRQ disabling. See the comment near struct mmu_table_batch.
+	 * IRQ disabling.
 	 */
 	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
-	__tlb_remove_table(table);
 }
 
 static void tlb_remove_table_rcu(struct rcu_head *head)
 {
-	struct mmu_table_batch *batch;
-	int i;
+	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
+}
 
-	batch = container_of(head, struct mmu_table_batch, rcu);
+static void tlb_remove_table_free(struct mmu_table_batch *batch)
+{
+	call_rcu(&batch->rcu, tlb_remove_table_rcu);
+}
 
-	for (i = 0; i < batch->nr; i++)
-		__tlb_remove_table(batch->tables[i]);
+#else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
-	free_page((unsigned long)batch);
+static void tlb_remove_table_sync_one(void) { }
+
+static void tlb_remove_table_free(struct mmu_table_batch *batch)
+{
+	__tlb_remove_table_free(batch);
+}
+
+#endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+
+/*
+ * If we want tlb_remove_table() to imply TLB invalidates.
+ */
+static inline void tlb_table_invalidate(struct mmu_gather *tlb)
+{
+	if (tlb_needs_table_invalidate()) {
+		/*
+		 * Invalidate page-table caches used by hardware walkers. Then
+		 * we still need to RCU-sched wait while freeing the pages
+		 * because software walkers can still be in-flight.
+		 */
+		tlb_flush_mmu_tlbonly(tlb);
+	}
+}
+
+static void tlb_remove_table_one(void *table)
+{
+	tlb_remove_table_sync_one();
+	__tlb_remove_table(table);
 }
 
 static void tlb_table_flush(struct mmu_gather *tlb)
@@ -149,7 +199,7 @@ static void tlb_table_flush(struct mmu_g
 
 	if (*batch) {
 		tlb_table_invalidate(tlb);
-		call_rcu(&(*batch)->rcu, tlb_remove_table_rcu);
+		tlb_remove_table_free(*batch);
 		*batch = NULL;
 	}
 }
@@ -173,13 +223,21 @@ void tlb_remove_table(struct mmu_gather
 		tlb_table_flush(tlb);
 }
 
-#endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
+static inline void tlb_table_init(struct mmu_gather *tlb)
+{
+	tlb->batch = NULL;
+}
+
+#else /* !CONFIG_MMU_GATHER_TABLE_FREE */
+
+static inline void tlb_table_flush(struct mmu_gather *tlb) { }
+static inline void tlb_table_init(struct mmu_gather *tlb) { }
+
+#endif /* CONFIG_MMU_GATHER_TABLE_FREE */
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
-#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
-#endif
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
 	tlb_batch_pages_flush(tlb);
 #endif
@@ -220,9 +278,7 @@ void tlb_gather_mmu(struct mmu_gather *t
 	tlb->batch_count = 0;
 #endif
 
-#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
-	tlb->batch = NULL;
-#endif
+	tlb_table_init(tlb);
 #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
 	tlb->page_size = 0;
 #endif
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 57/67] proc: decouple proc from VFS with "struct proc_ops"
  2020-02-04  1:33 incoming Andrew Morton
                   ` (55 preceding siblings ...)
  2020-02-04  1:37 ` [patch 56/67] asm-generic/tlb: provide MMU_GATHER_TABLE_FREE Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 58/67] proc: convert everything to " Andrew Morton
                   ` (181 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: proc: decouple proc from VFS with "struct proc_ops"

Currently core /proc code uses "struct file_operations" for custom hooks,
however, VFS doesn't directly call them.  Every time VFS expands
file_operations hook set, /proc code bloats for no reason.

Introduce "struct proc_ops" which contains only those hooks which /proc
allows to call into (open, release, read, write, ioctl, mmap, poll).  It
doesn't contain module pointer as well.

Save ~184 bytes per usage:

	add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752)
	Function                                     old     new   delta
	sysvipc_proc_ops                               -      72     +72
				...
	config_gz_proc_ops                             -      72     +72
	proc_get_inode                               289     339     +50
	proc_reg_get_unmapped_area                   110     107      -3
	close_pdeo                                   227     224      -3
	proc_reg_open                                289     284      -5
	proc_create_data                              60      53      -7
	rt_cpu_seq_fops                              256       -    -256
				...
	default_affinity_proc_fops                   256       -    -256
	Total: Before=5430095, After=5425343, chg -0.09%

Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/generic.c       |   38 +++++++++----------
 fs/proc/inode.c         |   76 +++++++++++++++++++-------------------
 fs/proc/internal.h      |    5 ++
 fs/proc/proc_net.c      |   32 ++++++++--------
 fs/proc/proc_sysctl.c   |    2 -
 fs/proc/root.c          |    2 -
 include/linux/proc_fs.h |   23 +++++++++--
 7 files changed, 98 insertions(+), 80 deletions(-)

--- a/fs/proc/generic.c~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/fs/proc/generic.c
@@ -473,7 +473,7 @@ struct proc_dir_entry *proc_mkdir_data(c
 	ent = __proc_create(&parent, name, S_IFDIR | mode, 2);
 	if (ent) {
 		ent->data = data;
-		ent->proc_fops = &proc_dir_operations;
+		ent->proc_dir_ops = &proc_dir_operations;
 		ent->proc_iops = &proc_dir_inode_operations;
 		ent = proc_register(parent, ent);
 	}
@@ -503,7 +503,7 @@ struct proc_dir_entry *proc_create_mount
 	ent = __proc_create(&parent, name, mode, 2);
 	if (ent) {
 		ent->data = NULL;
-		ent->proc_fops = NULL;
+		ent->proc_dir_ops = NULL;
 		ent->proc_iops = NULL;
 		ent = proc_register(parent, ent);
 	}
@@ -533,25 +533,23 @@ struct proc_dir_entry *proc_create_reg(c
 
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
 		struct proc_dir_entry *parent,
-		const struct file_operations *proc_fops, void *data)
+		const struct proc_ops *proc_ops, void *data)
 {
 	struct proc_dir_entry *p;
 
-	BUG_ON(proc_fops == NULL);
-
 	p = proc_create_reg(name, mode, &parent, data);
 	if (!p)
 		return NULL;
-	p->proc_fops = proc_fops;
+	p->proc_ops = proc_ops;
 	return proc_register(parent, p);
 }
 EXPORT_SYMBOL(proc_create_data);
  
 struct proc_dir_entry *proc_create(const char *name, umode_t mode,
 				   struct proc_dir_entry *parent,
-				   const struct file_operations *proc_fops)
+				   const struct proc_ops *proc_ops)
 {
-	return proc_create_data(name, mode, parent, proc_fops, NULL);
+	return proc_create_data(name, mode, parent, proc_ops, NULL);
 }
 EXPORT_SYMBOL(proc_create);
 
@@ -573,11 +571,11 @@ static int proc_seq_release(struct inode
 	return seq_release(inode, file);
 }
 
-static const struct file_operations proc_seq_fops = {
-	.open		= proc_seq_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= proc_seq_release,
+static const struct proc_ops proc_seq_ops = {
+	.proc_open	= proc_seq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= proc_seq_release,
 };
 
 struct proc_dir_entry *proc_create_seq_private(const char *name, umode_t mode,
@@ -589,7 +587,7 @@ struct proc_dir_entry *proc_create_seq_p
 	p = proc_create_reg(name, mode, &parent, data);
 	if (!p)
 		return NULL;
-	p->proc_fops = &proc_seq_fops;
+	p->proc_ops = &proc_seq_ops;
 	p->seq_ops = ops;
 	p->state_size = state_size;
 	return proc_register(parent, p);
@@ -603,11 +601,11 @@ static int proc_single_open(struct inode
 	return single_open(file, de->single_show, de->data);
 }
 
-static const struct file_operations proc_single_fops = {
-	.open		= proc_single_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
+static const struct proc_ops proc_single_ops = {
+	.proc_open	= proc_single_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 struct proc_dir_entry *proc_create_single_data(const char *name, umode_t mode,
@@ -619,7 +617,7 @@ struct proc_dir_entry *proc_create_singl
 	p = proc_create_reg(name, mode, &parent, data);
 	if (!p)
 		return NULL;
-	p->proc_fops = &proc_single_fops;
+	p->proc_ops = &proc_single_ops;
 	p->single_show = show;
 	return proc_register(parent, p);
 }
--- a/fs/proc/inode.c~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/fs/proc/inode.c
@@ -163,7 +163,7 @@ static void close_pdeo(struct proc_dir_e
 		pdeo->closing = true;
 		spin_unlock(&pde->pde_unload_lock);
 		file = pdeo->file;
-		pde->proc_fops->release(file_inode(file), file);
+		pde->proc_ops->proc_release(file_inode(file), file);
 		spin_lock(&pde->pde_unload_lock);
 		/* After ->release. */
 		list_del(&pdeo->lh);
@@ -200,12 +200,12 @@ static loff_t proc_reg_llseek(struct fil
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	loff_t rv = -EINVAL;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, llseek) llseek;
+		typeof_member(struct proc_ops, proc_lseek) lseek;
 
-		llseek = pde->proc_fops->llseek;
-		if (!llseek)
-			llseek = default_llseek;
-		rv = llseek(file, offset, whence);
+		lseek = pde->proc_ops->proc_lseek;
+		if (!lseek)
+			lseek = default_llseek;
+		rv = lseek(file, offset, whence);
 		unuse_pde(pde);
 	}
 	return rv;
@@ -216,9 +216,9 @@ static ssize_t proc_reg_read(struct file
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	ssize_t rv = -EIO;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, read) read;
+		typeof_member(struct proc_ops, proc_read) read;
 
-		read = pde->proc_fops->read;
+		read = pde->proc_ops->proc_read;
 		if (read)
 			rv = read(file, buf, count, ppos);
 		unuse_pde(pde);
@@ -231,9 +231,9 @@ static ssize_t proc_reg_write(struct fil
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	ssize_t rv = -EIO;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, write) write;
+		typeof_member(struct proc_ops, proc_write) write;
 
-		write = pde->proc_fops->write;
+		write = pde->proc_ops->proc_write;
 		if (write)
 			rv = write(file, buf, count, ppos);
 		unuse_pde(pde);
@@ -246,9 +246,9 @@ static __poll_t proc_reg_poll(struct fil
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	__poll_t rv = DEFAULT_POLLMASK;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, poll) poll;
+		typeof_member(struct proc_ops, proc_poll) poll;
 
-		poll = pde->proc_fops->poll;
+		poll = pde->proc_ops->proc_poll;
 		if (poll)
 			rv = poll(file, pts);
 		unuse_pde(pde);
@@ -261,9 +261,9 @@ static long proc_reg_unlocked_ioctl(stru
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	long rv = -ENOTTY;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, unlocked_ioctl) ioctl;
+		typeof_member(struct proc_ops, proc_ioctl) ioctl;
 
-		ioctl = pde->proc_fops->unlocked_ioctl;
+		ioctl = pde->proc_ops->proc_ioctl;
 		if (ioctl)
 			rv = ioctl(file, cmd, arg);
 		unuse_pde(pde);
@@ -277,9 +277,9 @@ static long proc_reg_compat_ioctl(struct
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	long rv = -ENOTTY;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, compat_ioctl) compat_ioctl;
+		typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl;
 
-		compat_ioctl = pde->proc_fops->compat_ioctl;
+		compat_ioctl = pde->proc_ops->proc_compat_ioctl;
 		if (compat_ioctl)
 			rv = compat_ioctl(file, cmd, arg);
 		unuse_pde(pde);
@@ -293,9 +293,9 @@ static int proc_reg_mmap(struct file *fi
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	int rv = -EIO;
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, mmap) mmap;
+		typeof_member(struct proc_ops, proc_mmap) mmap;
 
-		mmap = pde->proc_fops->mmap;
+		mmap = pde->proc_ops->proc_mmap;
 		if (mmap)
 			rv = mmap(file, vma);
 		unuse_pde(pde);
@@ -312,9 +312,9 @@ proc_reg_get_unmapped_area(struct file *
 	unsigned long rv = -EIO;
 
 	if (use_pde(pde)) {
-		typeof_member(struct file_operations, get_unmapped_area) get_area;
+		typeof_member(struct proc_ops, proc_get_unmapped_area) get_area;
 
-		get_area = pde->proc_fops->get_unmapped_area;
+		get_area = pde->proc_ops->proc_get_unmapped_area;
 #ifdef CONFIG_MMU
 		if (!get_area)
 			get_area = current->mm->get_unmapped_area;
@@ -333,8 +333,8 @@ static int proc_reg_open(struct inode *i
 {
 	struct proc_dir_entry *pde = PDE(inode);
 	int rv = 0;
-	typeof_member(struct file_operations, open) open;
-	typeof_member(struct file_operations, release) release;
+	typeof_member(struct proc_ops, proc_open) open;
+	typeof_member(struct proc_ops, proc_release) release;
 	struct pde_opener *pdeo;
 
 	/*
@@ -351,7 +351,7 @@ static int proc_reg_open(struct inode *i
 	if (!use_pde(pde))
 		return -ENOENT;
 
-	release = pde->proc_fops->release;
+	release = pde->proc_ops->proc_release;
 	if (release) {
 		pdeo = kmem_cache_alloc(pde_opener_cache, GFP_KERNEL);
 		if (!pdeo) {
@@ -360,7 +360,7 @@ static int proc_reg_open(struct inode *i
 		}
 	}
 
-	open = pde->proc_fops->open;
+	open = pde->proc_ops->proc_open;
 	if (open)
 		rv = open(inode, file);
 
@@ -468,21 +468,23 @@ struct inode *proc_get_inode(struct supe
 			inode->i_size = de->size;
 		if (de->nlink)
 			set_nlink(inode, de->nlink);
-		WARN_ON(!de->proc_iops);
-		inode->i_op = de->proc_iops;
-		if (de->proc_fops) {
-			if (S_ISREG(inode->i_mode)) {
+
+		if (S_ISREG(inode->i_mode)) {
+			inode->i_op = de->proc_iops;
+			inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-				if (!de->proc_fops->compat_ioctl)
-					inode->i_fop =
-						&proc_reg_file_ops_no_compat;
-				else
-#endif
-					inode->i_fop = &proc_reg_file_ops;
-			} else {
-				inode->i_fop = de->proc_fops;
+			if (!de->proc_ops->proc_compat_ioctl) {
+				inode->i_fop = &proc_reg_file_ops_no_compat;
 			}
-		}
+#endif
+		} else if (S_ISDIR(inode->i_mode)) {
+			inode->i_op = de->proc_iops;
+			inode->i_fop = de->proc_dir_ops;
+		} else if (S_ISLNK(inode->i_mode)) {
+			inode->i_op = de->proc_iops;
+			inode->i_fop = NULL;
+		} else
+			BUG();
 	} else
 	       pde_put(de);
 	return inode;
--- a/fs/proc/internal.h~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/fs/proc/internal.h
@@ -39,7 +39,10 @@ struct proc_dir_entry {
 	spinlock_t pde_unload_lock;
 	struct completion *pde_unload_completion;
 	const struct inode_operations *proc_iops;
-	const struct file_operations *proc_fops;
+	union {
+		const struct proc_ops *proc_ops;
+		const struct file_operations *proc_dir_ops;
+	};
 	const struct dentry_operations *proc_dops;
 	union {
 		const struct seq_operations *seq_ops;
--- a/fs/proc/proc_net.c~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/fs/proc/proc_net.c
@@ -90,12 +90,12 @@ static int seq_release_net(struct inode
 	return 0;
 }
 
-static const struct file_operations proc_net_seq_fops = {
-	.open		= seq_open_net,
-	.read		= seq_read,
-	.write		= proc_simple_write,
-	.llseek		= seq_lseek,
-	.release	= seq_release_net,
+static const struct proc_ops proc_net_seq_ops = {
+	.proc_open	= seq_open_net,
+	.proc_read	= seq_read,
+	.proc_write	= proc_simple_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release_net,
 };
 
 struct proc_dir_entry *proc_create_net_data(const char *name, umode_t mode,
@@ -108,7 +108,7 @@ struct proc_dir_entry *proc_create_net_d
 	if (!p)
 		return NULL;
 	pde_force_lookup(p);
-	p->proc_fops = &proc_net_seq_fops;
+	p->proc_ops = &proc_net_seq_ops;
 	p->seq_ops = ops;
 	p->state_size = state_size;
 	return proc_register(parent, p);
@@ -152,7 +152,7 @@ struct proc_dir_entry *proc_create_net_d
 	if (!p)
 		return NULL;
 	pde_force_lookup(p);
-	p->proc_fops = &proc_net_seq_fops;
+	p->proc_ops = &proc_net_seq_ops;
 	p->seq_ops = ops;
 	p->state_size = state_size;
 	p->write = write;
@@ -183,12 +183,12 @@ static int single_release_net(struct ino
 	return single_release(ino, f);
 }
 
-static const struct file_operations proc_net_single_fops = {
-	.open		= single_open_net,
-	.read		= seq_read,
-	.write		= proc_simple_write,
-	.llseek		= seq_lseek,
-	.release	= single_release_net,
+static const struct proc_ops proc_net_single_ops = {
+	.proc_open	= single_open_net,
+	.proc_read	= seq_read,
+	.proc_write	= proc_simple_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release_net,
 };
 
 struct proc_dir_entry *proc_create_net_single(const char *name, umode_t mode,
@@ -201,7 +201,7 @@ struct proc_dir_entry *proc_create_net_s
 	if (!p)
 		return NULL;
 	pde_force_lookup(p);
-	p->proc_fops = &proc_net_single_fops;
+	p->proc_ops = &proc_net_single_ops;
 	p->single_show = show;
 	return proc_register(parent, p);
 }
@@ -244,7 +244,7 @@ struct proc_dir_entry *proc_create_net_s
 	if (!p)
 		return NULL;
 	pde_force_lookup(p);
-	p->proc_fops = &proc_net_single_fops;
+	p->proc_ops = &proc_net_single_ops;
 	p->single_show = show;
 	p->write = write;
 	return proc_register(parent, p);
--- a/fs/proc/proc_sysctl.c~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/fs/proc/proc_sysctl.c
@@ -1720,7 +1720,7 @@ int __init proc_sys_init(void)
 
 	proc_sys_root = proc_mkdir("sys", NULL);
 	proc_sys_root->proc_iops = &proc_sys_dir_operations;
-	proc_sys_root->proc_fops = &proc_sys_dir_file_operations;
+	proc_sys_root->proc_dir_ops = &proc_sys_dir_file_operations;
 	proc_sys_root->nlink = 0;
 
 	return sysctl_init();
--- a/fs/proc/root.c~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/fs/proc/root.c
@@ -292,7 +292,7 @@ struct proc_dir_entry proc_root = {
 	.nlink		= 2, 
 	.refcnt		= REFCOUNT_INIT(1),
 	.proc_iops	= &proc_root_inode_operations, 
-	.proc_fops	= &proc_root_operations,
+	.proc_dir_ops	= &proc_root_operations,
 	.parent		= &proc_root,
 	.subdir		= RB_ROOT,
 	.name		= "/proc",
--- a/include/linux/proc_fs.h~proc-decouple-proc-from-vfs-with-struct-proc_ops
+++ a/include/linux/proc_fs.h
@@ -12,6 +12,21 @@ struct proc_dir_entry;
 struct seq_file;
 struct seq_operations;
 
+struct proc_ops {
+	int	(*proc_open)(struct inode *, struct file *);
+	ssize_t	(*proc_read)(struct file *, char __user *, size_t, loff_t *);
+	ssize_t	(*proc_write)(struct file *, const char __user *, size_t, loff_t *);
+	loff_t	(*proc_lseek)(struct file *, loff_t, int);
+	int	(*proc_release)(struct inode *, struct file *);
+	__poll_t (*proc_poll)(struct file *, struct poll_table_struct *);
+	long	(*proc_ioctl)(struct file *, unsigned int, unsigned long);
+#ifdef CONFIG_COMPAT
+	long	(*proc_compat_ioctl)(struct file *, unsigned int, unsigned long);
+#endif
+	int	(*proc_mmap)(struct file *, struct vm_area_struct *);
+	unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
+};
+
 #ifdef CONFIG_PROC_FS
 
 typedef int (*proc_write_t)(struct file *, char *, size_t);
@@ -43,10 +58,10 @@ struct proc_dir_entry *proc_create_singl
  
 extern struct proc_dir_entry *proc_create_data(const char *, umode_t,
 					       struct proc_dir_entry *,
-					       const struct file_operations *,
+					       const struct proc_ops *,
 					       void *);
 
-struct proc_dir_entry *proc_create(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct file_operations *proc_fops);
+struct proc_dir_entry *proc_create(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct proc_ops *proc_ops);
 extern void proc_set_size(struct proc_dir_entry *, loff_t);
 extern void proc_set_user(struct proc_dir_entry *, kuid_t, kgid_t);
 extern void *PDE_DATA(const struct inode *);
@@ -108,8 +123,8 @@ static inline struct proc_dir_entry *pro
 #define proc_create_seq(name, mode, parent, ops) ({NULL;})
 #define proc_create_single(name, mode, parent, show) ({NULL;})
 #define proc_create_single_data(name, mode, parent, show, data) ({NULL;})
-#define proc_create(name, mode, parent, proc_fops) ({NULL;})
-#define proc_create_data(name, mode, parent, proc_fops, data) ({NULL;})
+#define proc_create(name, mode, parent, proc_ops) ({NULL;})
+#define proc_create_data(name, mode, parent, proc_ops, data) ({NULL;})
 
 static inline void proc_set_size(struct proc_dir_entry *de, loff_t size) {}
 static inline void proc_set_user(struct proc_dir_entry *de, kuid_t uid, kgid_t gid) {}
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 58/67] proc: convert everything to "struct proc_ops"
  2020-02-04  1:33 incoming Andrew Morton
                   ` (56 preceding siblings ...)
  2020-02-04  1:37 ` [patch 57/67] proc: decouple proc from VFS with "struct proc_ops" Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 59/67] lib/string: add strnchrnul() Andrew Morton
                   ` (180 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, sfr, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: proc: convert everything to "struct proc_ops"

The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/kernel/srm_env.c                           |   17 -
 arch/arm/kernel/atags_proc.c                          |    8 
 arch/arm/mm/alignment.c                               |   14 -
 arch/ia64/kernel/salinfo.c                            |   24 -
 arch/m68k/kernel/bootinfo_proc.c                      |    8 
 arch/mips/lasat/picvue_proc.c                         |   31 +-
 arch/powerpc/kernel/proc_powerpc.c                    |   10 
 arch/powerpc/kernel/rtas-proc.c                       |   70 ++---
 arch/powerpc/kernel/rtas_flash.c                      |   34 +-
 arch/powerpc/kernel/rtasd.c                           |   14 -
 arch/powerpc/mm/numa.c                                |   12 
 arch/powerpc/platforms/pseries/lpar.c                 |   24 -
 arch/powerpc/platforms/pseries/lparcfg.c              |   14 -
 arch/powerpc/platforms/pseries/reconfig.c             |    8 
 arch/powerpc/platforms/pseries/scanlog.c              |   15 -
 arch/sh/mm/alignment.c                                |   17 -
 arch/sparc/kernel/led.c                               |   15 -
 arch/um/drivers/mconsole_kern.c                       |    9 
 arch/um/kernel/exitcode.c                             |   15 -
 arch/um/kernel/process.c                              |   15 -
 arch/x86/kernel/cpu/mtrr/if.c                         |   21 -
 arch/x86/platform/uv/tlb_uv.c                         |   14 -
 arch/xtensa/platforms/iss/simdisk.c                   |   10 
 drivers/acpi/battery.c                                |   15 -
 drivers/acpi/proc.c                                   |   15 -
 drivers/hwmon/dell-smm-hwmon.c                        |   15 -
 drivers/ide/ide-proc.c                                |   19 -
 drivers/input/input.c                                 |   28 +-
 drivers/isdn/capi/kcapi_proc.c                        |    6 
 drivers/macintosh/via-pmu.c                           |   17 -
 drivers/md/md.c                                       |   15 -
 drivers/misc/sgi-gru/gruprocfs.c                      |   42 +--
 drivers/net/wireless/cisco/airo.c                     |  126 ++++------
 drivers/net/wireless/intel/ipw2x00/libipw_module.c    |   15 -
 drivers/net/wireless/intersil/hostap/hostap_hw.c      |    4 
 drivers/net/wireless/intersil/hostap/hostap_proc.c    |   14 -
 drivers/net/wireless/intersil/hostap/hostap_wlan.h    |    2 
 drivers/net/wireless/ray_cs.c                         |   20 -
 drivers/parisc/led.c                                  |   17 -
 drivers/pci/proc.c                                    |   25 +
 drivers/platform/x86/thinkpad_acpi.c                  |   15 -
 drivers/platform/x86/toshiba_acpi.c                   |   60 ++--
 drivers/pnp/isapnp/proc.c                             |    9 
 drivers/pnp/pnpbios/proc.c                            |   17 -
 drivers/s390/block/dasd_proc.c                        |   15 -
 drivers/s390/cio/blacklist.c                          |   14 -
 drivers/s390/cio/css.c                                |   11 
 drivers/scsi/esas2r/esas2r_main.c                     |    9 
 drivers/scsi/scsi_devinfo.c                           |   15 -
 drivers/scsi/scsi_proc.c                              |   29 +-
 drivers/scsi/sg.c                                     |   30 +-
 drivers/staging/rtl8192u/ieee80211/ieee80211_module.c |   14 -
 drivers/tty/sysrq.c                                   |    8 
 drivers/usb/gadget/function/rndis.c                   |   17 -
 drivers/video/fbdev/via/viafbdev.c                    |  105 +++-----
 drivers/zorro/proc.c                                  |    9 
 fs/cifs/cifs_debug.c                                  |  108 ++++----
 fs/cifs/dfs_cache.c                                   |   13 -
 fs/cifs/dfs_cache.h                                   |    2 
 fs/fscache/internal.h                                 |    2 
 fs/fscache/object-list.c                              |   11 
 fs/fscache/proc.c                                     |    2 
 fs/jbd2/journal.c                                     |   13 -
 fs/jfs/jfs_debug.c                                    |   14 -
 fs/lockd/procfs.c                                     |   12 
 fs/nfsd/nfsctl.c                                      |   13 -
 fs/nfsd/stats.c                                       |   12 
 fs/proc/cpuinfo.c                                     |   12 
 fs/proc/kcore.c                                       |   13 -
 fs/proc/kmsg.c                                        |   14 -
 fs/proc/page.c                                        |   24 -
 fs/proc/stat.c                                        |   12 
 fs/proc/vmcore.c                                      |   10 
 include/linux/seq_file.h                              |   13 +
 include/linux/sunrpc/stats.h                          |    4 
 ipc/util.c                                            |   14 -
 kernel/configs.c                                      |    9 
 kernel/irq/proc.c                                     |   42 +--
 kernel/kallsyms.c                                     |   12 
 kernel/latencytop.c                                   |   14 -
 kernel/locking/lockdep_proc.c                         |   15 -
 kernel/module.c                                       |   12 
 kernel/profile.c                                      |   24 -
 kernel/sched/psi.c                                    |   48 +--
 mm/slab_common.c                                      |   15 -
 mm/swapfile.c                                         |   14 -
 net/atm/mpoa_proc.c                                   |   17 -
 net/atm/proc.c                                        |    8 
 net/core/pktgen.c                                     |   44 +--
 net/ipv4/ipconfig.c                                   |   10 
 net/ipv4/netfilter/ipt_CLUSTERIP.c                    |   16 -
 net/ipv4/route.c                                      |   24 -
 net/netfilter/xt_recent.c                             |   17 -
 net/sunrpc/auth_gss/svcauth_gss.c                     |   10 
 net/sunrpc/cache.c                                    |   45 +--
 net/sunrpc/stats.c                                    |   21 -
 samples/kfifo/bytestream-example.c                    |   11 
 samples/kfifo/inttype-example.c                       |   11 
 samples/kfifo/record-example.c                        |   11 
 sound/core/info.c                                     |   34 +-
 100 files changed, 974 insertions(+), 1019 deletions(-)

--- a/arch/alpha/kernel/srm_env.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/alpha/kernel/srm_env.c
@@ -119,13 +119,12 @@ static ssize_t srm_env_proc_write(struct
 	return res;
 }
 
-static const struct file_operations srm_env_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= srm_env_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= srm_env_proc_write,
+static const struct proc_ops srm_env_proc_ops = {
+	.proc_open	= srm_env_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= srm_env_proc_write,
 };
 
 static int __init
@@ -182,7 +181,7 @@ srm_env_init(void)
 	entry = srm_named_entries;
 	while (entry->name && entry->id) {
 		if (!proc_create_data(entry->name, 0644, named_dir,
-			     &srm_env_proc_fops, (void *)entry->id))
+			     &srm_env_proc_ops, (void *)entry->id))
 			goto cleanup;
 		entry++;
 	}
@@ -194,7 +193,7 @@ srm_env_init(void)
 		char name[4];
 		sprintf(name, "%ld", var_num);
 		if (!proc_create_data(name, 0644, numbered_dir,
-			     &srm_env_proc_fops, (void *)var_num))
+			     &srm_env_proc_ops, (void *)var_num))
 			goto cleanup;
 	}
 
--- a/arch/arm/kernel/atags_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/arm/kernel/atags_proc.c
@@ -17,9 +17,9 @@ static ssize_t atags_read(struct file *f
 	return simple_read_from_buffer(buf, count, ppos, b->data, b->size);
 }
 
-static const struct file_operations atags_fops = {
-	.read = atags_read,
-	.llseek = default_llseek,
+static const struct proc_ops atags_proc_ops = {
+	.proc_read	= atags_read,
+	.proc_lseek	= default_llseek,
 };
 
 #define BOOT_PARAMS_SIZE 1536
@@ -61,7 +61,7 @@ static int __init init_atags_procfs(void
 	b->size = size;
 	memcpy(b->data, atags_copy, size);
 
-	tags_entry = proc_create_data("atags", 0400, NULL, &atags_fops, b);
+	tags_entry = proc_create_data("atags", 0400, NULL, &atags_proc_ops, b);
 	if (!tags_entry)
 		goto nomem;
 
--- a/arch/arm/mm/alignment.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/arm/mm/alignment.c
@@ -162,12 +162,12 @@ static ssize_t alignment_proc_write(stru
 	return count;
 }
 
-static const struct file_operations alignment_proc_fops = {
-	.open		= alignment_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= alignment_proc_write,
+static const struct proc_ops alignment_proc_ops = {
+	.proc_open	= alignment_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= alignment_proc_write,
 };
 #endif /* CONFIG_PROC_FS */
 
@@ -1016,7 +1016,7 @@ static int __init alignment_init(void)
 	struct proc_dir_entry *res;
 
 	res = proc_create("cpu/alignment", S_IWUSR | S_IRUGO, NULL,
-			  &alignment_proc_fops);
+			  &alignment_proc_ops);
 	if (!res)
 		return -ENOMEM;
 #endif
--- a/arch/ia64/kernel/salinfo.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/ia64/kernel/salinfo.c
@@ -331,10 +331,10 @@ retry:
 	return size;
 }
 
-static const struct file_operations salinfo_event_fops = {
-	.open  = salinfo_event_open,
-	.read  = salinfo_event_read,
-	.llseek = noop_llseek,
+static const struct proc_ops salinfo_event_proc_ops = {
+	.proc_open	= salinfo_event_open,
+	.proc_read	= salinfo_event_read,
+	.proc_lseek	= noop_llseek,
 };
 
 static int
@@ -534,12 +534,12 @@ salinfo_log_write(struct file *file, con
 	return count;
 }
 
-static const struct file_operations salinfo_data_fops = {
-	.open    = salinfo_log_open,
-	.release = salinfo_log_release,
-	.read    = salinfo_log_read,
-	.write   = salinfo_log_write,
-	.llseek  = default_llseek,
+static const struct proc_ops salinfo_data_proc_ops = {
+	.proc_open	= salinfo_log_open,
+	.proc_release	= salinfo_log_release,
+	.proc_read	= salinfo_log_read,
+	.proc_write	= salinfo_log_write,
+	.proc_lseek	= default_llseek,
 };
 
 static int salinfo_cpu_online(unsigned int cpu)
@@ -617,13 +617,13 @@ salinfo_init(void)
 			continue;
 
 		entry = proc_create_data("event", S_IRUSR, dir,
-					 &salinfo_event_fops, data);
+					 &salinfo_event_proc_ops, data);
 		if (!entry)
 			continue;
 		*sdir++ = entry;
 
 		entry = proc_create_data("data", S_IRUSR | S_IWUSR, dir,
-					 &salinfo_data_fops, data);
+					 &salinfo_data_proc_ops, data);
 		if (!entry)
 			continue;
 		*sdir++ = entry;
--- a/arch/m68k/kernel/bootinfo_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/m68k/kernel/bootinfo_proc.c
@@ -26,9 +26,9 @@ static ssize_t bootinfo_read(struct file
 				       bootinfo_size);
 }
 
-static const struct file_operations bootinfo_fops = {
-	.read = bootinfo_read,
-	.llseek = default_llseek,
+static const struct proc_ops bootinfo_proc_ops = {
+	.proc_read	= bootinfo_read,
+	.proc_lseek	= default_llseek,
 };
 
 void __init save_bootinfo(const struct bi_record *bi)
@@ -67,7 +67,7 @@ static int __init init_bootinfo_procfs(v
 	if (!bootinfo_copy)
 		return -ENOMEM;
 
-	pde = proc_create_data("bootinfo", 0400, NULL, &bootinfo_fops, NULL);
+	pde = proc_create_data("bootinfo", 0400, NULL, &bootinfo_proc_ops, NULL);
 	if (!pde) {
 		kfree(bootinfo_copy);
 		return -ENOMEM;
--- a/arch/mips/lasat/picvue_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/mips/lasat/picvue_proc.c
@@ -89,13 +89,12 @@ static ssize_t pvc_line_proc_write(struc
 	return count;
 }
 
-static const struct file_operations pvc_line_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= pvc_line_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= pvc_line_proc_write,
+static const struct proc_ops pvc_line_proc_ops = {
+	.proc_open	= pvc_line_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= pvc_line_proc_write,
 };
 
 static ssize_t pvc_scroll_proc_write(struct file *file, const char __user *buf,
@@ -148,13 +147,12 @@ static int pvc_scroll_proc_open(struct i
 	return single_open(file, pvc_scroll_proc_show, NULL);
 }
 
-static const struct file_operations pvc_scroll_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= pvc_scroll_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= pvc_scroll_proc_write,
+static const struct proc_ops pvc_scroll_proc_ops = {
+	.proc_open	= pvc_scroll_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= pvc_scroll_proc_write,
 };
 
 void pvc_proc_timerfunc(struct timer_list *unused)
@@ -189,12 +187,11 @@ static int __init pvc_proc_init(void)
 	}
 	for (i = 0; i < PVC_NLINES; i++) {
 		proc_entry = proc_create_data(pvc_linename[i], 0644, dir,
-					&pvc_line_proc_fops, &pvc_linedata[i]);
+					&pvc_line_proc_ops, &pvc_linedata[i]);
 		if (proc_entry == NULL)
 			goto error;
 	}
-	proc_entry = proc_create("scroll", 0644, dir,
-				 &pvc_scroll_proc_fops);
+	proc_entry = proc_create("scroll", 0644, dir, &pvc_scroll_proc_ops);
 	if (proc_entry == NULL)
 		goto error;
 
--- a/arch/powerpc/kernel/proc_powerpc.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/kernel/proc_powerpc.c
@@ -39,10 +39,10 @@ static int page_map_mmap( struct file *f
 	return 0;
 }
 
-static const struct file_operations page_map_fops = {
-	.llseek	= page_map_seek,
-	.read	= page_map_read,
-	.mmap	= page_map_mmap
+static const struct proc_ops page_map_proc_ops = {
+	.proc_lseek	= page_map_seek,
+	.proc_read	= page_map_read,
+	.proc_mmap	= page_map_mmap,
 };
 
 
@@ -51,7 +51,7 @@ static int __init proc_ppc64_init(void)
 	struct proc_dir_entry *pde;
 
 	pde = proc_create_data("powerpc/systemcfg", S_IFREG | 0444, NULL,
-			       &page_map_fops, vdso_data);
+			       &page_map_proc_ops, vdso_data);
 	if (!pde)
 		return 1;
 	proc_set_size(pde, PAGE_SIZE);
--- a/arch/powerpc/kernel/rtasd.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/kernel/rtasd.c
@@ -385,12 +385,12 @@ static __poll_t rtas_log_poll(struct fil
 	return 0;
 }
 
-static const struct file_operations proc_rtas_log_operations = {
-	.read =		rtas_log_read,
-	.poll =		rtas_log_poll,
-	.open =		rtas_log_open,
-	.release =	rtas_log_release,
-	.llseek =	noop_llseek,
+static const struct proc_ops rtas_log_proc_ops = {
+	.proc_read	= rtas_log_read,
+	.proc_poll	= rtas_log_poll,
+	.proc_open	= rtas_log_open,
+	.proc_release	= rtas_log_release,
+	.proc_lseek	= noop_llseek,
 };
 
 static int enable_surveillance(int timeout)
@@ -572,7 +572,7 @@ static int __init rtas_init(void)
 		return -ENODEV;
 
 	entry = proc_create("powerpc/rtas/error_log", 0400, NULL,
-			    &proc_rtas_log_operations);
+			    &rtas_log_proc_ops);
 	if (!entry)
 		printk(KERN_ERR "Failed to create error_log proc entry\n");
 
--- a/arch/powerpc/kernel/rtas_flash.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/kernel/rtas_flash.c
@@ -655,7 +655,7 @@ struct rtas_flash_file {
 	const char *filename;
 	const char *rtas_call_name;
 	int *status;
-	const struct file_operations fops;
+	const struct proc_ops ops;
 };
 
 static const struct rtas_flash_file rtas_flash_files[] = {
@@ -663,36 +663,36 @@ static const struct rtas_flash_file rtas
 		.filename	= "powerpc/rtas/" FIRMWARE_FLASH_NAME,
 		.rtas_call_name	= "ibm,update-flash-64-and-reboot",
 		.status		= &rtas_update_flash_data.status,
-		.fops.read	= rtas_flash_read_msg,
-		.fops.write	= rtas_flash_write,
-		.fops.release	= rtas_flash_release,
-		.fops.llseek	= default_llseek,
+		.ops.proc_read	= rtas_flash_read_msg,
+		.ops.proc_write	= rtas_flash_write,
+		.ops.proc_release = rtas_flash_release,
+		.ops.proc_lseek	= default_llseek,
 	},
 	{
 		.filename	= "powerpc/rtas/" FIRMWARE_UPDATE_NAME,
 		.rtas_call_name	= "ibm,update-flash-64-and-reboot",
 		.status		= &rtas_update_flash_data.status,
-		.fops.read	= rtas_flash_read_num,
-		.fops.write	= rtas_flash_write,
-		.fops.release	= rtas_flash_release,
-		.fops.llseek	= default_llseek,
+		.ops.proc_read	= rtas_flash_read_num,
+		.ops.proc_write	= rtas_flash_write,
+		.ops.proc_release = rtas_flash_release,
+		.ops.proc_lseek	= default_llseek,
 	},
 	{
 		.filename	= "powerpc/rtas/" VALIDATE_FLASH_NAME,
 		.rtas_call_name	= "ibm,validate-flash-image",
 		.status		= &rtas_validate_flash_data.status,
-		.fops.read	= validate_flash_read,
-		.fops.write	= validate_flash_write,
-		.fops.release	= validate_flash_release,
-		.fops.llseek	= default_llseek,
+		.ops.proc_read	= validate_flash_read,
+		.ops.proc_write	= validate_flash_write,
+		.ops.proc_release = validate_flash_release,
+		.ops.proc_lseek	= default_llseek,
 	},
 	{
 		.filename	= "powerpc/rtas/" MANAGE_FLASH_NAME,
 		.rtas_call_name	= "ibm,manage-flash-image",
 		.status		= &rtas_manage_flash_data.status,
-		.fops.read	= manage_flash_read,
-		.fops.write	= manage_flash_write,
-		.fops.llseek	= default_llseek,
+		.ops.proc_read	= manage_flash_read,
+		.ops.proc_write	= manage_flash_write,
+		.ops.proc_lseek	= default_llseek,
 	}
 };
 
@@ -723,7 +723,7 @@ static int __init rtas_flash_init(void)
 		const struct rtas_flash_file *f = &rtas_flash_files[i];
 		int token;
 
-		if (!proc_create(f->filename, 0600, NULL, &f->fops))
+		if (!proc_create(f->filename, 0600, NULL, &f->ops))
 			goto enomem;
 
 		/*
--- a/arch/powerpc/kernel/rtas-proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/kernel/rtas-proc.c
@@ -159,12 +159,12 @@ static int poweron_open(struct inode *in
 	return single_open(file, ppc_rtas_poweron_show, NULL);
 }
 
-static const struct file_operations ppc_rtas_poweron_operations = {
-	.open		= poweron_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.write		= ppc_rtas_poweron_write,
-	.release	= single_release,
+static const struct proc_ops ppc_rtas_poweron_proc_ops = {
+	.proc_open	= poweron_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= ppc_rtas_poweron_write,
+	.proc_release	= single_release,
 };
 
 static int progress_open(struct inode *inode, struct file *file)
@@ -172,12 +172,12 @@ static int progress_open(struct inode *i
 	return single_open(file, ppc_rtas_progress_show, NULL);
 }
 
-static const struct file_operations ppc_rtas_progress_operations = {
-	.open		= progress_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.write		= ppc_rtas_progress_write,
-	.release	= single_release,
+static const struct proc_ops ppc_rtas_progress_proc_ops = {
+	.proc_open	= progress_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= ppc_rtas_progress_write,
+	.proc_release	= single_release,
 };
 
 static int clock_open(struct inode *inode, struct file *file)
@@ -185,12 +185,12 @@ static int clock_open(struct inode *inod
 	return single_open(file, ppc_rtas_clock_show, NULL);
 }
 
-static const struct file_operations ppc_rtas_clock_operations = {
-	.open		= clock_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.write		= ppc_rtas_clock_write,
-	.release	= single_release,
+static const struct proc_ops ppc_rtas_clock_proc_ops = {
+	.proc_open	= clock_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= ppc_rtas_clock_write,
+	.proc_release	= single_release,
 };
 
 static int tone_freq_open(struct inode *inode, struct file *file)
@@ -198,12 +198,12 @@ static int tone_freq_open(struct inode *
 	return single_open(file, ppc_rtas_tone_freq_show, NULL);
 }
 
-static const struct file_operations ppc_rtas_tone_freq_operations = {
-	.open		= tone_freq_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.write		= ppc_rtas_tone_freq_write,
-	.release	= single_release,
+static const struct proc_ops ppc_rtas_tone_freq_proc_ops = {
+	.proc_open	= tone_freq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= ppc_rtas_tone_freq_write,
+	.proc_release	= single_release,
 };
 
 static int tone_volume_open(struct inode *inode, struct file *file)
@@ -211,12 +211,12 @@ static int tone_volume_open(struct inode
 	return single_open(file, ppc_rtas_tone_volume_show, NULL);
 }
 
-static const struct file_operations ppc_rtas_tone_volume_operations = {
-	.open		= tone_volume_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.write		= ppc_rtas_tone_volume_write,
-	.release	= single_release,
+static const struct proc_ops ppc_rtas_tone_volume_proc_ops = {
+	.proc_open	= tone_volume_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= ppc_rtas_tone_volume_write,
+	.proc_release	= single_release,
 };
 
 static int ppc_rtas_find_all_sensors(void);
@@ -238,17 +238,17 @@ static int __init proc_rtas_init(void)
 		return -ENODEV;
 
 	proc_create("powerpc/rtas/progress", 0644, NULL,
-		    &ppc_rtas_progress_operations);
+		    &ppc_rtas_progress_proc_ops);
 	proc_create("powerpc/rtas/clock", 0644, NULL,
-		    &ppc_rtas_clock_operations);
+		    &ppc_rtas_clock_proc_ops);
 	proc_create("powerpc/rtas/poweron", 0644, NULL,
-		    &ppc_rtas_poweron_operations);
+		    &ppc_rtas_poweron_proc_ops);
 	proc_create_single("powerpc/rtas/sensors", 0444, NULL,
 			ppc_rtas_sensors_show);
 	proc_create("powerpc/rtas/frequency", 0644, NULL,
-		    &ppc_rtas_tone_freq_operations);
+		    &ppc_rtas_tone_freq_proc_ops);
 	proc_create("powerpc/rtas/volume", 0644, NULL,
-		    &ppc_rtas_tone_volume_operations);
+		    &ppc_rtas_tone_volume_proc_ops);
 	proc_create_single("powerpc/rtas/rmo_buffer", 0400, NULL,
 			ppc_rtas_rmo_buf_show);
 	return 0;
--- a/arch/powerpc/mm/numa.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/mm/numa.c
@@ -1616,11 +1616,11 @@ static ssize_t topology_write(struct fil
 	return count;
 }
 
-static const struct file_operations topology_ops = {
-	.read = seq_read,
-	.write = topology_write,
-	.open = topology_open,
-	.release = single_release
+static const struct proc_ops topology_proc_ops = {
+	.proc_read	= seq_read,
+	.proc_write	= topology_write,
+	.proc_open	= topology_open,
+	.proc_release	= single_release,
 };
 
 static int topology_update_init(void)
@@ -1630,7 +1630,7 @@ static int topology_update_init(void)
 	if (vphn_enabled)
 		topology_schedule_update();
 
-	if (!proc_create("powerpc/topology_updates", 0644, NULL, &topology_ops))
+	if (!proc_create("powerpc/topology_updates", 0644, NULL, &topology_proc_ops))
 		return -ENOMEM;
 
 	topology_inited = 1;
--- a/arch/powerpc/platforms/pseries/lpar.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/platforms/pseries/lpar.c
@@ -582,12 +582,12 @@ static int vcpudispatch_stats_open(struc
 	return single_open(file, vcpudispatch_stats_display, NULL);
 }
 
-static const struct file_operations vcpudispatch_stats_proc_ops = {
-	.open		= vcpudispatch_stats_open,
-	.read		= seq_read,
-	.write		= vcpudispatch_stats_write,
-	.llseek		= seq_lseek,
-	.release	= single_release,
+static const struct proc_ops vcpudispatch_stats_proc_ops = {
+	.proc_open	= vcpudispatch_stats_open,
+	.proc_read	= seq_read,
+	.proc_write	= vcpudispatch_stats_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 static ssize_t vcpudispatch_stats_freq_write(struct file *file,
@@ -626,12 +626,12 @@ static int vcpudispatch_stats_freq_open(
 	return single_open(file, vcpudispatch_stats_freq_display, NULL);
 }
 
-static const struct file_operations vcpudispatch_stats_freq_proc_ops = {
-	.open		= vcpudispatch_stats_freq_open,
-	.read		= seq_read,
-	.write		= vcpudispatch_stats_freq_write,
-	.llseek		= seq_lseek,
-	.release	= single_release,
+static const struct proc_ops vcpudispatch_stats_freq_proc_ops = {
+	.proc_open	= vcpudispatch_stats_freq_open,
+	.proc_read	= seq_read,
+	.proc_write	= vcpudispatch_stats_freq_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 static int __init vcpudispatch_stats_procfs_init(void)
--- a/arch/powerpc/platforms/pseries/lparcfg.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/platforms/pseries/lparcfg.c
@@ -698,12 +698,12 @@ static int lparcfg_open(struct inode *in
 	return single_open(file, lparcfg_data, NULL);
 }
 
-static const struct file_operations lparcfg_fops = {
-	.read		= seq_read,
-	.write		= lparcfg_write,
-	.open		= lparcfg_open,
-	.release	= single_release,
-	.llseek		= seq_lseek,
+static const struct proc_ops lparcfg_proc_ops = {
+	.proc_read	= seq_read,
+	.proc_write	= lparcfg_write,
+	.proc_open	= lparcfg_open,
+	.proc_release	= single_release,
+	.proc_lseek	= seq_lseek,
 };
 
 static int __init lparcfg_init(void)
@@ -714,7 +714,7 @@ static int __init lparcfg_init(void)
 	if (firmware_has_feature(FW_FEATURE_SPLPAR))
 		mode |= 0200;
 
-	if (!proc_create("powerpc/lparcfg", mode, NULL, &lparcfg_fops)) {
+	if (!proc_create("powerpc/lparcfg", mode, NULL, &lparcfg_proc_ops)) {
 		printk(KERN_ERR "Failed to create powerpc/lparcfg\n");
 		return -EIO;
 	}
--- a/arch/powerpc/platforms/pseries/reconfig.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/platforms/pseries/reconfig.c
@@ -391,9 +391,9 @@ out:
 	return rv ? rv : count;
 }
 
-static const struct file_operations ofdt_fops = {
-	.write = ofdt_write,
-	.llseek = noop_llseek,
+static const struct proc_ops ofdt_proc_ops = {
+	.proc_write	= ofdt_write,
+	.proc_lseek	= noop_llseek,
 };
 
 /* create /proc/powerpc/ofdt write-only by root */
@@ -401,7 +401,7 @@ static int proc_ppc64_create_ofdt(void)
 {
 	struct proc_dir_entry *ent;
 
-	ent = proc_create("powerpc/ofdt", 0200, NULL, &ofdt_fops);
+	ent = proc_create("powerpc/ofdt", 0200, NULL, &ofdt_proc_ops);
 	if (ent)
 		proc_set_size(ent, 0);
 
--- a/arch/powerpc/platforms/pseries/scanlog.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/powerpc/platforms/pseries/scanlog.c
@@ -152,13 +152,12 @@ static int scanlog_release(struct inode
 	return 0;
 }
 
-static const struct file_operations scanlog_fops = {
-	.owner		= THIS_MODULE,
-	.read		= scanlog_read,
-	.write		= scanlog_write,
-	.open		= scanlog_open,
-	.release	= scanlog_release,
-	.llseek		= noop_llseek,
+static const struct proc_ops scanlog_proc_ops = {
+	.proc_read	= scanlog_read,
+	.proc_write	= scanlog_write,
+	.proc_open	= scanlog_open,
+	.proc_release	= scanlog_release,
+	.proc_lseek	= noop_llseek,
 };
 
 static int __init scanlog_init(void)
@@ -176,7 +175,7 @@ static int __init scanlog_init(void)
 		goto err;
 
 	ent = proc_create("powerpc/rtas/scan-log-dump", 0400, NULL,
-			  &scanlog_fops);
+			  &scanlog_proc_ops);
 	if (!ent)
 		goto err;
 	return 0;
--- a/arch/sh/mm/alignment.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/sh/mm/alignment.c
@@ -152,13 +152,12 @@ static ssize_t alignment_proc_write(stru
 	return count;
 }
 
-static const struct file_operations alignment_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= alignment_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= alignment_proc_write,
+static const struct proc_ops alignment_proc_ops = {
+	.proc_open	= alignment_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= alignment_proc_write,
 };
 
 /*
@@ -176,12 +175,12 @@ static int __init alignment_init(void)
 		return -ENOMEM;
 
 	res = proc_create_data("alignment", S_IWUSR | S_IRUGO, dir,
-			       &alignment_proc_fops, &se_usermode);
+			       &alignment_proc_ops, &se_usermode);
 	if (!res)
 		return -ENOMEM;
 
         res = proc_create_data("kernel_alignment", S_IWUSR | S_IRUGO, dir,
-			       &alignment_proc_fops, &se_kernmode_warn);
+			       &alignment_proc_ops, &se_kernmode_warn);
         if (!res)
                 return -ENOMEM;
 
--- a/arch/sparc/kernel/led.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/sparc/kernel/led.c
@@ -104,13 +104,12 @@ static ssize_t led_proc_write(struct fil
 	return count;
 }
 
-static const struct file_operations led_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= led_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= led_proc_write,
+static const struct proc_ops led_proc_ops = {
+	.proc_open	= led_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= led_proc_write,
 };
 
 static struct proc_dir_entry *led;
@@ -121,7 +120,7 @@ static int __init led_init(void)
 {
 	timer_setup(&led_blink_timer, led_blink, 0);
 
-	led = proc_create("led", 0, NULL, &led_proc_fops);
+	led = proc_create("led", 0, NULL, &led_proc_ops);
 	if (!led)
 		return -ENOMEM;
 
--- a/arch/um/drivers/mconsole_kern.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/um/drivers/mconsole_kern.c
@@ -752,10 +752,9 @@ static ssize_t mconsole_proc_write(struc
 	return count;
 }
 
-static const struct file_operations mconsole_proc_fops = {
-	.owner		= THIS_MODULE,
-	.write		= mconsole_proc_write,
-	.llseek		= noop_llseek,
+static const struct proc_ops mconsole_proc_ops = {
+	.proc_write	= mconsole_proc_write,
+	.proc_lseek	= noop_llseek,
 };
 
 static int create_proc_mconsole(void)
@@ -765,7 +764,7 @@ static int create_proc_mconsole(void)
 	if (notify_socket == NULL)
 		return 0;
 
-	ent = proc_create("mconsole", 0200, NULL, &mconsole_proc_fops);
+	ent = proc_create("mconsole", 0200, NULL, &mconsole_proc_ops);
 	if (ent == NULL) {
 		printk(KERN_INFO "create_proc_mconsole : proc_create failed\n");
 		return 0;
--- a/arch/um/kernel/exitcode.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/um/kernel/exitcode.c
@@ -55,20 +55,19 @@ static ssize_t exitcode_proc_write(struc
 	return count;
 }
 
-static const struct file_operations exitcode_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= exitcode_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= exitcode_proc_write,
+static const struct proc_ops exitcode_proc_ops = {
+	.proc_open	= exitcode_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= exitcode_proc_write,
 };
 
 static int make_proc_exitcode(void)
 {
 	struct proc_dir_entry *ent;
 
-	ent = proc_create("exitcode", 0600, NULL, &exitcode_proc_fops);
+	ent = proc_create("exitcode", 0600, NULL, &exitcode_proc_ops);
 	if (ent == NULL) {
 		printk(KERN_WARNING "make_proc_exitcode : Failed to register "
 		       "/proc/exitcode\n");
--- a/arch/um/kernel/process.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/um/kernel/process.c
@@ -348,13 +348,12 @@ static ssize_t sysemu_proc_write(struct
 	return count;
 }
 
-static const struct file_operations sysemu_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= sysemu_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= sysemu_proc_write,
+static const struct proc_ops sysemu_proc_ops = {
+	.proc_open	= sysemu_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= sysemu_proc_write,
 };
 
 int __init make_proc_sysemu(void)
@@ -363,7 +362,7 @@ int __init make_proc_sysemu(void)
 	if (!sysemu_supported)
 		return 0;
 
-	ent = proc_create("sysemu", 0600, NULL, &sysemu_proc_fops);
+	ent = proc_create("sysemu", 0600, NULL, &sysemu_proc_ops);
 
 	if (ent == NULL)
 	{
--- a/arch/x86/kernel/cpu/mtrr/if.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/x86/kernel/cpu/mtrr/if.c
@@ -396,15 +396,16 @@ static int mtrr_open(struct inode *inode
 	return single_open(file, mtrr_seq_show, NULL);
 }
 
-static const struct file_operations mtrr_fops = {
-	.owner			= THIS_MODULE,
-	.open			= mtrr_open,
-	.read			= seq_read,
-	.llseek			= seq_lseek,
-	.write			= mtrr_write,
-	.unlocked_ioctl		= mtrr_ioctl,
-	.compat_ioctl		= mtrr_ioctl,
-	.release		= mtrr_close,
+static const struct proc_ops mtrr_proc_ops = {
+	.proc_open		= mtrr_open,
+	.proc_read		= seq_read,
+	.proc_lseek		= seq_lseek,
+	.proc_write		= mtrr_write,
+	.proc_ioctl		= mtrr_ioctl,
+#ifdef CONFIG_COMPAT
+	.proc_compat_ioctl	= mtrr_ioctl,
+#endif
+	.proc_release		= mtrr_close,
 };
 
 static int __init mtrr_if_init(void)
@@ -417,7 +418,7 @@ static int __init mtrr_if_init(void)
 	    (!cpu_has(c, X86_FEATURE_CENTAUR_MCR)))
 		return -ENODEV;
 
-	proc_create("mtrr", S_IWUSR | S_IRUGO, NULL, &mtrr_fops);
+	proc_create("mtrr", S_IWUSR | S_IRUGO, NULL, &mtrr_proc_ops);
 	return 0;
 }
 arch_initcall(mtrr_if_init);
--- a/arch/x86/platform/uv/tlb_uv.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/x86/platform/uv/tlb_uv.c
@@ -1668,12 +1668,12 @@ static int tunables_open(struct inode *i
 	return 0;
 }
 
-static const struct file_operations proc_uv_ptc_operations = {
-	.open		= ptc_proc_open,
-	.read		= seq_read,
-	.write		= ptc_proc_write,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops uv_ptc_proc_ops = {
+	.proc_open	= ptc_proc_open,
+	.proc_read	= seq_read,
+	.proc_write	= ptc_proc_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static const struct file_operations tunables_fops = {
@@ -1691,7 +1691,7 @@ static int __init uv_ptc_init(void)
 		return 0;
 
 	proc_uv_ptc = proc_create(UV_PTC_BASENAME, 0444, NULL,
-				  &proc_uv_ptc_operations);
+				  &uv_ptc_proc_ops);
 	if (!proc_uv_ptc) {
 		pr_err("unable to create %s proc entry\n",
 		       UV_PTC_BASENAME);
--- a/arch/xtensa/platforms/iss/simdisk.c~proc-convert-everything-to-struct-proc_ops
+++ a/arch/xtensa/platforms/iss/simdisk.c
@@ -251,10 +251,10 @@ out_free:
 	return err;
 }
 
-static const struct file_operations fops = {
-	.read = proc_read_simdisk,
-	.write = proc_write_simdisk,
-	.llseek = default_llseek,
+static const struct proc_ops simdisk_proc_ops = {
+	.proc_read	= proc_read_simdisk,
+	.proc_write	= proc_write_simdisk,
+	.proc_lseek	= default_llseek,
 };
 
 static int __init simdisk_setup(struct simdisk *dev, int which,
@@ -290,7 +290,7 @@ static int __init simdisk_setup(struct s
 	set_capacity(dev->gd, 0);
 	add_disk(dev->gd);
 
-	dev->procfile = proc_create_data(tmp, 0644, procdir, &fops, dev);
+	dev->procfile = proc_create_data(tmp, 0644, procdir, &simdisk_proc_ops, dev);
 	return 0;
 
 out_alloc_disk:
--- a/drivers/acpi/battery.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/acpi/battery.c
@@ -1202,13 +1202,12 @@ static int acpi_battery_alarm_proc_open(
 	return single_open(file, acpi_battery_alarm_proc_show, PDE_DATA(inode));
 }
 
-static const struct file_operations acpi_battery_alarm_fops = {
-	.owner		= THIS_MODULE,
-	.open		= acpi_battery_alarm_proc_open,
-	.read		= seq_read,
-	.write		= acpi_battery_write_alarm,
-	.llseek		= seq_lseek,
-	.release	= single_release,
+static const struct proc_ops acpi_battery_alarm_proc_ops = {
+	.proc_open	= acpi_battery_alarm_proc_open,
+	.proc_read	= seq_read,
+	.proc_write	= acpi_battery_write_alarm,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 static int acpi_battery_add_fs(struct acpi_device *device)
@@ -1228,7 +1227,7 @@ static int acpi_battery_add_fs(struct ac
 			acpi_battery_state_proc_show, acpi_driver_data(device)))
 		return -ENODEV;
 	if (!proc_create_data("alarm", S_IFREG | S_IRUGO | S_IWUSR,
-			acpi_device_dir(device), &acpi_battery_alarm_fops,
+			acpi_device_dir(device), &acpi_battery_alarm_proc_ops,
 			acpi_driver_data(device)))
 		return -ENODEV;
 	return 0;
--- a/drivers/acpi/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/acpi/proc.c
@@ -136,18 +136,17 @@ acpi_system_wakeup_device_open_fs(struct
 			   PDE_DATA(inode));
 }
 
-static const struct file_operations acpi_system_wakeup_device_fops = {
-	.owner = THIS_MODULE,
-	.open = acpi_system_wakeup_device_open_fs,
-	.read = seq_read,
-	.write = acpi_system_write_wakeup_device,
-	.llseek = seq_lseek,
-	.release = single_release,
+static const struct proc_ops acpi_system_wakeup_device_proc_ops = {
+	.proc_open	= acpi_system_wakeup_device_open_fs,
+	.proc_read	= seq_read,
+	.proc_write	= acpi_system_write_wakeup_device,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 void __init acpi_sleep_proc_init(void)
 {
 	/* 'wakeup device' [R/W] */
 	proc_create("wakeup", S_IFREG | S_IRUGO | S_IWUSR,
-		    acpi_root_dir, &acpi_system_wakeup_device_fops);
+		    acpi_root_dir, &acpi_system_wakeup_device_proc_ops);
 }
--- a/drivers/hwmon/dell-smm-hwmon.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/hwmon/dell-smm-hwmon.c
@@ -595,19 +595,18 @@ static int i8k_open_fs(struct inode *ino
 	return single_open(file, i8k_proc_show, NULL);
 }
 
-static const struct file_operations i8k_fops = {
-	.owner		= THIS_MODULE,
-	.open		= i8k_open_fs,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.unlocked_ioctl	= i8k_ioctl,
+static const struct proc_ops i8k_proc_ops = {
+	.proc_open	= i8k_open_fs,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_ioctl	= i8k_ioctl,
 };
 
 static void __init i8k_init_procfs(void)
 {
 	/* Register the proc entry */
-	proc_create("i8k", 0, NULL, &i8k_fops);
+	proc_create("i8k", 0, NULL, &i8k_proc_ops);
 }
 
 static void __exit i8k_exit_procfs(void)
--- a/drivers/ide/ide-proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/ide/ide-proc.c
@@ -381,13 +381,12 @@ parse_error:
 	return -EINVAL;
 }
 
-static const struct file_operations ide_settings_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= ide_settings_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= ide_settings_proc_write,
+static const struct proc_ops ide_settings_proc_ops = {
+	.proc_open	= ide_settings_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= ide_settings_proc_write,
 };
 
 int ide_capacity_proc_show(struct seq_file *m, void *v)
@@ -546,7 +545,7 @@ void ide_proc_port_register_devices(ide_
 		if (drive->proc) {
 			ide_add_proc_entries(drive->proc, generic_drive_entries, drive);
 			proc_create_data("settings", S_IFREG|S_IRUSR|S_IWUSR,
-					drive->proc, &ide_settings_proc_fops,
+					drive->proc, &ide_settings_proc_ops,
 					drive);
 		}
 		sprintf(name, "ide%d/%s", (drive->name[2]-'a')/2, drive->name);
@@ -615,7 +614,7 @@ static int ide_drivers_show(struct seq_f
 	return 0;
 }
 
-DEFINE_SHOW_ATTRIBUTE(ide_drivers);
+DEFINE_PROC_SHOW_ATTRIBUTE(ide_drivers);
 
 void proc_ide_create(void)
 {
@@ -624,7 +623,7 @@ void proc_ide_create(void)
 	if (!proc_ide_root)
 		return;
 
-	proc_create("drivers", 0, proc_ide_root, &ide_drivers_fops);
+	proc_create("drivers", 0, proc_ide_root, &ide_drivers_proc_ops);
 }
 
 void proc_ide_destroy(void)
--- a/drivers/input/input.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/input/input.c
@@ -1216,13 +1216,12 @@ static int input_proc_devices_open(struc
 	return seq_open(file, &input_devices_seq_ops);
 }
 
-static const struct file_operations input_devices_fileops = {
-	.owner		= THIS_MODULE,
-	.open		= input_proc_devices_open,
-	.poll		= input_proc_devices_poll,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops input_devices_proc_ops = {
+	.proc_open	= input_proc_devices_open,
+	.proc_poll	= input_proc_devices_poll,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static void *input_handlers_seq_start(struct seq_file *seq, loff_t *pos)
@@ -1280,12 +1279,11 @@ static int input_proc_handlers_open(stru
 	return seq_open(file, &input_handlers_seq_ops);
 }
 
-static const struct file_operations input_handlers_fileops = {
-	.owner		= THIS_MODULE,
-	.open		= input_proc_handlers_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops input_handlers_proc_ops = {
+	.proc_open	= input_proc_handlers_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static int __init input_proc_init(void)
@@ -1297,12 +1295,12 @@ static int __init input_proc_init(void)
 		return -ENOMEM;
 
 	entry = proc_create("devices", 0, proc_bus_input_dir,
-			    &input_devices_fileops);
+			    &input_devices_proc_ops);
 	if (!entry)
 		goto fail1;
 
 	entry = proc_create("handlers", 0, proc_bus_input_dir,
-			    &input_handlers_fileops);
+			    &input_handlers_proc_ops);
 	if (!entry)
 		goto fail2;
 
--- a/drivers/isdn/capi/kcapi_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/isdn/capi/kcapi_proc.c
@@ -199,8 +199,8 @@ static ssize_t empty_read(struct file *f
 	return 0;
 }
 
-static const struct file_operations empty_fops = {
-	.read	= empty_read,
+static const struct proc_ops empty_proc_ops = {
+	.proc_read	= empty_read,
 };
 
 // ---------------------------------------------------------------------------
@@ -214,7 +214,7 @@ kcapi_proc_init(void)
 	proc_create_seq("capi/contrstats",   0, NULL, &seq_contrstats_ops);
 	proc_create_seq("capi/applications", 0, NULL, &seq_applications_ops);
 	proc_create_seq("capi/applstats",    0, NULL, &seq_applstats_ops);
-	proc_create("capi/driver",           0, NULL, &empty_fops);
+	proc_create("capi/driver",           0, NULL, &empty_proc_ops);
 }
 
 void
--- a/drivers/macintosh/via-pmu.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/macintosh/via-pmu.c
@@ -212,7 +212,7 @@ static int pmu_info_proc_show(struct seq
 static int pmu_irqstats_proc_show(struct seq_file *m, void *v);
 static int pmu_battery_proc_show(struct seq_file *m, void *v);
 static void pmu_pass_intr(unsigned char *data, int len);
-static const struct file_operations pmu_options_proc_fops;
+static const struct proc_ops pmu_options_proc_ops;
 
 #ifdef CONFIG_ADB
 const struct adb_driver via_pmu_driver = {
@@ -573,7 +573,7 @@ static int __init via_pmu_dev_init(void)
 		proc_pmu_irqstats = proc_create_single("interrupts", 0,
 				proc_pmu_root, pmu_irqstats_proc_show);
 		proc_pmu_options = proc_create("options", 0600, proc_pmu_root,
-						&pmu_options_proc_fops);
+						&pmu_options_proc_ops);
 	}
 	return 0;
 }
@@ -974,13 +974,12 @@ static ssize_t pmu_options_proc_write(st
 	return fcount;
 }
 
-static const struct file_operations pmu_options_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= pmu_options_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= pmu_options_proc_write,
+static const struct proc_ops pmu_options_proc_ops = {
+	.proc_open	= pmu_options_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= pmu_options_proc_write,
 };
 
 #ifdef CONFIG_ADB
--- a/drivers/md/md.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/md/md.c
@@ -8279,13 +8279,12 @@ static __poll_t mdstat_poll(struct file
 	return mask;
 }
 
-static const struct file_operations md_seq_fops = {
-	.owner		= THIS_MODULE,
-	.open           = md_seq_open,
-	.read           = seq_read,
-	.llseek         = seq_lseek,
-	.release	= seq_release,
-	.poll		= mdstat_poll,
+static const struct proc_ops mdstat_proc_ops = {
+	.proc_open	= md_seq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
+	.proc_poll	= mdstat_poll,
 };
 
 int register_md_personality(struct md_personality *p)
@@ -9454,7 +9453,7 @@ static void md_geninit(void)
 {
 	pr_debug("md: sizeof(mdp_super_t) = %d\n", (int)sizeof(mdp_super_t));
 
-	proc_create("mdstat", S_IRUGO, NULL, &md_seq_fops);
+	proc_create("mdstat", S_IRUGO, NULL, &mdstat_proc_ops);
 }
 
 static int __init md_init(void)
--- a/drivers/misc/sgi-gru/gruprocfs.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/misc/sgi-gru/gruprocfs.c
@@ -255,28 +255,28 @@ static int options_open(struct inode *in
 }
 
 /* *INDENT-OFF* */
-static const struct file_operations statistics_fops = {
-	.open 		= statistics_open,
-	.read 		= seq_read,
-	.write 		= statistics_write,
-	.llseek 	= seq_lseek,
-	.release 	= single_release,
+static const struct proc_ops statistics_proc_ops = {
+	.proc_open	= statistics_open,
+	.proc_read	= seq_read,
+	.proc_write	= statistics_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
-static const struct file_operations mcs_statistics_fops = {
-	.open 		= mcs_statistics_open,
-	.read 		= seq_read,
-	.write 		= mcs_statistics_write,
-	.llseek 	= seq_lseek,
-	.release 	= single_release,
+static const struct proc_ops mcs_statistics_proc_ops = {
+	.proc_open	= mcs_statistics_open,
+	.proc_read	= seq_read,
+	.proc_write	= mcs_statistics_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
-static const struct file_operations options_fops = {
-	.open 		= options_open,
-	.read 		= seq_read,
-	.write 		= options_write,
-	.llseek 	= seq_lseek,
-	.release 	= single_release,
+static const struct proc_ops options_proc_ops = {
+	.proc_open	= options_open,
+	.proc_read	= seq_read,
+	.proc_write	= options_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 static struct proc_dir_entry *proc_gru __read_mostly;
@@ -286,11 +286,11 @@ int gru_proc_init(void)
 	proc_gru = proc_mkdir("sgi_uv/gru", NULL);
 	if (!proc_gru)
 		return -1;
-	if (!proc_create("statistics", 0644, proc_gru, &statistics_fops))
+	if (!proc_create("statistics", 0644, proc_gru, &statistics_proc_ops))
 		goto err;
-	if (!proc_create("mcs_statistics", 0644, proc_gru, &mcs_statistics_fops))
+	if (!proc_create("mcs_statistics", 0644, proc_gru, &mcs_statistics_proc_ops))
 		goto err;
-	if (!proc_create("debug_options", 0644, proc_gru, &options_fops))
+	if (!proc_create("debug_options", 0644, proc_gru, &options_proc_ops))
 		goto err;
 	if (!proc_create_seq("cch_status", 0444, proc_gru, &cch_seq_ops))
 		goto err;
--- a/drivers/net/wireless/cisco/airo.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/net/wireless/cisco/airo.c
@@ -4420,73 +4420,65 @@ static int proc_BSSList_open( struct ino
 static int proc_config_open( struct inode *inode, struct file *file );
 static int proc_wepkey_open( struct inode *inode, struct file *file );
 
-static const struct file_operations proc_statsdelta_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.open		= proc_statsdelta_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_stats_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.open		= proc_stats_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_status_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.open		= proc_status_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_SSID_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.write		= proc_write,
-	.open		= proc_SSID_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_BSSList_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.write		= proc_write,
-	.open		= proc_BSSList_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_APList_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.write		= proc_write,
-	.open		= proc_APList_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_config_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.write		= proc_write,
-	.open		= proc_config_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
-};
-
-static const struct file_operations proc_wepkey_ops = {
-	.owner		= THIS_MODULE,
-	.read		= proc_read,
-	.write		= proc_write,
-	.open		= proc_wepkey_open,
-	.release	= proc_close,
-	.llseek		= default_llseek,
+static const struct proc_ops proc_statsdelta_ops = {
+	.proc_read	= proc_read,
+	.proc_open	= proc_statsdelta_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_stats_ops = {
+	.proc_read	= proc_read,
+	.proc_open	= proc_stats_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_status_ops = {
+	.proc_read	= proc_read,
+	.proc_open	= proc_status_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_SSID_ops = {
+	.proc_read	= proc_read,
+	.proc_write	= proc_write,
+	.proc_open	= proc_SSID_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_BSSList_ops = {
+	.proc_read	= proc_read,
+	.proc_write	= proc_write,
+	.proc_open	= proc_BSSList_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_APList_ops = {
+	.proc_read	= proc_read,
+	.proc_write	= proc_write,
+	.proc_open	= proc_APList_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_config_ops = {
+	.proc_read	= proc_read,
+	.proc_write	= proc_write,
+	.proc_open	= proc_config_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
+};
+
+static const struct proc_ops proc_wepkey_ops = {
+	.proc_read	= proc_read,
+	.proc_write	= proc_write,
+	.proc_open	= proc_wepkey_open,
+	.proc_release	= proc_close,
+	.proc_lseek	= default_llseek,
 };
 
 static struct proc_dir_entry *airo_entry;
--- a/drivers/net/wireless/intel/ipw2x00/libipw_module.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/net/wireless/intel/ipw2x00/libipw_module.c
@@ -240,13 +240,12 @@ static ssize_t debug_level_proc_write(st
 	return strnlen(buf, len);
 }
 
-static const struct file_operations debug_level_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= debug_level_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= debug_level_proc_write,
+static const struct proc_ops debug_level_proc_ops = {
+	.proc_open	= debug_level_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= debug_level_proc_write,
 };
 #endif				/* CONFIG_LIBIPW_DEBUG */
 
@@ -263,7 +262,7 @@ static int __init libipw_init(void)
 		return -EIO;
 	}
 	e = proc_create("debug_level", 0644, libipw_proc,
-			&debug_level_proc_fops);
+			&debug_level_proc_ops);
 	if (!e) {
 		remove_proc_entry(DRV_PROCNAME, init_net.proc_net);
 		libipw_proc = NULL;
--- a/drivers/net/wireless/intersil/hostap/hostap_hw.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/net/wireless/intersil/hostap/hostap_hw.c
@@ -126,7 +126,7 @@ static void prism2_check_sta_fw_version(
 
 #ifdef PRISM2_DOWNLOAD_SUPPORT
 /* hostap_download.c */
-static const struct file_operations prism2_download_aux_dump_proc_fops;
+static const struct proc_ops prism2_download_aux_dump_proc_ops;
 static u8 * prism2_read_pda(struct net_device *dev);
 static int prism2_download(local_info_t *local,
 			   struct prism2_download_param *param);
@@ -3094,7 +3094,7 @@ prism2_init_local_data(struct prism2_hel
 	local->func->reset_port = prism2_reset_port;
 	local->func->schedule_reset = prism2_schedule_reset;
 #ifdef PRISM2_DOWNLOAD_SUPPORT
-	local->func->read_aux_fops = &prism2_download_aux_dump_proc_fops;
+	local->func->read_aux_proc_ops = &prism2_download_aux_dump_proc_ops;
 	local->func->download = prism2_download;
 #endif /* PRISM2_DOWNLOAD_SUPPORT */
 	local->func->tx = prism2_tx_80211;
--- a/drivers/net/wireless/intersil/hostap/hostap_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/net/wireless/intersil/hostap/hostap_proc.c
@@ -211,9 +211,9 @@ static ssize_t prism2_pda_proc_read(stru
 	return count;
 }
 
-static const struct file_operations prism2_pda_proc_fops = {
-	.read		= prism2_pda_proc_read,
-	.llseek		= generic_file_llseek,
+static const struct proc_ops prism2_pda_proc_ops = {
+	.proc_read	= prism2_pda_proc_read,
+	.proc_lseek	= generic_file_llseek,
 };
 
 
@@ -223,8 +223,8 @@ static ssize_t prism2_aux_dump_proc_no_r
 	return 0;
 }
 
-static const struct file_operations prism2_aux_dump_proc_fops = {
-	.read		= prism2_aux_dump_proc_no_read,
+static const struct proc_ops prism2_aux_dump_proc_ops = {
+	.proc_read	= prism2_aux_dump_proc_no_read,
 };
 
 
@@ -379,9 +379,9 @@ void hostap_init_proc(local_info_t *loca
 	proc_create_seq_data("wds", 0, local->proc,
 			&prism2_wds_proc_seqops, local);
 	proc_create_data("pda", 0, local->proc,
-			 &prism2_pda_proc_fops, local);
+			 &prism2_pda_proc_ops, local);
 	proc_create_data("aux_dump", 0, local->proc,
-			 local->func->read_aux_fops ?: &prism2_aux_dump_proc_fops,
+			 local->func->read_aux_proc_ops ?: &prism2_aux_dump_proc_ops,
 			 local);
 	proc_create_seq_data("bss_list", 0, local->proc,
 			&prism2_bss_list_proc_seqops, local);
--- a/drivers/net/wireless/intersil/hostap/hostap_wlan.h~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/net/wireless/intersil/hostap/hostap_wlan.h
@@ -599,7 +599,7 @@ struct prism2_helper_functions {
 			struct prism2_download_param *param);
 	int (*tx)(struct sk_buff *skb, struct net_device *dev);
 	int (*set_tim)(struct net_device *dev, int aid, int set);
-	const struct file_operations *read_aux_fops;
+	const struct proc_ops *read_aux_proc_ops;
 
 	int need_tx_headroom; /* number of bytes of headroom needed before
 			       * IEEE 802.11 header */
--- a/drivers/net/wireless/ray_cs.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/net/wireless/ray_cs.c
@@ -2717,10 +2717,9 @@ static ssize_t ray_cs_essid_proc_write(s
 	return count;
 }
 
-static const struct file_operations ray_cs_essid_proc_fops = {
-	.owner		= THIS_MODULE,
-	.write		= ray_cs_essid_proc_write,
-	.llseek		= noop_llseek,
+static const struct proc_ops ray_cs_essid_proc_ops = {
+	.proc_write	= ray_cs_essid_proc_write,
+	.proc_lseek	= noop_llseek,
 };
 
 static ssize_t int_proc_write(struct file *file, const char __user *buffer,
@@ -2751,10 +2750,9 @@ static ssize_t int_proc_write(struct fil
 	return count;
 }
 
-static const struct file_operations int_proc_fops = {
-	.owner		= THIS_MODULE,
-	.write		= int_proc_write,
-	.llseek		= noop_llseek,
+static const struct proc_ops int_proc_ops = {
+	.proc_write	= int_proc_write,
+	.proc_lseek	= noop_llseek,
 };
 #endif
 
@@ -2790,10 +2788,10 @@ static int __init init_ray_cs(void)
 	proc_mkdir("driver/ray_cs", NULL);
 
 	proc_create_single("driver/ray_cs/ray_cs", 0, NULL, ray_cs_proc_show);
-	proc_create("driver/ray_cs/essid", 0200, NULL, &ray_cs_essid_proc_fops);
-	proc_create_data("driver/ray_cs/net_type", 0200, NULL, &int_proc_fops,
+	proc_create("driver/ray_cs/essid", 0200, NULL, &ray_cs_essid_proc_ops);
+	proc_create_data("driver/ray_cs/net_type", 0200, NULL, &int_proc_ops,
 			 &net_type);
-	proc_create_data("driver/ray_cs/translate", 0200, NULL, &int_proc_fops,
+	proc_create_data("driver/ray_cs/translate", 0200, NULL, &int_proc_ops,
 			 &translate);
 #endif
 	if (translate != 0)
--- a/drivers/parisc/led.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/parisc/led.c
@@ -230,13 +230,12 @@ parse_error:
 	return -EINVAL;
 }
 
-static const struct file_operations led_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= led_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= led_proc_write,
+static const struct proc_ops led_proc_ops = {
+	.proc_open	= led_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= led_proc_write,
 };
 
 static int __init led_create_procfs(void)
@@ -252,14 +251,14 @@ static int __init led_create_procfs(void
 	if (!lcd_no_led_support)
 	{
 		ent = proc_create_data("led", S_IRUGO|S_IWUSR, proc_pdc_root,
-					&led_proc_fops, (void *)LED_NOLCD); /* LED */
+					&led_proc_ops, (void *)LED_NOLCD); /* LED */
 		if (!ent) return -1;
 	}
 
 	if (led_type == LED_HASLCD)
 	{
 		ent = proc_create_data("lcd", S_IRUGO|S_IWUSR, proc_pdc_root,
-					&led_proc_fops, (void *)LED_HASLCD); /* LCD */
+					&led_proc_ops, (void *)LED_HASLCD); /* LCD */
 		if (!ent) return -1;
 	}
 
--- a/drivers/pci/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/pci/proc.c
@@ -306,19 +306,20 @@ static int proc_bus_pci_release(struct i
 }
 #endif /* HAVE_PCI_MMAP */
 
-static const struct file_operations proc_bus_pci_operations = {
-	.owner		= THIS_MODULE,
-	.llseek		= proc_bus_pci_lseek,
-	.read		= proc_bus_pci_read,
-	.write		= proc_bus_pci_write,
-	.unlocked_ioctl	= proc_bus_pci_ioctl,
-	.compat_ioctl	= proc_bus_pci_ioctl,
+static const struct proc_ops proc_bus_pci_ops = {
+	.proc_lseek	= proc_bus_pci_lseek,
+	.proc_read	= proc_bus_pci_read,
+	.proc_write	= proc_bus_pci_write,
+	.proc_ioctl	= proc_bus_pci_ioctl,
+#ifdef CONFIG_COMPAT
+	.proc_compat_ioctl = proc_bus_pci_ioctl,
+#endif
 #ifdef HAVE_PCI_MMAP
-	.open		= proc_bus_pci_open,
-	.release	= proc_bus_pci_release,
-	.mmap		= proc_bus_pci_mmap,
+	.proc_open	= proc_bus_pci_open,
+	.proc_release	= proc_bus_pci_release,
+	.proc_mmap	= proc_bus_pci_mmap,
 #ifdef HAVE_ARCH_PCI_GET_UNMAPPED_AREA
-	.get_unmapped_area = get_pci_unmapped_area,
+	.proc_get_unmapped_area = get_pci_unmapped_area,
 #endif /* HAVE_ARCH_PCI_GET_UNMAPPED_AREA */
 #endif /* HAVE_PCI_MMAP */
 };
@@ -424,7 +425,7 @@ int pci_proc_attach_device(struct pci_de
 
 	sprintf(name, "%02x.%x", PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn));
 	e = proc_create_data(name, S_IFREG | S_IRUGO | S_IWUSR, bus->procdir,
-			     &proc_bus_pci_operations, dev);
+			     &proc_bus_pci_ops, dev);
 	if (!e)
 		return -ENOMEM;
 	proc_set_size(e, dev->cfg_size);
--- a/drivers/platform/x86/thinkpad_acpi.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/platform/x86/thinkpad_acpi.c
@@ -907,13 +907,12 @@ static ssize_t dispatch_proc_write(struc
 	return ret;
 }
 
-static const struct file_operations dispatch_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= dispatch_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= dispatch_proc_write,
+static const struct proc_ops dispatch_proc_ops = {
+	.proc_open	= dispatch_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= dispatch_proc_write,
 };
 
 static char *next_cmd(char **cmds)
@@ -9984,7 +9983,7 @@ static int __init ibm_init(struct ibm_in
 		if (ibm->write)
 			mode |= S_IWUSR;
 		entry = proc_create_data(ibm->name, mode, proc_dir,
-					 &dispatch_proc_fops, ibm);
+					 &dispatch_proc_ops, ibm);
 		if (!entry) {
 			pr_err("unable to create proc entry %s\n", ibm->name);
 			ret = -ENODEV;
--- a/drivers/platform/x86/toshiba_acpi.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/platform/x86/toshiba_acpi.c
@@ -1432,13 +1432,12 @@ static ssize_t lcd_proc_write(struct fil
 	return count;
 }
 
-static const struct file_operations lcd_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= lcd_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= lcd_proc_write,
+static const struct proc_ops lcd_proc_ops = {
+	.proc_open	= lcd_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= lcd_proc_write,
 };
 
 /* Video-Out */
@@ -1539,13 +1538,12 @@ static ssize_t video_proc_write(struct f
 	return ret ? -EIO : count;
 }
 
-static const struct file_operations video_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= video_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= video_proc_write,
+static const struct proc_ops video_proc_ops = {
+	.proc_open	= video_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= video_proc_write,
 };
 
 /* Fan status */
@@ -1617,13 +1615,12 @@ static ssize_t fan_proc_write(struct fil
 	return count;
 }
 
-static const struct file_operations fan_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= fan_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= fan_proc_write,
+static const struct proc_ops fan_proc_ops = {
+	.proc_open	= fan_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= fan_proc_write,
 };
 
 static int keys_proc_show(struct seq_file *m, void *v)
@@ -1662,13 +1659,12 @@ static ssize_t keys_proc_write(struct fi
 	return count;
 }
 
-static const struct file_operations keys_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= keys_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= keys_proc_write,
+static const struct proc_ops keys_proc_ops = {
+	.proc_open	= keys_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= keys_proc_write,
 };
 
 static int __maybe_unused version_proc_show(struct seq_file *m, void *v)
@@ -1688,16 +1684,16 @@ static void create_toshiba_proc_entries(
 {
 	if (dev->backlight_dev)
 		proc_create_data("lcd", S_IRUGO | S_IWUSR, toshiba_proc_dir,
-				 &lcd_proc_fops, dev);
+				 &lcd_proc_ops, dev);
 	if (dev->video_supported)
 		proc_create_data("video", S_IRUGO | S_IWUSR, toshiba_proc_dir,
-				 &video_proc_fops, dev);
+				 &video_proc_ops, dev);
 	if (dev->fan_supported)
 		proc_create_data("fan", S_IRUGO | S_IWUSR, toshiba_proc_dir,
-				 &fan_proc_fops, dev);
+				 &fan_proc_ops, dev);
 	if (dev->hotkey_dev)
 		proc_create_data("keys", S_IRUGO | S_IWUSR, toshiba_proc_dir,
-				 &keys_proc_fops, dev);
+				 &keys_proc_ops, dev);
 	proc_create_single_data("version", S_IRUGO, toshiba_proc_dir,
 			version_proc_show, dev);
 }
--- a/drivers/pnp/isapnp/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/pnp/isapnp/proc.c
@@ -49,10 +49,9 @@ static ssize_t isapnp_proc_bus_read(stru
 	return nbytes;
 }
 
-static const struct file_operations isapnp_proc_bus_file_operations = {
-	.owner	= THIS_MODULE,
-	.llseek = isapnp_proc_bus_lseek,
-	.read = isapnp_proc_bus_read,
+static const struct proc_ops isapnp_proc_bus_proc_ops = {
+	.proc_lseek	= isapnp_proc_bus_lseek,
+	.proc_read	= isapnp_proc_bus_read,
 };
 
 static int isapnp_proc_attach_device(struct pnp_dev *dev)
@@ -69,7 +68,7 @@ static int isapnp_proc_attach_device(str
 	}
 	sprintf(name, "%02x", dev->number);
 	e = dev->procent = proc_create_data(name, S_IFREG | S_IRUGO, de,
-			&isapnp_proc_bus_file_operations, dev);
+					    &isapnp_proc_bus_proc_ops, dev);
 	if (!e)
 		return -ENOMEM;
 	proc_set_size(e, 256);
--- a/drivers/pnp/pnpbios/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/pnp/pnpbios/proc.c
@@ -210,13 +210,12 @@ out:
 	return ret;
 }
 
-static const struct file_operations pnpbios_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= pnpbios_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= pnpbios_proc_write,
+static const struct proc_ops pnpbios_proc_ops = {
+	.proc_open	= pnpbios_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= pnpbios_proc_write,
 };
 
 int pnpbios_interface_attach_device(struct pnp_bios_node *node)
@@ -228,13 +227,13 @@ int pnpbios_interface_attach_device(stru
 	if (!proc_pnp)
 		return -EIO;
 	if (!pnpbios_dont_use_current_config) {
-		proc_create_data(name, 0644, proc_pnp, &pnpbios_proc_fops,
+		proc_create_data(name, 0644, proc_pnp, &pnpbios_proc_ops,
 				 (void *)(long)(node->handle));
 	}
 
 	if (!proc_pnp_boot)
 		return -EIO;
-	if (proc_create_data(name, 0644, proc_pnp_boot, &pnpbios_proc_fops,
+	if (proc_create_data(name, 0644, proc_pnp_boot, &pnpbios_proc_ops,
 			     (void *)(long)(node->handle + 0x100)))
 		return 0;
 	return -EIO;
--- a/drivers/s390/block/dasd_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/s390/block/dasd_proc.c
@@ -320,13 +320,12 @@ out_error:
 #endif				/* CONFIG_DASD_PROFILE */
 }
 
-static const struct file_operations dasd_stats_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= dasd_stats_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= dasd_stats_proc_write,
+static const struct proc_ops dasd_stats_proc_ops = {
+	.proc_open	= dasd_stats_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= dasd_stats_proc_write,
 };
 
 /*
@@ -347,7 +346,7 @@ dasd_proc_init(void)
 	dasd_statistics_entry = proc_create("statistics",
 					    S_IFREG | S_IRUGO | S_IWUSR,
 					    dasd_proc_root_entry,
-					    &dasd_stats_proc_fops);
+					    &dasd_stats_proc_ops);
 	if (!dasd_statistics_entry)
 		goto out_nostatistics;
 	return 0;
--- a/drivers/s390/cio/blacklist.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/s390/cio/blacklist.c
@@ -398,12 +398,12 @@ cio_ignore_proc_open(struct inode *inode
 				sizeof(struct ccwdev_iter));
 }
 
-static const struct file_operations cio_ignore_proc_fops = {
-	.open    = cio_ignore_proc_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.release = seq_release_private,
-	.write   = cio_ignore_write,
+static const struct proc_ops cio_ignore_proc_ops = {
+	.proc_open	= cio_ignore_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release_private,
+	.proc_write	= cio_ignore_write,
 };
 
 static int
@@ -412,7 +412,7 @@ cio_ignore_proc_init (void)
 	struct proc_dir_entry *entry;
 
 	entry = proc_create("cio_ignore", S_IFREG | S_IRUGO | S_IWUSR, NULL,
-			    &cio_ignore_proc_fops);
+			    &cio_ignore_proc_ops);
 	if (!entry)
 		return -ENOENT;
 	return 0;
--- a/drivers/s390/cio/css.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/s390/cio/css.c
@@ -1372,18 +1372,17 @@ static ssize_t cio_settle_write(struct f
 	return ret ? ret : count;
 }
 
-static const struct file_operations cio_settle_proc_fops = {
-	.open = nonseekable_open,
-	.write = cio_settle_write,
-	.llseek = no_llseek,
+static const struct proc_ops cio_settle_proc_ops = {
+	.proc_open	= nonseekable_open,
+	.proc_write	= cio_settle_write,
+	.proc_lseek	= no_llseek,
 };
 
 static int __init cio_settle_init(void)
 {
 	struct proc_dir_entry *entry;
 
-	entry = proc_create("cio_settle", S_IWUSR, NULL,
-			    &cio_settle_proc_fops);
+	entry = proc_create("cio_settle", S_IWUSR, NULL, &cio_settle_proc_ops);
 	if (!entry)
 		return -ENOMEM;
 	return 0;
--- a/drivers/scsi/esas2r/esas2r_main.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/scsi/esas2r/esas2r_main.c
@@ -617,6 +617,13 @@ static const struct file_operations esas
 	.unlocked_ioctl = esas2r_proc_ioctl,
 };
 
+static const struct proc_ops esas2r_proc_ops = {
+	.proc_ioctl		= esas2r_proc_ioctl,
+#ifdef CONFIG_COMPAT
+	.proc_compat_ioctl	= compat_ptr_ioctl,
+#endif
+};
+
 static struct Scsi_Host *esas2r_proc_host;
 static int esas2r_proc_major;
 
@@ -728,7 +735,7 @@ const char *esas2r_info(struct Scsi_Host
 
 			pde = proc_create(ATTONODE_NAME, 0,
 					  sh->hostt->proc_dir,
-					  &esas2r_proc_fops);
+					  &esas2r_proc_ops);
 
 			if (!pde) {
 				esas2r_log_dev(ESAS2R_LOG_WARN,
--- a/drivers/scsi/scsi_devinfo.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/scsi/scsi_devinfo.c
@@ -736,13 +736,12 @@ out:
 	return err;
 }
 
-static const struct file_operations scsi_devinfo_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= proc_scsi_devinfo_open,
-	.read		= seq_read,
-	.write		= proc_scsi_devinfo_write,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops scsi_devinfo_proc_ops = {
+	.proc_open	= proc_scsi_devinfo_open,
+	.proc_read	= seq_read,
+	.proc_write	= proc_scsi_devinfo_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 #endif /* CONFIG_SCSI_PROC_FS */
 
@@ -867,7 +866,7 @@ int __init scsi_init_devinfo(void)
 	}
 
 #ifdef CONFIG_SCSI_PROC_FS
-	p = proc_create("scsi/device_info", 0, NULL, &scsi_devinfo_proc_fops);
+	p = proc_create("scsi/device_info", 0, NULL, &scsi_devinfo_proc_ops);
 	if (!p) {
 		error = -ENOMEM;
 		goto out;
--- a/drivers/scsi/scsi_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/scsi/scsi_proc.c
@@ -83,12 +83,12 @@ static int proc_scsi_host_open(struct in
 				4 * PAGE_SIZE);
 }
 
-static const struct file_operations proc_scsi_fops = {
-	.open = proc_scsi_host_open,
-	.release = single_release,
-	.read = seq_read,
-	.llseek = seq_lseek,
-	.write = proc_scsi_host_write
+static const struct proc_ops proc_scsi_ops = {
+	.proc_open	= proc_scsi_host_open,
+	.proc_release	= single_release,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= proc_scsi_host_write
 };
 
 /**
@@ -146,7 +146,7 @@ void scsi_proc_host_add(struct Scsi_Host
 
 	sprintf(name,"%d", shost->host_no);
 	p = proc_create_data(name, S_IRUGO | S_IWUSR,
-		sht->proc_dir, &proc_scsi_fops, shost);
+		sht->proc_dir, &proc_scsi_ops, shost);
 	if (!p)
 		printk(KERN_ERR "%s: Failed to register host %d in"
 		       "%s\n", __func__, shost->host_no,
@@ -436,13 +436,12 @@ static int proc_scsi_open(struct inode *
 	return seq_open(file, &scsi_seq_ops);
 }
 
-static const struct file_operations proc_scsi_operations = {
-	.owner		= THIS_MODULE,
-	.open		= proc_scsi_open,
-	.read		= seq_read,
-	.write		= proc_scsi_write,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops scsi_scsi_proc_ops = {
+	.proc_open	= proc_scsi_open,
+	.proc_read	= seq_read,
+	.proc_write	= proc_scsi_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 /**
@@ -456,7 +455,7 @@ int __init scsi_init_procfs(void)
 	if (!proc_scsi)
 		goto err1;
 
-	pde = proc_create("scsi/scsi", 0, NULL, &proc_scsi_operations);
+	pde = proc_create("scsi/scsi", 0, NULL, &scsi_scsi_proc_ops);
 	if (!pde)
 		goto err2;
 
--- a/drivers/scsi/sg.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/scsi/sg.c
@@ -2322,25 +2322,23 @@ static int sg_proc_seq_show_int(struct s
 static int sg_proc_single_open_adio(struct inode *inode, struct file *file);
 static ssize_t sg_proc_write_adio(struct file *filp, const char __user *buffer,
 			          size_t count, loff_t *off);
-static const struct file_operations adio_fops = {
-	.owner = THIS_MODULE,
-	.open = sg_proc_single_open_adio,
-	.read = seq_read,
-	.llseek = seq_lseek,
-	.write = sg_proc_write_adio,
-	.release = single_release,
+static const struct proc_ops adio_proc_ops = {
+	.proc_open	= sg_proc_single_open_adio,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= sg_proc_write_adio,
+	.proc_release	= single_release,
 };
 
 static int sg_proc_single_open_dressz(struct inode *inode, struct file *file);
 static ssize_t sg_proc_write_dressz(struct file *filp, 
 		const char __user *buffer, size_t count, loff_t *off);
-static const struct file_operations dressz_fops = {
-	.owner = THIS_MODULE,
-	.open = sg_proc_single_open_dressz,
-	.read = seq_read,
-	.llseek = seq_lseek,
-	.write = sg_proc_write_dressz,
-	.release = single_release,
+static const struct proc_ops dressz_proc_ops = {
+	.proc_open	= sg_proc_single_open_dressz,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= sg_proc_write_dressz,
+	.proc_release	= single_release,
 };
 
 static int sg_proc_seq_show_version(struct seq_file *s, void *v);
@@ -2381,9 +2379,9 @@ sg_proc_init(void)
 	if (!p)
 		return 1;
 
-	proc_create("allow_dio", S_IRUGO | S_IWUSR, p, &adio_fops);
+	proc_create("allow_dio", S_IRUGO | S_IWUSR, p, &adio_proc_ops);
 	proc_create_seq("debug", S_IRUGO, p, &debug_seq_ops);
-	proc_create("def_reserved_size", S_IRUGO | S_IWUSR, p, &dressz_fops);
+	proc_create("def_reserved_size", S_IRUGO | S_IWUSR, p, &dressz_proc_ops);
 	proc_create_single("device_hdr", S_IRUGO, p, sg_proc_seq_show_devhdr);
 	proc_create_seq("devices", S_IRUGO, p, &dev_seq_ops);
 	proc_create_seq("device_strs", S_IRUGO, p, &devstrs_seq_ops);
--- a/drivers/staging/rtl8192u/ieee80211/ieee80211_module.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/staging/rtl8192u/ieee80211/ieee80211_module.c
@@ -264,12 +264,12 @@ static int open_debug_level(struct inode
 	return single_open(file, show_debug_level, NULL);
 }
 
-static const struct file_operations fops = {
-	.open = open_debug_level,
-	.read = seq_read,
-	.llseek = seq_lseek,
-	.write = write_debug_level,
-	.release = single_release,
+static const struct proc_ops debug_level_proc_ops = {
+	.proc_open	= open_debug_level,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= write_debug_level,
+	.proc_release	= single_release,
 };
 
 int __init ieee80211_debug_init(void)
@@ -284,7 +284,7 @@ int __init ieee80211_debug_init(void)
 				" proc directory\n");
 		return -EIO;
 	}
-	e = proc_create("debug_level", 0644, ieee80211_proc, &fops);
+	e = proc_create("debug_level", 0644, ieee80211_proc, &debug_level_proc_ops);
 	if (!e) {
 		remove_proc_entry(DRV_NAME, init_net.proc_net);
 		ieee80211_proc = NULL;
--- a/drivers/tty/sysrq.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/tty/sysrq.c
@@ -1101,15 +1101,15 @@ static ssize_t write_sysrq_trigger(struc
 	return count;
 }
 
-static const struct file_operations proc_sysrq_trigger_operations = {
-	.write		= write_sysrq_trigger,
-	.llseek		= noop_llseek,
+static const struct proc_ops sysrq_trigger_proc_ops = {
+	.proc_write	= write_sysrq_trigger,
+	.proc_lseek	= noop_llseek,
 };
 
 static void sysrq_init_procfs(void)
 {
 	if (!proc_create("sysrq-trigger", S_IWUSR, NULL,
-			 &proc_sysrq_trigger_operations))
+			 &sysrq_trigger_proc_ops))
 		pr_err("Failed to register proc interface\n");
 }
 
--- a/drivers/usb/gadget/function/rndis.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/usb/gadget/function/rndis.c
@@ -72,7 +72,7 @@ static rndis_resp_t *rndis_add_response(
 
 #ifdef CONFIG_USB_GADGET_DEBUG_FILES
 
-static const struct file_operations rndis_proc_fops;
+static const struct proc_ops rndis_proc_ops;
 
 #endif /* CONFIG_USB_GADGET_DEBUG_FILES */
 
@@ -902,7 +902,7 @@ struct rndis_params *rndis_register(void
 
 		sprintf(name, NAME_TEMPLATE, i);
 		proc_entry = proc_create_data(name, 0660, NULL,
-					      &rndis_proc_fops, params);
+					      &rndis_proc_ops, params);
 		if (!proc_entry) {
 			kfree(params);
 			rndis_put_nr(i);
@@ -1164,13 +1164,12 @@ static int rndis_proc_open(struct inode
 	return single_open(file, rndis_proc_show, PDE_DATA(inode));
 }
 
-static const struct file_operations rndis_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= rndis_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= rndis_proc_write,
+static const struct proc_ops rndis_proc_ops = {
+	.proc_open	= rndis_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= rndis_proc_write,
 };
 
 #define	NAME_TEMPLATE "driver/rndis-%03d"
--- a/drivers/video/fbdev/via/viafbdev.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/video/fbdev/via/viafbdev.c
@@ -1173,13 +1173,12 @@ static ssize_t viafb_dvp0_proc_write(str
 	return count;
 }
 
-static const struct file_operations viafb_dvp0_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_dvp0_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_dvp0_proc_write,
+static const struct proc_ops viafb_dvp0_proc_ops = {
+	.proc_open	= viafb_dvp0_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_dvp0_proc_write,
 };
 
 static int viafb_dvp1_proc_show(struct seq_file *m, void *v)
@@ -1238,13 +1237,12 @@ static ssize_t viafb_dvp1_proc_write(str
 	return count;
 }
 
-static const struct file_operations viafb_dvp1_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_dvp1_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_dvp1_proc_write,
+static const struct proc_ops viafb_dvp1_proc_ops = {
+	.proc_open	= viafb_dvp1_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_dvp1_proc_write,
 };
 
 static int viafb_dfph_proc_show(struct seq_file *m, void *v)
@@ -1273,13 +1271,12 @@ static ssize_t viafb_dfph_proc_write(str
 	return count;
 }
 
-static const struct file_operations viafb_dfph_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_dfph_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_dfph_proc_write,
+static const struct proc_ops viafb_dfph_proc_ops = {
+	.proc_open	= viafb_dfph_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_dfph_proc_write,
 };
 
 static int viafb_dfpl_proc_show(struct seq_file *m, void *v)
@@ -1308,13 +1305,12 @@ static ssize_t viafb_dfpl_proc_write(str
 	return count;
 }
 
-static const struct file_operations viafb_dfpl_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_dfpl_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_dfpl_proc_write,
+static const struct proc_ops viafb_dfpl_proc_ops = {
+	.proc_open	= viafb_dfpl_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_dfpl_proc_write,
 };
 
 static int viafb_vt1636_proc_show(struct seq_file *m, void *v)
@@ -1444,13 +1440,12 @@ static ssize_t viafb_vt1636_proc_write(s
 	return count;
 }
 
-static const struct file_operations viafb_vt1636_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_vt1636_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_vt1636_proc_write,
+static const struct proc_ops viafb_vt1636_proc_ops = {
+	.proc_open	= viafb_vt1636_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_vt1636_proc_write,
 };
 
 #endif /* CONFIG_FB_VIA_DIRECT_PROCFS */
@@ -1522,13 +1517,12 @@ static ssize_t viafb_iga1_odev_proc_writ
 	return res;
 }
 
-static const struct file_operations viafb_iga1_odev_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_iga1_odev_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_iga1_odev_proc_write,
+static const struct proc_ops viafb_iga1_odev_proc_ops = {
+	.proc_open	= viafb_iga1_odev_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_iga1_odev_proc_write,
 };
 
 static int viafb_iga2_odev_proc_show(struct seq_file *m, void *v)
@@ -1562,13 +1556,12 @@ static ssize_t viafb_iga2_odev_proc_writ
 	return res;
 }
 
-static const struct file_operations viafb_iga2_odev_proc_fops = {
-	.owner		= THIS_MODULE,
-	.open		= viafb_iga2_odev_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= viafb_iga2_odev_proc_write,
+static const struct proc_ops viafb_iga2_odev_proc_ops = {
+	.proc_open	= viafb_iga2_odev_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= viafb_iga2_odev_proc_write,
 };
 
 #define IS_VT1636(lvds_chip)	((lvds_chip).lvds_chip_name == VT1636_LVDS)
@@ -1580,14 +1573,14 @@ static void viafb_init_proc(struct viafb
 	shared->proc_entry = viafb_entry;
 	if (viafb_entry) {
 #ifdef CONFIG_FB_VIA_DIRECT_PROCFS
-		proc_create("dvp0", 0, viafb_entry, &viafb_dvp0_proc_fops);
-		proc_create("dvp1", 0, viafb_entry, &viafb_dvp1_proc_fops);
-		proc_create("dfph", 0, viafb_entry, &viafb_dfph_proc_fops);
-		proc_create("dfpl", 0, viafb_entry, &viafb_dfpl_proc_fops);
+		proc_create("dvp0", 0, viafb_entry, &viafb_dvp0_proc_ops);
+		proc_create("dvp1", 0, viafb_entry, &viafb_dvp1_proc_ops);
+		proc_create("dfph", 0, viafb_entry, &viafb_dfph_proc_ops);
+		proc_create("dfpl", 0, viafb_entry, &viafb_dfpl_proc_ops);
 		if (IS_VT1636(shared->chip_info.lvds_chip_info)
 			|| IS_VT1636(shared->chip_info.lvds_chip_info2))
 			proc_create("vt1636", 0, viafb_entry,
-				&viafb_vt1636_proc_fops);
+				    &viafb_vt1636_proc_ops);
 #endif /* CONFIG_FB_VIA_DIRECT_PROCFS */
 
 		proc_create_single("supported_output_devices", 0, viafb_entry,
@@ -1595,11 +1588,11 @@ static void viafb_init_proc(struct viafb
 		iga1_entry = proc_mkdir("iga1", viafb_entry);
 		shared->iga1_proc_entry = iga1_entry;
 		proc_create("output_devices", 0, iga1_entry,
-			&viafb_iga1_odev_proc_fops);
+			    &viafb_iga1_odev_proc_ops);
 		iga2_entry = proc_mkdir("iga2", viafb_entry);
 		shared->iga2_proc_entry = iga2_entry;
 		proc_create("output_devices", 0, iga2_entry,
-			&viafb_iga2_odev_proc_fops);
+			    &viafb_iga2_odev_proc_ops);
 	}
 }
 static void viafb_remove_proc(struct viafb_shared *shared)
--- a/drivers/zorro/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/drivers/zorro/proc.c
@@ -56,10 +56,9 @@ proc_bus_zorro_read(struct file *file, c
 	return nbytes;
 }
 
-static const struct file_operations proc_bus_zorro_operations = {
-	.owner		= THIS_MODULE,
-	.llseek		= proc_bus_zorro_lseek,
-	.read		= proc_bus_zorro_read,
+static const struct proc_ops bus_zorro_proc_ops = {
+	.proc_lseek	= proc_bus_zorro_lseek,
+	.proc_read	= proc_bus_zorro_read,
 };
 
 static void * zorro_seq_start(struct seq_file *m, loff_t *pos)
@@ -105,7 +104,7 @@ static int __init zorro_proc_attach_devi
 
 	sprintf(name, "%02x", slot);
 	entry = proc_create_data(name, 0, proc_bus_zorro_dir,
-				 &proc_bus_zorro_operations,
+				 &bus_zorro_proc_ops,
 				 &zorro_autocon[slot]);
 	if (!entry)
 		return -ENOMEM;
--- a/fs/cifs/cifs_debug.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/cifs/cifs_debug.c
@@ -611,12 +611,12 @@ static int cifs_stats_proc_open(struct i
 	return single_open(file, cifs_stats_proc_show, NULL);
 }
 
-static const struct file_operations cifs_stats_proc_fops = {
-	.open		= cifs_stats_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= cifs_stats_proc_write,
+static const struct proc_ops cifs_stats_proc_ops = {
+	.proc_open	= cifs_stats_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= cifs_stats_proc_write,
 };
 
 #ifdef CONFIG_CIFS_SMB_DIRECT
@@ -640,12 +640,12 @@ static int name##_open(struct inode *ino
 	return single_open(file, name##_proc_show, NULL); \
 } \
 \
-static const struct file_operations cifs_##name##_proc_fops = { \
-	.open		= name##_open, \
-	.read		= seq_read, \
-	.llseek		= seq_lseek, \
-	.release	= single_release, \
-	.write		= name##_write, \
+static const struct proc_ops cifs_##name##_proc_fops = { \
+	.proc_open	= name##_open, \
+	.proc_read	= seq_read, \
+	.proc_lseek	= seq_lseek, \
+	.proc_release	= single_release, \
+	.proc_write	= name##_write, \
 }
 
 PROC_FILE_DEFINE(rdma_readwrite_threshold);
@@ -659,11 +659,11 @@ PROC_FILE_DEFINE(smbd_receive_credit_max
 #endif
 
 static struct proc_dir_entry *proc_fs_cifs;
-static const struct file_operations cifsFYI_proc_fops;
-static const struct file_operations cifs_lookup_cache_proc_fops;
-static const struct file_operations traceSMB_proc_fops;
-static const struct file_operations cifs_security_flags_proc_fops;
-static const struct file_operations cifs_linux_ext_proc_fops;
+static const struct proc_ops cifsFYI_proc_ops;
+static const struct proc_ops cifs_lookup_cache_proc_ops;
+static const struct proc_ops traceSMB_proc_ops;
+static const struct proc_ops cifs_security_flags_proc_ops;
+static const struct proc_ops cifs_linux_ext_proc_ops;
 
 void
 cifs_proc_init(void)
@@ -678,18 +678,18 @@ cifs_proc_init(void)
 	proc_create_single("open_files", 0400, proc_fs_cifs,
 			cifs_debug_files_proc_show);
 
-	proc_create("Stats", 0644, proc_fs_cifs, &cifs_stats_proc_fops);
-	proc_create("cifsFYI", 0644, proc_fs_cifs, &cifsFYI_proc_fops);
-	proc_create("traceSMB", 0644, proc_fs_cifs, &traceSMB_proc_fops);
+	proc_create("Stats", 0644, proc_fs_cifs, &cifs_stats_proc_ops);
+	proc_create("cifsFYI", 0644, proc_fs_cifs, &cifsFYI_proc_ops);
+	proc_create("traceSMB", 0644, proc_fs_cifs, &traceSMB_proc_ops);
 	proc_create("LinuxExtensionsEnabled", 0644, proc_fs_cifs,
-		    &cifs_linux_ext_proc_fops);
+		    &cifs_linux_ext_proc_ops);
 	proc_create("SecurityFlags", 0644, proc_fs_cifs,
-		    &cifs_security_flags_proc_fops);
+		    &cifs_security_flags_proc_ops);
 	proc_create("LookupCacheEnabled", 0644, proc_fs_cifs,
-		    &cifs_lookup_cache_proc_fops);
+		    &cifs_lookup_cache_proc_ops);
 
 #ifdef CONFIG_CIFS_DFS_UPCALL
-	proc_create("dfscache", 0644, proc_fs_cifs, &dfscache_proc_fops);
+	proc_create("dfscache", 0644, proc_fs_cifs, &dfscache_proc_ops);
 #endif
 
 #ifdef CONFIG_CIFS_SMB_DIRECT
@@ -774,12 +774,12 @@ static ssize_t cifsFYI_proc_write(struct
 	return count;
 }
 
-static const struct file_operations cifsFYI_proc_fops = {
-	.open		= cifsFYI_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= cifsFYI_proc_write,
+static const struct proc_ops cifsFYI_proc_ops = {
+	.proc_open	= cifsFYI_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= cifsFYI_proc_write,
 };
 
 static int cifs_linux_ext_proc_show(struct seq_file *m, void *v)
@@ -805,12 +805,12 @@ static ssize_t cifs_linux_ext_proc_write
 	return count;
 }
 
-static const struct file_operations cifs_linux_ext_proc_fops = {
-	.open		= cifs_linux_ext_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= cifs_linux_ext_proc_write,
+static const struct proc_ops cifs_linux_ext_proc_ops = {
+	.proc_open	= cifs_linux_ext_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= cifs_linux_ext_proc_write,
 };
 
 static int cifs_lookup_cache_proc_show(struct seq_file *m, void *v)
@@ -836,12 +836,12 @@ static ssize_t cifs_lookup_cache_proc_wr
 	return count;
 }
 
-static const struct file_operations cifs_lookup_cache_proc_fops = {
-	.open		= cifs_lookup_cache_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= cifs_lookup_cache_proc_write,
+static const struct proc_ops cifs_lookup_cache_proc_ops = {
+	.proc_open	= cifs_lookup_cache_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= cifs_lookup_cache_proc_write,
 };
 
 static int traceSMB_proc_show(struct seq_file *m, void *v)
@@ -867,12 +867,12 @@ static ssize_t traceSMB_proc_write(struc
 	return count;
 }
 
-static const struct file_operations traceSMB_proc_fops = {
-	.open		= traceSMB_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= traceSMB_proc_write,
+static const struct proc_ops traceSMB_proc_ops = {
+	.proc_open	= traceSMB_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= traceSMB_proc_write,
 };
 
 static int cifs_security_flags_proc_show(struct seq_file *m, void *v)
@@ -978,12 +978,12 @@ static ssize_t cifs_security_flags_proc_
 	return count;
 }
 
-static const struct file_operations cifs_security_flags_proc_fops = {
-	.open		= cifs_security_flags_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= cifs_security_flags_proc_write,
+static const struct proc_ops cifs_security_flags_proc_ops = {
+	.proc_open	= cifs_security_flags_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= cifs_security_flags_proc_write,
 };
 #else
 inline void cifs_proc_init(void)
--- a/fs/cifs/dfs_cache.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/cifs/dfs_cache.c
@@ -8,6 +8,7 @@
 #include <linux/jhash.h>
 #include <linux/ktime.h>
 #include <linux/slab.h>
+#include <linux/proc_fs.h>
 #include <linux/nls.h>
 #include <linux/workqueue.h>
 #include "cifsglob.h"
@@ -211,12 +212,12 @@ static int dfscache_proc_open(struct ino
 	return single_open(file, dfscache_proc_show, NULL);
 }
 
-const struct file_operations dfscache_proc_fops = {
-	.open		= dfscache_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= dfscache_proc_write,
+const struct proc_ops dfscache_proc_ops = {
+	.proc_open	= dfscache_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= dfscache_proc_write,
 };
 
 #ifdef CONFIG_CIFS_DEBUG2
--- a/fs/cifs/dfs_cache.h~proc-convert-everything-to-struct-proc_ops
+++ a/fs/cifs/dfs_cache.h
@@ -24,7 +24,7 @@ struct dfs_cache_tgt_iterator {
 
 extern int dfs_cache_init(void);
 extern void dfs_cache_destroy(void);
-extern const struct file_operations dfscache_proc_fops;
+extern const struct proc_ops dfscache_proc_ops;
 
 extern int dfs_cache_find(const unsigned int xid, struct cifs_ses *ses,
 			  const struct nls_table *nls_codepage, int remap,
--- a/fs/fscache/internal.h~proc-convert-everything-to-struct-proc_ops
+++ a/fs/fscache/internal.h
@@ -111,7 +111,7 @@ extern void fscache_enqueue_object(struc
  * object-list.c
  */
 #ifdef CONFIG_FSCACHE_OBJECT_LIST
-extern const struct file_operations fscache_objlist_fops;
+extern const struct proc_ops fscache_objlist_proc_ops;
 
 extern void fscache_objlist_add(struct fscache_object *);
 extern void fscache_objlist_remove(struct fscache_object *);
--- a/fs/fscache/object-list.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/fscache/object-list.c
@@ -7,6 +7,7 @@
 
 #define FSCACHE_DEBUG_LEVEL COOKIE
 #include <linux/module.h>
+#include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/key.h>
@@ -405,9 +406,9 @@ static int fscache_objlist_release(struc
 	return seq_release(inode, file);
 }
 
-const struct file_operations fscache_objlist_fops = {
-	.open		= fscache_objlist_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= fscache_objlist_release,
+const struct proc_ops fscache_objlist_proc_ops = {
+	.proc_open	= fscache_objlist_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= fscache_objlist_release,
 };
--- a/fs/fscache/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/fscache/proc.c
@@ -35,7 +35,7 @@ int __init fscache_proc_init(void)
 
 #ifdef CONFIG_FSCACHE_OBJECT_LIST
 	if (!proc_create("fs/fscache/objects", S_IFREG | 0444, NULL,
-			 &fscache_objlist_fops))
+			 &fscache_objlist_proc_ops))
 		goto error_objects;
 #endif
 
--- a/fs/jbd2/journal.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/jbd2/journal.c
@@ -1074,12 +1074,11 @@ static int jbd2_seq_info_release(struct
 	return seq_release(inode, file);
 }
 
-static const struct file_operations jbd2_seq_info_fops = {
-	.owner		= THIS_MODULE,
-	.open           = jbd2_seq_info_open,
-	.read           = seq_read,
-	.llseek         = seq_lseek,
-	.release        = jbd2_seq_info_release,
+static const struct proc_ops jbd2_info_proc_ops = {
+	.proc_open	= jbd2_seq_info_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= jbd2_seq_info_release,
 };
 
 static struct proc_dir_entry *proc_jbd2_stats;
@@ -1089,7 +1088,7 @@ static void jbd2_stats_proc_init(journal
 	journal->j_proc_entry = proc_mkdir(journal->j_devname, proc_jbd2_stats);
 	if (journal->j_proc_entry) {
 		proc_create_data("info", S_IRUGO, journal->j_proc_entry,
-				 &jbd2_seq_info_fops, journal);
+				 &jbd2_info_proc_ops, journal);
 	}
 }
 
--- a/fs/jfs/jfs_debug.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/jfs/jfs_debug.c
@@ -43,12 +43,12 @@ static ssize_t jfs_loglevel_proc_write(s
 	return count;
 }
 
-static const struct file_operations jfs_loglevel_proc_fops = {
-	.open		= jfs_loglevel_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= jfs_loglevel_proc_write,
+static const struct proc_ops jfs_loglevel_proc_ops = {
+	.proc_open	= jfs_loglevel_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= jfs_loglevel_proc_write,
 };
 #endif
 
@@ -68,7 +68,7 @@ void jfs_proc_init(void)
 #endif
 #ifdef CONFIG_JFS_DEBUG
 	proc_create_single("TxAnchor", 0, base, jfs_txanchor_proc_show);
-	proc_create("loglevel", 0, base, &jfs_loglevel_proc_fops);
+	proc_create("loglevel", 0, base, &jfs_loglevel_proc_ops);
 #endif
 }
 
--- a/fs/lockd/procfs.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/lockd/procfs.c
@@ -60,11 +60,11 @@ nlm_end_grace_read(struct file *file, ch
 	return simple_read_from_buffer(buf, size, pos, resp, sizeof(resp));
 }
 
-static const struct file_operations lockd_end_grace_operations = {
-	.write		= nlm_end_grace_write,
-	.read		= nlm_end_grace_read,
-	.llseek		= default_llseek,
-	.release	= simple_transaction_release,
+static const struct proc_ops lockd_end_grace_proc_ops = {
+	.proc_write	= nlm_end_grace_write,
+	.proc_read	= nlm_end_grace_read,
+	.proc_lseek	= default_llseek,
+	.proc_release	= simple_transaction_release,
 };
 
 int __init
@@ -76,7 +76,7 @@ lockd_create_procfs(void)
 	if (!entry)
 		return -ENOMEM;
 	entry = proc_create("nlm_end_grace", S_IRUGO|S_IWUSR, entry,
-				 &lockd_end_grace_operations);
+			    &lockd_end_grace_proc_ops);
 	if (!entry) {
 		remove_proc_entry("fs/lockd", NULL);
 		return -ENOMEM;
--- a/fs/nfsd/nfsctl.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/nfsd/nfsctl.c
@@ -157,11 +157,11 @@ static int exports_proc_open(struct inod
 	return exports_net_open(current->nsproxy->net_ns, file);
 }
 
-static const struct file_operations exports_proc_operations = {
-	.open		= exports_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops exports_proc_ops = {
+	.proc_open	= exports_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static int exports_nfsd_open(struct inode *inode, struct file *file)
@@ -1431,8 +1431,7 @@ static int create_proc_exports_entry(voi
 	entry = proc_mkdir("fs/nfs", NULL);
 	if (!entry)
 		return -ENOMEM;
-	entry = proc_create("exports", 0, entry,
-				 &exports_proc_operations);
+	entry = proc_create("exports", 0, entry, &exports_proc_ops);
 	if (!entry) {
 		remove_proc_entry("fs/nfs", NULL);
 		return -ENOMEM;
--- a/fs/nfsd/stats.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/nfsd/stats.c
@@ -84,17 +84,17 @@ static int nfsd_proc_open(struct inode *
 	return single_open(file, nfsd_proc_show, NULL);
 }
 
-static const struct file_operations nfsd_proc_fops = {
-	.open = nfsd_proc_open,
-	.read  = seq_read,
-	.llseek = seq_lseek,
-	.release = single_release,
+static const struct proc_ops nfsd_proc_ops = {
+	.proc_open	= nfsd_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 void
 nfsd_stat_init(void)
 {
-	svc_proc_register(&init_net, &nfsd_svcstats, &nfsd_proc_fops);
+	svc_proc_register(&init_net, &nfsd_svcstats, &nfsd_proc_ops);
 }
 
 void
--- a/fs/proc/cpuinfo.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/proc/cpuinfo.c
@@ -16,16 +16,16 @@ static int cpuinfo_open(struct inode *in
 	return seq_open(file, &cpuinfo_op);
 }
 
-static const struct file_operations proc_cpuinfo_operations = {
-	.open		= cpuinfo_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops cpuinfo_proc_ops = {
+	.proc_open	= cpuinfo_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static int __init proc_cpuinfo_init(void)
 {
-	proc_create("cpuinfo", 0, NULL, &proc_cpuinfo_operations);
+	proc_create("cpuinfo", 0, NULL, &cpuinfo_proc_ops);
 	return 0;
 }
 fs_initcall(proc_cpuinfo_init);
--- a/fs/proc/kcore.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/proc/kcore.c
@@ -574,11 +574,11 @@ static int release_kcore(struct inode *i
 	return 0;
 }
 
-static const struct file_operations proc_kcore_operations = {
-	.read		= read_kcore,
-	.open		= open_kcore,
-	.release	= release_kcore,
-	.llseek		= default_llseek,
+static const struct proc_ops kcore_proc_ops = {
+	.proc_read	= read_kcore,
+	.proc_open	= open_kcore,
+	.proc_release	= release_kcore,
+	.proc_lseek	= default_llseek,
 };
 
 /* just remember that we have to update kcore */
@@ -637,8 +637,7 @@ static void __init add_modules_range(voi
 
 static int __init proc_kcore_init(void)
 {
-	proc_root_kcore = proc_create("kcore", S_IRUSR, NULL,
-				      &proc_kcore_operations);
+	proc_root_kcore = proc_create("kcore", S_IRUSR, NULL, &kcore_proc_ops);
 	if (!proc_root_kcore) {
 		pr_err("couldn't create /proc/kcore\n");
 		return 0; /* Always returns 0. */
--- a/fs/proc/kmsg.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/proc/kmsg.c
@@ -49,17 +49,17 @@ static __poll_t kmsg_poll(struct file *f
 }
 
 
-static const struct file_operations proc_kmsg_operations = {
-	.read		= kmsg_read,
-	.poll		= kmsg_poll,
-	.open		= kmsg_open,
-	.release	= kmsg_release,
-	.llseek		= generic_file_llseek,
+static const struct proc_ops kmsg_proc_ops = {
+	.proc_read	= kmsg_read,
+	.proc_poll	= kmsg_poll,
+	.proc_open	= kmsg_open,
+	.proc_release	= kmsg_release,
+	.proc_lseek	= generic_file_llseek,
 };
 
 static int __init proc_kmsg_init(void)
 {
-	proc_create("kmsg", S_IRUSR, NULL, &proc_kmsg_operations);
+	proc_create("kmsg", S_IRUSR, NULL, &kmsg_proc_ops);
 	return 0;
 }
 fs_initcall(proc_kmsg_init);
--- a/fs/proc/page.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/proc/page.c
@@ -89,9 +89,9 @@ static ssize_t kpagecount_read(struct fi
 	return ret;
 }
 
-static const struct file_operations proc_kpagecount_operations = {
-	.llseek = mem_lseek,
-	.read = kpagecount_read,
+static const struct proc_ops kpagecount_proc_ops = {
+	.proc_lseek	= mem_lseek,
+	.proc_read	= kpagecount_read,
 };
 
 /* /proc/kpageflags - an array exposing page flags
@@ -263,9 +263,9 @@ static ssize_t kpageflags_read(struct fi
 	return ret;
 }
 
-static const struct file_operations proc_kpageflags_operations = {
-	.llseek = mem_lseek,
-	.read = kpageflags_read,
+static const struct proc_ops kpageflags_proc_ops = {
+	.proc_lseek	= mem_lseek,
+	.proc_read	= kpageflags_read,
 };
 
 #ifdef CONFIG_MEMCG
@@ -317,18 +317,18 @@ static ssize_t kpagecgroup_read(struct f
 	return ret;
 }
 
-static const struct file_operations proc_kpagecgroup_operations = {
-	.llseek = mem_lseek,
-	.read = kpagecgroup_read,
+static const struct proc_ops kpagecgroup_proc_ops = {
+	.proc_lseek	= mem_lseek,
+	.proc_read	= kpagecgroup_read,
 };
 #endif /* CONFIG_MEMCG */
 
 static int __init proc_page_init(void)
 {
-	proc_create("kpagecount", S_IRUSR, NULL, &proc_kpagecount_operations);
-	proc_create("kpageflags", S_IRUSR, NULL, &proc_kpageflags_operations);
+	proc_create("kpagecount", S_IRUSR, NULL, &kpagecount_proc_ops);
+	proc_create("kpageflags", S_IRUSR, NULL, &kpageflags_proc_ops);
 #ifdef CONFIG_MEMCG
-	proc_create("kpagecgroup", S_IRUSR, NULL, &proc_kpagecgroup_operations);
+	proc_create("kpagecgroup", S_IRUSR, NULL, &kpagecgroup_proc_ops);
 #endif
 	return 0;
 }
--- a/fs/proc/stat.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/proc/stat.c
@@ -223,16 +223,16 @@ static int stat_open(struct inode *inode
 	return single_open_size(file, show_stat, NULL, size);
 }
 
-static const struct file_operations proc_stat_operations = {
-	.open		= stat_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
+static const struct proc_ops stat_proc_ops = {
+	.proc_open	= stat_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 static int __init proc_stat_init(void)
 {
-	proc_create("stat", 0, NULL, &proc_stat_operations);
+	proc_create("stat", 0, NULL, &stat_proc_ops);
 	return 0;
 }
 fs_initcall(proc_stat_init);
--- a/fs/proc/vmcore.c~proc-convert-everything-to-struct-proc_ops
+++ a/fs/proc/vmcore.c
@@ -667,10 +667,10 @@ static int mmap_vmcore(struct file *file
 }
 #endif
 
-static const struct file_operations proc_vmcore_operations = {
-	.read		= read_vmcore,
-	.llseek		= default_llseek,
-	.mmap		= mmap_vmcore,
+static const struct proc_ops vmcore_proc_ops = {
+	.proc_read	= read_vmcore,
+	.proc_lseek	= default_llseek,
+	.proc_mmap	= mmap_vmcore,
 };
 
 static struct vmcore* __init get_new_element(void)
@@ -1555,7 +1555,7 @@ static int __init vmcore_init(void)
 	elfcorehdr_free(elfcorehdr_addr);
 	elfcorehdr_addr = ELFCORE_ADDR_ERR;
 
-	proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
+	proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &vmcore_proc_ops);
 	if (proc_vmcore)
 		proc_vmcore->size = vmcore_size;
 	return 0;
--- a/include/linux/seq_file.h~proc-convert-everything-to-struct-proc_ops
+++ a/include/linux/seq_file.h
@@ -160,6 +160,19 @@ static const struct file_operations __na
 	.release	= single_release,				\
 }
 
+#define DEFINE_PROC_SHOW_ATTRIBUTE(__name)				\
+static int __name ## _open(struct inode *inode, struct file *file)	\
+{									\
+	return single_open(file, __name ## _show, inode->i_private);	\
+}									\
+									\
+static const struct proc_ops __name ## _proc_ops = {			\
+	.proc_open	= __name ## _open,				\
+	.proc_read	= seq_read,					\
+	.proc_lseek	= seq_lseek,					\
+	.proc_release	= single_release,				\
+}
+
 static inline struct user_namespace *seq_user_ns(struct seq_file *seq)
 {
 #ifdef CONFIG_USER_NS
--- a/include/linux/sunrpc/stats.h~proc-convert-everything-to-struct-proc_ops
+++ a/include/linux/sunrpc/stats.h
@@ -63,7 +63,7 @@ struct proc_dir_entry *	rpc_proc_registe
 void			rpc_proc_unregister(struct net *,const char *);
 void			rpc_proc_zero(const struct rpc_program *);
 struct proc_dir_entry *	svc_proc_register(struct net *, struct svc_stat *,
-					  const struct file_operations *);
+					  const struct proc_ops *);
 void			svc_proc_unregister(struct net *, const char *);
 
 void			svc_seq_show(struct seq_file *,
@@ -75,7 +75,7 @@ static inline void rpc_proc_unregister(s
 static inline void rpc_proc_zero(const struct rpc_program *p) {}
 
 static inline struct proc_dir_entry *svc_proc_register(struct net *net, struct svc_stat *s,
-						       const struct file_operations *f) { return NULL; }
+						       const struct proc_ops *proc_ops) { return NULL; }
 static inline void svc_proc_unregister(struct net *net, const char *p) {}
 
 static inline void svc_seq_show(struct seq_file *seq,
--- a/ipc/util.c~proc-convert-everything-to-struct-proc_ops
+++ a/ipc/util.c
@@ -126,7 +126,7 @@ void ipc_init_ids(struct ipc_ids *ids)
 }
 
 #ifdef CONFIG_PROC_FS
-static const struct file_operations sysvipc_proc_fops;
+static const struct proc_ops sysvipc_proc_ops;
 /**
  * ipc_init_proc_interface -  create a proc interface for sysipc types using a seq_file interface.
  * @path: Path in procfs
@@ -151,7 +151,7 @@ void __init ipc_init_proc_interface(cons
 	pde = proc_create_data(path,
 			       S_IRUGO,        /* world readable */
 			       NULL,           /* parent dir */
-			       &sysvipc_proc_fops,
+			       &sysvipc_proc_ops,
 			       iface);
 	if (!pde)
 		kfree(iface);
@@ -884,10 +884,10 @@ static int sysvipc_proc_release(struct i
 	return seq_release_private(inode, file);
 }
 
-static const struct file_operations sysvipc_proc_fops = {
-	.open    = sysvipc_proc_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.release = sysvipc_proc_release,
+static const struct proc_ops sysvipc_proc_ops = {
+	.proc_open	= sysvipc_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= sysvipc_proc_release,
 };
 #endif /* CONFIG_PROC_FS */
--- a/kernel/configs.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/configs.c
@@ -47,10 +47,9 @@ ikconfig_read_current(struct file *file,
 				       &kernel_config_data);
 }
 
-static const struct file_operations ikconfig_file_ops = {
-	.owner = THIS_MODULE,
-	.read = ikconfig_read_current,
-	.llseek = default_llseek,
+static const struct proc_ops config_gz_proc_ops = {
+	.proc_read	= ikconfig_read_current,
+	.proc_lseek	= default_llseek,
 };
 
 static int __init ikconfig_init(void)
@@ -59,7 +58,7 @@ static int __init ikconfig_init(void)
 
 	/* create the current config file */
 	entry = proc_create("config.gz", S_IFREG | S_IRUGO, NULL,
-			    &ikconfig_file_ops);
+			    &config_gz_proc_ops);
 	if (!entry)
 		return -ENOMEM;
 
--- a/kernel/irq/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/irq/proc.c
@@ -176,20 +176,20 @@ static int irq_affinity_list_proc_open(s
 	return single_open(file, irq_affinity_list_proc_show, PDE_DATA(inode));
 }
 
-static const struct file_operations irq_affinity_proc_fops = {
-	.open		= irq_affinity_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= irq_affinity_proc_write,
+static const struct proc_ops irq_affinity_proc_ops = {
+	.proc_open	= irq_affinity_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= irq_affinity_proc_write,
 };
 
-static const struct file_operations irq_affinity_list_proc_fops = {
-	.open		= irq_affinity_list_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= irq_affinity_list_proc_write,
+static const struct proc_ops irq_affinity_list_proc_ops = {
+	.proc_open	= irq_affinity_list_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= irq_affinity_list_proc_write,
 };
 
 #ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
@@ -246,12 +246,12 @@ static int default_affinity_open(struct
 	return single_open(file, default_affinity_show, PDE_DATA(inode));
 }
 
-static const struct file_operations default_affinity_proc_fops = {
-	.open		= default_affinity_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= default_affinity_write,
+static const struct proc_ops default_affinity_proc_ops = {
+	.proc_open	= default_affinity_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= default_affinity_write,
 };
 
 static int irq_node_proc_show(struct seq_file *m, void *v)
@@ -342,7 +342,7 @@ void register_irq_proc(unsigned int irq,
 #ifdef CONFIG_SMP
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0644, desc->dir,
-			 &irq_affinity_proc_fops, irqp);
+			 &irq_affinity_proc_ops, irqp);
 
 	/* create /proc/irq/<irq>/affinity_hint */
 	proc_create_single_data("affinity_hint", 0444, desc->dir,
@@ -350,7 +350,7 @@ void register_irq_proc(unsigned int irq,
 
 	/* create /proc/irq/<irq>/smp_affinity_list */
 	proc_create_data("smp_affinity_list", 0644, desc->dir,
-			 &irq_affinity_list_proc_fops, irqp);
+			 &irq_affinity_list_proc_ops, irqp);
 
 	proc_create_single_data("node", 0444, desc->dir, irq_node_proc_show,
 			irqp);
@@ -401,7 +401,7 @@ static void register_default_affinity_pr
 {
 #ifdef CONFIG_SMP
 	proc_create("irq/default_smp_affinity", 0644, NULL,
-		    &default_affinity_proc_fops);
+		    &default_affinity_proc_ops);
 #endif
 }
 
--- a/kernel/kallsyms.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/kallsyms.c
@@ -698,16 +698,16 @@ const char *kdb_walk_kallsyms(loff_t *po
 }
 #endif	/* CONFIG_KGDB_KDB */
 
-static const struct file_operations kallsyms_operations = {
-	.open = kallsyms_open,
-	.read = seq_read,
-	.llseek = seq_lseek,
-	.release = seq_release_private,
+static const struct proc_ops kallsyms_proc_ops = {
+	.proc_open	= kallsyms_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release_private,
 };
 
 static int __init kallsyms_init(void)
 {
-	proc_create("kallsyms", 0444, NULL, &kallsyms_operations);
+	proc_create("kallsyms", 0444, NULL, &kallsyms_proc_ops);
 	return 0;
 }
 device_initcall(kallsyms_init);
--- a/kernel/latencytop.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/latencytop.c
@@ -255,17 +255,17 @@ static int lstats_open(struct inode *ino
 	return single_open(filp, lstats_show, NULL);
 }
 
-static const struct file_operations lstats_fops = {
-	.open		= lstats_open,
-	.read		= seq_read,
-	.write		= lstats_write,
-	.llseek		= seq_lseek,
-	.release	= single_release,
+static const struct proc_ops lstats_proc_ops = {
+	.proc_open	= lstats_open,
+	.proc_read	= seq_read,
+	.proc_write	= lstats_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 static int __init init_lstats_procfs(void)
 {
-	proc_create("latency_stats", 0644, NULL, &lstats_fops);
+	proc_create("latency_stats", 0644, NULL, &lstats_proc_ops);
 	return 0;
 }
 
--- a/kernel/locking/lockdep_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/locking/lockdep_proc.c
@@ -643,12 +643,12 @@ static int lock_stat_release(struct inod
 	return seq_release(inode, file);
 }
 
-static const struct file_operations proc_lock_stat_operations = {
-	.open		= lock_stat_open,
-	.write		= lock_stat_write,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= lock_stat_release,
+static const struct proc_ops lock_stat_proc_ops = {
+	.proc_open	= lock_stat_open,
+	.proc_write	= lock_stat_write,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= lock_stat_release,
 };
 #endif /* CONFIG_LOCK_STAT */
 
@@ -660,8 +660,7 @@ static int __init lockdep_proc_init(void
 #endif
 	proc_create_single("lockdep_stats", S_IRUSR, NULL, lockdep_stats_show);
 #ifdef CONFIG_LOCK_STAT
-	proc_create("lock_stat", S_IRUSR | S_IWUSR, NULL,
-		    &proc_lock_stat_operations);
+	proc_create("lock_stat", S_IRUSR | S_IWUSR, NULL, &lock_stat_proc_ops);
 #endif
 
 	return 0;
--- a/kernel/module.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/module.c
@@ -4354,16 +4354,16 @@ static int modules_open(struct inode *in
 	return err;
 }
 
-static const struct file_operations proc_modules_operations = {
-	.open		= modules_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops modules_proc_ops = {
+	.proc_open	= modules_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static int __init proc_modules_init(void)
 {
-	proc_create("modules", 0, NULL, &proc_modules_operations);
+	proc_create("modules", 0, NULL, &modules_proc_ops);
 	return 0;
 }
 module_init(proc_modules_init);
--- a/kernel/profile.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/profile.c
@@ -442,18 +442,18 @@ static ssize_t prof_cpu_mask_proc_write(
 	return err;
 }
 
-static const struct file_operations prof_cpu_mask_proc_fops = {
-	.open		= prof_cpu_mask_proc_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-	.write		= prof_cpu_mask_proc_write,
+static const struct proc_ops prof_cpu_mask_proc_ops = {
+	.proc_open	= prof_cpu_mask_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
+	.proc_write	= prof_cpu_mask_proc_write,
 };
 
 void create_prof_cpu_mask(void)
 {
 	/* create /proc/irq/prof_cpu_mask */
-	proc_create("irq/prof_cpu_mask", 0600, NULL, &prof_cpu_mask_proc_fops);
+	proc_create("irq/prof_cpu_mask", 0600, NULL, &prof_cpu_mask_proc_ops);
 }
 
 /*
@@ -517,10 +517,10 @@ static ssize_t write_profile(struct file
 	return count;
 }
 
-static const struct file_operations proc_profile_operations = {
-	.read		= read_profile,
-	.write		= write_profile,
-	.llseek		= default_llseek,
+static const struct proc_ops profile_proc_ops = {
+	.proc_read	= read_profile,
+	.proc_write	= write_profile,
+	.proc_lseek	= default_llseek,
 };
 
 int __ref create_proc_profile(void)
@@ -548,7 +548,7 @@ int __ref create_proc_profile(void)
 	err = 0;
 #endif
 	entry = proc_create("profile", S_IWUSR | S_IRUGO,
-			    NULL, &proc_profile_operations);
+			    NULL, &profile_proc_ops);
 	if (!entry)
 		goto err_state_onl;
 	proc_set_size(entry, (1 + prof_len) * sizeof(atomic_t));
--- a/kernel/sched/psi.c~proc-convert-everything-to-struct-proc_ops
+++ a/kernel/sched/psi.c
@@ -1251,40 +1251,40 @@ static int psi_fop_release(struct inode
 	return single_release(inode, file);
 }
 
-static const struct file_operations psi_io_fops = {
-	.open           = psi_io_open,
-	.read           = seq_read,
-	.llseek         = seq_lseek,
-	.write          = psi_io_write,
-	.poll           = psi_fop_poll,
-	.release        = psi_fop_release,
+static const struct proc_ops psi_io_proc_ops = {
+	.proc_open	= psi_io_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= psi_io_write,
+	.proc_poll	= psi_fop_poll,
+	.proc_release	= psi_fop_release,
 };
 
-static const struct file_operations psi_memory_fops = {
-	.open           = psi_memory_open,
-	.read           = seq_read,
-	.llseek         = seq_lseek,
-	.write          = psi_memory_write,
-	.poll           = psi_fop_poll,
-	.release        = psi_fop_release,
+static const struct proc_ops psi_memory_proc_ops = {
+	.proc_open	= psi_memory_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= psi_memory_write,
+	.proc_poll	= psi_fop_poll,
+	.proc_release	= psi_fop_release,
 };
 
-static const struct file_operations psi_cpu_fops = {
-	.open           = psi_cpu_open,
-	.read           = seq_read,
-	.llseek         = seq_lseek,
-	.write          = psi_cpu_write,
-	.poll           = psi_fop_poll,
-	.release        = psi_fop_release,
+static const struct proc_ops psi_cpu_proc_ops = {
+	.proc_open	= psi_cpu_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= psi_cpu_write,
+	.proc_poll	= psi_fop_poll,
+	.proc_release	= psi_fop_release,
 };
 
 static int __init psi_proc_init(void)
 {
 	if (psi_enable) {
 		proc_mkdir("pressure", NULL);
-		proc_create("pressure/io", 0, NULL, &psi_io_fops);
-		proc_create("pressure/memory", 0, NULL, &psi_memory_fops);
-		proc_create("pressure/cpu", 0, NULL, &psi_cpu_fops);
+		proc_create("pressure/io", 0, NULL, &psi_io_proc_ops);
+		proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops);
+		proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops);
 	}
 	return 0;
 }
--- a/mm/slab_common.c~proc-convert-everything-to-struct-proc_ops
+++ a/mm/slab_common.c
@@ -1580,18 +1580,17 @@ static int slabinfo_open(struct inode *i
 	return seq_open(file, &slabinfo_op);
 }
 
-static const struct file_operations proc_slabinfo_operations = {
-	.open		= slabinfo_open,
-	.read		= seq_read,
-	.write          = slabinfo_write,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
+static const struct proc_ops slabinfo_proc_ops = {
+	.proc_open	= slabinfo_open,
+	.proc_read	= seq_read,
+	.proc_write	= slabinfo_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 static int __init slab_proc_init(void)
 {
-	proc_create("slabinfo", SLABINFO_RIGHTS, NULL,
-						&proc_slabinfo_operations);
+	proc_create("slabinfo", SLABINFO_RIGHTS, NULL, &slabinfo_proc_ops);
 	return 0;
 }
 module_init(slab_proc_init);
--- a/mm/swapfile.c~proc-convert-everything-to-struct-proc_ops
+++ a/mm/swapfile.c
@@ -2796,17 +2796,17 @@ static int swaps_open(struct inode *inod
 	return 0;
 }
 
-static const struct file_operations proc_swaps_operations = {
-	.open		= swaps_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-	.poll		= swaps_poll,
+static const struct proc_ops swaps_proc_ops = {
+	.proc_open	= swaps_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
+	.proc_poll	= swaps_poll,
 };
 
 static int __init procswaps_init(void)
 {
-	proc_create("swaps", 0, NULL, &proc_swaps_operations);
+	proc_create("swaps", 0, NULL, &swaps_proc_ops);
 	return 0;
 }
 __initcall(procswaps_init);
--- a/net/atm/mpoa_proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/atm/mpoa_proc.c
@@ -53,15 +53,12 @@ static ssize_t proc_mpc_write(struct fil
 
 static int parse_qos(const char *buff);
 
-/*
- *   Define allowed FILE OPERATIONS
- */
-static const struct file_operations mpc_file_operations = {
-	.open =		proc_mpc_open,
-	.read =		seq_read,
-	.llseek =	seq_lseek,
-	.write =	proc_mpc_write,
-	.release =	seq_release,
+static const struct proc_ops mpc_proc_ops = {
+	.proc_open	= proc_mpc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= proc_mpc_write,
+	.proc_release	= seq_release,
 };
 
 /*
@@ -290,7 +287,7 @@ int mpc_proc_init(void)
 {
 	struct proc_dir_entry *p;
 
-	p = proc_create(STAT_FILE_NAME, 0, atm_proc_root, &mpc_file_operations);
+	p = proc_create(STAT_FILE_NAME, 0, atm_proc_root, &mpc_proc_ops);
 	if (!p) {
 		pr_err("Unable to initialize /proc/atm/%s\n", STAT_FILE_NAME);
 		return -ENOMEM;
--- a/net/atm/proc.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/atm/proc.c
@@ -36,9 +36,9 @@
 static ssize_t proc_dev_atm_read(struct file *file, char __user *buf,
 				 size_t count, loff_t *pos);
 
-static const struct file_operations proc_atm_dev_ops = {
-	.read =		proc_dev_atm_read,
-	.llseek =	noop_llseek,
+static const struct proc_ops atm_dev_proc_ops = {
+	.proc_read	= proc_dev_atm_read,
+	.proc_lseek	= noop_llseek,
 };
 
 static void add_stats(struct seq_file *seq, const char *aal,
@@ -359,7 +359,7 @@ int atm_proc_dev_register(struct atm_dev
 		goto err_out;
 
 	dev->proc_entry = proc_create_data(dev->proc_name, 0, atm_proc_root,
-					   &proc_atm_dev_ops, dev);
+					   &atm_dev_proc_ops, dev);
 	if (!dev->proc_entry)
 		goto err_free_name;
 	return 0;
--- a/net/core/pktgen.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/core/pktgen.c
@@ -535,12 +535,12 @@ static int pgctrl_open(struct inode *ino
 	return single_open(file, pgctrl_show, PDE_DATA(inode));
 }
 
-static const struct file_operations pktgen_fops = {
-	.open    = pgctrl_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.write   = pgctrl_write,
-	.release = single_release,
+static const struct proc_ops pktgen_proc_ops = {
+	.proc_open	= pgctrl_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= pgctrl_write,
+	.proc_release	= single_release,
 };
 
 static int pktgen_if_show(struct seq_file *seq, void *v)
@@ -1707,12 +1707,12 @@ static int pktgen_if_open(struct inode *
 	return single_open(file, pktgen_if_show, PDE_DATA(inode));
 }
 
-static const struct file_operations pktgen_if_fops = {
-	.open    = pktgen_if_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.write   = pktgen_if_write,
-	.release = single_release,
+static const struct proc_ops pktgen_if_proc_ops = {
+	.proc_open	= pktgen_if_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= pktgen_if_write,
+	.proc_release	= single_release,
 };
 
 static int pktgen_thread_show(struct seq_file *seq, void *v)
@@ -1844,12 +1844,12 @@ static int pktgen_thread_open(struct ino
 	return single_open(file, pktgen_thread_show, PDE_DATA(inode));
 }
 
-static const struct file_operations pktgen_thread_fops = {
-	.open    = pktgen_thread_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.write   = pktgen_thread_write,
-	.release = single_release,
+static const struct proc_ops pktgen_thread_proc_ops = {
+	.proc_open	= pktgen_thread_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_write	= pktgen_thread_write,
+	.proc_release	= single_release,
 };
 
 /* Think find or remove for NN */
@@ -1926,7 +1926,7 @@ static void pktgen_change_name(const str
 
 			pkt_dev->entry = proc_create_data(dev->name, 0600,
 							  pn->proc_dir,
-							  &pktgen_if_fops,
+							  &pktgen_if_proc_ops,
 							  pkt_dev);
 			if (!pkt_dev->entry)
 				pr_err("can't move proc entry for '%s'\n",
@@ -3638,7 +3638,7 @@ static int pktgen_add_device(struct pktg
 		pkt_dev->clone_skb = pg_clone_skb_d;
 
 	pkt_dev->entry = proc_create_data(ifname, 0600, t->net->proc_dir,
-					  &pktgen_if_fops, pkt_dev);
+					  &pktgen_if_proc_ops, pkt_dev);
 	if (!pkt_dev->entry) {
 		pr_err("cannot create %s/%s procfs entry\n",
 		       PG_PROC_DIR, ifname);
@@ -3708,7 +3708,7 @@ static int __net_init pktgen_create_thre
 	t->tsk = p;
 
 	pe = proc_create_data(t->tsk->comm, 0600, pn->proc_dir,
-			      &pktgen_thread_fops, t);
+			      &pktgen_thread_proc_ops, t);
 	if (!pe) {
 		pr_err("cannot create %s/%s procfs entry\n",
 		       PG_PROC_DIR, t->tsk->comm);
@@ -3793,7 +3793,7 @@ static int __net_init pg_net_init(struct
 		pr_warn("cannot create /proc/net/%s\n", PG_PROC_DIR);
 		return -ENODEV;
 	}
-	pe = proc_create(PGCTRL, 0600, pn->proc_dir, &pktgen_fops);
+	pe = proc_create(PGCTRL, 0600, pn->proc_dir, &pktgen_proc_ops);
 	if (pe == NULL) {
 		pr_err("cannot create %s procfs entry\n", PGCTRL);
 		ret = -EINVAL;
--- a/net/ipv4/ipconfig.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/ipv4/ipconfig.c
@@ -1334,7 +1334,7 @@ static int __init ipconfig_proc_net_init
 
 /* Create a new file under /proc/net/ipconfig */
 static int ipconfig_proc_net_create(const char *name,
-				    const struct file_operations *fops)
+				    const struct proc_ops *proc_ops)
 {
 	char *pname;
 	struct proc_dir_entry *p;
@@ -1346,7 +1346,7 @@ static int ipconfig_proc_net_create(cons
 	if (!pname)
 		return -ENOMEM;
 
-	p = proc_create(pname, 0444, init_net.proc_net, fops);
+	p = proc_create(pname, 0444, init_net.proc_net, proc_ops);
 	kfree(pname);
 	if (!p)
 		return -ENOMEM;
@@ -1355,7 +1355,7 @@ static int ipconfig_proc_net_create(cons
 }
 
 /* Write NTP server IP addresses to /proc/net/ipconfig/ntp_servers */
-static int ntp_servers_seq_show(struct seq_file *seq, void *v)
+static int ntp_servers_show(struct seq_file *seq, void *v)
 {
 	int i;
 
@@ -1365,7 +1365,7 @@ static int ntp_servers_seq_show(struct s
 	}
 	return 0;
 }
-DEFINE_SHOW_ATTRIBUTE(ntp_servers_seq);
+DEFINE_PROC_SHOW_ATTRIBUTE(ntp_servers);
 #endif /* CONFIG_PROC_FS */
 
 /*
@@ -1456,7 +1456,7 @@ static int __init ip_auto_config(void)
 	proc_create_single("pnp", 0444, init_net.proc_net, pnp_seq_show);
 
 	if (ipconfig_proc_net_init() == 0)
-		ipconfig_proc_net_create("ntp_servers", &ntp_servers_seq_fops);
+		ipconfig_proc_net_create("ntp_servers", &ntp_servers_proc_ops);
 #endif /* CONFIG_PROC_FS */
 
 	if (!ic_enable)
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -58,7 +58,7 @@ struct clusterip_config {
 };
 
 #ifdef CONFIG_PROC_FS
-static const struct file_operations clusterip_proc_fops;
+static const struct proc_ops clusterip_proc_ops;
 #endif
 
 struct clusterip_net {
@@ -280,7 +280,7 @@ clusterip_config_init(struct net *net, c
 		mutex_lock(&cn->mutex);
 		c->pde = proc_create_data(buffer, 0600,
 					  cn->procdir,
-					  &clusterip_proc_fops, c);
+					  &clusterip_proc_ops, c);
 		mutex_unlock(&cn->mutex);
 		if (!c->pde) {
 			err = -ENOMEM;
@@ -804,12 +804,12 @@ static ssize_t clusterip_proc_write(stru
 	return size;
 }
 
-static const struct file_operations clusterip_proc_fops = {
-	.open	 = clusterip_proc_open,
-	.read	 = seq_read,
-	.write	 = clusterip_proc_write,
-	.llseek	 = seq_lseek,
-	.release = clusterip_proc_release,
+static const struct proc_ops clusterip_proc_ops = {
+	.proc_open	= clusterip_proc_open,
+	.proc_read	= seq_read,
+	.proc_write	= clusterip_proc_write,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= clusterip_proc_release,
 };
 
 #endif /* CONFIG_PROC_FS */
--- a/net/ipv4/route.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/ipv4/route.c
@@ -237,11 +237,11 @@ static int rt_cache_seq_open(struct inod
 	return seq_open(file, &rt_cache_seq_ops);
 }
 
-static const struct file_operations rt_cache_seq_fops = {
-	.open	 = rt_cache_seq_open,
-	.read	 = seq_read,
-	.llseek	 = seq_lseek,
-	.release = seq_release,
+static const struct proc_ops rt_cache_proc_ops = {
+	.proc_open	= rt_cache_seq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 
@@ -328,11 +328,11 @@ static int rt_cpu_seq_open(struct inode
 	return seq_open(file, &rt_cpu_seq_ops);
 }
 
-static const struct file_operations rt_cpu_seq_fops = {
-	.open	 = rt_cpu_seq_open,
-	.read	 = seq_read,
-	.llseek	 = seq_lseek,
-	.release = seq_release,
+static const struct proc_ops rt_cpu_proc_ops = {
+	.proc_open	= rt_cpu_seq_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
 };
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
@@ -366,12 +366,12 @@ static int __net_init ip_rt_do_proc_init
 	struct proc_dir_entry *pde;
 
 	pde = proc_create("rt_cache", 0444, net->proc_net,
-			  &rt_cache_seq_fops);
+			  &rt_cache_proc_ops);
 	if (!pde)
 		goto err1;
 
 	pde = proc_create("rt_cache", 0444,
-			  net->proc_net_stat, &rt_cpu_seq_fops);
+			  net->proc_net_stat, &rt_cpu_proc_ops);
 	if (!pde)
 		goto err2;
 
--- a/net/netfilter/xt_recent.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/netfilter/xt_recent.c
@@ -103,7 +103,7 @@ static DEFINE_SPINLOCK(recent_lock);
 static DEFINE_MUTEX(recent_mutex);
 
 #ifdef CONFIG_PROC_FS
-static const struct file_operations recent_mt_fops;
+static const struct proc_ops recent_mt_proc_ops;
 #endif
 
 static u_int32_t hash_rnd __read_mostly;
@@ -405,7 +405,7 @@ static int recent_mt_check(const struct
 		goto out;
 	}
 	pde = proc_create_data(t->name, ip_list_perms, recent_net->xt_recent,
-		  &recent_mt_fops, t);
+			       &recent_mt_proc_ops, t);
 	if (pde == NULL) {
 		recent_table_free(t);
 		ret = -ENOMEM;
@@ -616,13 +616,12 @@ recent_mt_proc_write(struct file *file,
 	return size + 1;
 }
 
-static const struct file_operations recent_mt_fops = {
-	.open    = recent_seq_open,
-	.read    = seq_read,
-	.write   = recent_mt_proc_write,
-	.release = seq_release_private,
-	.owner   = THIS_MODULE,
-	.llseek = seq_lseek,
+static const struct proc_ops recent_mt_proc_ops = {
+	.proc_open	= recent_seq_open,
+	.proc_read	= seq_read,
+	.proc_write	= recent_mt_proc_write,
+	.proc_release	= seq_release_private,
+	.proc_lseek	= seq_lseek,
 };
 
 static int __net_init recent_proc_net_init(struct net *net)
--- a/net/sunrpc/auth_gss/svcauth_gss.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/sunrpc/auth_gss/svcauth_gss.c
@@ -1428,10 +1428,10 @@ static ssize_t read_gssp(struct file *fi
 	return len;
 }
 
-static const struct file_operations use_gss_proxy_ops = {
-	.open = nonseekable_open,
-	.write = write_gssp,
-	.read = read_gssp,
+static const struct proc_ops use_gss_proxy_proc_ops = {
+	.proc_open	= nonseekable_open,
+	.proc_write	= write_gssp,
+	.proc_read	= read_gssp,
 };
 
 static int create_use_gss_proxy_proc_entry(struct net *net)
@@ -1442,7 +1442,7 @@ static int create_use_gss_proxy_proc_ent
 	sn->use_gss_proxy = -1;
 	*p = proc_create_data("use-gss-proxy", S_IFREG | 0600,
 			      sn->proc_net_rpc,
-			      &use_gss_proxy_ops, net);
+			      &use_gss_proxy_proc_ops, net);
 	if (!*p)
 		return -ENOMEM;
 	init_gssp_clnt(sn);
--- a/net/sunrpc/cache.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/sunrpc/cache.c
@@ -1571,15 +1571,14 @@ static int cache_release_procfs(struct i
 	return cache_release(inode, filp, cd);
 }
 
-static const struct file_operations cache_file_operations_procfs = {
-	.owner		= THIS_MODULE,
-	.llseek		= no_llseek,
-	.read		= cache_read_procfs,
-	.write		= cache_write_procfs,
-	.poll		= cache_poll_procfs,
-	.unlocked_ioctl	= cache_ioctl_procfs, /* for FIONREAD */
-	.open		= cache_open_procfs,
-	.release	= cache_release_procfs,
+static const struct proc_ops cache_channel_proc_ops = {
+	.proc_lseek	= no_llseek,
+	.proc_read	= cache_read_procfs,
+	.proc_write	= cache_write_procfs,
+	.proc_poll	= cache_poll_procfs,
+	.proc_ioctl	= cache_ioctl_procfs, /* for FIONREAD */
+	.proc_open	= cache_open_procfs,
+	.proc_release	= cache_release_procfs,
 };
 
 static int content_open_procfs(struct inode *inode, struct file *filp)
@@ -1596,11 +1595,11 @@ static int content_release_procfs(struct
 	return content_release(inode, filp, cd);
 }
 
-static const struct file_operations content_file_operations_procfs = {
-	.open		= content_open_procfs,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= content_release_procfs,
+static const struct proc_ops content_proc_ops = {
+	.proc_open	= content_open_procfs,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= content_release_procfs,
 };
 
 static int open_flush_procfs(struct inode *inode, struct file *filp)
@@ -1634,12 +1633,12 @@ static ssize_t write_flush_procfs(struct
 	return write_flush(filp, buf, count, ppos, cd);
 }
 
-static const struct file_operations cache_flush_operations_procfs = {
-	.open		= open_flush_procfs,
-	.read		= read_flush_procfs,
-	.write		= write_flush_procfs,
-	.release	= release_flush_procfs,
-	.llseek		= no_llseek,
+static const struct proc_ops cache_flush_proc_ops = {
+	.proc_open	= open_flush_procfs,
+	.proc_read	= read_flush_procfs,
+	.proc_write	= write_flush_procfs,
+	.proc_release	= release_flush_procfs,
+	.proc_lseek	= no_llseek,
 };
 
 static void remove_cache_proc_entries(struct cache_detail *cd)
@@ -1662,19 +1661,19 @@ static int create_cache_proc_entries(str
 		goto out_nomem;
 
 	p = proc_create_data("flush", S_IFREG | 0600,
-			     cd->procfs, &cache_flush_operations_procfs, cd);
+			     cd->procfs, &cache_flush_proc_ops, cd);
 	if (p == NULL)
 		goto out_nomem;
 
 	if (cd->cache_request || cd->cache_parse) {
 		p = proc_create_data("channel", S_IFREG | 0600, cd->procfs,
-				     &cache_file_operations_procfs, cd);
+				     &cache_channel_proc_ops, cd);
 		if (p == NULL)
 			goto out_nomem;
 	}
 	if (cd->cache_show) {
 		p = proc_create_data("content", S_IFREG | 0400, cd->procfs,
-				     &content_file_operations_procfs, cd);
+				     &content_proc_ops, cd);
 		if (p == NULL)
 			goto out_nomem;
 	}
--- a/net/sunrpc/stats.c~proc-convert-everything-to-struct-proc_ops
+++ a/net/sunrpc/stats.c
@@ -69,12 +69,11 @@ static int rpc_proc_open(struct inode *i
 	return single_open(file, rpc_proc_show, PDE_DATA(inode));
 }
 
-static const struct file_operations rpc_proc_fops = {
-	.owner = THIS_MODULE,
-	.open = rpc_proc_open,
-	.read  = seq_read,
-	.llseek = seq_lseek,
-	.release = single_release,
+static const struct proc_ops rpc_proc_ops = {
+	.proc_open	= rpc_proc_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= single_release,
 };
 
 /*
@@ -281,19 +280,19 @@ EXPORT_SYMBOL_GPL(rpc_clnt_show_stats);
  */
 static inline struct proc_dir_entry *
 do_register(struct net *net, const char *name, void *data,
-	    const struct file_operations *fops)
+	    const struct proc_ops *proc_ops)
 {
 	struct sunrpc_net *sn;
 
 	dprintk("RPC:       registering /proc/net/rpc/%s\n", name);
 	sn = net_generic(net, sunrpc_net_id);
-	return proc_create_data(name, 0, sn->proc_net_rpc, fops, data);
+	return proc_create_data(name, 0, sn->proc_net_rpc, proc_ops, data);
 }
 
 struct proc_dir_entry *
 rpc_proc_register(struct net *net, struct rpc_stat *statp)
 {
-	return do_register(net, statp->program->name, statp, &rpc_proc_fops);
+	return do_register(net, statp->program->name, statp, &rpc_proc_ops);
 }
 EXPORT_SYMBOL_GPL(rpc_proc_register);
 
@@ -308,9 +307,9 @@ rpc_proc_unregister(struct net *net, con
 EXPORT_SYMBOL_GPL(rpc_proc_unregister);
 
 struct proc_dir_entry *
-svc_proc_register(struct net *net, struct svc_stat *statp, const struct file_operations *fops)
+svc_proc_register(struct net *net, struct svc_stat *statp, const struct proc_ops *proc_ops)
 {
-	return do_register(net, statp->program->pg_name, statp, fops);
+	return do_register(net, statp->program->pg_name, statp, proc_ops);
 }
 EXPORT_SYMBOL_GPL(svc_proc_register);
 
--- a/samples/kfifo/bytestream-example.c~proc-convert-everything-to-struct-proc_ops
+++ a/samples/kfifo/bytestream-example.c
@@ -142,11 +142,10 @@ static ssize_t fifo_read(struct file *fi
 	return ret ? ret : copied;
 }
 
-static const struct file_operations fifo_fops = {
-	.owner		= THIS_MODULE,
-	.read		= fifo_read,
-	.write		= fifo_write,
-	.llseek		= noop_llseek,
+static const struct proc_ops fifo_proc_ops = {
+	.proc_read	= fifo_read,
+	.proc_write	= fifo_write,
+	.proc_lseek	= noop_llseek,
 };
 
 static int __init example_init(void)
@@ -169,7 +168,7 @@ static int __init example_init(void)
 		return -EIO;
 	}
 
-	if (proc_create(PROC_FIFO, 0, NULL, &fifo_fops) == NULL) {
+	if (proc_create(PROC_FIFO, 0, NULL, &fifo_proc_ops) == NULL) {
 #ifdef DYNAMIC
 		kfifo_free(&test);
 #endif
--- a/samples/kfifo/inttype-example.c~proc-convert-everything-to-struct-proc_ops
+++ a/samples/kfifo/inttype-example.c
@@ -135,11 +135,10 @@ static ssize_t fifo_read(struct file *fi
 	return ret ? ret : copied;
 }
 
-static const struct file_operations fifo_fops = {
-	.owner		= THIS_MODULE,
-	.read		= fifo_read,
-	.write		= fifo_write,
-	.llseek		= noop_llseek,
+static const struct proc_ops fifo_proc_ops = {
+	.proc_read	= fifo_read,
+	.proc_write	= fifo_write,
+	.proc_lseek	= noop_llseek,
 };
 
 static int __init example_init(void)
@@ -160,7 +159,7 @@ static int __init example_init(void)
 		return -EIO;
 	}
 
-	if (proc_create(PROC_FIFO, 0, NULL, &fifo_fops) == NULL) {
+	if (proc_create(PROC_FIFO, 0, NULL, &fifo_proc_ops) == NULL) {
 #ifdef DYNAMIC
 		kfifo_free(&test);
 #endif
--- a/samples/kfifo/record-example.c~proc-convert-everything-to-struct-proc_ops
+++ a/samples/kfifo/record-example.c
@@ -149,11 +149,10 @@ static ssize_t fifo_read(struct file *fi
 	return ret ? ret : copied;
 }
 
-static const struct file_operations fifo_fops = {
-	.owner		= THIS_MODULE,
-	.read		= fifo_read,
-	.write		= fifo_write,
-	.llseek		= noop_llseek,
+static const struct proc_ops fifo_proc_ops = {
+	.proc_read	= fifo_read,
+	.proc_write	= fifo_write,
+	.proc_lseek	= noop_llseek,
 };
 
 static int __init example_init(void)
@@ -176,7 +175,7 @@ static int __init example_init(void)
 		return -EIO;
 	}
 
-	if (proc_create(PROC_FIFO, 0, NULL, &fifo_fops) == NULL) {
+	if (proc_create(PROC_FIFO, 0, NULL, &fifo_proc_ops) == NULL) {
 #ifdef DYNAMIC
 		kfifo_free(&test);
 #endif
--- a/sound/core/info.c~proc-convert-everything-to-struct-proc_ops
+++ a/sound/core/info.c
@@ -282,17 +282,16 @@ static int snd_info_entry_release(struct
 	return 0;
 }
 
-static const struct file_operations snd_info_entry_operations =
+static const struct proc_ops snd_info_entry_operations =
 {
-	.owner =		THIS_MODULE,
-	.llseek =		snd_info_entry_llseek,
-	.read =			snd_info_entry_read,
-	.write =		snd_info_entry_write,
-	.poll =			snd_info_entry_poll,
-	.unlocked_ioctl =	snd_info_entry_ioctl,
-	.mmap =			snd_info_entry_mmap,
-	.open =			snd_info_entry_open,
-	.release =		snd_info_entry_release,
+	.proc_lseek	= snd_info_entry_llseek,
+	.proc_read	= snd_info_entry_read,
+	.proc_write	= snd_info_entry_write,
+	.proc_poll	= snd_info_entry_poll,
+	.proc_ioctl	= snd_info_entry_ioctl,
+	.proc_mmap	= snd_info_entry_mmap,
+	.proc_open	= snd_info_entry_open,
+	.proc_release	= snd_info_entry_release,
 };
 
 /*
@@ -421,14 +420,13 @@ static int snd_info_text_entry_release(s
 	return 0;
 }
 
-static const struct file_operations snd_info_text_entry_ops =
+static const struct proc_ops snd_info_text_entry_ops =
 {
-	.owner =		THIS_MODULE,
-	.open =			snd_info_text_entry_open,
-	.release =		snd_info_text_entry_release,
-	.write =		snd_info_text_entry_write,
-	.llseek =		seq_lseek,
-	.read =			seq_read,
+	.proc_open	= snd_info_text_entry_open,
+	.proc_release	= snd_info_text_entry_release,
+	.proc_write	= snd_info_text_entry_write,
+	.proc_lseek	= seq_lseek,
+	.proc_read	= seq_read,
 };
 
 static struct snd_info_entry *create_subdir(struct module *mod,
@@ -810,7 +808,7 @@ static int __snd_info_register(struct sn
 			return -ENOMEM;
 		}
 	} else {
-		const struct file_operations *ops;
+		const struct proc_ops *ops;
 		if (entry->content == SNDRV_INFO_CONTENT_DATA)
 			ops = &snd_info_entry_operations;
 		else
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 59/67] lib/string: add strnchrnul()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (57 preceding siblings ...)
  2020-02-04  1:37 ` [patch 58/67] proc: convert everything to " Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 60/67] bitops: more BITS_TO_* macros Andrew Morton
                   ` (179 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib/string: add strnchrnul()

Patch series "lib: rework bitmap_parse", v5.

Similarl to the recently revisited bitmap_parselist(), bitmap_parse() is
ineffective and overcomplicated.  This series reworks it, aligns its
interface with bitmap_parselist() and makes it simpler to use.

The series also adds a test for the function and fixes usage of it in
cpumask_parse() according to the new design - drops the calculating of
length of an input string.

bitmap_parse() takes the array of numbers to be put into the map in the BE
order which is reversed to the natural LE order for bitmaps.  For example,
to construct bitmap containing a bit on the position 42, we have to put a
line '400,0'.  Current implementation reads chunk one by one from the
beginning ('400' before '0') and makes bitmap shift after each successful
parse.  It makes the complexity of the whole process as O(n^2).  We can do
it in reverse direction ('0' before '400') and avoid shifting, but it
requires reverse parsing helpers.


This patch (of 7):

New function works like strchrnul() with a length limited string.

Link: http://lkml.kernel.org/r/20200102043031.30357-2-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/string.h |    1 +
 lib/string.c           |   17 +++++++++++++++++
 2 files changed, 18 insertions(+)

--- a/include/linux/string.h~lib-string-add-strnchrnul
+++ a/include/linux/string.h
@@ -62,6 +62,7 @@ extern char * strchr(const char *,int);
 #ifndef __HAVE_ARCH_STRCHRNUL
 extern char * strchrnul(const char *,int);
 #endif
+extern char * strnchrnul(const char *, size_t, int);
 #ifndef __HAVE_ARCH_STRNCHR
 extern char * strnchr(const char *, size_t, int);
 #endif
--- a/lib/string.c~lib-string-add-strnchrnul
+++ a/lib/string.c
@@ -434,6 +434,23 @@ char *strchrnul(const char *s, int c)
 EXPORT_SYMBOL(strchrnul);
 #endif
 
+/**
+ * strnchrnul - Find and return a character in a length limited string,
+ * or end of string
+ * @s: The string to be searched
+ * @count: The number of characters to be searched
+ * @c: The character to search for
+ *
+ * Returns pointer to the first occurrence of 'c' in s. If c is not found,
+ * then return a pointer to the last character of the string.
+ */
+char *strnchrnul(const char *s, size_t count, int c)
+{
+	while (count-- && *s && *s != (char)c)
+		s++;
+	return (char *)s;
+}
+
 #ifndef __HAVE_ARCH_STRRCHR
 /**
  * strrchr - Find the last occurrence of a character in a string
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 60/67] bitops: more BITS_TO_* macros
  2020-02-04  1:33 incoming Andrew Morton
                   ` (58 preceding siblings ...)
  2020-02-04  1:37 ` [patch 59/67] lib/string: add strnchrnul() Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 61/67] lib: add test for bitmap_parse() Andrew Morton
                   ` (178 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: bitops: more BITS_TO_* macros

Introduce BITS_TO_U64, BITS_TO_U32 and BITS_TO_BYTES as they are handy in
the following patches (BITS_TO_U32 specifically).  Reimplement tools/
version of the macros according to the kernel implementation.

Also fix indentation for BITS_PER_TYPE definition.

Link: http://lkml.kernel.org/r/20200102043031.30357-3-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bitops.h       |    4 +++-
 tools/include/linux/bitops.h |    9 +++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

--- a/include/linux/bitops.h~bitops-more-bits_to_-macros
+++ a/include/linux/bitops.h
@@ -11,8 +11,10 @@
 #  define aligned_byte_mask(n) (~0xffUL << (BITS_PER_LONG - 8 - 8*(n)))
 #endif
 
-#define BITS_PER_TYPE(type) (sizeof(type) * BITS_PER_BYTE)
+#define BITS_PER_TYPE(type)	(sizeof(type) * BITS_PER_BYTE)
 #define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
+#define BITS_TO_U64(nr)		DIV_ROUND_UP(nr, BITS_PER_TYPE(u64))
+#define BITS_TO_U32(nr)		DIV_ROUND_UP(nr, BITS_PER_TYPE(u32))
 #define BITS_TO_BYTES(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(char))
 
 extern unsigned int __sw_hweight8(unsigned int w);
--- a/tools/include/linux/bitops.h~bitops-more-bits_to_-macros
+++ a/tools/include/linux/bitops.h
@@ -14,10 +14,11 @@
 #include <linux/bits.h>
 #include <linux/compiler.h>
 
-#define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
-#define BITS_TO_U64(nr)		DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u64))
-#define BITS_TO_U32(nr)		DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(u32))
-#define BITS_TO_BYTES(nr)	DIV_ROUND_UP(nr, BITS_PER_BYTE)
+#define BITS_PER_TYPE(type)	(sizeof(type) * BITS_PER_BYTE)
+#define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(long))
+#define BITS_TO_U64(nr)		DIV_ROUND_UP(nr, BITS_PER_TYPE(u64))
+#define BITS_TO_U32(nr)		DIV_ROUND_UP(nr, BITS_PER_TYPE(u32))
+#define BITS_TO_BYTES(nr)	DIV_ROUND_UP(nr, BITS_PER_TYPE(char))
 
 extern unsigned int __sw_hweight8(unsigned int w);
 extern unsigned int __sw_hweight16(unsigned int w);
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 61/67] lib: add test for bitmap_parse()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (59 preceding siblings ...)
  2020-02-04  1:37 ` [patch 60/67] bitops: more BITS_TO_* macros Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 62/67] lib: make bitmap_parse_user a wrapper on bitmap_parse Andrew Morton
                   ` (177 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib: add test for bitmap_parse()

The test is derived from bitmap_parselist() NO_LEN is reserved for use in
following patches.

[yury.norov@gmail.com: fix rebase issue]
  Link: http://lkml.kernel.org/r/20200102182659.6685-1-yury.norov@gmail.com
[andriy.shevchenko@linux.intel.com: fix address space when test user buffer]
  Link: http://lkml.kernel.org/r/20200109103601.45929-2-andriy.shevchenko@linux.intel.com
Link: http://lkml.kernel.org/r/20200102043031.30357-4-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_bitmap.c |   97 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 2 deletions(-)

--- a/lib/test_bitmap.c~lib-add-test-for-bitmap_parse
+++ a/lib/test_bitmap.c
@@ -295,7 +295,8 @@ static void __init test_replace(void)
 	expect_eq_bitmap(bmap, exp3_1_0, nbits);
 }
 
-#define PARSE_TIME 0x1
+#define PARSE_TIME	0x1
+#define NO_LEN		0x2
 
 struct test_bitmap_parselist{
 	const int errno;
@@ -349,7 +350,6 @@ static const struct test_bitmap_parselis
 	{-EINVAL, "0-31:a/1", NULL, 8, 0},
 	{-EINVAL, "0-\n", NULL, 8, 0},
 
-#undef step
 };
 
 static void __init __test_bitmap_parselist(int is_user)
@@ -401,6 +401,87 @@ static void __init __test_bitmap_parseli
 	}
 }
 
+static const unsigned long parse_test[] __initconst = {
+	BITMAP_FROM_U64(0),
+	BITMAP_FROM_U64(1),
+	BITMAP_FROM_U64(0xdeadbeef),
+	BITMAP_FROM_U64(0x100000000ULL),
+};
+
+static const unsigned long parse_test2[] __initconst = {
+	BITMAP_FROM_U64(0x100000000ULL), BITMAP_FROM_U64(0xdeadbeef),
+	BITMAP_FROM_U64(0x100000000ULL), BITMAP_FROM_U64(0xbaadf00ddeadbeef),
+	BITMAP_FROM_U64(0x100000000ULL), BITMAP_FROM_U64(0x0badf00ddeadbeef),
+};
+
+static const struct test_bitmap_parselist parse_tests[] __initconst = {
+	{0, "0",			&parse_test[0 * step], 32, 0},
+	{0, "1",			&parse_test[1 * step], 32, 0},
+	{0, "deadbeef",			&parse_test[2 * step], 32, 0},
+	{0, "1,0",			&parse_test[3 * step], 33, 0},
+
+	{0, "deadbeef,1,0",		&parse_test2[0 * 2 * step], 96, 0},
+	{0, "baadf00d,deadbeef,1,0",	&parse_test2[1 * 2 * step], 128, 0},
+	{0, "badf00d,deadbeef,1,0",	&parse_test2[2 * 2 * step], 124, 0},
+
+	{-EINVAL,    "goodfood,deadbeef,1,0",	NULL, 128, 0},
+	{-EOVERFLOW, "3,0",			NULL, 33, 0},
+	{-EOVERFLOW, "123badf00d,deadbeef,1,0",	NULL, 128, 0},
+	{-EOVERFLOW, "badf00d,deadbeef,1,0",	NULL, 90, 0},
+	{-EOVERFLOW, "fbadf00d,deadbeef,1,0",	NULL, 95, 0},
+	{-EOVERFLOW, "badf00d,deadbeef,1,0",	NULL, 100, 0},
+#undef step
+};
+
+static void __init __test_bitmap_parse(int is_user)
+{
+	int i;
+	int err;
+	ktime_t time;
+	DECLARE_BITMAP(bmap, 2048);
+	char *mode = is_user ? "_user"  : "";
+
+	for (i = 0; i < ARRAY_SIZE(parse_tests); i++) {
+		struct test_bitmap_parselist test = parse_tests[i];
+
+		if (is_user) {
+			size_t len = strlen(test.in);
+			mm_segment_t orig_fs = get_fs();
+
+			set_fs(KERNEL_DS);
+			time = ktime_get();
+			err = bitmap_parse_user((__force const char __user *)test.in, len,
+						bmap, test.nbits);
+			time = ktime_get() - time;
+			set_fs(orig_fs);
+		} else {
+			size_t len = test.flags & NO_LEN ?
+				UINT_MAX : strlen(test.in);
+			time = ktime_get();
+			err = bitmap_parse(test.in, len, bmap, test.nbits);
+			time = ktime_get() - time;
+		}
+
+		if (err != test.errno) {
+			pr_err("parse%s: %d: input is %s, errno is %d, expected %d\n",
+					mode, i, test.in, err, test.errno);
+			continue;
+		}
+
+		if (!err && test.expected
+			 && !__bitmap_equal(bmap, test.expected, test.nbits)) {
+			pr_err("parse%s: %d: input is %s, result is 0x%lx, expected 0x%lx\n",
+					mode, i, test.in, bmap[0],
+					*test.expected);
+			continue;
+		}
+
+		if (test.flags & PARSE_TIME)
+			pr_err("parse%s: %d: input is '%s' OK, Time: %llu\n",
+					mode, i, test.in, time);
+	}
+}
+
 static void __init test_bitmap_parselist(void)
 {
 	__test_bitmap_parselist(0);
@@ -411,6 +492,16 @@ static void __init test_bitmap_parselist
 	__test_bitmap_parselist(1);
 }
 
+static void __init test_bitmap_parse(void)
+{
+	__test_bitmap_parse(0);
+}
+
+static void __init test_bitmap_parse_user(void)
+{
+	__test_bitmap_parse(1);
+}
+
 #define EXP1_IN_BITS	(sizeof(exp1) * 8)
 
 static void __init test_bitmap_arr32(void)
@@ -516,6 +607,8 @@ static void __init selftest(void)
 	test_copy();
 	test_replace();
 	test_bitmap_arr32();
+	test_bitmap_parse();
+	test_bitmap_parse_user();
 	test_bitmap_parselist();
 	test_bitmap_parselist_user();
 	test_mem_optimisations();
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 62/67] lib: make bitmap_parse_user a wrapper on bitmap_parse
  2020-02-04  1:33 incoming Andrew Morton
                   ` (60 preceding siblings ...)
  2020-02-04  1:37 ` [patch 61/67] lib: add test for bitmap_parse() Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 63/67] lib: rework bitmap_parse() Andrew Morton
                   ` (176 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib: make bitmap_parse_user a wrapper on bitmap_parse

Currently we parse user data byte after byte which leads to
overcomplicating of parsing algorithm.  There are no performance critical
users of bitmap_parse_user(), and so we can duplicate user data to kernel
buffer and simply call bitmap_parselist().  This rework lets us unify and
simplify bitmap_parse() and bitmap_parse_user(), which is done in the
following patch.

Link: http://lkml.kernel.org/r/20200102043031.30357-5-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/bitmap.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

--- a/lib/bitmap.c~lib-make-bitmap_parse_user-a-wrapper-on-bitmap_parse
+++ a/lib/bitmap.c
@@ -530,22 +530,22 @@ EXPORT_SYMBOL(__bitmap_parse);
  *    then it must be terminated with a \0.
  * @maskp: pointer to bitmap array that will contain result.
  * @nmaskbits: size of bitmap, in bits.
- *
- * Wrapper for __bitmap_parse(), providing it with user buffer.
- *
- * We cannot have this as an inline function in bitmap.h because it needs
- * linux/uaccess.h to get the access_ok() declaration and this causes
- * cyclic dependencies.
  */
 int bitmap_parse_user(const char __user *ubuf,
 			unsigned int ulen, unsigned long *maskp,
 			int nmaskbits)
 {
-	if (!access_ok(ubuf, ulen))
-		return -EFAULT;
-	return __bitmap_parse((const char __force *)ubuf,
-				ulen, 1, maskp, nmaskbits);
+	char *buf;
+	int ret;
 
+	buf = memdup_user_nul(ubuf, ulen);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+
+	ret = bitmap_parse(buf, ulen, maskp, nmaskbits);
+
+	kfree(buf);
+	return ret;
 }
 EXPORT_SYMBOL(bitmap_parse_user);
 
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 63/67] lib: rework bitmap_parse()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (61 preceding siblings ...)
  2020-02-04  1:37 ` [patch 62/67] lib: make bitmap_parse_user a wrapper on bitmap_parse Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 64/67] lib: new testcases for bitmap_parse{_user} Andrew Morton
                   ` (175 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib: rework bitmap_parse()

bitmap_parse() is ineffective and full of opaque variables and opencoded
parts.  It leads to hard understanding and usage of it.  This rework
includes:

- remove bitmap_shift_left() call from the cycle.  Now it makes the
  complexity of the algorithm as O(nbits^2).  In the suggested approach
  the input string is parsed in reverse direction, so no shifts needed;

- relax requirement on a single comma and no white spaces between
  chunks.  It is considered useful in scripting, and it aligns with
  bitmap_parselist();

- split bitmap_parse() to small readable helpers;

- make an explicit calculation of the end of input line at the
  beginning, so users of the bitmap_parse() won't bother doing this.

Link: http://lkml.kernel.org/r/20200102043031.30357-6-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bitmap.h |    8 -
 lib/bitmap.c           |  175 ++++++++++++++++++---------------------
 2 files changed, 84 insertions(+), 99 deletions(-)

--- a/include/linux/bitmap.h~lib-rework-bitmap_parse
+++ a/include/linux/bitmap.h
@@ -186,7 +186,7 @@ bitmap_find_next_zero_area(unsigned long
 					      align_mask, 0);
 }
 
-extern int __bitmap_parse(const char *buf, unsigned int buflen, int is_user,
+extern int bitmap_parse(const char *buf, unsigned int buflen,
 			unsigned long *dst, int nbits);
 extern int bitmap_parse_user(const char __user *ubuf, unsigned int ulen,
 			unsigned long *dst, int nbits);
@@ -454,12 +454,6 @@ static inline void bitmap_replace(unsign
 		__bitmap_replace(dst, old, new, mask, nbits);
 }
 
-static inline int bitmap_parse(const char *buf, unsigned int buflen,
-			unsigned long *maskp, int nmaskbits)
-{
-	return __bitmap_parse(buf, buflen, 0, maskp, nmaskbits);
-}
-
 static inline void bitmap_next_clear_region(unsigned long *bitmap,
 					    unsigned int *rs, unsigned int *re,
 					    unsigned int end)
--- a/lib/bitmap.c~lib-rework-bitmap_parse
+++ a/lib/bitmap.c
@@ -431,97 +431,6 @@ EXPORT_SYMBOL(bitmap_find_next_zero_area
  * second version by Paul Jackson, third by Joe Korty.
  */
 
-#define CHUNKSZ				32
-#define nbits_to_hold_value(val)	fls(val)
-#define BASEDEC 10		/* fancier cpuset lists input in decimal */
-
-/**
- * __bitmap_parse - convert an ASCII hex string into a bitmap.
- * @buf: pointer to buffer containing string.
- * @buflen: buffer size in bytes.  If string is smaller than this
- *    then it must be terminated with a \0.
- * @is_user: location of buffer, 0 indicates kernel space
- * @maskp: pointer to bitmap array that will contain result.
- * @nmaskbits: size of bitmap, in bits.
- *
- * Commas group hex digits into chunks.  Each chunk defines exactly 32
- * bits of the resultant bitmask.  No chunk may specify a value larger
- * than 32 bits (%-EOVERFLOW), and if a chunk specifies a smaller value
- * then leading 0-bits are prepended.  %-EINVAL is returned for illegal
- * characters and for grouping errors such as "1,,5", ",44", "," and "".
- * Leading and trailing whitespace accepted, but not embedded whitespace.
- */
-int __bitmap_parse(const char *buf, unsigned int buflen,
-		int is_user, unsigned long *maskp,
-		int nmaskbits)
-{
-	int c, old_c, totaldigits, ndigits, nchunks, nbits;
-	u32 chunk;
-	const char __user __force *ubuf = (const char __user __force *)buf;
-
-	bitmap_zero(maskp, nmaskbits);
-
-	nchunks = nbits = totaldigits = c = 0;
-	do {
-		chunk = 0;
-		ndigits = totaldigits;
-
-		/* Get the next chunk of the bitmap */
-		while (buflen) {
-			old_c = c;
-			if (is_user) {
-				if (__get_user(c, ubuf++))
-					return -EFAULT;
-			}
-			else
-				c = *buf++;
-			buflen--;
-			if (isspace(c))
-				continue;
-
-			/*
-			 * If the last character was a space and the current
-			 * character isn't '\0', we've got embedded whitespace.
-			 * This is a no-no, so throw an error.
-			 */
-			if (totaldigits && c && isspace(old_c))
-				return -EINVAL;
-
-			/* A '\0' or a ',' signal the end of the chunk */
-			if (c == '\0' || c == ',')
-				break;
-
-			if (!isxdigit(c))
-				return -EINVAL;
-
-			/*
-			 * Make sure there are at least 4 free bits in 'chunk'.
-			 * If not, this hexdigit will overflow 'chunk', so
-			 * throw an error.
-			 */
-			if (chunk & ~((1UL << (CHUNKSZ - 4)) - 1))
-				return -EOVERFLOW;
-
-			chunk = (chunk << 4) | hex_to_bin(c);
-			totaldigits++;
-		}
-		if (ndigits == totaldigits)
-			return -EINVAL;
-		if (nchunks == 0 && chunk == 0)
-			continue;
-
-		__bitmap_shift_left(maskp, maskp, CHUNKSZ, nmaskbits);
-		*maskp |= chunk;
-		nchunks++;
-		nbits += (nchunks == 1) ? nbits_to_hold_value(chunk) : CHUNKSZ;
-		if (nbits > nmaskbits)
-			return -EOVERFLOW;
-	} while (buflen && c == ',');
-
-	return 0;
-}
-EXPORT_SYMBOL(__bitmap_parse);
-
 /**
  * bitmap_parse_user - convert an ASCII hex string in a user buffer into a bitmap
  *
@@ -542,7 +451,7 @@ int bitmap_parse_user(const char __user
 	if (IS_ERR(buf))
 		return PTR_ERR(buf);
 
-	ret = bitmap_parse(buf, ulen, maskp, nmaskbits);
+	ret = bitmap_parse(buf, UINT_MAX, maskp, nmaskbits);
 
 	kfree(buf);
 	return ret;
@@ -653,6 +562,14 @@ static const char *bitmap_find_region(co
 	return end_of_str(*str) ? NULL : str;
 }
 
+static const char *bitmap_find_region_reverse(const char *start, const char *end)
+{
+	while (start <= end && __end_of_region(*end))
+		end--;
+
+	return end;
+}
+
 static const char *bitmap_parse_region(const char *str, struct region *r)
 {
 	str = bitmap_getnum(str, &r->start);
@@ -776,6 +693,80 @@ int bitmap_parselist_user(const char __u
 }
 EXPORT_SYMBOL(bitmap_parselist_user);
 
+static const char *bitmap_get_x32_reverse(const char *start,
+					const char *end, u32 *num)
+{
+	u32 ret = 0;
+	int c, i;
+
+	for (i = 0; i < 32; i += 4) {
+		c = hex_to_bin(*end--);
+		if (c < 0)
+			return ERR_PTR(-EINVAL);
+
+		ret |= c << i;
+
+		if (start > end || __end_of_region(*end))
+			goto out;
+	}
+
+	if (hex_to_bin(*end--) >= 0)
+		return ERR_PTR(-EOVERFLOW);
+out:
+	*num = ret;
+	return end;
+}
+
+/**
+ * bitmap_parse - convert an ASCII hex string into a bitmap.
+ * @start: pointer to buffer containing string.
+ * @buflen: buffer size in bytes.  If string is smaller than this
+ *    then it must be terminated with a \0 or \n. In that case,
+ *    UINT_MAX may be provided instead of string length.
+ * @maskp: pointer to bitmap array that will contain result.
+ * @nmaskbits: size of bitmap, in bits.
+ *
+ * Commas group hex digits into chunks.  Each chunk defines exactly 32
+ * bits of the resultant bitmask.  No chunk may specify a value larger
+ * than 32 bits (%-EOVERFLOW), and if a chunk specifies a smaller value
+ * then leading 0-bits are prepended.  %-EINVAL is returned for illegal
+ * characters. Grouping such as "1,,5", ",44", "," or "" is allowed.
+ * Leading, embedded and trailing whitespace accepted.
+ */
+int bitmap_parse(const char *start, unsigned int buflen,
+		unsigned long *maskp, int nmaskbits)
+{
+	const char *end = strnchrnul(start, buflen, '\n') - 1;
+	int chunks = BITS_TO_U32(nmaskbits);
+	u32 *bitmap = (u32 *)maskp;
+	int unset_bit;
+
+	while (1) {
+		end = bitmap_find_region_reverse(start, end);
+		if (start > end)
+			break;
+
+		if (!chunks--)
+			return -EOVERFLOW;
+
+		end = bitmap_get_x32_reverse(start, end, bitmap++);
+		if (IS_ERR(end))
+			return PTR_ERR(end);
+	}
+
+	unset_bit = (BITS_TO_U32(nmaskbits) - chunks) * 32;
+	if (unset_bit < nmaskbits) {
+		bitmap_clear(maskp, unset_bit, nmaskbits - unset_bit);
+		return 0;
+	}
+
+	if (find_next_bit(maskp, unset_bit, nmaskbits) != unset_bit)
+		return -EOVERFLOW;
+
+	return 0;
+}
+EXPORT_SYMBOL(bitmap_parse);
+
 
 #ifdef CONFIG_NUMA
 /**
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 64/67] lib: new testcases for bitmap_parse{_user}
  2020-02-04  1:33 incoming Andrew Morton
                   ` (62 preceding siblings ...)
  2020-02-04  1:37 ` [patch 63/67] lib: rework bitmap_parse() Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 65/67] include/linux/cpumask.h: don't calculate length of the input string Andrew Morton
                   ` (174 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib: new testcases for bitmap_parse{_user}

New version of bitmap_parse() is unified with bitmap_parse_list(),
and therefore:

- weakens rules on whitespaces and commas between hex chunks;

- in addition to

- allows passing UINT_MAX or any other big number as the length of input
  string instead of actual string length.

The patch covers the cases.

Link: http://lkml.kernel.org/r/20200102043031.30357-7-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_bitmap.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/lib/test_bitmap.c~lib-new-testcases-for-bitmap_parse_user
+++ a/lib/test_bitmap.c
@@ -415,14 +415,22 @@ static const unsigned long parse_test2[]
 };
 
 static const struct test_bitmap_parselist parse_tests[] __initconst = {
+	{0, "",				&parse_test[0 * step], 32, 0},
+	{0, " ",			&parse_test[0 * step], 32, 0},
 	{0, "0",			&parse_test[0 * step], 32, 0},
+	{0, "0\n",			&parse_test[0 * step], 32, 0},
 	{0, "1",			&parse_test[1 * step], 32, 0},
 	{0, "deadbeef",			&parse_test[2 * step], 32, 0},
 	{0, "1,0",			&parse_test[3 * step], 33, 0},
+	{0, "deadbeef,\n,0,1",		&parse_test[2 * step], 96, 0},
 
 	{0, "deadbeef,1,0",		&parse_test2[0 * 2 * step], 96, 0},
 	{0, "baadf00d,deadbeef,1,0",	&parse_test2[1 * 2 * step], 128, 0},
 	{0, "badf00d,deadbeef,1,0",	&parse_test2[2 * 2 * step], 124, 0},
+	{0, "badf00d,deadbeef,1,0",	&parse_test2[2 * 2 * step], 124, NO_LEN},
+	{0, "  badf00d,deadbeef,1,0  ",	&parse_test2[2 * 2 * step], 124, 0},
+	{0, " , badf00d,deadbeef,1,0 , ",	&parse_test2[2 * 2 * step], 124, 0},
+	{0, " , badf00d, ,, ,,deadbeef,1,0 , ",	&parse_test2[2 * 2 * step], 124, 0},
 
 	{-EINVAL,    "goodfood,deadbeef,1,0",	NULL, 128, 0},
 	{-EOVERFLOW, "3,0",			NULL, 33, 0},
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 65/67] include/linux/cpumask.h: don't calculate length of the input string
  2020-02-04  1:33 incoming Andrew Morton
                   ` (63 preceding siblings ...)
  2020-02-04  1:37 ` [patch 64/67] lib: new testcases for bitmap_parse{_user} Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 66/67] treewide: remove redundant IS_ERR() before error code check Andrew Morton
                   ` (173 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: acme, akpm, amritha.nambiar, andriy.shevchenko, chris, keescook,
	linux-mm, linux, mm-commits, mszeredi, steffen.klassert, tobin,
	torvalds, vineet.gupta1, will.deacon, willemb, willy, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: include/linux/cpumask.h: don't calculate length of the input string

New design of inner bitmap_parse() allows to avoid calculating the size of
a null-terminated string.

Link: http://lkml.kernel.org/r/20200102043031.30357-8-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Cc: Vineet Gupta <vineet.gupta1@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/cpumask.h |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/include/linux/cpumask.h~cpumask-dont-calculate-length-of-the-input-string
+++ a/include/linux/cpumask.h
@@ -663,9 +663,7 @@ static inline int cpumask_parselist_user
  */
 static inline int cpumask_parse(const char *buf, struct cpumask *dstp)
 {
-	unsigned int len = strchrnul(buf, '\n') - buf;

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 66/67] treewide: remove redundant IS_ERR() before error code check
  2020-02-04  1:33 incoming Andrew Morton
                   ` (64 preceding siblings ...)
  2020-02-04  1:37 ` [patch 65/67] include/linux/cpumask.h: don't calculate length of the input string Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:37 ` [patch 67/67] ARM: dma-api: fix max_pfn off-by-one error in __dma_supported() Andrew Morton
                   ` (172 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: akpm, bgolaszewski, ebiggers, julia.lawall, linux-mm, masahiroy,
	mm-commits, rafael.j.wysocki, robh, sboyd, torvalds, wsa

From: Masahiro Yamada <masahiroy@kernel.org>
Subject: treewide: remove redundant IS_ERR() before error code check

'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p).
Hence, IS_ERR(p) is unneeded.

The semantic patch that generates this commit is as follows:

// <smpl>
@@
expression ptr;
constant error_code;
@@
-IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code)
+PTR_ERR(ptr) == - error_code
// </smpl>

Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Acked-by: Stephen Boyd <sboyd@kernel.org> [drivers/clk/clk.c]
Acked-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> [GPIO]
Acked-by: Wolfram Sang <wsa@the-dreams.de> [drivers/i2c]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi/scan.c]
Acked-by: Rob Herring <robh@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 crypto/af_alg.c                      |    2 +-
 drivers/acpi/scan.c                  |    2 +-
 drivers/char/hw_random/bcm2835-rng.c |    2 +-
 drivers/char/hw_random/omap-rng.c    |    4 ++--
 drivers/clk/clk.c                    |    2 +-
 drivers/dma/mv_xor_v2.c              |    2 +-
 drivers/gpio/gpiolib-devres.c        |    2 +-
 drivers/gpio/gpiolib-of.c            |    8 ++++----
 drivers/gpio/gpiolib.c               |    2 +-
 drivers/i2c/busses/i2c-mv64xxx.c     |    5 ++---
 drivers/i2c/busses/i2c-synquacer.c   |    2 +-
 drivers/mtd/ubi/build.c              |    2 +-
 drivers/of/device.c                  |    2 +-
 drivers/pci/controller/pci-tegra.c   |    2 +-
 drivers/phy/phy-core.c               |    4 ++--
 drivers/spi/spi-orion.c              |    3 +--
 drivers/video/fbdev/imxfb.c          |    2 +-
 fs/ext4/super.c                      |    2 +-
 fs/f2fs/node.c                       |    2 +-
 fs/ocfs2/suballoc.c                  |    2 +-
 fs/sysfs/group.c                     |    2 +-
 net/core/dev.c                       |    2 +-
 net/core/filter.c                    |    2 +-
 net/xfrm/xfrm_policy.c               |    2 +-
 sound/soc/codecs/ak4104.c            |    3 +--
 sound/soc/codecs/cs4270.c            |    3 +--
 sound/soc/codecs/tlv320aic32x4.c     |    6 ++----
 sound/soc/sunxi/sun4i-spdif.c        |    2 +-
 28 files changed, 35 insertions(+), 41 deletions(-)

--- a/crypto/af_alg.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/crypto/af_alg.c
@@ -171,7 +171,7 @@ static int alg_bind(struct socket *sock,
 	sa->salg_name[sizeof(sa->salg_name) + addr_len - sizeof(*sa) - 1] = 0;
 
 	type = alg_get_type(sa->salg_type);
-	if (IS_ERR(type) && PTR_ERR(type) == -ENOENT) {
+	if (PTR_ERR(type) == -ENOENT) {
 		request_module("algif-%s", sa->salg_type);
 		type = alg_get_type(sa->salg_type);
 	}
--- a/drivers/acpi/scan.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/acpi/scan.c
@@ -1462,7 +1462,7 @@ int acpi_dma_configure(struct device *de
 	iort_dma_setup(dev, &dma_addr, &size);
 
 	iommu = iort_iommu_configure(dev);
-	if (IS_ERR(iommu) && PTR_ERR(iommu) == -EPROBE_DEFER)
+	if (PTR_ERR(iommu) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 
 	arch_setup_dma_ops(dev, dma_addr, size,
--- a/drivers/char/hw_random/bcm2835-rng.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/char/hw_random/bcm2835-rng.c
@@ -157,7 +157,7 @@ static int bcm2835_rng_probe(struct plat
 
 	/* Clock is optional on most platforms */
 	priv->clk = devm_clk_get(dev, NULL);
-	if (IS_ERR(priv->clk) && PTR_ERR(priv->clk) == -EPROBE_DEFER)
+	if (PTR_ERR(priv->clk) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 
 	priv->rng.name = pdev->name;
--- a/drivers/char/hw_random/omap-rng.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/char/hw_random/omap-rng.c
@@ -476,7 +476,7 @@ static int omap_rng_probe(struct platfor
 	}
 
 	priv->clk = devm_clk_get(&pdev->dev, NULL);
-	if (IS_ERR(priv->clk) && PTR_ERR(priv->clk) == -EPROBE_DEFER)
+	if (PTR_ERR(priv->clk) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 	if (!IS_ERR(priv->clk)) {
 		ret = clk_prepare_enable(priv->clk);
@@ -488,7 +488,7 @@ static int omap_rng_probe(struct platfor
 	}
 
 	priv->clk_reg = devm_clk_get(&pdev->dev, "reg");
-	if (IS_ERR(priv->clk_reg) && PTR_ERR(priv->clk_reg) == -EPROBE_DEFER)
+	if (PTR_ERR(priv->clk_reg) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 	if (!IS_ERR(priv->clk_reg)) {
 		ret = clk_prepare_enable(priv->clk_reg);
--- a/drivers/clk/clk.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/clk/clk.c
@@ -429,7 +429,7 @@ static void clk_core_fill_parent_index(s
 			parent = ERR_PTR(-EPROBE_DEFER);
 	} else {
 		parent = clk_core_get(core, index);
-		if (IS_ERR(parent) && PTR_ERR(parent) == -ENOENT && entry->name)
+		if (PTR_ERR(parent) == -ENOENT && entry->name)
 			parent = clk_core_lookup(entry->name);
 	}
 
--- a/drivers/dma/mv_xor_v2.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/dma/mv_xor_v2.c
@@ -750,7 +750,7 @@ static int mv_xor_v2_probe(struct platfo
 	}
 
 	xor_dev->clk = devm_clk_get(&pdev->dev, NULL);
-	if (IS_ERR(xor_dev->clk) && PTR_ERR(xor_dev->clk) == -EPROBE_DEFER) {
+	if (PTR_ERR(xor_dev->clk) == -EPROBE_DEFER) {
 		ret = EPROBE_DEFER;
 		goto disable_reg_clk;
 	}
--- a/drivers/gpio/gpiolib.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/gpio/gpiolib.c
@@ -5039,7 +5039,7 @@ struct gpio_descs *__must_check gpiod_ge
 	struct gpio_descs *descs;
 
 	descs = gpiod_get_array(dev, con_id, flags);
-	if (IS_ERR(descs) && (PTR_ERR(descs) == -ENOENT))
+	if (PTR_ERR(descs) == -ENOENT)
 		return NULL;
 
 	return descs;
--- a/drivers/gpio/gpiolib-devres.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/gpio/gpiolib-devres.c
@@ -308,7 +308,7 @@ devm_gpiod_get_array_optional(struct dev
 	struct gpio_descs *descs;
 
 	descs = devm_gpiod_get_array(dev, con_id, flags);
-	if (IS_ERR(descs) && (PTR_ERR(descs) == -ENOENT))
+	if (PTR_ERR(descs) == -ENOENT)
 		return NULL;
 
 	return descs;
--- a/drivers/gpio/gpiolib-of.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/gpio/gpiolib-of.c
@@ -484,24 +484,24 @@ struct gpio_desc *of_find_gpio(struct de
 			break;
 	}
 
-	if (IS_ERR(desc) && PTR_ERR(desc) == -ENOENT) {
+	if (PTR_ERR(desc) == -ENOENT) {
 		/* Special handling for SPI GPIOs if used */
 		desc = of_find_spi_gpio(dev, con_id, &of_flags);
 	}
 
-	if (IS_ERR(desc) && PTR_ERR(desc) == -ENOENT) {
+	if (PTR_ERR(desc) == -ENOENT) {
 		/* This quirk looks up flags and all */
 		desc = of_find_spi_cs_gpio(dev, con_id, idx, flags);
 		if (!IS_ERR(desc))
 			return desc;
 	}
 
-	if (IS_ERR(desc) && PTR_ERR(desc) == -ENOENT) {
+	if (PTR_ERR(desc) == -ENOENT) {
 		/* Special handling for regulator GPIOs if used */
 		desc = of_find_regulator_gpio(dev, con_id, &of_flags);
 	}
 
-	if (IS_ERR(desc) && PTR_ERR(desc) == -ENOENT)
+	if (PTR_ERR(desc) == -ENOENT)
 		desc = of_find_arizona_gpio(dev, con_id, &of_flags);
 
 	if (IS_ERR(desc))
--- a/drivers/i2c/busses/i2c-mv64xxx.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/i2c/busses/i2c-mv64xxx.c
@@ -901,14 +901,13 @@ mv64xxx_i2c_probe(struct platform_device
 
 	/* Not all platforms have clocks */
 	drv_data->clk = devm_clk_get(&pd->dev, NULL);
-	if (IS_ERR(drv_data->clk) && PTR_ERR(drv_data->clk) == -EPROBE_DEFER)
+	if (PTR_ERR(drv_data->clk) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 	if (!IS_ERR(drv_data->clk))
 		clk_prepare_enable(drv_data->clk);
 
 	drv_data->reg_clk = devm_clk_get(&pd->dev, "reg");
-	if (IS_ERR(drv_data->reg_clk) &&
-	    PTR_ERR(drv_data->reg_clk) == -EPROBE_DEFER)
+	if (PTR_ERR(drv_data->reg_clk) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 	if (!IS_ERR(drv_data->reg_clk))
 		clk_prepare_enable(drv_data->reg_clk);
--- a/drivers/i2c/busses/i2c-synquacer.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/i2c/busses/i2c-synquacer.c
@@ -553,7 +553,7 @@ static int synquacer_i2c_probe(struct pl
 				 &i2c->pclkrate);
 
 	i2c->pclk = devm_clk_get(&pdev->dev, "pclk");
-	if (IS_ERR(i2c->pclk) && PTR_ERR(i2c->pclk) == -EPROBE_DEFER)
+	if (PTR_ERR(i2c->pclk) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 	if (!IS_ERR_OR_NULL(i2c->pclk)) {
 		dev_dbg(&pdev->dev, "clock source %p\n", i2c->pclk);
--- a/drivers/mtd/ubi/build.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/mtd/ubi/build.c
@@ -1180,7 +1180,7 @@ static struct mtd_info * __init open_mtd
 		 * MTD device name.
 		 */
 		mtd = get_mtd_device_nm(mtd_dev);
-		if (IS_ERR(mtd) && PTR_ERR(mtd) == -ENODEV)
+		if (PTR_ERR(mtd) == -ENODEV)
 			/* Probably this is an MTD character device node path */
 			mtd = open_mtd_by_chdev(mtd_dev);
 	} else
--- a/drivers/of/device.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/of/device.c
@@ -161,7 +161,7 @@ int of_dma_configure(struct device *dev,
 		coherent ? " " : " not ");
 
 	iommu = of_iommu_configure(dev, np);
-	if (IS_ERR(iommu) && PTR_ERR(iommu) == -EPROBE_DEFER)
+	if (PTR_ERR(iommu) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 
 	dev_dbg(dev, "device is%sbehind an iommu\n",
--- a/drivers/pci/controller/pci-tegra.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/pci/controller/pci-tegra.c
@@ -1406,7 +1406,7 @@ static struct phy *devm_of_phy_optional_
 	phy = devm_of_phy_get(dev, np, name);
 	kfree(name);
 
-	if (IS_ERR(phy) && PTR_ERR(phy) == -ENODEV)
+	if (PTR_ERR(phy) == -ENODEV)
 		phy = NULL;
 
 	return phy;
--- a/drivers/phy/phy-core.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/phy/phy-core.c
@@ -712,7 +712,7 @@ struct phy *phy_optional_get(struct devi
 {
 	struct phy *phy = phy_get(dev, string);
 
-	if (IS_ERR(phy) && (PTR_ERR(phy) == -ENODEV))
+	if (PTR_ERR(phy) == -ENODEV)
 		phy = NULL;
 
 	return phy;
@@ -766,7 +766,7 @@ struct phy *devm_phy_optional_get(struct
 {
 	struct phy *phy = devm_phy_get(dev, string);
 
-	if (IS_ERR(phy) && (PTR_ERR(phy) == -ENODEV))
+	if (PTR_ERR(phy) == -ENODEV)
 		phy = NULL;
 
 	return phy;
--- a/drivers/spi/spi-orion.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/spi/spi-orion.c
@@ -646,8 +646,7 @@ static int orion_spi_probe(struct platfo
 
 	/* The following clock is only used by some SoCs */
 	spi->axi_clk = devm_clk_get(&pdev->dev, "axi");
-	if (IS_ERR(spi->axi_clk) &&
-	    PTR_ERR(spi->axi_clk) == -EPROBE_DEFER) {
+	if (PTR_ERR(spi->axi_clk) == -EPROBE_DEFER) {
 		status = -EPROBE_DEFER;
 		goto out_rel_clk;
 	}
--- a/drivers/video/fbdev/imxfb.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/drivers/video/fbdev/imxfb.c
@@ -1017,7 +1017,7 @@ static int imxfb_probe(struct platform_d
 	}
 
 	fbi->lcd_pwr = devm_regulator_get(&pdev->dev, "lcd");
-	if (IS_ERR(fbi->lcd_pwr) && (PTR_ERR(fbi->lcd_pwr) == -EPROBE_DEFER)) {
+	if (PTR_ERR(fbi->lcd_pwr) == -EPROBE_DEFER) {
 		ret = -EPROBE_DEFER;
 		goto failed_lcd;
 	}
--- a/fs/ext4/super.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/fs/ext4/super.c
@@ -6043,7 +6043,7 @@ static ssize_t ext4_quota_write(struct s
 		bh = ext4_bread(handle, inode, blk,
 				EXT4_GET_BLOCKS_CREATE |
 				EXT4_GET_BLOCKS_METADATA_NOFAIL);
-	} while (IS_ERR(bh) && (PTR_ERR(bh) == -ENOSPC) &&
+	} while (PTR_ERR(bh) == -ENOSPC &&
 		 ext4_should_retry_alloc(inode->i_sb, &retries));
 	if (IS_ERR(bh))
 		return PTR_ERR(bh);
--- a/fs/f2fs/node.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/fs/f2fs/node.c
@@ -875,7 +875,7 @@ static int truncate_dnode(struct dnode_o
 
 	/* get direct node */
 	page = f2fs_get_node_page(F2FS_I_SB(dn->inode), dn->nid);
-	if (IS_ERR(page) && PTR_ERR(page) == -ENOENT)
+	if (PTR_ERR(page) == -ENOENT)
 		return 1;
 	else if (IS_ERR(page))
 		return PTR_ERR(page);
--- a/fs/ocfs2/suballoc.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/fs/ocfs2/suballoc.c
@@ -696,7 +696,7 @@ static int ocfs2_block_group_alloc(struc
 
 	bg_bh = ocfs2_block_group_alloc_contig(osb, handle, alloc_inode,
 					       ac, cl);
-	if (IS_ERR(bg_bh) && (PTR_ERR(bg_bh) == -ENOSPC))
+	if (PTR_ERR(bg_bh) == -ENOSPC)
 		bg_bh = ocfs2_block_group_alloc_discontig(handle,
 							  alloc_inode,
 							  ac, cl);
--- a/fs/sysfs/group.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/fs/sysfs/group.c
@@ -449,7 +449,7 @@ int __compat_only_sysfs_link_entry_to_ko
 	}
 
 	link = kernfs_create_link(kobj->sd, target_name, entry);
-	if (IS_ERR(link) && PTR_ERR(link) == -EEXIST)
+	if (PTR_ERR(link) == -EEXIST)
 		sysfs_warn_dup(kobj->sd, target_name);
 
 	kernfs_put(entry);
--- a/net/core/dev.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/net/core/dev.c
@@ -5792,7 +5792,7 @@ static enum gro_result dev_gro_receive(s
 	if (&ptype->list == head)
 		goto normal;
 
-	if (IS_ERR(pp) && PTR_ERR(pp) == -EINPROGRESS) {
+	if (PTR_ERR(pp) == -EINPROGRESS) {
 		ret = GRO_CONSUMED;
 		goto ok;
 	}
--- a/net/core/filter.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/net/core/filter.c
@@ -1573,7 +1573,7 @@ int sk_reuseport_attach_bpf(u32 ufd, str
 		return -EPERM;
 
 	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
-	if (IS_ERR(prog) && PTR_ERR(prog) == -EINVAL)
+	if (PTR_ERR(prog) == -EINVAL)
 		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SK_REUSEPORT);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
--- a/net/xfrm/xfrm_policy.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/net/xfrm/xfrm_policy.c
@@ -3189,7 +3189,7 @@ struct dst_entry *xfrm_lookup_route(stru
 					    flags | XFRM_LOOKUP_QUEUE |
 					    XFRM_LOOKUP_KEEP_DST_REF);
 
-	if (IS_ERR(dst) && PTR_ERR(dst) == -EREMOTE)
+	if (PTR_ERR(dst) == -EREMOTE)
 		return make_blackhole(net, dst_orig->ops->family, dst_orig);
 
 	if (IS_ERR(dst))
--- a/sound/soc/codecs/ak4104.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/sound/soc/codecs/ak4104.c
@@ -295,8 +295,7 @@ static int ak4104_spi_probe(struct spi_d
 
 	reset_gpiod = devm_gpiod_get_optional(&spi->dev, "reset",
 					      GPIOD_OUT_HIGH);
-	if (IS_ERR(reset_gpiod) &&
-	    PTR_ERR(reset_gpiod) == -EPROBE_DEFER)
+	if (PTR_ERR(reset_gpiod) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 
 	/* read the 'reserved' register - according to the datasheet, it
--- a/sound/soc/codecs/cs4270.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/sound/soc/codecs/cs4270.c
@@ -681,8 +681,7 @@ static int cs4270_i2c_probe(struct i2c_c
 
 	reset_gpiod = devm_gpiod_get_optional(&i2c_client->dev, "reset",
 					      GPIOD_OUT_HIGH);
-	if (IS_ERR(reset_gpiod) &&
-	    PTR_ERR(reset_gpiod) == -EPROBE_DEFER)
+	if (PTR_ERR(reset_gpiod) == -EPROBE_DEFER)
 		return -EPROBE_DEFER;
 
 	cs4270->regmap = devm_regmap_init_i2c(i2c_client, &cs4270_regmap);
--- a/sound/soc/codecs/tlv320aic32x4.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/sound/soc/codecs/tlv320aic32x4.c
@@ -1098,11 +1098,9 @@ static int aic32x4_setup_regulators(stru
 			return PTR_ERR(aic32x4->supply_av);
 		}
 	} else {
-		if (IS_ERR(aic32x4->supply_dv) &&
-				PTR_ERR(aic32x4->supply_dv) == -EPROBE_DEFER)
+		if (PTR_ERR(aic32x4->supply_dv) == -EPROBE_DEFER)
 			return -EPROBE_DEFER;
-		if (IS_ERR(aic32x4->supply_av) &&
-				PTR_ERR(aic32x4->supply_av) == -EPROBE_DEFER)
+		if (PTR_ERR(aic32x4->supply_av) == -EPROBE_DEFER)
 			return -EPROBE_DEFER;
 	}
 
--- a/sound/soc/sunxi/sun4i-spdif.c~treewide-remove-redundent-is_err-before-error-code-check
+++ a/sound/soc/sunxi/sun4i-spdif.c
@@ -555,7 +555,7 @@ static int sun4i_spdif_probe(struct plat
 	if (quirks->has_reset) {
 		host->rst = devm_reset_control_get_optional_exclusive(&pdev->dev,
 								      NULL);
-		if (IS_ERR(host->rst) && PTR_ERR(host->rst) == -EPROBE_DEFER) {
+		if (PTR_ERR(host->rst) == -EPROBE_DEFER) {
 			ret = -EPROBE_DEFER;
 			dev_err(&pdev->dev, "Failed to get reset: %d\n", ret);
 			return ret;
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [patch 67/67] ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()
  2020-02-04  1:33 incoming Andrew Morton
                   ` (65 preceding siblings ...)
  2020-02-04  1:37 ` [patch 66/67] treewide: remove redundant IS_ERR() before error code check Andrew Morton
@ 2020-02-04  1:37 ` Andrew Morton
  2020-02-04  1:48 ` [merged] mips-kdb-remove-old-workaround-for-backtracing-on-other-cpus.patch removed from -mm tree Andrew Morton
                   ` (171 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:37 UTC (permalink / raw)
  To: akpm, hch, linux-mm, linux, mm-commits, robin.murphy, stable,
	torvalds, wens

From: Chen-Yu Tsai <wens@csie.org>
Subject: ARM: dma-api: fix max_pfn off-by-one error in __dma_supported()

max_pfn, as set in arch/arm/mm/init.c:

    static void __init find_limits(unsigned long *min,
				   unsigned long *max_low,
				   unsigned long *max_high)
    {
	    *max_low = PFN_DOWN(memblock_get_current_limit());
	    *min = PFN_UP(memblock_start_of_DRAM());
	    *max_high = PFN_DOWN(memblock_end_of_DRAM());
    }

with memblock_end_of_DRAM() pointing to the next byte after DRAM.  As
such, max_pfn points to the PFN after the end of DRAM.

Thus when using max_pfn to check DMA masks, we should subtract one when
checking DMA ranges against it.

Commit 8bf1268f48ad ("ARM: dma-api: fix off-by-one error in
__dma_supported()") fixed the same issue, but missed this spot.

This issue was found while working on the sun4i-csi v4l2 driver on the
Allwinner R40 SoC.  On Allwinner SoCs, DRAM is offset at 0x40000000, and
we are starting to use of_dma_configure() with the "dma-ranges" property
in the device tree to have the DMA API handle the offset.

In this particular instance, dma-ranges was set to the same range as the
actual available (2 GiB) DRAM.  The following error appeared when the
driver attempted to allocate a buffer:

    sun4i-csi 1c09000.csi: Coherent DMA mask 0x7fffffff (pfn 0x40000-0xc0000)
    covers a smaller range of system memory than the DMA zone pfn 0x0-0xc0001
    sun4i-csi 1c09000.csi: dma_alloc_coherent of size 307200 failed

Fixing the off-by-one error makes things work.

Link: http://lkml.kernel.org/r/20191224030239.5656-1-wens@kernel.org
Fixes: 11a5aa32562e ("ARM: dma-mapping: check DMA mask against available memory")
Fixes: 9f28cde0bc64 ("ARM: another fix for the DMA mapping checks")
Fixes: ab746573c405 ("ARM: dma-mapping: allow larger DMA mask than supported")
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/mm/dma-mapping.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/arm/mm/dma-mapping.c~arm-dma-api-fix-max_pfn-off-by-one-error-in-__dma_supported
+++ a/arch/arm/mm/dma-mapping.c
@@ -221,7 +221,7 @@ EXPORT_SYMBOL(arm_coherent_dma_ops);
 
 static int __dma_supported(struct device *dev, u64 mask, bool warn)
 {
-	unsigned long max_dma_pfn = min(max_pfn, arm_dma_pfn_limit);
+	unsigned long max_dma_pfn = min(max_pfn - 1, arm_dma_pfn_limit);
 
 	/*
 	 * Translate the device's DMA mask to a PFN limit.  This
_

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [merged] mips-kdb-remove-old-workaround-for-backtracing-on-other-cpus.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (66 preceding siblings ...)
  2020-02-04  1:37 ` [patch 67/67] ARM: dma-api: fix max_pfn off-by-one error in __dma_supported() Andrew Morton
@ 2020-02-04  1:48 ` Andrew Morton
  2020-02-04  1:48 ` [merged] kdb-kdb_current_regs-should-be-private.patch " Andrew Morton
                   ` (170 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:48 UTC (permalink / raw)
  To: bigeasy, daniel.thompson, dianders, ebiederm, f4bug,
	fancer.lancer, jason.wessel, jhogan, mm-commits, paul.burton,
	qiaochong, ralf, rppt


The patch titled
     Subject: MIPS: kdb: remove old workaround for backtracing on other CPUs
has been removed from the -mm tree.  Its filename was
     mips-kdb-remove-old-workaround-for-backtracing-on-other-cpus.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Douglas Anderson <dianders@chromium.org>
Subject: MIPS: kdb: remove old workaround for backtracing on other CPUs

Patch series "kdb: Don't implicitly change tasks; plus misc fixups".


This patch (of 5):

As of commit 2277b492582d ("kdb: Fix stack crawling on 'running' CPUs that
aren't the master") we no longer need any special case for doing stack
dumps on CPUs that are not the kdb master.  Let's remove.

Link: http://lkml.kernel.org/r/20191109111623.1.I30a0cac4d9880040c8d41495bd9a567fe3e24989@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Chong Qiao <qiaochong@loongson.cn>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Philippe Mathieu-Daud <f4bug@amsat.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/mips/kernel/traps.c |    5 -----
 1 file changed, 5 deletions(-)

--- a/arch/mips/kernel/traps.c~mips-kdb-remove-old-workaround-for-backtracing-on-other-cpus
+++ a/arch/mips/kernel/traps.c
@@ -210,11 +210,6 @@ void show_stack(struct task_struct *task
 			regs.regs[29] = task->thread.reg29;
 			regs.regs[31] = 0;
 			regs.cp0_epc = task->thread.reg31;
-#ifdef CONFIG_KGDB_KDB
-		} else if (atomic_read(&kgdb_active) != -1 &&
-			   kdb_current_regs) {
-			memcpy(&regs, kdb_current_regs, sizeof(regs));
-#endif /* CONFIG_KGDB_KDB */
 		} else {
 			prepare_frametrace(&regs);
 		}
_

Patches currently in -mm which might be from dianders@chromium.org are

kdb-kdb_current_regs-should-be-private.patch
kdb-kdb_current_task-shouldnt-be-exported.patch
kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch
kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [merged] kdb-kdb_current_regs-should-be-private.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (67 preceding siblings ...)
  2020-02-04  1:48 ` [merged] mips-kdb-remove-old-workaround-for-backtracing-on-other-cpus.patch removed from -mm tree Andrew Morton
@ 2020-02-04  1:48 ` Andrew Morton
  2020-02-04  1:48 ` [merged] kdb-kdb_current_task-shouldnt-be-exported.patch " Andrew Morton
                   ` (169 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:48 UTC (permalink / raw)
  To: bigeasy, daniel.thompson, dianders, ebiederm, f4bug,
	fancer.lancer, jason.wessel, jhogan, mm-commits, paul.burton,
	qiaochong, ralf, rppt


The patch titled
     Subject: kdb: kdb_current_regs should be private
has been removed from the -mm tree.  Its filename was
     kdb-kdb_current_regs-should-be-private.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Douglas Anderson <dianders@chromium.org>
Subject: kdb: kdb_current_regs should be private

As of the patch ("MIPS: kdb: Remove old workaround for backtracing on
other CPUs") there is no reason for kdb_current_regs to be in the public
"kdb.h".  Let's move it next to kdb_current_task.

Link: http://lkml.kernel.org/r/20191109111623.2.Iadbfb484e90b557cc4b5ac9890bfca732cd99d77@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Chong Qiao <qiaochong@loongson.cn>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Philippe Mathieu-Daud <f4bug@amsat.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kdb.h            |    2 --
 kernel/debug/kdb/kdb_private.h |    1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

--- a/include/linux/kdb.h~kdb-kdb_current_regs-should-be-private
+++ a/include/linux/kdb.h
@@ -183,8 +183,6 @@ int kdb_process_cpu(const struct task_st
 	return cpu;
 }
 
-/* kdb access to register set for stack dumping */
-extern struct pt_regs *kdb_current_regs;
 #ifdef CONFIG_KALLSYMS
 extern const char *kdb_walk_kallsyms(loff_t *pos);
 #else /* ! CONFIG_KALLSYMS */
--- a/kernel/debug/kdb/kdb_private.h~kdb-kdb_current_regs-should-be-private
+++ a/kernel/debug/kdb/kdb_private.h
@@ -242,6 +242,7 @@ extern void debug_kusage(void);
 
 extern void kdb_set_current_task(struct task_struct *);
 extern struct task_struct *kdb_current_task;
+extern struct pt_regs *kdb_current_regs;
 
 #ifdef CONFIG_KDB_KEYBOARD
 extern void kdb_kbd_cleanup_state(void);
_

Patches currently in -mm which might be from dianders@chromium.org are

kdb-kdb_current_task-shouldnt-be-exported.patch
kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch
kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [merged] kdb-kdb_current_task-shouldnt-be-exported.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (68 preceding siblings ...)
  2020-02-04  1:48 ` [merged] kdb-kdb_current_regs-should-be-private.patch " Andrew Morton
@ 2020-02-04  1:48 ` Andrew Morton
  2020-02-04  1:48 ` [merged] kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch " Andrew Morton
                   ` (168 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:48 UTC (permalink / raw)
  To: bigeasy, daniel.thompson, dianders, ebiederm, f4bug,
	fancer.lancer, jason.wessel, jhogan, mm-commits, paul.burton,
	qiaochong, ralf, rppt


The patch titled
     Subject: kdb: kdb_current_task shouldn't be exported
has been removed from the -mm tree.  Its filename was
     kdb-kdb_current_task-shouldnt-be-exported.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Douglas Anderson <dianders@chromium.org>
Subject: kdb: kdb_current_task shouldn't be exported

The kdb_current_task variable has been declared in
"kernel/debug/kdb/kdb_private.h" since 2010 when kdb was added to the
mainline kernel.  This is not a public header.  There should be no reason
that kdb_current_task should be exported and there are no in-kernel users
that need it.  Remove the export.

Link: http://lkml.kernel.org/r/20191109111623.3.I14b22b5eb15ca8f3812ab33e96621231304dc1f7@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Chong Qiao <qiaochong@loongson.cn>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Philippe Mathieu-Daud <f4bug@amsat.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/debug/kdb/kdb_main.c |    1 -
 1 file changed, 1 deletion(-)

--- a/kernel/debug/kdb/kdb_main.c~kdb-kdb_current_task-shouldnt-be-exported
+++ a/kernel/debug/kdb/kdb_main.c
@@ -73,7 +73,6 @@ int kdb_nextline = 1;
 int kdb_state;			/* General KDB state */
 
 struct task_struct *kdb_current_task;
-EXPORT_SYMBOL(kdb_current_task);
 struct pt_regs *kdb_current_regs;
 
 const char *kdb_diemsg;
_

Patches currently in -mm which might be from dianders@chromium.org are

kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch
kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [merged] kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (69 preceding siblings ...)
  2020-02-04  1:48 ` [merged] kdb-kdb_current_task-shouldnt-be-exported.patch " Andrew Morton
@ 2020-02-04  1:48 ` Andrew Morton
  2020-02-04  1:48 ` [merged] kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch " Andrew Morton
                   ` (167 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:48 UTC (permalink / raw)
  To: bigeasy, daniel.thompson, dianders, ebiederm, f4bug,
	fancer.lancer, jason.wessel, jhogan, mm-commits, paul.burton,
	qiaochong, ralf, rppt


The patch titled
     Subject: kdb: get rid of implicit setting of the current task/regs
has been removed from the -mm tree.  Its filename was
     kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Douglas Anderson <dianders@chromium.org>
Subject: kdb: get rid of implicit setting of the current task/regs

Some (but not all?) of the kdb backtrace paths would cause the
kdb_current_task and kdb_current_regs to remain changed.  As discussed in
a review of a previous patch [1], this doesn't seem intuitive, so let's
fix that.

...but, it turns out that there's actually no longer any reason to set the
current task / current regs while backtracing anymore anyway.  As of
commit 2277b492582d ("kdb: Fix stack crawling on 'running' CPUs that
aren't the master") if we're backtracing on a task running on a CPU we ask
that CPU to do the backtrace itself.  Linux can do that without anything
fancy.  If we're doing backtrace on a sleeping task we can also do that
fine without updating globals.  So this patch mostly just turns into
deleting a bunch of code.

[1] https://lore.kernel.org/r/20191010150735.dhrj3pbjgmjrdpwr@holly.lan

Link: http://lkml.kernel.org/r/20191109111624.4.Ibc3d982bbeb9e46872d43973ba808cd4c79537c7@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Chong Qiao <qiaochong@loongson.cn>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Philippe Mathieu-Daud <f4bug@amsat.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/debug/kdb/kdb_bt.c      |    8 +-------
 kernel/debug/kdb/kdb_main.c    |    2 +-
 kernel/debug/kdb/kdb_private.h |    1 -
 3 files changed, 2 insertions(+), 9 deletions(-)

--- a/kernel/debug/kdb/kdb_bt.c~kdb-gid-rid-of-implicit-setting-of-the-current-task-regs
+++ a/kernel/debug/kdb/kdb_bt.c
@@ -119,7 +119,6 @@ kdb_bt_cpu(unsigned long cpu)
 		return;
 	}
 
-	kdb_set_current_task(kdb_tsk);
 	kdb_bt1(kdb_tsk, ~0UL, false);
 }
 
@@ -166,10 +165,8 @@ kdb_bt(int argc, const char **argv)
 		if (diag)
 			return diag;
 		p = find_task_by_pid_ns(pid, &init_pid_ns);
-		if (p) {
-			kdb_set_current_task(p);
+		if (p)
 			return kdb_bt1(p, ~0UL, false);
-		}
 		kdb_printf("No process with pid == %ld found\n", pid);
 		return 0;
 	} else if (strcmp(argv[0], "btt") == 0) {
@@ -178,11 +175,9 @@ kdb_bt(int argc, const char **argv)
 		diag = kdbgetularg((char *)argv[1], &addr);
 		if (diag)
 			return diag;
-		kdb_set_current_task((struct task_struct *)addr);
 		return kdb_bt1((struct task_struct *)addr, ~0UL, false);
 	} else if (strcmp(argv[0], "btc") == 0) {
 		unsigned long cpu = ~0;
-		struct task_struct *save_current_task = kdb_current_task;
 		if (argc > 1)
 			return KDB_ARGCOUNT;
 		if (argc == 1) {
@@ -204,7 +199,6 @@ kdb_bt(int argc, const char **argv)
 				kdb_bt_cpu(cpu);
 				touch_nmi_watchdog();
 			}
-			kdb_set_current_task(save_current_task);
 		}
 		return 0;
 	} else {
--- a/kernel/debug/kdb/kdb_main.c~kdb-gid-rid-of-implicit-setting-of-the-current-task-regs
+++ a/kernel/debug/kdb/kdb_main.c
@@ -1138,7 +1138,7 @@ static void kdb_dumpregs(struct pt_regs
 	console_loglevel = old_lvl;
 }
 
-void kdb_set_current_task(struct task_struct *p)
+static void kdb_set_current_task(struct task_struct *p)
 {
 	kdb_current_task = p;
 
--- a/kernel/debug/kdb/kdb_private.h~kdb-gid-rid-of-implicit-setting-of-the-current-task-regs
+++ a/kernel/debug/kdb/kdb_private.h
@@ -240,7 +240,6 @@ extern void *debug_kmalloc(size_t size,
 extern void debug_kfree(void *);
 extern void debug_kusage(void);
 
-extern void kdb_set_current_task(struct task_struct *);
 extern struct task_struct *kdb_current_task;
 extern struct pt_regs *kdb_current_regs;
 
_

Patches currently in -mm which might be from dianders@chromium.org are

kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [merged] kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (70 preceding siblings ...)
  2020-02-04  1:48 ` [merged] kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch " Andrew Morton
@ 2020-02-04  1:48 ` Andrew Morton
  2020-02-04  1:49 ` [obsolete] linux-next-git-rejects.patch " Andrew Morton
                   ` (166 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:48 UTC (permalink / raw)
  To: bigeasy, daniel.thompson, dianders, ebiederm, f4bug,
	fancer.lancer, jason.wessel, jhogan, mm-commits, paul.burton,
	qiaochong, ralf, rppt


The patch titled
     Subject: kdb: get rid of confusing diag msg from "rd" if current task has no regs
has been removed from the -mm tree.  Its filename was
     kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch

This patch was dropped because it was merged into mainline or a subsystem tree

------------------------------------------------------
From: Douglas Anderson <dianders@chromium.org>
Subject: kdb: get rid of confusing diag msg from "rd" if current task has no regs

If you switch to a sleeping task with the "pid" command and then type
"rd", kdb tells you this:

  No current kdb registers.  You may need to select another task
  diag: -17: Invalid register name

The first message makes sense, but not the second.  Fix it by just
returning 0 after commands accessing the current registers finish if we've
already printed the "No current kdb registers" error.

While fixing kdb_rd(), change the function to use "if" rather than
"ifdef".  It cleans the function up a bit and any modern compiler will
have no trouble handling still producing good code.

Link: http://lkml.kernel.org/r/20191109111624.5.I121f4c6f0c19266200bf6ef003de78841e5bfc3d@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Chong Qiao <qiaochong@loongson.cn>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: James Hogan <jhogan@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Philippe Mathieu-Daud <f4bug@amsat.org>
Cc: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/debug/kdb/kdb_main.c |   28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

--- a/kernel/debug/kdb/kdb_main.c~kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs
+++ a/kernel/debug/kdb/kdb_main.c
@@ -543,9 +543,8 @@ int kdbgetaddrarg(int argc, const char *
 		if (diag)
 			return diag;
 	} else if (symname[0] == '%') {
-		diag = kdb_check_regs();
-		if (diag)
-			return diag;
+		if (kdb_check_regs())
+			return 0;
 		/* Implement register values with % at a later time as it is
 		 * arch optional.
 		 */
@@ -1836,8 +1835,7 @@ static int kdb_go(int argc, const char *
  */
 static int kdb_rd(int argc, const char **argv)
 {
-	int len = kdb_check_regs();
-#if DBG_MAX_REG_NUM > 0
+	int len = 0;
 	int i;
 	char *rname;
 	int rsize;
@@ -1846,8 +1844,14 @@ static int kdb_rd(int argc, const char *
 	u16 reg16;
 	u8 reg8;
 
-	if (len)
-		return len;
+	if (kdb_check_regs())
+		return 0;
+
+	/* Fallback to Linux showregs() if we don't have DBG_MAX_REG_NUM */
+	if (DBG_MAX_REG_NUM <= 0) {
+		kdb_dumpregs(kdb_current_regs);
+		return 0;
+	}
 
 	for (i = 0; i < DBG_MAX_REG_NUM; i++) {
 		rsize = dbg_reg_def[i].size * 2;
@@ -1889,12 +1893,7 @@ static int kdb_rd(int argc, const char *
 		}
 	}
 	kdb_printf("\n");
-#else
-	if (len)
-		return len;
 
-	kdb_dumpregs(kdb_current_regs);
-#endif
 	return 0;
 }
 
@@ -1928,9 +1927,8 @@ static int kdb_rm(int argc, const char *
 	if (diag)
 		return diag;
 
-	diag = kdb_check_regs();
-	if (diag)
-		return diag;
+	if (kdb_check_regs())
+		return 0;
 
 	diag = KDB_BADREG;
 	for (i = 0; i < DBG_MAX_REG_NUM; i++) {
_

Patches currently in -mm which might be from dianders@chromium.org are

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [obsolete] linux-next-git-rejects.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (71 preceding siblings ...)
  2020-02-04  1:48 ` [merged] kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch " Andrew Morton
@ 2020-02-04  1:49 ` Andrew Morton
       [not found] ` <CAHk-=whog86e4fRY_sxHqAos6spwAi_4aFF49S7h5C4XAZM2qw@mail.gmail.com>
                   ` (165 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  1:49 UTC (permalink / raw)
  To: akpm, mm-commits


The patch titled
     Subject: linux-next-git-rejects
has been removed from the -mm tree.  Its filename was
     linux-next-git-rejects.patch

This patch was dropped because it is obsolete

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: linux-next-git-rejects

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/f2fs/data.c                  |   13 -------------
 fs/jbd2/journal.c               |    3 ---
 fs/xfs/libxfs/xfs_attr_remote.c |   15 ---------------
 init/main.c                     |   10 ----------
 4 files changed, 41 deletions(-)

--- a/fs/f2fs/data.c~linux-next-git-rejects
+++ a/fs/f2fs/data.c
@@ -2815,7 +2815,6 @@ readd:
 					done = 1;
 					break;
 				}
-<<<<<<< HEAD
 
 				if (!f2fs_cluster_can_merge_page(&cc,
 								page->index)) {
@@ -2826,18 +2825,6 @@ readd:
 					goto result;
 				}
 
-=======
-
-				if (!f2fs_cluster_can_merge_page(&cc,
-								page->index)) {
-					ret = f2fs_write_multi_pages(&cc,
-						&submitted, wbc, io_type);
-					if (!ret)
-						need_readd = true;
-					goto result;
-				}
-
->>>>>>> linux-next/akpm-base
 				if (unlikely(f2fs_cp_error(sbi)))
 					goto lock_page;
 
--- a/fs/jbd2/journal.c~linux-next-git-rejects
+++ a/fs/jbd2/journal.c
@@ -806,12 +806,9 @@ int jbd2_journal_bmap(journal_t *journal
 			       __func__, blocknr, journal->j_devname);
 			err = -EIO;
 			jbd2_journal_abort(journal, err);
-<<<<<<< HEAD
-=======
 
 		} else {
 			*retp = block;
->>>>>>> linux-next/akpm-base
 		}
 
 	} else {
--- a/fs/xfs/libxfs/xfs_attr_remote.c~linux-next-git-rejects
+++ a/fs/xfs/libxfs/xfs_attr_remote.c
@@ -418,24 +418,9 @@ xfs_attr_rmtval_get(
 			       (map[i].br_startblock != HOLESTARTBLOCK));
 			dblkno = XFS_FSB_TO_DADDR(mp, map[i].br_startblock);
 			dblkcnt = XFS_FSB_TO_BB(mp, map[i].br_blockcount);
-<<<<<<< HEAD
-			bp = xfs_buf_read(mp->m_ddev_targp, dblkno, dblkcnt, 0,
-					&xfs_attr3_rmt_buf_ops);
-			if (!bp)
-				return -ENOMEM;
-			error = bp->b_error;
-			if (error) {
-				xfs_buf_ioerror_alert(bp, __func__);
-				xfs_buf_relse(bp);
-
-				/* bad CRC means corrupted metadata */
-				if (error == -EFSBADCRC)
-					error = -EFSCORRUPTED;
-=======
 			error = xfs_buf_read(mp->m_ddev_targp, dblkno, dblkcnt,
 					0, &bp, &xfs_attr3_rmt_buf_ops);
 			if (error)
->>>>>>> linux-next/akpm-base
 				return error;
 			}
 
--- a/init/main.c~linux-next-git-rejects
+++ a/init/main.c
@@ -1161,17 +1161,7 @@ static const char *initcall_level_names[
 	"late",
 };
 
-<<<<<<< HEAD
-static int __init ignore_unknown_bootoption(char *param, char *val,
-			       const char *unused, void *arg)
-{
-	return 0;
-}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: incoming
       [not found] ` <CAHk-=whog86e4fRY_sxHqAos6spwAi_4aFF49S7h5C4XAZM2qw@mail.gmail.com>
@ 2020-02-04  2:46   ` Andrew Morton
  0 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  2:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, Linux-MM

On Tue, 4 Feb 2020 02:27:48 +0000 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 4, 2020 at 1:33 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > The rest of MM and the rest of everything else.
> 
> What's the base? You've changed your scripts or something, and that
> information is no longer in your cover letter..
> 

Crap, sorry, geriatric.

d4e9056daedca3891414fe3c91de3449a5dad0f2

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: [patch 06/67] mm: factor out next_present_section_nr()
       [not found]   ` <CAHk-=widODg0oV6uNrpTvA3tFVN706o-nrWC5XBL9RZ4y3RFcg@mail.gmail.com>
@ 2020-02-04  4:29     ` Andrew Morton
  0 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-04  4:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Baoquan He, Dan Williams, David Hildenbrand, Kirill A . Shutemov,
	Kirill A . Shutemov, Linux-MM, Mel Gorman, Michal Hocko,
	Michal Hocko, mm-commits, osalvador, Pavel Tatashin,
	Vlastimil Babka, zhi.jin

On Tue, 4 Feb 2020 03:04:40 +0000 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, Feb 4, 2020 at 1:34 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: David Hildenbrand <david@redhat.com>
> > Subject: mm: factor out next_present_section_nr()
> >
> > Let's move it to the header and use the shorter variant from
> > mm/page_alloc.c (the original one will also check
> > "__highest_present_section_nr + 1", which is not necessary).  While at it,
> > make the section_nr in next_pfn() const.
> >
> > In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
> > exceed __highest_present_section_nr, which doesn't make a difference i= n
> > the caller as it is big enough (>=3D all sane end_pfn).
> 
> Here, look at that "i= n". It looks like it was a MIME line-break (so
> "in" was MIME-encoded and turned into "i=\nn") followed by you or
> David re-flowing the text without MIME-decoding it.

Yes, my MUA is not good about converting all input.  I have tools to fix
up the patches post-facto but changelogs have been a challenge.

I have updated my nightly check-patches-for-crap-and-email-it-to-andrew
script to check for this.

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (73 preceding siblings ...)
       [not found] ` <CAHk-=whog86e4fRY_sxHqAos6spwAi_4aFF49S7h5C4XAZM2qw@mail.gmail.com>
@ 2020-02-10  0:55 ` Andrew Morton
  2020-02-10  0:55 ` + revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch " Andrew Morton
                   ` (163 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  0:55 UTC (permalink / raw)
  To: hannes, kirill.shutemov, lixinhai.lxh, mm-commits, riel, willy


The patch titled
     Subject: mm: don't prepare anon_vma if vma has VM_WIPEONFORK
has been added to the -mm tree.  Its filename is
     mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: mm: don't prepare anon_vma if vma has VM_WIPEONFORK

Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path".

This patchset fixes the misuse of parenet anon_vma, which mainly caused by
child vma's vm_next and vm_prev are left same as its parent after
duplicate vma.  Finally, code reached parent vma's neighbor by referring
pointer of child vma and executed wrong logic.

The first two patches fix relevant issues, and the third patch sets
vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse
in future.  


This patch (of 3):

In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and
parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
->vm_prev as its parent vma.  That allows anon_vma used by parent been
mistakenly shared by child (find_mergeable_anon_vma() will do this reuse
work).

Besides this issue, call anon_vma_prepare() should be avoided because we
don't copy page for this vma.  Preparing anon_vma will be handled during
fault.

Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.com
Fixes: d2cd9ede6e19 ("mm,fork: introduce MADV_WIPEONFORK")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/fork.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/kernel/fork.c~mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork
+++ a/kernel/fork.c
@@ -552,10 +552,12 @@ static __latent_entropy int dup_mmap(str
 		if (retval)
 			goto fail_nomem_anon_vma_fork;
 		if (tmp->vm_flags & VM_WIPEONFORK) {
-			/* VM_WIPEONFORK gets a clean slate in the child. */
+			/*
+			 * VM_WIPEONFORK gets a clean slate in the child.
+			 * Don't prepare anon_vma until fault since we don't
+			 * copy page for current vma.
+			 */
 			tmp->anon_vma = NULL;
-			if (anon_vma_prepare(tmp))
-				goto fail_nomem_anon_vma_fork;
 		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
_

Patches currently in -mm which might be from lixinhai.lxh@gmail.com are

mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (74 preceding siblings ...)
  2020-02-10  0:55 ` + mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch added to -mm tree Andrew Morton
@ 2020-02-10  0:55 ` Andrew Morton
  2020-02-10  0:55 ` + mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch " Andrew Morton
                   ` (162 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  0:55 UTC (permalink / raw)
  To: hannes, kirill.shutemov, lixinhai.lxh, mm-commits, riel, willy


The patch titled
     Subject: Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork"
has been added to the -mm tree.  Its filename is
     revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork"

This reverts commit 4e4a9eb921332b9d1 ("mm/rmap.c: reuse mergeable
anon_vma as parent when fork").

In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
->vm_prev as its parent vma.  That causes the anon_vma used by parent been
mistakenly shared by child (In anon_vma_clone(), the code added by that
commit will do this reuse work).

Besides this issue, the design of reusing anon_vma from vma which has gone
through fork should be avoided ([1]).  So, this patch reverts that commit
and maintains the consistent logic of reusing anon_vma for
fork/split/merge vma.

[1] commit d0e9fe1758f2 ("Simplify and comment on anon_vma re-use for
    anon_vma_prepare()") explains the test of "list_is_singular()".

Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.com
Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/rmap.c |   13 -------------
 1 file changed, 13 deletions(-)

--- a/mm/rmap.c~revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork
+++ a/mm/rmap.c
@@ -269,19 +269,6 @@ int anon_vma_clone(struct vm_area_struct
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
-	struct vm_area_struct *prev = dst->vm_prev, *pprev = src->vm_prev;
-
-	/*
-	 * If parent share anon_vma with its vm_prev, keep this sharing in in
-	 * child.
-	 *
-	 * 1. Parent has vm_prev, which implies we have vm_prev.
-	 * 2. Parent and its vm_prev have the same anon_vma.
-	 */
-	if (!dst->anon_vma && src->anon_vma &&
-	    pprev && pprev->anon_vma == src->anon_vma)
-		dst->anon_vma = prev->anon_vma;

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (75 preceding siblings ...)
  2020-02-10  0:55 ` + revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch " Andrew Morton
@ 2020-02-10  0:55 ` Andrew Morton
  2020-02-10  1:03 ` + mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch " Andrew Morton
                   ` (161 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  0:55 UTC (permalink / raw)
  To: hannes, kirill.shutemov, lixinhai.lxh, mm-commits, riel, willy


The patch titled
     Subject: mm: set vm_next and vm_prev to NULL in vm_area_dup()
has been added to the -mm tree.  Its filename is
     mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: mm: set vm_next and vm_prev to NULL in vm_area_dup()

Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the
new duplicated vma.

Currently, only in fork path there are misuse for handling anon_vma.  No
other bugs been revealed with this patch applied.

Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/fork.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/fork.c~mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup
+++ a/kernel/fork.c
@@ -361,6 +361,7 @@ struct vm_area_struct *vm_area_dup(struc
 	if (new) {
 		*new = *orig;
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		new->vm_next = new->vm_prev = NULL;
 	}
 	return new;
 }
@@ -561,7 +562,6 @@ static __latent_entropy int dup_mmap(str
 		} else if (anon_vma_fork(tmp, mpnt))
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
-		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file_inode(file);
_

Patches currently in -mm which might be from lixinhai.lxh@gmail.com are

mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (76 preceding siblings ...)
  2020-02-10  0:55 ` + mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch " Andrew Morton
@ 2020-02-10  1:03 ` Andrew Morton
  2020-02-10  1:05 ` + mm-frontswap-mark-various-intentional-data-races.patch " Andrew Morton
                   ` (160 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  1:03 UTC (permalink / raw)
  To: cai, elver, konrad.wilk, mm-commits


The patch titled
     Subject: mm/list_lru: fix a data race in list_lru_count_one
has been added to the -mm tree.  Its filename is
     mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/list_lru: fix a data race in list_lru_count_one

struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move

 write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
  list_lru_isolate_move+0xf9/0x130
  list_lru_isolate_move at mm/list_lru.c:180
  inode_lru_isolate+0x12b/0x2a0
  __list_lru_walk_one+0x122/0x3d0
  list_lru_walk_one+0x75/0xa0
  prune_icache_sb+0x8b/0xc0
  super_cache_scan+0x1b8/0x250
  do_shrink_slab+0x256/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
  list_lru_count_one+0x116/0x2f0
  list_lru_count_one at mm/list_lru.c:193
  super_cache_count+0xe8/0x170
  do_shrink_slab+0x95/0x6d0
  shrink_slab+0x41b/0x4a0
  shrink_node+0x35c/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 56 PID: 6345 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #4
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).

Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/list_lru.c~mm-list_lru-fix-a-data-race-in-list_lru_count_one
+++ a/mm/list_lru.c
@@ -190,7 +190,7 @@ unsigned long list_lru_count_one(struct
 
 	rcu_read_lock();
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
-	count = l->nr_items;
+	count = READ_ONCE(l->nr_items);
 	rcu_read_unlock();
 
 	return count;
_

Patches currently in -mm which might be from cai@lca.pw are

mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-frontswap-mark-various-intentional-data-races.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (77 preceding siblings ...)
  2020-02-10  1:03 ` + mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch " Andrew Morton
@ 2020-02-10  1:05 ` Andrew Morton
  2020-02-10  1:22 ` + mm-add-mremap_dontunmap-to-mremap.patch " Andrew Morton
                   ` (159 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  1:05 UTC (permalink / raw)
  To: cai, elver, konrad.wilk, mm-commits


The patch titled
     Subject: mm/frontswap: mark various intentional data races
has been added to the -mm tree.  Its filename is
     mm-frontswap-mark-various-intentional-data-races.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-frontswap-mark-various-intentional-data-races.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-frontswap-mark-various-intentional-data-races.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/frontswap: mark various intentional data races

There are a few information counters that are intentionally not protected
against increment races, so just annotate them using the data_race()
macro.

 BUG: KCSAN: data-race in __frontswap_store / __frontswap_store

 write to 0xffffffff8b7174d8 of 8 bytes by task 6396 on cpu 103:
  __frontswap_store+0x2d0/0x344
  inc_frontswap_failed_stores at mm/frontswap.c:70
  (inlined by) __frontswap_store at mm/frontswap.c:280
  swap_writepage+0x83/0xf0
  pageout+0x33e/0xae0
  shrink_page_list+0x1f57/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffffffff8b7174d8 of 8 bytes by task 6405 on cpu 47:
  __frontswap_store+0x2b9/0x344
  inc_frontswap_failed_stores at mm/frontswap.c:70
  (inlined by) __frontswap_store at mm/frontswap.c:280
  swap_writepage+0x83/0xf0
  pageout+0x33e/0xae0
  shrink_page_list+0x1f57/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

Link: http://lkml.kernel.org/r/1581114499-5042-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/frontswap.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/frontswap.c~mm-frontswap-mark-various-intentional-data-races
+++ a/mm/frontswap.c
@@ -61,16 +61,16 @@ static u64 frontswap_failed_stores;
 static u64 frontswap_invalidates;
 
 static inline void inc_frontswap_loads(void) {
-	frontswap_loads++;
+	data_race(frontswap_loads++);
 }
 static inline void inc_frontswap_succ_stores(void) {
-	frontswap_succ_stores++;
+	data_race(frontswap_succ_stores++);
 }
 static inline void inc_frontswap_failed_stores(void) {
-	frontswap_failed_stores++;
+	data_race(frontswap_failed_stores++);
 }
 static inline void inc_frontswap_invalidates(void) {
-	frontswap_invalidates++;
+	data_race(frontswap_invalidates++);
 }
 #else
 static inline void inc_frontswap_loads(void) { }
_

Patches currently in -mm which might be from cai@lca.pw are

mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-frontswap-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-add-mremap_dontunmap-to-mremap.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (78 preceding siblings ...)
  2020-02-10  1:05 ` + mm-frontswap-mark-various-intentional-data-races.patch " Andrew Morton
@ 2020-02-10  1:22 ` Andrew Morton
  2020-02-10  1:35 ` + mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch " Andrew Morton
                   ` (158 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  1:22 UTC (permalink / raw)
  To: aarcange, arnd, bgeffon, fweimer, joel, jsbarnes, kirill, luto,
	minchan, mm-commits, mst, natechancellor, sonnyrao, will, yuzhao


The patch titled
     Subject: mm/mremap: add MREMAP_DONTUNMAP to mremap()
has been added to the -mm tree.  Its filename is
     mm-add-mremap_dontunmap-to-mremap.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-add-mremap_dontunmap-to-mremap.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-add-mremap_dontunmap-to-mremap.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Brian Geffon <bgeffon@google.com>
Subject: mm/mremap: add MREMAP_DONTUNMAP to mremap()

When remapping an anonymous, private mapping, if MREMAP_DONTUNMAP is set,
the source mapping will not be removed.  Instead it will be cleared as if
a brand new anonymous, private mapping had been created atomically as part
of the mremap() call.   If a userfaultfd was watching the source, it will
continue to watch the new mapping.   For a mapping that is shared or not
anonymous, MREMAP_DONTUNMAP will cause the mremap() call to fail.  Because
MREMAP_DONTUNMAP always results in moving a VMA you MUST use the
MREMAP_MAYMOVE flag.  The final result is two equally sized VMAs where the
destination contains the PTEs of the source.

We hope to use this in Chrome OS where with userfaultfd we could write an
anonymous mapping to disk without having to STOP the process or worry
about VMA permission changes.

This feature also has a use case in Android, Lokesh Gidra has said that
"As part of using userfaultfd for GC, We'll have to move the physical
pages of the java heap to a separate location.  For this purpose mremap
will be used.  Without the MREMAP_DONTUNMAP flag, when I mremap the java
heap, its virtual mapping will be removed as well.  Therefore, we'll
require performing mmap immediately after.  This is not only time
consuming but also opens a time window where a native thread may call mmap
and reserve the java heap's address range for its own usage.  This flag
solves the problem."

Link: http://lkml.kernel.org/r/20200207201856.46070-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/uapi/linux/mman.h |    5 +
 mm/mremap.c               |   98 ++++++++++++++++++++++++++++--------
 2 files changed, 80 insertions(+), 23 deletions(-)

--- a/include/uapi/linux/mman.h~mm-add-mremap_dontunmap-to-mremap
+++ a/include/uapi/linux/mman.h
@@ -5,8 +5,9 @@
 #include <asm/mman.h>
 #include <asm-generic/hugetlb_encode.h>
 
-#define MREMAP_MAYMOVE	1
-#define MREMAP_FIXED	2
+#define MREMAP_MAYMOVE		1
+#define MREMAP_FIXED		2
+#define MREMAP_DONTUNMAP	4
 
 #define OVERCOMMIT_GUESS		0
 #define OVERCOMMIT_ALWAYS		1
--- a/mm/mremap.c~mm-add-mremap_dontunmap-to-mremap
+++ a/mm/mremap.c
@@ -318,8 +318,8 @@ unsigned long move_page_tables(struct vm
 static unsigned long move_vma(struct vm_area_struct *vma,
 		unsigned long old_addr, unsigned long old_len,
 		unsigned long new_len, unsigned long new_addr,
-		bool *locked, struct vm_userfaultfd_ctx *uf,
-		struct list_head *uf_unmap)
+		bool *locked, unsigned long flags,
+		struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
@@ -408,11 +408,41 @@ static unsigned long move_vma(struct vm_
 	if (unlikely(vma->vm_flags & VM_PFNMAP))
 		untrack_pfn_moved(vma);
 
+	if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
+		if (vm_flags & VM_ACCOUNT) {
+			/* Always put back VM_ACCOUNT since we won't unmap */
+			vma->vm_flags |= VM_ACCOUNT;
+
+			vm_acct_memory(vma_pages(new_vma));
+		}
+
+		/*
+		 * locked_vm accounting: if the mapping remained the same size
+		 * it will have just moved and we don't need to touch locked_vm
+		 * because we skip the do_unmap. If the mapping shrunk before
+		 * being moved then the do_unmap on that portion will have
+		 * adjusted vm_locked. Only if the mapping grows do we need to
+		 * do something special; the reason is locked_vm only accounts
+		 * for old_len, but we're now adding new_len - old_len locked
+		 * bytes to the new mapping.
+		 */
+		if (new_len > old_len)
+			mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
+
+		goto out;
+	}
+
 	if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
 		/* OOM: unable to split vma, just get accounts right */
 		vm_unacct_memory(excess >> PAGE_SHIFT);
 		excess = 0;
 	}
+
+	if (vm_flags & VM_LOCKED) {
+		mm->locked_vm += new_len >> PAGE_SHIFT;
+		*locked = true;
+	}
+out:
 	mm->hiwater_vm = hiwater_vm;
 
 	/* Restore VM_ACCOUNT if one or two pieces of vma left */
@@ -422,16 +452,12 @@ static unsigned long move_vma(struct vm_
 			vma->vm_next->vm_flags |= VM_ACCOUNT;
 	}
 
-	if (vm_flags & VM_LOCKED) {
-		mm->locked_vm += new_len >> PAGE_SHIFT;
-		*locked = true;
-	}
-
 	return new_addr;
 }
 
 static struct vm_area_struct *vma_to_resize(unsigned long addr,
-	unsigned long old_len, unsigned long new_len, unsigned long *p)
+	unsigned long old_len, unsigned long new_len, unsigned long flags,
+	unsigned long *p)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = find_vma(mm, addr);
@@ -453,6 +479,10 @@ static struct vm_area_struct *vma_to_res
 		return ERR_PTR(-EINVAL);
 	}
 
+	if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
+			vma->vm_flags & VM_SHARED))
+		return ERR_PTR(-EINVAL);
+
 	if (is_vm_hugetlb_page(vma))
 		return ERR_PTR(-EINVAL);
 
@@ -497,7 +527,7 @@ static struct vm_area_struct *vma_to_res
 
 static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 		unsigned long new_addr, unsigned long new_len, bool *locked,
-		struct vm_userfaultfd_ctx *uf,
+		unsigned long flags, struct vm_userfaultfd_ctx *uf,
 		struct list_head *uf_unmap_early,
 		struct list_head *uf_unmap)
 {
@@ -505,7 +535,7 @@ static unsigned long mremap_to(unsigned
 	struct vm_area_struct *vma;
 	unsigned long ret = -EINVAL;
 	unsigned long charged = 0;
-	unsigned long map_flags;
+	unsigned long map_flags = 0;
 
 	if (offset_in_page(new_addr))
 		goto out;
@@ -534,9 +564,11 @@ static unsigned long mremap_to(unsigned
 	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
 		return -ENOMEM;
 
-	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
-	if (ret)
-		goto out;
+	if (flags & MREMAP_FIXED) {
+		ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
+		if (ret)
+			goto out;
+	}
 
 	if (old_len >= new_len) {
 		ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
@@ -545,13 +577,26 @@ static unsigned long mremap_to(unsigned
 		old_len = new_len;
 	}
 
-	vma = vma_to_resize(addr, old_len, new_len, &charged);
+	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
 		goto out;
 	}
 
-	map_flags = MAP_FIXED;
+	/*
+	 * MREMAP_DONTUNMAP expands by new_len - (new_len - old_len), we will
+	 * check that we can expand by new_len and vma_to_resize will handle
+	 * the vma growing which is (new_len - old_len).
+	 */
+	if (flags & MREMAP_DONTUNMAP &&
+		!may_expand_vm(mm, vma->vm_flags, new_len >> PAGE_SHIFT)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (flags & MREMAP_FIXED)
+		map_flags |= MAP_FIXED;
+
 	if (vma->vm_flags & VM_MAYSHARE)
 		map_flags |= MAP_SHARED;
 
@@ -561,10 +606,16 @@ static unsigned long mremap_to(unsigned
 	if (IS_ERR_VALUE(ret))
 		goto out1;
 
-	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, uf,
+	/* We got a new mapping */
+	if (!(flags & MREMAP_FIXED))
+		new_addr = ret;
+
+	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, flags, uf,
 		       uf_unmap);
+
 	if (!(offset_in_page(ret)))
 		goto out;
+
 out1:
 	vm_unacct_memory(charged);
 
@@ -609,12 +660,16 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	addr = untagged_addr(addr);
 	new_addr = untagged_addr(new_addr);
 
-	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
+	if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
 		return ret;
 
 	if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
 		return ret;
 
+	/* MREMAP_DONTUNMAP is always a move */
+	if (flags & MREMAP_DONTUNMAP && !(flags & MREMAP_MAYMOVE))
+		return ret;
+
 	if (offset_in_page(addr))
 		return ret;
 
@@ -632,9 +687,10 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	if (down_write_killable(&current->mm->mmap_sem))
 		return -EINTR;
 
-	if (flags & MREMAP_FIXED) {
+	if (flags & MREMAP_FIXED || flags & MREMAP_DONTUNMAP) {
 		ret = mremap_to(addr, old_len, new_addr, new_len,
-				&locked, &uf, &uf_unmap_early, &uf_unmap);
+				&locked, flags, &uf, &uf_unmap_early,
+				&uf_unmap);
 		goto out;
 	}
 
@@ -662,7 +718,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	/*
 	 * Ok, we need to grow..
 	 */
-	vma = vma_to_resize(addr, old_len, new_len, &charged);
+	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
 		goto out;
@@ -712,7 +768,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 		}
 
 		ret = move_vma(vma, addr, old_len, new_len, new_addr,
-			       &locked, &uf, &uf_unmap);
+			       &locked, flags, &uf, &uf_unmap);
 	}
 out:
 	if (offset_in_page(ret)) {
_

Patches currently in -mm which might be from bgeffon@google.com are

mm-add-mremap_dontunmap-to-mremap.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (79 preceding siblings ...)
  2020-02-10  1:22 ` + mm-add-mremap_dontunmap-to-mremap.patch " Andrew Morton
@ 2020-02-10  1:35 ` Andrew Morton
  2020-02-10  1:48 ` + mm-swapfile-fix-and-annotate-various-data-races.patch " Andrew Morton
                   ` (157 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  1:35 UTC (permalink / raw)
  To: bhe, dan.j.williams, david, mm-commits, richardw.yang


The patch titled
     Subject: mm/sparsemem: get address to page struct instead of address to pfn
has been added to the -mm tree.  Its filename is
     mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/sparsemem: get address to page struct instead of address to pfn

memmap should be the address to page struct instead of address to pfn.

As mentioned by David, if system memory and devmem sit within a section,
the mismatch address would lead kdump to dump unexpected memory.

Since sub-section only works for SPARSEMEM_VMEMMAP, pfn_to_page() is valid
to get the page struct address at this point.

Link: http://lkml.kernel.org/r/20200210005048.10437-1-richardw.yang@linux.intel.com
Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/sparse.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/sparse.c~mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn
+++ a/mm/sparse.c
@@ -884,7 +884,7 @@ int __meminit sparse_add_section(int nid
 
 	/* Align memmap to section boundary in the subsection case */
 	if (section_nr_to_pfn(section_nr) != start_pfn)
-		memmap = pfn_to_kaddr(section_nr_to_pfn(section_nr));
+		memmap = pfn_to_page(section_nr_to_pfn(section_nr));
 	sparse_init_one_section(ms, section_nr, memmap, ms->usage, 0);
 
 	return 0;
_

Patches currently in -mm which might be from richardw.yang@linux.intel.com are

mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-swapfile-fix-and-annotate-various-data-races.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (80 preceding siblings ...)
  2020-02-10  1:35 ` + mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch " Andrew Morton
@ 2020-02-10  1:48 ` Andrew Morton
  2020-02-10  2:01 ` + mm-page_io-mark-various-intentional-data-races.patch " Andrew Morton
                   ` (156 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  1:48 UTC (permalink / raw)
  To: cai, elver, hughd, mm-commits


The patch titled
     Subject: mm/swapfile: fix and annotate various data races
has been added to the -mm tree.  Its filename is
     mm-swapfile-fix-and-annotate-various-data-races.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swapfile-fix-and-annotate-various-data-races.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swapfile-fix-and-annotate-various-data-races.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/swapfile: fix and annotate various data races

swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
be accessed concurrently separately as noticed by KCSAN,

=== si.highest_bit ===

 write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
  swap_range_alloc+0x81/0x130
  swap_range_alloc at mm/swapfile.c:681
  scan_swap_map_slots+0x371/0xb90
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0xf2/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1795/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

 read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
  scan_swap_map_slots+0x4a6/0xb90
  scan_swap_map_slots at mm/swapfile.c:892
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0xf2/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1795/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 70 PID: 6672 Comm: oom01 Tainted: G        W    L 5.5.0-next-20200205+ #3
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

=== si.swap_map[offset] ===

 write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
  __swap_entry_free_locked+0x8c/0x100
  __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
  __swap_entry_free.constprop.20+0x69/0xb0
  free_swap_and_cache+0x53/0xa0
  unmap_page_range+0x7f8/0x1d70
  unmap_single_vma+0xcd/0x170
  unmap_vmas+0x18b/0x220
  exit_mmap+0xee/0x220
  mmput+0x10e/0x270
  do_exit+0x59b/0xf40
  do_group_exit+0x8b/0x180

 read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
  _swap_info_get+0x81/0xa0
  _swap_info_get at mm/swapfile.c:1140
  free_swap_and_cache+0x40/0xa0
  unmap_page_range+0x7f8/0x1d70
  unmap_single_vma+0xcd/0x170
  unmap_vmas+0x18b/0x220
  exit_mmap+0xee/0x220
  mmput+0x10e/0x270
  do_exit+0x59b/0xf40
  do_group_exit+0x8b/0x180

=== si.flags ===

 write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
  scan_swap_map_slots+0x6fe/0xb50
  scan_swap_map_slots at mm/swapfile.c:887
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0x377/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1795/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

 read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
  _swap_info_get+0x41/0xa0
  __swap_info_get at mm/swapfile.c:1114
  put_swap_page+0x84/0x490
  __remove_mapping+0x384/0x5f0
  shrink_page_list+0xff1/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290

The writes are under si->lock but the reads are not. For si.highest_bit
and si.swap_map[offset], data race could trigger logic bugs, so fix them
by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
except those isolated reads where they compare against zero which a data
race would cause no harm. Thus, annotate them as intentional data races
using the data_race() macro.

For si.flags, the readers are only interested in a single bit where a
data race there would cause no issue there.

Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |   31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

--- a/mm/swapfile.c~mm-swapfile-fix-and-annotate-various-data-races
+++ a/mm/swapfile.c
@@ -678,7 +678,7 @@ static void swap_range_alloc(struct swap
 	if (offset == si->lowest_bit)
 		si->lowest_bit += nr_entries;
 	if (end == si->highest_bit)
-		si->highest_bit -= nr_entries;
+		WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries);
 	si->inuse_pages += nr_entries;
 	if (si->inuse_pages == si->pages) {
 		si->lowest_bit = si->max;
@@ -710,7 +710,7 @@ static void swap_range_free(struct swap_
 	if (end > si->highest_bit) {
 		bool was_full = !si->highest_bit;
 
-		si->highest_bit = end;
+		WRITE_ONCE(si->highest_bit, end);
 		if (was_full && (si->flags & SWP_WRITEOK))
 			add_to_avail_list(si);
 	}
@@ -843,7 +843,7 @@ checks:
 		else
 			goto done;
 	}
-	si->swap_map[offset] = usage;
+	WRITE_ONCE(si->swap_map[offset], usage);
 	inc_cluster_info_page(si, si->cluster_info, offset);
 	unlock_cluster(ci);
 
@@ -889,12 +889,13 @@ done:
 
 scan:
 	spin_unlock(&si->lock);
-	while (++offset <= si->highest_bit) {
-		if (!si->swap_map[offset]) {
+	while (++offset <= READ_ONCE(si->highest_bit)) {
+		if (data_race(!si->swap_map[offset])) {
 			spin_lock(&si->lock);
 			goto checks;
 		}
-		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+		if (vm_swap_full() &&
+		    READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
 			spin_lock(&si->lock);
 			goto checks;
 		}
@@ -905,11 +906,12 @@ scan:
 	}
 	offset = si->lowest_bit;
 	while (offset < scan_base) {
-		if (!si->swap_map[offset]) {
+		if (data_race(!si->swap_map[offset])) {
 			spin_lock(&si->lock);
 			goto checks;
 		}
-		if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+		if (vm_swap_full() &&
+		    READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
 			spin_lock(&si->lock);
 			goto checks;
 		}
@@ -1111,7 +1113,7 @@ static struct swap_info_struct *__swap_i
 	p = swp_swap_info(entry);
 	if (!p)
 		goto bad_nofile;
-	if (!(p->flags & SWP_USED))
+	if (data_race(!(p->flags & SWP_USED)))
 		goto bad_device;
 	offset = swp_offset(entry);
 	if (offset >= p->max)
@@ -1137,7 +1139,7 @@ static struct swap_info_struct *_swap_in
 	p = __swap_info_get(entry);
 	if (!p)
 		goto out;
-	if (!p->swap_map[swp_offset(entry)])
+	if (data_race(!p->swap_map[swp_offset(entry)]))
 		goto bad_free;
 	return p;
 
@@ -1206,7 +1208,10 @@ static unsigned char __swap_entry_free_l
 	}
 
 	usage = count | has_cache;
-	p->swap_map[offset] = usage ? : SWAP_HAS_CACHE;
+	if (usage)
+		WRITE_ONCE(p->swap_map[offset], usage);
+	else
+		WRITE_ONCE(p->swap_map[offset], SWAP_HAS_CACHE);
 
 	return usage;
 }
@@ -1258,7 +1263,7 @@ struct swap_info_struct *get_swap_device
 		goto bad_nofile;
 
 	rcu_read_lock();
-	if (!(si->flags & SWP_VALID))
+	if (data_race(!(si->flags & SWP_VALID)))
 		goto unlock_out;
 	offset = swp_offset(entry);
 	if (offset >= si->max)
@@ -3436,7 +3441,7 @@ static int __swap_duplicate(swp_entry_t
 	} else
 		err = -ENOENT;			/* unused swap entry */
 
-	p->swap_map[offset] = count | has_cache;
+	WRITE_ONCE(p->swap_map[offset], count | has_cache);
 
 unlock_out:
 	unlock_cluster_or_swap_info(p, ci);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-and-annotate-various-data-races.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-frontswap-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_io-mark-various-intentional-data-races.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (81 preceding siblings ...)
  2020-02-10  1:48 ` + mm-swapfile-fix-and-annotate-various-data-races.patch " Andrew Morton
@ 2020-02-10  2:01 ` Andrew Morton
  2020-02-10  2:01 ` + mm-swap_state-mark-various-intentional-data-races.patch " Andrew Morton
                   ` (155 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  2:01 UTC (permalink / raw)
  To: cai, elver, mm-commits


The patch titled
     Subject: mm/page_io: mark various intentional data races
has been added to the -mm tree.  Its filename is
     mm-page_io-mark-various-intentional-data-races.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_io-mark-various-intentional-data-races.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_io-mark-various-intentional-data-races.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/page_io: mark various intentional data races

struct swap_info_struct si.flags could be accessed concurrently as noticed
by KCSAN,

 BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage

 write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
  scan_swap_map_slots+0x6fe/0xb50
  scan_swap_map_slots at mm/swapfile.c:887
  get_swap_pages+0x39d/0x5c0
  get_swap_page+0x377/0x524
  add_to_swap+0xe4/0x1c0
  shrink_page_list+0x1740/0x2820
  shrink_inactive_list+0x316/0x8b0
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
  swap_readpage+0x204/0x6a0
  swap_readpage at mm/page_io.c:380
  read_swap_cache_async+0xa2/0xb0
  swapin_readahead+0x6a0/0x890
  do_swap_page+0x465/0xeb0
  __handle_mm_fault+0xc7a/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 7 PID: 5422 Comm: gmain Tainted: G        W  O L 5.5.0-next-20200204+ #6
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

Other reads,

 read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
  __swap_writepage+0x140/0xc20
  __swap_writepage at mm/page_io.c:289

 read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
  swap_set_page_dirty+0x44/0x1f4
  swap_set_page_dirty at mm/page_io.c:442

The write is under &si->lock, but the reads are done as lockless.  Since
the reads only check for a specific bit in the flag, it is harmless even
if load tearing happens.  Thus, just mark them as intentional data races
using the data_race() macro.

Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_io.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/page_io.c~mm-page_io-mark-various-intentional-data-races
+++ a/mm/page_io.c
@@ -286,7 +286,7 @@ int __swap_writepage(struct page *page,
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (sis->flags & SWP_FS) {
+	if (data_race(sis->flags & SWP_FS)) {
 		struct kiocb kiocb;
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
@@ -377,7 +377,7 @@ int swap_readpage(struct page *page, boo
 		goto out;
 	}
 
-	if (sis->flags & SWP_FS) {
+	if (data_race(sis->flags & SWP_FS)) {
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 
@@ -439,7 +439,7 @@ int swap_set_page_dirty(struct page *pag
 {
 	struct swap_info_struct *sis = page_swap_info(page);
 
-	if (sis->flags & SWP_FS) {
+	if (data_race(sis->flags & SWP_FS)) {
 		struct address_space *mapping = sis->swap_file->f_mapping;
 
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-and-annotate-various-data-races.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-swap_state-mark-various-intentional-data-races.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (82 preceding siblings ...)
  2020-02-10  2:01 ` + mm-page_io-mark-various-intentional-data-races.patch " Andrew Morton
@ 2020-02-10  2:01 ` Andrew Morton
  2020-02-10  4:00 ` + linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch " Andrew Morton
                   ` (154 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  2:01 UTC (permalink / raw)
  To: cai, elver, mm-commits


The patch titled
     Subject: mm/swap_state: mark various intentional data races
has been added to the -mm tree.  Its filename is
     mm-swap_state-mark-various-intentional-data-races.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swap_state-mark-various-intentional-data-races.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swap_state-mark-various-intentional-data-races.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/swap_state: mark various intentional data races

swap_cache_info.* could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache

 write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
  lookup_swap_cache+0x12e/0x460
  lookup_swap_cache at mm/swap_state.c:322
  do_swap_page+0x112/0xeb0
  __handle_mm_fault+0xc7a/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
  lookup_swap_cache+0x117/0x460
  lookup_swap_cache at mm/swap_state.c:322
  shmem_swapin_page+0xc7/0x9e0
  shmem_getpage_gfp+0x2ca/0x16c0
  shmem_fault+0xef/0x3c0
  __do_fault+0x9e/0x220
  do_fault+0x4a0/0x920
  __handle_mm_fault+0xc69/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G        W  O L 5.5.0-next-20200204+ #6
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

 write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
   __delete_from_swap_cache+0x681/0x8b0
   __delete_from_swap_cache at mm/swap_state.c:178

 read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
   __delete_from_swap_cache+0x66e/0x8b0
   __delete_from_swap_cache at mm/swap_state.c:178

Both the read and write are done as lockless. Since swap_cache_info.*
are only used to print out counter information, even if any of them
missed a few incremental due to data races, it will be harmless, so just
mark it as an intentional data race using the data_race() macro.

While at it, fix a checkpatch.pl warning,

WARNING: Single statement macros should not use a do {} while (0) loop

Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap_state.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/swap_state.c~mm-swap_state-mark-various-intentional-data-races
+++ a/mm/swap_state.c
@@ -58,8 +58,8 @@ static bool enable_vma_readahead __read_
 #define GET_SWAP_RA_VAL(vma)					\
 	(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
 
-#define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
-#define ADD_CACHE_INFO(x, nr)	do { swap_cache_info.x += (nr); } while (0)
+#define INC_CACHE_INFO(x)	data_race(swap_cache_info.x++)
+#define ADD_CACHE_INFO(x, nr)	data_race(swap_cache_info.x += (nr))
 
 static struct {
 	unsigned long add_total;
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-and-annotate-various-data-races.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (83 preceding siblings ...)
  2020-02-10  2:01 ` + mm-swap_state-mark-various-intentional-data-races.patch " Andrew Morton
@ 2020-02-10  4:00 ` Andrew Morton
  2020-02-10  4:16 ` + mm-swap-move-inode_lock-out-of-claim_swapfile.patch " Andrew Morton
                   ` (153 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:00 UTC (permalink / raw)
  To: mm-commits, rdunlap


The patch titled
     Subject: linux/pipe_fs_i.h: fix kernel-doc warnings after @wait was split
has been added to the -mm tree.  Its filename is
     linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Randy Dunlap <rdunlap@infradead.org>
Subject: linux/pipe_fs_i.h: fix kernel-doc warnings after @wait was split

Fix kernel-doc warnings in struct pipe_inode_info after @wait was split
into @rd_wait and @wr_wait.

../include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'rd_wait' not described in 'pipe_inode_info'
../include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'wr_wait' not described in 'pipe_inode_info'

Link: http://lkml.kernel.org/r/0956ab21-9b9a-4d1e-fe43-b853d1602781@infradead.org
Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/pipe_fs_i.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/linux/pipe_fs_i.h~linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split
+++ a/include/linux/pipe_fs_i.h
@@ -29,7 +29,8 @@ struct pipe_buffer {
 /**
  *	struct pipe_inode_info - a linux kernel pipe
  *	@mutex: mutex protecting the whole thing
- *	@wait: reader/writer wait point in case of empty/full pipe
+ *	@rd_wait: reader wait point in case of empty pipe
+ *	@wr_wait: writer wait point in case of full pipe
  *	@head: The point of buffer production
  *	@tail: The point of buffer consumption
  *	@max_usage: The maximum number of slots that may be used in the ring
_

Patches currently in -mm which might be from rdunlap@infradead.org are

linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-swap-move-inode_lock-out-of-claim_swapfile.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (84 preceding siblings ...)
  2020-02-10  4:00 ` + linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch " Andrew Morton
@ 2020-02-10  4:16 ` Andrew Morton
  2020-02-10  4:20 ` + selftests-vm-add-missed-tests-in-run_vmtests.patch " Andrew Morton
                   ` (152 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:16 UTC (permalink / raw)
  To: akpm, darrick.wong, hch, mm-commits, naohiro.aota


The patch titled
     Subject: mm/swapfile.c: move inode_lock out of claim_swapfile
has been added to the -mm tree.  Its filename is
     mm-swap-move-inode_lock-out-of-claim_swapfile.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swap-move-inode_lock-out-of-claim_swapfile.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swap-move-inode_lock-out-of-claim_swapfile.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Naohiro Aota <naohiro.aota@wdc.com>
Subject: mm/swapfile.c: move inode_lock out of claim_swapfile

claim_swapfile() currently keeps the inode locked when it is successful,
or the file is already swapfile (with -EBUSY).  And, on the other error
cases, it does not lock the inode.

This inconsistency of the lock state and return value is quite confusing
and actually causing a bad unlock balance as below in the "bad_swap"
section of __do_sys_swapon().

This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
check out of claim_swapfile().  The inode is unlocked in
"bad_swap_unlock_inode" section, so that the inode is ensured to be
unlocked at "bad_swap".  Thus, error handling codes after the locking now
jumps to "bad_swap_unlock_inode" instead of "bad_swap".

    =====================================
    WARNING: bad unlock balance detected!
    5.5.0-rc7+ #176 Not tainted
    -------------------------------------
    swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at:
    [<ffffffff8173a6eb>] __do_sys_swapon+0x94b/0x3550
    but there are no more locks to release!

    other info that might help us debug this:
    no locks held by swapon/4294.

    stack backtrace:
    CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
    Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
    Call Trace:
     dump_stack+0xa1/0xea
     ? __do_sys_swapon+0x94b/0x3550
     print_unlock_imbalance_bug.cold+0x114/0x123
     ? __do_sys_swapon+0x94b/0x3550
     lock_release+0x562/0xed0
     ? kvfree+0x31/0x40
     ? lock_downgrade+0x770/0x770
     ? kvfree+0x31/0x40
     ? rcu_read_lock_sched_held+0xa1/0xd0
     ? rcu_read_lock_bh_held+0xb0/0xb0
     up_write+0x2d/0x490
     ? kfree+0x293/0x2f0
     __do_sys_swapon+0x94b/0x3550
     ? putname+0xb0/0xf0
     ? kmem_cache_free+0x2e7/0x370
     ? do_sys_open+0x184/0x3e0
     ? generic_max_swapfile_size+0x40/0x40
     ? do_syscall_64+0x27/0x4b0
     ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
     ? lockdep_hardirqs_on+0x38c/0x590
     __x64_sys_swapon+0x54/0x80
     do_syscall_64+0xa4/0x4b0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f15da0a0dc7

Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |   41 ++++++++++++++++++++---------------------
 1 file changed, 20 insertions(+), 21 deletions(-)

--- a/mm/swapfile.c~mm-swap-move-inode_lock-out-of-claim_swapfile
+++ a/mm/swapfile.c
@@ -2904,10 +2904,6 @@ static int claim_swapfile(struct swap_in
 		p->bdev = inode->i_sb->s_bdev;
 	}
 
-	inode_lock(inode);
-	if (IS_SWAPFILE(inode))
-		return -EBUSY;
-
 	return 0;
 }
 
@@ -3162,36 +3158,41 @@ SYSCALL_DEFINE2(swapon, const char __use
 	mapping = swap_file->f_mapping;
 	inode = mapping->host;
 
-	/* If S_ISREG(inode->i_mode) will do inode_lock(inode); */
 	error = claim_swapfile(p, inode);
 	if (unlikely(error))
 		goto bad_swap;
 
+	inode_lock(inode);
+	if (IS_SWAPFILE(inode)) {
+		error = -EBUSY;
+		goto bad_swap_unlock_inode;
+	}
+
 	/*
 	 * Read the swap header.
 	 */
 	if (!mapping->a_ops->readpage) {
 		error = -EINVAL;
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 	}
 	page = read_mapping_page(mapping, 0, swap_file);
 	if (IS_ERR(page)) {
 		error = PTR_ERR(page);
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 	}
 	swap_header = kmap(page);
 
 	maxpages = read_swap_header(p, swap_header, inode);
 	if (unlikely(!maxpages)) {
 		error = -EINVAL;
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 	}
 
 	/* OK, set up the swap map and apply the bad block list */
 	swap_map = vzalloc(maxpages);
 	if (!swap_map) {
 		error = -ENOMEM;
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 	}
 
 	if (bdi_cap_stable_pages_required(inode_to_bdi(inode)))
@@ -3216,7 +3217,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 					GFP_KERNEL);
 		if (!cluster_info) {
 			error = -ENOMEM;
-			goto bad_swap;
+			goto bad_swap_unlock_inode;
 		}
 
 		for (ci = 0; ci < nr_cluster; ci++)
@@ -3225,7 +3226,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 		p->percpu_cluster = alloc_percpu(struct percpu_cluster);
 		if (!p->percpu_cluster) {
 			error = -ENOMEM;
-			goto bad_swap;
+			goto bad_swap_unlock_inode;
 		}
 		for_each_possible_cpu(cpu) {
 			struct percpu_cluster *cluster;
@@ -3239,13 +3240,13 @@ SYSCALL_DEFINE2(swapon, const char __use
 
 	error = swap_cgroup_swapon(p->type, maxpages);
 	if (error)
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 
 	nr_extents = setup_swap_map_and_extents(p, swap_header, swap_map,
 		cluster_info, maxpages, &span);
 	if (unlikely(nr_extents < 0)) {
 		error = nr_extents;
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 	}
 	/* frontswap enabled? set up bit-per-page map for frontswap */
 	if (IS_ENABLED(CONFIG_FRONTSWAP))
@@ -3285,7 +3286,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 
 	error = init_swap_address_space(p->type, maxpages);
 	if (error)
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 
 	/*
 	 * Flush any pending IO and dirty mappings before we start using this
@@ -3295,7 +3296,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 	error = inode_drain_writes(inode);
 	if (error) {
 		inode->i_flags &= ~S_SWAPFILE;
-		goto bad_swap;
+		goto bad_swap_unlock_inode;
 	}
 
 	mutex_lock(&swapon_mutex);
@@ -3320,6 +3321,8 @@ SYSCALL_DEFINE2(swapon, const char __use
 
 	error = 0;
 	goto out;
+bad_swap_unlock_inode:
+	inode_unlock(inode);
 bad_swap:
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
@@ -3327,6 +3330,7 @@ bad_swap:
 		set_blocksize(p->bdev, p->old_block_size);
 		blkdev_put(p->bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 	}
+	inode = NULL;
 	destroy_swap_extents(p);
 	swap_cgroup_swapoff(p->type);
 	spin_lock(&swap_lock);
@@ -3338,13 +3342,8 @@ bad_swap:
 	kvfree(frontswap_map);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
-	if (swap_file) {
-		if (inode) {
-			inode_unlock(inode);
-			inode = NULL;
-		}
+	if (swap_file)
 		filp_close(swap_file, NULL);
-	}
 out:
 	if (page && !IS_ERR(page)) {
 		kunmap(page);
_

Patches currently in -mm which might be from naohiro.aota@wdc.com are

mm-swap-move-inode_lock-out-of-claim_swapfile.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + selftests-vm-add-missed-tests-in-run_vmtests.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (85 preceding siblings ...)
  2020-02-10  4:16 ` + mm-swap-move-inode_lock-out-of-claim_swapfile.patch " Andrew Morton
@ 2020-02-10  4:20 ` Andrew Morton
  2020-02-10  4:21 ` + mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch " Andrew Morton
                   ` (151 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:20 UTC (permalink / raw)
  To: mhiramat, mm-commits, shuah, sjpark, urezki


The patch titled
     Subject: selftests/vm: add missed tests in run_vmtests
has been added to the -mm tree.  Its filename is
     selftests-vm-add-missed-tests-in-run_vmtests.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/selftests-vm-add-missed-tests-in-run_vmtests.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/selftests-vm-add-missed-tests-in-run_vmtests.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: SeongJae Park <sjpark@amazon.de>
Subject: selftests/vm: add missed tests in run_vmtests

The commits introducing 'mlock-random-test'[1], 'map_fiex_noreplace'[2],
and 'thuge-gen'[3] have not added those in the 'run_vmtests' script and
thus the 'run_tests' command of kselftests doesn't run those.  This commit
adds those in the script.

'gup_benchmark' and 'transhuge-stress' are also not included in the
'run_vmtests', but this commit does not add those because those are for
performance measurement rather than pass/fail tests.

[1] commit 26b4224d9961 ("selftests: expanding more mlock selftest")
[2] commit 91cbacc34512 ("tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE")
[3] commit fcc1f2d5dd34 ("selftests: add a test program for variable huge page sizes in mmap/shmget")

Link: http://lkml.kernel.org/r/20200206085144.29126-1-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/run_vmtests |   33 +++++++++++++++++++++++
 1 file changed, 33 insertions(+)

--- a/tools/testing/selftests/vm/run_vmtests~selftests-vm-add-missed-tests-in-run_vmtests
+++ a/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,17 @@ echo "NOTE: The above hugetlb tests prov
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
 
+echo "---------------------------"
+echo "running map_fixed_noreplace"
+echo "---------------------------"
+./map_fixed_noreplace
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "-------------------"
 echo "running userfaultfd"
 echo "-------------------"
@@ -186,6 +197,17 @@ else
 	echo "[PASS]"
 fi
 
+echo "-------------------------"
+echo "running mlock-random-test"
+echo "-------------------------"
+./mlock-random-test
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "--------------------"
 echo "running mlock2-tests"
 echo "--------------------"
@@ -193,6 +215,17 @@ echo "--------------------"
 if [ $? -ne 0 ]; then
 	echo "[FAIL]"
 	exitcode=1
+else
+	echo "[PASS]"
+fi
+
+echo "-----------------"
+echo "running thuge-gen"
+echo "-----------------"
+./thuge-gen
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
 else
 	echo "[PASS]"
 fi
_

Patches currently in -mm which might be from sjpark@amazon.de are

selftests-vm-add-missed-tests-in-run_vmtests.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (86 preceding siblings ...)
  2020-02-10  4:20 ` + selftests-vm-add-missed-tests-in-run_vmtests.patch " Andrew Morton
@ 2020-02-10  4:21 ` Andrew Morton
  2020-02-10  4:23 ` + mm-memcontrol-fix-a-data-race-in-scan-count.patch " Andrew Morton
                   ` (150 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:21 UTC (permalink / raw)
  To: hannes, laoar.shao, mhocko, mm-commits, tj, vdavydov.dev


The patch titled
     Subject: mm, memcg: fix build error around the usage of kmem_caches
has been added to the -mm tree.  Its filename is
     mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yafang Shao <laoar.shao@gmail.com>
Subject: mm, memcg: fix build error around the usage of kmem_caches

When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error
occurs,

mm/slab_common.c: In function 'memcg_slab_start':
mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named
'kmem_caches'
  return seq_list_start(&memcg->kmem_caches, *pos);
                              ^
mm/slab_common.c: In function 'memcg_slab_next':
mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named
'kmem_caches'
  return seq_list_next(p, &memcg->kmem_caches, pos);
                                ^
mm/slab_common.c: In function 'memcg_slab_show':
mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named
'kmem_caches'
  if (p == memcg->kmem_caches.next)
                ^
  CC      arch/x86/xen/smp.o
mm/slab_common.c: In function 'memcg_slab_start':
mm/slab_common.c:1531:1: warning: control reaches end of non-void function
[-Wreturn-type]
 }
 ^
mm/slab_common.c: In function 'memcg_slab_next':
mm/slab_common.c:1538:1: warning: control reaches end of non-void function
[-Wreturn-type]
 }
 ^

That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set,
while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined
or not.

By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify
whether my some other code change is still stable when CONFIG_MEMCG_KMEM is
not set. Unfortunately, the existing code has been already unstable since
v4.11.

Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.com
Fixes: bc2791f857e1 ("slab: link memcg kmem_caches on their associated memory cgroup")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c  |    3 ++-
 mm/slab_common.c |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-fix-build-error-around-the-usage-of-kmem_caches
+++ a/mm/memcontrol.c
@@ -4723,7 +4723,8 @@ static struct cftype mem_cgroup_legacy_f
 		.write = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read_u64,
 	},
-#if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
+#if defined(CONFIG_MEMCG_KMEM) && \
+	(defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG))
 	{
 		.name = "kmem.slabinfo",
 		.seq_start = memcg_slab_start,
--- a/mm/slab_common.c~mm-memcg-fix-build-error-around-the-usage-of-kmem_caches
+++ a/mm/slab_common.c
@@ -1521,7 +1521,7 @@ void dump_unreclaimable_slab(void)
 	mutex_unlock(&slab_mutex);
 }
 
-#if defined(CONFIG_MEMCG)
+#if defined(CONFIG_MEMCG_KMEM)
 void *memcg_slab_start(struct seq_file *m, loff_t *pos)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
_

Patches currently in -mm which might be from laoar.shao@gmail.com are

mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-memcontrol-fix-a-data-race-in-scan-count.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (87 preceding siblings ...)
  2020-02-10  4:21 ` + mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch " Andrew Morton
@ 2020-02-10  4:23 ` Andrew Morton
  2020-02-10  4:29 ` + get_maintainer-remove-uses-of-p-for-maintainer-name.patch " Andrew Morton
                   ` (149 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:23 UTC (permalink / raw)
  To: cai, hannes, mhocko, mm-commits, vdavydov.dev


The patch titled
     Subject: mm/memcontrol: fix a data race in scan count
has been added to the -mm tree.  Its filename is
     mm-memcontrol-fix-a-data-race-in-scan-count.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memcontrol-fix-a-data-race-in-scan-count.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcontrol-fix-a-data-race-in-scan-count.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/memcontrol: fix a data race in scan count

struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size

 write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
  mem_cgroup_update_lru_size+0x11c/0x1d0
  mem_cgroup_update_lru_size at mm/memcontrol.c:1266
  isolate_lru_pages+0x6a9/0xf30
  shrink_active_list+0x123/0xcc0
  shrink_lruvec+0x8fd/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
  lruvec_lru_size+0xbb/0x270
  mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
  (inlined by) lruvec_lru_size at mm/vmscan.c:326
  shrink_lruvec+0x1d0/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_current+0xa6/0x120
  alloc_slab_page+0x3b1/0x540
  allocate_slab+0x70/0x660
  new_slab+0x46/0x70
  ___slab_alloc+0x4ad/0x7d0
  __slab_alloc+0x43/0x70
  kmem_cache_alloc+0x2c3/0x420
  getname_flags+0x4c/0x230
  getname+0x22/0x30
  do_sys_openat2+0x205/0x3b0
  do_sys_open+0x9a/0xf0
  __x64_sys_openat+0x62/0x80
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 95 PID: 50964 Comm: cc1 Tainted: G        W  O L    5.5.0-next-20200204+ #6
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

The write is under lru_lock, but the read is done as lockless.  The scan
count is used to determine how aggressively the anon and file LRU lists
should be scanned.  Load tearing could generate an inefficient heuristic,
so fix it by adding READ_ONCE() for the read.

Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/memcontrol.h~mm-memcontrol-fix-a-data-race-in-scan-count
+++ a/include/linux/memcontrol.h
@@ -533,7 +533,7 @@ unsigned long mem_cgroup_get_zone_lru_si
 	struct mem_cgroup_per_node *mz;
 
 	mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	return mz->lru_zone_size[zone_idx][lru];
+	return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
 }
 
 void mem_cgroup_handle_over_high(void);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-and-annotate-various-data-races.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + get_maintainer-remove-uses-of-p-for-maintainer-name.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (88 preceding siblings ...)
  2020-02-10  4:23 ` + mm-memcontrol-fix-a-data-race-in-scan-count.patch " Andrew Morton
@ 2020-02-10  4:29 ` Andrew Morton
  2020-02-10  4:37 ` + scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch " Andrew Morton
                   ` (148 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:29 UTC (permalink / raw)
  To: corbet, dan.j.william, joe, mm-commits


The patch titled
     Subject: get_maintainer: Remove uses of P: for maintainer name
has been added to the -mm tree.  Its filename is
     get_maintainer-remove-uses-of-p-for-maintainer-name.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/get_maintainer-remove-uses-of-p-for-maintainer-name.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/get_maintainer-remove-uses-of-p-for-maintainer-name.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Joe Perches <joe@perches.com>
Subject: get_maintainer: Remove uses of P: for maintainer name

commit 1ca84ed6425f ("MAINTAINERS: Reclaim the P: tag for Maintainer Entry
Profile") changed the use of the "P:" tag from "Person" to "Profile (ie:
special subsystem coding styles and characteristics)"

Change how get_maintainer.pl parses the "P:" tag to match.

Link: http://lkml.kernel.org/r/ca53823fc5d25c0be32ad937d0207a0589c08643.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Dan Williams <dan.j.william@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/get_maintainer.pl |   24 ------------------------
 1 file changed, 24 deletions(-)

--- a/scripts/get_maintainer.pl~get_maintainer-remove-uses-of-p-for-maintainer-name
+++ a/scripts/get_maintainer.pl
@@ -1341,35 +1341,11 @@ sub add_categories {
 		    }
 		}
 	    } elsif ($ptype eq "M") {
-		my ($name, $address) = parse_email($pvalue);
-		if ($name eq "") {
-		    if ($i > 0) {
-			my $tv = $typevalue[$i - 1];
-			if ($tv =~ m/^([A-Z]):\s*(.*)/) {
-			    if ($1 eq "P") {
-				$name = $2;
-				$pvalue = format_email($name, $address, $email_usename);
-			    }
-			}
-		    }
-		}
 		if ($email_maintainer) {
 		    my $role = get_maintainer_role($i);
 		    push_email_addresses($pvalue, $role);
 		}
 	    } elsif ($ptype eq "R") {
-		my ($name, $address) = parse_email($pvalue);
-		if ($name eq "") {
-		    if ($i > 0) {
-			my $tv = $typevalue[$i - 1];
-			if ($tv =~ m/^([A-Z]):\s*(.*)/) {
-			    if ($1 eq "P") {
-				$name = $2;
-				$pvalue = format_email($name, $address, $email_usename);
-			    }
-			}
-		    }
-		}
 		if ($email_reviewer) {
 		    my $subsystem = get_subsystem_name($i);
 		    push_email_addresses($pvalue, "reviewer:$subsystem");
_

Patches currently in -mm which might be from joe@perches.com are

get_maintainer-remove-uses-of-p-for-maintainer-name.patch
string-add-stracpy-and-stracpy_pad-mechanisms.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (89 preceding siblings ...)
  2020-02-10  4:29 ` + get_maintainer-remove-uses-of-p-for-maintainer-name.patch " Andrew Morton
@ 2020-02-10  4:37 ` Andrew Morton
  2020-02-11  5:33 ` + mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch " Andrew Morton
                   ` (147 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-10  4:37 UTC (permalink / raw)
  To: agross, bjorn.andersson, dan.carpenter, dianders, gregkh, joe,
	keescook, mm-commits, sboyd


The patch titled
     Subject: scripts/get_maintainer.pl: deprioritize old Fixes: addresses
has been added to the -mm tree.  Its filename is
     scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Douglas Anderson <dianders@chromium.org>
Subject: scripts/get_maintainer.pl: deprioritize old Fixes: addresses

Recently, I found that get_maintainer was causing me to send emails to the
old addresses for maintainers.  Since I usually just trust the output of
get_maintainer to know the right email address, I didn't even look
carefully and fired off two patch series that went to the wrong place. 
Oops.

The problem was introduced recently when trying to add signatures from
Fixes.  The problem was that these email addresses were added too early in
the process of compiling our list of places to send.  Things added to the
list earlier are considered more canonical and when we later added
maintainer entries we ended up deduplicating to the old address.

Here are two examples using mainline commits (to make it easier to
replicate) for the two maintainers that I messed up recently:

$ git format-patch d8549bcd0529~..d8549bcd0529
$ ./scripts/get_maintainer.pl 0001-clk-Add-clk_hw*.patch | grep Boyd
Stephen Boyd <sboyd@codeaurora.org>...

$ git format-patch 6d1238aa3395~..6d1238aa3395
$ ./scripts/get_maintainer.pl 0001-arm64-dts-qcom-qcs404*.patch | grep Andy
Andy Gross <andy.gross@linaro.org>

Let's move the adding of addresses from Fixes: to the end since the
email addresses from these are much more likely to be older.

After this patch the above examples get the right addresses for the
two examples.

Link: http://lkml.kernel.org/r/20200127095001.1.I41fba9f33590bfd92cd01960161d8384268c6569@changeid
Fixes: 2f5bd343694e ("scripts/get_maintainer.pl: add signatures from Fixes: <badcommit> lines in commit message")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Joe Perches <joe@perches.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Andy Gross <agross@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/get_maintainer.pl |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/scripts/get_maintainer.pl~scripts-get_maintainerpl-deprioritize-old-fixes-addresses
+++ a/scripts/get_maintainer.pl
@@ -932,10 +932,6 @@ sub get_maintainers {
 	}
     }
 
-    foreach my $fix (@fixes) {
-	vcs_add_commit_signers($fix, "blamed_fixes");
-    }

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (90 preceding siblings ...)
  2020-02-10  4:37 ` + scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch " Andrew Morton
@ 2020-02-11  5:33 ` Andrew Morton
  2020-02-11  5:34 ` + mm-vmpressure-use-mem_cgroup_is_root-api.patch " Andrew Morton
                   ` (146 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:33 UTC (permalink / raw)
  To: akpm, mm-commits, yang.shi


The patch titled
     Subject: mm: vmpressure: don't need call kfree if kstrndup fails
has been added to the -mm tree.  Its filename is
     mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm: vmpressure: don't need call kfree if kstrndup fails

When kstrndup fails (returns NULL) there is no memory is allocated by
kmalloc, so no need to call kfree().

Link: http://lkml.kernel.org/r/1581398649-125989-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmpressure.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/mm/vmpressure.c~mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails
+++ a/mm/vmpressure.c
@@ -371,10 +371,8 @@ int vmpressure_register_event(struct mem
 	int ret = 0;
 
 	spec_orig = spec = kstrndup(args, MAX_VMPRESSURE_ARGS_LEN, GFP_KERNEL);
-	if (!spec) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!spec)
+		return -ENOMEM;
 
 	/* Find required level */
 	token = strsep(&spec, ",");
_

Patches currently in -mm which might be from yang.shi@linux.alibaba.com are

mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vmpressure-use-mem_cgroup_is_root-api.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (91 preceding siblings ...)
  2020-02-11  5:33 ` + mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch " Andrew Morton
@ 2020-02-11  5:34 ` Andrew Morton
  2020-02-11  5:39 ` + mm-filemap-fix-a-data-race-in-filemap_fault.patch " Andrew Morton
                   ` (145 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:34 UTC (permalink / raw)
  To: akpm, mm-commits, yang.shi


The patch titled
     Subject: mm: vmpressure: use mem_cgroup_is_root API
has been added to the -mm tree.  Its filename is
     mm-vmpressure-use-mem_cgroup_is_root-api.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmpressure-use-mem_cgroup_is_root-api.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmpressure-use-mem_cgroup_is_root-api.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm: vmpressure: use mem_cgroup_is_root API

Use mem_cgroup_is_root() API to check if memcg is root memcg instead of
open coding.

Link: http://lkml.kernel.org/r/1581398649-125989-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmpressure.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmpressure.c~mm-vmpressure-use-mem_cgroup_is_root-api
+++ a/mm/vmpressure.c
@@ -280,7 +280,7 @@ void vmpressure(gfp_t gfp, struct mem_cg
 		enum vmpressure_levels level;
 
 		/* For now, no users for root-level efficiency */
-		if (!memcg || memcg == root_mem_cgroup)
+		if (!memcg || mem_cgroup_is_root(memcg))
 			return;
 
 		spin_lock(&vmpr->sr_lock);
_

Patches currently in -mm which might be from yang.shi@linux.alibaba.com are

mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
mm-vmpressure-use-mem_cgroup_is_root-api.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-filemap-fix-a-data-race-in-filemap_fault.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (92 preceding siblings ...)
  2020-02-11  5:34 ` + mm-vmpressure-use-mem_cgroup_is_root-api.patch " Andrew Morton
@ 2020-02-11  5:39 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-split-get_user_pages_remote-into-two-routines.patch " Andrew Morton
                   ` (144 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:39 UTC (permalink / raw)
  To: cai, elver, kirill, mm-commits, willy


The patch titled
     Subject: mm/filemap.c: fix a data race in filemap_fault()
has been added to the -mm tree.  Its filename is
     mm-filemap-fix-a-data-race-in-filemap_fault.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-filemap-fix-a-data-race-in-filemap_fault.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-filemap-fix-a-data-race-in-filemap_fault.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Kirill A. Shutemov <kirill@shutemov.name>
Subject: mm/filemap.c: fix a data race in filemap_fault()

struct file_ra_state ra.mmap_miss could be accessed concurrently during
page faults as noticed by KCSAN,

 BUG: KCSAN: data-race in filemap_fault / filemap_map_pages

 write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
  filemap_fault+0x920/0xfc0
  do_sync_mmap_readahead at mm/filemap.c:2384
  (inlined by) filemap_fault at mm/filemap.c:2486
  __xfs_filemap_fault+0x112/0x3e0 [xfs]
  xfs_filemap_fault+0x74/0x90 [xfs]
  __do_fault+0x9e/0x220
  do_fault+0x4a0/0x920
  __handle_mm_fault+0xc69/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
  filemap_map_pages+0xc2e/0xd80
  filemap_map_pages at mm/filemap.c:2625
  do_fault+0x3da/0x920
  __handle_mm_fault+0xc69/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G        W    L 5.5.0-next-20200210+ #1
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

ra.mmap_miss is used to contribute the readahead decisions, a data race
could be undesirable.  Both the read and write is only under non-exclusive
mmap_sem, two concurrent writers could even underflow the counter.  Fix
the underflow by writing to a local variable before committing a final
store to ra.mmap_miss given a small inaccuracy of the counter should be
acceptable.

Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |   20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

--- a/mm/filemap.c~mm-filemap-fix-a-data-race-in-filemap_fault
+++ a/mm/filemap.c
@@ -2365,6 +2365,7 @@ static struct file *do_sync_mmap_readahe
 	struct address_space *mapping = file->f_mapping;
 	struct file *fpin = NULL;
 	pgoff_t offset = vmf->pgoff;
+	unsigned int mmap_miss;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
@@ -2380,14 +2381,15 @@ static struct file *do_sync_mmap_readahe
 	}
 
 	/* Avoid banging the cache line if not needed */
-	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
-		ra->mmap_miss++;
+	mmap_miss = READ_ONCE(ra->mmap_miss);
+	if (mmap_miss < MMAP_LOTSAMISS * 10)
+		WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
 
 	/*
 	 * Do we miss much more than hit in this file? If so,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
-	if (ra->mmap_miss > MMAP_LOTSAMISS)
+	if (mmap_miss > MMAP_LOTSAMISS)
 		return fpin;
 
 	/*
@@ -2413,13 +2415,15 @@ static struct file *do_async_mmap_readah
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
 	struct file *fpin = NULL;
+	unsigned int mmap_miss;
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
 	if (vmf->vma->vm_flags & VM_RAND_READ)
 		return fpin;
-	if (ra->mmap_miss > 0)
-		ra->mmap_miss--;
+	mmap_miss = READ_ONCE(ra->mmap_miss);
+	if (mmap_miss)
+		WRITE_ONCE(ra->mmap_miss, --mmap_miss);
 	if (PageReadahead(page)) {
 		fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 		page_cache_async_readahead(mapping, ra, file,
@@ -2586,6 +2590,7 @@ void filemap_map_pages(struct vm_fault *
 	unsigned long max_idx;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct page *page;
+	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 
 	rcu_read_lock();
 	xas_for_each(&xas, page, end_pgoff) {
@@ -2622,8 +2627,8 @@ void filemap_map_pages(struct vm_fault *
 		if (page->index >= max_idx)
 			goto unlock;
 
-		if (file->f_ra.mmap_miss > 0)
-			file->f_ra.mmap_miss--;
+		if (mmap_miss > 0)
+			mmap_miss--;
 
 		vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		if (vmf->pte)
@@ -2643,6 +2648,7 @@ next:
 			break;
 	}
 	rcu_read_unlock();
+	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
 }
 EXPORT_SYMBOL(filemap_map_pages);
 
_

Patches currently in -mm which might be from kirill@shutemov.name are

mm-filemap-fix-a-data-race-in-filemap_fault.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-split-get_user_pages_remote-into-two-routines.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (93 preceding siblings ...)
  2020-02-11  5:39 ` + mm-filemap-fix-a-data-race-in-filemap_fault.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch " Andrew Morton
                   ` (143 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: split get_user_pages_remote() into two routines
has been added to the -mm tree.  Its filename is
     mm-gup-split-get_user_pages_remote-into-two-routines.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-split-get_user_pages_remote-into-two-routines.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-split-get_user_pages_remote-into-two-routines.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: split get_user_pages_remote() into two routines

Patch series "mm/gup: track FOLL_PIN pages", v6.

This activates tracking of FOLL_PIN pages.  This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].

FOLL_PIN support is now in the main linux tree.  However, the patch to use
FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
suite failure that involved (I think) page refcount overflows when huge
pages were used.

This patch definitively solves that kind of overflow problem, by adding an
exact pincount, for compound pages (of order > 1), in the 3rd struct page
of a compound page.  If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach.  Thanks again to Jan Kara
for that idea.

Other interesting changes:

* dump_page(): added one, or two new things to report for compound
  pages: head refcount (for all compound pages), and map_pincount (for
  compound pages of order > 1).

* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
  huge page refcount upper limit problems, and added notes about how it
  works now.  Also added a note about the dump_page() enhancements.

* Added some comments in gup.c and mm.h, to explain that there are two
  ways to count pinned pages: exact (for compound pages of order > 1) and
  fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).

============================================================
General notes about the tracking patch:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017.  :)

In contrast to earlier approaches, the page tracking can be incrementally
applied to the kernel call sites that, until now, have been simply calling
get_user_pages() ("gup").  In other words, opt-in by changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

============================================================
Future steps:

* Convert more subsystems from get_user_pages() to pin_user_pages().
  The first probably needs to be bio/biovecs, because any filesystem
  testing is too difficult without those in place.

* Change VFS and filesystems to respond appropriately when encountering
  dma-pinned pages.

* Work with Ira and others to connect this all up with file system
  leases.

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/

[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/

[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/

[4] LWN kernel index: get_user_pages()
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages


This patch (of 12):

An upcoming patch requires reusing the implementation of
get_user_pages_remote().  Split up get_user_pages_remote() into an outer
routine that checks flags, and an implementation routine that will be
reused.  This makes subsequent changes much easier to understand.

There should be no change in behavior due to this patch.

Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   56 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 23 deletions(-)

--- a/mm/gup.c~mm-gup-split-get_user_pages_remote-into-two-routines
+++ a/mm/gup.c
@@ -1557,6 +1557,37 @@ static __always_inline long __gup_longte
 }
 #endif /* CONFIG_FS_DAX || CONFIG_CMA */
 
+#ifdef CONFIG_MMU
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * Parts of FOLL_LONGTERM behavior are incompatible with
+	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
+	 * vmas. However, this only comes up if locked is set, and there are
+	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
+	 * allow what we can.
+	 */
+	if (gup_flags & FOLL_LONGTERM) {
+		if (WARN_ON_ONCE(locked))
+			return -EINVAL;
+		/*
+		 * This will check the vmas (even if our vmas arg is NULL)
+		 * and return -ENOTSUPP if DAX isn't allowed in this case:
+		 */
+		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+					     vmas, gup_flags | FOLL_TOUCH |
+					     FOLL_REMOTE);
+	}
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked,
+				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
+
 /*
  * get_user_pages_remote() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -1619,7 +1650,6 @@ static __always_inline long __gup_longte
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-#ifdef CONFIG_MMU
 long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
@@ -1632,28 +1662,8 @@ long get_user_pages_remote(struct task_s
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
-	/*
-	 * Parts of FOLL_LONGTERM behavior are incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas. However, this only comes up if locked is set, and there are
-	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
-	 * allow what we can.
-	 */
-	if (gup_flags & FOLL_LONGTERM) {
-		if (WARN_ON_ONCE(locked))
-			return -EINVAL;
-		/*
-		 * This will check the vmas (even if our vmas arg is NULL)
-		 * and return -ENOTSUPP if DAX isn't allowed in this case:
-		 */
-		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
-					     vmas, gup_flags | FOLL_TOUCH |
-					     FOLL_REMOTE);
-	}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (94 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-split-get_user_pages_remote-into-two-routines.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-introduce-page_ref_sub_return.patch " Andrew Morton
                   ` (142 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: pass a flags arg to __gup_device_* functions
has been added to the -mm tree.  Its filename is
     mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: pass a flags arg to __gup_device_* functions

A subsequent patch requires access to gup flags, so pass the flags
argument through to the __gup_device_* functions.

Also placate checkpatch.pl by shortening a nearby line.

Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

--- a/mm/gup.c~mm-gup-pass-a-flags-arg-to-__gup_device_-functions
+++ a/mm/gup.c
@@ -1963,7 +1963,8 @@ static int gup_pte_range(pmd_t pmd, unsi
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+			     unsigned long end, unsigned int flags,
+			     struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1989,13 +1990,14 @@ static int __gup_device_huge(unsigned lo
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -2006,13 +2008,14 @@ static int __gup_device_huge_pmd(pmd_t o
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -2023,14 +2026,16 @@ static int __gup_device_huge_pud(pud_t o
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
@@ -2146,7 +2151,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
@@ -2167,7 +2173,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+			unsigned long end, unsigned int flags,
+			struct page **pages, int *nr)
 {
 	struct page *head, *page;
 	int refs;
@@ -2178,7 +2185,8 @@ static int gup_huge_pud(pud_t orig, pud_
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-introduce-page_ref_sub_return.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (95 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-pass-gup-flags-to-two-more-routines.patch " Andrew Morton
                   ` (141 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm: introduce page_ref_sub_return()
has been added to the -mm tree.  Its filename is
     mm-introduce-page_ref_sub_return.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-introduce-page_ref_sub_return.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-introduce-page_ref_sub_return.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm: introduce page_ref_sub_return()

An upcoming patch requires subtracting a large chunk of refcounts from a
page, and checking what the resulting refcount is.  This is a little
different than the usual "check for zero refcount" that many of the page
ref functions already do.  However, it is similar to a few other routines
that (like this one) are generally useful for things such as 1-based
refcounting.

Add page_ref_sub_return(), that subtracts a chunk of refcounts atomically,
and returns an atomic snapshot of the result.

Link: http://lkml.kernel.org/r/20200211001536.1027652-4-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page_ref.h |    9 +++++++++
 1 file changed, 9 insertions(+)

--- a/include/linux/page_ref.h~mm-introduce-page_ref_sub_return
+++ a/include/linux/page_ref.h
@@ -102,6 +102,15 @@ static inline void page_ref_sub(struct p
 		__page_ref_mod(page, -nr);
 }
 
+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+	int ret = atomic_sub_return(nr, &page->_refcount);
+
+	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return))
+		__page_ref_mod_and_return(page, -nr, ret);
+	return ret;
+}
+
 static inline void page_ref_inc(struct page *page)
 {
 	atomic_inc(&page->_refcount);
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-pass-gup-flags-to-two-more-routines.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (96 preceding siblings ...)
  2020-02-11  5:50 ` + mm-introduce-page_ref_sub_return.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-require-foll_get-for-get_user_pages_fast.patch " Andrew Morton
                   ` (140 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: pass gup flags to two more routines
has been added to the -mm tree.  Its filename is
     mm-gup-pass-gup-flags-to-two-more-routines.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-pass-gup-flags-to-two-more-routines.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-pass-gup-flags-to-two-more-routines.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: pass gup flags to two more routines

In preparation for an upcoming patch, send gup flags args to two more
routines: put_compound_head(), and undo_dev_pagemap().

Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- a/mm/gup.c~mm-gup-pass-gup-flags-to-two-more-routines
+++ a/mm/gup.c
@@ -1870,6 +1870,7 @@ static inline pte_t gup_get_pte(pte_t *p
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+					    unsigned int flags,
 					    struct page **pages)
 {
 	while ((*nr) - nr_start) {
@@ -1909,7 +1910,7 @@ static int gup_pte_range(pmd_t pmd, unsi
 
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
+				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
 		} else if (pte_special(pte))
@@ -1974,7 +1975,7 @@ static int __gup_device_huge(unsigned lo
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
-			undo_dev_pagemap(nr, nr_start, pages);
+			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
 		SetPageReferenced(page);
@@ -2001,7 +2002,7 @@ static int __gup_device_huge_pmd(pmd_t o
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
@@ -2019,7 +2020,7 @@ static int __gup_device_huge_pud(pud_t o
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
@@ -2053,7 +2054,7 @@ static int record_subpages(struct page *
 	return nr;
 }
 
-static void put_compound_head(struct page *page, int refs)
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
 	/*
@@ -2103,7 +2104,7 @@ static int gup_hugepte(pte_t *ptep, unsi
 		return 0;
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2163,7 +2164,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2197,7 +2198,7 @@ static int gup_huge_pud(pud_t orig, pud_
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2226,7 +2227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_
 		return 0;
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-require-foll_get-for-get_user_pages_fast.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (97 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-pass-gup-flags-to-two-more-routines.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-track-foll_pin-pages.patch " Andrew Morton
                   ` (139 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: require FOLL_GET for get_user_pages_fast()
has been added to the -mm tree.  Its filename is
     mm-gup-require-foll_get-for-get_user_pages_fast.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-require-foll_get-for-get_user_pages_fast.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-require-foll_get-for-get_user_pages_fast.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: require FOLL_GET for get_user_pages_fast()

Internal to mm/gup.c, require that get_user_pages_fast() and
__get_user_pages_fast() identify themselves, by setting FOLL_GET.  This is
required in order to be able to make decisions based on "FOLL_PIN, or
FOLL_GET, or both or neither are set", in upcoming patches.

Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

--- a/mm/gup.c~mm-gup-require-foll_get-for-get_user_pages_fast
+++ a/mm/gup.c
@@ -2390,6 +2390,14 @@ int __get_user_pages_fast(unsigned long
 	unsigned long len, end;
 	unsigned long flags;
 	int nr = 0;
+	/*
+	 * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+	 * because gup fast is always a "pin with a +1 page refcount" request.
+	 */
+	unsigned int gup_flags = FOLL_GET;
+
+	if (write)
+		gup_flags |= FOLL_WRITE;
 
 	start = untagged_addr(start) & PAGE_MASK;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
@@ -2415,7 +2423,7 @@ int __get_user_pages_fast(unsigned long
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
 		local_irq_save(flags);
-		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
+		gup_pgd_range(start, end, gup_flags, pages, &nr);
 		local_irq_restore(flags);
 	}
 
@@ -2454,7 +2462,7 @@ static int internal_get_user_pages_fast(
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE | FOLL_PIN)))
+				       FOLL_FORCE | FOLL_PIN | FOLL_GET)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2521,6 +2529,13 @@ int get_user_pages_fast(unsigned long st
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
+	/*
+	 * The caller may or may not have explicitly set FOLL_GET; either way is
+	 * OK. However, internally (within mm/gup.c), gup fast variants must set
+	 * FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
+	 * request.
+	 */
+	gup_flags |= FOLL_GET;
 	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-track-foll_pin-pages.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (98 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-require-foll_get-for-get_user_pages_fast.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch " Andrew Morton
                   ` (138 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: track FOLL_PIN pages
has been added to the -mm tree.  Its filename is
     mm-gup-track-foll_pin-pages.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-track-foll_pin-pages.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-track-foll_pin-pages.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: track FOLL_PIN pages

Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK). 
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.

As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().

Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section.  (That limitation will be
removed in a following patch.)

The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:

   bool page_maybe_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/pin_user_pages.rst |    6 
 include/linux/mm.h                        |   84 +++++-
 mm/gup.c                                  |  254 ++++++++++++++++----
 mm/huge_memory.c                          |   29 +-
 mm/hugetlb.c                              |   54 ++--
 5 files changed, 335 insertions(+), 92 deletions(-)

--- a/Documentation/core-api/pin_user_pages.rst~mm-gup-track-foll_pin-pages
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -173,8 +173,8 @@ CASE 4: Pinning for struct page manipula
 -------------------------------------------------
 Here, normal GUP calls are sufficient, so neither flag needs to be set.
 
-page_dma_pinned(): the whole point of pinning
-=============================================
+page_maybe_dma_pinned(): the whole point of pinning
+===================================================
 
 The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
 to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
@@ -186,7 +186,7 @@ and debates (see the References at the e
 here: fill in the details once that's worked out. Meanwhile, it's safe to say
 that having this available: ::
 
-        static inline bool page_dma_pinned(struct page *page)
+        static inline bool page_maybe_dma_pinned(struct page *page)
 
 ...is a prerequisite to solving the long-running gup+DMA problem.
 
--- a/include/linux/mm.h~mm-gup-track-foll_pin-pages
+++ a/include/linux/mm.h
@@ -1001,6 +1001,8 @@ static inline void get_page(struct page
 	page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -1029,29 +1031,79 @@ static inline void put_page(struct page
 		__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many pin_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
+ *
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special handling.
- *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void unpin_user_page(struct page *page)
-{
-	put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty);
-
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_maybe_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*().
+ *
+ * For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
+ * because it means "definitely not pinned for DMA", but true means "probably
+ * pinned for DMA, but possibly a false positive due to having at least
+ * GUP_PIN_COUNTING_BIAS worth of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please see Documentation/vm/pin_user_pages.rst.
+ *
+ * @page:	pointer to page to be queried.
+ * @Return:	True, if it is likely that the page has been "dma-pinned".
+ *		False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_maybe_dma_pinned(struct page *page)
+{
+	/*
+	 * page_ref_count() is signed. If that refcount overflows, then
+	 * page_ref_count() returns a negative value, and callers will avoid
+	 * further incrementing the refcount.
+	 *
+	 * Here, for that overflow case, use the signed bit to count a little
+	 * bit higher via unsigned math, and thus still get an accurate result.
+	 */
+	return ((unsigned int)page_ref_count(compound_head(page))) >=
+		GUP_PIN_COUNTING_BIAS;
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
--- a/mm/gup.c~mm-gup-track-foll_pin-pages
+++ a/mm/gup.c
@@ -44,6 +44,135 @@ static inline struct page *try_get_compo
 	return head;
 }
 
+/*
+ * try_grab_compound_head() - attempt to elevate a page's refcount, by a
+ * flags-dependent amount.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
+ * same time. (That's true throughout the get_user_pages*() and
+ * pin_user_pages*() APIs.) Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: head page (with refcount appropriately incremented) for success, or
+ * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
+ * considered failure, and furthermore, a likely bug in the caller, so a warning
+ * is also emitted.
+ */
+static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+							  int refs,
+							  unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_compound_head(page, refs);
+	else if (flags & FOLL_PIN) {
+		refs *= GUP_PIN_COUNTING_BIAS;
+		return try_get_compound_head(page, refs);
+	}
+
+	WARN_ON_ONCE(1);
+	return NULL;
+}
+
+/**
+ * try_grab_page() - elevate a page's refcount by a flag-dependent amount
+ *
+ * This might not do anything at all, depending on the flags argument.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * @page:    pointer to page to be grabbed
+ * @flags:   gup flags: these are the FOLL_* flag values.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
+ * time. Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: true for success, or if no action was required (if neither FOLL_PIN
+ * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
+ * FOLL_PIN was set, but the page could not be grabbed.
+ */
+bool __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+	WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN));
+
+	if (flags & FOLL_GET)
+		return try_get_page(page);
+	else if (flags & FOLL_PIN) {
+		page = compound_head(page);
+
+		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+			return false;
+
+		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+	}
+
+	return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	int count;
+
+	if (!page_is_devmap_managed(page))
+		return false;
+
+	count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+	/*
+	 * devmap page refcounts are 1-based, rather than 0-based: if
+	 * refcount is 1, then the page is free and the refcount is
+	 * stable because nobody holds a reference on the page.
+	 */
+	if (count == 1)
+		free_devmap_managed_page(page);
+	else if (!count)
+		__put_page(page);
+
+	return true;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (__unpin_devmap_managed_user_page(page))
+		return;
+
+	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+		__put_page(page);
+}
+EXPORT_SYMBOL(unpin_user_page);
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -230,10 +359,11 @@ retry:
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET case since
-		 * they are only valid while holding the pgmap reference.
+		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+		 * case since they are only valid while holding the pgmap
+		 * reference.
 		 */
 		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 		if (*pgmap)
@@ -271,11 +401,10 @@ retry:
 		goto retry;
 	}
 
-	if (flags & FOLL_GET) {
-		if (unlikely(!try_get_page(page))) {
-			page = ERR_PTR(-ENOMEM);
-			goto out;
-		}
+	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
+	if (unlikely(!try_grab_page(page, flags))) {
+		page = ERR_PTR(-ENOMEM);
+		goto out;
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -537,7 +666,7 @@ static struct page *follow_page_mask(str
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
+		WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
 		return page;
 	}
 
@@ -1675,6 +1804,15 @@ long get_user_pages_remote(struct task_s
 {
 	return 0;
 }
+
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	return 0;
+}
 #endif /* !CONFIG_MMU */
 
 /*
@@ -1877,7 +2015,10 @@ static void __maybe_unused undo_dev_page
 		struct page *page = pages[--(*nr)];
 
 		ClearPageReferenced(page);
-		put_page(page);
+		if (flags & FOLL_PIN)
+			unpin_user_page(page);
+		else
+			put_page(page);
 	}
 }
 
@@ -1919,7 +2060,7 @@ static int gup_pte_range(pmd_t pmd, unsi
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		head = try_get_compound_head(page, 1);
+		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
 
@@ -1980,7 +2121,10 @@ static int __gup_device_huge(unsigned lo
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			undo_dev_pagemap(nr, nr_start, flags, pages);
+			return 0;
+		}
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
@@ -2056,6 +2200,9 @@ static int record_subpages(struct page *
 
 static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
+	if (flags & FOLL_PIN)
+		refs *= GUP_PIN_COUNTING_BIAS;
+
 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
 	/*
 	 * Calling put_page() for each ref is unnecessarily slow. Only the last
@@ -2099,7 +2246,7 @@ static int gup_hugepte(pte_t *ptep, unsi
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(head, refs);
+	head = try_grab_compound_head(head, refs, flags);
 	if (!head)
 		return 0;
 
@@ -2159,7 +2306,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pmd_page(orig), refs);
+	head = try_grab_compound_head(pmd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2193,7 +2340,7 @@ static int gup_huge_pud(pud_t orig, pud_
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pud_page(orig), refs);
+	head = try_grab_compound_head(pud_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2222,7 +2369,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pgd_page(orig), refs);
+	head = try_grab_compound_head(pgd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2505,11 +2652,11 @@ static int internal_get_user_pages_fast(
 
 /**
  * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
  *
  * Attempt to pin user pages in memory without taking mm->mmap_sem.
  * If not successful, it will fall back to taking the lock and
@@ -2543,9 +2690,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 /**
  * pin_user_pages_fast() - pin user pages in memory without taking locks
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_fast().
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2553,21 +2703,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_remote().
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2577,22 +2730,24 @@ long pin_user_pages_remote(struct task_s
 			   unsigned int gup_flags, struct page **pages,
 			   struct vm_area_struct **vmas, int *locked)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
-				     vmas, locked);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(pin_user_pages_remote);
 
 /**
  * pin_user_pages() - pin user pages in memory for use by other devices
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages().
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2601,11 +2756,12 @@ long pin_user_pages(unsigned long start,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages);
--- a/mm/huge_memory.c~mm-gup-track-foll_pin-pages
+++ a/mm/huge_memory.c
@@ -958,6 +958,11 @@ struct page *follow_devmap_pmd(struct vm
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		return NULL;
 
@@ -973,7 +978,7 @@ struct page *follow_devmap_pmd(struct vm
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -981,7 +986,8 @@ struct page *follow_devmap_pmd(struct vm
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1101,6 +1107,11 @@ struct page *follow_devmap_pud(struct vm
 	if (flags & FOLL_WRITE && !pud_write(*pud))
 		return NULL;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1112,8 +1123,10 @@ struct page *follow_devmap_pud(struct vm
 	/*
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
+	 *
+	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1121,7 +1134,8 @@ struct page *follow_devmap_pud(struct vm
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1497,8 +1511,13 @@ struct page *follow_trans_huge_pmd(struc
 
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if (!try_grab_page(page, flags))
+		return ERR_PTR(-ENOMEM);
+
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
+
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/*
 		 * We don't mlock() pte-mapped THPs. This way we can avoid
@@ -1535,8 +1554,6 @@ struct page *follow_trans_huge_pmd(struc
 skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-	if (flags & FOLL_GET)
-		get_page(page);
 
 out:
 	return page;
--- a/mm/hugetlb.c~mm-gup-track-foll_pin-pages
+++ a/mm/hugetlb.c
@@ -4376,19 +4376,6 @@ long follow_hugetlb_page(struct mm_struc
 		page = pte_page(huge_ptep_get(pte));
 
 		/*
-		 * Instead of doing 'try_get_page()' below in the same_page
-		 * loop, just check the count once here.
-		 */
-		if (unlikely(page_count(page) <= 0)) {
-			if (pages) {
-				spin_unlock(ptl);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
-		/*
 		 * If subpage information not requested, update counters
 		 * and skip the same_page loop below.
 		 */
@@ -4405,7 +4392,22 @@ long follow_hugetlb_page(struct mm_struc
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+			/*
+			 * try_grab_page() should always succeed here, because:
+			 * a) we hold the ptl lock, and b) we've just checked
+			 * that the huge page is present in the page tables. If
+			 * the huge page is present, then the tail pages must
+			 * also be present. The ptl prevents the head page and
+			 * tail pages from being rearranged in any way. So this
+			 * page must be available at this point, unless the page
+			 * refcount overflowed:
+			 */
+			if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
 		}
 
 		if (vmas)
@@ -4965,6 +4967,12 @@ follow_huge_pmd(struct mm_struct *mm, un
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t pte;
+
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 retry:
 	ptl = pmd_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -4977,8 +4985,18 @@ retry:
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
-		if (flags & FOLL_GET)
-			get_page(page);
+		/*
+		 * try_grab_page() should always succeed here, because: a) we
+		 * hold the pmd (ptl) lock, and b) we've just checked that the
+		 * huge pmd (head) page is present in the page tables. The ptl
+		 * prevents the head page and tail pages from being rearranged
+		 * in any way. So this page must be available at this point,
+		 * unless the page refcount overflowed:
+		 */
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
@@ -4999,7 +5017,7 @@ struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -5008,7 +5026,7 @@ follow_huge_pud(struct mm_struct *mm, un
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (99 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-track-foll_pin-pages.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch " Andrew Morton
                   ` (137 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
has been added to the -mm tree.  Its filename is
     mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages

For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
of huge pages that can be pinned.

This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1.  The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.

A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).

This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block.  That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.

Documentation/core-api/pin_user_pages.rst is updated accordingly.

Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/pin_user_pages.rst |   40 ++++------
 include/linux/mm.h                        |   26 ++++++
 include/linux/mm_types.h                  |    7 +
 mm/gup.c                                  |   78 +++++++++++++++++---
 mm/hugetlb.c                              |    6 +
 mm/page_alloc.c                           |    2 
 mm/rmap.c                                 |    6 +
 7 files changed, 133 insertions(+), 32 deletions(-)

--- a/Documentation/core-api/pin_user_pages.rst~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -52,8 +52,22 @@ Which flags are set by each wrapper
 
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
-pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+pages* array, and the function then pins pages by incrementing each by a special
+value: GUP_PIN_COUNTING_BIAS.
+
+For huge pages (and in fact, any compound page of more than 2 pages), the
+GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
+is achieved, by using the 3rd struct page in the compound page. A new struct
+page field, hpage_pinned_refcount, has been added in order to support this.
+
+This approach for compound pages avoids the counting upper limit problems that
+are discussed below. Those limitations would have been aggravated severely by
+huge pages, because each tail page adds a refcount to the head page. And in
+fact, testing revealed that, without a separate hpage_pinned_refcount field,
+page overflows were seen in some huge page stress tests.
+
+This also means that huge pages and compound pages (of order > 1) do not suffer
+from the false positives problem that is mentioned below.::
 
  Function
  --------
@@ -99,27 +113,6 @@ pages:
 This also leads to limitations: there are only 31-10==21 bits available for a
 counter that increments 10 bits at a time.
 
-TODO: for 1GB and larger huge pages, this is cutting it close. That's because
-when pin_user_pages() follows such pages, it increments the head page by "1"
-(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
-pin_user_pages()) for each tail page. So if you have a 1GB huge page:
-
-* There are 256K (18 bits) worth of 4 KB tail pages.
-* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
-  10 bits at a time)
-* There are 21 - 18 == 3 bits available to count. Except that there aren't,
-  because you need to allow for a few normal get_page() calls on the head page,
-  as well. Fortunately, the approach of using addition, rather than "hard"
-  bitfields, within page->_refcount, allows for sharing these bits gracefully.
-  But we're still looking at about 8 references.
-
-This, however, is a missing feature more than anything else, because it's easily
-solved by addressing an obvious inefficiency in the original get_user_pages()
-approach of retrieving pages: stop treating all the pages as if they were
-PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
-this, so some work is required. Once that's in place, this limitation mostly
-disappears from view, because there will be ample refcounting range available.
-
 * Callers must specifically request "dma-pinned tracking of pages". In other
   words, just calling get_user_pages() will not suffice; a new set of functions,
   pin_user_page() and related, must be used.
@@ -228,5 +221,6 @@ References
 * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
 * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
 * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
 
 John Hubbard, October, 2019
--- a/include/linux/mm.h~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/include/linux/mm.h
@@ -770,6 +770,24 @@ static inline unsigned int compound_orde
 	return page[1].compound_order;
 }
 
+static inline bool hpage_pincount_available(struct page *page)
+{
+	/*
+	 * Can the page->hpage_pinned_refcount field be used? That field is in
+	 * the 3rd page of the compound page, so the smallest (2-page) compound
+	 * pages cannot support it.
+	 */
+	page = compound_head(page);
+	return PageCompound(page) && compound_order(page) > 1;
+}
+
+static inline int compound_pincount(struct page *page)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	page = compound_head(page);
+	return atomic_read(compound_pincount_ptr(page));
+}
+
 static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
@@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **page
  * refcounts, and b) all the callers of this routine are expected to be able to
  * deal gracefully with a false positive.
  *
+ * For huge pages, the result will be exactly correct. That's because we have
+ * more tracking data available: the 3rd struct page in the compound page is
+ * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
+ * scheme).
+ *
  * For more information, please see Documentation/vm/pin_user_pages.rst.
  *
  * @page:	pointer to page to be queried.
@@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **page
  */
 static inline bool page_maybe_dma_pinned(struct page *page)
 {
+	if (hpage_pincount_available(page))
+		return compound_pincount(page) > 0;
+
 	/*
 	 * page_ref_count() is signed. If that refcount overflows, then
 	 * page_ref_count() returns a negative value, and callers will avoid
--- a/include/linux/mm_types.h~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/include/linux/mm_types.h
@@ -137,7 +137,7 @@ struct page {
 		};
 		struct {	/* Second tail page of compound page */
 			unsigned long _compound_pad_1;	/* compound_head */
-			unsigned long _compound_pad_2;
+			atomic_t hpage_pinned_refcount;
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
@@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcoun
 	return &page[1].compound_mapcount;
 }
 
+static inline atomic_t *compound_pincount_ptr(struct page *page)
+{
+	return &page[2].hpage_pinned_refcount;
+}
+
 /*
  * Used for sizing the vmemmap region on some architectures
  */
--- a/mm/gup.c~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/mm/gup.c
@@ -29,6 +29,22 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+static void hpage_pincount_add(struct page *page, int refs)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	VM_BUG_ON_PAGE(page != compound_head(page), page);
+
+	atomic_add(refs, compound_pincount_ptr(page));
+}
+
+static void hpage_pincount_sub(struct page *page, int refs)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	VM_BUG_ON_PAGE(page != compound_head(page), page);
+
+	atomic_sub(refs, compound_pincount_ptr(page));
+}
+
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -70,8 +86,25 @@ static __maybe_unused struct page *try_g
 	if (flags & FOLL_GET)
 		return try_get_compound_head(page, refs);
 	else if (flags & FOLL_PIN) {
-		refs *= GUP_PIN_COUNTING_BIAS;
-		return try_get_compound_head(page, refs);
+		/*
+		 * When pinning a compound page of order > 1 (which is what
+		 * hpage_pincount_available() checks for), use an exact count to
+		 * track it, via hpage_pincount_add/_sub().
+		 *
+		 * However, be sure to *also* increment the normal page refcount
+		 * field at least once, so that the page really is pinned.
+		 */
+		if (!hpage_pincount_available(page))
+			refs *= GUP_PIN_COUNTING_BIAS;
+
+		page = try_get_compound_head(page, refs);
+		if (!page)
+			return NULL;
+
+		if (hpage_pincount_available(page))
+			hpage_pincount_add(page, refs);
+
+		return page;
 	}
 
 	WARN_ON_ONCE(1);
@@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct p
 	if (flags & FOLL_GET)
 		return try_get_page(page);
 	else if (flags & FOLL_PIN) {
+		int refs = 1;
+
 		page = compound_head(page);
 
 		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
 			return false;
 
-		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+		if (hpage_pincount_available(page))
+			hpage_pincount_add(page, 1);
+		else
+			refs = GUP_PIN_COUNTING_BIAS;
+
+		/*
+		 * Similar to try_grab_compound_head(): even if using the
+		 * hpage_pincount_add/_sub() routines, be sure to
+		 * *also* increment the normal page refcount field at least
+		 * once, so that the page really is pinned.
+		 */
+		page_ref_add(page, refs);
 	}
 
 	return true;
@@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct p
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 static bool __unpin_devmap_managed_user_page(struct page *page)
 {
-	int count;
+	int count, refs = 1;
 
 	if (!page_is_devmap_managed(page))
 		return false;
 
-	count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+	if (hpage_pincount_available(page))
+		hpage_pincount_sub(page, 1);
+	else
+		refs = GUP_PIN_COUNTING_BIAS;
+
+	count = page_ref_sub_return(page, refs);
 
 	/*
 	 * devmap page refcounts are 1-based, rather than 0-based: if
@@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_
  */
 void unpin_user_page(struct page *page)
 {
+	int refs = 1;
+
 	page = compound_head(page);
 
 	/*
@@ -168,7 +221,12 @@ void unpin_user_page(struct page *page)
 	if (__unpin_devmap_managed_user_page(page))
 		return;
 
-	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+	if (hpage_pincount_available(page))
+		hpage_pincount_sub(page, 1);
+	else
+		refs = GUP_PIN_COUNTING_BIAS;
+
+	if (page_ref_sub_and_test(page, refs))
 		__put_page(page);
 }
 EXPORT_SYMBOL(unpin_user_page);
@@ -2200,8 +2258,12 @@ static int record_subpages(struct page *
 
 static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
-	if (flags & FOLL_PIN)
-		refs *= GUP_PIN_COUNTING_BIAS;
+	if (flags & FOLL_PIN) {
+		if (hpage_pincount_available(page))
+			hpage_pincount_sub(page, refs);
+		else
+			refs *= GUP_PIN_COUNTING_BIAS;
+	}
 
 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
 	/*
--- a/mm/hugetlb.c~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/mm/hugetlb.c
@@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_pa
 	struct page *p = page + 1;
 
 	atomic_set(compound_mapcount_ptr(page), 0);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		clear_compound_head(p);
 		set_page_refcounted(p);
@@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
+
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
 }
 
 /*
--- a/mm/page_alloc.c~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/mm/page_alloc.c
@@ -689,6 +689,8 @@ void prep_compound_page(struct page *pag
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
--- a/mm/rmap.c~mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages
+++ a/mm/rmap.c
@@ -1165,6 +1165,9 @@ void page_add_new_anon_rmap(struct page
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
+		if (hpage_pincount_available(page))
+			atomic_set(compound_pincount_ptr(page), 0);
+
 		__inc_node_page_state(page, NR_ANON_THPS);
 	} else {
 		/* Anon THP always mapped first with PMD */
@@ -1961,6 +1964,9 @@ void hugepage_add_new_anon_rmap(struct p
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	atomic_set(compound_mapcount_ptr(page), 0);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (100 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch " Andrew Morton
                   ` (136 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
has been added to the -mm tree.  Its filename is
     mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting

Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.

Add two new fields to /proc/vmstat:

    nr_foll_pin_acquired
    nr_foll_pin_released

These are documented in Documentation/core-api/pin_user_pages.rst.  They
represent the number of pages (since boot time) that have been pinned
("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
pin_user_pages*() and unpin_user_pages*().

In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.

Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.

Also: update Documentation/core-api/pin_user_pages.rst to rename the new
/proc/vmstat entries, to the names listed here.

Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/pin_user_pages.rst |   33 ++++++++++++++++----
 include/linux/mmzone.h                    |    2 +
 mm/gup.c                                  |   13 +++++++
 mm/vmstat.c                               |    2 +
 4 files changed, 45 insertions(+), 5 deletions(-)

--- a/Documentation/core-api/pin_user_pages.rst~mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -208,12 +208,35 @@ has the following new calls to exercise
 You can monitor how many total dma-pinned pages have been acquired and released
 since the system was booted, via two new /proc/vmstat entries: ::
 
-    /proc/vmstat/nr_foll_pin_requested
-    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_acquired
+    /proc/vmstat/nr_foll_pin_released
 
-Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in unpin_user_page(), when they
-are activated.
+Under normal conditions, these two values will be equal unless there are any
+long-term [R]DMA pins in place, or during pin/unpin transitions.
+
+* nr_foll_pin_acquired: This is the number of logical pins that have been
+  acquired since the system was powered on. For huge pages, the head page is
+  pinned once for each page (head page and each tail page) within the huge page.
+  This follows the same sort of behavior that get_user_pages() uses for huge
+  pages: the head page is refcounted once for each tail or head page in the huge
+  page, when get_user_pages() is applied to a huge page.
+
+* nr_foll_pin_released: The number of logical pins that have been released since
+  the system was powered on. Note that pages are released (unpinned) on a
+  PAGE_SIZE granularity, even if the original pin was applied to a huge page.
+  Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
+  the accounting balances out, so that after doing this::
+
+    pin_user_pages(huge_page);
+    for (each page in huge_page)
+        unpin_user_page(page);
+
+...the following is expected::
+
+    nr_foll_pin_released == nr_foll_pin_acquired
+
+(...unless it was already out of balance due to a long-term RDMA pin being in
+place.)
 
 References
 ==========
--- a/include/linux/mmzone.h~mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting
+++ a/include/linux/mmzone.h
@@ -243,6 +243,8 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
+	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_VM_NODE_STAT_ITEMS
 };
 
--- a/mm/gup.c~mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting
+++ a/mm/gup.c
@@ -86,6 +86,8 @@ static __maybe_unused struct page *try_g
 	if (flags & FOLL_GET)
 		return try_get_compound_head(page, refs);
 	else if (flags & FOLL_PIN) {
+		int orig_refs = refs;
+
 		/*
 		 * When pinning a compound page of order > 1 (which is what
 		 * hpage_pincount_available() checks for), use an exact count to
@@ -104,6 +106,9 @@ static __maybe_unused struct page *try_g
 		if (hpage_pincount_available(page))
 			hpage_pincount_add(page, refs);
 
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
+				    orig_refs);
+
 		return page;
 	}
 
@@ -158,6 +163,8 @@ bool __must_check try_grab_page(struct p
 		 * once, so that the page really is pinned.
 		 */
 		page_ref_add(page, refs);
+
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
 	}
 
 	return true;
@@ -178,6 +185,7 @@ static bool __unpin_devmap_managed_user_
 
 	count = page_ref_sub_return(page, refs);
 
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
 	/*
 	 * devmap page refcounts are 1-based, rather than 0-based: if
 	 * refcount is 1, then the page is free and the refcount is
@@ -228,6 +236,8 @@ void unpin_user_page(struct page *page)
 
 	if (page_ref_sub_and_test(page, refs))
 		__put_page(page);
+
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
 }
 EXPORT_SYMBOL(unpin_user_page);
 
@@ -2259,6 +2269,9 @@ static int record_subpages(struct page *
 static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
 	if (flags & FOLL_PIN) {
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+				    refs);
+
 		if (hpage_pincount_available(page))
 			hpage_pincount_sub(page, refs);
 		else
--- a/mm/vmstat.c~mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting
+++ a/mm/vmstat.c
@@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+	"nr_foll_pin_acquired",
+	"nr_foll_pin_released",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (101 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch " Andrew Morton
                   ` (135 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm/gup_benchmark: support pin_user_pages() and related calls
has been added to the -mm tree.  Its filename is
     mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup_benchmark: support pin_user_pages() and related calls

Up until now, gup_benchmark supported testing of the following kernel
functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():      via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the default
choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages really are
dma-pinned, via the new is_dma_pinned() routine.  Those commands are:

    PIN_FAST_BENCHMARK     : calls pin_user_pages_fast()
    PIN_BENCHMARK          : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(), check each page:
if page_maybe_dma_pinned() returns false, then WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't affect
reported times.

Link: http://lkml.kernel.org/r/20200211001536.1027652-10-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup_benchmark.c                         |   71 +++++++++++++++++--
 tools/testing/selftests/vm/gup_benchmark.c |   15 +++-
 2 files changed, 80 insertions(+), 6 deletions(-)

--- a/mm/gup_benchmark.c~mm-gup_benchmark-support-pin_user_pages-and-related-calls
+++ a/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -19,6 +21,48 @@ struct gup_benchmark {
 	__u64 expansion[10];	/* For future use */
 };
 
+static void put_back_pages(unsigned int cmd, struct page **pages,
+			   unsigned long nr_pages)
+{
+	unsigned long i;
+
+	switch (cmd) {
+	case GUP_FAST_BENCHMARK:
+	case GUP_LONGTERM_BENCHMARK:
+	case GUP_BENCHMARK:
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+		break;
+
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
+		unpin_user_pages(pages, nr_pages);
+		break;
+	}
+}
+
+static void verify_dma_pinned(unsigned int cmd, struct page **pages,
+			      unsigned long nr_pages)
+{
+	unsigned long i;
+	struct page *page;
+
+	switch (cmd) {
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
+		for (i = 0; i < nr_pages; i++) {
+			page = pages[i];
+			if (WARN(!page_maybe_dma_pinned(page),
+				 "pages[%lu] is NOT dma-pinned\n", i)) {
+
+				dump_page(page, "gup_benchmark failure");
+				break;
+			}
+		}
+		break;
+	}
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
 		struct gup_benchmark *gup)
 {
@@ -66,6 +110,14 @@ static int __gup_benchmark_ioctl(unsigne
 			nr = get_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
 			break;
+		case PIN_FAST_BENCHMARK:
+			nr = pin_user_pages_fast(addr, nr, gup->flags,
+						 pages + i);
+			break;
+		case PIN_BENCHMARK:
+			nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+					    NULL);
+			break;
 		default:
 			kvfree(pages);
 			ret = -EINVAL;
@@ -78,15 +130,22 @@ static int __gup_benchmark_ioctl(unsigne
 	}
 	end_time = ktime_get();
 
+	/* Shifting the meaning of nr_pages: now it is actual number pinned: */
+	nr_pages = i;
+
 	gup->get_delta_usec = ktime_us_delta(end_time, start_time);
 	gup->size = addr - gup->addr;
 
+	/*
+	 * Take an un-benchmark-timed moment to verify DMA pinned
+	 * state: print a warning if any non-dma-pinned pages are found:
+	 */
+	verify_dma_pinned(cmd, pages, nr_pages);
+
 	start_time = ktime_get();
-	for (i = 0; i < nr_pages; i++) {
-		if (!pages[i])
-			break;
-		put_page(pages[i]);
-	}
+
+	put_back_pages(cmd, pages, nr_pages);
+
 	end_time = ktime_get();
 	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -105,6 +164,8 @@ static long gup_benchmark_ioctl(struct f
 	case GUP_FAST_BENCHMARK:
 	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
 		break;
 	default:
 		return -EINVAL;
--- a/tools/testing/selftests/vm/gup_benchmark.c~mm-gup_benchmark-support-pin_user_pages-and-related-calls
+++ a/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,10 @@
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
+/* Similar to above, but use FOLL_PIN instead of FOLL_GET. */
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
+
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE	0x01	/* check pte is writable */
 
@@ -40,8 +44,14 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:abtTLUuwSH")) != -1) {
 		switch (opt) {
+		case 'a':
+			cmd = PIN_FAST_BENCHMARK;
+			break;
+		case 'b':
+			cmd = PIN_BENCHMARK;
+			break;
 		case 'm':
 			size = atoi(optarg) * MB;
 			break;
@@ -63,6 +73,9 @@ int main(int argc, char **argv)
 		case 'U':
 			cmd = GUP_BENCHMARK;
 			break;
+		case 'u':
+			cmd = GUP_FAST_BENCHMARK;
+			break;
 		case 'w':
 			write = 1;
 			break;
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (102 preceding siblings ...)
  2020-02-11  5:50 ` + mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:50 ` + mm-improve-dump_page-for-compound-pages.patch " Andrew Morton
                   ` (134 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage
has been added to the -mm tree.  Its filename is
     selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

It's good to have basic unit test coverage of the new FOLL_PIN behavior. 
Fortunately, the gup_benchmark unit test is extremely fast (a few
milliseconds), so adding it the the run_vmtests suite is going to cause no
noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages().  This is much like the
   first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual comparison of
the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script, because
with test suites, it's usually preferable to put the shorter, faster tests
first, all other things being equal.

Link: http://lkml.kernel.org/r/20200211001536.1027652-11-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/run_vmtests |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- a/tools/testing/selftests/vm/run_vmtests~selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage
+++ a/tools/testing/selftests/vm/run_vmtests
@@ -123,6 +123,28 @@ else
 	echo "[PASS]"
 fi
 
+echo "--------------------------------------------"
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo "--------------------------------------------"
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
+echo "------------------------------------------"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "------------------------------------------"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "-------------------"
 echo "running userfaultfd"
 echo "-------------------"
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-improve-dump_page-for-compound-pages.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (103 preceding siblings ...)
  2020-02-11  5:50 ` + selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch " Andrew Morton
@ 2020-02-11  5:50 ` Andrew Morton
  2020-02-11  5:51 ` + mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch " Andrew Morton
                   ` (133 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:50 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm: improve dump_page() for compound pages
has been added to the -mm tree.  Its filename is
     mm-improve-dump_page-for-compound-pages.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-improve-dump_page-for-compound-pages.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-improve-dump_page-for-compound-pages.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: improve dump_page() for compound pages

There was no protection against a corrupted struct page having an
implausible compound_head().  Sanity check that a compound page has a head
within reach of the maximum allocatable page (this will need to be
adjusted if one of the plans to allocate 1GB pages comes to fruition).  In
addition,

 - Print the mapping pointer using %p insted of %px.  The actual value of
   the pointer can be read out of the raw page dump and using %p gives a
   chance to correlate it with an earlier printk of the mapping pointer
 - Print the mapping pointer from the head page, not the tail page
   (the tail ->mapping pointer may be in use for other purposes, eg part
   of a list_head)
 - Print the order of the page for compound pages
 - Dump the raw head page as well as the raw page
 - Print the refcount from the head page, not the tail page

Link: http://lkml.kernel.org/r/20200211001536.1027652-12-jhubbard@nvidia.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Co-developed-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug.c |   33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

--- a/mm/debug.c~mm-improve-dump_page-for-compound-pages
+++ a/mm/debug.c
@@ -44,8 +44,10 @@ const struct trace_print_flags vmaflag_n
 
 void __dump_page(struct page *page, const char *reason)
 {
+	struct page *head = compound_head(page);
 	struct address_space *mapping;
 	bool page_poisoned = PagePoisoned(page);
+	bool compound = PageCompound(page);
 	/*
 	 * Accessing the pageblock without the zone lock. It could change to
 	 * "isolate" again in the meantime, but since we are just dumping the
@@ -66,25 +68,32 @@ void __dump_page(struct page *page, cons
 		goto hex_only;
 	}
 
-	mapping = page_mapping(page);
+	if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) {
+		/* Corrupt page, cannot call page_mapping */
+		mapping = page->mapping;
+		head = page;
+		compound = false;
+	} else {
+		mapping = page_mapping(page);
+	}
 
 	/*
 	 * Avoid VM_BUG_ON() in page_mapcount().
 	 * page->_mapcount space in struct page is used by sl[aou]b pages to
 	 * encode own info.
 	 */
-	mapcount = PageSlab(page) ? 0 : page_mapcount(page);
+	mapcount = PageSlab(head) ? 0 : page_mapcount(page);
 
-	if (PageCompound(page))
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
-			"index:%#lx compound_mapcount: %d\n",
-			page, page_ref_count(page), mapcount,
-			page->mapping, page_to_pgoff(page),
-			compound_mapcount(page));
+	if (compound)
+		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
+			"index:%#lx head:%px order:%u compound_mapcount:%d\n",
+			page, page_ref_count(head), mapcount,
+			mapping, page_to_pgoff(page), head,
+			compound_order(head), compound_mapcount(page));
 	else
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
+		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
 			page, page_ref_count(page), mapcount,
-			page->mapping, page_to_pgoff(page));
+			mapping, page_to_pgoff(page));
 	if (PageKsm(page))
 		type = "ksm ";
 	else if (PageAnon(page))
@@ -106,6 +115,10 @@ hex_only:
 	print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
 			sizeof(unsigned long), page,
 			sizeof(struct page), false);
+	if (head != page)
+		print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32,
+			sizeof(unsigned long), head,
+			sizeof(struct page), false);
 
 	if (reason)
 		pr_warn("page dumped because: %s\n", reason);
_

Patches currently in -mm which might be from willy@infradead.org are

mm-improve-dump_page-for-compound-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (104 preceding siblings ...)
  2020-02-11  5:50 ` + mm-improve-dump_page-for-compound-pages.patch " Andrew Morton
@ 2020-02-11  5:51 ` Andrew Morton
  2020-02-11  6:05 ` + mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch " Andrew Morton
                   ` (132 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  5:51 UTC (permalink / raw)
  To: corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, kirill.shutemov, mhocko, mike.kravetz,
	mm-commits, shuah, vbabka, viro, willy


The patch titled
     Subject: mm: dump_page(): additional diagnostics for huge pinned pages
has been added to the -mm tree.  Its filename is
     mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: mm: dump_page(): additional diagnostics for huge pinned pages

As part of pin_user_pages() and related API calls, pages are "dma-pinned".
For the case of compound pages of order > 1, the per-page accounting of
dma pins is accomplished via the 3rd struct page in the compound page.  In
order to support debugging of any pin_user_pages()- related problems,
enhance dump_page() so as to report the pin count in that case.

Documentation/core-api/pin_user_pages.rst is also updated accordingly.

Link: http://lkml.kernel.org/r/20200211001536.1027652-13-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/pin_user_pages.rst |    7 ++++++
 mm/debug.c                                |   21 +++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

--- a/Documentation/core-api/pin_user_pages.rst~mm-dump_page-additional-diagnostics-for-huge-pinned-pages
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -238,6 +238,13 @@ long-term [R]DMA pins in place, or durin
 (...unless it was already out of balance due to a long-term RDMA pin being in
 place.)
 
+Other diagnostics
+=================
+
+dump_page() has been enhanced slightly, to handle these new counting fields, and
+to better report on compound pages in general. Specifically, for compound pages
+with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
+
 References
 ==========
 
--- a/mm/debug.c~mm-dump_page-additional-diagnostics-for-huge-pinned-pages
+++ a/mm/debug.c
@@ -85,11 +85,22 @@ void __dump_page(struct page *page, cons
 	mapcount = PageSlab(head) ? 0 : page_mapcount(page);
 
 	if (compound)
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
-			"index:%#lx head:%px order:%u compound_mapcount:%d\n",
-			page, page_ref_count(head), mapcount,
-			mapping, page_to_pgoff(page), head,
-			compound_order(head), compound_mapcount(page));
+		if (hpage_pincount_available(page)) {
+			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
+				"index:%#lx head:%px order:%u "
+				"compound_mapcount:%d compound_pincount:%d\n",
+				page, page_ref_count(head), mapcount,
+				mapping, page_to_pgoff(page), head,
+				compound_order(head), compound_mapcount(page),
+				compound_pincount(page));
+		} else {
+			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
+				"index:%#lx head:%px order:%u "
+				"compound_mapcount:%d\n",
+				page, page_ref_count(head), mapcount,
+				mapping, page_to_pgoff(page), head,
+				compound_order(head), compound_mapcount(page));
+		}
 	else
 		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
 			page, page_ref_count(page), mapcount,
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (105 preceding siblings ...)
  2020-02-11  5:51 ` + mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch " Andrew Morton
@ 2020-02-11  6:05 ` Andrew Morton
  2020-02-11  6:06 ` + zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch " Andrew Morton
                   ` (131 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  6:05 UTC (permalink / raw)
  To: akpm, mm-commits, steven.price, thellstrom


The patch titled
     Subject: mm/mapping_dirty_helpers: Update huge page-table entry callbacks
has been added to the -mm tree.  Its filename is
     mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Thomas Hellstrom <thellstrom@vmware.com>
Subject: mm/mapping_dirty_helpers: Update huge page-table entry callbacks

Following the update of pagewalk code commit a07984d48146 ("mm: pagewalk:
add p4d_entry() and pgd_entry()") we can modify the mapping_dirty_helpers'
huge page-table entry callbacks to avoid splitting when a huge pud or -pmd
is encountered.

Link: http://lkml.kernel.org/r/20200203154305.15045-1-thomas_os@shipmail.org
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mapping_dirty_helpers.c |   42 +++++++++++++++++++++++++++++++----
 1 file changed, 38 insertions(+), 4 deletions(-)

--- a/mm/mapping_dirty_helpers.c~mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks
+++ a/mm/mapping_dirty_helpers.c
@@ -111,26 +111,60 @@ static int clean_record_pte(pte_t *pte,
 	return 0;
 }
 
-/* wp_clean_pmd_entry - The pagewalk pmd callback. */
+/*
+ * wp_clean_pmd_entry - The pagewalk pmd callback.
+ *
+ * Dirty-tracking should take place on the PTE level, so
+ * WARN() if encountering a dirty huge pmd.
+ * Furthermore, never split huge pmds, since that currently
+ * causes dirty info loss. The pagefault handler should do
+ * that if needed.
+ */
 static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 			      struct mm_walk *walk)
 {
-	/* Dirty-tracking should be handled on the pte level */
 	pmd_t pmdval = pmd_read_atomic(pmd);
 
+	if (!pmd_trans_unstable(&pmdval))
+		return 0;
+
+	if (pmd_none(pmdval)) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+
+	/* Huge pmd, present or migrated */
+	walk->action = ACTION_CONTINUE;
 	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
 		WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
 
 	return 0;
 }
 
-/* wp_clean_pud_entry - The pagewalk pud callback. */
+/*
+ * wp_clean_pud_entry - The pagewalk pud callback.
+ *
+ * Dirty-tracking should take place on the PTE level, so
+ * WARN() if encountering a dirty huge puds.
+ * Furthermore, never split huge puds, since that currently
+ * causes dirty info loss. The pagefault handler should do
+ * that if needed.
+ */
 static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
 			      struct mm_walk *walk)
 {
-	/* Dirty-tracking should be handled on the pte level */
 	pud_t pudval = READ_ONCE(*pud);
 
+	if (!pud_trans_unstable(&pudval))
+		return 0;
+
+	if (pud_none(pudval)) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+
+	/* Huge pud */
+	walk->action = ACTION_CONTINUE;
 	if (pud_trans_huge(pudval) || pud_devmap(pudval))
 		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
 
_

Patches currently in -mm which might be from thellstrom@vmware.com are

mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (106 preceding siblings ...)
  2020-02-11  6:05 ` + mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch " Andrew Morton
@ 2020-02-11  6:06 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch " Andrew Morton
                   ` (130 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11  6:06 UTC (permalink / raw)
  To: mail, mm-commits, vitaly.wool


The patch titled
     Subject: mm/zswap: allow setting default status, compressor and allocator in Kconfig
has been added to the -mm tree.  Its filename is
     zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
Subject: mm/zswap: allow setting default status, compressor and allocator in Kconfig

The compressed cache for swap pages (zswap) currently needs from 1 to 3
extra kernel command line parameters in order to make it work: it has to
be enabled by adding a "zswap.enabled=1" command line parameter and if one
wants a different compressor or pool allocator than the default lzo / zbud
combination then these choices also need to be specified on the kernel
command line in additional parameters.

Using a different compressor and allocator for zswap is actually pretty
common as guides often recommend using the lz4 / z3fold pair instead of
the default one.  In such case it is also necessary to remember to enable
the appropriate compression algorithm and pool allocator in the kernel
config manually.

Let's avoid the need for adding these kernel command line parameters and
automatically pull in the dependencies for the selected compressor
algorithm and pool allocator by adding an appropriate default switches to
Kconfig.

The default values for these options match what the code was using
previously as its defaults.

Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Reviewed-by: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/zswap.rst |   20 +++--
 mm/Kconfig                 |  118 ++++++++++++++++++++++++++++++++++-
 mm/zswap.c                 |   24 +++----
 3 files changed, 141 insertions(+), 21 deletions(-)

--- a/Documentation/vm/zswap.rst~zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig
+++ a/Documentation/vm/zswap.rst
@@ -35,9 +35,11 @@ Zswap evicts pages from compressed cache
 device when the compressed pool reaches its size limit.  This requirement had
 been identified in prior community discussions.
 
-Zswap is disabled by default but can be enabled at boot time by setting
-the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``.  Zswap
-can also be enabled and disabled at runtime using the sysfs interface.
+Whether Zswap is enabled at the boot time depends on whether
+the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not.
+This setting can then be overridden by providing the kernel command line
+``zswap.enabled=`` option, for example ``zswap.enabled=0``.
+Zswap can also be enabled and disabled at runtime using the sysfs interface.
 An example command to enable zswap at runtime, assuming sysfs is mounted
 at ``/sys``, is::
 
@@ -64,9 +66,10 @@ allocation in zpool is not directly acce
 returned by the allocation routine and that handle must be mapped before being
 accessed.  The compressed memory pool grows on demand and shrinks as compressed
 pages are freed.  The pool is not preallocated.  By default, a zpool
-of type zbud is created, but it can be selected at boot time by
-setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can
-also be changed at runtime using the sysfs ``zpool`` attribute, e.g.::
+of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
+but it can be overridden at boot time by setting the ``zpool`` attribute,
+e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
+``zpool`` attribute, e.g.::
 
 	echo zbud > /sys/module/zswap/parameters/zpool
 
@@ -97,8 +100,9 @@ controlled policy:
 * max_pool_percent - The maximum percentage of memory that the compressed
   pool can occupy.
 
-The default compressor is lzo, but it can be selected at boot time by
-setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
+The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT``
+Kconfig option, but it can be overridden at boot time by setting the
+``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
 It can also be changed at runtime using the sysfs "compressor"
 attribute, e.g.::
 
--- a/mm/Kconfig~zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig
+++ a/mm/Kconfig
@@ -526,7 +526,6 @@ config MEM_SOFT_DIRTY
 config ZSWAP
 	bool "Compressed cache for swap pages (EXPERIMENTAL)"
 	depends on FRONTSWAP && CRYPTO=y
-	select CRYPTO_LZO
 	select ZPOOL
 	help
 	  A lightweight compressed cache for swap pages.  It takes
@@ -542,6 +541,123 @@ config ZSWAP
 	  they have not be fully explored on the large set of potential
 	  configurations and workloads that exist.
 
+choice
+	prompt "Compressed cache for swap pages default compressor"
+	depends on ZSWAP
+	default ZSWAP_COMPRESSOR_DEFAULT_LZO
+	help
+	  Selects the default compression algorithm for the compressed cache
+	  for swap pages.
+
+	  For an overview what kind of performance can be expected from
+	  a particular compression algorithm please refer to the benchmarks
+	  available at the following LWN page:
+	  https://lwn.net/Articles/751795/
+
+	  If in doubt, select 'LZO'.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.compressor=' option.
+
+config ZSWAP_COMPRESSOR_DEFAULT_DEFLATE
+	bool "Deflate"
+	select CRYPTO_DEFLATE
+	help
+	  Use the Deflate algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_LZO
+	bool "LZO"
+	select CRYPTO_LZO
+	help
+	  Use the LZO algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_842
+	bool "842"
+	select CRYPTO_842
+	help
+	  Use the 842 algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_LZ4
+	bool "LZ4"
+	select CRYPTO_LZ4
+	help
+	  Use the LZ4 algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_LZ4HC
+	bool "LZ4HC"
+	select CRYPTO_LZ4HC
+	help
+	  Use the LZ4HC algorithm as the default compression algorithm.
+
+config ZSWAP_COMPRESSOR_DEFAULT_ZSTD
+	bool "zstd"
+	select CRYPTO_ZSTD
+	help
+	  Use the zstd algorithm as the default compression algorithm.
+endchoice
+
+config ZSWAP_COMPRESSOR_DEFAULT
+       string
+       depends on ZSWAP
+       default "deflate" if ZSWAP_COMPRESSOR_DEFAULT_DEFLATE
+       default "lzo" if ZSWAP_COMPRESSOR_DEFAULT_LZO
+       default "842" if ZSWAP_COMPRESSOR_DEFAULT_842
+       default "lz4" if ZSWAP_COMPRESSOR_DEFAULT_LZ4
+       default "lz4hc" if ZSWAP_COMPRESSOR_DEFAULT_LZ4HC
+       default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD
+       default ""
+
+choice
+	prompt "Compressed cache for swap pages default allocator"
+	depends on ZSWAP
+	default ZSWAP_ZPOOL_DEFAULT_ZBUD
+	help
+	  Selects the default allocator for the compressed cache for
+	  swap pages.
+	  The default is 'zbud' for compatibility, however please do
+	  read the description of each of the allocators below before
+	  making a right choice.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.zpool=' option.
+
+config ZSWAP_ZPOOL_DEFAULT_ZBUD
+	bool "zbud"
+	select ZBUD
+	help
+	  Use the zbud allocator as the default allocator.
+
+config ZSWAP_ZPOOL_DEFAULT_Z3FOLD
+	bool "z3fold"
+	select Z3FOLD
+	help
+	  Use the z3fold allocator as the default allocator.
+
+config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
+	bool "zsmalloc"
+	select ZSMALLOC
+	help
+	  Use the zsmalloc allocator as the default allocator.
+endchoice
+
+config ZSWAP_ZPOOL_DEFAULT
+       string
+       depends on ZSWAP
+       default "zbud" if ZSWAP_ZPOOL_DEFAULT_ZBUD
+       default "z3fold" if ZSWAP_ZPOOL_DEFAULT_Z3FOLD
+       default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC
+       default ""
+
+config ZSWAP_DEFAULT_ON
+	bool "Enable the compressed cache for swap pages by default"
+	depends on ZSWAP
+	help
+	  If selected, the compressed cache for swap pages will be enabled
+	  at boot, otherwise it will be disabled.
+
+	  The selection made here can be overridden by using the kernel
+	  command line 'zswap.enabled=' option.
+
 config ZPOOL
 	tristate "Common API for compressed memory storage"
 	help
--- a/mm/zswap.c~zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig
+++ a/mm/zswap.c
@@ -77,8 +77,8 @@ static bool zswap_pool_reached_full;
 
 #define ZSWAP_PARAM_UNSET ""
 
-/* Enable/disable zswap (disabled by default) */
-static bool zswap_enabled;
+/* Enable/disable zswap */
+static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON);
 static int zswap_enabled_param_set(const char *,
 				   const struct kernel_param *);
 static struct kernel_param_ops zswap_enabled_param_ops = {
@@ -88,8 +88,7 @@ static struct kernel_param_ops zswap_ena
 module_param_cb(enabled, &zswap_enabled_param_ops, &zswap_enabled, 0644);
 
 /* Crypto compressor to use */
-#define ZSWAP_COMPRESSOR_DEFAULT "lzo"
-static char *zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
+static char *zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
 static int zswap_compressor_param_set(const char *,
 				      const struct kernel_param *);
 static struct kernel_param_ops zswap_compressor_param_ops = {
@@ -101,8 +100,7 @@ module_param_cb(compressor, &zswap_compr
 		&zswap_compressor, 0644);
 
 /* Compressed storage zpool to use */
-#define ZSWAP_ZPOOL_DEFAULT "zbud"
-static char *zswap_zpool_type = ZSWAP_ZPOOL_DEFAULT;
+static char *zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT;
 static int zswap_zpool_param_set(const char *, const struct kernel_param *);
 static struct kernel_param_ops zswap_zpool_param_ops = {
 	.set =		zswap_zpool_param_set,
@@ -599,11 +597,12 @@ static __init struct zswap_pool *__zswap
 	bool has_comp, has_zpool;
 
 	has_comp = crypto_has_comp(zswap_compressor, 0, 0);
-	if (!has_comp && strcmp(zswap_compressor, ZSWAP_COMPRESSOR_DEFAULT)) {
+	if (!has_comp && strcmp(zswap_compressor,
+				CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) {
 		pr_err("compressor %s not available, using default %s\n",
-		       zswap_compressor, ZSWAP_COMPRESSOR_DEFAULT);
+		       zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT);
 		param_free_charp(&zswap_compressor);
-		zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
+		zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT;
 		has_comp = crypto_has_comp(zswap_compressor, 0, 0);
 	}
 	if (!has_comp) {
@@ -614,11 +613,12 @@ static __init struct zswap_pool *__zswap
 	}
 
 	has_zpool = zpool_has_pool(zswap_zpool_type);
-	if (!has_zpool && strcmp(zswap_zpool_type, ZSWAP_ZPOOL_DEFAULT)) {
+	if (!has_zpool && strcmp(zswap_zpool_type,
+				 CONFIG_ZSWAP_ZPOOL_DEFAULT)) {
 		pr_err("zpool %s not available, using default %s\n",
-		       zswap_zpool_type, ZSWAP_ZPOOL_DEFAULT);
+		       zswap_zpool_type, CONFIG_ZSWAP_ZPOOL_DEFAULT);
 		param_free_charp(&zswap_zpool_type);
-		zswap_zpool_type = ZSWAP_ZPOOL_DEFAULT;
+		zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT;
 		has_zpool = zpool_has_pool(zswap_zpool_type);
 	}
 	if (!has_zpool) {
_

Patches currently in -mm which might be from mail@maciej.szmigiero.name are

zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (107 preceding siblings ...)
  2020-02-11  6:06 ` + zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch " Andrew Morton
                   ` (129 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: add hugetlb_cgroup reservation counter
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: add hugetlb_cgroup reservation counter

These counters will track hugetlb reservations rather than hugetlb memory
faulted in.  This patch only adds the counter, following patches add the
charging and uncharging of the counter.

This is patch 1 of an 9 patch series.

Problem:

Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time.  This is thanks to Hugetlbfs
Reservations [1].  However, if a task attempts to reserve more hugetlb
memory than its hugetlb_cgroup limit allows, the kernel will allow the
mmap/shmget call, but will SIGBUS the task when it attempts to fault in
the excess memory.

We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode.  We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on mmap/shmget
time, rather than getting SIGBUS'd when they try to fault the excess
memory in.  This gives the user an opportunity to fallback more gracefully
to non-hugetlbfs memory for example.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.  Thus,
enforcing the hugetlb_cgroup limit only happens at fault time, and the
offending task gets SIGBUS'd.

Proposed Solution:

A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

- While usage_in_bytes tracks all *faulted* hugetlb memory,
  rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
  memory faulted in without a prior reservation.

- If a task attempts to reserve more memory than limit_in_bytes allows,
  the kernel will allow it to do so.  But if a task attempts to reserve
  more memory than rsvd.limit_in_bytes, the kernel will fail this
  reservation.

This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.

Alternatives considered:

1. A new cgroup, instead of only a new page_counter attached to the
   existing hugetlb_cgroup.  Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup.  Keeping hugetlb related page counters
   under hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that
   modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
   accounting at reservation time rather than fault time.  Adding a new
   page_counter seems better as userspace could, if it wants, choose to
   enforce different cgroups differently: one via limit_in_bytes, and
   another via rsvd.limit_in_bytes.  This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example.  Also, someone may find usage for both
   limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
   gives them the option to do so.

Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h |    4 -
 mm/hugetlb_cgroup.c     |  115 +++++++++++++++++++++++++++++++++-----
 2 files changed, 104 insertions(+), 15 deletions(-)

--- a/include/linux/hugetlb.h~hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter
+++ a/include/linux/hugetlb.h
@@ -432,8 +432,8 @@ struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files_dfl[5];
-	struct cftype cgroup_files_legacy[5];
+	struct cftype cgroup_files_dfl[7];
+	struct cftype cgroup_files_legacy[9];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter
+++ a/mm/hugetlb_cgroup.c
@@ -36,6 +36,11 @@ struct hugetlb_cgroup {
 	 */
 	struct page_counter hugepage[HUGE_MAX_HSTATE];
 
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter rsvd_hugepage[HUGE_MAX_HSTATE];
+
 	atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
 	atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
 
@@ -55,6 +60,15 @@ struct hugetlb_cgroup {
 
 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
+static inline struct page_counter *
+hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx,
+				   bool rsvd)
+{
+	if (rsvd)
+		return &h_cg->rsvd_hugepage[idx];
+	return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -295,28 +309,42 @@ void hugetlb_cgroup_uncharge_cgroup(int
 
 enum {
 	RES_USAGE,
+	RES_RSVD_USAGE,
 	RES_LIMIT,
+	RES_RSVD_LIMIT,
 	RES_MAX_USAGE,
+	RES_RSVD_MAX_USAGE,
 	RES_FAILCNT,
+	RES_RSVD_FAILCNT,
 };
 
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
 	struct page_counter *counter;
+	struct page_counter *rsvd_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
 
 	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+	rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(cft->private)];
 
 	switch (MEMFILE_ATTR(cft->private)) {
 	case RES_USAGE:
 		return (u64)page_counter_read(counter) * PAGE_SIZE;
+	case RES_RSVD_USAGE:
+		return (u64)page_counter_read(rsvd_counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->max * PAGE_SIZE;
+	case RES_RSVD_LIMIT:
+		return (u64)rsvd_counter->max * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_RSVD_MAX_USAGE:
+		return (u64)rsvd_counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
 		return counter->failcnt;
+	case RES_RSVD_FAILCNT:
+		return rsvd_counter->failcnt;
 	default:
 		BUG();
 	}
@@ -338,10 +366,16 @@ static int hugetlb_cgroup_read_u64_max(s
 			   1 << huge_page_order(&hstates[idx]));
 
 	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_RSVD_USAGE:
+		counter = &h_cg->rsvd_hugepage[idx];
+		/* Fall through. */
 	case RES_USAGE:
 		val = (u64)page_counter_read(counter);
 		seq_printf(seq, "%llu\n", val * PAGE_SIZE);
 		break;
+	case RES_RSVD_LIMIT:
+		counter = &h_cg->rsvd_hugepage[idx];
+		/* Fall through. */
 	case RES_LIMIT:
 		val = (u64)counter->max;
 		if (val == limit)
@@ -365,6 +399,7 @@ static ssize_t hugetlb_cgroup_write(stru
 	int ret, idx;
 	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+	bool rsvd = false;
 
 	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
 		return -EINVAL;
@@ -378,9 +413,14 @@ static ssize_t hugetlb_cgroup_write(stru
 	nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
+	case RES_RSVD_LIMIT:
+		rsvd = true;
+		/* Fall through. */
 	case RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
-		ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+		ret = page_counter_set_max(
+			hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd),
+			nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
@@ -406,18 +446,25 @@ static ssize_t hugetlb_cgroup_reset(stru
 				    char *buf, size_t nbytes, loff_t off)
 {
 	int ret = 0;
-	struct page_counter *counter;
+	struct page_counter *counter, *rsvd_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
 
 	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+	rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(of_cft(of)->private)];
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
 		page_counter_reset_watermark(counter);
 		break;
+	case RES_RSVD_MAX_USAGE:
+		page_counter_reset_watermark(rsvd_counter);
+		break;
 	case RES_FAILCNT:
 		counter->failcnt = 0;
 		break;
+	case RES_RSVD_FAILCNT:
+		rsvd_counter->failcnt = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -472,7 +519,7 @@ static void __init __hugetlb_cgroup_file
 	struct hstate *h = &hstates[idx];
 
 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));
 
 	/* Add the limit file */
 	cft = &h->cgroup_files_dfl[0];
@@ -482,15 +529,30 @@ static void __init __hugetlb_cgroup_file
 	cft->write = hugetlb_cgroup_write_dfl;
 	cft->flags = CFTYPE_NOT_ON_ROOT;
 
-	/* Add the current usage file */
+	/* Add the reservation limit file */
 	cft = &h->cgroup_files_dfl[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT);
+	cft->seq_show = hugetlb_cgroup_read_u64_max;
+	cft->write = hugetlb_cgroup_write_dfl;
+	cft->flags = CFTYPE_NOT_ON_ROOT;
+
+	/* Add the current usage file */
+	cft = &h->cgroup_files_dfl[2];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
 	cft->seq_show = hugetlb_cgroup_read_u64_max;
 	cft->flags = CFTYPE_NOT_ON_ROOT;
 
+	/* Add the current reservation usage file */
+	cft = &h->cgroup_files_dfl[3];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.current", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE);
+	cft->seq_show = hugetlb_cgroup_read_u64_max;
+	cft->flags = CFTYPE_NOT_ON_ROOT;
+
 	/* Add the events file */
-	cft = &h->cgroup_files_dfl[2];
+	cft = &h->cgroup_files_dfl[4];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
 	cft->private = MEMFILE_PRIVATE(idx, 0);
 	cft->seq_show = hugetlb_events_show;
@@ -498,7 +560,7 @@ static void __init __hugetlb_cgroup_file
 	cft->flags = CFTYPE_NOT_ON_ROOT;
 
 	/* Add the events.local file */
-	cft = &h->cgroup_files_dfl[3];
+	cft = &h->cgroup_files_dfl[5];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
 	cft->private = MEMFILE_PRIVATE(idx, 0);
 	cft->seq_show = hugetlb_events_local_show;
@@ -507,7 +569,7 @@ static void __init __hugetlb_cgroup_file
 	cft->flags = CFTYPE_NOT_ON_ROOT;
 
 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_dfl[4];
+	cft = &h->cgroup_files_dfl[6];
 	memset(cft, 0, sizeof(*cft));
 
 	WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -521,7 +583,7 @@ static void __init __hugetlb_cgroup_file
 	struct hstate *h = &hstates[idx];
 
 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));
 
 	/* Add the limit file */
 	cft = &h->cgroup_files_legacy[0];
@@ -530,28 +592,55 @@ static void __init __hugetlb_cgroup_file
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 	cft->write = hugetlb_cgroup_write_legacy;
 
-	/* Add the usage file */
+	/* Add the reservation limit file */
 	cft = &h->cgroup_files_legacy[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.limit_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+	cft->write = hugetlb_cgroup_write_legacy;
+
+	/* Add the usage file */
+	cft = &h->cgroup_files_legacy[2];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 
+	/* Add the reservation usage file */
+	cft = &h->cgroup_files_legacy[3];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* Add the MAX usage file */
-	cft = &h->cgroup_files_legacy[2];
+	cft = &h->cgroup_files_legacy[4];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 
+	/* Add the MAX reservation usage file */
+	cft = &h->cgroup_files_legacy[5];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max_usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_MAX_USAGE);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* Add the failcntfile */
-	cft = &h->cgroup_files_legacy[3];
+	cft = &h->cgroup_files_legacy[6];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
-	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the reservation failcntfile */
+	cft = &h->cgroup_files_legacy[7];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.failcnt", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_FAILCNT);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 
 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_legacy[4];
+	cft = &h->cgroup_files_legacy[8];
 	memset(cft, 0, sizeof(*cft));
 
 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (108 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch " Andrew Morton
                   ` (128 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb usage
or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Link: http://lkml.kernel.org/r/20200211213128.73302-2-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb_cgroup.h |  123 ++++++++++++++++++---
 mm/hugetlb.c                   |    2 
 mm/hugetlb_cgroup.c            |  174 +++++++++++++++++++++++++------
 3 files changed, 251 insertions(+), 48 deletions(-)

--- a/include/linux/hugetlb_cgroup.h~hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations
+++ a/include/linux/hugetlb_cgroup.h
@@ -20,32 +20,64 @@
 struct hugetlb_cgroup;
 /*
  * Minimum page order trackable by hugetlb cgroup.
- * At least 3 pages are necessary for all the tracking information.
+ * At least 4 pages are necessary for all the tracking information.
+ * The second tail page (hpage[2]) is the fault usage cgroup.
+ * The third tail page (hpage[3]) is the reservation usage cgroup.
  */
 #define HUGETLB_CGROUP_MIN_ORDER	2
 
 #ifdef CONFIG_CGROUP_HUGETLB
 
-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+static inline struct hugetlb_cgroup *
+__hugetlb_cgroup_from_page(struct page *page, bool rsvd)
 {
 	VM_BUG_ON_PAGE(!PageHuge(page), page);
 
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return NULL;
-	return (struct hugetlb_cgroup *)page[2].private;
+	if (rsvd)
+		return (struct hugetlb_cgroup *)page[3].private;
+	else
+		return (struct hugetlb_cgroup *)page[2].private;
+}
+
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
+{
+	return __hugetlb_cgroup_from_page(page, false);
 }
 
-static inline
-int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+static inline struct hugetlb_cgroup *
+hugetlb_cgroup_from_page_rsvd(struct page *page)
+{
+	return __hugetlb_cgroup_from_page(page, true);
+}
+
+static inline int __set_hugetlb_cgroup(struct page *page,
+				       struct hugetlb_cgroup *h_cg, bool rsvd)
 {
 	VM_BUG_ON_PAGE(!PageHuge(page), page);
 
 	if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
 		return -1;
-	page[2].private	= (unsigned long)h_cg;
+	if (rsvd)
+		page[3].private = (unsigned long)h_cg;
+	else
+		page[2].private = (unsigned long)h_cg;
 	return 0;
 }
 
+static inline int set_hugetlb_cgroup(struct page *page,
+				     struct hugetlb_cgroup *h_cg)
+{
+	return __set_hugetlb_cgroup(page, h_cg, false);
+}
+
+static inline int set_hugetlb_cgroup_rsvd(struct page *page,
+					  struct hugetlb_cgroup *h_cg)
+{
+	return __set_hugetlb_cgroup(page, h_cg, true);
+}
+
 static inline bool hugetlb_cgroup_disabled(void)
 {
 	return !cgroup_subsys_enabled(hugetlb_cgrp_subsys);
@@ -53,13 +85,27 @@ static inline bool hugetlb_cgroup_disabl
 
 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
 					struct hugetlb_cgroup **ptr);
+extern int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages,
+					     struct hugetlb_cgroup **ptr);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 					 struct hugetlb_cgroup *h_cg,
 					 struct page *page);
+extern void hugetlb_cgroup_commit_charge_rsvd(int idx, unsigned long nr_pages,
+					      struct hugetlb_cgroup *h_cg,
+					      struct page *page);
 extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 					 struct page *page);
+extern void hugetlb_cgroup_uncharge_page_rsvd(int idx, unsigned long nr_pages,
+					      struct page *page);
+
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
 					   struct hugetlb_cgroup *h_cg);
+extern void hugetlb_cgroup_uncharge_cgroup_rsvd(int idx, unsigned long nr_pages,
+						struct hugetlb_cgroup *h_cg);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+					    unsigned long nr_pages,
+					    struct cgroup_subsys_state *css);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
 				   struct page *newhpage);
@@ -70,8 +116,26 @@ static inline struct hugetlb_cgroup *hug
 	return NULL;
 }
 
-static inline
-int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+static inline struct hugetlb_cgroup *
+hugetlb_cgroup_from_page_resv(struct page *page)
+{
+	return NULL;
+}
+
+static inline struct hugetlb_cgroup *
+hugetlb_cgroup_from_page_rsvd(struct page *page)
+{
+	return NULL;
+}
+
+static inline int set_hugetlb_cgroup(struct page *page,
+				     struct hugetlb_cgroup *h_cg)
+{
+	return 0;
+}
+
+static inline int set_hugetlb_cgroup_rsvd(struct page *page,
+					  struct hugetlb_cgroup *h_cg)
 {
 	return 0;
 }
@@ -81,28 +145,51 @@ static inline bool hugetlb_cgroup_disabl
 	return true;
 }
 
-static inline int
-hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-			     struct hugetlb_cgroup **ptr)
+static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+					       struct hugetlb_cgroup **ptr)
 {
 	return 0;
 }
 
-static inline void
-hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
-			     struct hugetlb_cgroup *h_cg,
-			     struct page *page)
+static inline int hugetlb_cgroup_charge_cgroup_rsvd(int idx,
+						    unsigned long nr_pages,
+						    struct hugetlb_cgroup **ptr)
+{
+	return 0;
+}
+
+static inline void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+						struct hugetlb_cgroup *h_cg,
+						struct page *page)
 {
 }
 
 static inline void
-hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages, struct page *page)
+hugetlb_cgroup_commit_charge_rsvd(int idx, unsigned long nr_pages,
+				  struct hugetlb_cgroup *h_cg,
+				  struct page *page)
+{
+}
+
+static inline void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+						struct page *page)
+{
+}
+
+static inline void hugetlb_cgroup_uncharge_page_rsvd(int idx,
+						     unsigned long nr_pages,
+						     struct page *page)
+{
+}
+static inline void hugetlb_cgroup_uncharge_cgroup(int idx,
+						  unsigned long nr_pages,
+						  struct hugetlb_cgroup *h_cg)
 {
 }
 
 static inline void
-hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-			       struct hugetlb_cgroup *h_cg)
+hugetlb_cgroup_uncharge_cgroup_rsvd(int idx, unsigned long nr_pages,
+				    struct hugetlb_cgroup *h_cg)
 {
 }
 
--- a/mm/hugetlb.c~hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations
+++ a/mm/hugetlb.c
@@ -1072,6 +1072,7 @@ static void update_and_free_page(struct
 				1 << PG_writeback);
 	}
 	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page), page);
+	VM_BUG_ON_PAGE(hugetlb_cgroup_from_page_rsvd(page), page);
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
@@ -1257,6 +1258,7 @@ static void prep_new_huge_page(struct hs
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
 	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL);
+	set_hugetlb_cgroup_rsvd(page, NULL);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations
+++ a/mm/hugetlb_cgroup.c
@@ -61,14 +61,26 @@ struct hugetlb_cgroup {
 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
 static inline struct page_counter *
-hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx,
-				   bool rsvd)
+__hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx,
+				     bool rsvd)
 {
 	if (rsvd)
 		return &h_cg->rsvd_hugepage[idx];
 	return &h_cg->hugepage[idx];
 }
 
+static inline struct page_counter *
+hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx)
+{
+	return __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, false);
+}
+
+static inline struct page_counter *
+hugetlb_cgroup_counter_from_cgroup_rsvd(struct hugetlb_cgroup *h_cg, int idx)
+{
+	return __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, true);
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -97,8 +109,12 @@ static inline bool hugetlb_cgroup_have_u
 	int idx;
 
 	for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-		if (page_counter_read(&h_cg->hugepage[idx]))
+		if (page_counter_read(
+			    hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) ||
+		    page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd(
+			    h_cg, idx))) {
 			return true;
+		}
 	}
 	return false;
 }
@@ -109,18 +125,34 @@ static void hugetlb_cgroup_init(struct h
 	int idx;
 
 	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-		struct page_counter *counter = &h_cgroup->hugepage[idx];
-		struct page_counter *parent = NULL;
+		struct page_counter *fault_parent = NULL;
+		struct page_counter *rsvd_parent = NULL;
 		unsigned long limit;
 		int ret;
 
-		if (parent_h_cgroup)
-			parent = &parent_h_cgroup->hugepage[idx];
-		page_counter_init(counter, parent);
+		if (parent_h_cgroup) {
+			fault_parent = hugetlb_cgroup_counter_from_cgroup(
+				parent_h_cgroup, idx);
+			rsvd_parent = hugetlb_cgroup_counter_from_cgroup_rsvd(
+				parent_h_cgroup, idx);
+		}
+		page_counter_init(hugetlb_cgroup_counter_from_cgroup(h_cgroup,
+								     idx),
+				  fault_parent);
+		page_counter_init(
+			hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
+			rsvd_parent);
 
 		limit = round_down(PAGE_COUNTER_MAX,
 				   1 << huge_page_order(&hstates[idx]));
-		ret = page_counter_set_max(counter, limit);
+
+		ret = page_counter_set_max(
+			hugetlb_cgroup_counter_from_cgroup(h_cgroup, idx),
+			limit);
+		VM_BUG_ON(ret);
+		ret = page_counter_set_max(
+			hugetlb_cgroup_counter_from_cgroup_rsvd(h_cgroup, idx),
+			limit);
 		VM_BUG_ON(ret);
 	}
 }
@@ -150,7 +182,6 @@ static void hugetlb_cgroup_css_free(stru
 	kfree(h_cgroup);
 }
 
-
 /*
  * Should be called with hugetlb_lock held.
  * Since we are holding hugetlb_lock, pages cannot get moved from
@@ -227,8 +258,9 @@ static inline void hugetlb_event(struct
 		 !hugetlb_cgroup_is_root(hugetlb));
 }
 
-int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-				 struct hugetlb_cgroup **ptr)
+static int __hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+					  struct hugetlb_cgroup **ptr,
+					  bool rsvd)
 {
 	int ret = 0;
 	struct page_counter *counter;
@@ -251,51 +283,104 @@ again:
 	}
 	rcu_read_unlock();
 
-	if (!page_counter_try_charge(&h_cg->hugepage[idx], nr_pages,
-				     &counter)) {
+	if (!page_counter_try_charge(
+		    __hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd),
+		    nr_pages, &counter)) {
 		ret = -ENOMEM;
 		hugetlb_event(hugetlb_cgroup_from_counter(counter, idx), idx,
 			      HUGETLB_MAX);
+		css_put(&h_cg->css);
+		goto done;
 	}
-	css_put(&h_cg->css);
+	/* Reservations take a reference to the css because they do not get
+	 * reparented.
+	 */
+	if (!rsvd)
+		css_put(&h_cg->css);
 done:
 	*ptr = h_cg;
 	return ret;
 }
 
+int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+				 struct hugetlb_cgroup **ptr)
+{
+	return __hugetlb_cgroup_charge_cgroup(idx, nr_pages, ptr, false);
+}
+
+int hugetlb_cgroup_charge_cgroup_rsvd(int idx, unsigned long nr_pages,
+				      struct hugetlb_cgroup **ptr)
+{
+	return __hugetlb_cgroup_charge_cgroup(idx, nr_pages, ptr, true);
+}
+
 /* Should be called with hugetlb_lock held */
-void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
-				  struct hugetlb_cgroup *h_cg,
-				  struct page *page)
+static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+					   struct hugetlb_cgroup *h_cg,
+					   struct page *page, bool rsvd)
 {
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;
 
-	set_hugetlb_cgroup(page, h_cg);
+	__set_hugetlb_cgroup(page, h_cg, rsvd);
 	return;
 }
 
+void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
+				  struct hugetlb_cgroup *h_cg,
+				  struct page *page)
+{
+	__hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, page, false);
+}
+
+void hugetlb_cgroup_commit_charge_rsvd(int idx, unsigned long nr_pages,
+				       struct hugetlb_cgroup *h_cg,
+				       struct page *page)
+{
+	__hugetlb_cgroup_commit_charge(idx, nr_pages, h_cg, page, true);
+}
+
 /*
  * Should be called with hugetlb_lock held
  */
-void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
-				  struct page *page)
+static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+					   struct page *page, bool rsvd)
 {
 	struct hugetlb_cgroup *h_cg;
 
 	if (hugetlb_cgroup_disabled())
 		return;
 	lockdep_assert_held(&hugetlb_lock);
-	h_cg = hugetlb_cgroup_from_page(page);
+	h_cg = __hugetlb_cgroup_from_page(page, rsvd);
 	if (unlikely(!h_cg))
 		return;
-	set_hugetlb_cgroup(page, NULL);
-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
+	__set_hugetlb_cgroup(page, NULL, rsvd);
+
+	page_counter_uncharge(__hugetlb_cgroup_counter_from_cgroup(h_cg, idx,
+								   rsvd),
+			      nr_pages);
+
+	if (rsvd)
+		css_put(&h_cg->css);
+
 	return;
 }
 
-void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-				    struct hugetlb_cgroup *h_cg)
+void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
+				  struct page *page)
+{
+	__hugetlb_cgroup_uncharge_page(idx, nr_pages, page, false);
+}
+
+void hugetlb_cgroup_uncharge_page_rsvd(int idx, unsigned long nr_pages,
+				       struct page *page)
+{
+	__hugetlb_cgroup_uncharge_page(idx, nr_pages, page, true);
+}
+
+static void __hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+					     struct hugetlb_cgroup *h_cg,
+					     bool rsvd)
 {
 	if (hugetlb_cgroup_disabled() || !h_cg)
 		return;
@@ -303,8 +388,35 @@ void hugetlb_cgroup_uncharge_cgroup(int
 	if (huge_page_order(&hstates[idx]) < HUGETLB_CGROUP_MIN_ORDER)
 		return;
 
-	page_counter_uncharge(&h_cg->hugepage[idx], nr_pages);
-	return;
+	page_counter_uncharge(__hugetlb_cgroup_counter_from_cgroup(h_cg, idx,
+								   rsvd),
+			      nr_pages);
+
+	if (rsvd)
+		css_put(&h_cg->css);
+}
+
+void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
+				    struct hugetlb_cgroup *h_cg)
+{
+	__hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg, false);
+}
+
+void hugetlb_cgroup_uncharge_cgroup_rsvd(int idx, unsigned long nr_pages,
+					 struct hugetlb_cgroup *h_cg)
+{
+	__hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg, true);
+}
+
+void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+				     unsigned long nr_pages,
+				     struct cgroup_subsys_state *css)
+{
+	if (hugetlb_cgroup_disabled() || !p || !css)
+		return;
+
+	page_counter_uncharge(p, nr_pages);
+	css_put(css);
 }
 
 enum {
@@ -419,7 +531,7 @@ static ssize_t hugetlb_cgroup_write(stru
 	case RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
 		ret = page_counter_set_max(
-			hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd),
+			__hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd),
 			nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
@@ -675,6 +787,7 @@ void __init hugetlb_cgroup_file_init(voi
 void hugetlb_cgroup_migrate(struct page *oldhpage, struct page *newhpage)
 {
 	struct hugetlb_cgroup *h_cg;
+	struct hugetlb_cgroup *h_cg_rsvd;
 	struct hstate *h = page_hstate(oldhpage);
 
 	if (hugetlb_cgroup_disabled())
@@ -683,10 +796,11 @@ void hugetlb_cgroup_migrate(struct page
 	VM_BUG_ON_PAGE(!PageHuge(oldhpage), oldhpage);
 	spin_lock(&hugetlb_lock);
 	h_cg = hugetlb_cgroup_from_page(oldhpage);
+	h_cg_rsvd = hugetlb_cgroup_from_page_rsvd(oldhpage);
 	set_hugetlb_cgroup(oldhpage, NULL);
 
 	/* move the h_cg details to new cgroup */
-	set_hugetlb_cgroup(newhpage, h_cg);
+	set_hugetlb_cgroup_rsvd(newhpage, h_cg_rsvd);
 	list_move(&newhpage->lru, &h->hugepage_activelist);
 	spin_unlock(&hugetlb_lock);
 	return;
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (109 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb-disable-region_add-file_region-coalescing.patch " Andrew Morton
                   ` (127 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: add reservation accounting for private mappings
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: add reservation accounting for private mappings

Normally the pointer to the cgroup to uncharge hangs off the struct page,
and gets queried when it's time to free the page.  With hugetlb_cgroup
reservations, this is not possible.  Because it's possible for a page to
be reserved by one task and actually faulted in by another task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map.  But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different.  This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter.  On initializing the resv_map this is set
to NULL.  On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h        |   10 ++++++
 include/linux/hugetlb_cgroup.h |   39 +++++++++++++++++++++++--
 mm/hugetlb.c                   |   47 +++++++++++++++++++++++++++++--
 mm/hugetlb_cgroup.c            |   41 ++++-----------------------
 4 files changed, 97 insertions(+), 40 deletions(-)

--- a/include/linux/hugetlb_cgroup.h~hugetlb_cgroup-add-reservation-accounting-for-private-mappings
+++ a/include/linux/hugetlb_cgroup.h
@@ -27,6 +27,33 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER	2
 
 #ifdef CONFIG_CGROUP_HUGETLB
+enum hugetlb_memory_event {
+	HUGETLB_MAX,
+	HUGETLB_NR_MEMORY_EVENTS,
+};
+
+struct hugetlb_cgroup {
+	struct cgroup_subsys_state css;
+
+	/*
+	 * the counter to account for hugepages from hugetlb.
+	 */
+	struct page_counter hugepage[HUGE_MAX_HSTATE];
+
+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter rsvd_hugepage[HUGE_MAX_HSTATE];
+
+	atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
+	atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
+
+	/* Handle for "hugetlb.events" */
+	struct cgroup_file events_file[HUGE_MAX_HSTATE];
+
+	/* Handle for "hugetlb.events.local" */
+	struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
+};
 
 static inline struct hugetlb_cgroup *
 __hugetlb_cgroup_from_page(struct page *page, bool rsvd)
@@ -102,9 +129,9 @@ extern void hugetlb_cgroup_uncharge_cgro
 					   struct hugetlb_cgroup *h_cg);
 extern void hugetlb_cgroup_uncharge_cgroup_rsvd(int idx, unsigned long nr_pages,
 						struct hugetlb_cgroup *h_cg);
-extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
-					    unsigned long nr_pages,
-					    struct cgroup_subsys_state *css);
+extern void hugetlb_cgroup_uncharge_counter(struct resv_map *resv,
+					    unsigned long start,
+					    unsigned long end);
 
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
@@ -193,6 +220,12 @@ hugetlb_cgroup_uncharge_cgroup_rsvd(int
 {
 }
 
+static inline void hugetlb_cgroup_uncharge_counter(struct resv_map *resv,
+						   unsigned long start,
+						   unsigned long end)
+{
+}
+
 static inline void hugetlb_cgroup_file_init(void)
 {
 }
--- a/include/linux/hugetlb.h~hugetlb_cgroup-add-reservation-accounting-for-private-mappings
+++ a/include/linux/hugetlb.h
@@ -46,6 +46,16 @@ struct resv_map {
 	long adds_in_progress;
 	struct list_head region_cache;
 	long region_cache_count;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On private mappings, the counter to uncharge reservations is stored
+	 * here. If these fields are 0, then either the mapping is shared, or
+	 * cgroup accounting is disabled for this resv_map.
+	 */
+	struct page_counter *reservation_counter;
+	unsigned long pages_per_hpage;
+	struct cgroup_subsys_state *css;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
--- a/mm/hugetlb.c~hugetlb_cgroup-add-reservation-accounting-for-private-mappings
+++ a/mm/hugetlb.c
@@ -650,6 +650,25 @@ static void set_vma_private_data(struct
 	vma->vm_private_data = (void *)value;
 }
 
+static void
+resv_map_set_hugetlb_cgroup_uncharge_info(struct resv_map *resv_map,
+					  struct hugetlb_cgroup *h_cg,
+					  struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	if (!h_cg || !h) {
+		resv_map->reservation_counter = NULL;
+		resv_map->pages_per_hpage = 0;
+		resv_map->css = NULL;
+	} else {
+		resv_map->reservation_counter =
+			&h_cg->rsvd_hugepage[hstate_index(h)];
+		resv_map->pages_per_hpage = pages_per_huge_page(h);
+		resv_map->css = &h_cg->css;
+	}
+#endif
+}
+
 struct resv_map *resv_map_alloc(void)
 {
 	struct resv_map *resv_map = kmalloc(sizeof(*resv_map), GFP_KERNEL);
@@ -666,6 +685,13 @@ struct resv_map *resv_map_alloc(void)
 	INIT_LIST_HEAD(&resv_map->regions);
 
 	resv_map->adds_in_progress = 0;
+	/*
+	 * Initialize these to 0. On shared mappings, 0's here indicate these
+	 * fields don't do cgroup accounting. On private mappings, these will be
+	 * re-initialized to the proper values, to indicate that hugetlb cgroup
+	 * reservations are to be un-charged from here.
+	 */
+	resv_map_set_hugetlb_cgroup_uncharge_info(resv_map, NULL, NULL);
 
 	INIT_LIST_HEAD(&resv_map->region_cache);
 	list_add(&rg->link, &resv_map->region_cache);
@@ -3196,9 +3222,7 @@ static void hugetlb_vm_op_close(struct v
 	end = vma_hugecache_offset(h, vma, vma->vm_end);
 
 	reserve = (end - start) - region_count(resv, start, end);
-
-	kref_put(&resv->refs, resv_map_release);
-
+	hugetlb_cgroup_uncharge_counter(resv, start, end);
 	if (reserve) {
 		/*
 		 * Decrement reserve counts.  The global reserve count may be
@@ -3207,6 +3231,8 @@ static void hugetlb_vm_op_close(struct v
 		gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
 		hugetlb_acct_memory(h, -gbl_reserve);
 	}
+
+	kref_put(&resv->refs, resv_map_release);
 }
 
 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4555,6 +4581,7 @@ int hugetlb_reserve_pages(struct inode *
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
+	struct hugetlb_cgroup *h_cg;
 	long gbl_reserve;
 
 	/* This should never happen */
@@ -4588,12 +4615,26 @@ int hugetlb_reserve_pages(struct inode *
 		chg = region_chg(resv_map, from, to);
 
 	} else {
+		/* Private mapping. */
 		resv_map = resv_map_alloc();
 		if (!resv_map)
 			return -ENOMEM;
 
 		chg = to - from;
 
+		if (hugetlb_cgroup_charge_cgroup_rsvd(
+			    hstate_index(h), chg * pages_per_huge_page(h),
+			    &h_cg)) {
+			kref_put(&resv_map->refs, resv_map_release);
+			return -ENOMEM;
+		}
+
+		/*
+		 * Since this branch handles private mappings, we attach the
+		 * counter to uncharge for this reservation off resv_map.
+		 */
+		resv_map_set_hugetlb_cgroup_uncharge_info(resv_map, h_cg, h);
+
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
 	}
--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-add-reservation-accounting-for-private-mappings
+++ a/mm/hugetlb_cgroup.c
@@ -23,34 +23,6 @@
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
 
-enum hugetlb_memory_event {
-	HUGETLB_MAX,
-	HUGETLB_NR_MEMORY_EVENTS,
-};
-
-struct hugetlb_cgroup {
-	struct cgroup_subsys_state css;
-
-	/*
-	 * the counter to account for hugepages from hugetlb.
-	 */
-	struct page_counter hugepage[HUGE_MAX_HSTATE];
-
-	/*
-	 * the counter to account for hugepage reservations from hugetlb.
-	 */
-	struct page_counter rsvd_hugepage[HUGE_MAX_HSTATE];
-
-	atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
-	atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
-
-	/* Handle for "hugetlb.events" */
-	struct cgroup_file events_file[HUGE_MAX_HSTATE];
-
-	/* Handle for "hugetlb.events.local" */
-	struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
-};
-
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_IDX(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
@@ -408,15 +380,16 @@ void hugetlb_cgroup_uncharge_cgroup_rsvd
 	__hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg, true);
 }
 
-void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
-				     unsigned long nr_pages,
-				     struct cgroup_subsys_state *css)
+void hugetlb_cgroup_uncharge_counter(struct resv_map *resv, unsigned long start,
+				     unsigned long end)
 {
-	if (hugetlb_cgroup_disabled() || !p || !css)
+	if (hugetlb_cgroup_disabled() || !resv || !resv->reservation_counter ||
+	    !resv->css)
 		return;
 
-	page_counter_uncharge(p, nr_pages);
-	css_put(css);
+	page_counter_uncharge(resv->reservation_counter,
+			      (end - start) * resv->pages_per_hpage);
+	css_put(resv->css);
 }
 
 enum {
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb-disable-region_add-file_region-coalescing.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (110 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-add-accounting-for-shared-mappings.patch " Andrew Morton
                   ` (126 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb: disable region_add file_region coalescing
has been added to the -mm tree.  Its filename is
     hugetlb-disable-region_add-file_region-coalescing.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb-disable-region_add-file_region-coalescing.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb-disable-region_add-file_region-coalescing.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb: disable region_add file_region coalescing

A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions.  The cgroup uncharge info may differ
for different regions, so they can no longer be coalesced at region_add
time.  So, disable region coalescing in region_add in this patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly.  In the past, only 1 region would be added for every
region_chg and region_add call.  But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in
region_chg, or decrement adds_in_progress by 1 after region_add or
region_abort.  Instead, region_chg calls add_reservation_in_range() to
count the number of regions needed and allocates those, and that info is
passed to region_add and region_abort to decrement adds_in_progress
correctly.

We've also modified the assumption that region_add after region_chg never
fails.  region_chg now pre-allocates at least 1 region for region_add.  If
region_add needs more regions than region_chg has allocated for it, then
it may fail.

Link: http://lkml.kernel.org/r/20200211213128.73302-4-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |  340 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 230 insertions(+), 110 deletions(-)

--- a/mm/hugetlb.c~hugetlb-disable-region_add-file_region-coalescing
+++ a/mm/hugetlb.c
@@ -245,110 +245,180 @@ struct file_region {
 	long to;
 };
 
+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *
+get_file_region_entry_from_cache(struct resv_map *resv, long from, long to)
+{
+	struct file_region *nrg = NULL;
+
+	VM_BUG_ON(resv->region_cache_count <= 0);
+
+	resv->region_cache_count--;
+	nrg = list_first_entry(&resv->region_cache, struct file_region, link);
+	VM_BUG_ON(!nrg);
+	list_del(&nrg->link);
+
+	nrg->from = from;
+	nrg->to = to;
+
+	return nrg;
+}
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
- * list.
+ * list. If regions_needed != NULL and count_only == true, then regions_needed
+ * will indicate the number of file_regions needed in the cache to carry out to
+ * add the regions for this range.
  */
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
-				     bool count_only)
+				     long *regions_needed, bool count_only)
 {
-	long chg = 0;
+	long add = 0;
 	struct list_head *head = &resv->regions;
+	long last_accounted_offset = f;
 	struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
 
-	/* Locate the region we are before or in. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
+	if (regions_needed)
+		*regions_needed = 0;
 
-	/* Round our left edge to the current segment if it encloses us. */
-	if (f > rg->from)
-		f = rg->from;
-
-	chg = t - f;
-
-	/* Check for and consume any regions we now overlap with. */
-	nrg = rg;
-	list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-		if (&rg->link == head)
-			break;
+	/* In this loop, we essentially handle an entry for the range
+	 * [last_accounted_offset, rg->from), at every iteration, with some
+	 * bounds checking.
+	 */
+	list_for_each_entry_safe(rg, trg, head, link) {
+		/* Skip irrelevant regions that start before our range. */
+		if (rg->from < f) {
+			/* If this region ends after the last accounted offset,
+			 * then we need to update last_accounted_offset.
+			 */
+			if (rg->to > last_accounted_offset)
+				last_accounted_offset = rg->to;
+			continue;
+		}
+
+		/* When we find a region that starts beyond our range, we've
+		 * finished.
+		 */
 		if (rg->from > t)
 			break;
 
-		/* We overlap with this area, if it extends further than
-		 * us then we must extend ourselves.  Account for its
-		 * existing reservation.
+		/* Add an entry for last_accounted_offset -> rg->from, and
+		 * update last_accounted_offset.
 		 */
-		if (rg->to > t) {
-			chg += rg->to - t;
-			t = rg->to;
-		}
-		chg -= rg->to - rg->from;
-
-		if (!count_only && rg != nrg) {
-			list_del(&rg->link);
-			kfree(rg);
-		}
+		if (rg->from > last_accounted_offset) {
+			add += rg->from - last_accounted_offset;
+			if (!count_only) {
+				nrg = get_file_region_entry_from_cache(
+					resv, last_accounted_offset, rg->from);
+				list_add(&nrg->link, rg->link.prev);
+			} else if (regions_needed)
+				*regions_needed += 1;
+		}
+
+		last_accounted_offset = rg->to;
+	}
+
+	/* Handle the case where our range extends beyond
+	 * last_accounted_offset.
+	 */
+	if (last_accounted_offset < t) {
+		add += t - last_accounted_offset;
+		if (!count_only) {
+			nrg = get_file_region_entry_from_cache(
+				resv, last_accounted_offset, t);
+			list_add(&nrg->link, rg->link.prev);
+		} else if (regions_needed)
+			*regions_needed += 1;
 	}
 
-	if (!count_only) {
-		nrg->from = f;
-		nrg->to = t;
-	}
-
-	return chg;
+	VM_BUG_ON(add < 0);
+	return add;
 }
 
 /*
  * Add the huge page range represented by [f, t) to the reserve
- * map.  Existing regions will be expanded to accommodate the specified
- * range, or a region will be taken from the cache.  Sufficient regions
- * must exist in the cache due to the previous call to region_chg with
- * the same range.
+ * map.  Regions will be taken from the cache to fill in this range.
+ * Sufficient regions should exist in the cache due to the previous
+ * call to region_chg with the same range, but in some cases the cache will not
+ * have sufficient entries due to races with other code doing region_add or
+ * region_del.  The extra needed entries will be allocated.
  *
- * Return the number of new huge pages added to the map.  This
- * number is greater than or equal to zero.
- */
-static long region_add(struct resv_map *resv, long f, long t)
-{
-	struct list_head *head = &resv->regions;
-	struct file_region *rg, *nrg;
-	long add = 0;
+ * regions_needed is the out value provided by a previous call to region_chg.
+ *
+ * Return the number of new huge pages added to the map.  This number is greater
+ * than or equal to zero.  If file_region entries needed to be allocated for
+ * this operation and we were not able to allocate, it ruturns -ENOMEM.
+ * region_add of regions of length 1 never allocate file_regions and cannot
+ * fail; region_chg will always allocate at least 1 entry and a region_add for
+ * 1 page will only require at most 1 entry.
+ */
+static long region_add(struct resv_map *resv, long f, long t,
+		       long in_regions_needed)
+{
+	long add = 0, actual_regions_needed = 0, i = 0;
+	struct file_region *trg = NULL, *rg = NULL;
+	struct list_head allocated_regions;
+
+	INIT_LIST_HEAD(&allocated_regions);
 
 	spin_lock(&resv->lock);
-	/* Locate the region we are either in or before. */
-	list_for_each_entry(rg, head, link)
-		if (f <= rg->to)
-			break;
+retry:
+
+	/* Count how many regions are actually needed to execute this add. */
+	add_reservation_in_range(resv, f, t, &actual_regions_needed, true);
 
 	/*
-	 * If no region exists which can be expanded to include the
-	 * specified range, pull a region descriptor from the cache
-	 * and use it for this range.
-	 */
-	if (&rg->link == head || t < rg->from) {
-		VM_BUG_ON(resv->region_cache_count <= 0);
+	 * Check for sufficient descriptors in the cache to accommodate
+	 * this add operation. Note that actual_regions_needed may be greater
+	 * than in_regions_needed. In this case, we need to make sure that we
+	 * allocate extra entries, such that we have enough for all the
+	 * existing adds_in_progress, plus the excess needed for this
+	 * operation.
+	 */
+	if (resv->region_cache_count <
+	    resv->adds_in_progress +
+		    (actual_regions_needed - in_regions_needed)) {
+		/* region_add operation of range 1 should never need to
+		 * allocate file_region entries.
+		 */
+		VM_BUG_ON(t - f <= 1);
 
-		resv->region_cache_count--;
-		nrg = list_first_entry(&resv->region_cache, struct file_region,
-					link);
-		list_del(&nrg->link);
+		/* Must drop lock to allocate a new descriptor. */
+		spin_unlock(&resv->lock);
+		for (i = 0; i < (actual_regions_needed - in_regions_needed);
+		     i++) {
+			trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+			if (!trg)
+				goto out_of_memory;
+			list_add(&trg->link, &allocated_regions);
+		}
+		spin_lock(&resv->lock);
 
-		nrg->from = f;
-		nrg->to = t;
-		list_add(&nrg->link, rg->link.prev);
+		list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+			list_del(&rg->link);
+			list_add(&rg->link, &resv->region_cache);
+			resv->region_cache_count++;
+		}
 
-		add += t - f;
-		goto out_locked;
+		goto retry;
 	}
 
-	add = add_reservation_in_range(resv, f, t, false);
+	add = add_reservation_in_range(resv, f, t, NULL, false);
+
+	resv->adds_in_progress -= in_regions_needed;
 
-out_locked:
-	resv->adds_in_progress--;
 	spin_unlock(&resv->lock);
 	VM_BUG_ON(add < 0);
 	return add;
+
+out_of_memory:
+	list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return -ENOMEM;
 }
 
 /*
@@ -358,49 +428,79 @@ out_locked:
  * call to region_add that will actually modify the reserve
  * map to add the specified range [f, t).  region_chg does
  * not change the number of huge pages represented by the
- * map.  A new file_region structure is added to the cache
- * as a placeholder, so that the subsequent region_add
- * call will have all the regions it needs and will not fail.
+ * map.  A number of new file_region structures is added to the cache as a
+ * placeholder, for the subsequent region_add call to use. At least 1
+ * file_region structure is added.
+ *
+ * out_regions_needed is the number of regions added to the
+ * resv->adds_in_progress.  This value needs to be provided to a follow up call
+ * to region_add or region_abort for proper accounting.
  *
  * Returns the number of huge pages that need to be added to the existing
  * reservation map for the range [f, t).  This number is greater or equal to
  * zero.  -ENOMEM is returned if a new file_region structure or cache entry
  * is needed and can not be allocated.
  */
-static long region_chg(struct resv_map *resv, long f, long t)
+static long region_chg(struct resv_map *resv, long f, long t,
+		       long *out_regions_needed)
 {
-	long chg = 0;
+	struct file_region *trg = NULL, *rg = NULL;
+	long chg = 0, i = 0, to_allocate = 0;
+	struct list_head allocated_regions;
+
+	INIT_LIST_HEAD(&allocated_regions);
 
 	spin_lock(&resv->lock);
-retry_locked:
-	resv->adds_in_progress++;
+
+	/* Count how many hugepages in this range are NOT respresented. */
+	chg = add_reservation_in_range(resv, f, t, out_regions_needed, true);
+
+	if (*out_regions_needed == 0)
+		*out_regions_needed = 1;
+
+	resv->adds_in_progress += *out_regions_needed;
 
 	/*
 	 * Check for sufficient descriptors in the cache to accommodate
 	 * the number of in progress add operations.
 	 */
-	if (resv->adds_in_progress > resv->region_cache_count) {
-		struct file_region *trg;
+	while (resv->region_cache_count < resv->adds_in_progress) {
+		to_allocate = resv->adds_in_progress - resv->region_cache_count;
 
-		VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1);
-		/* Must drop lock to allocate a new descriptor. */
-		resv->adds_in_progress--;
+		/* Must drop lock to allocate a new descriptor. Note that even
+		 * though we drop the lock here, we do not make another call to
+		 * add_reservation_in_range after re-acquiring the lock.
+		 * Essentially this branch makes sure that we have enough
+		 * descriptors in the cache as suggested by the first call to
+		 * add_reservation_in_range. If more regions turn out to be
+		 * required, region_add will deal with it.
+		 */
 		spin_unlock(&resv->lock);
-
-		trg = kmalloc(sizeof(*trg), GFP_KERNEL);
-		if (!trg)
-			return -ENOMEM;
+		for (i = 0; i < to_allocate; i++) {
+			trg = kmalloc(sizeof(*trg), GFP_KERNEL);
+			if (!trg)
+				goto out_of_memory;
+			list_add(&trg->link, &allocated_regions);
+		}
 
 		spin_lock(&resv->lock);
-		list_add(&trg->link, &resv->region_cache);
-		resv->region_cache_count++;
-		goto retry_locked;
-	}
 
-	chg = add_reservation_in_range(resv, f, t, true);
+		list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+			list_del(&rg->link);
+			list_add(&rg->link, &resv->region_cache);
+			resv->region_cache_count++;
+		}
+	}
 
 	spin_unlock(&resv->lock);
 	return chg;
+
+out_of_memory:
+	list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
+		list_del(&rg->link);
+		kfree(rg);
+	}
+	return -ENOMEM;
 }
 
 /*
@@ -408,17 +508,20 @@ retry_locked:
  * of the resv_map keeps track of the operations in progress between
  * calls to region_chg and region_add.  Operations are sometimes
  * aborted after the call to region_chg.  In such cases, region_abort
- * is called to decrement the adds_in_progress counter.
+ * is called to decrement the adds_in_progress counter. regions_needed
+ * is the value returned by the region_chg call, it is used to decrement
+ * the adds_in_progress counter.
  *
  * NOTE: The range arguments [f, t) are not needed or used in this
  * routine.  They are kept to make reading the calling code easier as
  * arguments will match the associated region_chg call.
  */
-static void region_abort(struct resv_map *resv, long f, long t)
+static void region_abort(struct resv_map *resv, long f, long t,
+			 long regions_needed)
 {
 	spin_lock(&resv->lock);
 	VM_BUG_ON(!resv->region_cache_count);
-	resv->adds_in_progress--;
+	resv->adds_in_progress -= regions_needed;
 	spin_unlock(&resv->lock);
 }
 
@@ -1904,6 +2007,7 @@ static long __vma_reservation_common(str
 	struct resv_map *resv;
 	pgoff_t idx;
 	long ret;
+	long dummy_out_regions_needed;
 
 	resv = vma_resv_map(vma);
 	if (!resv)
@@ -1912,20 +2016,29 @@ static long __vma_reservation_common(str
 	idx = vma_hugecache_offset(h, vma, addr);
 	switch (mode) {
 	case VMA_NEEDS_RESV:
-		ret = region_chg(resv, idx, idx + 1);
+		ret = region_chg(resv, idx, idx + 1, &dummy_out_regions_needed);
+		/* We assume that vma_reservation_* routines always operate on
+		 * 1 page, and that adding to resv map a 1 page entry can only
+		 * ever require 1 region.
+		 */
+		VM_BUG_ON(dummy_out_regions_needed != 1);
 		break;
 	case VMA_COMMIT_RESV:
-		ret = region_add(resv, idx, idx + 1);
+		ret = region_add(resv, idx, idx + 1, 1);
+		/* region_add calls of range 1 should never fail. */
+		VM_BUG_ON(ret < 0);
 		break;
 	case VMA_END_RESV:
-		region_abort(resv, idx, idx + 1);
+		region_abort(resv, idx, idx + 1, 1);
 		ret = 0;
 		break;
 	case VMA_ADD_RESV:
-		if (vma->vm_flags & VM_MAYSHARE)
-			ret = region_add(resv, idx, idx + 1);
-		else {
-			region_abort(resv, idx, idx + 1);
+		if (vma->vm_flags & VM_MAYSHARE) {
+			ret = region_add(resv, idx, idx + 1, 1);
+			/* region_add calls of range 1 should never fail. */
+			VM_BUG_ON(ret < 0);
+		} else {
+			region_abort(resv, idx, idx + 1, 1);
 			ret = region_del(resv, idx, idx + 1);
 		}
 		break;
@@ -4577,12 +4690,12 @@ int hugetlb_reserve_pages(struct inode *
 					struct vm_area_struct *vma,
 					vm_flags_t vm_flags)
 {
-	long ret, chg;
+	long ret, chg, add = -1;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
 	struct hugetlb_cgroup *h_cg;
-	long gbl_reserve;
+	long gbl_reserve, regions_needed = 0;
 
 	/* This should never happen */
 	if (from > to) {
@@ -4612,7 +4725,7 @@ int hugetlb_reserve_pages(struct inode *
 		 */
 		resv_map = inode_resv_map(inode);
 
-		chg = region_chg(resv_map, from, to);
+		chg = region_chg(resv_map, from, to, &regions_needed);
 
 	} else {
 		/* Private mapping. */
@@ -4678,9 +4791,14 @@ int hugetlb_reserve_pages(struct inode *
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		long add = region_add(resv_map, from, to);
+		add = region_add(resv_map, from, to, regions_needed);
 
-		if (unlikely(chg > add)) {
+		if (unlikely(add < 0)) {
+			hugetlb_acct_memory(h, -gbl_reserve);
+			/* put back original number of pages, chg */
+			(void)hugepage_subpool_put_pages(spool, chg);
+			goto out_err;
+		} else if (unlikely(chg > add)) {
 			/*
 			 * pages in this range were added to the reserve
 			 * map between region_chg and region_add.  This
@@ -4698,9 +4816,11 @@ int hugetlb_reserve_pages(struct inode *
 	return 0;
 out_err:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
-		/* Don't call region_abort if region_chg failed */
-		if (chg >= 0)
-			region_abort(resv_map, from, to);
+		/* Only call region_abort if the region_chg succeeded but the
+		 * region_add failed or didn't run.
+		 */
+		if (chg >= 0 && add < 0)
+			region_abort(resv_map, from, to, regions_needed);
 	if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		kref_put(&resv_map->refs, resv_map_release);
 	return ret;
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-add-accounting-for-shared-mappings.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (111 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb-disable-region_add-file_region-coalescing.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-support-noreserve-mappings.patch " Andrew Morton
                   ` (125 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: add accounting for shared mappings
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-add-accounting-for-shared-mappings.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-add-accounting-for-shared-mappings.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-add-accounting-for-shared-mappings.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: add accounting for shared mappings

For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and
if successful, we pass on the hugetlb_cgroup info to a follow up
region_add call.  When a file_region entry is added to the resv_map via
region_add, we put the pointer to that cgroup in
file_region->reservation_counter.  If charging doesn't succeed, we report
the error to the caller, so that the kernel fails the reservation.

On region_del, which is when the hugetlb memory is unreserved, we also
uncharge the file_region->reservation_counter.

Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h        |   35 +++++++
 include/linux/hugetlb_cgroup.h |   10 ++
 mm/hugetlb.c                   |  148 +++++++++++++++++++------------
 mm/hugetlb_cgroup.c            |   15 +++
 4 files changed, 154 insertions(+), 54 deletions(-)

--- a/include/linux/hugetlb_cgroup.h~hugetlb_cgroup-add-accounting-for-shared-mappings
+++ a/include/linux/hugetlb_cgroup.h
@@ -133,11 +133,21 @@ extern void hugetlb_cgroup_uncharge_coun
 					    unsigned long start,
 					    unsigned long end);
 
+extern void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
+						struct file_region *rg,
+						unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
 				   struct page *newhpage);
 
 #else
+static inline void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
+						       struct file_region *rg,
+						       unsigned long nr_pages)
+{
+}
+
 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page)
 {
 	return NULL;
--- a/include/linux/hugetlb.h~hugetlb_cgroup-add-accounting-for-shared-mappings
+++ a/include/linux/hugetlb.h
@@ -57,6 +57,41 @@ struct resv_map {
 	struct cgroup_subsys_state *css;
 #endif
 };
+
+/*
+ * Region tracking -- allows tracking of reservations and instantiated pages
+ *                    across the pages in a mapping.
+ *
+ * The region data structures are embedded into a resv_map and protected
+ * by a resv_map's lock.  The set of regions within the resv_map represent
+ * reservations for huge pages, or huge pages that have already been
+ * instantiated within the map.  The from and to elements are huge page
+ * indicies into the associated mapping.  from indicates the starting index
+ * of the region.  to represents the first index past the end of  the region.
+ *
+ * For example, a file region structure with from == 0 and to == 4 represents
+ * four huge pages in a mapping.  It is important to note that the to element
+ * represents the first element past the end of the region. This is used in
+ * arithmetic as 4(to) - 0(from) = 4 huge pages in the region.
+ *
+ * Interval notation of the form [from, to) will be used to indicate that
+ * the endpoint from is inclusive and to is exclusive.
+ */
+struct file_region {
+	struct list_head link;
+	long from;
+	long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+	/*
+	 * On shared mappings, each reserved region appears as a struct
+	 * file_region in resv_map. These fields hold the info needed to
+	 * uncharge each reservation.
+	 */
+	struct page_counter *reservation_counter;
+	struct cgroup_subsys_state *css;
+#endif
+};
+
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
 
--- a/mm/hugetlb.c~hugetlb_cgroup-add-accounting-for-shared-mappings
+++ a/mm/hugetlb.c
@@ -220,31 +220,6 @@ static inline struct hugepage_subpool *s
 	return subpool_inode(file_inode(vma->vm_file));
 }
 
-/*
- * Region tracking -- allows tracking of reservations and instantiated pages
- *                    across the pages in a mapping.
- *
- * The region data structures are embedded into a resv_map and protected
- * by a resv_map's lock.  The set of regions within the resv_map represent
- * reservations for huge pages, or huge pages that have already been
- * instantiated within the map.  The from and to elements are huge page
- * indicies into the associated mapping.  from indicates the starting index
- * of the region.  to represents the first index past the end of  the region.
- *
- * For example, a file region structure with from == 0 and to == 4 represents
- * four huge pages in a mapping.  It is important to note that the to element
- * represents the first element past the end of the region. This is used in
- * arithmetic as 4(to) - 0(from) = 4 huge pages in the region.
- *
- * Interval notation of the form [from, to) will be used to indicate that
- * the endpoint from is inclusive and to is exclusive.
- */
-struct file_region {
-	struct list_head link;
-	long from;
-	long to;
-};
-
 /* Helper that removes a struct file_region from the resv_map cache and returns
  * it for use.
  */
@@ -266,6 +241,41 @@ get_file_region_entry_from_cache(struct
 	return nrg;
 }
 
+static void copy_hugetlb_cgroup_uncharge_info(struct file_region *nrg,
+					      struct file_region *rg)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	nrg->reservation_counter = rg->reservation_counter;
+	nrg->css = rg->css;
+	if (rg->css)
+		css_get(rg->css);
+#endif
+}
+
+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+						struct hstate *h,
+						struct resv_map *resv,
+						struct file_region *nrg)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	if (h_cg) {
+		nrg->reservation_counter =
+			&h_cg->rsvd_hugepage[hstate_index(h)];
+		nrg->css = &h_cg->css;
+		if (!resv->pages_per_hpage)
+			resv->pages_per_hpage = pages_per_huge_page(h);
+		/* pages_per_hpage should be the same for all entries in
+		 * a resv_map.
+		 */
+		VM_BUG_ON(resv->pages_per_hpage != pages_per_huge_page(h));
+	} else {
+		nrg->reservation_counter = NULL;
+		nrg->css = NULL;
+	}
+#endif
+}
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list. If regions_needed != NULL and count_only == true, then regions_needed
@@ -273,7 +283,9 @@ get_file_region_entry_from_cache(struct
  * add the regions for this range.
  */
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
-				     long *regions_needed, bool count_only)
+				     struct hugetlb_cgroup *h_cg,
+				     struct hstate *h, long *regions_needed,
+				     bool count_only)
 {
 	long add = 0;
 	struct list_head *head = &resv->regions;
@@ -312,6 +324,8 @@ static long add_reservation_in_range(str
 			if (!count_only) {
 				nrg = get_file_region_entry_from_cache(
 					resv, last_accounted_offset, rg->from);
+				record_hugetlb_cgroup_uncharge_info(h_cg, h,
+								    resv, nrg);
 				list_add(&nrg->link, rg->link.prev);
 			} else if (regions_needed)
 				*regions_needed += 1;
@@ -328,6 +342,7 @@ static long add_reservation_in_range(str
 		if (!count_only) {
 			nrg = get_file_region_entry_from_cache(
 				resv, last_accounted_offset, t);
+			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
 			list_add(&nrg->link, rg->link.prev);
 		} else if (regions_needed)
 			*regions_needed += 1;
@@ -355,7 +370,8 @@ static long add_reservation_in_range(str
  * 1 page will only require at most 1 entry.
  */
 static long region_add(struct resv_map *resv, long f, long t,
-		       long in_regions_needed)
+		       long in_regions_needed, struct hstate *h,
+		       struct hugetlb_cgroup *h_cg)
 {
 	long add = 0, actual_regions_needed = 0, i = 0;
 	struct file_region *trg = NULL, *rg = NULL;
@@ -367,7 +383,8 @@ static long region_add(struct resv_map *
 retry:
 
 	/* Count how many regions are actually needed to execute this add. */
-	add_reservation_in_range(resv, f, t, &actual_regions_needed, true);
+	add_reservation_in_range(resv, f, t, NULL, NULL, &actual_regions_needed,
+				 true);
 
 	/*
 	 * Check for sufficient descriptors in the cache to accommodate
@@ -405,7 +422,7 @@ retry:
 		goto retry;
 	}
 
-	add = add_reservation_in_range(resv, f, t, NULL, false);
+	add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false);
 
 	resv->adds_in_progress -= in_regions_needed;
 
@@ -453,7 +470,8 @@ static long region_chg(struct resv_map *
 	spin_lock(&resv->lock);
 
 	/* Count how many hugepages in this range are NOT respresented. */
-	chg = add_reservation_in_range(resv, f, t, out_regions_needed, true);
+	chg = add_reservation_in_range(resv, f, t, NULL, NULL,
+				       out_regions_needed, true);
 
 	if (*out_regions_needed == 0)
 		*out_regions_needed = 1;
@@ -589,11 +607,17 @@ retry:
 			/* New entry for end of split region */
 			nrg->from = t;
 			nrg->to = rg->to;
+
+			copy_hugetlb_cgroup_uncharge_info(nrg, rg);
+
 			INIT_LIST_HEAD(&nrg->link);
 
 			/* Original entry is trimmed */
 			rg->to = f;
 
+			hugetlb_cgroup_uncharge_file_region(
+				resv, rg, nrg->to - nrg->from);
+
 			list_add(&nrg->link, &rg->link);
 			nrg = NULL;
 			break;
@@ -601,6 +625,8 @@ retry:
 
 		if (f <= rg->from && t >= rg->to) { /* Remove entire region */
 			del += rg->to - rg->from;
+			hugetlb_cgroup_uncharge_file_region(resv, rg,
+							    rg->to - rg->from);
 			list_del(&rg->link);
 			kfree(rg);
 			continue;
@@ -609,9 +635,15 @@ retry:
 		if (f <= rg->from) {	/* Trim beginning of region */
 			del += t - rg->from;
 			rg->from = t;
+
+			hugetlb_cgroup_uncharge_file_region(resv, rg,
+							    t - rg->from);
 		} else {		/* Trim end of region */
 			del += rg->to - f;
 			rg->to = f;
+
+			hugetlb_cgroup_uncharge_file_region(resv, rg,
+							    rg->to - f);
 		}
 	}
 
@@ -2024,7 +2056,7 @@ static long __vma_reservation_common(str
 		VM_BUG_ON(dummy_out_regions_needed != 1);
 		break;
 	case VMA_COMMIT_RESV:
-		ret = region_add(resv, idx, idx + 1, 1);
+		ret = region_add(resv, idx, idx + 1, 1, NULL, NULL);
 		/* region_add calls of range 1 should never fail. */
 		VM_BUG_ON(ret < 0);
 		break;
@@ -2034,7 +2066,7 @@ static long __vma_reservation_common(str
 		break;
 	case VMA_ADD_RESV:
 		if (vma->vm_flags & VM_MAYSHARE) {
-			ret = region_add(resv, idx, idx + 1, 1);
+			ret = region_add(resv, idx, idx + 1, 1, NULL, NULL);
 			/* region_add calls of range 1 should never fail. */
 			VM_BUG_ON(ret < 0);
 		} else {
@@ -4694,7 +4726,7 @@ int hugetlb_reserve_pages(struct inode *
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
-	struct hugetlb_cgroup *h_cg;
+	struct hugetlb_cgroup *h_cg = NULL;
 	long gbl_reserve, regions_needed = 0;
 
 	/* This should never happen */
@@ -4735,19 +4767,6 @@ int hugetlb_reserve_pages(struct inode *
 
 		chg = to - from;
 
-		if (hugetlb_cgroup_charge_cgroup_rsvd(
-			    hstate_index(h), chg * pages_per_huge_page(h),
-			    &h_cg)) {
-			kref_put(&resv_map->refs, resv_map_release);
-			return -ENOMEM;
-		}
-
-		/*
-		 * Since this branch handles private mappings, we attach the
-		 * counter to uncharge for this reservation off resv_map.
-		 */
-		resv_map_set_hugetlb_cgroup_uncharge_info(resv_map, h_cg, h);
-
 		set_vma_resv_map(vma, resv_map);
 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
 	}
@@ -4757,6 +4776,21 @@ int hugetlb_reserve_pages(struct inode *
 		goto out_err;
 	}
 
+	ret = hugetlb_cgroup_charge_cgroup_rsvd(
+		hstate_index(h), chg * pages_per_huge_page(h), &h_cg);
+
+	if (ret < 0) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
+		/* For private mappings, the hugetlb_cgroup uncharge info hangs
+		 * of the resv_map.
+		 */
+		resv_map_set_hugetlb_cgroup_uncharge_info(resv_map, h_cg, h);
+	}
+
 	/*
 	 * There must be enough pages in the subpool for the mapping. If
 	 * the subpool has a minimum size, there may be some global
@@ -4765,7 +4799,7 @@ int hugetlb_reserve_pages(struct inode *
 	gbl_reserve = hugepage_subpool_get_pages(spool, chg);
 	if (gbl_reserve < 0) {
 		ret = -ENOSPC;
-		goto out_err;
+		goto out_uncharge_cgroup;
 	}
 
 	/*
@@ -4774,9 +4808,7 @@ int hugetlb_reserve_pages(struct inode *
 	 */
 	ret = hugetlb_acct_memory(h, gbl_reserve);
 	if (ret < 0) {
-		/* put back original number of pages, chg */
-		(void)hugepage_subpool_put_pages(spool, chg);
-		goto out_err;
+		goto out_put_pages;
 	}
 
 	/*
@@ -4791,13 +4823,11 @@ int hugetlb_reserve_pages(struct inode *
 	 * else has to be done for private mappings here
 	 */
 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
-		add = region_add(resv_map, from, to, regions_needed);
+		add = region_add(resv_map, from, to, regions_needed, h, h_cg);
 
 		if (unlikely(add < 0)) {
 			hugetlb_acct_memory(h, -gbl_reserve);
-			/* put back original number of pages, chg */
-			(void)hugepage_subpool_put_pages(spool, chg);
-			goto out_err;
+			goto out_put_pages;
 		} else if (unlikely(chg > add)) {
 			/*
 			 * pages in this range were added to the reserve
@@ -4808,12 +4838,22 @@ int hugetlb_reserve_pages(struct inode *
 			 */
 			long rsv_adjust;
 
+			hugetlb_cgroup_uncharge_cgroup_rsvd(
+				hstate_index(h),
+				(chg - add) * pages_per_huge_page(h), h_cg);
+
 			rsv_adjust = hugepage_subpool_put_pages(spool,
 								chg - add);
 			hugetlb_acct_memory(h, -rsv_adjust);
 		}
 	}
 	return 0;
+out_put_pages:
+	/* put back original number of pages, chg */
+	(void)hugepage_subpool_put_pages(spool, chg);
+out_uncharge_cgroup:
+	hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
+					    chg * pages_per_huge_page(h), h_cg);
 out_err:
 	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		/* Only call region_abort if the region_chg succeeded but the
--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-add-accounting-for-shared-mappings
+++ a/mm/hugetlb_cgroup.c
@@ -392,6 +392,21 @@ void hugetlb_cgroup_uncharge_counter(str
 	css_put(resv->css);
 }
 
+void hugetlb_cgroup_uncharge_file_region(struct resv_map *resv,
+					 struct file_region *rg,
+					 unsigned long nr_pages)
+{
+	if (hugetlb_cgroup_disabled() || !resv || !rg || !nr_pages)
+		return;
+
+	if (rg->reservation_counter && resv->pages_per_hpage && nr_pages > 0 &&
+	    !resv->reservation_counter) {
+		page_counter_uncharge(rg->reservation_counter,
+				      nr_pages * resv->pages_per_hpage);
+		css_put(rg->css);
+	}
+}
+
 enum {
 	RES_USAGE,
 	RES_RSVD_USAGE,
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-support-noreserve-mappings.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (112 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-add-accounting-for-shared-mappings.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb-support-file_region-coalescing-again.patch " Andrew Morton
                   ` (124 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: support noreserve mappings
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-support-noreserve-mappings.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-support-noreserve-mappings.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-support-noreserve-mappings.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: support noreserve mappings

Support MAP_NORESERVE accounting as part of the new counter.

For each hugepage allocation, at allocation time we check if there is a
reservation for this allocation or not.  If there is a reservation for
this allocation, then this allocation was charged at reservation time, and
we don't re-account it.  If there is no reserevation for this allocation,
we charge the appropriate hugetlb_cgroup.

The hugetlb_cgroup to uncharge for this allocation is stored in
page[3].private.  We use new APIs added in an earlier patch to set this
pointer.

Link: http://lkml.kernel.org/r/20200211213128.73302-6-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb_cgroup-support-noreserve-mappings
+++ a/mm/hugetlb.c
@@ -1345,6 +1345,8 @@ static void __free_huge_page(struct page
 	clear_page_huge_active(page);
 	hugetlb_cgroup_uncharge_page(hstate_index(h),
 				     pages_per_huge_page(h), page);
+	hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
+					  pages_per_huge_page(h), page);
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
@@ -2181,6 +2183,7 @@ struct page *alloc_huge_page(struct vm_a
 	long gbl_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg;
+	bool deferred_reserve;
 
 	idx = hstate_index(h);
 	/*
@@ -2218,9 +2221,19 @@ struct page *alloc_huge_page(struct vm_a
 			gbl_chg = 1;
 	}
 
+	/* If this allocation is not consuming a reservation, charge it now.
+	 */
+	deferred_reserve = map_chg || avoid_reserve || !vma_resv_map(vma);
+	if (deferred_reserve) {
+		ret = hugetlb_cgroup_charge_cgroup_rsvd(
+			idx, pages_per_huge_page(h), &h_cg);
+		if (ret)
+			goto out_subpool_put;
+	}
+
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
 	if (ret)
-		goto out_subpool_put;
+		goto out_uncharge_cgroup_reservation;
 
 	spin_lock(&hugetlb_lock);
 	/*
@@ -2243,6 +2256,14 @@ struct page *alloc_huge_page(struct vm_a
 		/* Fall through */
 	}
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
+	/* If allocation is not consuming a reservation, also store the
+	 * hugetlb_cgroup pointer on the page.
+	 */
+	if (deferred_reserve) {
+		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
+						  h_cg, page);
+	}
+
 	spin_unlock(&hugetlb_lock);
 
 	set_page_private(page, (unsigned long)spool);
@@ -2267,6 +2288,10 @@ struct page *alloc_huge_page(struct vm_a
 
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+out_uncharge_cgroup_reservation:
+	if (deferred_reserve)
+		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
+						    h_cg);
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb-support-file_region-coalescing-again.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (113 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-support-noreserve-mappings.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch " Andrew Morton
                   ` (123 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb: support file_region coalescing again
has been added to the -mm tree.  Its filename is
     hugetlb-support-file_region-coalescing-again.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb-support-file_region-coalescing-again.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb-support-file_region-coalescing-again.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb: support file_region coalescing again

An earlier patch in this series disabled file_region coalescing in order
to hang the hugetlb_cgroup uncharge info on the file_region entries.

This patch re-adds support for coalescing of file_region entries. 
Essentially everytime we add an entry, we call a recursive function that
tries to coalesce the added region with the regions next to it.  The worst
case call depth for this function is 3: one to coalesce with the region
next to it, one to coalesce to the region prev, and one to reach the base
case.

This is an important performance optimization as private mappings add
their entries page by page, and we could incur big performance costs for
large mappings with lots of file_region entries in their resv_map.

Link: http://lkml.kernel.org/r/20200211213128.73302-7-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   85 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

--- a/mm/hugetlb.c~hugetlb-support-file_region-coalescing-again
+++ a/mm/hugetlb.c
@@ -276,6 +276,86 @@ static void record_hugetlb_cgroup_unchar
 #endif
 }
 
+static bool has_same_uncharge_info(struct file_region *rg,
+				   struct file_region *org)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+	return rg && org &&
+	       rg->reservation_counter == org->reservation_counter &&
+	       rg->css == org->css;
+
+#else
+	return true;
+#endif
+}
+
+#ifdef CONFIG_DEBUG_VM
+static void dump_resv_map(struct resv_map *resv)
+{
+	struct list_head *head = &resv->regions;
+	struct file_region *rg = NULL;
+
+	pr_err("--------- start print resv_map ---------\n");
+	list_for_each_entry(rg, head, link) {
+		pr_err("rg->from=%ld, rg->to=%ld, rg->reservation_counter=%px, rg->css=%px\n",
+		       rg->from, rg->to, rg->reservation_counter, rg->css);
+	}
+	pr_err("--------- end print resv_map ---------\n");
+}
+
+/* Debug function to loop over the resv_map and make sure that coalescing is
+ * working.
+ */
+static void check_coalesce_bug(struct resv_map *resv)
+{
+	struct list_head *head = &resv->regions;
+	struct file_region *rg = NULL, *nrg = NULL;
+
+	list_for_each_entry(rg, head, link) {
+		nrg = list_next_entry(rg, link);
+
+		if (&nrg->link == head)
+			break;
+
+		if (nrg->reservation_counter && nrg->from == rg->to &&
+		    nrg->reservation_counter == rg->reservation_counter &&
+		    nrg->css == rg->css) {
+			dump_resv_map(resv);
+			VM_BUG_ON(true);
+		}
+	}
+}
+#endif
+
+static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
+{
+	struct file_region *nrg = NULL, *prg = NULL;
+
+	prg = list_prev_entry(rg, link);
+	if (&prg->link != &resv->regions && prg->to == rg->from &&
+	    has_same_uncharge_info(prg, rg)) {
+		prg->to = rg->to;
+
+		list_del(&rg->link);
+		kfree(rg);
+
+		coalesce_file_region(resv, prg);
+		return;
+	}
+
+	nrg = list_next_entry(rg, link);
+	if (&nrg->link != &resv->regions && nrg->from == rg->to &&
+	    has_same_uncharge_info(nrg, rg)) {
+		nrg->from = rg->from;
+
+		list_del(&rg->link);
+		kfree(rg);
+
+		coalesce_file_region(resv, nrg);
+		return;
+	}
+}
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list. If regions_needed != NULL and count_only == true, then regions_needed
@@ -327,6 +407,7 @@ static long add_reservation_in_range(str
 				record_hugetlb_cgroup_uncharge_info(h_cg, h,
 								    resv, nrg);
 				list_add(&nrg->link, rg->link.prev);
+				coalesce_file_region(resv, nrg);
 			} else if (regions_needed)
 				*regions_needed += 1;
 		}
@@ -344,11 +425,15 @@ static long add_reservation_in_range(str
 				resv, last_accounted_offset, t);
 			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
 			list_add(&nrg->link, rg->link.prev);
+			coalesce_file_region(resv, nrg);
 		} else if (regions_needed)
 			*regions_needed += 1;
 	}
 
 	VM_BUG_ON(add < 0);
+#ifdef CONFIG_DEBUG_VM
+	check_coalesce_bug(resv);
+#endif
 	return add;
 }
 
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (114 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb-support-file_region-coalescing-again.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch " Andrew Morton
                   ` (122 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: add hugetlb_cgroup reservation tests
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: add hugetlb_cgroup reservation tests

The tests use both shared and private mapped hugetlb memory, and monitors
the hugetlb usage counter as well as the hugetlb reservation counter. 
They test different configurations such as hugetlb memory usage via
hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Also add test for hugetlb reservation reparenting, since this is a subtle
issue.

Link: http://lkml.kernel.org/r/20200211213128.73302-8-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/.gitignore                  |    1 
 tools/testing/selftests/vm/Makefile                    |    1 
 tools/testing/selftests/vm/charge_reserved_hugetlb.sh  |  575 ++++++++++
 tools/testing/selftests/vm/hugetlb_reparenting_test.sh |  244 ++++
 tools/testing/selftests/vm/write_hugetlb_memory.sh     |   23 
 tools/testing/selftests/vm/write_to_hugetlbfs.c        |  242 ++++
 6 files changed, 1086 insertions(+)

--- /dev/null
+++ a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,575 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+  echo "This test must be run as root. Skipping..."
+  exit 0
+fi
+
+fault_limit_file=limit_in_bytes
+reservation_limit_file=rsvd.limit_in_bytes
+fault_usage_file=usage_in_bytes
+reservation_usage_file=rsvd.usage_in_bytes
+
+if [[ "$1" == "-cgroup-v2" ]]; then
+  cgroup2=1
+  fault_limit_file=max
+  reservation_limit_file=rsvd.max
+  fault_usage_file=current
+  reservation_usage_file=rsvd.current
+fi
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+  mkdir -p $cgroup_path
+  if [[ $cgroup2 ]]; then
+    mount -t cgroup2 none $cgroup_path
+  else
+    mount -t cgroup memory,hugetlb $cgroup_path
+  fi
+fi
+
+if [[ $cgroup2 ]]; then
+  echo "+hugetlb" >/dev/cgroup/memory/cgroup.subtree_control
+fi
+
+function cleanup() {
+  if [[ $cgroup2 ]]; then
+    echo $$ >$cgroup_path/cgroup.procs
+  else
+    echo $$ >$cgroup_path/tasks
+  fi
+
+  if [[ -e /mnt/huge ]]; then
+    rm -rf /mnt/huge/*
+    umount /mnt/huge || echo error
+    rmdir /mnt/huge
+  fi
+  if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+    rmdir $cgroup_path/hugetlb_cgroup_test
+  fi
+  if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+    rmdir $cgroup_path/hugetlb_cgroup_test1
+  fi
+  if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+    rmdir $cgroup_path/hugetlb_cgroup_test2
+  fi
+  echo 0 >/proc/sys/vm/nr_hugepages
+  echo CLEANUP DONE
+}
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$expected" != "$actual" ]]; then
+    echo "expected ($expected) != actual ($actual): $3"
+    cleanup
+    exit 1
+  fi
+}
+
+function get_machine_hugepage_size() {
+  hpz=$(grep -i hugepagesize /proc/meminfo)
+  kb=${hpz:14:-3}
+  mb=$(($kb / 1024))
+  echo $mb
+}
+
+MB=$(get_machine_hugepage_size)
+
+function setup_cgroup() {
+  local name="$1"
+  local cgroup_limit="$2"
+  local reservation_limit="$3"
+
+  mkdir $cgroup_path/$name
+
+  echo writing cgroup limit: "$cgroup_limit"
+  echo "$cgroup_limit" >$cgroup_path/$name/hugetlb.${MB}MB.$fault_limit_file
+
+  echo writing reseravation limit: "$reservation_limit"
+  echo "$reservation_limit" > \
+    $cgroup_path/$name/hugetlb.${MB}MB.$reservation_limit_file
+
+  if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then
+    echo 0 >$cgroup_path/$name/cpuset.cpus
+  fi
+  if [ -e "$cgroup_path/$name/cpuset.mems" ]; then
+    echo 0 >$cgroup_path/$name/cpuset.mems
+  fi
+}
+
+function wait_for_hugetlb_memory_to_get_depleted() {
+  local cgroup="$1"
+  local path="/dev/cgroup/memory/$cgroup/hugetlb.${MB}MB.$reservation_usage_file"
+  # Wait for hugetlbfs memory to get depleted.
+  while [ $(cat $path) != 0 ]; do
+    echo Waiting for hugetlb memory to get depleted.
+    cat $path
+    sleep 0.5
+  done
+}
+
+function wait_for_hugetlb_memory_to_get_reserved() {
+  local cgroup="$1"
+  local size="$2"
+
+  local path="/dev/cgroup/memory/$cgroup/hugetlb.${MB}MB.$reservation_usage_file"
+  # Wait for hugetlbfs memory to get written.
+  while [ $(cat $path) != $size ]; do
+    echo Waiting for hugetlb memory reservation to reach size $size.
+    cat $path
+    sleep 0.5
+  done
+}
+
+function wait_for_hugetlb_memory_to_get_written() {
+  local cgroup="$1"
+  local size="$2"
+
+  local path="/dev/cgroup/memory/$cgroup/hugetlb.${MB}MB.$fault_usage_file"
+  # Wait for hugetlbfs memory to get written.
+  while [ $(cat $path) != $size ]; do
+    echo Waiting for hugetlb memory to reach size $size.
+    cat $path
+    sleep 0.5
+  done
+}
+
+function write_hugetlbfs_and_get_usage() {
+  local cgroup="$1"
+  local size="$2"
+  local populate="$3"
+  local write="$4"
+  local path="$5"
+  local method="$6"
+  local private="$7"
+  local expect_failure="$8"
+  local reserve="$9"
+
+  # Function return values.
+  reservation_failed=0
+  oom_killed=0
+  hugetlb_difference=0
+  reserved_difference=0
+
+  local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.${MB}MB.$fault_usage_file
+  local reserved_usage=$cgroup_path/$cgroup/hugetlb.${MB}MB.$reservation_usage_file
+
+  local hugetlb_before=$(cat $hugetlb_usage)
+  local reserved_before=$(cat $reserved_usage)
+
+  echo
+  echo Starting:
+  echo hugetlb_usage="$hugetlb_before"
+  echo reserved_usage="$reserved_before"
+  echo expect_failure is "$expect_failure"
+
+  output=$(mktemp)
+  set +e
+  if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] ||
+    [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+
+    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+      "$cgroup" "$path" "$method" "$private" "-l" "$reserve" 2>&1 | tee $output &
+
+    local write_result=$?
+    local write_pid=$!
+
+    until grep -q -i "DONE" $output; do
+      echo waiting for DONE signal.
+      if ! ps $write_pid > /dev/null
+      then
+        echo "FAIL: The write died"
+        cleanup
+        exit 1
+      fi
+      sleep 0.5
+    done
+
+    echo ================= write_hugetlb_memory.sh output is:
+    cat $output
+    echo ================= end output.
+
+    if [[ "$populate" == "-o" ]] || [[ "$write" == "-w" ]]; then
+      wait_for_hugetlb_memory_to_get_written "$cgroup" "$size"
+    elif [[ "$reserve" != "-n" ]]; then
+      wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size"
+    else
+      # This case doesn't produce visible effects, but we still have
+      # to wait for the async process to start and execute...
+      sleep 0.5
+    fi
+
+    echo write_result is $write_result
+  else
+    bash write_hugetlb_memory.sh "$size" "$populate" "$write" \
+      "$cgroup" "$path" "$method" "$private" "$reserve"
+    local write_result=$?
+
+    if [[ "$reserve" != "-n" ]]; then
+      wait_for_hugetlb_memory_to_get_reserved "$cgroup" "$size"
+    fi
+  fi
+  set -e
+
+  if [[ "$write_result" == 1 ]]; then
+    reservation_failed=1
+  fi
+
+  # On linus/master, the above process gets SIGBUS'd on oomkill, with
+  # return code 135. On earlier kernels, it gets actual oomkill, with return
+  # code 137, so just check for both conditions in case we're testing
+  # against an earlier kernel.
+  if [[ "$write_result" == 135 ]] || [[ "$write_result" == 137 ]]; then
+    oom_killed=1
+  fi
+
+  local hugetlb_after=$(cat $hugetlb_usage)
+  local reserved_after=$(cat $reserved_usage)
+
+  echo After write:
+  echo hugetlb_usage="$hugetlb_after"
+  echo reserved_usage="$reserved_after"
+
+  hugetlb_difference=$(($hugetlb_after - $hugetlb_before))
+  reserved_difference=$(($reserved_after - $reserved_before))
+}
+
+function cleanup_hugetlb_memory() {
+  set +e
+  local cgroup="$1"
+  if [[ "$(pgrep -f write_to_hugetlbfs)" != "" ]]; then
+    echo killing write_to_hugetlbfs
+    killall -2 write_to_hugetlbfs
+    wait_for_hugetlb_memory_to_get_depleted $cgroup
+  fi
+  set -e
+
+  if [[ -e /mnt/huge ]]; then
+    rm -rf /mnt/huge/*
+    umount /mnt/huge
+    rmdir /mnt/huge
+  fi
+}
+
+function run_test() {
+  local size=$(($1 * ${MB} * 1024 * 1024))
+  local populate="$2"
+  local write="$3"
+  local cgroup_limit=$(($4 * ${MB} * 1024 * 1024))
+  local reservation_limit=$(($5 * ${MB} * 1024 * 1024))
+  local nr_hugepages="$6"
+  local method="$7"
+  local private="$8"
+  local expect_failure="$9"
+  local reserve="${10}"
+
+  # Function return values.
+  hugetlb_difference=0
+  reserved_difference=0
+  reservation_failed=0
+  oom_killed=0
+
+  echo nr hugepages = "$nr_hugepages"
+  echo "$nr_hugepages" >/proc/sys/vm/nr_hugepages
+
+  setup_cgroup "hugetlb_cgroup_test" "$cgroup_limit" "$reservation_limit"
+
+  mkdir -p /mnt/huge
+  mount -t hugetlbfs -o pagesize=${MB}M,size=256M none /mnt/huge
+
+  write_hugetlbfs_and_get_usage "hugetlb_cgroup_test" "$size" "$populate" \
+    "$write" "/mnt/huge/test" "$method" "$private" "$expect_failure" \
+    "$reserve"
+
+  cleanup_hugetlb_memory "hugetlb_cgroup_test"
+
+  local final_hugetlb=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.${MB}MB.$fault_usage_file)
+  local final_reservation=$(cat $cgroup_path/hugetlb_cgroup_test/hugetlb.${MB}MB.$reservation_usage_file)
+
+  echo $hugetlb_difference
+  echo $reserved_difference
+  expect_equal "0" "$final_hugetlb" "final hugetlb is not zero"
+  expect_equal "0" "$final_reservation" "final reservation is not zero"
+}
+
+function run_multiple_cgroup_test() {
+  local size1="$1"
+  local populate1="$2"
+  local write1="$3"
+  local cgroup_limit1="$4"
+  local reservation_limit1="$5"
+
+  local size2="$6"
+  local populate2="$7"
+  local write2="$8"
+  local cgroup_limit2="$9"
+  local reservation_limit2="${10}"
+
+  local nr_hugepages="${11}"
+  local method="${12}"
+  local private="${13}"
+  local expect_failure="${14}"
+  local reserve="${15}"
+
+  # Function return values.
+  hugetlb_difference1=0
+  reserved_difference1=0
+  reservation_failed1=0
+  oom_killed1=0
+
+  hugetlb_difference2=0
+  reserved_difference2=0
+  reservation_failed2=0
+  oom_killed2=0
+
+  echo nr hugepages = "$nr_hugepages"
+  echo "$nr_hugepages" >/proc/sys/vm/nr_hugepages
+
+  setup_cgroup "hugetlb_cgroup_test1" "$cgroup_limit1" "$reservation_limit1"
+  setup_cgroup "hugetlb_cgroup_test2" "$cgroup_limit2" "$reservation_limit2"
+
+  mkdir -p /mnt/huge
+  mount -t hugetlbfs -o pagesize=${MB}M,size=256M none /mnt/huge
+
+  write_hugetlbfs_and_get_usage "hugetlb_cgroup_test1" "$size1" \
+    "$populate1" "$write1" "/mnt/huge/test1" "$method" "$private" \
+    "$expect_failure" "$reserve"
+
+  hugetlb_difference1=$hugetlb_difference
+  reserved_difference1=$reserved_difference
+  reservation_failed1=$reservation_failed
+  oom_killed1=$oom_killed
+
+  local cgroup1_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.${MB}MB.$fault_usage_file
+  local cgroup1_reservation_usage=$cgroup_path/hugetlb_cgroup_test1/hugetlb.${MB}MB.$reservation_usage_file
+  local cgroup2_hugetlb_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.${MB}MB.$fault_usage_file
+  local cgroup2_reservation_usage=$cgroup_path/hugetlb_cgroup_test2/hugetlb.${MB}MB.$reservation_usage_file
+
+  local usage_before_second_write=$(cat $cgroup1_hugetlb_usage)
+  local reservation_usage_before_second_write=$(cat $cgroup1_reservation_usage)
+
+  write_hugetlbfs_and_get_usage "hugetlb_cgroup_test2" "$size2" \
+    "$populate2" "$write2" "/mnt/huge/test2" "$method" "$private" \
+    "$expect_failure" "$reserve"
+
+  hugetlb_difference2=$hugetlb_difference
+  reserved_difference2=$reserved_difference
+  reservation_failed2=$reservation_failed
+  oom_killed2=$oom_killed
+
+  expect_equal "$usage_before_second_write" \
+    "$(cat $cgroup1_hugetlb_usage)" "Usage changed."
+  expect_equal "$reservation_usage_before_second_write" \
+    "$(cat $cgroup1_reservation_usage)" "Reservation usage changed."
+
+  cleanup_hugetlb_memory
+
+  local final_hugetlb=$(cat $cgroup1_hugetlb_usage)
+  local final_reservation=$(cat $cgroup1_reservation_usage)
+
+  expect_equal "0" "$final_hugetlb" \
+    "hugetlbt_cgroup_test1 final hugetlb is not zero"
+  expect_equal "0" "$final_reservation" \
+    "hugetlbt_cgroup_test1 final reservation is not zero"
+
+  local final_hugetlb=$(cat $cgroup2_hugetlb_usage)
+  local final_reservation=$(cat $cgroup2_reservation_usage)
+
+  expect_equal "0" "$final_hugetlb" \
+    "hugetlb_cgroup_test2 final hugetlb is not zero"
+  expect_equal "0" "$final_reservation" \
+    "hugetlb_cgroup_test2 final reservation is not zero"
+}
+
+cleanup
+
+for populate in "" "-o"; do
+  for method in 0 1 2; do
+    for private in "" "-r"; do
+      for reserve in "" "-n"; do
+
+        # Skip mmap(MAP_HUGETLB | MAP_SHARED). Doesn't seem to be supported.
+        if [[ "$method" == 1 ]] && [[ "$private" == "" ]]; then
+          continue
+        fi
+
+        # Skip populated shmem tests. Doesn't seem to be supported.
+        if [[ "$method" == 2"" ]] && [[ "$populate" == "-o" ]]; then
+          continue
+        fi
+
+        if [[ "$method" == 2"" ]] && [[ "$reserve" == "-n" ]]; then
+          continue
+        fi
+
+        cleanup
+        echo
+        echo
+        echo
+        echo Test normal case.
+        echo private=$private, populate=$populate, method=$method, reserve=$reserve
+        run_test 5 "$populate" "" 10 10 10 "$method" "$private" "0" "$reserve"
+
+        echo Memory charged to hugtlb=$hugetlb_difference
+        echo Memory charged to reservation=$reserved_difference
+
+        if [[ "$populate" == "-o" ]]; then
+          expect_equal "$((5 * $MB * 1024 * 1024))" "$hugetlb_difference" \
+            "Reserved memory charged to hugetlb cgroup."
+        else
+          expect_equal "0" "$hugetlb_difference" \
+            "Reserved memory charged to hugetlb cgroup."
+        fi
+
+        if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then
+          expect_equal "$((5 * $MB * 1024 * 1024))" "$reserved_difference" \
+            "Reserved memory not charged to reservation usage."
+        else
+          expect_equal "0" "$reserved_difference" \
+            "Reserved memory not charged to reservation usage."
+        fi
+
+        echo 'PASS'
+
+        cleanup
+        echo
+        echo
+        echo
+        echo Test normal case with write.
+        echo private=$private, populate=$populate, method=$method, reserve=$reserve
+        run_test 5 "$populate" '-w' 5 5 10 "$method" "$private" "0" "$reserve"
+
+        echo Memory charged to hugtlb=$hugetlb_difference
+        echo Memory charged to reservation=$reserved_difference
+
+        expect_equal "$((5 * $MB * 1024 * 1024))" "$hugetlb_difference" \
+          "Reserved memory charged to hugetlb cgroup."
+
+        expect_equal "$((5 * $MB * 1024 * 1024))" "$reserved_difference" \
+          "Reserved memory not charged to reservation usage."
+
+        echo 'PASS'
+
+        cleanup
+        continue
+        echo
+        echo
+        echo
+        echo Test more than reservation case.
+        echo private=$private, populate=$populate, method=$method, reserve=$reserve
+
+        if [ "$reserve" != "-n" ]; then
+          run_test "5" "$populate" '' "10" "2" "10" "$method" "$private" "1" \
+            "$reserve"
+
+          expect_equal "1" "$reservation_failed" "Reservation succeeded."
+        fi
+
+        echo 'PASS'
+
+        cleanup
+
+        echo
+        echo
+        echo
+        echo Test more than cgroup limit case.
+        echo private=$private, populate=$populate, method=$method, reserve=$reserve
+
+        # Not sure if shm memory can be cleaned up when the process gets sigbus'd.
+        if [[ "$method" != 2 ]]; then
+          run_test 5 "$populate" "-w" 2 10 10 "$method" "$private" "1" "$reserve"
+
+          expect_equal "1" "$oom_killed" "Not oom killed."
+        fi
+        echo 'PASS'
+
+        cleanup
+
+        echo
+        echo
+        echo
+        echo Test normal case, multiple cgroups.
+        echo private=$private, populate=$populate, method=$method, reserve=$reserve
+        run_multiple_cgroup_test "3" "$populate" "" "10" "10" "5" \
+          "$populate" "" "10" "10" "10" \
+          "$method" "$private" "0" "$reserve"
+
+        echo Memory charged to hugtlb1=$hugetlb_difference1
+        echo Memory charged to reservation1=$reserved_difference1
+        echo Memory charged to hugtlb2=$hugetlb_difference2
+        echo Memory charged to reservation2=$reserved_difference2
+
+        if [[ "$reserve" != "-n" ]] || [[ "$populate" == "-o" ]]; then
+          expect_equal "3" "$reserved_difference1" \
+            "Incorrect reservations charged to cgroup 1."
+
+          expect_equal "5" "$reserved_difference2" \
+            "Incorrect reservation charged to cgroup 2."
+
+        else
+          expect_equal "0" "$reserved_difference1" \
+            "Incorrect reservations charged to cgroup 1."
+
+          expect_equal "0" "$reserved_difference2" \
+            "Incorrect reservation charged to cgroup 2."
+        fi
+
+        if [[ "$populate" == "-o" ]]; then
+          expect_equal "3" "$hugetlb_difference1" \
+            "Incorrect hugetlb charged to cgroup 1."
+
+          expect_equal "5" "$hugetlb_difference2" \
+            "Incorrect hugetlb charged to cgroup 2."
+
+        else
+          expect_equal "0" "$hugetlb_difference1" \
+            "Incorrect hugetlb charged to cgroup 1."
+
+          expect_equal "0" "$hugetlb_difference2" \
+            "Incorrect hugetlb charged to cgroup 2."
+        fi
+        echo 'PASS'
+
+        cleanup
+        echo
+        echo
+        echo
+        echo Test normal case with write, multiple cgroups.
+        echo private=$private, populate=$populate, method=$method, reserve=$reserve
+        run_multiple_cgroup_test "3" "$populate" "-w" "10" "10" "5" \
+          "$populate" "-w" "10" "10" "10" \
+          "$method" "$private" "0" "$reserve"
+
+        echo Memory charged to hugtlb1=$hugetlb_difference1
+        echo Memory charged to reservation1=$reserved_difference1
+        echo Memory charged to hugtlb2=$hugetlb_difference2
+        echo Memory charged to reservation2=$reserved_difference2
+
+        expect_equal "3" "$hugetlb_difference1" \
+          "Incorrect hugetlb charged to cgroup 1."
+
+        expect_equal "3" "$reserved_difference1" \
+          "Incorrect reservation charged to cgroup 1."
+
+        expect_equal "5" "$hugetlb_difference2" \
+          "Incorrect hugetlb charged to cgroup 2."
+
+        expect_equal "5" "$reserved_difference2" \
+          "Incorrected reservation charged to cgroup 2."
+        echo 'PASS'
+
+        cleanup
+
+      done # reserve
+    done   # private
+  done     # populate
+done       # method
+
+umount $cgroup_path
+rmdir $cgroup_path
--- a/tools/testing/selftests/vm/.gitignore~hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests
+++ a/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
--- /dev/null
+++ a/tools/testing/selftests/vm/hugetlb_reparenting_test.sh
@@ -0,0 +1,244 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+  echo "This test must be run as root. Skipping..."
+  exit 0
+fi
+
+usage_file=usage_in_bytes
+
+if [[ "$1" == "-cgroup-v2" ]]; then
+  cgroup2=1
+  usage_file=current
+fi
+
+CGROUP_ROOT='/dev/cgroup/memory'
+MNT='/mnt/huge/'
+
+if [[ ! -e $CGROUP_ROOT ]]; then
+  mkdir -p $CGROUP_ROOT
+  if [[ $cgroup2 ]]; then
+    mount -t cgroup2 none $CGROUP_ROOT
+    sleep 1
+    echo "+hugetlb +memory" >$CGROUP_ROOT/cgroup.subtree_control
+  else
+    mount -t cgroup memory,hugetlb $CGROUP_ROOT
+  fi
+fi
+
+function get_machine_hugepage_size() {
+  hpz=$(grep -i hugepagesize /proc/meminfo)
+  kb=${hpz:14:-3}
+  mb=$(($kb / 1024))
+  echo $mb
+}
+
+MB=$(get_machine_hugepage_size)
+
+function cleanup() {
+  echo cleanup
+  set +e
+  rm -rf "$MNT"/* 2>/dev/null
+  umount "$MNT" 2>/dev/null
+  rmdir "$MNT" 2>/dev/null
+  rmdir "$CGROUP_ROOT"/a/b 2>/dev/null
+  rmdir "$CGROUP_ROOT"/a 2>/dev/null
+  rmdir "$CGROUP_ROOT"/test1 2>/dev/null
+  echo 0 >/proc/sys/vm/nr_hugepages
+  set -e
+}
+
+function assert_state() {
+  local expected_a="$1"
+  local expected_a_hugetlb="$2"
+  local expected_b=""
+  local expected_b_hugetlb=""
+
+  if [ ! -z ${3:-} ] && [ ! -z ${4:-} ]; then
+    expected_b="$3"
+    expected_b_hugetlb="$4"
+  fi
+  local tolerance=$((5 * 1024 * 1024))
+
+  local actual_a
+  actual_a="$(cat "$CGROUP_ROOT"/a/memory.$usage_file)"
+  if [[ $actual_a -lt $(($expected_a - $tolerance)) ]] ||
+    [[ $actual_a -gt $(($expected_a + $tolerance)) ]]; then
+    echo actual a = $((${actual_a%% *} / 1024 / 1024)) MB
+    echo expected a = $((${expected_a%% *} / 1024 / 1024)) MB
+    echo fail
+
+    cleanup
+    exit 1
+  fi
+
+  local actual_a_hugetlb
+  actual_a_hugetlb="$(cat "$CGROUP_ROOT"/a/hugetlb.${MB}MB.$usage_file)"
+  if [[ $actual_a_hugetlb -lt $(($expected_a_hugetlb - $tolerance)) ]] ||
+    [[ $actual_a_hugetlb -gt $(($expected_a_hugetlb + $tolerance)) ]]; then
+    echo actual a hugetlb = $((${actual_a_hugetlb%% *} / 1024 / 1024)) MB
+    echo expected a hugetlb = $((${expected_a_hugetlb%% *} / 1024 / 1024)) MB
+    echo fail
+
+    cleanup
+    exit 1
+  fi
+
+  if [[ -z "$expected_b" || -z "$expected_b_hugetlb" ]]; then
+    return
+  fi
+
+  local actual_b
+  actual_b="$(cat "$CGROUP_ROOT"/a/b/memory.$usage_file)"
+  if [[ $actual_b -lt $(($expected_b - $tolerance)) ]] ||
+    [[ $actual_b -gt $(($expected_b + $tolerance)) ]]; then
+    echo actual b = $((${actual_b%% *} / 1024 / 1024)) MB
+    echo expected b = $((${expected_b%% *} / 1024 / 1024)) MB
+    echo fail
+
+    cleanup
+    exit 1
+  fi
+
+  local actual_b_hugetlb
+  actual_b_hugetlb="$(cat "$CGROUP_ROOT"/a/b/hugetlb.${MB}MB.$usage_file)"
+  if [[ $actual_b_hugetlb -lt $(($expected_b_hugetlb - $tolerance)) ]] ||
+    [[ $actual_b_hugetlb -gt $(($expected_b_hugetlb + $tolerance)) ]]; then
+    echo actual b hugetlb = $((${actual_b_hugetlb%% *} / 1024 / 1024)) MB
+    echo expected b hugetlb = $((${expected_b_hugetlb%% *} / 1024 / 1024)) MB
+    echo fail
+
+    cleanup
+    exit 1
+  fi
+}
+
+function setup() {
+  echo 100 >/proc/sys/vm/nr_hugepages
+  mkdir "$CGROUP_ROOT"/a
+  sleep 1
+  if [[ $cgroup2 ]]; then
+    echo "+hugetlb +memory" >$CGROUP_ROOT/a/cgroup.subtree_control
+  else
+    echo 0 >$CGROUP_ROOT/a/cpuset.mems
+    echo 0 >$CGROUP_ROOT/a/cpuset.cpus
+  fi
+
+  mkdir "$CGROUP_ROOT"/a/b
+
+  if [[ ! $cgroup2 ]]; then
+    echo 0 >$CGROUP_ROOT/a/b/cpuset.mems
+    echo 0 >$CGROUP_ROOT/a/b/cpuset.cpus
+  fi
+
+  mkdir -p "$MNT"
+  mount -t hugetlbfs none "$MNT"
+}
+
+write_hugetlbfs() {
+  local cgroup="$1"
+  local path="$2"
+  local size="$3"
+
+  if [[ $cgroup2 ]]; then
+    echo $$ >$CGROUP_ROOT/$cgroup/cgroup.procs
+  else
+    echo 0 >$CGROUP_ROOT/$cgroup/cpuset.mems
+    echo 0 >$CGROUP_ROOT/$cgroup/cpuset.cpus
+    echo $$ >"$CGROUP_ROOT/$cgroup/tasks"
+  fi
+  ./write_to_hugetlbfs -p "$path" -s "$size" -m 0 -o
+  if [[ $cgroup2 ]]; then
+    echo $$ >$CGROUP_ROOT/cgroup.procs
+  else
+    echo $$ >"$CGROUP_ROOT/tasks"
+  fi
+  echo
+}
+
+set -e
+
+size=$((${MB} * 1024 * 1024 * 25)) # 50MB = 25 * 2MB hugepages.
+
+cleanup
+
+echo
+echo
+echo Test charge, rmdir, uncharge
+setup
+echo mkdir
+mkdir $CGROUP_ROOT/test1
+
+echo write
+write_hugetlbfs test1 "$MNT"/test $size
+
+echo rmdir
+rmdir $CGROUP_ROOT/test1
+mkdir $CGROUP_ROOT/test1
+
+echo uncharge
+rm -rf /mnt/huge/*
+
+cleanup
+
+echo done
+echo
+echo
+if [[ ! $cgroup2 ]]; then
+  echo "Test parent and child hugetlb usage"
+  setup
+
+  echo write
+  write_hugetlbfs a "$MNT"/test $size
+
+  echo Assert memory charged correctly for parent use.
+  assert_state 0 $size 0 0
+
+  write_hugetlbfs a/b "$MNT"/test2 $size
+
+  echo Assert memory charged correctly for child use.
+  assert_state 0 $(($size * 2)) 0 $size
+
+  rmdir "$CGROUP_ROOT"/a/b
+  sleep 5
+  echo Assert memory reparent correctly.
+  assert_state 0 $(($size * 2))
+
+  rm -rf "$MNT"/*
+  umount "$MNT"
+  echo Assert memory uncharged correctly.
+  assert_state 0 0
+
+  cleanup
+fi
+
+echo
+echo
+echo "Test child only hugetlb usage"
+echo setup
+setup
+
+echo write
+write_hugetlbfs a/b "$MNT"/test2 $size
+
+echo Assert memory charged correctly for child only use.
+assert_state 0 $(($size)) 0 $size
+
+rmdir "$CGROUP_ROOT"/a/b
+echo Assert memory reparent correctly.
+assert_state 0 $size
+
+rm -rf "$MNT"/*
+umount "$MNT"
+echo Assert memory uncharged correctly.
+assert_state 0 0
+
+cleanup
+
+echo ALL PASS
+
+umount $CGROUP_ROOT
+rm -rf $CGROUP_ROOT
--- a/tools/testing/selftests/vm/Makefile~hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests
+++ a/tools/testing/selftests/vm/Makefile
@@ -22,6 +22,7 @@ TEST_GEN_FILES += userfaultfd
 ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64))
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs
 endif
 
 TEST_PROGS := run_vmtests
--- /dev/null
+++ a/tools/testing/selftests/vm/write_hugetlb_memory.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+size=$1
+populate=$2
+write=$3
+cgroup=$4
+path=$5
+method=$6
+private=$7
+want_sleep=$8
+reserve=$9
+
+echo "Putting task in cgroup '$cgroup'"
+echo $$ > /dev/cgroup/memory/"$cgroup"/cgroup.procs
+
+echo "Method is $method"
+
+set +e
+./write_to_hugetlbfs -p "$path" -s "$size" "$write" "$populate" -m "$method" \
+      "$private" "$want_sleep" "$reserve"
--- /dev/null
+++ a/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -0,0 +1,242 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This program reserves and uses hugetlb memory, supporting a bunch of
+ * scenarios needed by the charged_reserved_hugetlb.sh test.
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/shm.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+/* Global definitions. */
+enum method {
+	HUGETLBFS,
+	MMAP_MAP_HUGETLB,
+	SHM,
+	MAX_METHOD
+};
+
+
+/* Global variables. */
+static const char *self;
+static char *shmaddr;
+static int shmid;
+
+/*
+ * Show usage and exit.
+ */
+static void exit_usage(void)
+{
+	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
+	       "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
+	       "[-o] [-w] [-n]\n",
+	       self);
+	exit(EXIT_FAILURE);
+}
+
+void sig_handler(int signo)
+{
+	printf("Received %d.\n", signo);
+	if (signo == SIGINT) {
+		printf("Deleting the memory\n");
+		if (shmdt((const void *)shmaddr) != 0) {
+			perror("Detach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(4);
+		}
+
+		shmctl(shmid, IPC_RMID, NULL);
+		printf("Done deleting the memory\n");
+	}
+	exit(2);
+}
+
+int main(int argc, char **argv)
+{
+	int fd = 0;
+	int key = 0;
+	int *ptr = NULL;
+	int c = 0;
+	int size = 0;
+	char path[256] = "";
+	enum method method = MAX_METHOD;
+	int want_sleep = 0, private = 0;
+	int populate = 0;
+	int write = 0;
+	int reserve = 1;
+
+	unsigned long i;
+
+	if (signal(SIGINT, sig_handler) == SIG_ERR)
+		err(1, "\ncan't catch SIGINT\n");
+
+	/* Parse command-line arguments. */
+	setvbuf(stdout, NULL, _IONBF, 0);
+	self = argv[0];
+
+	while ((c = getopt(argc, argv, "s:p:m:owlrn")) != -1) {
+		switch (c) {
+		case 's':
+			size = atoi(optarg);
+			break;
+		case 'p':
+			strncpy(path, optarg, sizeof(path));
+			break;
+		case 'm':
+			if (atoi(optarg) >= MAX_METHOD) {
+				errno = EINVAL;
+				perror("Invalid -m.");
+				exit_usage();
+			}
+			method = atoi(optarg);
+			break;
+		case 'o':
+			populate = 1;
+			break;
+		case 'w':
+			write = 1;
+			break;
+		case 'l':
+			want_sleep = 1;
+			break;
+		case 'r':
+		    private
+			= 1;
+			break;
+		case 'n':
+			reserve = 0;
+			break;
+		default:
+			errno = EINVAL;
+			perror("Invalid arg");
+			exit_usage();
+		}
+	}
+
+	if (strncmp(path, "", sizeof(path)) != 0) {
+		printf("Writing to this path: %s\n", path);
+	} else {
+		errno = EINVAL;
+		perror("path not found");
+		exit_usage();
+	}
+
+	if (size != 0) {
+		printf("Writing this size: %d\n", size);
+	} else {
+		errno = EINVAL;
+		perror("size not found");
+		exit_usage();
+	}
+
+	if (!populate)
+		printf("Not populating.\n");
+	else
+		printf("Populating.\n");
+
+	if (!write)
+		printf("Not writing to memory.\n");
+
+	if (method == MAX_METHOD) {
+		errno = EINVAL;
+		perror("-m Invalid");
+		exit_usage();
+	} else
+		printf("Using method=%d\n", method);
+
+	if (!private)
+		printf("Shared mapping.\n");
+	else
+		printf("Private mapping.\n");
+
+	if (!reserve)
+		printf("NO_RESERVE mapping.\n");
+	else
+		printf("RESERVE mapping.\n");
+
+	switch (method) {
+	case HUGETLBFS:
+		printf("Allocating using HUGETLBFS.\n");
+		fd = open(path, O_CREAT | O_RDWR, 0777);
+		if (fd == -1)
+			err(1, "Failed to open file.");
+
+		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   (private ? MAP_PRIVATE : MAP_SHARED) |
+				   (populate ? MAP_POPULATE : 0) |
+				   (reserve ? 0 : MAP_NORESERVE),
+			   fd, 0);
+
+		if (ptr == MAP_FAILED) {
+			close(fd);
+			err(1, "Error mapping the file");
+		}
+		break;
+	case MMAP_MAP_HUGETLB:
+		printf("Allocating using MAP_HUGETLB.\n");
+		ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   (private ? (MAP_PRIVATE | MAP_ANONYMOUS) :
+				      MAP_SHARED) |
+				   MAP_HUGETLB | (populate ? MAP_POPULATE : 0) |
+				   (reserve ? 0 : MAP_NORESERVE),
+			   -1, 0);
+
+		if (ptr == MAP_FAILED)
+			err(1, "mmap");
+
+		printf("Returned address is %p\n", ptr);
+		break;
+	case SHM:
+		printf("Allocating using SHM.\n");
+		shmid = shmget(key, size,
+			       SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
+		if (shmid < 0) {
+			shmid = shmget(++key, size,
+				       SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
+			if (shmid < 0)
+				err(1, "shmget");
+		}
+		printf("shmid: 0x%x, shmget key:%d\n", shmid, key);
+
+		ptr = shmat(shmid, NULL, 0);
+		if (ptr == (int *)-1) {
+			perror("Shared memory attach failure");
+			shmctl(shmid, IPC_RMID, NULL);
+			exit(2);
+		}
+		printf("shmaddr: %p\n", ptr);
+
+		break;
+	default:
+		errno = EINVAL;
+		err(1, "Invalid method.");
+	}
+
+	if (write) {
+		printf("Writing to memory.\n");
+		memset(ptr, 1, size);
+	}
+
+	if (want_sleep) {
+		/* Signal to caller that we're done. */
+		printf("DONE\n");
+
+		/* Hold memory until external kill signal is delivered. */
+		while (1)
+			sleep(100);
+	}
+
+	if (method == HUGETLBFS)
+		close(fd);
+
+	return 0;
+}
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (115 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch " Andrew Morton
@ 2020-02-11 23:19 ` Andrew Morton
  2020-02-11 23:21 ` + lib-bch-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
                   ` (121 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:19 UTC (permalink / raw)
  To: almasrymina, gthelen, mike.kravetz, mm-commits, rientjes,
	sandipan, shakeelb, shuah


The patch titled
     Subject: hugetlb_cgroup: add hugetlb_cgroup reservation docs
has been added to the -mm tree.  Its filename is
     hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb_cgroup: add hugetlb_cgroup reservation docs

Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Link: http://lkml.kernel.org/r/20200211213128.73302-9-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/cgroup-v1/hugetlb.rst |  103 ++++++++++++--
 1 file changed, 92 insertions(+), 11 deletions(-)

--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst~hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs
+++ a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==================
 
-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.
 
 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.
 
 Brief summary of control files::
 
- hugetlb.<hugepagesize>.limit_in_bytes     # set/show limit of "hugepagesize" hugetlb usage
- hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb  usage recorded
- hugetlb.<hugepagesize>.usage_in_bytes     # show current usage for "hugepagesize" hugetlb
- hugetlb.<hugepagesize>.failcnt		   # show the number of allocation failure due to HugeTLB limit
+ hugetlb.<hugepagesize>.rsvd.limit_in_bytes            # set/show limit of "hugepagesize" hugetlb reservations
+ hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes        # show max "hugepagesize" hugetlb reservations and no-reserve faults
+ hugetlb.<hugepagesize>.rsvd.usage_in_bytes            # show current reservations and no-reserve faults for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.rsvd.failcnt                   # show the number of allocation failure due to HugeTLB reservation limit
+ hugetlb.<hugepagesize>.limit_in_bytes                 # set/show limit of "hugepagesize" hugetlb faults
+ hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
+ hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
+ hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
 
 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,95 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.rsvd.limit_in_bytes
+  hugetlb.1GB.rsvd.max_usage_in_bytes
+  hugetlb.1GB.rsvd.usage_in_bytes
+  hugetlb.1GB.rsvd.failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.rsvd.limit_in_bytes
+  hugetlb.64KB.rsvd.max_usage_in_bytes
+  hugetlb.64KB.rsvd.usage_in_bytes
+  hugetlb.64KB.rsvd.failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.rsvd.limit_in_bytes
+  hugetlb.32MB.rsvd.max_usage_in_bytes
+  hugetlb.32MB.rsvd.usage_in_bytes
+  hugetlb.32MB.rsvd.failcnt
+
+
+1. Page fault accounting
+
+hugetlb.<hugepagesize>.limit_in_bytes
+hugetlb.<hugepagesize>.max_usage_in_bytes
+hugetlb.<hugepagesize>.usage_in_bytes
+hugetlb.<hugepagesize>.failcnt
+
+The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
+control group and enforces the limit during page fault. Since HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to fault in HugeTLB
+pages beyond its limit. Therefore the application needs to know exactly how many
+HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
+there are enough available on the machine for all the users to avoid processes
+getting SIGBUS.
+
+
+2. Reservation accounting
+
+hugetlb.<hugepagesize>.rsvd.limit_in_bytes
+hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
+hugetlb.<hugepagesize>.rsvd.usage_in_bytes
+hugetlb.<hugepagesize>.rsvd.failcnt
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time and at the fault of
+HugeTLB memory for which no reservation exists. Since reservation limits are
+enforced at reservation time (on mmap or shget), reservation limits never causes
+the application to get SIGBUS signal if the memory was reserved before hand. For
+MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
+limit, enforcing memory usage at fault time and causing the application to
+receive a SIGBUS if it's crossing its limit.
+
+Reservation limits are superior to page fault limits described above, since
+reservation limits are enforced at reservation time (on mmap or shget), and
+never causes the application to get SIGBUS signal if the memory was reserved
+before hand. This allows for easier fallback to alternatives such as
+non-HugeTLB memory for example. In the case of page fault accounting, it's very
+hard to avoid processes getting SIGBUS since the sysadmin needs precisely know
+the HugeTLB usage of all the tasks in the system and make sure there is enough
+pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited
+systems is practically impossible with page fault accounting.
+
+
+3. Caveats with shared memory
+
+For shared HugeTLB memory, both HugeTLB reservation and page faults are charged
+to the first task that causes the memory to be reserved or faulted, and all
+subsequent uses of this reserved or faulted memory is done without charging.
+
+Shared HugeTLB memory is only uncharged when it is unreserved or deallocated.
+This is usually when the HugeTLB file is deleted, and not when the task that
+caused the reservation or fault has exited.
+
+
+4. Caveats with HugeTLB cgroup offline.
+
+When a HugeTLB cgroup goes offline with some reservations or faults still
+charged to it, the behavior is as follows:
+
+- The fault charges are charged to the parent HugeTLB cgroup (reparented),
+- the reservation charges remain on the offline HugeTLB cgroup.
+
+This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB
+reservations charged to it, that cgroup persists as a zombie until all HugeTLB
+reservations are uncharged. HugeTLB reservations behave in this manner to match
+the memory controller whose cgroups also persist as zombie until all charged
+memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more
+complex compared to the tracking of HugeTLB faults, so it is significantly
+harder to reparent reservations at offline time.
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-bch-replace-zero-length-array-with-flexible-array-member.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (116 preceding siblings ...)
  2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch " Andrew Morton
@ 2020-02-11 23:21 ` Andrew Morton
  2020-02-11 23:21 ` + lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch " Andrew Morton
                   ` (120 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:21 UTC (permalink / raw)
  To: gustavo, mm-commits


The patch titled
     Subject: lib/bch.c: replace zero-length array with flexible-array member
has been added to the -mm tree.  Its filename is
     lib-bch-replace-zero-length-array-with-flexible-array-member.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-bch-replace-zero-length-array-with-flexible-array-member.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-bch-replace-zero-length-array-with-flexible-array-member.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: lib/bch.c: replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertenly introduced[3] to the codebase from now on.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: http://lkml.kernel.org/r/20200211205119.GA21234@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/bch.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/bch.c~lib-bch-replace-zero-length-array-with-flexible-array-member
+++ a/lib/bch.c
@@ -102,7 +102,7 @@
  */
 struct gf_poly {
 	unsigned int deg;    /* polynomial degree */
-	unsigned int c[0];   /* polynomial terms */
+	unsigned int c[];   /* polynomial terms */
 };
 
 /* given its degree, compute a polynomial size in bytes */
_

Patches currently in -mm which might be from gustavo@embeddedor.com are

lib-bch-replace-zero-length-array-with-flexible-array-member.patch
lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (117 preceding siblings ...)
  2020-02-11 23:21 ` + lib-bch-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
@ 2020-02-11 23:21 ` Andrew Morton
  2020-02-11 23:21 ` + lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
                   ` (119 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:21 UTC (permalink / raw)
  To: gustavo, mm-commits


The patch titled
     Subject: lib/objagg.c: replace zero-length arrays with flexible-array member
has been added to the -mm tree.  Its filename is
     lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: lib/objagg.c: replace zero-length arrays with flexible-array member

The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertenly introduced[3] to the codebase from now on.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: http://lkml.kernel.org/r/20200211205356.GA23101@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/objagg.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/lib/objagg.c~lib-objagg-replace-zero-length-arrays-with-flexible-array-member
+++ a/lib/objagg.c
@@ -28,7 +28,7 @@ struct objagg_hints_node {
 	struct objagg_hints_node *parent;
 	unsigned int root_id;
 	struct objagg_obj_stats_info stats_info;
-	unsigned long obj[0];
+	unsigned long obj[];
 };
 
 static struct objagg_hints_node *
@@ -66,7 +66,7 @@ struct objagg_obj {
 				* including nested objects
 				*/
 	struct objagg_obj_stats stats;
-	unsigned long obj[0];
+	unsigned long obj[];
 };
 
 static unsigned int objagg_obj_ref_inc(struct objagg_obj *objagg_obj)
_

Patches currently in -mm which might be from gustavo@embeddedor.com are

lib-bch-replace-zero-length-array-with-flexible-array-member.patch
lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (118 preceding siblings ...)
  2020-02-11 23:21 ` + lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch " Andrew Morton
@ 2020-02-11 23:21 ` Andrew Morton
  2020-02-11 23:21 ` + lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
                   ` (118 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:21 UTC (permalink / raw)
  To: gustavo, mm-commits


The patch titled
     Subject: lib/ts_bm.c: replace zero-length array with flexible-array member
has been added to the -mm tree.  Its filename is
     lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: lib/ts_bm.c: replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertenly introduced[3] to the codebase from now on.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: http://lkml.kernel.org/r/20200211205620.GA24694@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/ts_bm.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/ts_bm.c~lib-ts_bm-replace-zero-length-array-with-flexible-array-member
+++ a/lib/ts_bm.c
@@ -52,7 +52,7 @@ struct ts_bm
 	u8 *		pattern;
 	unsigned int	patlen;
 	unsigned int 	bad_shift[ASIZE];
-	unsigned int	good_shift[0];
+	unsigned int	good_shift[];
 };
 
 static unsigned int bm_find(struct ts_config *conf, struct ts_state *state)
_

Patches currently in -mm which might be from gustavo@embeddedor.com are

lib-bch-replace-zero-length-array-with-flexible-array-member.patch
lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (119 preceding siblings ...)
  2020-02-11 23:21 ` + lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
@ 2020-02-11 23:21 ` Andrew Morton
  2020-02-11 23:21 ` + lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
                   ` (117 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:21 UTC (permalink / raw)
  To: gustavo, mm-commits


The patch titled
     Subject: lib/ts_fsm.c: replace zero-length array with flexible-array member
has been added to the -mm tree.  Its filename is
     lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: lib/ts_fsm.c: replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertenly introduced[3] to the codebase from now on.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: http://lkml.kernel.org/r/20200211205813.GA25602@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/ts_fsm.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/ts_fsm.c~lib-ts_fsm-replace-zero-length-array-with-flexible-array-member
+++ a/lib/ts_fsm.c
@@ -32,7 +32,7 @@
 struct ts_fsm
 {
 	unsigned int		ntokens;
-	struct ts_fsm_token	tokens[0];
+	struct ts_fsm_token	tokens[];
 };
 
 /* other values derived from ctype.h */
_

Patches currently in -mm which might be from gustavo@embeddedor.com are

lib-bch-replace-zero-length-array-with-flexible-array-member.patch
lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (120 preceding siblings ...)
  2020-02-11 23:21 ` + lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
@ 2020-02-11 23:21 ` Andrew Morton
  2020-02-11 23:23 ` + mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch " Andrew Morton
                   ` (116 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:21 UTC (permalink / raw)
  To: gustavo, mm-commits


The patch titled
     Subject: lib/ts_kmp.c: replace zero-length array with flexible-array member
has been added to the -mm tree.  Its filename is
     lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: lib/ts_kmp.c: replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertenly introduced[3] to the codebase from now on.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: http://lkml.kernel.org/r/20200211205948.GA26459@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/ts_kmp.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/ts_kmp.c~lib-ts_kmp-replace-zero-length-array-with-flexible-array-member
+++ a/lib/ts_kmp.c
@@ -36,7 +36,7 @@ struct ts_kmp
 {
 	u8 *		pattern;
 	unsigned int	pattern_len;
-	unsigned int 	prefix_tbl[0];
+	unsigned int	prefix_tbl[];
 };
 
 static unsigned int kmp_find(struct ts_config *conf, struct ts_state *state)
_

Patches currently in -mm which might be from gustavo@embeddedor.com are

lib-bch-replace-zero-length-array-with-flexible-array-member.patch
lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (121 preceding siblings ...)
  2020-02-11 23:21 ` + lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
@ 2020-02-11 23:23 ` Andrew Morton
  2020-02-11 23:26 ` + mm-mempool-fix-a-data-race-in-mempool_free.patch " Andrew Morton
                   ` (115 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:23 UTC (permalink / raw)
  To: cai, elver, mm-commits


The patch titled
     Subject: mm/rmap: annotate a data race at tlb_flush_batched
has been added to the -mm tree.  Its filename is
     mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/rmap: annotate a data race at tlb_flush_batched

mm->tlb_flush_batched could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

 write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
  try_to_unmap_one+0x59a/0x1ab0
  set_tlb_ubc_flush_pending at mm/rmap.c:635
  (inlined by) try_to_unmap_one at mm/rmap.c:1538
  rmap_walk_anon+0x296/0x650
  rmap_walk+0xdf/0x100
  try_to_unmap+0x18a/0x2f0
  shrink_page_list+0xef6/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
  flush_tlb_batched_pending+0x29/0x90
  flush_tlb_batched_pending at mm/rmap.c:682
  change_p4d_range+0x5dd/0x1030
  change_pte_range at mm/mprotect.c:44
  (inlined by) change_pmd_range at mm/mprotect.c:212
  (inlined by) change_pud_range at mm/mprotect.c:240
  (inlined by) change_p4d_range at mm/mprotect.c:260
  change_protection+0x222/0x310
  change_prot_numa+0x3e/0x60
  task_numa_work+0x219/0x350
  task_work_run+0xed/0x140
  prepare_exit_to_usermode+0x2cc/0x2e0
  ret_from_intr+0x32/0x42

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 4 PID: 6364 Comm: mtest01 Tainted: G        W    L 5.5.0-next-20200210+ #5
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

flush_tlb_batched_pending() is under PTL but the write is not, but
mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
shattered.  Thus, mark it as an intentional data race by using the data
race macro.

Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/rmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/rmap.c~mm-rmap-annotate-a-data-race-at-tlb_flush_batched
+++ a/mm/rmap.c
@@ -666,7 +666,7 @@ static bool should_defer_flush(struct mm
  */
 void flush_tlb_batched_pending(struct mm_struct *mm)
 {
-	if (mm->tlb_flush_batched) {
+	if (data_race(mm->tlb_flush_batched)) {
 		flush_tlb_mm(mm);
 
 		/*
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-and-annotate-various-data-races.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-mempool-fix-a-data-race-in-mempool_free.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (122 preceding siblings ...)
  2020-02-11 23:23 ` + mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch " Andrew Morton
@ 2020-02-11 23:26 ` Andrew Morton
  2020-02-11 23:53 ` + mm-kmemleak-annotate-a-data-race-in-checksum.patch " Andrew Morton
                   ` (114 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:26 UTC (permalink / raw)
  To: cai, elver, mm-commits, oleg, tj


The patch titled
     Subject: mm/mempool: fix a data race in mempool_free()
has been added to the -mm tree.  Its filename is
     mm-mempool-fix-a-data-race-in-mempool_free.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-mempool-fix-a-data-race-in-mempool_free.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-mempool-fix-a-data-race-in-mempool_free.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/mempool: fix a data race in mempool_free()

mempool_t pool.curr_nr could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in mempool_free / remove_element

 write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
  remove_element+0x4a/0x1c0
  remove_element at mm/mempool.c:132
  mempool_alloc+0x102/0x210
  (inlined by) mempool_alloc at mm/mempool.c:399
  bio_alloc_bioset+0x106/0x2c0
  get_swap_bio+0x49/0x230
  __swap_writepage+0x680/0xc30
  swap_writepage+0x9c/0xf0
  pageout+0x33e/0xae0
  shrink_page_list+0x1f57/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  do_try_to_free_pages+0x1f7/0xa10
  try_to_free_pages+0x26c/0x5e0
  __alloc_pages_slowpath+0x458/0x1290
  <snip>

 read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
  mempool_free+0x3e/0x150
  mempool_free at mm/mempool.c:492
  bio_free+0x192/0x280
  bio_put+0x91/0xd0
  end_swap_bio_write+0x1d8/0x280
  bio_endio+0x2c2/0x5b0
  dec_pending+0x22b/0x440 [dm_mod]
  clone_endio+0xe4/0x2c0 [dm_mod]
  bio_endio+0x2c2/0x5b0
  blk_update_request+0x217/0x940
  scsi_end_request+0x6b/0x4d0
  scsi_io_completion+0xb7/0x7e0
  scsi_finish_command+0x223/0x310
  scsi_softirq_done+0x1d5/0x210
  blk_mq_complete_request+0x224/0x250
  scsi_mq_done+0xc2/0x250
  pqi_raid_io_complete+0x5a/0x70 [smartpqi]
  pqi_irq_handler+0x150/0x1410 [smartpqi]
  __handle_irq_event_percpu+0x90/0x540
  handle_irq_event_percpu+0x49/0xd0
  handle_irq_event+0x85/0xca
  handle_edge_irq+0x13f/0x3e0
  do_IRQ+0x86/0x190
  <snip>

Since the write is under pool->lock but the read is done as lockless.
Even though the commit 5b990546e334 ("mempool: fix and document
synchronization and memory barrier usage") introduced the smp_wmb() and
smp_rmb() pair to improve the situation, it is adequate to protect it
from data races which could lead to a logic bug, so fix it by adding
READ_ONCE() for the read.

Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempool.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/mempool.c~mm-mempool-fix-a-data-race-in-mempool_free
+++ a/mm/mempool.c
@@ -489,7 +489,7 @@ void mempool_free(void *element, mempool
 	 * ensures that there will be frees which return elements to the
 	 * pool waking up the waiters.
 	 */
-	if (unlikely(pool->curr_nr < pool->min_nr)) {
+	if (unlikely(READ_ONCE(pool->curr_nr) < pool->min_nr)) {
 		spin_lock_irqsave(&pool->lock, flags);
 		if (likely(pool->curr_nr < pool->min_nr)) {
 			add_element(pool, element);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-and-annotate-various-data-races.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmemleak-annotate-a-data-race-in-checksum.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (123 preceding siblings ...)
  2020-02-11 23:26 ` + mm-mempool-fix-a-data-race-in-mempool_free.patch " Andrew Morton
@ 2020-02-11 23:53 ` Andrew Morton
  2020-02-12  0:19 ` + mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch " Andrew Morton
                   ` (113 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-11 23:53 UTC (permalink / raw)
  To: cai, catalin.marinas, elver, mm-commits


The patch titled
     Subject: mm/kmemleak.c: annotate a data race in checksum
has been added to the -mm tree.  Its filename is
     mm-kmemleak-annotate-a-data-race-in-checksum.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmemleak-annotate-a-data-race-in-checksum.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmemleak-annotate-a-data-race-in-checksum.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/kmemleak.c: annotate a data race in checksum

The value of object->pointer could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in crc32_le_base / do_raw_spin_lock

 write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
  do_raw_spin_lock+0x114/0x200
  debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
  (inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
  _raw_spin_lock+0x40/0x50
  __handle_mm_fault+0xa9e/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
  crc32_le_base+0x67/0x350
  crc32_le_base+0x67/0x350:
  crc32_body at lib/crc32.c:106
  (inlined by) crc32_le_generic at lib/crc32.c:179
  (inlined by) crc32_le at lib/crc32.c:197
  kmemleak_scan+0x528/0xd90
  update_checksum at mm/kmemleak.c:1172
  (inlined by) kmemleak_scan at mm/kmemleak.c:1497
  kmemleak_scan_thread+0xcc/0xfa
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 60 PID: 839 Comm: kmemleak Tainted: G        W    L 5.5.0-next-20200210+ #3
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

crc32() will dereference object->pointer.  If a shattered value was
returned due to a data race, it will be corrected in the next scan.  Thus,
annotate it as an intentional data race using the data_race() macro.

Link: http://lkml.kernel.org/r/1581438245-24391-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/mm/kmemleak.c~mm-kmemleak-annotate-a-data-race-in-checksum
+++ a/mm/kmemleak.c
@@ -1169,7 +1169,12 @@ static bool update_checksum(struct kmeml
 	u32 old_csum = object->checksum;
 
 	kasan_disable_current();
-	object->checksum = crc32(0, (void *)object->pointer, object->size);
+	/*
+	 * crc32() will dereference object->pointer. If an unstable value was
+	 * returned due to a data race, it will be corrected in the next scan.
+	 */
+	object->checksum = data_race(crc32(0, (void *)object->pointer,
+					   object->size));
 	kasan_enable_current();
 
 	return object->checksum != old_csum;
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-a-data-race-in-checksum.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (124 preceding siblings ...)
  2020-02-11 23:53 ` + mm-kmemleak-annotate-a-data-race-in-checksum.patch " Andrew Morton
@ 2020-02-12  0:19 ` Andrew Morton
  2020-02-12  0:19 ` + mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch " Andrew Morton
                   ` (112 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:19 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm: adjust shuffle code to allow for future coalescing
has been added to the -mm tree.  Its filename is
     mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm: adjust shuffle code to allow for future coalescing

Patch series "mm / virtio: Provide support for free page reporting", v17.

This series provides an asynchronous means of reporting free guest pages
to a hypervisor so that the memory associated with those pages can be
dropped and reused by other processes and/or guests on the host.  Using
this it is possible to avoid unnecessary I/O to disk and greatly improve
performance in the case of memory overcommit on the host.

When enabled we will be performing a scan of free memory every 2 seconds
while pages of sufficiently high order are being freed.  In each pass at
least one sixteenth of each free list will be reported.  By doing this we
avoid racing against other threads that may be causing a high amount of
memory churn.

The lowest page order currently scanned when reporting pages is
pageblock_order so that this feature will not interfere with the use of
Transparent Huge Pages in the case of virtualization.

Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it.  In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently free.  It will be zeroed and faulted back into the guest the
next time the page is accessed.

To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages.  We walk though the free list
isolating pages and adding them to the scatterlist until we either
encounter the end of the list or have processed at least one sixteenth of
the pages that were listed in nr_free prior to us starting.  If we fill
the scatterlist before we reach the end of the list we rotate the list so
that the first unreported page we encounter is moved to the head of the
list as that is where we will resume after we have freed the reported
pages back into the tail of the list.

Below are the results from various benchmarks.  I primarily focused on two
tests.  The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP.  I did this as it allows for better visibility into different parts
of the memory subsystem.  The guest is running with 32G for RAM on one
node of a E5-2630 v3.  The host has had some features such as CPU turbo
disabled in the BIOS.

Test                   page_fault1 (THP)    page_fault2
Name            tasks  Process Iter  STDEV  Process Iter  STDEV
Baseline            1    1012402.50  0.14%     361855.25  0.81%
                   16    8827457.25  0.09%    3282347.00  0.34%

Patches Applied     1    1007897.00  0.23%     361887.00  0.26%
                   16    8784741.75  0.39%    3240669.25  0.48%

Patches Enabled     1    1010227.50  0.39%     359749.25  0.56%
                   16    8756219.00  0.24%    3226608.75  0.97%

Patches Enabled     1    1050982.00  4.26%     357966.25  0.14%
 page shuffle      16    8672601.25  0.49%    3223177.75  0.40%

Patches enabled     1    1003238.00  0.22%     360211.00  0.22%
 shuffle w/ RFC    16    8767010.50  0.32%    3199874.00  0.71%

The results above are for a baseline with a linux-next-20191219 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, the patches applied and page reporting fully enabled, the
patches enabled with page shuffling enabled, and the patches applied with
page shuffling enabled and an RFC patch that makes used of MADV_FREE in
QEMU.  These results include the deviation seen between the average value
reported here versus the high and/or low value.  I observed that during
the test memory usage for the first three tests never dropped whereas with
the patches fully enabled the VM would drop to using only a few GB of the
host's memory when switching from memhog to page fault tests.

Any of the overhead visible with this patch set enabled seems due to page
faults caused by accessing the reported pages and the host zeroing the
page before giving it back to the guest.  This overhead is much more
visible when using THP than with standard 4K pages.  In addition page
shuffling seemed to increase the amount of faults generated due to an
increase in memory churn.  The overehad is reduced when using MADV_FREE as
we can avoid the extra zeroing of the pages when they are reintroduced to
the host, as can be seen when the RFC is applied with shuffling enabled.

The overall guest size is kept fairly small to only a few GB while the
test is running.  If the host memory were oversubscribed this patch set
should result in a performance improvement as swapping memory in the host
can be avoided.

A brief history on the background of free page reporting can be found at:
https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/


This patch (of 9):

Move the head/tail adding logic out of the shuffle code and into the
__free_one_page function since ultimately that is where it is really
needed anyway.  By doing this we should be able to reduce the overhead and
can consolidate all of the list addition bits in one spot.

Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   12 ------
 mm/page_alloc.c        |   71 +++++++++++++++++++++++----------------
 mm/shuffle.c           |   12 +++---
 mm/shuffle.h           |    6 +++
 4 files changed, 54 insertions(+), 47 deletions(-)

--- a/include/linux/mmzone.h~mm-adjust-shuffle-code-to-allow-for-future-coalescing
+++ a/include/linux/mmzone.h
@@ -116,18 +116,6 @@ static inline void add_to_free_area_tail
 	area->nr_free++;
 }
 
-#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-/* Used to preserve page allocation order entropy */
-void add_to_free_area_random(struct page *page, struct free_area *area,
-		int migratetype);
-#else
-static inline void add_to_free_area_random(struct page *page,
-		struct free_area *area, int migratetype)
-{
-	add_to_free_area(page, area, migratetype);
-}
-#endif
-
 /* Used for pages which are on another list */
 static inline void move_to_free_area(struct page *page, struct free_area *area,
 			     int migratetype)
--- a/mm/page_alloc.c~mm-adjust-shuffle-code-to-allow-for-future-coalescing
+++ a/mm/page_alloc.c
@@ -873,6 +873,36 @@ compaction_capture(struct capture_contro
 #endif /* CONFIG_COMPACTION */
 
 /*
+ * If this is not the largest possible page, check if the buddy
+ * of the next-highest order is free. If it is, it's possible
+ * that pages are being freed that will coalesce soon. In case,
+ * that is happening, add the free page to the tail of the list
+ * so it's less likely to be used soon and more likely to be merged
+ * as a higher order page
+ */
+static inline bool
+buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
+		   struct page *page, unsigned int order)
+{
+	struct page *higher_page, *higher_buddy;
+	unsigned long combined_pfn;
+
+	if (order >= MAX_ORDER - 2)
+		return false;
+
+	if (!pfn_valid_within(buddy_pfn))
+		return false;
+
+	combined_pfn = buddy_pfn & pfn;
+	higher_page = page + (combined_pfn - pfn);
+	buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
+	higher_buddy = higher_page + (buddy_pfn - combined_pfn);
+
+	return pfn_valid_within(buddy_pfn) &&
+	       page_is_buddy(higher_page, higher_buddy, order + 1);
+}
+
+/*
  * Freeing function for a buddy system allocator.
  *
  * The concept of a buddy system is to maintain direct-mapped table
@@ -901,11 +931,13 @@ static inline void __free_one_page(struc
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
-	unsigned long combined_pfn;
+	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
-	struct page *buddy;
+	unsigned long combined_pfn;
+	struct free_area *area;
 	unsigned int max_order;
-	struct capture_control *capc = task_capc(zone);
+	struct page *buddy;
+	bool to_tail;
 
 	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
@@ -974,35 +1006,16 @@ continue_merging:
 done_merging:
 	set_page_order(page, order);
 
-	/*
-	 * If this is not the largest possible page, check if the buddy
-	 * of the next-highest order is free. If it is, it's possible
-	 * that pages are being freed that will coalesce soon. In case,
-	 * that is happening, add the free page to the tail of the list
-	 * so it's less likely to be used soon and more likely to be merged
-	 * as a higher order page
-	 */
-	if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
-			&& !is_shuffle_order(order)) {
-		struct page *higher_page, *higher_buddy;
-		combined_pfn = buddy_pfn & pfn;
-		higher_page = page + (combined_pfn - pfn);
-		buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
-		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
-		if (pfn_valid_within(buddy_pfn) &&
-		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
-			add_to_free_area_tail(page, &zone->free_area[order],
-					      migratetype);
-			return;
-		}
-	}
-
+	area = &zone->free_area[order];
 	if (is_shuffle_order(order))
-		add_to_free_area_random(page, &zone->free_area[order],
-				migratetype);
+		to_tail = shuffle_pick_tail();
 	else
-		add_to_free_area(page, &zone->free_area[order], migratetype);
+		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
+	if (to_tail)
+		add_to_free_area_tail(page, area, migratetype);
+	else
+		add_to_free_area(page, area, migratetype);
 }
 
 /*
--- a/mm/shuffle.c~mm-adjust-shuffle-code-to-allow-for-future-coalescing
+++ a/mm/shuffle.c
@@ -183,11 +183,11 @@ void __meminit __shuffle_free_memory(pg_
 		shuffle_zone(z);
 }
 
-void add_to_free_area_random(struct page *page, struct free_area *area,
-		int migratetype)
+bool shuffle_pick_tail(void)
 {
 	static u64 rand;
 	static u8 rand_bits;
+	bool ret;
 
 	/*
 	 * The lack of locking is deliberate. If 2 threads race to
@@ -198,10 +198,10 @@ void add_to_free_area_random(struct page
 		rand = get_random_u64();
 	}
 
-	if (rand & 1)
-		add_to_free_area(page, area, migratetype);
-	else
-		add_to_free_area_tail(page, area, migratetype);
+	ret = rand & 1;
+
 	rand_bits--;
 	rand >>= 1;
+
+	return ret;
 }
--- a/mm/shuffle.h~mm-adjust-shuffle-code-to-allow-for-future-coalescing
+++ a/mm/shuffle.h
@@ -22,6 +22,7 @@ enum mm_shuffle_ctl {
 DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
 extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
 extern void __shuffle_free_memory(pg_data_t *pgdat);
+extern bool shuffle_pick_tail(void);
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
 	if (!static_branch_unlikely(&page_alloc_shuffle_key))
@@ -44,6 +45,11 @@ static inline bool is_shuffle_order(int
 	return order >= SHUFFLE_ORDER;
 }
 #else
+static inline bool shuffle_pick_tail(void)
+{
+	return false;
+}
+
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
 }
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (125 preceding siblings ...)
  2020-02-12  0:19 ` + mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch " Andrew Morton
@ 2020-02-12  0:19 ` Andrew Morton
  2020-02-12  0:19 ` + mm-add-function-__putback_isolated_page.patch " Andrew Morton
                   ` (111 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:19 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm: use zone and order instead of free area in free_list manipulators
has been added to the -mm tree.  Its filename is
     mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm: use zone and order instead of free area in free_list manipulators

In order to enable the use of the zone from the list manipulator functions
I will need access to the zone pointer.  As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.

In order to be able to reference the zone we need to move the declaration
of the functions down so that we have the zone defined before we define
the list manipulation functions.  Since the functions are only used in the
file mm/page_alloc.c we can just move them there to reduce noise in the
header.

Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   32 ------------------
 mm/page_alloc.c        |   67 ++++++++++++++++++++++++++++-----------
 2 files changed, 49 insertions(+), 50 deletions(-)

--- a/include/linux/mmzone.h~mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators
+++ a/include/linux/mmzone.h
@@ -100,29 +100,6 @@ struct free_area {
 	unsigned long		nr_free;
 };
 
-/* Used for pages not on another list */
-static inline void add_to_free_area(struct page *page, struct free_area *area,
-			     int migratetype)
-{
-	list_add(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
-}
-
-/* Used for pages not on another list */
-static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
-				  int migratetype)
-{
-	list_add_tail(&page->lru, &area->free_list[migratetype]);
-	area->nr_free++;
-}
-
-/* Used for pages which are on another list */
-static inline void move_to_free_area(struct page *page, struct free_area *area,
-			     int migratetype)
-{
-	list_move(&page->lru, &area->free_list[migratetype]);
-}
-
 static inline struct page *get_page_from_free_area(struct free_area *area,
 					    int migratetype)
 {
@@ -130,15 +107,6 @@ static inline struct page *get_page_from
 					struct page, lru);
 }
 
-static inline void del_page_from_free_area(struct page *page,
-		struct free_area *area)
-{
-	list_del(&page->lru);
-	__ClearPageBuddy(page);
-	set_page_private(page, 0);
-	area->nr_free--;
-}
-
 static inline bool free_area_empty(struct free_area *area, int migratetype)
 {
 	return list_empty(&area->free_list[migratetype]);
--- a/mm/page_alloc.c~mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators
+++ a/mm/page_alloc.c
@@ -872,6 +872,44 @@ compaction_capture(struct capture_contro
 }
 #endif /* CONFIG_COMPACTION */
 
+/* Used for pages not on another list */
+static inline void add_to_free_list(struct page *page, struct zone *zone,
+				    unsigned int order, int migratetype)
+{
+	struct free_area *area = &zone->free_area[order];
+
+	list_add(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
+					 unsigned int order, int migratetype)
+{
+	struct free_area *area = &zone->free_area[order];
+
+	list_add_tail(&page->lru, &area->free_list[migratetype]);
+	area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_list(struct page *page, struct zone *zone,
+				     unsigned int order, int migratetype)
+{
+	struct free_area *area = &zone->free_area[order];
+
+	list_move(&page->lru, &area->free_list[migratetype]);
+}
+
+static inline void del_page_from_free_list(struct page *page, struct zone *zone,
+					   unsigned int order)
+{
+	list_del(&page->lru);
+	__ClearPageBuddy(page);
+	set_page_private(page, 0);
+	zone->free_area[order].nr_free--;
+}
+
 /*
  * If this is not the largest possible page, check if the buddy
  * of the next-highest order is free. If it is, it's possible
@@ -934,7 +972,6 @@ static inline void __free_one_page(struc
 	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
 	unsigned long combined_pfn;
-	struct free_area *area;
 	unsigned int max_order;
 	struct page *buddy;
 	bool to_tail;
@@ -972,7 +1009,7 @@ continue_merging:
 		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order, migratetype);
 		else
-			del_page_from_free_area(buddy, &zone->free_area[order]);
+			del_page_from_free_list(buddy, zone, order);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
@@ -1006,16 +1043,15 @@ continue_merging:
 done_merging:
 	set_page_order(page, order);
 
-	area = &zone->free_area[order];
 	if (is_shuffle_order(order))
 		to_tail = shuffle_pick_tail();
 	else
 		to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
 	if (to_tail)
-		add_to_free_area_tail(page, area, migratetype);
+		add_to_free_list_tail(page, zone, order, migratetype);
 	else
-		add_to_free_area(page, area, migratetype);
+		add_to_free_list(page, zone, order, migratetype);
 }
 
 /*
@@ -2029,13 +2065,11 @@ void __init init_cma_reserved_pageblock(
  * -- nyc
  */
 static inline void expand(struct zone *zone, struct page *page,
-	int low, int high, struct free_area *area,
-	int migratetype)
+	int low, int high, int migratetype)
 {
 	unsigned long size = 1 << high;
 
 	while (high > low) {
-		area--;
 		high--;
 		size >>= 1;
 		VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
@@ -2049,7 +2083,7 @@ static inline void expand(struct zone *z
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
 
-		add_to_free_area(&page[size], area, migratetype);
+		add_to_free_list(&page[size], zone, high, migratetype);
 		set_page_order(&page[size], high);
 	}
 }
@@ -2207,8 +2241,8 @@ struct page *__rmqueue_smallest(struct z
 		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
-		del_page_from_free_area(page, area);
-		expand(zone, page, order, current_order, area, migratetype);
+		del_page_from_free_list(page, zone, current_order);
+		expand(zone, page, order, current_order, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
 	}
@@ -2282,7 +2316,7 @@ static int move_freepages(struct zone *z
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = page_order(page);
-		move_to_free_area(page, &zone->free_area[order], migratetype);
+		move_to_free_list(page, zone, order, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2398,7 +2432,6 @@ static void steal_suitable_fallback(stru
 		unsigned int alloc_flags, int start_type, bool whole_block)
 {
 	unsigned int current_order = page_order(page);
-	struct free_area *area;
 	int free_pages, movable_pages, alike_pages;
 	int old_block_type;
 
@@ -2469,8 +2502,7 @@ static void steal_suitable_fallback(stru
 	return;
 
 single_page:
-	area = &zone->free_area[current_order];
-	move_to_free_area(page, area, start_type);
+	move_to_free_list(page, zone, current_order, start_type);
 }
 
 /*
@@ -3141,7 +3173,6 @@ EXPORT_SYMBOL_GPL(split_page);
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
-	struct free_area *area = &page_zone(page)->free_area[order];
 	unsigned long watermark;
 	struct zone *zone;
 	int mt;
@@ -3167,7 +3198,7 @@ int __isolate_free_page(struct page *pag
 
 	/* Remove page from free list */
 
-	del_page_from_free_area(page, area);
+	del_page_from_free_list(page, zone, order);
 
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
@@ -8724,7 +8755,7 @@ __offline_isolated_pages(unsigned long s
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
 		offlined_pages += 1 << order;
-		del_page_from_free_area(page, &zone->free_area[order]);
+		del_page_from_free_list(page, zone, order);
 		pfn += (1 << order);
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-add-function-__putback_isolated_page.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (126 preceding siblings ...)
  2020-02-12  0:19 ` + mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch " Andrew Morton
@ 2020-02-12  0:19 ` Andrew Morton
  2020-02-12  0:19 ` + mm-introduce-reported-pages.patch " Andrew Morton
                   ` (110 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:19 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm: add function __putback_isolated_page
has been added to the -mm tree.  Its filename is
     mm-add-function-__putback_isolated_page.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-add-function-__putback_isolated_page.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-add-function-__putback_isolated_page.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm: add function __putback_isolated_page

There are cases where we would benefit from avoiding having to go through
the allocation and free cycle to return an isolated page.

Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.

This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.

Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/internal.h       |    2 ++
 mm/page_alloc.c     |   19 +++++++++++++++++++
 mm/page_isolation.c |    6 ++----
 3 files changed, 23 insertions(+), 4 deletions(-)

--- a/mm/internal.h~mm-add-function-__putback_isolated_page
+++ a/mm/internal.h
@@ -157,6 +157,8 @@ static inline struct page *pageblock_pfn
 }
 
 extern int __isolate_free_page(struct page *page, unsigned int order);
+extern void __putback_isolated_page(struct page *page, unsigned int order,
+				    int mt);
 extern void memblock_free_pages(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void __free_pages_core(struct page *page, unsigned int order);
--- a/mm/page_alloc.c~mm-add-function-__putback_isolated_page
+++ a/mm/page_alloc.c
@@ -3219,6 +3219,25 @@ int __isolate_free_page(struct page *pag
 	return 1UL << order;
 }
 
+/**
+ * __putback_isolated_page - Return a now-isolated page back where we got it
+ * @page: Page that was isolated
+ * @order: Order of the isolated page
+ *
+ * This function is meant to return a page pulled from the free lists via
+ * __isolate_free_page back to the free lists they were pulled from.
+ */
+void __putback_isolated_page(struct page *page, unsigned int order, int mt)
+{
+	struct zone *zone = page_zone(page);
+
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
+
+	/* Return isolated page to tail of freelist. */
+	__free_one_page(page, page_to_pfn(page), zone, order, mt);
+}
+
 /*
  * Update NUMA hit/miss statistics
  *
--- a/mm/page_isolation.c~mm-add-function-__putback_isolated_page
+++ a/mm/page_isolation.c
@@ -117,13 +117,11 @@ static void unset_migratetype_isolate(st
 		__mod_zone_freepage_state(zone, nr_pages, migratetype);
 	}
 	set_pageblock_migratetype(page, migratetype);
+	if (isolated_page)
+		__putback_isolated_page(page, order, migratetype);
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (isolated_page) {
-		post_alloc_hook(page, order, __GFP_MOVABLE);
-		__free_pages(page, order);
-	}
 }
 
 static inline struct page *
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-introduce-reported-pages.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (127 preceding siblings ...)
  2020-02-12  0:19 ` + mm-add-function-__putback_isolated_page.patch " Andrew Morton
@ 2020-02-12  0:19 ` Andrew Morton
  2020-02-12  0:19 ` + virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch " Andrew Morton
                   ` (109 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:19 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm: introduce Reported pages
has been added to the -mm tree.  Its filename is
     mm-introduce-reported-pages.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-introduce-reported-pages.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-introduce-reported-pages.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm: introduce Reported pages

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned.  To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy page
type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function.  As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple.  Once we free a page
that meets the minimum order for page reporting we will schedule a worker
thread to start 2s or more in the future.  That worker thread will begin
working from the lowest supported page reporting order up to MAX_ORDER - 1
pulling unreported pages from the free list and storing them in the
scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages.  To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist.  Once the function completes it will
return the pages to the free area from which they were allocated and start
over pulling more pages from the free areas until there are no longer
enough pages to report on to keep the worker busy, or we have processed as
many pages as were contained in the free area when we started processing
the list.

The worker thread will work in a round-robin fashion making its way though
each zone requesting reporting, and through each reportable free list
within that zone.  Once all free areas within the zone have been processed
it will check to see if there have been any requests for reporting while
it was processing.  If so it will reschedule the worker thread to start up
again in roughly 2s and exit.

Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h     |   11 +
 include/linux/page_reporting.h |   25 ++
 mm/Kconfig                     |   11 +
 mm/Makefile                    |    1 
 mm/page_alloc.c                |   17 +
 mm/page_reporting.c            |  319 +++++++++++++++++++++++++++++++
 mm/page_reporting.h            |   54 +++++
 7 files changed, 434 insertions(+), 4 deletions(-)

--- a/include/linux/page-flags.h~mm-introduce-reported-pages
+++ a/include/linux/page-flags.h
@@ -163,6 +163,9 @@ enum pageflags {
 
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
+
+	/* Only valid for buddy pages. Used to track pages that are reported */
+	PG_reported = PG_uptodate,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -432,6 +435,14 @@ PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
 /*
+ * PageReported() is used to track reported free pages within the Buddy
+ * allocator. We can use the non-atomic version of the test and set
+ * operations as both should be shielded with the zone lock to prevent
+ * any possible races on the setting or clearing of the bit.
+ */
+__PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
+
+/*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
  * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
--- /dev/null
+++ a/include/linux/page_reporting.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PAGE_REPORTING_H
+#define _LINUX_PAGE_REPORTING_H
+
+#include <linux/mmzone.h>
+#include <linux/scatterlist.h>
+
+#define PAGE_REPORTING_CAPACITY		32
+
+struct page_reporting_dev_info {
+	/* function that alters pages to make them "reported" */
+	int (*report)(struct page_reporting_dev_info *prdev,
+		      struct scatterlist *sg, unsigned int nents);
+
+	/* work struct for processing reports */
+	struct delayed_work work;
+
+	/* Current state of page reporting */
+	atomic_t state;
+};
+
+/* Tear-down and bring-up for page reporting devices */
+void page_reporting_unregister(struct page_reporting_dev_info *prdev);
+int page_reporting_register(struct page_reporting_dev_info *prdev);
+#endif /*_LINUX_PAGE_REPORTING_H */
--- a/mm/Kconfig~mm-introduce-reported-pages
+++ a/mm/Kconfig
@@ -237,6 +237,17 @@ config COMPACTION
 	  linux-mm@kvack.org.
 
 #
+# support for free page reporting
+config PAGE_REPORTING
+	bool "Free page reporting"
+	def_bool n
+	help
+	  Free page reporting allows for the incremental acquisition of
+	  free pages from the buddy allocator for the purpose of reporting
+	  those pages to another entity, such as a hypervisor, so that the
+	  memory can be freed within the host for other uses.
+
+#
 # support for page migration
 #
 config MIGRATION
--- a/mm/Makefile~mm-introduce-reported-pages
+++ a/mm/Makefile
@@ -110,3 +110,4 @@ obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
+obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
--- a/mm/page_alloc.c~mm-introduce-reported-pages
+++ a/mm/page_alloc.c
@@ -74,6 +74,7 @@
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
+#include "page_reporting.h"
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
@@ -904,6 +905,10 @@ static inline void move_to_free_list(str
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 					   unsigned int order)
 {
+	/* clear reported state and update reported page count */
+	if (page_reported(page))
+		__ClearPageReported(page);
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
@@ -967,7 +972,7 @@ buddy_merge_likely(unsigned long pfn, un
 static inline void __free_one_page(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
-		int migratetype)
+		int migratetype, bool report)
 {
 	struct capture_control *capc = task_capc(zone);
 	unsigned long uninitialized_var(buddy_pfn);
@@ -1052,6 +1057,10 @@ done_merging:
 		add_to_free_list_tail(page, zone, order, migratetype);
 	else
 		add_to_free_list(page, zone, order, migratetype);
+
+	/* Notify page reporting subsystem of freed page */
+	if (report)
+		page_reporting_notify_free(order);
 }
 
 /*
@@ -1368,7 +1377,7 @@ static void free_pcppages_bulk(struct zo
 		if (unlikely(isolated_pageblocks))
 			mt = get_pageblock_migratetype(page);
 
-		__free_one_page(page, page_to_pfn(page), zone, 0, mt);
+		__free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
 		trace_mm_page_pcpu_drain(page, 0, mt);
 	}
 	spin_unlock(&zone->lock);
@@ -1384,7 +1393,7 @@ static void free_one_page(struct zone *z
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
 	}
-	__free_one_page(page, pfn, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype, true);
 	spin_unlock(&zone->lock);
 }
 
@@ -3235,7 +3244,7 @@ void __putback_isolated_page(struct page
 	lockdep_assert_held(&zone->lock);
 
 	/* Return isolated page to tail of freelist. */
-	__free_one_page(page, page_to_pfn(page), zone, order, mt);
+	__free_one_page(page, page_to_pfn(page), zone, order, mt, false);
 }
 
 /*
--- /dev/null
+++ a/mm/page_reporting.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/page_reporting.h>
+#include <linux/gfp.h>
+#include <linux/export.h>
+#include <linux/delay.h>
+#include <linux/scatterlist.h>
+
+#include "page_reporting.h"
+#include "internal.h"
+
+#define PAGE_REPORTING_DELAY	(2 * HZ)
+static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
+
+enum {
+	PAGE_REPORTING_IDLE = 0,
+	PAGE_REPORTING_REQUESTED,
+	PAGE_REPORTING_ACTIVE
+};
+
+/* request page reporting */
+static void
+__page_reporting_request(struct page_reporting_dev_info *prdev)
+{
+	unsigned int state;
+
+	/* Check to see if we are in desired state */
+	state = atomic_read(&prdev->state);
+	if (state == PAGE_REPORTING_REQUESTED)
+		return;
+
+	/*
+	 *  If reporting is already active there is nothing we need to do.
+	 *  Test against 0 as that represents PAGE_REPORTING_IDLE.
+	 */
+	state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED);
+	if (state != PAGE_REPORTING_IDLE)
+		return;
+
+	/*
+	 * Delay the start of work to allow a sizable queue to build. For
+	 * now we are limiting this to running no more than once every
+	 * couple of seconds.
+	 */
+	schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
+}
+
+/* notify prdev of free page reporting request */
+void __page_reporting_notify(void)
+{
+	struct page_reporting_dev_info *prdev;
+
+	/*
+	 * We use RCU to protect the pr_dev_info pointer. In almost all
+	 * cases this should be present, however in the unlikely case of
+	 * a shutdown this will be NULL and we should exit.
+	 */
+	rcu_read_lock();
+	prdev = rcu_dereference(pr_dev_info);
+	if (likely(prdev))
+		__page_reporting_request(prdev);
+
+	rcu_read_unlock();
+}
+
+static void
+page_reporting_drain(struct page_reporting_dev_info *prdev,
+		     struct scatterlist *sgl, unsigned int nents, bool reported)
+{
+	struct scatterlist *sg = sgl;
+
+	/*
+	 * Drain the now reported pages back into their respective
+	 * free lists/areas. We assume at least one page is populated.
+	 */
+	do {
+		struct page *page = sg_page(sg);
+		int mt = get_pageblock_migratetype(page);
+		unsigned int order = get_order(sg->length);
+
+		__putback_isolated_page(page, order, mt);
+
+		/* If the pages were not reported due to error skip flagging */
+		if (!reported)
+			continue;
+
+		/*
+		 * If page was not comingled with another page we can
+		 * consider the result to be "reported" since the page
+		 * hasn't been modified, otherwise we will need to
+		 * report on the new larger page when we make our way
+		 * up to that higher order.
+		 */
+		if (PageBuddy(page) && page_order(page) == order)
+			__SetPageReported(page);
+	} while ((sg = sg_next(sg)));
+
+	/* reinitialize scatterlist now that it is empty */
+	sg_init_table(sgl, nents);
+}
+
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+		     unsigned int order, unsigned int mt,
+		     struct scatterlist *sgl, unsigned int *offset)
+{
+	struct free_area *area = &zone->free_area[order];
+	struct list_head *list = &area->free_list[mt];
+	unsigned int page_len = PAGE_SIZE << order;
+	struct page *page, *next;
+	int err = 0;
+
+	/*
+	 * Perform early check, if free area is empty there is
+	 * nothing to process so we can skip this free_list.
+	 */
+	if (list_empty(list))
+		return err;
+
+	spin_lock_irq(&zone->lock);
+
+	/* loop through free list adding unreported pages to sg list */
+	list_for_each_entry_safe(page, next, list, lru) {
+		/* We are going to skip over the reported pages. */
+		if (PageReported(page))
+			continue;
+
+		/* Attempt to pull page from list */
+		if (!__isolate_free_page(page, order))
+			break;
+
+		/* Add page to scatter list */
+		--(*offset);
+		sg_set_page(&sgl[*offset], page, page_len, 0);
+
+		/* If scatterlist isn't full grab more pages */
+		if (*offset)
+			continue;
+
+		/* release lock before waiting on report processing */
+		spin_unlock_irq(&zone->lock);
+
+		/* begin processing pages in local list */
+		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
+
+		/* reset offset since the full list was reported */
+		*offset = PAGE_REPORTING_CAPACITY;
+
+		/* reacquire zone lock and resume processing */
+		spin_lock_irq(&zone->lock);
+
+		/* flush reported pages from the sg list */
+		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
+
+		/*
+		 * Reset next to first entry, the old next isn't valid
+		 * since we dropped the lock to report the pages
+		 */
+		next = list_first_entry(list, struct page, lru);
+
+		/* exit on error */
+		if (err)
+			break;
+	}
+
+	spin_unlock_irq(&zone->lock);
+
+	return err;
+}
+
+static int
+page_reporting_process_zone(struct page_reporting_dev_info *prdev,
+			    struct scatterlist *sgl, struct zone *zone)
+{
+	unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY;
+	unsigned long watermark;
+	int err = 0;
+
+	/* Generate minimum watermark to be able to guarantee progress */
+	watermark = low_wmark_pages(zone) +
+		    (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
+
+	/*
+	 * Cancel request if insufficient free memory or if we failed
+	 * to allocate page reporting statistics for the zone.
+	 */
+	if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+		return err;
+
+	/* Process each free list starting from lowest order/mt */
+	for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
+		for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+			/* We do not pull pages from the isolate free list */
+			if (is_migrate_isolate(mt))
+				continue;
+
+			err = page_reporting_cycle(prdev, zone, order, mt,
+						   sgl, &offset);
+			if (err)
+				return err;
+		}
+	}
+
+	/* report the leftover pages before going idle */
+	leftover = PAGE_REPORTING_CAPACITY - offset;
+	if (leftover) {
+		sgl = &sgl[offset];
+		err = prdev->report(prdev, sgl, leftover);
+
+		/* flush any remaining pages out from the last report */
+		spin_lock_irq(&zone->lock);
+		page_reporting_drain(prdev, sgl, leftover, !err);
+		spin_unlock_irq(&zone->lock);
+	}
+
+	return err;
+}
+
+static void page_reporting_process(struct work_struct *work)
+{
+	struct delayed_work *d_work = to_delayed_work(work);
+	struct page_reporting_dev_info *prdev =
+		container_of(d_work, struct page_reporting_dev_info, work);
+	int err = 0, state = PAGE_REPORTING_ACTIVE;
+	struct scatterlist *sgl;
+	struct zone *zone;
+
+	/*
+	 * Change the state to "Active" so that we can track if there is
+	 * anyone requests page reporting after we complete our pass. If
+	 * the state is not altered by the end of the pass we will switch
+	 * to idle and quit scheduling reporting runs.
+	 */
+	atomic_set(&prdev->state, state);
+
+	/* allocate scatterlist to store pages being reported on */
+	sgl = kmalloc_array(PAGE_REPORTING_CAPACITY, sizeof(*sgl), GFP_KERNEL);
+	if (!sgl)
+		goto err_out;
+
+	sg_init_table(sgl, PAGE_REPORTING_CAPACITY);
+
+	for_each_zone(zone) {
+		err = page_reporting_process_zone(prdev, sgl, zone);
+		if (err)
+			break;
+	}
+
+	kfree(sgl);
+err_out:
+	/*
+	 * If the state has reverted back to requested then there may be
+	 * additional pages to be processed. We will defer for 2s to allow
+	 * more pages to accumulate.
+	 */
+	state = atomic_cmpxchg(&prdev->state, state, PAGE_REPORTING_IDLE);
+	if (state == PAGE_REPORTING_REQUESTED)
+		schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY);
+}
+
+static DEFINE_MUTEX(page_reporting_mutex);
+DEFINE_STATIC_KEY_FALSE(page_reporting_enabled);
+
+int page_reporting_register(struct page_reporting_dev_info *prdev)
+{
+	int err = 0;
+
+	mutex_lock(&page_reporting_mutex);
+
+	/* nothing to do if already in use */
+	if (rcu_access_pointer(pr_dev_info)) {
+		err = -EBUSY;
+		goto err_out;
+	}
+
+	/* initialize state and work structures */
+	atomic_set(&prdev->state, PAGE_REPORTING_IDLE);
+	INIT_DELAYED_WORK(&prdev->work, &page_reporting_process);
+
+	/* Begin initial flush of zones */
+	__page_reporting_request(prdev);
+
+	/* Assign device to allow notifications */
+	rcu_assign_pointer(pr_dev_info, prdev);
+
+	/* enable page reporting notification */
+	if (!static_key_enabled(&page_reporting_enabled)) {
+		static_branch_enable(&page_reporting_enabled);
+		pr_info("Free page reporting enabled\n");
+	}
+err_out:
+	mutex_unlock(&page_reporting_mutex);
+
+	return err;
+}
+EXPORT_SYMBOL_GPL(page_reporting_register);
+
+void page_reporting_unregister(struct page_reporting_dev_info *prdev)
+{
+	mutex_lock(&page_reporting_mutex);
+
+	if (rcu_access_pointer(pr_dev_info) == prdev) {
+		/* Disable page reporting notification */
+		RCU_INIT_POINTER(pr_dev_info, NULL);
+		synchronize_rcu();
+
+		/* Flush any existing work, and lock it out */
+		cancel_delayed_work_sync(&prdev->work);
+	}
+
+	mutex_unlock(&page_reporting_mutex);
+}
+EXPORT_SYMBOL_GPL(page_reporting_unregister);
--- /dev/null
+++ a/mm/page_reporting.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _MM_PAGE_REPORTING_H
+#define _MM_PAGE_REPORTING_H
+
+#include <linux/mmzone.h>
+#include <linux/pageblock-flags.h>
+#include <linux/page-isolation.h>
+#include <linux/jump_label.h>
+#include <linux/slab.h>
+#include <asm/pgtable.h>
+#include <linux/scatterlist.h>
+
+#define PAGE_REPORTING_MIN_ORDER	pageblock_order
+
+#ifdef CONFIG_PAGE_REPORTING
+DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
+void __page_reporting_notify(void);
+
+static inline bool page_reported(struct page *page)
+{
+	return static_branch_unlikely(&page_reporting_enabled) &&
+	       PageReported(page);
+}
+
+/**
+ * page_reporting_notify_free - Free page notification to start page processing
+ *
+ * This function is meant to act as a screener for __page_reporting_notify
+ * which will determine if a give zone has crossed over the high-water mark
+ * that will justify us beginning page treatment. If we have crossed that
+ * threshold then it will start the process of pulling some pages and
+ * placing them in the batch list for treatment.
+ */
+static inline void page_reporting_notify_free(unsigned int order)
+{
+	/* Called from hot path in __free_one_page() */
+	if (!static_branch_unlikely(&page_reporting_enabled))
+		return;
+
+	/* Determine if we have crossed reporting threshold */
+	if (order < PAGE_REPORTING_MIN_ORDER)
+		return;
+
+	/* This will add a few cycles, but should be called infrequently */
+	__page_reporting_notify();
+}
+#else /* CONFIG_PAGE_REPORTING */
+#define page_reported(_page)	false
+
+static inline void page_reporting_notify_free(unsigned int order)
+{
+}
+#endif /* CONFIG_PAGE_REPORTING */
+#endif /*_MM_PAGE_REPORTING_H */
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (128 preceding siblings ...)
  2020-02-12  0:19 ` + mm-introduce-reported-pages.patch " Andrew Morton
@ 2020-02-12  0:19 ` Andrew Morton
  2020-02-12  0:19 ` + virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch " Andrew Morton
                   ` (108 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:19 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: virtio-balloon: pull page poisoning config out of free page hinting
has been added to the -mm tree.  Its filename is
     virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: virtio-balloon: pull page poisoning config out of free page hinting

Currently the page poisoning setting wasn't being enabled unless free page
hinting was enabled.  However we will need the page poisoning tracking
logic as well for free page reporting.  As such pull it out and make it a
separate bit of config in the probe function.

In addition we need to add support for the more recent init_on_free
feature which expects a behavior similar to page poisoning in that we
expect the page to be pre-zeroed.

Link: http://lkml.kernel.org/r/20200211224646.29318.695.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/virtio/virtio_balloon.c |   23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

--- a/drivers/virtio/virtio_balloon.c~virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting
+++ a/drivers/virtio/virtio_balloon.c
@@ -864,7 +864,6 @@ static int virtio_balloon_register_shrin
 static int virtballoon_probe(struct virtio_device *vdev)
 {
 	struct virtio_balloon *vb;
-	__u32 poison_val;
 	int err;
 
 	if (!vdev->config->get) {
@@ -930,11 +929,20 @@ static int virtballoon_probe(struct virt
 						  VIRTIO_BALLOON_CMD_ID_STOP);
 		spin_lock_init(&vb->free_page_list_lock);
 		INIT_LIST_HEAD(&vb->free_page_list);
-		if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+	}
+	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+		/* Start with poison val of 0 representing general init */
+		__u32 poison_val = 0;
+
+		/*
+		 * Let the hypervisor know that we are expecting a
+		 * specific value to be written back in balloon pages.
+		 */
+		if (!want_init_on_free())
 			memset(&poison_val, PAGE_POISON, sizeof(poison_val));
-			virtio_cwrite(vb->vdev, struct virtio_balloon_config,
-				      poison_val, &poison_val);
-		}
+
+		virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+			      poison_val, &poison_val);
 	}
 	/*
 	 * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a
@@ -1045,7 +1053,10 @@ static int virtballoon_restore(struct vi
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
-	if (!page_poisoning_enabled())
+	/* Tell the host whether we care about poisoned pages. */
+	if (!want_init_on_free() &&
+	    (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
+	     !page_poisoning_enabled()))
 		__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
 
 	__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (129 preceding siblings ...)
  2020-02-12  0:19 ` + virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch " Andrew Morton
@ 2020-02-12  0:19 ` Andrew Morton
  2020-02-12  0:20 ` + mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch " Andrew Morton
                   ` (107 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:19 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: virtio-balloon: add support for providing free page reports to host
has been added to the -mm tree.  Its filename is
     virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: virtio-balloon: add support for providing free page reports to host

Add support for the page reporting feature provided by virtio-balloon. 
Reporting differs from the regular balloon functionality in that is is
much less durable than a standard memory balloon.  Instead of creating a
list of pages that cannot be accessed the pages are only inaccessible
while they are being indicated to the virtio interface.  Once the
interface has acknowledged them they are placed back into their respective
free lists and are once again accessible by the guest system.

Unlike a standard balloon we don't inflate and deflate the pages.  Instead
we perform the reporting, and once the reporting is completed it is
assumed that the page has been dropped from the guest and will be faulted
back in the next time the page is accessed.

Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/virtio/Kconfig              |    1 
 drivers/virtio/virtio_balloon.c     |   64 ++++++++++++++++++++++++++
 include/uapi/linux/virtio_balloon.h |    1 
 3 files changed, 66 insertions(+)

--- a/drivers/virtio/Kconfig~virtio-balloon-add-support-for-providing-free-page-reports-to-host
+++ a/drivers/virtio/Kconfig
@@ -58,6 +58,7 @@ config VIRTIO_BALLOON
 	tristate "Virtio balloon driver"
 	depends on VIRTIO
 	select MEMORY_BALLOON
+	select PAGE_REPORTING
 	---help---
 	 This driver supports increasing and decreasing the amount
 	 of memory within a KVM guest.
--- a/drivers/virtio/virtio_balloon.c~virtio-balloon-add-support-for-providing-free-page-reports-to-host
+++ a/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,7 @@
 #include <linux/mount.h>
 #include <linux/magic.h>
 #include <linux/pseudo_fs.h>
+#include <linux/page_reporting.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -47,6 +48,7 @@ enum virtio_balloon_vq {
 	VIRTIO_BALLOON_VQ_DEFLATE,
 	VIRTIO_BALLOON_VQ_STATS,
 	VIRTIO_BALLOON_VQ_FREE_PAGE,
+	VIRTIO_BALLOON_VQ_REPORTING,
 	VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -114,6 +116,10 @@ struct virtio_balloon {
 
 	/* To register a shrinker to shrink memory upon memory pressure */
 	struct shrinker shrinker;
+
+	/* Free page reporting device */
+	struct virtqueue *reporting_vq;
+	struct page_reporting_dev_info pr_dev_info;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -153,6 +159,33 @@ static void tell_host(struct virtio_ball
 
 }
 
+int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info,
+				   struct scatterlist *sg, unsigned int nents)
+{
+	struct virtio_balloon *vb =
+		container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
+	struct virtqueue *vq = vb->reporting_vq;
+	unsigned int unused, err;
+
+	/* We should always be able to add these buffers to an empty queue. */
+	err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN);
+
+	/*
+	 * In the extremely unlikely case that something has occurred and we
+	 * are able to trigger an error we will simply display a warning
+	 * and exit without actually processing the pages.
+	 */
+	if (WARN_ON_ONCE(err))
+		return err;
+
+	virtqueue_kick(vq);
+
+	/* When host has read buffer, this completes via balloon_ack */
+	wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+
+	return 0;
+}
+
 static void set_page_pfns(struct virtio_balloon *vb,
 			  __virtio32 pfns[], struct page *page)
 {
@@ -481,6 +514,7 @@ static int init_vqs(struct virtio_balloo
 	names[VIRTIO_BALLOON_VQ_STATS] = NULL;
 	callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+	names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
 
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
 		names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -492,6 +526,11 @@ static int init_vqs(struct virtio_balloo
 		callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
 	}
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+		names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
+		callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
+	}
+
 	err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
 					 vqs, callbacks, names, NULL, NULL);
 	if (err)
@@ -524,6 +563,9 @@ static int init_vqs(struct virtio_balloo
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+		vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
+
 	return 0;
 }
 
@@ -953,12 +995,31 @@ static int virtballoon_probe(struct virt
 		if (err)
 			goto out_del_balloon_wq;
 	}
+
+	vb->pr_dev_info.report = virtballoon_free_page_report;
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+		unsigned int capacity;
+
+		capacity = virtqueue_get_vring_size(vb->reporting_vq);
+		if (capacity < PAGE_REPORTING_CAPACITY) {
+			err = -ENOSPC;
+			goto out_unregister_shrinker;
+		}
+
+		err = page_reporting_register(&vb->pr_dev_info);
+		if (err)
+			goto out_unregister_shrinker;
+	}
+
 	virtio_device_ready(vdev);
 
 	if (towards_target(vb))
 		virtballoon_changed(vdev);
 	return 0;
 
+out_unregister_shrinker:
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+		virtio_balloon_unregister_shrinker(vb);
 out_del_balloon_wq:
 	if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
 		destroy_workqueue(vb->balloon_wq);
@@ -997,6 +1058,8 @@ static void virtballoon_remove(struct vi
 {
 	struct virtio_balloon *vb = vdev->priv;
 
+	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+		page_reporting_unregister(&vb->pr_dev_info);
 	if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
 		virtio_balloon_unregister_shrinker(vb);
 	spin_lock_irq(&vb->stop_update_lock);
@@ -1069,6 +1132,7 @@ static unsigned int features[] = {
 	VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
 	VIRTIO_BALLOON_F_FREE_PAGE_HINT,
 	VIRTIO_BALLOON_F_PAGE_POISON,
+	VIRTIO_BALLOON_F_REPORTING,
 };
 
 static struct virtio_driver virtio_balloon_driver = {
--- a/include/uapi/linux/virtio_balloon.h~virtio-balloon-add-support-for-providing-free-page-reports-to-host
+++ a/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM	2 /* Deflate balloon on OOM */
 #define VIRTIO_BALLOON_F_FREE_PAGE_HINT	3 /* VQ to report free pages */
 #define VIRTIO_BALLOON_F_PAGE_POISON	4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING	5 /* Page reporting virtqueue */
 
 /* Size of a PFN in the balloon interface. */
 #define VIRTIO_BALLOON_PFN_SHIFT 12
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (130 preceding siblings ...)
  2020-02-12  0:19 ` + virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch " Andrew Morton
@ 2020-02-12  0:20 ` Andrew Morton
  2020-02-12  0:20 ` + mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch " Andrew Morton
                   ` (106 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:20 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm/page_reporting: rotate reported pages to the tail of the list
has been added to the -mm tree.  Its filename is
     mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm/page_reporting: rotate reported pages to the tail of the list

Rather than walking over the same pages again and again to get to the
pages that have yet to be reported we can save ourselves a significant
amount of time by simply rotating the list so that when we have a full
list of reported pages the head of the list is pointing to the next
non-reported page.  Doing this should save us some significant time when
processing each free list.

This doesn't gain us much in the standard case as all of the non-reported
pages should be near the top of the list already.  However in the case of
page shuffling this results in a noticeable improvement.  Below are the
will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without
this patch.

Without:
tasks   processes       processes_idle  threads         threads_idle
16      8093776.25      0.17            5393242.00      38.20

With:
tasks   processes       processes_idle  threads         threads_idle
16      8283274.75      0.17            5594261.00      38.15

Link: http://lkml.kernel.org/r/20200211224708.29318.16862.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_reporting.c |   32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

--- a/mm/page_reporting.c~mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list
+++ a/mm/page_reporting.c
@@ -131,17 +131,27 @@ page_reporting_cycle(struct page_reporti
 		if (PageReported(page))
 			continue;
 
-		/* Attempt to pull page from list */
-		if (!__isolate_free_page(page, order))
-			break;
-
-		/* Add page to scatter list */
-		--(*offset);
-		sg_set_page(&sgl[*offset], page, page_len, 0);
+		/* Attempt to pull page from list and place in scatterlist */
+		if (*offset) {
+			if (!__isolate_free_page(page, order)) {
+				next = page;
+				break;
+			}
+
+			/* Add page to scatter list */
+			--(*offset);
+			sg_set_page(&sgl[*offset], page, page_len, 0);
 
-		/* If scatterlist isn't full grab more pages */
-		if (*offset)
 			continue;
+		}
+
+		/*
+		 * Make the first non-processed page in the free list
+		 * the new head of the free list before we release the
+		 * zone lock.
+		 */
+		if (&page->lru != list && !list_is_first(&page->lru, list))
+			list_rotate_to_front(&page->lru, list);
 
 		/* release lock before waiting on report processing */
 		spin_unlock_irq(&zone->lock);
@@ -169,6 +179,10 @@ page_reporting_cycle(struct page_reporti
 			break;
 	}
 
+	/* Rotate any leftover pages to the head of the freelist */
+	if (&next->lru != list && !list_is_first(&next->lru, list))
+		list_rotate_to_front(&next->lru, list);
+
 	spin_unlock_irq(&zone->lock);
 
 	return err;
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (131 preceding siblings ...)
  2020-02-12  0:20 ` + mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch " Andrew Morton
@ 2020-02-12  0:20 ` Andrew Morton
  2020-02-12  0:20 ` + mm-page_reporting-add-free-page-reporting-documentation.patch " Andrew Morton
                   ` (105 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:20 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm/page_reporting: add budget limit on how many pages can be reported per pass
has been added to the -mm tree.  Its filename is
     mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm/page_reporting: add budget limit on how many pages can be reported per pass

In order to keep ourselves from reporting pages that are just going to be
reused again in the case of heavy churn we can put a limit on how many
total pages we will process per pass.  Doing this will allow the worker
thread to go into idle much more quickly so that we avoid competing with
other threads that might be allocating or freeing pages.

The logic added here will limit the worker thread to no more than one
sixteenth of the total free pages in a given area per list.  Once that
limit is reached it will update the state so that at the end of the pass
we will reschedule the worker to try again in 2 seconds when the memory
churn has hopefully settled down.

Again this optimization doesn't show much of a benefit in the standard
case as the memory churn is minmal.  However with page allocator shuffling
enabled the gain is quite noticeable.  Below are the results with a THP
enabled version of the will-it-scale page_fault1 test showing the
improvement in iterations for 16 processes or threads.

Without:
tasks   processes       processes_idle  threads         threads_idle
16      8283274.75      0.17            5594261.00      38.15

With:
tasks   processes       processes_idle  threads         threads_idle
16      8767010.50      0.21            5791312.75      36.98

Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page_reporting.h |    1 
 mm/page_reporting.c            |   33 ++++++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

--- a/include/linux/page_reporting.h~mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass
+++ a/include/linux/page_reporting.h
@@ -5,6 +5,7 @@
 #include <linux/mmzone.h>
 #include <linux/scatterlist.h>
 
+/* This value should always be a power of 2, see page_reporting_cycle() */
 #define PAGE_REPORTING_CAPACITY		32
 
 struct page_reporting_dev_info {
--- a/mm/page_reporting.c~mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass
+++ a/mm/page_reporting.c
@@ -114,6 +114,7 @@ page_reporting_cycle(struct page_reporti
 	struct list_head *list = &area->free_list[mt];
 	unsigned int page_len = PAGE_SIZE << order;
 	struct page *page, *next;
+	long budget;
 	int err = 0;
 
 	/*
@@ -125,12 +126,39 @@ page_reporting_cycle(struct page_reporti
 
 	spin_lock_irq(&zone->lock);
 
+	/*
+	 * Limit how many calls we will be making to the page reporting
+	 * device for this list. By doing this we avoid processing any
+	 * given list for too long.
+	 *
+	 * The current value used allows us enough calls to process over a
+	 * sixteenth of the current list plus one additional call to handle
+	 * any pages that may have already been present from the previous
+	 * list processed. This should result in us reporting all pages on
+	 * an idle system in about 30 seconds.
+	 *
+	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
+	 * should always be a power of 2.
+	 */
+	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
+
 	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
 		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
 
+		/*
+		 * If we fully consumed our budget then update our
+		 * state to indicate that we are requesting additional
+		 * processing and exit this list.
+		 */
+		if (budget < 0) {
+			atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
+			next = page;
+			break;
+		}
+
 		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
 			if (!__isolate_free_page(page, order)) {
@@ -146,7 +174,7 @@ page_reporting_cycle(struct page_reporti
 		}
 
 		/*
-		 * Make the first non-processed page in the free list
+		 * Make the first non-reported page in the free list
 		 * the new head of the free list before we release the
 		 * zone lock.
 		 */
@@ -162,6 +190,9 @@ page_reporting_cycle(struct page_reporti
 		/* reset offset since the full list was reported */
 		*offset = PAGE_REPORTING_CAPACITY;
 
+		/* update budget to reflect call to report function */
+		budget--;
+
 		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
 
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_reporting-add-free-page-reporting-documentation.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (132 preceding siblings ...)
  2020-02-12  0:20 ` + mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch " Andrew Morton
@ 2020-02-12  0:20 ` Andrew Morton
  2020-02-12  0:30 ` + mm-page_counter-fix-various-data-races.patch " Andrew Morton
                   ` (104 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:20 UTC (permalink / raw)
  To: aarcange, alexander.h.duyck, dan.j.williams, dave.hansen, david,
	konrad.wilk, lcapitulino, mgorman, mhocko, mm-commits, mst,
	nitesh, osalvador, pagupta, pbonzini, riel, vbabka, wei.w.wang,
	willy, yang.zhang.wz


The patch titled
     Subject: mm/page_reporting: add free page reporting documentation
has been added to the -mm tree.  Its filename is
     mm-page_reporting-add-free-page-reporting-documentation.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_reporting-add-free-page-reporting-documentation.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_reporting-add-free-page-reporting-documentation.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Subject: mm/page_reporting: add free page reporting documentation

Add documentation for free page reporting.  Currently the only consumer is
virtio-balloon, however it is possible that other drivers might make use
of this so it is best to add a bit of documetation explaining at a high
level how to use the API.

Link: http://lkml.kernel.org/r/20200211224730.29318.43815.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/free_page_reporting.rst |   41 +++++++++++++++++++++
 1 file changed, 41 insertions(+)

--- /dev/null
+++ a/Documentation/vm/free_page_reporting.rst
@@ -0,0 +1,41 @@
+.. _free_page_reporting:
+
+=====================
+Free Page Reporting
+=====================
+
+Free page reporting is an API by which a device can register to receive
+lists of pages that are currently unused by the system. This is useful in
+the case of virtualization where a guest is then able to use this data to
+notify the hypervisor that it is no longer using certain pages in memory.
+
+For the driver, typically a balloon driver, to use of this functionality
+it will allocate and initialize a page_reporting_dev_info structure. The
+field within the structure it will populate is the "report" function
+pointer used to process the scatterlist. It must also guarantee that it can
+handle at least PAGE_REPORTING_CAPACITY worth of scatterlist entries per
+call to the function. A call to page_reporting_register will register the
+page reporting interface with the reporting framework assuming no other
+page reporting devices are already registered.
+
+Once registered the page reporting API will begin reporting batches of
+pages to the driver. The API will start reporting pages 2 seconds after
+the interface is registered and will continue to do so 2 seconds after any
+page of a sufficiently high order is freed.
+
+Pages reported will be stored in the scatterlist passed to the reporting
+function with the final entry having the end bit set in entry nent - 1.
+While pages are being processed by the report function they will not be
+accessible to the allocator. Once the report function has been completed
+the pages will be returned to the free area from which they were obtained.
+
+Prior to removing a driver that is making use of free page reporting it
+is necessary to call page_reporting_unregister to have the
+page_reporting_dev_info structure that is currently in use by free page
+reporting removed. Doing this will prevent further reports from being
+issued via the interface. If another driver or the same driver is
+registered it is possible for it to resume where the previous driver had
+left off in terms of reporting free pages.
+
+Alexander Duyck, Dec 04, 2019
+
_

Patches currently in -mm which might be from alexander.h.duyck@linux.intel.com are

mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
mm-add-function-__putback_isolated_page.patch
mm-introduce-reported-pages.patch
virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
mm-page_reporting-add-free-page-reporting-documentation.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_counter-fix-various-data-races.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (133 preceding siblings ...)
  2020-02-12  0:20 ` + mm-page_reporting-add-free-page-reporting-documentation.patch " Andrew Morton
@ 2020-02-12  0:30 ` Andrew Morton
  2020-02-12  0:33 ` + memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch " Andrew Morton
                   ` (103 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:30 UTC (permalink / raw)
  To: cai, david, dvyukov, elver, hannes, mhocko, mm-commits, penguin-kernel


The patch titled
     Subject: mm/page_counter.c: fix various data races
has been added to the -mm tree.  Its filename is
     mm-page_counter-fix-various-data-races.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_counter-fix-various-data-races.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_counter-fix-various-data-races.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/page_counter.c: fix various data races

The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could
had memcg->memsw->watermark been accessed concurrently as reported by
KCSAN,

 Reported by Kernel Concurrency Sanitizer on:
 BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_char=
ge

 read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
  page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  __memcg_kmem_charge_memcg+0x58/0x140
  __memcg_kmem_charge+0xcc/0x280
  __alloc_pages_nodemask+0x1e1/0x450
  alloc_pages_current+0xa6/0x120
  pte_alloc_one+0x17/0xd0
  __pte_alloc+0x3a/0x1f0
  copy_p4d_range+0xc36/0x1990
  copy_page_range+0x21d/0x360
  dup_mmap+0x5f5/0x7a0
  dup_mm+0xa2/0x240
  copy_process+0x1b3f/0x3460
  _do_fork+0xaa/0xa20
  __x64_sys_clone+0x13b/0x170
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
  page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  mem_cgroup_try_charge+0x159/0x460
  mem_cgroup_try_charge_delay+0x3d/0xa0
  wp_page_copy+0x14d/0x930
  do_wp_page+0x107/0x7b0
  __handle_mm_fault+0xce6/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

Since watermark could be compared or set to garbage due to load or store
tearing which would change the code logic, fix it by adding a pair of
READ_ONCE() and WRITE_ONCE() in those places.

Link: http://lkml.kernel.org/r/20200129105224.4016-1-cai@lca.pw
Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_counter.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/page_counter.c~mm-page_counter-fix-various-data-races
+++ a/mm/page_counter.c
@@ -82,8 +82,8 @@ void page_counter_charge(struct page_cou
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 }
 
@@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 	return true;
 
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-a-data-race-in-checksum.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-page_counter-fix-various-data-races.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (134 preceding siblings ...)
  2020-02-12  0:30 ` + mm-page_counter-fix-various-data-races.patch " Andrew Morton
@ 2020-02-12  0:33 ` Andrew Morton
  2020-02-12 21:16 ` + mm-page_counter-fix-various-data-races-at-memsw.patch " Andrew Morton
                   ` (102 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12  0:33 UTC (permalink / raw)
  To: guro, hannes, ktkhai, mhocko, mm-commits, stable, vdavydov.dev, vvs


The patch titled
     Subject: mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()
has been added to the -mm tree.  Its filename is
     memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vasily Averin <vvs@virtuozzo.com>
Subject: mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()

for_each_mem_cgroup() increases css reference counter for memory cgroup
and requires to use mem_cgroup_iter_break() if the walk is cancelled.

Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/memcontrol.c~memcg-lost-css_put-in-memcg_expand_shrinker_maps
+++ a/mm/memcontrol.c
@@ -409,8 +409,10 @@ int memcg_expand_shrinker_maps(int new_i
 		if (mem_cgroup_is_root(memcg))
 			continue;
 		ret = memcg_expand_one_shrinker_map(memcg, size, old_size);
-		if (ret)
+		if (ret) {
+			mem_cgroup_iter_break(NULL, memcg);
 			goto unlock;
+		}
 	}
 unlock:
 	if (!ret)
_

Patches currently in -mm which might be from vvs@virtuozzo.com are

memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_counter-fix-various-data-races-at-memsw.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (135 preceding siblings ...)
  2020-02-12  0:33 ` + memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch " Andrew Morton
@ 2020-02-12 21:16 ` Andrew Morton
  2020-02-12 21:16 ` [to-be-updated] mm-page_counter-fix-various-data-races.patch removed from " Andrew Morton
                   ` (101 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 21:16 UTC (permalink / raw)
  To: cai, david, dvyukov, elver, hannes, mhocko, mm-commits, penguin-kernel


The patch titled
     Subject: mm/page_counter: fix various data races at memsw
has been added to the -mm tree.  Its filename is
     mm-page_counter-fix-various-data-races-at-memsw.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_counter-fix-various-data-races-at-memsw.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_counter-fix-various-data-races-at-memsw.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/page_counter: fix various data races at memsw

Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could had
memcg->memsw->watermark and memcg->memsw->failcnt been accessed
concurrently as reported by KCSAN,

 BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

 read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
  page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  __memcg_kmem_charge_memcg+0x58/0x140
  __memcg_kmem_charge+0xcc/0x280
  __alloc_pages_nodemask+0x1e1/0x450
  alloc_pages_current+0xa6/0x120
  pte_alloc_one+0x17/0xd0
  __pte_alloc+0x3a/0x1f0
  copy_p4d_range+0xc36/0x1990
  copy_page_range+0x21d/0x360
  dup_mmap+0x5f5/0x7a0
  dup_mm+0xa2/0x240
  copy_process+0x1b3f/0x3460
  _do_fork+0xaa/0xa20
  __x64_sys_clone+0x13b/0x170
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
  page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  mem_cgroup_try_charge+0x159/0x460
  mem_cgroup_try_charge_delay+0x3d/0xa0
  wp_page_copy+0x14d/0x930
  do_wp_page+0x107/0x7b0
  __handle_mm_fault+0xce6/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge

 write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
  page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
  try_charge+0x185/0xbf0 mm/memcontrol.c:2405
  __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
  __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
  __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

 read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
  page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
  try_charge+0x185/0xbf0 mm/memcontrol.c:2405
  __memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
  __memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
  __alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780

Since watermark could be compared or set to garbage due to a data race
which would change the code logic, fix it by adding a pair of READ_ONCE()
and WRITE_ONCE() in those places.

The "failcnt" counter is tolerant of some degree of inaccuracy and is only
used to report stats, a data race will not be harmful, thus mark it as an
intentional data race using the data_race() macro.

Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Qian Cai <cai@lca.pw>
Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_counter.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- a/mm/page_counter.c~mm-page_counter-fix-various-data-races-at-memsw
+++ a/mm/page_counter.c
@@ -82,8 +82,8 @@ void page_counter_charge(struct page_cou
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 }
 
@@ -124,9 +124,10 @@ bool page_counter_try_charge(struct page
 			propagate_protected_usage(counter, new);
 			/*
 			 * This is racy, but we can live with some
-			 * inaccuracy in the failcnt.
+			 * inaccuracy in the failcnt which is only used
+			 * to report stats.
 			 */
-			c->failcnt++;
+			data_race(c->failcnt++);
 			*fail = c;
 			goto failed;
 		}
@@ -135,8 +136,8 @@ bool page_counter_try_charge(struct page
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 	return true;
 
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-a-data-race-in-checksum.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-page_counter-fix-various-data-races.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [to-be-updated] mm-page_counter-fix-various-data-races.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (136 preceding siblings ...)
  2020-02-12 21:16 ` + mm-page_counter-fix-various-data-races-at-memsw.patch " Andrew Morton
@ 2020-02-12 21:16 ` Andrew Morton
  2020-02-12 21:18 ` + mm-swapfilec-fix-comments-for-swapcache_prepare.patch added to " Andrew Morton
                   ` (100 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 21:16 UTC (permalink / raw)
  To: cai, david, dvyukov, elver, hannes, mhocko, mm-commits, penguin-kernel


The patch titled
     Subject: mm/page_counter.c: fix various data races
has been removed from the -mm tree.  Its filename was
     mm-page_counter-fix-various-data-races.patch

This patch was dropped because an updated version will be merged

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/page_counter.c: fix various data races

The commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could
had memcg->memsw->watermark been accessed concurrently as reported by
KCSAN,

 Reported by Kernel Concurrency Sanitizer on:
 BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_char=
ge

 read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
  page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  __memcg_kmem_charge_memcg+0x58/0x140
  __memcg_kmem_charge+0xcc/0x280
  __alloc_pages_nodemask+0x1e1/0x450
  alloc_pages_current+0xa6/0x120
  pte_alloc_one+0x17/0xd0
  __pte_alloc+0x3a/0x1f0
  copy_p4d_range+0xc36/0x1990
  copy_page_range+0x21d/0x360
  dup_mmap+0x5f5/0x7a0
  dup_mm+0xa2/0x240
  copy_process+0x1b3f/0x3460
  _do_fork+0xaa/0xa20
  __x64_sys_clone+0x13b/0x170
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
  page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
  try_charge+0x131/0xd50 mm/memcontrol.c:2405
  mem_cgroup_try_charge+0x159/0x460
  mem_cgroup_try_charge_delay+0x3d/0xa0
  wp_page_copy+0x14d/0x930
  do_wp_page+0x107/0x7b0
  __handle_mm_fault+0xce6/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

Since watermark could be compared or set to garbage due to load or store
tearing which would change the code logic, fix it by adding a pair of
READ_ONCE() and WRITE_ONCE() in those places.

Link: http://lkml.kernel.org/r/20200129105224.4016-1-cai@lca.pw
Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_counter.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/page_counter.c~mm-page_counter-fix-various-data-races
+++ a/mm/page_counter.c
@@ -82,8 +82,8 @@ void page_counter_charge(struct page_cou
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 }
 
@@ -135,8 +135,8 @@ bool page_counter_try_charge(struct page
 		 * Just like with failcnt, we can live with some
 		 * inaccuracy in the watermark.
 		 */
-		if (new > c->watermark)
-			c->watermark = new;
+		if (new > READ_ONCE(c->watermark))
+			WRITE_ONCE(c->watermark, new);
 	}
 	return true;
 
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-a-data-race-in-checksum.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-swapfilec-fix-comments-for-swapcache_prepare.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (137 preceding siblings ...)
  2020-02-12 21:16 ` [to-be-updated] mm-page_counter-fix-various-data-races.patch removed from " Andrew Morton
@ 2020-02-12 21:18 ` Andrew Morton
  2020-02-12 21:22 ` + mm-util-annotate-an-data-race-at-vm_committed_as.patch " Andrew Morton
                   ` (99 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 21:18 UTC (permalink / raw)
  To: akpm, chenwandun, mm-commits


The patch titled
     Subject: mm/swapfile.c: fix comments for swapcache_prepare
has been added to the -mm tree.  Its filename is
     mm-swapfilec-fix-comments-for-swapcache_prepare.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swapfilec-fix-comments-for-swapcache_prepare.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swapfilec-fix-comments-for-swapcache_prepare.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Chen Wandun <chenwandun@huawei.com>
Subject: mm/swapfile.c: fix comments for swapcache_prepare

The -EEXIST returned by __swap_duplicate means there is a swap cache
instead -EBUSY

Link: http://lkml.kernel.org/r/20200212145754.27123-1-chenwandun@huawei.com
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/swapfile.c~mm-swapfilec-fix-comments-for-swapcache_prepare
+++ a/mm/swapfile.c
@@ -3480,7 +3480,7 @@ int swap_duplicate(swp_entry_t entry)
  *
  * Called when allocating swap cache for existing swap entry,
  * This can return error codes. Returns 0 at success.
- * -EBUSY means there is a swap cache.
+ * -EEXIST means there is a swap cache.
  * Note: return code is different from swap_duplicate().
  */
 int swapcache_prepare(swp_entry_t entry)
_

Patches currently in -mm which might be from chenwandun@huawei.com are

mm-swapfilec-fix-comments-for-swapcache_prepare.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-util-annotate-an-data-race-at-vm_committed_as.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (138 preceding siblings ...)
  2020-02-12 21:18 ` + mm-swapfilec-fix-comments-for-swapcache_prepare.patch added to " Andrew Morton
@ 2020-02-12 21:22 ` Andrew Morton
  2020-02-12 22:08 ` + asm-generic-fix-unistd_32h-generation-format.patch " Andrew Morton
                   ` (98 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 21:22 UTC (permalink / raw)
  To: cai, cl, dennis, elver, mm-commits, tj


The patch titled
     Subject: mm/util.c: annotate an data race at vm_committed_as
has been added to the -mm tree.  Its filename is
     mm-util-annotate-an-data-race-at-vm_committed_as.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-util-annotate-an-data-race-at-vm_committed_as.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-util-annotate-an-data-race-at-vm_committed_as.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/util.c: annotate an data race at vm_committed_as

"vm_committed_as.count" could be accessed concurrently as reported by
KCSAN,

 read to 0xffffffff923164f8 of 8 bytes by task 1268 on cpu 38:
  __vm_enough_memory+0x43/0x280 mm/util.c:801
  mmap_region+0x1b2/0xb90 mm/mmap.c:1726
  do_mmap+0x45c/0x700
  vm_mmap_pgoff+0xc0/0x130
  vm_mmap+0x71/0x90
  elf_map+0xa1/0x1b0
  load_elf_binary+0x9de/0x2180
  search_binary_handler+0xd8/0x2b0
  __do_execve_file+0xb61/0x1080
  __x64_sys_execve+0x5f/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffffffff923164f8 of 8 bytes by task 1265 on cpu 41:
  percpu_counter_add_batch+0x83/0xd0 lib/percpu_counter.c:91
  exit_mmap+0x178/0x220 include/linux/mman.h:68
  mmput+0x10e/0x270
  flush_old_exec+0x572/0xfe0
  load_elf_binary+0x467/0x2180
  search_binary_handler+0xd8/0x2b0
  __do_execve_file+0xb61/0x1080
  __x64_sys_execve+0x5f/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The warning is almost impossible to trigger according to the commit
82f71ae4a2b8 ("mm: catch memory commitment underflow") but leave it for
now to catch any possible unbalanced vm_unacct_memory() in the future. 
Since only the read is operating as lockless, mark it as an intentional
data race using the data_race() macro to avoid modifying
percpu_counter_read() and still catch unintended races elsewhere.

Link: http://lkml.kernel.org/r/1581518109-21180-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/util.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/mm/util.c~mm-util-annotate-an-data-race-at-vm_committed_as
+++ a/mm/util.c
@@ -798,8 +798,12 @@ int __vm_enough_memory(struct mm_struct
 {
 	long allowed;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
+	/*
+	 * A transient decrease in the value is unlikely, so no need
+	 * READ_ONCE() for vm_committed_as.count.
+	 */
+	VM_WARN_ONCE(data_race(percpu_counter_read(&vm_committed_as) <
+			-(s64)vm_committed_as_batch * num_online_cpus()),
 			"memory commitment underflow");
 
 	vm_acct_memory(pages);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-a-data-race-in-checksum.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + asm-generic-fix-unistd_32h-generation-format.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (139 preceding siblings ...)
  2020-02-12 21:22 ` + mm-util-annotate-an-data-race-at-vm_committed_as.patch " Andrew Morton
@ 2020-02-12 22:08 ` Andrew Morton
  2020-02-12 22:26 ` + mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch " Andrew Morton
                   ` (97 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 22:08 UTC (permalink / raw)
  To: arnd, benh, chris, dalias, davem, deller, fenghua.yu, ink,
	James.Bottomley, jcmvbkbc, mattst88, michal.simek, mm-commits,
	mpe, paulburton, paulus, ralf, rth, stefan.asserhall, tony.luck,
	ysato


The patch titled
     Subject: asm-generic: fix unistd_32.h generation format
has been added to the -mm tree.  Its filename is
     asm-generic-fix-unistd_32h-generation-format.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/asm-generic-fix-unistd_32h-generation-format.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/asm-generic-fix-unistd_32h-generation-format.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Michal Simek <michal.simek@xilinx.com>
Subject: asm-generic: fix unistd_32.h generation format

Generated files are also checked by sparse that's why add newline to
remove sparse (C=1) warning.

The issue was found on Microblaze and reported like this:
./arch/microblaze/include/generated/uapi/asm/unistd_32.h:438:45: warning:
no newline at end of file

Mips and PowerPC have it already but let's align with style used by m68k.

Link: http://lkml.kernel.org/r/4d32ab4e1fb2edb691d2e1687e8fb303c09fd023.1581504803.git.michal.simek@xilinx.com
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Reviewed-by: Stefan Asserhall <stefan.asserhall@xilinx.com>
Acked-by: Max Filippov <jcmvbkbc@gmail.com> (xtensa)
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/kernel/syscalls/syscallhdr.sh      |    2 +-
 arch/ia64/kernel/syscalls/syscallhdr.sh       |    2 +-
 arch/microblaze/kernel/syscalls/syscallhdr.sh |    2 +-
 arch/mips/kernel/syscalls/syscallhdr.sh       |    3 +--
 arch/parisc/kernel/syscalls/syscallhdr.sh     |    2 +-
 arch/powerpc/kernel/syscalls/syscallhdr.sh    |    3 +--
 arch/sh/kernel/syscalls/syscallhdr.sh         |    2 +-
 arch/sparc/kernel/syscalls/syscallhdr.sh      |    2 +-
 arch/xtensa/kernel/syscalls/syscallhdr.sh     |    2 +-
 9 files changed, 9 insertions(+), 11 deletions(-)

--- a/arch/alpha/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/alpha/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/ia64/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/ia64/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/microblaze/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/microblaze/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/mips/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/mips/kernel/syscalls/syscallhdr.sh
@@ -32,6 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
-	printf "\n"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/parisc/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/parisc/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/powerpc/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/powerpc/kernel/syscalls/syscallhdr.sh
@@ -32,6 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
-	printf "\n"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/sh/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/sh/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/sparc/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/sparc/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
--- a/arch/xtensa/kernel/syscalls/syscallhdr.sh~asm-generic-fix-unistd_32h-generation-format
+++ a/arch/xtensa/kernel/syscalls/syscallhdr.sh
@@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my
 	printf "#define __NR_syscalls\t%s\n" "${nxt}"
 	printf "#endif\n"
 	printf "\n"
-	printf "#endif /* %s */" "${fileguard}"
+	printf "#endif /* %s */\n" "${fileguard}"
 ) > "$out"
_

Patches currently in -mm which might be from michal.simek@xilinx.com are

asm-generic-fix-unistd_32h-generation-format.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (140 preceding siblings ...)
  2020-02-12 22:08 ` + asm-generic-fix-unistd_32h-generation-format.patch " Andrew Morton
@ 2020-02-12 22:26 ` Andrew Morton
  2020-02-12 22:34 ` + lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch " Andrew Morton
                   ` (96 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 22:26 UTC (permalink / raw)
  To: jack, josef, minchan, mm-commits, snazy


The patch titled
     Subject: mm/filemap.c: don't bother dropping mmap_sem for zero size readahead
has been added to the -mm tree.  Its filename is
     mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Jan Kara <jack@suse.cz>
Subject: mm/filemap.c: don't bother dropping mmap_sem for zero size readahead

When handling a page fault, we drop mmap_sem to start async readahead so
that we don't block on IO submission with mmap_sem held.  However there's
no point to drop mmap_sem in case readahead is disabled.  Handle that case
to avoid pointless dropping of mmap_sem and retrying the fault.  This was
actually reported to block mlockall(MCL_CURRENT) indefinitely.

Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.cz
Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Minchan Kim <minchan@kernel.org>
Reported-by: Robert Stupp <snazy@gmx.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/filemap.c~mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead
+++ a/mm/filemap.c
@@ -2419,7 +2419,7 @@ static struct file *do_async_mmap_readah
 	pgoff_t offset = vmf->pgoff;
 
 	/* If we don't want any read-ahead, don't bother */
-	if (vmf->vma->vm_flags & VM_RAND_READ)
+	if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
 		return fpin;
 	mmap_miss = READ_ONCE(ra->mmap_miss);
 	if (mmap_miss)
_

Patches currently in -mm which might be from jack@suse.cz are

mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (141 preceding siblings ...)
  2020-02-12 22:26 ` + mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch " Andrew Morton
@ 2020-02-12 22:34 ` Andrew Morton
  2020-02-12 22:59 ` + mm-allocate-shrinker_map-on-appropriate-numa-node.patch " Andrew Morton
                   ` (95 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 22:34 UTC (permalink / raw)
  To: akinobu.mita, geert+renesas, mm-commits


The patch titled
     Subject: lib/scatterlist: fix sg_copy_buffer() kerneldoc
has been added to the -mm tree.  Its filename is
     lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Geert Uytterhoeven <geert+renesas@glider.be>
Subject: lib/scatterlist: fix sg_copy_buffer() kerneldoc

Add the missing closing parenthesis to the description for the to_buffer
parameter of sg_copy_buffer().

Link: http://lkml.kernel.org/r/20200212084241.8778-1-geert+renesas@glider.be
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Akinobu Mita <akinobu.mita@gmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/scatterlist.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/scatterlist.c~lib-scatterlist-fix-sg_copy_buffer-kerneldoc
+++ a/lib/scatterlist.c
@@ -832,7 +832,7 @@ EXPORT_SYMBOL(sg_miter_stop);
  * @buflen:		 The number of bytes to copy
  * @skip:		 Number of bytes to skip before copying
  * @to_buffer:		 transfer direction (true == from an sg list to a
- *			 buffer, false == from a buffer to an sg list
+ *			 buffer, false == from a buffer to an sg list)
  *
  * Returns the number of copied bytes.
  *
_

Patches currently in -mm which might be from geert+renesas@glider.be are

lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-allocate-shrinker_map-on-appropriate-numa-node.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (142 preceding siblings ...)
  2020-02-12 22:34 ` + lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch " Andrew Morton
@ 2020-02-12 22:59 ` Andrew Morton
  2020-02-12 23:05 ` + init-cleanup-anon_inodes-and-old-io-schedulers-options.patch " Andrew Morton
                   ` (94 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 22:59 UTC (permalink / raw)
  To: david, guro, hannes, ktkhai, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node
has been added to the -mm tree.  Its filename is
     mm-allocate-shrinker_map-on-appropriate-numa-node.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-allocate-shrinker_map-on-appropriate-numa-node.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-allocate-shrinker_map-on-appropriate-numa-node.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node

The shrinker_map may be touched from any cpu (e.g., a bit there may be set
by a task running everywhere) but kswapd is always bound to specific node.
So allocate shrinker_map from the related NUMA node to respect its NUMA
locality.  Also, this follows generic way we use for allocation of memcg's
per-node data.

Link: http://lkml.kernel.org/r/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-allocate-shrinker_map-on-appropriate-numa-node
+++ a/mm/memcontrol.c
@@ -334,7 +334,7 @@ static int memcg_expand_one_shrinker_map
 		if (!old)
 			return 0;
 
-		new = kvmalloc(sizeof(*new) + size, GFP_KERNEL);
+		new = kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid);
 		if (!new)
 			return -ENOMEM;
 
@@ -378,7 +378,7 @@ static int memcg_alloc_shrinker_maps(str
 	mutex_lock(&memcg_shrinker_map_mutex);
 	size = memcg_shrinker_map_size;
 	for_each_node(nid) {
-		map = kvzalloc(sizeof(*map) + size, GFP_KERNEL);
+		map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
 		if (!map) {
 			memcg_free_shrinker_maps(memcg);
 			ret = -ENOMEM;
_

Patches currently in -mm which might be from ktkhai@virtuozzo.com are

mm-allocate-shrinker_map-on-appropriate-numa-node.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + init-cleanup-anon_inodes-and-old-io-schedulers-options.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (143 preceding siblings ...)
  2020-02-12 22:59 ` + mm-allocate-shrinker_map-on-appropriate-numa-node.patch " Andrew Morton
@ 2020-02-12 23:05 ` Andrew Morton
  2020-02-12 23:10 ` + checkpatch-check-spdx-tags-in-yaml-files.patch " Andrew Morton
                   ` (93 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 23:05 UTC (permalink / raw)
  To: gregkh, krzk, mm-commits, tglx, yamada.masahiro


The patch titled
     Subject: init/Kconfig: clean up ANON_INODES and old IO schedulers options
has been added to the -mm tree.  Its filename is
     init-cleanup-anon_inodes-and-old-io-schedulers-options.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/init-cleanup-anon_inodes-and-old-io-schedulers-options.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/init-cleanup-anon_inodes-and-old-io-schedulers-options.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Krzysztof Kozlowski <krzk@kernel.org>
Subject: init/Kconfig: clean up ANON_INODES and old IO schedulers options

CONFIG_ANON_INODES is gone since commit 5dd50aaeb185 ("Make anon_inodes
unconditional").

CONFIG_CFQ_GROUP_IOSCHED was replaced with CONFIG_BFQ_GROUP_IOSCHED in
commit f382fb0bcef4 ("block: remove legacy IO schedulers").

Link: http://lkml.kernel.org/r/20200130192419.3026-1-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/Kconfig |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/init/Kconfig~init-cleanup-anon_inodes-and-old-io-schedulers-options
+++ a/init/Kconfig
@@ -870,7 +870,7 @@ config BLK_CGROUP
 	This option only enables generic Block IO controller infrastructure.
 	One needs to also enable actual IO controlling logic/policy. For
 	enabling proportional weight division of disk bandwidth in CFQ, set
-	CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
+	CONFIG_BFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
 	CONFIG_BLK_DEV_THROTTLING=y.
 
 	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
@@ -1536,7 +1536,6 @@ config AIO
 
 config IO_URING
 	bool "Enable IO uring support" if EXPERT
-	select ANON_INODES
 	select IO_WQ
 	default y
 	help
_

Patches currently in -mm which might be from krzk@kernel.org are

init-cleanup-anon_inodes-and-old-io-schedulers-options.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-check-spdx-tags-in-yaml-files.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (144 preceding siblings ...)
  2020-02-12 23:05 ` + init-cleanup-anon_inodes-and-old-io-schedulers-options.patch " Andrew Morton
@ 2020-02-12 23:10 ` Andrew Morton
  2020-02-13  2:16 ` + drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch " Andrew Morton
                   ` (92 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-12 23:10 UTC (permalink / raw)
  To: joe, lkundrak, mm-commits, robh


The patch titled
     Subject: checkpatch: check SPDX tags in YAML files
has been added to the -mm tree.  Its filename is
     checkpatch-check-spdx-tags-in-yaml-files.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-check-spdx-tags-in-yaml-files.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-check-spdx-tags-in-yaml-files.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Lubomir Rintel <lkundrak@v3.sk>
Subject: checkpatch: check SPDX tags in YAML files

This adds a warning when a YAML file is lacking a SPDX header on first
line, or it uses incorrect commenting style.

Currently the only YAML files in the tree are Devicetree binding
documents.

Link: http://lkml.kernel.org/r/20200129123356.388669-1-lkundrak@v3.sk
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Joe Perches <joe@perches.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/scripts/checkpatch.pl~checkpatch-check-spdx-tags-in-yaml-files
+++ a/scripts/checkpatch.pl
@@ -3120,7 +3120,7 @@ sub process {
 					$comment = '/*';
 				} elsif ($realfile =~ /\.(c|dts|dtsi)$/) {
 					$comment = '//';
-				} elsif (($checklicenseline == 2) || $realfile =~ /\.(sh|pl|py|awk|tc)$/) {
+				} elsif (($checklicenseline == 2) || $realfile =~ /\.(sh|pl|py|awk|tc|yaml)$/) {
 					$comment = '#';
 				} elsif ($realfile =~ /\.rst$/) {
 					$comment = '..';
_

Patches currently in -mm which might be from lkundrak@v3.sk are

checkpatch-check-spdx-tags-in-yaml-files.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (145 preceding siblings ...)
  2020-02-12 23:10 ` + checkpatch-check-spdx-tags-in-yaml-files.patch " Andrew Morton
@ 2020-02-13  2:16 ` Andrew Morton
  2020-02-13  2:56 ` + mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch " Andrew Morton
                   ` (91 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-13  2:16 UTC (permalink / raw)
  To: dan.j.williams, david, gregkh, heiko.carstens, kzak, mhocko,
	mhocko, mm-commits, ndfont, pbadari, rafael, rcj


The patch titled
     Subject: drivers/base/memory.c: indicate all memory blocks as removable
has been added to the -mm tree.  Its filename is
     drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: David Hildenbrand <david@redhat.com>
Subject: drivers/base/memory.c: indicate all memory blocks as removable

We see multiple issues with the implementation/interface to compute
whether a memory block can be offlined (exposed via
/sys/devices/system/memory/memoryX/removable) and would like to simplify
it (remove the implementation).

1. It runs basically lockless. While this might be good for performance,
   we see possible races with memory offlining that will require at least
   some sort of locking to fix.

2. Nowadays, more false positives are possible. No arch-specific checks
   are performed that validate if memory offlining will not be denied
   right away (and such check will require locking). For example, arm64
   won't allow to offline any memory block that was added during boot -
   which will imply a very high error rate. Other archs have other
   constraints.

3. The interface is inherently racy. E.g., if a memory block is
   detected to be removable (and was not a false positive at that time),
   there is still no guarantee that offlining will actually succeed. So
   any caller already has to deal with false positives.

4. It is unclear which performance benefit this interface actually
   provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add
   sysfs removable attribute for hotplug memory remove") mentioned
	"A user-level agent must be able to identify which sections of
	 memory are likely to be removable before attempting the
	 potentially expensive operation."
   However, no actual performance comparison was included.

Known users:
- lsmem: Will group memory blocks based on the "removable" property. [1]
- chmem: Indirect user. It has a RANGE mode where one can specify
	 removable ranges identified via lsmem to be offlined. However, it
	 also has a "SIZE" mode, which allows a sysadmin to skip the manual
	 "identify removable blocks" step. [2]
- powerpc-utils: Uses the "removable" attribute to skip some memory
		 blocks right away when trying to find some to
		 offline+remove. However, with ballooning enabled, it
		 already skips this information completely (because it
		 once resulted in many false negatives). Therefore, the
		 implementation can deal with false positives properly
		 already. [3]

According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
driven from userspace via the drmgr command (powerpc-utils). Nowadays
it's managed in the kernel - including onlining/offlining of memory
blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the
affected legacy userspace handling is only active on old kernels. Only ve=
ry
old versions of drmgr on a new kernel (unlikely) might execute slower -
totally acceptable.

With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
break any user space tool. We implement a very bad heuristic now.  Withou=
t
CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
"not removable" as before.

Original discussion can be found in [4] ("[PATCH RFC v1] mm:
is_mem_section_removable() overhaul").

Other users of is_mem_section_removable() will be removed next, so that
we can remove is_mem_section_removable() completely.

[1] http://man7.org/linux/man-pages/man1/lsmem.1.html
[2] http://man7.org/linux/man-pages/man8/chmem.8.html
[3] https://github.com/ibm-power-utilities/powerpc-utils
[4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com

Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Nathan Fontenot <ndfont@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Karel Zak <kzak@redhat.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c |   23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

--- a/drivers/base/memory.c~drivers-base-memoryc-indicate-all-memory-blocks-as-removable
+++ a/drivers/base/memory.c
@@ -105,30 +105,13 @@ static ssize_t phys_index_show(struct de
 }
 
 /*
- * Show whether the memory block is likely to be offlineable (or is already
- * offline). Once offline, the memory block could be removed. The return
- * value does, however, not indicate that there is a way to remove the
- * memory block.
+ * Legacy interface that we cannot remove. Always indicate "removable"
+ * with CONFIG_MEMORY_HOTREMOVE - bad heuristic.
  */
 static ssize_t removable_show(struct device *dev, struct device_attribute *attr,
 			      char *buf)
 {
-	struct memory_block *mem = to_memory_block(dev);
-	unsigned long pfn;
-	int ret = 1, i;
-
-	if (mem->state != MEM_ONLINE)
-		goto out;
-
-	for (i = 0; i < sections_per_block; i++) {
-		if (!present_section_nr(mem->start_section_nr + i))
-			continue;
-		pfn = section_nr_to_pfn(mem->start_section_nr + i);
-		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
-	}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (146 preceding siblings ...)
  2020-02-13  2:16 ` + drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch " Andrew Morton
@ 2020-02-13  2:56 ` Andrew Morton
  2020-02-13  2:56 ` + mm-add-vm_insert_pages.patch " Andrew Morton
                   ` (90 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-13  2:56 UTC (permalink / raw)
  To: arjunroy, davem, edumazet, mm-commits, soheil


The patch titled
     Subject: mm/memory.c: refactor insert_page to prepare for batched-lock insert
has been added to the -mm tree.  Its filename is
     mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Arjun Roy <arjunroy@google.com>
Subject: mm/memory.c: refactor insert_page to prepare for batched-lock insert

Add helper methods for vm_insert_page()/insert_page() to prepare for
vm_insert_pages(), which batch-inserts pages to reduce spinlock operations
when inserting multiple consecutive pages into the user page table.

The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |   39 ++++++++++++++++++++++++---------------
 1 file changed, 24 insertions(+), 15 deletions(-)

--- a/mm/memory.c~mm-refactor-insert_page-to-prepare-for-batched-lock-insert
+++ a/mm/memory.c
@@ -1430,6 +1430,27 @@ pte_t *__get_locked_pte(struct mm_struct
 	return pte_alloc_map_lock(mm, pmd, addr, ptl);
 }
 
+static int validate_page_before_insert(struct page *page)
+{
+	if (PageAnon(page) || PageSlab(page) || page_has_type(page))
+		return -EINVAL;
+	flush_dcache_page(page);
+	return 0;
+}
+
+static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+			unsigned long addr, struct page *page, pgprot_t prot)
+{
+	if (!pte_none(*pte))
+		return -EBUSY;
+	/* Ok, finally just insert the thing.. */
+	get_page(page);
+	inc_mm_counter_fast(mm, mm_counter_file(page));
+	page_add_file_rmap(page, false);
+	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	return 0;
+}
+
 /*
  * This is the old fallback for page remapping.
  *
@@ -1445,26 +1466,14 @@ static int insert_page(struct vm_area_st
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	retval = -EINVAL;
-	if (PageAnon(page) || PageSlab(page) || page_has_type(page))
+	retval = validate_page_before_insert(page);
+	if (retval)
 		goto out;
 	retval = -ENOMEM;
-	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
 		goto out;
-	retval = -EBUSY;
-	if (!pte_none(*pte))
-		goto out_unlock;
-
-	/* Ok, finally just insert the thing.. */
-	get_page(page);
-	inc_mm_counter_fast(mm, mm_counter_file(page));
-	page_add_file_rmap(page, false);
-	set_pte_at(mm, addr, pte, mk_pte(page, prot));

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-add-vm_insert_pages.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (147 preceding siblings ...)
  2020-02-13  2:56 ` + mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch " Andrew Morton
@ 2020-02-13  2:56 ` Andrew Morton
  2020-02-13  2:57 ` + mm-add-vm_insert_pages-fix.patch " Andrew Morton
                   ` (89 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-13  2:56 UTC (permalink / raw)
  To: arjunroy, davem, edumazet, mm-commits, soheil


The patch titled
     Subject: mm/memory.c: add vm_insert_pages()
has been added to the -mm tree.  Its filename is
     mm-add-vm_insert_pages.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-add-vm_insert_pages.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-add-vm_insert_pages.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Arjun Roy <arjunroy@google.com>
Subject: mm/memory.c: add vm_insert_pages()

Add the ability to insert multiple pages at once to a user VM with lower
PTE spinlock operations.

The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    2 
 mm/memory.c        |  111 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 111 insertions(+), 2 deletions(-)

--- a/include/linux/mm.h~mm-add-vm_insert_pages
+++ a/include/linux/mm.h
@@ -2597,6 +2597,8 @@ struct vm_area_struct *find_extend_vma(s
 int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
 			unsigned long pfn, unsigned long size, pgprot_t);
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
+int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
+			struct page **pages, unsigned long *num);
 int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
 				unsigned long num);
 int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
--- a/mm/memory.c~mm-add-vm_insert_pages
+++ a/mm/memory.c
@@ -1407,8 +1407,7 @@ void zap_vma_ptes(struct vm_area_struct
 }
 EXPORT_SYMBOL_GPL(zap_vma_ptes);
 
-pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
-			spinlock_t **ptl)
+static pmd_t *walk_to_pmd(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -1427,6 +1426,16 @@ pte_t *__get_locked_pte(struct mm_struct
 		return NULL;
 
 	VM_BUG_ON(pmd_trans_huge(*pmd));
+	return pmd;
+}
+
+pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
+			spinlock_t **ptl)
+{
+	pmd_t *pmd = walk_to_pmd(mm, addr);
+
+	if (!pmd)
+		return NULL;
 	return pte_alloc_map_lock(mm, pmd, addr, ptl);
 }
 
@@ -1451,6 +1460,15 @@ static int insert_page_into_pte_locked(s
 	return 0;
 }
 
+static int insert_page_in_batch_locked(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, struct page *page, pgprot_t prot)
+{
+	const int err = validate_page_before_insert(page);
+
+	return err ? err : insert_page_into_pte_locked(
+		mm, pte_offset_map(pmd, addr), addr, page, prot);
+}
+
 /*
  * This is the old fallback for page remapping.
  *
@@ -1479,6 +1497,95 @@ out:
 	return retval;
 }
 
+/* insert_pages() amortizes the cost of spinlock operations
+ * when inserting pages in a loop.
+ */
+static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
+			struct page **pages, unsigned long *num, pgprot_t prot)
+{
+	pmd_t *pmd = NULL;
+	spinlock_t *pte_lock = NULL;
+	struct mm_struct *const mm = vma->vm_mm;
+	unsigned long curr_page_idx = 0;
+	unsigned long remaining_pages_total = *num;
+	unsigned long pages_to_write_in_pmd;
+	int ret;
+more:
+	ret = -EFAULT;
+	pmd = walk_to_pmd(mm, addr);
+	if (!pmd)
+		goto out;
+
+	pages_to_write_in_pmd = min_t(unsigned long,
+		remaining_pages_total, PTRS_PER_PTE - pte_index(addr));
+
+	/* Allocate the PTE if necessary; takes PMD lock once only. */
+	ret = -ENOMEM;
+	if (pte_alloc(mm, pmd, addr))
+		goto out;
+	pte_lock = pte_lockptr(mm, pmd);
+
+	while (pages_to_write_in_pmd) {
+		int pte_idx = 0;
+		const int batch_size = min_t(int, pages_to_write_in_pmd, 8);
+
+		spin_lock(pte_lock);
+		for (; pte_idx < batch_size; ++pte_idx) {
+			int err = insert_page_in_batch_locked(mm, pmd,
+				addr, pages[curr_page_idx], prot);
+			if (unlikely(err)) {
+				spin_unlock(pte_lock);
+				ret = err;
+				remaining_pages_total -= pte_idx;
+				goto out;
+			}
+			addr += PAGE_SIZE;
+			++curr_page_idx;
+		}
+		spin_unlock(pte_lock);
+		pages_to_write_in_pmd -= batch_size;
+		remaining_pages_total -= batch_size;
+	}
+	if (remaining_pages_total)
+		goto more;
+	ret = 0;
+out:
+	*num = remaining_pages_total;
+	return ret;
+}
+
+/**
+ * vm_insert_pages - insert multiple pages into user vma, batching the pmd lock.
+ * @vma: user vma to map to
+ * @addr: target start user address of these pages
+ * @pages: source kernel pages
+ * @num: in: number of pages to map. out: number of pages that were *not*
+ * mapped. (0 means all pages were successfully mapped).
+ *
+ * Preferred over vm_insert_page() when inserting multiple pages.
+ *
+ * In case of error, we may have mapped a subset of the provided
+ * pages. It is the caller's responsibility to account for this case.
+ *
+ * The same restrictions apply as in vm_insert_page().
+ */
+int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
+			struct page **pages, unsigned long *num)
+{
+	const unsigned long end_addr = addr + (*num * PAGE_SIZE) - 1;
+
+	if (addr < vma->vm_start || end_addr >= vma->vm_end)
+		return -EFAULT;
+	if (!(vma->vm_flags & VM_MIXEDMAP)) {
+		BUG_ON(down_read_trylock(&vma->vm_mm->mmap_sem));
+		BUG_ON(vma->vm_flags & VM_PFNMAP);
+		vma->vm_flags |= VM_MIXEDMAP;
+	}
+	/* Defer page refcount checking till we're about to map that page. */
+	return insert_pages(vma, addr, pages, num, vma->vm_page_prot);
+}
+EXPORT_SYMBOL(vm_insert_pages);
+
 /**
  * vm_insert_page - insert single page into user vma
  * @vma: user vma to map to
_

Patches currently in -mm which might be from arjunroy@google.com are

mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
mm-add-vm_insert_pages.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-add-vm_insert_pages-fix.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (148 preceding siblings ...)
  2020-02-13  2:56 ` + mm-add-vm_insert_pages.patch " Andrew Morton
@ 2020-02-13  2:57 ` Andrew Morton
  2020-02-13  2:57 ` + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch " Andrew Morton
                   ` (88 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-13  2:57 UTC (permalink / raw)
  To: akpm, arjunroy, davem, edumazet, mm-commits, soheil


The patch titled
     Subject: mm-add-vm_insert_pages-fix
has been added to the -mm tree.  Its filename is
     mm-add-vm_insert_pages-fix.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-add-vm_insert_pages-fix.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-add-vm_insert_pages-fix.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-add-vm_insert_pages-fix

pte_alloc() no longer takes the `addr' argument

Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory.c~mm-add-vm_insert_pages-fix
+++ a/mm/memory.c
@@ -1521,7 +1521,7 @@ more:
 
 	/* Allocate the PTE if necessary; takes PMD lock once only. */
 	ret = -ENOMEM;
-	if (pte_alloc(mm, pmd, addr))
+	if (pte_alloc(mm, pmd))
 		goto out;
 	pte_lock = pte_lockptr(mm, pmd);
 
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
linux-next-rejects.patch
linux-next-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (149 preceding siblings ...)
  2020-02-13  2:57 ` + mm-add-vm_insert_pages-fix.patch " Andrew Morton
@ 2020-02-13  2:57 ` Andrew Morton
  2020-02-13  2:57 ` + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch " Andrew Morton
                   ` (87 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-13  2:57 UTC (permalink / raw)
  To: arjunroy, davem, edumazet, mm-commits, soheil


The patch titled
     Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy
has been added to the -mm tree.  Its filename is
     net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Arjun Roy <arjunroy@google.com>
Subject: net-zerocopy: use vm_insert_pages() for tcp rcv zerocopy

Use vm_insert_pages() for tcp receive zerocopy.  Spin lock cycles (as
reported by perf) drop from a couple of percentage points to a fraction of
a percent.  This results in a roughly 6% increase in efficiency, measured
roughly as zerocopy receive count divided by CPU utilization.

The intention of this patchset is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/ipv4/tcp.c |   67 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 6 deletions(-)

--- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy
+++ a/net/ipv4/tcp.c
@@ -1734,14 +1734,48 @@ int tcp_mmap(struct file *file, struct s
 }
 EXPORT_SYMBOL(tcp_mmap);
 
+static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
+					struct page **pages,
+					unsigned long pages_to_map,
+					unsigned long *insert_addr,
+					u32 *length_with_pending,
+					u32 *seq,
+					struct tcp_zerocopy_receive *zc)
+{
+	unsigned long pages_remaining = pages_to_map;
+	int bytes_mapped;
+	int ret;
+
+	ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining);
+	bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining);
+	/* Even if vm_insert_pages fails, it may have partially succeeded in
+	 * mapping (some but not all of the pages).
+	 */
+	*seq += bytes_mapped;
+	*insert_addr += bytes_mapped;
+	if (ret) {
+		/* But if vm_insert_pages did fail, we have to unroll some state
+		 * we speculatively touched before.
+		 */
+		const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
+		*length_with_pending -= bytes_not_mapped;
+		zc->recv_skip_hint += bytes_not_mapped;
+	}
+	return ret;
+}
+
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc)
 {
 	unsigned long address = (unsigned long)zc->address;
 	u32 length = 0, seq, offset, zap_len;
+	#define PAGE_BATCH_SIZE 8
+	struct page *pages[PAGE_BATCH_SIZE];
 	const skb_frag_t *frags = NULL;
 	struct vm_area_struct *vma;
 	struct sk_buff *skb = NULL;
+	unsigned long pg_idx = 0;
+	unsigned long curr_addr;
 	struct tcp_sock *tp;
 	int inq;
 	int ret;
@@ -1774,8 +1808,20 @@ static int tcp_zerocopy_receive(struct s
 		zc->recv_skip_hint = zc->length;
 	}
 	ret = 0;
+	curr_addr = address;
 	while (length + PAGE_SIZE <= zc->length) {
 		if (zc->recv_skip_hint < PAGE_SIZE) {
+			/* If we're here, finish the current batch. */
+			if (pg_idx) {
+				ret = tcp_zerocopy_vm_insert_batch(vma, pages,
+								   pg_idx,
+								   &curr_addr,
+								   &length,
+								   &seq, zc);
+				if (ret)
+					goto out;
+				pg_idx = 0;
+			}
 			if (skb) {
 				if (zc->recv_skip_hint > 0)
 					break;
@@ -1784,7 +1830,6 @@ static int tcp_zerocopy_receive(struct s
 			} else {
 				skb = tcp_recv_skb(sk, seq, &offset);
 			}
-
 			zc->recv_skip_hint = skb->len - offset;
 			offset -= skb_headlen(skb);
 			if ((int)offset < 0 || skb_has_frag_list(skb))
@@ -1808,14 +1853,24 @@ static int tcp_zerocopy_receive(struct s
 			zc->recv_skip_hint -= remaining;
 			break;
 		}
-		ret = vm_insert_page(vma, address + length,
-				     skb_frag_page(frags));
-		if (ret)
-			break;
+		pages[pg_idx] = skb_frag_page(frags);
+		pg_idx++;
 		length += PAGE_SIZE;
-		seq += PAGE_SIZE;
 		zc->recv_skip_hint -= PAGE_SIZE;
 		frags++;
+		if (pg_idx == PAGE_BATCH_SIZE) {
+			ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
+							   &curr_addr, &length,
+							   &seq, zc);
+			if (ret)
+				goto out;
+			pg_idx = 0;
+		}
+	}
+	if (pg_idx) {
+		ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
+						   &curr_addr, &length, &seq,
+						   zc);
 	}
 out:
 	up_read(&current->mm->mmap_sem);
_

Patches currently in -mm which might be from arjunroy@google.com are

mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
mm-add-vm_insert_pages.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (150 preceding siblings ...)
  2020-02-13  2:57 ` + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch " Andrew Morton
@ 2020-02-13  2:57 ` Andrew Morton
  2020-02-14  2:50 ` + mm-add-vm_insert_pages-2.patch " Andrew Morton
                   ` (86 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-13  2:57 UTC (permalink / raw)
  To: akpm, arjunroy, davem, edumazet, mm-commits, soheil


The patch titled
     Subject: net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix
has been added to the -mm tree.  Its filename is
     net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix

Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/ipv4/tcp.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/net/ipv4/tcp.c~net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix
+++ a/net/ipv4/tcp.c
@@ -1788,6 +1788,8 @@ static int tcp_zerocopy_receive(struct s
 
 	sock_rps_record_flow(sk);
 
+	tp = tcp_sk(sk);
+
 	down_read(&current->mm->mmap_sem);
 
 	ret = -EINVAL;
@@ -1796,7 +1798,6 @@ static int tcp_zerocopy_receive(struct s
 		goto out;
 	zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
 
-	tp = tcp_sk(sk);
 	seq = tp->copied_seq;
 	inq = tcp_inq(sk);
 	zc->length = min_t(u32, zc->length, inq);
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
linux-next-rejects.patch
linux-next-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-add-vm_insert_pages-2.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (151 preceding siblings ...)
  2020-02-13  2:57 ` + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch " Andrew Morton
@ 2020-02-14  2:50 ` Andrew Morton
  2020-02-14  2:52 ` + mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch " Andrew Morton
                   ` (85 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:50 UTC (permalink / raw)
  To: arjunroy, davem, edumazet, mm-commits, soheil, willy


The patch titled
     Subject: add missing page_count() check to vm_insert_pages().
has been added to the -mm tree.  Its filename is
     mm-add-vm_insert_pages-2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-add-vm_insert_pages-2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-add-vm_insert_pages-2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Arjun Roy <arjunroy@google.com>
Subject: add missing page_count() check to vm_insert_pages().

Add missing page_count() check to vm_insert_pages(), specifically inside
insert_page_in_batch_locked().  This was accidentally forgotten in the
original patchset.

See: https://marc.info/?l=linux-mm&m=158156166403807&w=2

The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/memory.c~mm-add-vm_insert_pages-2
+++ a/mm/memory.c
@@ -1463,8 +1463,11 @@ static int insert_page_into_pte_locked(s
 static int insert_page_in_batch_locked(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, struct page *page, pgprot_t prot)
 {
-	const int err = validate_page_before_insert(page);
+	int err;
 
+	if (!page_count(page))
+		return -EINVAL;
+	err = validate_page_before_insert(page);
 	return err ? err : insert_page_into_pte_locked(
 		mm, pte_offset_map(pmd, addr), addr, page, prot);
 }
_

Patches currently in -mm which might be from arjunroy@google.com are

mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
mm-add-vm_insert_pages.patch
mm-add-vm_insert_pages-2.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (152 preceding siblings ...)
  2020-02-14  2:50 ` + mm-add-vm_insert_pages-2.patch " Andrew Morton
@ 2020-02-14  2:52 ` Andrew Morton
  2020-02-14  2:52 ` + mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch " Andrew Morton
                   ` (84 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:52 UTC (permalink / raw)
  To: david, mhocko, mm-commits, richardw.yang


The patch titled
     Subject: mm/migrate.c: no need to check for i > start in do_pages_move()
has been added to the -mm tree.  Its filename is
     mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/migrate.c: no need to check for i > start in do_pages_move()

Patch series "cleanup on do_pages_move()", v5.

The logic in do_pages_move() is a little mess for audience to read and has
some potential error on handling the return value. Especially there are
three calls on do_move_pages_to_node() and store_status() with almost the
same form.

This patch set tries to make the code a little friendly for audience by
consolidate the calls.


This patch (of 4):

At this point, we always have i >= start.  If i == start, store_status()
will return 0.  So we can drop the check for i > start.

[david@redhat.com rephrase changelog]

Link: http://lkml.kernel.org/r/20200214003017.25558-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/mm/migrate.c~mm-migratec-no-need-to-check-for-i-start-in-do_pages_move
+++ a/mm/migrate.c
@@ -1675,11 +1675,9 @@ static int do_pages_move(struct mm_struc
 				err += nr_pages - i - 1;
 			goto out;
 		}
-		if (i > start) {
-			err = store_status(status, start, current_node, i - start);
-			if (err)
-				goto out;
-		}
+		err = store_status(status, start, current_node, i - start);
+		if (err)
+			goto out;
 		current_node = NUMA_NO_NODE;
 	}
 out_flush:
_

Patches currently in -mm which might be from richardw.yang@linux.intel.com are

mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch
mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch
mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch
mm-migratec-check-pagelist-in-move_pages_and_store_status.patch
mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (153 preceding siblings ...)
  2020-02-14  2:52 ` + mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch " Andrew Morton
@ 2020-02-14  2:52 ` Andrew Morton
  2020-02-14  2:52 ` + mm-migratec-check-pagelist-in-move_pages_and_store_status.patch " Andrew Morton
                   ` (83 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:52 UTC (permalink / raw)
  To: david, mhocko, mm-commits, richardw.yang


The patch titled
     Subject: mm/migrate.c: wrap do_move_pages_to_node() and store_status()
has been added to the -mm tree.  Its filename is
     mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/migrate.c: wrap do_move_pages_to_node() and store_status()

Usually, do_move_pages_to_node() and store_status() are used in
combination.  We have three similar call sites.

Let's provide a wrapper for both function calls -
move_pages_and_store_status - to make the calling code easier to maintain
and fix (as noted by Yang Shi, the return value handling of
do_move_pages_to_node() has a flaw).

[david@redhat.com rephrase changelog]
Link: http://lkml.kernel.org/r/20200214003017.25558-3-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   61 +++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 32 deletions(-)

--- a/mm/migrate.c~mm-migratec-wrap-do_move_pages_to_node-and-store_status
+++ a/mm/migrate.c
@@ -1583,6 +1583,29 @@ out:
 	return err;
 }
 
+static int move_pages_and_store_status(struct mm_struct *mm, int node,
+		struct list_head *pagelist, int __user *status,
+		int start, int i, unsigned long nr_pages)
+{
+	int err;
+
+	err = do_move_pages_to_node(mm, pagelist, node);
+	if (err) {
+		/*
+		 * Positive err means the number of failed
+		 * pages to migrate.  Since we are going to
+		 * abort and return the number of non-migrated
+		 * pages, so need to incude the rest of the
+		 * nr_pages that have not been attempted as
+		 * well.
+		 */
+		if (err > 0)
+			err += nr_pages - i - 1;
+		return err;
+	}
+	return store_status(status, start, node, i - start);
+}
+
 /*
  * Migrate an array of page address onto an array of nodes and fill
  * the corresponding array of status.
@@ -1626,21 +1649,8 @@ static int do_pages_move(struct mm_struc
 			current_node = node;
 			start = i;
 		} else if (node != current_node) {
-			err = do_move_pages_to_node(mm, &pagelist, current_node);
-			if (err) {
-				/*
-				 * Positive err means the number of failed
-				 * pages to migrate.  Since we are going to
-				 * abort and return the number of non-migrated
-				 * pages, so need to incude the rest of the
-				 * nr_pages that have not been attempted as
-				 * well.
-				 */
-				if (err > 0)
-					err += nr_pages - i - 1;
-				goto out;
-			}
-			err = store_status(status, start, current_node, i - start);
+			err = move_pages_and_store_status(mm, current_node,
+					&pagelist, status, start, i, nr_pages);
 			if (err)
 				goto out;
 			start = i;
@@ -1669,13 +1679,8 @@ static int do_pages_move(struct mm_struc
 		if (err)
 			goto out_flush;
 
-		err = do_move_pages_to_node(mm, &pagelist, current_node);
-		if (err) {
-			if (err > 0)
-				err += nr_pages - i - 1;
-			goto out;
-		}
-		err = store_status(status, start, current_node, i - start);
+		err = move_pages_and_store_status(mm, current_node, &pagelist,
+				status, start, i, nr_pages);
 		if (err)
 			goto out;
 		current_node = NUMA_NO_NODE;
@@ -1685,16 +1690,8 @@ out_flush:
 		return err;
 
 	/* Make sure we do not overwrite the existing error */
-	err1 = do_move_pages_to_node(mm, &pagelist, current_node);
-	/*
-	 * Don't have to report non-attempted pages here since:
-	 *     - If the above loop is done gracefully all pages have been
-	 *       attempted.
-	 *     - If the above loop is aborted it means a fatal error
-	 *       happened, should return ret.
-	 */
-	if (!err1)
-		err1 = store_status(status, start, current_node, i - start);
+	err1 = move_pages_and_store_status(mm, current_node, &pagelist,
+				status, start, i, nr_pages);
 	if (err >= 0)
 		err = err1;
 out:
_

Patches currently in -mm which might be from richardw.yang@linux.intel.com are

mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch
mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch
mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch
mm-migratec-check-pagelist-in-move_pages_and_store_status.patch
mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-migratec-check-pagelist-in-move_pages_and_store_status.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (154 preceding siblings ...)
  2020-02-14  2:52 ` + mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch " Andrew Morton
@ 2020-02-14  2:52 ` Andrew Morton
  2020-02-14  2:52 ` + mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch " Andrew Morton
                   ` (82 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:52 UTC (permalink / raw)
  To: david, mhocko, mm-commits, richardw.yang


The patch titled
     Subject: mm/migrate.c: check pagelist in move_pages_and_store_status()
has been added to the -mm tree.  Its filename is
     mm-migratec-check-pagelist-in-move_pages_and_store_status.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-migratec-check-pagelist-in-move_pages_and_store_status.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-migratec-check-pagelist-in-move_pages_and_store_status.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/migrate.c: check pagelist in move_pages_and_store_status()

When pagelist is empty, it is not necessary to do the move and store. 
Also it consolidate the empty list check in one place.

Link: http://lkml.kernel.org/r/20200214003017.25558-4-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/mm/migrate.c~mm-migratec-check-pagelist-in-move_pages_and_store_status
+++ a/mm/migrate.c
@@ -1499,9 +1499,6 @@ static int do_move_pages_to_node(struct
 {
 	int err;
 
-	if (list_empty(pagelist))
-		return 0;
-
 	err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
 			MIGRATE_SYNC, MR_SYSCALL);
 	if (err)
@@ -1589,6 +1586,9 @@ static int move_pages_and_store_status(s
 {
 	int err;
 
+	if (list_empty(pagelist))
+		return 0;
+
 	err = do_move_pages_to_node(mm, pagelist, node);
 	if (err) {
 		/*
@@ -1686,9 +1686,6 @@ static int do_pages_move(struct mm_struc
 		current_node = NUMA_NO_NODE;
 	}
 out_flush:
-	if (list_empty(&pagelist))
-		return err;

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (155 preceding siblings ...)
  2020-02-14  2:52 ` + mm-migratec-check-pagelist-in-move_pages_and_store_status.patch " Andrew Morton
@ 2020-02-14  2:52 ` Andrew Morton
  2020-02-14  2:57 ` + mm-migratec-migrate-pg_readahead-flag.patch " Andrew Morton
                   ` (81 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:52 UTC (permalink / raw)
  To: david, mhocko, mm-commits, richardw.yang


The patch titled
     Subject: mm/migrate.c: unify "not queued for migration" handling in do_pages_move()
has been added to the -mm tree.  Its filename is
     mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/migrate.c: unify "not queued for migration" handling in do_pages_move()

It can currently happen that we store the status of a page twice:
* Once we detect that it is already on the target node
* Once we moved a bunch of pages, and a page that's already on the
  target node is contained in the current interval.

Let's simplify the code and always call do_move_pages_to_node() in case we
did not queue a page for migration.  Note that pages that are already on
the target node are not added to the pagelist and are, therefore, ignored
by do_move_pages_to_node() - there is no functional change.

The status of such a page is now only stored once.

[david@redhat.com rephrase changelog]
Link: http://lkml.kernel.org/r/20200214003017.25558-5-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

--- a/mm/migrate.c~mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move
+++ a/mm/migrate.c
@@ -1664,18 +1664,16 @@ static int do_pages_move(struct mm_struc
 		err = add_page_for_migration(mm, addr, current_node,
 				&pagelist, flags & MPOL_MF_MOVE_ALL);
 
-		if (!err) {
-			/* The page is already on the target node */
-			err = store_status(status, i, current_node, 1);
-			if (err)
-				goto out_flush;
-			continue;
-		} else if (err > 0) {
+		if (err > 0) {
 			/* The page is successfully queued for migration */
 			continue;
 		}
 
-		err = store_status(status, i, err, 1);
+		/*
+		 * If the page is already on the target node (!err), store the
+		 * node, otherwise, store the err.
+		 */
+		err = store_status(status, i, err ? : current_node, 1);
 		if (err)
 			goto out_flush;
 
_

Patches currently in -mm which might be from richardw.yang@linux.intel.com are

mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch
mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch
mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch
mm-migratec-check-pagelist-in-move_pages_and_store_status.patch
mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-migratec-migrate-pg_readahead-flag.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (156 preceding siblings ...)
  2020-02-14  2:52 ` + mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch " Andrew Morton
@ 2020-02-14  2:57 ` Andrew Morton
  2020-02-14  2:57 ` + mm-migratec-migrate-pg_readahead-flag-fix.patch " Andrew Morton
                   ` (80 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:57 UTC (permalink / raw)
  To: mgorman, mhocko, mm-commits, willy, yang.shi


The patch titled
     Subject: mm/migrate.c: migrate PG_readahead flag
has been added to the -mm tree.  Its filename is
     mm-migratec-migrate-pg_readahead-flag.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-migratec-migrate-pg_readahead-flag.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-migratec-migrate-pg_readahead-flag.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm/migrate.c: migrate PG_readahead flag

Currently the migration code doesn't migrate PG_readahead flag. 
Theoretically this would incur slight performance loss as the application
might have to ramp its readahead back up again.  Even though such problem
happens, it might be hidden by something else since migration is typically
triggered by compaction and NUMA balancing, any of which should be more
noticeable.

Migrate the flag after end_page_writeback() since it may clear PG_reclaim
flag, which is the same bit as PG_readahead, for the new page.

Link: http://lkml.kernel.org/r/1581640185-95731-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/mm/migrate.c~mm-migratec-migrate-pg_readahead-flag
+++ a/mm/migrate.c
@@ -647,6 +647,14 @@ void migrate_page_states(struct page *ne
 	if (PageWriteback(newpage))
 		end_page_writeback(newpage);
 
+	/*
+	 * PG_readahead share the same bit with PG_reclaim, the above
+	 * end_page_writeback() may clear PG_readahead mistakenly, so set
+	 * the bit after that.
+	 */
+	if (PageReadahead(page))
+		SetPageReadahead(newpage);
+
 	copy_page_owner(page, newpage);
 
 	mem_cgroup_migrate(page, newpage);
_

Patches currently in -mm which might be from yang.shi@linux.alibaba.com are

mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
mm-vmpressure-use-mem_cgroup_is_root-api.patch
mm-migratec-migrate-pg_readahead-flag.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-migratec-migrate-pg_readahead-flag-fix.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (157 preceding siblings ...)
  2020-02-14  2:57 ` + mm-migratec-migrate-pg_readahead-flag.patch " Andrew Morton
@ 2020-02-14  2:57 ` Andrew Morton
  2020-02-14  2:58 ` + include-remove-highmemh-from-pagemaph.patch " Andrew Morton
                   ` (79 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:57 UTC (permalink / raw)
  To: akpm, mm-commits, yang.shi


The patch titled
     Subject: mm-migratec-migrate-pg_readahead-flag-fix
has been added to the -mm tree.  Its filename is
     mm-migratec-migrate-pg_readahead-flag-fix.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-migratec-migrate-pg_readahead-flag-fix.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-migratec-migrate-pg_readahead-flag-fix.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-migratec-migrate-pg_readahead-flag-fix

tweak comment

Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/migrate.c~mm-migratec-migrate-pg_readahead-flag-fix
+++ a/mm/migrate.c
@@ -648,9 +648,9 @@ void migrate_page_states(struct page *ne
 		end_page_writeback(newpage);
 
 	/*
-	 * PG_readahead share the same bit with PG_reclaim, the above
-	 * end_page_writeback() may clear PG_readahead mistakenly, so set
-	 * the bit after that.
+	 * PG_readahead shares the same bit with PG_reclaim.  The above
+	 * end_page_writeback() may clear PG_readahead mistakenly, so set the
+	 * bit after that.
 	 */
 	if (PageReadahead(page))
 		SetPageReadahead(newpage);
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
mm-migratec-migrate-pg_readahead-flag-fix.patch
linux-next-rejects.patch
linux-next-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + include-remove-highmemh-from-pagemaph.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (158 preceding siblings ...)
  2020-02-14  2:57 ` + mm-migratec-migrate-pg_readahead-flag-fix.patch " Andrew Morton
@ 2020-02-14  2:58 ` Andrew Morton
  2020-02-14  3:06 ` + mm-kmemleak-annotate-various-data-races-obj-ptr.patch " Andrew Morton
                   ` (78 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  2:58 UTC (permalink / raw)
  To: mm-commits, willy


The patch titled
     Subject: include: Remove highmem.h from pagemap.h
has been added to the -mm tree.  Its filename is
     include-remove-highmemh-from-pagemaph.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/include-remove-highmemh-from-pagemaph.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/include-remove-highmemh-from-pagemaph.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: include: Remove highmem.h from pagemap.h

pagemap.h doesn't need highmem.h itself.  Only two dozen users were
relying on pagemap to pull in highmem (usually for kmap), so fix them all
up and we can remove this header file dependency.

Link: http://lkml.kernel.org/r/20200213195643.31587-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/char/virtio_console.c                                      |    1 +
 drivers/media/pci/ivtv/ivtv-driver.h                               |    1 +
 drivers/media/v4l2-core/videobuf-dma-sg.c                          |    1 +
 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c |    1 +
 fs/affs/affs.h                                                     |    1 +
 fs/ext2/namei.c                                                    |    1 +
 fs/freevxfs/vxfs_immed.c                                           |    1 +
 fs/freevxfs/vxfs_subr.c                                            |    1 +
 fs/hfs/btree.c                                                     |    1 +
 fs/jffs2/gc.c                                                      |    1 +
 fs/minix/minix.h                                                   |    1 +
 fs/ocfs2/symlink.c                                                 |    1 +
 fs/qnx6/qnx6.h                                                     |    1 +
 fs/read_write.c                                                    |    1 +
 fs/reiserfs/ioctl.c                                                |    1 +
 fs/reiserfs/tail_conversion.c                                      |    1 +
 fs/reiserfs/xattr.c                                                |    3 ++-
 fs/squashfs/file.c                                                 |    1 +
 fs/squashfs/symlink.c                                              |    1 +
 fs/sysv/namei.c                                                    |    1 +
 fs/udf/file.c                                                      |    1 +
 fs/ufs/ufs_fs.h                                                    |    1 +
 include/linux/pagemap.h                                            |    1 -
 kernel/bpf/stackmap.c                                              |    1 +
 lib/iov_iter.c                                                     |    1 +
 net/sunrpc/auth_gss/gss_krb5_wrap.c                                |    1 +
 net/sunrpc/cache.c                                                 |    1 +
 27 files changed, 27 insertions(+), 2 deletions(-)

--- a/drivers/char/virtio_console.c~include-remove-highmemh-from-pagemaph
+++ a/drivers/char/virtio_console.c
@@ -13,6 +13,7 @@
 #include <linux/fs.h>
 #include <linux/splice.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/poll.h>
--- a/drivers/media/pci/ivtv/ivtv-driver.h~include-remove-highmemh-from-pagemaph
+++ a/drivers/media/pci/ivtv/ivtv-driver.h
@@ -50,6 +50,7 @@
 #include <linux/list.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/scatterlist.h>
 #include <linux/kthread.h>
 #include <linux/mutex.h>
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~include-remove-highmemh-from-pagemaph
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -25,6 +25,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/scatterlist.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c~include-remove-highmemh-from-pagemaph
+++ a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c
@@ -4,6 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/errno.h>
+#include <linux/highmem.h>
 #include <linux/interrupt.h>
 #include <linux/pagemap.h>
 #include <linux/dma-mapping.h>
--- a/fs/affs/affs.h~include-remove-highmemh-from-pagemaph
+++ a/fs/affs/affs.h
@@ -8,6 +8,7 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
+#include <linux/highmem.h>
 #include "amigaffs.h"
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
--- a/fs/ext2/namei.c~include-remove-highmemh-from-pagemaph
+++ a/fs/ext2/namei.c
@@ -31,6 +31,7 @@
  *        David S. Miller (davem@caip.rutgers.edu), 1995
  */
 
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/quotaops.h>
 #include "ext2.h"
--- a/fs/freevxfs/vxfs_immed.c~include-remove-highmemh-from-pagemaph
+++ a/fs/freevxfs/vxfs_immed.c
@@ -32,6 +32,7 @@
  */
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 #include "vxfs.h"
 #include "vxfs_extern.h"
--- a/fs/freevxfs/vxfs_subr.c~include-remove-highmemh-from-pagemaph
+++ a/fs/freevxfs/vxfs_subr.c
@@ -34,6 +34,7 @@
 #include <linux/buffer_head.h>
 #include <linux/kernel.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 #include "vxfs_extern.h"
 
--- a/fs/hfs/btree.c~include-remove-highmemh-from-pagemaph
+++ a/fs/hfs/btree.c
@@ -10,6 +10,7 @@
  */
 
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/log2.h>
 
--- a/fs/jffs2/gc.c~include-remove-highmemh-from-pagemaph
+++ a/fs/jffs2/gc.c
@@ -16,6 +16,7 @@
 #include <linux/mtd/mtd.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/crc32.h>
 #include <linux/compiler.h>
 #include <linux/stat.h>
--- a/fs/minix/minix.h~include-remove-highmemh-from-pagemaph
+++ a/fs/minix/minix.h
@@ -4,6 +4,7 @@
 
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/minix_fs.h>
 
 #define INODE_VERSION(inode)	minix_sb(inode->i_sb)->s_version
--- a/fs/ocfs2/symlink.c~include-remove-highmemh-from-pagemaph
+++ a/fs/ocfs2/symlink.c
@@ -38,6 +38,7 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/namei.h>
 
 #include <cluster/masklog.h>
--- a/fs/qnx6/qnx6.h~include-remove-highmemh-from-pagemaph
+++ a/fs/qnx6/qnx6.h
@@ -19,6 +19,7 @@
 
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 typedef __u16 __bitwise __fs16;
 typedef __u32 __bitwise __fs32;
--- a/fs/read_write.c~include-remove-highmemh-from-pagemaph
+++ a/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/export.h>
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
--- a/fs/reiserfs/ioctl.c~include-remove-highmemh-from-pagemaph
+++ a/fs/reiserfs/ioctl.c
@@ -9,6 +9,7 @@
 #include <linux/time.h>
 #include <linux/uaccess.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/compat.h>
 
 /*
--- a/fs/reiserfs/tail_conversion.c~include-remove-highmemh-from-pagemaph
+++ a/fs/reiserfs/tail_conversion.c
@@ -6,6 +6,7 @@
 
 #include <linux/time.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/buffer_head.h>
 #include "reiserfs.h"
 
--- a/fs/reiserfs/xattr.c~include-remove-highmemh-from-pagemaph
+++ a/fs/reiserfs/xattr.c
@@ -37,11 +37,12 @@
 #include "reiserfs.h"
 #include <linux/capability.h>
 #include <linux/dcache.h>
-#include <linux/namei.h>
 #include <linux/errno.h>
 #include <linux/gfp.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/highmem.h>
+#include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/xattr.h>
 #include "xattr.h"
--- a/fs/squashfs/file.c~include-remove-highmemh-from-pagemaph
+++ a/fs/squashfs/file.c
@@ -33,6 +33,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/mutex.h>
 
 #include "squashfs_fs.h"
--- a/fs/squashfs/symlink.c~include-remove-highmemh-from-pagemaph
+++ a/fs/squashfs/symlink.c
@@ -22,6 +22,7 @@
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/xattr.h>
 
 #include "squashfs_fs.h"
--- a/fs/sysv/namei.c~include-remove-highmemh-from-pagemaph
+++ a/fs/sysv/namei.c
@@ -13,6 +13,7 @@
  *  Copyright (C) 1997, 1998  Krzysztof G. Baranowski
  */
 
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include "sysv.h"
 
--- a/fs/udf/file.c~include-remove-highmemh-from-pagemaph
+++ a/fs/udf/file.c
@@ -33,6 +33,7 @@
 #include <linux/capability.h>
 #include <linux/errno.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/uio.h>
 
 #include "udf_i.h"
--- a/fs/ufs/ufs_fs.h~include-remove-highmemh-from-pagemaph
+++ a/fs/ufs/ufs_fs.h
@@ -35,6 +35,7 @@
 #include <linux/kernel.h>
 #include <linux/stat.h>
 #include <linux/fs.h>
+#include <linux/highmem.h>
 #include <linux/workqueue.h>
 
 #include <asm/div64.h>
--- a/include/linux/pagemap.h~include-remove-highmemh-from-pagemaph
+++ a/include/linux/pagemap.h
@@ -8,7 +8,6 @@
 #include <linux/mm.h>
 #include <linux/fs.h>
 #include <linux/list.h>
-#include <linux/highmem.h>
 #include <linux/compiler.h>
 #include <linux/uaccess.h>
 #include <linux/gfp.h>
--- a/kernel/bpf/stackmap.c~include-remove-highmemh-from-pagemaph
+++ a/kernel/bpf/stackmap.c
@@ -8,6 +8,7 @@
 #include <linux/perf_event.h>
 #include <linux/elf.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/irq_work.h>
 #include "percpu_freelist.h"
 
--- a/lib/iov_iter.c~include-remove-highmemh-from-pagemaph
+++ a/lib/iov_iter.c
@@ -3,6 +3,7 @@
 #include <linux/bvec.h>
 #include <linux/uio.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/splice.h>
--- a/net/sunrpc/auth_gss/gss_krb5_wrap.c~include-remove-highmemh-from-pagemaph
+++ a/net/sunrpc/auth_gss/gss_krb5_wrap.c
@@ -34,6 +34,7 @@
 #include <linux/sunrpc/gss_krb5.h>
 #include <linux/random.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 # define RPCDBG_FACILITY	RPCDBG_AUTH
--- a/net/sunrpc/cache.c~include-remove-highmemh-from-pagemaph
+++ a/net/sunrpc/cache.c
@@ -27,6 +27,7 @@
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <asm/ioctls.h>
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/cache.h>
_

Patches currently in -mm which might be from willy@infradead.org are

mm-improve-dump_page-for-compound-pages.patch
include-remove-highmemh-from-pagemaph.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmemleak-annotate-various-data-races-obj-ptr.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (159 preceding siblings ...)
  2020-02-14  2:58 ` + include-remove-highmemh-from-pagemaph.patch " Andrew Morton
@ 2020-02-14  3:06 ` Andrew Morton
  2020-02-14  3:06 ` [to-be-updated] mm-kmemleak-annotate-a-data-race-in-checksum.patch removed from " Andrew Morton
                   ` (77 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  3:06 UTC (permalink / raw)
  To: cai, catalin.marinas, elver, mm-commits


The patch titled
     Subject: mm/kmemleak: annotate various data races obj->ptr
has been added to the -mm tree.  Its filename is
     mm-kmemleak-annotate-various-data-races-obj-ptr.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmemleak-annotate-various-data-races-obj-ptr.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmemleak-annotate-various-data-races-obj-ptr.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/kmemleak: annotate various data races obj->ptr

The value of object->pointer could be accessed concurrently as noticed
by KCSAN,

 write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
  do_raw_spin_lock+0x114/0x200
  debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
  (inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
  _raw_spin_lock+0x40/0x50
  __handle_mm_fault+0xa9e/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
  crc32_le_base+0x67/0x350
  crc32_le_base+0x67/0x350:
  crc32_body at lib/crc32.c:106
  (inlined by) crc32_le_generic at lib/crc32.c:179
  (inlined by) crc32_le at lib/crc32.c:197
  kmemleak_scan+0x528/0xd90
  update_checksum at mm/kmemleak.c:1172
  (inlined by) kmemleak_scan at mm/kmemleak.c:1497
  kmemleak_scan_thread+0xcc/0xfa
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 write to 0xffff939bf07b95b8 of 4 bytes by interrupt on cpu 119:
  __free_object+0x884/0xcb0
  __free_object at lib/debugobjects.c:359
  __debug_check_no_obj_freed+0x19d/0x370
  debug_check_no_obj_freed+0x41/0x4b
  slab_free_freelist_hook+0xfb/0x1c0
  kmem_cache_free+0x10c/0x3a0
  free_object_rcu+0x1ca/0x260
  rcu_core+0x677/0xcc0
  rcu_core_si+0x17/0x20
  __do_softirq+0xd9/0x57c
  run_ksoftirqd+0x29/0x50
  smpboot_thread_fn+0x222/0x3f0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffff939bf07b95b8 of 8 bytes by task 838 on cpu 109:
  scan_block+0x69/0x190
  scan_block at mm/kmemleak.c:1250
  kmemleak_scan+0x249/0xd90
  scan_large_block at mm/kmemleak.c:1309
  (inlined by) kmemleak_scan at mm/kmemleak.c:1434
  kmemleak_scan_thread+0xcc/0xfa
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

crc32() will dereference object->pointer.  If a shattered value was
returned due to a data race, it will be corrected in the next scan. 
scan_block() will dereference a range of addresses (e.g., percpu sections)
to search for valid pointers.  Even if a data race heppens, it will cause
no issue because the code here does not care about the exact value of a
non-pointer.  Thus, mark them as intentional data races using the
data_race() macro.

Link: http://lkml.kernel.org/r/1581615390-9720-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/mm/kmemleak.c~mm-kmemleak-annotate-various-data-races-obj-ptr
+++ a/mm/kmemleak.c
@@ -1169,7 +1169,12 @@ static bool update_checksum(struct kmeml
 	u32 old_csum = object->checksum;
 
 	kasan_disable_current();
-	object->checksum = crc32(0, (void *)object->pointer, object->size);
+	/*
+	 * crc32() will dereference object->pointer. If an unstable value was
+	 * returned due to a data race, it will be corrected in the next scan.
+	 */
+	object->checksum = data_race(crc32(0, (void *)object->pointer,
+					   object->size));
 	kasan_enable_current();
 
 	return object->checksum != old_csum;
@@ -1243,7 +1248,7 @@ static void scan_block(void *_start, voi
 			break;
 
 		kasan_disable_current();
-		pointer = *ptr;
+		pointer = data_race(*ptr);
 		kasan_enable_current();
 
 		untagged_ptr = (unsigned long)kasan_reset_tag((void *)pointer);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-kmemleak-annotate-a-data-race-in-checksum.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [to-be-updated] mm-kmemleak-annotate-a-data-race-in-checksum.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (160 preceding siblings ...)
  2020-02-14  3:06 ` + mm-kmemleak-annotate-various-data-races-obj-ptr.patch " Andrew Morton
@ 2020-02-14  3:06 ` Andrew Morton
  2020-02-14  3:26 ` + mm-swapfile-fix-and-annotate-various-data-races-v2.patch added to " Andrew Morton
                   ` (76 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  3:06 UTC (permalink / raw)
  To: cai, catalin.marinas, elver, mm-commits


The patch titled
     Subject: mm/kmemleak.c: annotate a data race in checksum
has been removed from the -mm tree.  Its filename was
     mm-kmemleak-annotate-a-data-race-in-checksum.patch

This patch was dropped because an updated version will be merged

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/kmemleak.c: annotate a data race in checksum

The value of object->pointer could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in crc32_le_base / do_raw_spin_lock

 write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
  do_raw_spin_lock+0x114/0x200
  debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
  (inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
  _raw_spin_lock+0x40/0x50
  __handle_mm_fault+0xa9e/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
  crc32_le_base+0x67/0x350
  crc32_le_base+0x67/0x350:
  crc32_body at lib/crc32.c:106
  (inlined by) crc32_le_generic at lib/crc32.c:179
  (inlined by) crc32_le at lib/crc32.c:197
  kmemleak_scan+0x528/0xd90
  update_checksum at mm/kmemleak.c:1172
  (inlined by) kmemleak_scan at mm/kmemleak.c:1497
  kmemleak_scan_thread+0xcc/0xfa
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 60 PID: 839 Comm: kmemleak Tainted: G        W    L 5.5.0-next-20200210+ #3
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

crc32() will dereference object->pointer.  If a shattered value was
returned due to a data race, it will be corrected in the next scan.  Thus,
annotate it as an intentional data race using the data_race() macro.

Link: http://lkml.kernel.org/r/1581438245-24391-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- a/mm/kmemleak.c~mm-kmemleak-annotate-a-data-race-in-checksum
+++ a/mm/kmemleak.c
@@ -1169,7 +1169,12 @@ static bool update_checksum(struct kmeml
 	u32 old_csum = object->checksum;
 
 	kasan_disable_current();
-	object->checksum = crc32(0, (void *)object->pointer, object->size);
+	/*
+	 * crc32() will dereference object->pointer. If an unstable value was
+	 * returned due to a data race, it will be corrected in the next scan.
+	 */
+	object->checksum = data_race(crc32(0, (void *)object->pointer,
+					   object->size));
 	kasan_enable_current();
 
 	return object->checksum != old_csum;
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-swapfile-fix-and-annotate-various-data-races-v2.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (161 preceding siblings ...)
  2020-02-14  3:06 ` [to-be-updated] mm-kmemleak-annotate-a-data-race-in-checksum.patch removed from " Andrew Morton
@ 2020-02-14  3:26 ` Andrew Morton
  2020-02-14  3:27 ` + mm-page_io-mark-various-intentional-data-races-v2.patch " Andrew Morton
                   ` (75 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  3:26 UTC (permalink / raw)
  To: cai, elver, hughd, mm-commits


The patch titled
     Subject: mm-swapfile-fix-and-annotate-various-data-races-v2
has been added to the -mm tree.  Its filename is
     mm-swapfile-fix-and-annotate-various-data-races-v2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swapfile-fix-and-annotate-various-data-races-v2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swapfile-fix-and-annotate-various-data-races-v2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm-swapfile-fix-and-annotate-various-data-races-v2

add a missing annotation for si->flags in memory.c

Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memory.c~mm-swapfile-fix-and-annotate-various-data-races-v2
+++ a/mm/memory.c
@@ -2955,8 +2955,8 @@ vm_fault_t do_swap_page(struct vm_fault
 	if (!page) {
 		struct swap_info_struct *si = swp_swap_info(entry);
 
-		if (si->flags & SWP_SYNCHRONOUS_IO &&
-				__swap_count(entry) == 1) {
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 							vmf->address);
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-swapfile-fix-and-annotate-various-data-races-v2.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_io-mark-various-intentional-data-races-v2.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (162 preceding siblings ...)
  2020-02-14  3:26 ` + mm-swapfile-fix-and-annotate-various-data-races-v2.patch added to " Andrew Morton
@ 2020-02-14  3:27 ` Andrew Morton
  2020-02-14  5:10 ` + checkpatch-support-base-commit-format.patch " Andrew Morton
                   ` (74 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  3:27 UTC (permalink / raw)
  To: cai, elver, mm-commits


The patch titled
     Subject: mm-page_io-mark-various-intentional-data-races-v2
has been added to the -mm tree.  Its filename is
     mm-page_io-mark-various-intentional-data-races-v2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_io-mark-various-intentional-data-races-v2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_io-mark-various-intentional-data-races-v2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm-page_io-mark-various-intentional-data-races-v2

add a missing annotation

Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_io.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_io.c~mm-page_io-mark-various-intentional-data-races-v2
+++ a/mm/page_io.c
@@ -86,7 +86,7 @@ static void swap_slot_free_notify(struct
 		return;
 
 	sis = page_swap_info(page);
-	if (!(sis->flags & SWP_BLKDEV))
+	if (data_race(!(sis->flags & SWP_BLKDEV)))
 		return;
 
 	/*
_

Patches currently in -mm which might be from cai@lca.pw are

mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-swapfile-fix-and-annotate-various-data-races-v2.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races-v2.patch
mm-swap_state-mark-various-intentional-data-races.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-support-base-commit-format.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (163 preceding siblings ...)
  2020-02-14  3:27 ` + mm-page_io-mark-various-intentional-data-races-v2.patch " Andrew Morton
@ 2020-02-14  5:10 ` Andrew Morton
  2020-02-14  5:11 ` + checkpatch-prefer-fallthrough-over-fallthrough-comments.patch " Andrew Morton
                   ` (73 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:10 UTC (permalink / raw)
  To: apw, corbet, jhubbard, joe, konstantin, mm-commits


The patch titled
     Subject: checkpatch: support "base-commit:" format
has been added to the -mm tree.  Its filename is
     checkpatch-support-base-commit-format.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-support-base-commit-format.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-support-base-commit-format.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: John Hubbard <jhubbard@nvidia.com>
Subject: checkpatch: support "base-commit:" format

In order to support the get-lore-mbox.py tool described in [1], I ran:

    git format-patch --base=<commit> --cover-letter <revrange>

...  which generated a "base-commit: <commit-hash>" tag at the end of the
cover letter.  However, checkpatch.pl generated an error upon encounting
"base-commit:" in the cover letter:

    "ERROR: Please use git commit description style..."

...  because it found the "commit" keyword, and failed to recognize that
it was part of the "base-commit" phrase, and as such, should not be
subjected to the same commit description style rules.

Update checkpatch.pl to include a special case for "base-commit:" (at the
start of the line, possibly with some leading whitespace) so that that tag
no longer generates a checkpatch error.

[1] https://lwn.net/Articles/811528/ "Better tools for kernel
    developers"

Link: http://lkml.kernel.org/r/20200213055004.69235-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Joe Perches <joe@perches.com>
Acked-by: Joe Perches <joe@perches.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/scripts/checkpatch.pl~checkpatch-support-base-commit-format
+++ a/scripts/checkpatch.pl
@@ -2794,7 +2794,7 @@ sub process {
 
 # Check for git id commit length and improperly formed commit descriptions
 		if ($in_commit_log && !$commit_log_possible_stack_dump &&
-		    $line !~ /^\s*(?:Link|Patchwork|http|https|BugLink):/i &&
+		    $line !~ /^\s*(?:Link|Patchwork|http|https|BugLink|base-commit):/i &&
 		    $line !~ /^This reverts commit [0-9a-f]{7,40}/ &&
 		    ($line =~ /\bcommit\s+[0-9a-f]{5,}\b/i ||
 		     ($line =~ /(?:\s|^)[0-9a-f]{12,40}(?:[\s"'\(\[]|$)/i &&
_

Patches currently in -mm which might be from jhubbard@nvidia.com are

mm-gup-split-get_user_pages_remote-into-two-routines.patch
mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
mm-introduce-page_ref_sub_return.patch
mm-gup-pass-gup-flags-to-two-more-routines.patch
mm-gup-require-foll_get-for-get_user_pages_fast.patch
mm-gup-track-foll_pin-pages.patch
mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch
checkpatch-support-base-commit-format.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-prefer-fallthrough-over-fallthrough-comments.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (164 preceding siblings ...)
  2020-02-14  5:10 ` + checkpatch-support-base-commit-format.patch " Andrew Morton
@ 2020-02-14  5:11 ` Andrew Morton
  2020-02-14  5:12 ` + uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch " Andrew Morton
                   ` (72 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:11 UTC (permalink / raw)
  To: joe, mm-commits


The patch titled
     Subject: checkpatch: prefer fallthrough; over fallthrough comments
has been added to the -mm tree.  Its filename is
     checkpatch-prefer-fallthrough-over-fallthrough-comments.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-prefer-fallthrough-over-fallthrough-comments.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-prefer-fallthrough-over-fallthrough-comments.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Joe Perches <joe@perches.com>
Subject: checkpatch: prefer fallthrough; over fallthrough comments

commit 294f69e662d1 ("compiler_attributes.h: Add 'fallthrough' pseudo
keyword for switch/case use") added the pseudo keyword so add a test for
it to checkpatch.

Warn on a patch or use --strict for files.

Link: http://lkml.kernel.org/r/8b6c1b9031ab9f3cdebada06b8d46467f1492d68.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

--- a/scripts/checkpatch.pl~checkpatch-prefer-fallthrough-over-fallthrough-comments
+++ a/scripts/checkpatch.pl
@@ -2308,6 +2308,19 @@ sub pos_last_openparen {
 	return length(expand_tabs(substr($line, 0, $last_openparen))) + 1;
 }
 
+sub get_raw_comment {
+	my ($line, $rawline) = @_;
+	my $comment = '';
+
+	for my $i (0 .. (length($line) - 1)) {
+		if (substr($line, $i, 1) eq "$;") {
+			$comment .= substr($rawline, $i, 1);
+		}
+	}
+
+	return $comment;
+}
+
 sub process {
 	my $filename = shift;
 
@@ -2469,6 +2482,7 @@ sub process {
 		$sline =~ s/$;/ /g;	#with comments as spaces
 
 		my $rawline = $rawlines[$linenr - 1];
+		my $raw_comment = get_raw_comment($line, $rawline);
 
 # check if it's a mode change, rename or start of a patch
 		if (!$in_commit_log &&
@@ -6422,6 +6436,28 @@ sub process {
 			}
 		}
 
+# check for /* fallthrough */ like comment, prefer fallthrough;
+		my @fallthroughs = (
+			'fallthrough',
+			'@fallthrough@',
+			'lint -fallthrough[ \t]*',
+			'intentional(?:ly)?[ \t]*fall(?:(?:s | |-)[Tt]|t)hr(?:ough|u|ew)',
+			'(?:else,?\s*)?FALL(?:S | |-)?THR(?:OUGH|U|EW)[ \t.!]*(?:-[^\n\r]*)?',
+			'Fall(?:(?:s | |-)[Tt]|t)hr(?:ough|u|ew)[ \t.!]*(?:-[^\n\r]*)?',
+			'fall(?:s | |-)?thr(?:ough|u|ew)[ \t.!]*(?:-[^\n\r]*)?',
+		    );
+		if ($raw_comment ne '') {
+			foreach my $ft (@fallthroughs) {
+				if ($raw_comment =~ /$ft/) {
+					my $msg_level = \&WARN;
+					$msg_level = \&CHK if ($file);
+					&{$msg_level}("PREFER_FALLTHROUGH",
+						      "Prefer 'fallthrough;' over fallthrough comment\n" . $herecurr);
+					last;
+				}
+			}
+		}
+
 # check for switch/default statements without a break;
 		if ($perl_version_ok &&
 		    defined $stat &&
_

Patches currently in -mm which might be from joe@perches.com are

get_maintainer-remove-uses-of-p-for-maintainer-name.patch
string-add-stracpy-and-stracpy_pad-mechanisms.patch
checkpatch-remove-email-address-comment-from-email-address-comparisons.patch
checkpatch-prefer-fallthrough-over-fallthrough-comments.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (165 preceding siblings ...)
  2020-02-14  5:11 ` + checkpatch-prefer-fallthrough-over-fallthrough-comments.patch " Andrew Morton
@ 2020-02-14  5:12 ` Andrew Morton
  2020-02-14  5:15 ` + lib-string-update-match_string-doc-strings-with-correct-behavior.patch " Andrew Morton
                   ` (71 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:12 UTC (permalink / raw)
  To: allison, borntraeger, joe, mm-commits, tglx, torsten.hilbrich,
	vilhelm.gray, yury.norov


The patch titled
     Subject: include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap
has been added to the -mm tree.  Its filename is
     uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Christian Borntraeger <borntraeger@de.ibm.com>
Subject: include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap

QEMU has a funny new build error message when I use the upstream kernel
headers:

  CC      block/file-posix.o
In file included from /home/cborntra/REPOS/qemu/include/qemu/timer.h:4,
                 from /home/cborntra/REPOS/qemu/include/qemu/timed-average.h:29,
                 from /home/cborntra/REPOS/qemu/include/block/accounting.h:28,
                 from /home/cborntra/REPOS/qemu/include/block/block_int.h:27,
                 from /home/cborntra/REPOS/qemu/block/file-posix.c:30:
/usr/include/linux/swab.h: In function `__swab':
/home/cborntra/REPOS/qemu/include/qemu/bitops.h:20:34: error: "sizeof" is not defined, evaluates to 0 [-Werror=undef]
   20 | #define BITS_PER_LONG           (sizeof (unsigned long) * BITS_PER_BYTE)
      |                                  ^~~~~~
/home/cborntra/REPOS/qemu/include/qemu/bitops.h:20:41: error: missing binary operator before token "("
   20 | #define BITS_PER_LONG           (sizeof (unsigned long) * BITS_PER_BYTE)
      |                                         ^
cc1: all warnings being treated as errors
make: *** [/home/cborntra/REPOS/qemu/rules.mak:69: block/file-posix.o] Error 1
rm tests/qemu-iotests/socket_scm_helper.o

This was triggered by commit d5767057c9a ("uapi: rename ext2_swab() to swab() and share globally in swab.h")
This patch is doing
+#include <asm/bitsperlong.h>
but it uses BITS_PER_LONG.

The kernel file asm/bitsperlong.h provide only __BITS_PER_LONG.

Let us use the __ variant in swap.h

Link: http://lkml.kernel.org/r/20200213142147.17604-1-borntraeger@de.ibm.com
Fixes: d5767057c9a ("uapi: rename ext2_swab() to swab() and share globally in swab.h")
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Joe Perches <joe@perches.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: William Breathitt Gray <vilhelm.gray@gmail.com>
Cc: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/uapi/linux/swab.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/uapi/linux/swab.h~uapi-fix-userspace-breakage-use-__bits_per_long-for-swap
+++ a/include/uapi/linux/swab.h
@@ -135,9 +135,9 @@ static inline __attribute_const__ __u32
 
 static __always_inline unsigned long __swab(const unsigned long y)
 {
-#if BITS_PER_LONG == 64
+#if __BITS_PER_LONG == 64
 	return __swab64(y);
-#else /* BITS_PER_LONG == 32 */
+#else /* __BITS_PER_LONG == 32 */
 	return __swab32(y);
 #endif
 }
_

Patches currently in -mm which might be from borntraeger@de.ibm.com are

uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-string-update-match_string-doc-strings-with-correct-behavior.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (166 preceding siblings ...)
  2020-02-14  5:12 ` + uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch " Andrew Morton
@ 2020-02-14  5:15 ` Andrew Morton
  2020-02-14  5:16 ` + mm-vmscan-replace-open-codings-to-numa_no_node.patch " Andrew Morton
                   ` (70 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:15 UTC (permalink / raw)
  To: alexandru.ardelean, andriy.shevchenko, keescook, mm-commits, tobin


The patch titled
     Subject: lib/string.c: update match_string() doc-strings with correct behavior
has been added to the -mm tree.  Its filename is
     lib-string-update-match_string-doc-strings-with-correct-behavior.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-string-update-match_string-doc-strings-with-correct-behavior.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-string-update-match_string-doc-strings-with-correct-behavior.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexandru Ardelean <alexandru.ardelean@analog.com>
Subject: lib/string.c: update match_string() doc-strings with correct behavior

There were a few attempts at changing behavior of the match_string()
helpers (i.e.  'match_string()' & 'sysfs_match_string()'), to change &
extend the behavior according to the doc-string.

But the simplest approach is to just fix the doc-strings.  The current
behavior is fine as-is, and some bugs were introduced trying to fix it.

As for extending the behavior, new helpers can always be introduced if
needed.

The match_string() helpers behave more like 'strncmp()' in the sense that
they go up to n elements or until the first NULL element in the array of
strings.

This change updates the doc-strings with this info.

Link: http://lkml.kernel.org/r/20200213072722.8249-1-alexandru.ardelean@analog.com
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Tobin C . Harding" <tobin@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

--- a/lib/string.c~lib-string-update-match_string-doc-strings-with-correct-behavior
+++ a/lib/string.c
@@ -699,6 +699,14 @@ EXPORT_SYMBOL(sysfs_streq);
  * @n:		number of strings in the array or -1 for NULL terminated arrays
  * @string:	string to match with
  *
+ * This routine will look for a string in an array of strings up to the
+ * n-th element in the array or until the first NULL element.
+ *
+ * Historically the value of -1 for @n, was used to search in arrays that
+ * are NULL terminated. However, the function does not make a distinction
+ * when finishing the search: either @n elements have been compared OR
+ * the first NULL element was found.
+ *
  * Return:
  * index of a @string in the @array if matches, or %-EINVAL otherwise.
  */
@@ -727,6 +735,14 @@ EXPORT_SYMBOL(match_string);
  *
  * Returns index of @str in the @array or -EINVAL, just like match_string().
  * Uses sysfs_streq instead of strcmp for matching.
+ *
+ * This routine will look for a string in an array of strings up to the
+ * n-th element in the array or until the first NULL element.
+ *
+ * Historically the value of -1 for @n, was used to search in arrays that
+ * are NULL terminated. However, the function does not make a distinction
+ * when finishing the search: either @n elements have been compared OR
+ * the first NULL element was found.
  */
 int __sysfs_match_string(const char * const *array, size_t n, const char *str)
 {
_

Patches currently in -mm which might be from alexandru.ardelean@analog.com are

lib-string-update-match_string-doc-strings-with-correct-behavior.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vmscan-replace-open-codings-to-numa_no_node.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (167 preceding siblings ...)
  2020-02-14  5:15 ` + lib-string-update-match_string-doc-strings-with-correct-behavior.patch " Andrew Morton
@ 2020-02-14  5:16 ` Andrew Morton
  2020-02-14  5:19 ` + mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch " Andrew Morton
                   ` (69 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:16 UTC (permalink / raw)
  To: anshuman.khandual, minchan, mm-commits, yang.shi


The patch titled
     Subject: mm: vmscan: replace open codings to NUMA_NO_NODE
has been added to the -mm tree.  Its filename is
     mm-vmscan-replace-open-codings-to-numa_no_node.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmscan-replace-open-codings-to-numa_no_node.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-replace-open-codings-to-numa_no_node.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm: vmscan: replace open codings to NUMA_NO_NODE

The commit 98fa15f34cb3 ("mm: replace all open encodings for
NUMA_NO_NODE") did the replacement across the kernel tree, but we got
some more in vmscan.c since then.

Link: http://lkml.kernel.org/r/1581568298-45317-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-replace-open-codings-to-numa_no_node
+++ a/mm/vmscan.c
@@ -2096,7 +2096,7 @@ static void shrink_active_list(unsigned
 
 unsigned long reclaim_pages(struct list_head *page_list)
 {
-	int nid = -1;
+	int nid = NUMA_NO_NODE;
 	unsigned long nr_reclaimed = 0;
 	LIST_HEAD(node_page_list);
 	struct reclaim_stat dummy_stat;
@@ -2111,7 +2111,7 @@ unsigned long reclaim_pages(struct list_
 
 	while (!list_empty(page_list)) {
 		page = lru_to_page(page_list);
-		if (nid == -1) {
+		if (nid == NUMA_NO_NODE) {
 			nid = page_to_nid(page);
 			INIT_LIST_HEAD(&node_page_list);
 		}
@@ -2132,7 +2132,7 @@ unsigned long reclaim_pages(struct list_
 			putback_lru_page(page);
 		}
 
-		nid = -1;
+		nid = NUMA_NO_NODE;
 	}
 
 	if (!list_empty(&node_page_list)) {
_

Patches currently in -mm which might be from yang.shi@linux.alibaba.com are

mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
mm-vmpressure-use-mem_cgroup_is_root-api.patch
mm-vmscan-replace-open-codings-to-numa_no_node.patch
mm-migratec-migrate-pg_readahead-flag.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (168 preceding siblings ...)
  2020-02-14  5:16 ` + mm-vmscan-replace-open-codings-to-numa_no_node.patch " Andrew Morton
@ 2020-02-14  5:19 ` Andrew Morton
  2020-02-14  5:22 ` + mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch " Andrew Morton
                   ` (68 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:19 UTC (permalink / raw)
  To: linux-man, lixinhai.lxh, mhocko, mike.kravetz, mm-commits,
	naoya.horiguchi


The patch titled
     Subject: mm/mempolicy: support MPOL_MF_STRICT for huge page mapping
has been added to the -mm tree.  Its filename is
     mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: mm/mempolicy: support MPOL_MF_STRICT for huge page mapping

MPOL_MF_STRICT is used in mbind() for purposes:

(1) MPOL_MF_STRICT is set alone without MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL, to check if there is misplaced page and return -EIO;

(2) MPOL_MF_STRICT is set with MPOL_MF_MOVE or MPOL_MF_MOVE_ALL, to
    check if there is misplaced page which is failed to isolate, or page
    is success on isolate but failed to move, and return -EIO.

For non hugepage mapping, (1) and (2) are implemented as expectation.  For
hugepage mapping, (1) is not implemented.  And in (2), the part about
failed to isolate and report -EIO is not implemented.

This patch implements the missed parts for hugepage mapping.  Benefits
with it applied:

- User space can apply same code logic to handle mbind() on hugepage and
  non hugepage mapping;

- Reliably using MPOL_MF_STRICT alone to check whether there is
  misplaced page or not when bind policy on address range, especially for
  address range which contains both hugepage and non hugepage mapping.

Analysis of potential impact to existing users:

- If MPOL_MF_STRICT alone was previously used, hugetlb pages not
  following the memory policy would not cause an EIO error.  After this
  change, hugetlb pages are treated like all other pages.  If
  MPOL_MF_STRICT alone is used and hugetlb pages do not follow memory
  policy an EIO error will be returned.

- For users who using MPOL_MF_STRICT with MPOL_MF_MOVE or
  MPOL_MF_MOVE_ALL, the semantic about some pages could not be moved will
  not be changed by this patch, because failed to isolate and failed to
  move have same effects to users, so their existing code will not be
  impacted.

In mbind man page, the note about 'MPOL_MF_STRICT is ignored on huge page
mappings' can be removed after this patch is applied.

Mike:

: The current behavior with MPOL_MF_STRICT and hugetlb pages is inconsistent
: and does not match documentation (as described above).  The special
: behavior for hugetlb pages ideally should have been removed when hugetlb
: page migration was introduced.  It is unlikely that anyone relies on
: today's inconsistent behavior, and removing one more case of special
: handling for hugetlb pages is a good thing.

Link: http://lkml.kernel.org/r/1581559627-6206-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-man <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |   37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping
+++ a/mm/mempolicy.c
@@ -557,9 +557,10 @@ static int queue_pages_hugetlb(pte_t *pt
 			       unsigned long addr, unsigned long end,
 			       struct mm_walk *walk)
 {
+	int ret = 0;
 #ifdef CONFIG_HUGETLB_PAGE
 	struct queue_pages *qp = walk->private;
-	unsigned long flags = qp->flags;
+	unsigned long flags = (qp->flags & MPOL_MF_VALID);
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -571,16 +572,44 @@ static int queue_pages_hugetlb(pte_t *pt
 	page = pte_page(entry);
 	if (!queue_pages_required(page, qp))
 		goto unlock;
+
+	if (flags == MPOL_MF_STRICT) {
+		/*
+		 * STRICT alone means only detecting misplaced page and no
+		 * need to further check other vma.
+		 */
+		ret = -EIO;
+		goto unlock;
+	}
+
+	if (!vma_migratable(walk->vma)) {
+		/*
+		 * Must be STRICT with MOVE*, otherwise .test_walk() have
+		 * stopped walking current vma.
+		 * Detecting misplaced page but allow migrating pages which
+		 * have been queued.
+		 */
+		ret = 1;
+		goto unlock;
+	}
+
 	/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
 	if (flags & (MPOL_MF_MOVE_ALL) ||
-	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
-		isolate_huge_page(page, qp->pagelist);
+	    (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) {
+		if (!isolate_huge_page(page, qp->pagelist) &&
+			(flags & MPOL_MF_STRICT))
+			/*
+			 * Failed to isolate page but allow migrating pages
+			 * which have been queued.
+			 */
+			ret = 1;
+	}
 unlock:
 	spin_unlock(ptl);
 #else
 	BUG();
 #endif
-	return 0;
+	return ret;
 }
 
 #ifdef CONFIG_NUMA_BALANCING
_

Patches currently in -mm which might be from lixinhai.lxh@gmail.com are

mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch
mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (169 preceding siblings ...)
  2020-02-14  5:19 ` + mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch " Andrew Morton
@ 2020-02-14  5:22 ` Andrew Morton
  2020-02-14  5:26 ` + mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch " Andrew Morton
                   ` (67 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:22 UTC (permalink / raw)
  To: gshan, guro, mm-commits, stable


The patch titled
     Subject: mm/vmscan.c: don't round up scan size for online memory cgroup
has been added to the -mm tree.  Its filename is
     mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Gavin Shan <gshan@redhat.com>
Subject: mm/vmscan.c: don't round up scan size for online memory cgroup

commit 68600f623d69 ("mm: don't miss the last page because of round-off
error") makes the scan size round up to @denominator regardless of the
memory cgroup's state, online or offline.  This affects the overall
reclaiming behavior: The corresponding LRU list is eligible for reclaiming
only when its size logically right shifted by @sc->priority is bigger than
zero in the former formula.  For example, the inactive anonymous LRU list
should have at least 0x4000 pages to be eligible for reclaiming when we
have 60/12 for swappiness/priority and without taking scan/rotation ratio
into account.  After the roundup is applied, the inactive anonymous LRU
list becomes eligible for reclaiming when its size is bigger than or equal
to 0x1000 in the same condition.

    (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
    ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1

aarch64 has 512MB huge page size when the base page size is 64KB.  The
memory cgroup that has a huge page is always eligible for reclaiming in
that case.  The reclaiming is likely to stop after the huge page is
reclaimed, meaing the further iteration on @sc->priority and the silbing
and child memory cgroups will be skipped.  The overall behaviour has been
changed.  This fixes the issue by applying the roundup to offlined memory
cgroups only, to give more preference to reclaim memory from offlined
memory cgroup.  It sounds reasonable as those memory is unlikedly to be
used by anyone.

The issue was found by starting up 8 VMs on a Ampere Mustang machine,
which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and 2GB
memory.  It took 264 seconds for all VMs to be completely up and 784MB
swap is consumed after that.  With this patch applied, it took 236 seconds
and 60MB swap to do same thing.  So there is 10% performance improvement
for my case.  Note that KSM is disable while THP is enabled in the
testing.

         total     used    free   shared  buff/cache   available
   Mem:  16196    10065    2049       16        4081        3749
   Swap:  8175      784    7391
         total     used    free   shared  buff/cache   available
   Mem:  16196    11324    3656       24        1215        2936
   Swap:  8175       60    8115

Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: <stable@vger.kernel.org>	[4.20+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup
+++ a/mm/vmscan.c
@@ -2415,10 +2415,13 @@ out:
 			/*
 			 * Scan types proportional to swappiness and
 			 * their relative recent reclaim efficiency.
-			 * Make sure we don't miss the last page
-			 * because of a round-off error.
+			 * Make sure we don't miss the last page on
+			 * the offlined memory cgroups because of a
+			 * round-off error.
 			 */
-			scan = DIV64_U64_ROUND_UP(scan * fraction[file],
+			scan = mem_cgroup_online(memcg) ?
+			       div64_u64(scan * fraction[file], denominator) :
+			       DIV64_U64_ROUND_UP(scan * fraction[file],
 						  denominator);
 			break;
 		case SCAN_FILE:
_

Patches currently in -mm which might be from gshan@redhat.com are

mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (170 preceding siblings ...)
  2020-02-14  5:22 ` + mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch " Andrew Morton
@ 2020-02-14  5:26 ` Andrew Morton
  2020-02-14  5:30 ` + drivers-base-memoryc-drop-section_count.patch " Andrew Morton
                   ` (66 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:26 UTC (permalink / raw)
  To: akpm, cai, lixinhai.lxh, mm-commits, yang.shi


The patch titled
     Subject: mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
has been added to the -mm tree.  Its filename is
     mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()

The VM_BUG_ON() is already used by queue_pages_test_walk(), it sounds
better to dump more debug information by using VM_BUG_ON_VMA() to help
debugging.

Link: http://lkml.kernel.org/r/1579068565-110432-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Li Xinhai" <lixinhai.lxh@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/mempolicy.c~mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk
+++ a/mm/mempolicy.c
@@ -650,7 +650,7 @@ static int queue_pages_test_walk(unsigne
 	unsigned long flags = qp->flags;
 
 	/* range check first */
-	VM_BUG_ON((vma->vm_start > start) || (vma->vm_end < end));
+	VM_BUG_ON_VMA((vma->vm_start > start) || (vma->vm_end < end), vma);
 
 	if (!qp->first) {
 		qp->first = vma;
_

Patches currently in -mm which might be from yang.shi@linux.alibaba.com are

mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
mm-vmpressure-use-mem_cgroup_is_root-api.patch
mm-vmscan-replace-open-codings-to-numa_no_node.patch
mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch
mm-migratec-migrate-pg_readahead-flag.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + drivers-base-memoryc-drop-section_count.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (171 preceding siblings ...)
  2020-02-14  5:26 ` + mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch " Andrew Morton
@ 2020-02-14  5:30 ` Andrew Morton
  2020-02-14  5:31 ` + drivers-base-memoryc-drop-pages_correctly_probed.patch " Andrew Morton
                   ` (65 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:30 UTC (permalink / raw)
  To: anshuman.khandual, dan.j.williams, david, gregkh, mhocko,
	mm-commits, pasha.tatashin, rafael


The patch titled
     Subject: drivers/base/memory.c: drop section_count
has been added to the -mm tree.  Its filename is
     drivers-base-memoryc-drop-section_count.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/drivers-base-memoryc-drop-section_count.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/drivers-base-memoryc-drop-section_count.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: David Hildenbrand <david@redhat.com>
Subject: drivers/base/memory.c: drop section_count

Patch series "mm: drop superfluous section checks when onlining/offlining".

Let's drop some superfluous section checks on the onlining/offlining path.

This patch (of 3):

Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes") we have a generic check in
offline_pages() that disallows offlining memory blocks with holes.

Memory blocks with missing sections are just another variant of these type
of blocks.  We can stop checking (and especially storing) present
sections.  A proper error message is now printed why offlining failed.

section_count was initially introduced in commit 07681215975e ("Driver
core: Add section count to memory_block struct") in order to detect when
it is okay to remove a memory block.  It was used in commit 26bbe7ef6d5c
("drivers/base/memory.c: prohibit offlining of memory blocks with missing
sections") to disallow offlining memory blocks with missing sections.  As
we refactored creation/removal of memory devices and have a proper check
for holes in place, we can drop the section_count.

This also removes a leftover comment regarding the mem_sysfs_mutex, which
was removed in commit 848e19ad3c33 ("drivers/base/memory.c: drop the
mem_sysfs_mutex").

Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c  |   17 +++--------------
 include/linux/memory.h |    1 -
 2 files changed, 3 insertions(+), 15 deletions(-)

--- a/drivers/base/memory.c~drivers-base-memoryc-drop-section_count
+++ a/drivers/base/memory.c
@@ -275,10 +275,6 @@ static int memory_subsys_offline(struct
 	if (mem->state == MEM_OFFLINE)
 		return 0;
 
-	/* Can't offline block with non-present sections */
-	if (mem->section_count != sections_per_block)
-		return -EINVAL;
-
 	return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
 }
 
@@ -643,7 +639,7 @@ static int init_memory_block(struct memo
 
 static int add_memory_block(unsigned long base_section_nr)
 {
-	int ret, section_count = 0;
+	int section_count = 0;
 	struct memory_block *mem;
 	unsigned long nr;
 
@@ -654,12 +650,8 @@ static int add_memory_block(unsigned lon
 
 	if (section_count == 0)
 		return 0;
-	ret = init_memory_block(&mem, base_memory_block_id(base_section_nr),
-				MEM_ONLINE);
-	if (ret)
-		return ret;
-	mem->section_count = section_count;
-	return 0;
+	return init_memory_block(&mem, base_memory_block_id(base_section_nr),
+				 MEM_ONLINE);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -697,7 +689,6 @@ int create_memory_block_devices(unsigned
 		ret = init_memory_block(&mem, block_id, MEM_OFFLINE);
 		if (ret)
 			break;
-		mem->section_count = sections_per_block;
 	}
 	if (ret) {
 		end_block_id = block_id;
@@ -706,7 +697,6 @@ int create_memory_block_devices(unsigned
 			mem = find_memory_block_by_id(block_id);
 			if (WARN_ON_ONCE(!mem))
 				continue;
-			mem->section_count = 0;
 			unregister_memory(mem);
 		}
 	}
@@ -735,7 +725,6 @@ void remove_memory_block_devices(unsigne
 		mem = find_memory_block_by_id(block_id);
 		if (WARN_ON_ONCE(!mem))
 			continue;
-		mem->section_count = 0;
 		unregister_memory_block_under_nodes(mem);
 		unregister_memory(mem);
 	}
--- a/include/linux/memory.h~drivers-base-memoryc-drop-section_count
+++ a/include/linux/memory.h
@@ -26,7 +26,6 @@
 struct memory_block {
 	unsigned long start_section_nr;
 	unsigned long state;		/* serialized by the dev->lock */
-	int section_count;		/* serialized by mem_sysfs_mutex */
 	int online_type;		/* for passing data to online routine */
 	int phys_device;		/* to which fru does this belong? */
 	struct device dev;
_

Patches currently in -mm which might be from david@redhat.com are

drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup-fix.patch
drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch
drivers-base-memoryc-drop-section_count.patch
drivers-base-memoryc-drop-pages_correctly_probed.patch
mm-page_extc-drop-pfn_present-check-when-onlining.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + drivers-base-memoryc-drop-pages_correctly_probed.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (172 preceding siblings ...)
  2020-02-14  5:30 ` + drivers-base-memoryc-drop-section_count.patch " Andrew Morton
@ 2020-02-14  5:31 ` Andrew Morton
  2020-02-14  5:31 ` + mm-page_extc-drop-pfn_present-check-when-onlining.patch " Andrew Morton
                   ` (64 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:31 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, david, gregkh, mhocko,
	mm-commits, pasha.tatashin, rafael


The patch titled
     Subject: drivers/base/memory.c: drop pages_correctly_probed()
has been added to the -mm tree.  Its filename is
     drivers-base-memoryc-drop-pages_correctly_probed.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/drivers-base-memoryc-drop-pages_correctly_probed.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/drivers-base-memoryc-drop-pages_correctly_probed.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: David Hildenbrand <david@redhat.com>
Subject: drivers/base/memory.c: drop pages_correctly_probed()

pages_correctly_probed() is a leftover from ancient times.  It dates back
to commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove
functions"), where Pg_reserved checks were added as a sfety net:

	/*
	 * The probe routines leave the pages reserved, just
	 * as the bootmem code does.  Make sure they're still
	 * that way.
	 */

The checks were refactored quite a bit over the years, especially in
commit b77eab7079d9 ("mm/memory_hotplug: optimize probe routine"), where
checks for present, valid, and online sections were added.

Hotplugged memory is added via add_memory(), which will create the full
memmap for the hotplugged memory, and mark all sections valid and present.

Only full memory blocks are onlined/offlined, so we also cannot have an
inconsistency in that regard (especially, memory blocks with some sections
being online and some being offline).

1. Boot memory always starts online.  Since commit c5e79ef561b0
   ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
   holes") we disallow to offline any memory with holes.  Therefore, we
   never online memory with holes.  Present and validity checks are
   superfluous.

2. Only complete memory blocks are onlined/offlined (and especially,
   the state - online or offline - is stored for whole memory blocks). 
   Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
   manually calls offline_pages() and fiddels with memory block states. 
   But it also only offlines complete memory blocks.

3. To make any of these conditions trigger, something would have to be
   terribly messed up in the core.  (e.g., online/offline only some
   sections of a memory block).

4. Memory unplug properly makes sure that all sysfs attributes were
   removed (and therefore, that all threads left the sysfs handlers).  We
   don't have to worry about zombie devices at this point.

5. The valid_section_nr(section_nr) check is actually dead code, as it
   would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).

No wonder we haven't seen any of these errors in a long time (or even
   ever, according to my search).  Let's just get rid of them.  Now, all
   checks that could hinder onlining and offlining are completely
   contained in online_pages()/offline_pages().

Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c |   42 ----------------------------------------
 1 file changed, 42 deletions(-)

--- a/drivers/base/memory.c~drivers-base-memoryc-drop-pages_correctly_probed
+++ a/drivers/base/memory.c
@@ -153,45 +153,6 @@ int memory_notify(unsigned long val, voi
 }
 
 /*
- * The probe routines leave the pages uninitialized, just as the bootmem code
- * does. Make sure we do not access them, but instead use only information from
- * within sections.
- */
-static bool pages_correctly_probed(unsigned long start_pfn)
-{
-	unsigned long section_nr = pfn_to_section_nr(start_pfn);
-	unsigned long section_nr_end = section_nr + sections_per_block;
-	unsigned long pfn = start_pfn;
-
-	/*
-	 * memmap between sections is not contiguous except with
-	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
-	 * and assume memmap is contiguous within each section
-	 */
-	for (; section_nr < section_nr_end; section_nr++) {
-		if (WARN_ON_ONCE(!pfn_valid(pfn)))
-			return false;
-
-		if (!present_section_nr(section_nr)) {
-			pr_warn("section %ld pfn[%lx, %lx) not present\n",
-				section_nr, pfn, pfn + PAGES_PER_SECTION);
-			return false;
-		} else if (!valid_section_nr(section_nr)) {
-			pr_warn("section %ld pfn[%lx, %lx) no valid memmap\n",
-				section_nr, pfn, pfn + PAGES_PER_SECTION);
-			return false;
-		} else if (online_section_nr(section_nr)) {
-			pr_warn("section %ld pfn[%lx, %lx) is already online\n",
-				section_nr, pfn, pfn + PAGES_PER_SECTION);
-			return false;
-		}
-		pfn += PAGES_PER_SECTION;
-	}
-
-	return true;
-}
-
-/*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
  */
@@ -207,9 +168,6 @@ memory_block_action(unsigned long start_
 
 	switch (action) {
 	case MEM_ONLINE:
-		if (!pages_correctly_probed(start_pfn))
-			return -EBUSY;

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-page_extc-drop-pfn_present-check-when-onlining.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (173 preceding siblings ...)
  2020-02-14  5:31 ` + drivers-base-memoryc-drop-pages_correctly_probed.patch " Andrew Morton
@ 2020-02-14  5:31 ` Andrew Morton
  2020-02-14  5:41 ` [failures] include-remove-highmemh-from-pagemaph.patch removed from " Andrew Morton
                   ` (63 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:31 UTC (permalink / raw)
  To: anshuman.khandual, dan.j.williams, david, gregkh, mhocko,
	mm-commits, pasha.tatashin, rafael


The patch titled
     Subject: mm/page_ext.c: drop pfn_present() check when onlining
has been added to the -mm tree.  Its filename is
     mm-page_extc-drop-pfn_present-check-when-onlining.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-page_extc-drop-pfn_present-check-when-onlining.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_extc-drop-pfn_present-check-when-onlining.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: David Hildenbrand <david@redhat.com>
Subject: mm/page_ext.c: drop pfn_present() check when onlining

Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes") we disallow to offline any
memory with holes.  As all boot memory is online and hotplugged memory
cannot contain holes, we never online memory with holes.

This present check can be dropped.

Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_ext.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/mm/page_ext.c~mm-page_extc-drop-pfn_present-check-when-onlining
+++ a/mm/page_ext.c
@@ -303,11 +303,8 @@ static int __meminit online_page_ext(uns
 		VM_BUG_ON(!node_state(nid, N_ONLINE));
 	}
 
-	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
+	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION)
 		fail = init_section_page_ext(pfn, nid);
-	}
 	if (!fail)
 		return 0;
 
_

Patches currently in -mm which might be from david@redhat.com are

drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup-fix.patch
drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch
drivers-base-memoryc-drop-section_count.patch
drivers-base-memoryc-drop-pages_correctly_probed.patch
mm-page_extc-drop-pfn_present-check-when-onlining.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [failures] include-remove-highmemh-from-pagemaph.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (174 preceding siblings ...)
  2020-02-14  5:31 ` + mm-page_extc-drop-pfn_present-check-when-onlining.patch " Andrew Morton
@ 2020-02-14  5:41 ` Andrew Morton
  2020-02-14  6:26 ` mmotm 2020-02-13-22-26 uploaded Andrew Morton
                   ` (62 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  5:41 UTC (permalink / raw)
  To: mm-commits, willy


The patch titled
     Subject: include: Remove highmem.h from pagemap.h
has been removed from the -mm tree.  Its filename was
     include-remove-highmemh-from-pagemaph.patch

This patch was dropped because it had testing failures

------------------------------------------------------
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: include: Remove highmem.h from pagemap.h

pagemap.h doesn't need highmem.h itself.  Only two dozen users were
relying on pagemap to pull in highmem (usually for kmap), so fix them all
up and we can remove this header file dependency.

Link: http://lkml.kernel.org/r/20200213195643.31587-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/char/virtio_console.c                                      |    1 +
 drivers/media/pci/ivtv/ivtv-driver.h                               |    1 +
 drivers/media/v4l2-core/videobuf-dma-sg.c                          |    1 +
 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c |    1 +
 fs/affs/affs.h                                                     |    1 +
 fs/ext2/namei.c                                                    |    1 +
 fs/freevxfs/vxfs_immed.c                                           |    1 +
 fs/freevxfs/vxfs_subr.c                                            |    1 +
 fs/hfs/btree.c                                                     |    1 +
 fs/jffs2/gc.c                                                      |    1 +
 fs/minix/minix.h                                                   |    1 +
 fs/ocfs2/symlink.c                                                 |    1 +
 fs/qnx6/qnx6.h                                                     |    1 +
 fs/read_write.c                                                    |    1 +
 fs/reiserfs/ioctl.c                                                |    1 +
 fs/reiserfs/tail_conversion.c                                      |    1 +
 fs/reiserfs/xattr.c                                                |    3 ++-
 fs/squashfs/file.c                                                 |    1 +
 fs/squashfs/symlink.c                                              |    1 +
 fs/sysv/namei.c                                                    |    1 +
 fs/udf/file.c                                                      |    1 +
 fs/ufs/ufs_fs.h                                                    |    1 +
 include/linux/pagemap.h                                            |    1 -
 kernel/bpf/stackmap.c                                              |    1 +
 lib/iov_iter.c                                                     |    1 +
 net/sunrpc/auth_gss/gss_krb5_wrap.c                                |    1 +
 net/sunrpc/cache.c                                                 |    1 +
 27 files changed, 27 insertions(+), 2 deletions(-)

--- a/drivers/char/virtio_console.c~include-remove-highmemh-from-pagemaph
+++ a/drivers/char/virtio_console.c
@@ -13,6 +13,7 @@
 #include <linux/fs.h>
 #include <linux/splice.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/init.h>
 #include <linux/list.h>
 #include <linux/poll.h>
--- a/drivers/media/pci/ivtv/ivtv-driver.h~include-remove-highmemh-from-pagemaph
+++ a/drivers/media/pci/ivtv/ivtv-driver.h
@@ -50,6 +50,7 @@
 #include <linux/list.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/scatterlist.h>
 #include <linux/kthread.h>
 #include <linux/mutex.h>
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~include-remove-highmemh-from-pagemaph
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -25,6 +25,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/scatterlist.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c~include-remove-highmemh-from-pagemaph
+++ a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c
@@ -4,6 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/errno.h>
+#include <linux/highmem.h>
 #include <linux/interrupt.h>
 #include <linux/pagemap.h>
 #include <linux/dma-mapping.h>
--- a/fs/affs/affs.h~include-remove-highmemh-from-pagemaph
+++ a/fs/affs/affs.h
@@ -8,6 +8,7 @@
 #include <linux/types.h>
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
+#include <linux/highmem.h>
 #include "amigaffs.h"
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
--- a/fs/ext2/namei.c~include-remove-highmemh-from-pagemaph
+++ a/fs/ext2/namei.c
@@ -31,6 +31,7 @@
  *        David S. Miller (davem@caip.rutgers.edu), 1995
  */
 
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include <linux/quotaops.h>
 #include "ext2.h"
--- a/fs/freevxfs/vxfs_immed.c~include-remove-highmemh-from-pagemaph
+++ a/fs/freevxfs/vxfs_immed.c
@@ -32,6 +32,7 @@
  */
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 #include "vxfs.h"
 #include "vxfs_extern.h"
--- a/fs/freevxfs/vxfs_subr.c~include-remove-highmemh-from-pagemaph
+++ a/fs/freevxfs/vxfs_subr.c
@@ -34,6 +34,7 @@
 #include <linux/buffer_head.h>
 #include <linux/kernel.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 #include "vxfs_extern.h"
 
--- a/fs/hfs/btree.c~include-remove-highmemh-from-pagemaph
+++ a/fs/hfs/btree.c
@@ -10,6 +10,7 @@
  */
 
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/log2.h>
 
--- a/fs/jffs2/gc.c~include-remove-highmemh-from-pagemaph
+++ a/fs/jffs2/gc.c
@@ -16,6 +16,7 @@
 #include <linux/mtd/mtd.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/crc32.h>
 #include <linux/compiler.h>
 #include <linux/stat.h>
--- a/fs/minix/minix.h~include-remove-highmemh-from-pagemaph
+++ a/fs/minix/minix.h
@@ -4,6 +4,7 @@
 
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/minix_fs.h>
 
 #define INODE_VERSION(inode)	minix_sb(inode->i_sb)->s_version
--- a/fs/ocfs2/symlink.c~include-remove-highmemh-from-pagemaph
+++ a/fs/ocfs2/symlink.c
@@ -38,6 +38,7 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/namei.h>
 
 #include <cluster/masklog.h>
--- a/fs/qnx6/qnx6.h~include-remove-highmemh-from-pagemaph
+++ a/fs/qnx6/qnx6.h
@@ -19,6 +19,7 @@
 
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 typedef __u16 __bitwise __fs16;
 typedef __u32 __bitwise __fs32;
--- a/fs/read_write.c~include-remove-highmemh-from-pagemaph
+++ a/fs/read_write.c
@@ -16,6 +16,7 @@
 #include <linux/export.h>
 #include <linux/syscalls.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
--- a/fs/reiserfs/ioctl.c~include-remove-highmemh-from-pagemaph
+++ a/fs/reiserfs/ioctl.c
@@ -9,6 +9,7 @@
 #include <linux/time.h>
 #include <linux/uaccess.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/compat.h>
 
 /*
--- a/fs/reiserfs/tail_conversion.c~include-remove-highmemh-from-pagemaph
+++ a/fs/reiserfs/tail_conversion.c
@@ -6,6 +6,7 @@
 
 #include <linux/time.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/buffer_head.h>
 #include "reiserfs.h"
 
--- a/fs/reiserfs/xattr.c~include-remove-highmemh-from-pagemaph
+++ a/fs/reiserfs/xattr.c
@@ -37,11 +37,12 @@
 #include "reiserfs.h"
 #include <linux/capability.h>
 #include <linux/dcache.h>
-#include <linux/namei.h>
 #include <linux/errno.h>
 #include <linux/gfp.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/highmem.h>
+#include <linux/namei.h>
 #include <linux/pagemap.h>
 #include <linux/xattr.h>
 #include "xattr.h"
--- a/fs/squashfs/file.c~include-remove-highmemh-from-pagemaph
+++ a/fs/squashfs/file.c
@@ -33,6 +33,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/mutex.h>
 
 #include "squashfs_fs.h"
--- a/fs/squashfs/symlink.c~include-remove-highmemh-from-pagemaph
+++ a/fs/squashfs/symlink.c
@@ -22,6 +22,7 @@
 #include <linux/kernel.h>
 #include <linux/string.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/xattr.h>
 
 #include "squashfs_fs.h"
--- a/fs/sysv/namei.c~include-remove-highmemh-from-pagemaph
+++ a/fs/sysv/namei.c
@@ -13,6 +13,7 @@
  *  Copyright (C) 1997, 1998  Krzysztof G. Baranowski
  */
 
+#include <linux/highmem.h>
 #include <linux/pagemap.h>
 #include "sysv.h"
 
--- a/fs/udf/file.c~include-remove-highmemh-from-pagemaph
+++ a/fs/udf/file.c
@@ -33,6 +33,7 @@
 #include <linux/capability.h>
 #include <linux/errno.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/uio.h>
 
 #include "udf_i.h"
--- a/fs/ufs/ufs_fs.h~include-remove-highmemh-from-pagemaph
+++ a/fs/ufs/ufs_fs.h
@@ -35,6 +35,7 @@
 #include <linux/kernel.h>
 #include <linux/stat.h>
 #include <linux/fs.h>
+#include <linux/highmem.h>
 #include <linux/workqueue.h>
 
 #include <asm/div64.h>
--- a/include/linux/pagemap.h~include-remove-highmemh-from-pagemaph
+++ a/include/linux/pagemap.h
@@ -8,7 +8,6 @@
 #include <linux/mm.h>
 #include <linux/fs.h>
 #include <linux/list.h>
-#include <linux/highmem.h>
 #include <linux/compiler.h>
 #include <linux/uaccess.h>
 #include <linux/gfp.h>
--- a/kernel/bpf/stackmap.c~include-remove-highmemh-from-pagemaph
+++ a/kernel/bpf/stackmap.c
@@ -8,6 +8,7 @@
 #include <linux/perf_event.h>
 #include <linux/elf.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/irq_work.h>
 #include "percpu_freelist.h"
 
--- a/lib/iov_iter.c~include-remove-highmemh-from-pagemaph
+++ a/lib/iov_iter.c
@@ -3,6 +3,7 @@
 #include <linux/bvec.h>
 #include <linux/uio.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/splice.h>
--- a/net/sunrpc/auth_gss/gss_krb5_wrap.c~include-remove-highmemh-from-pagemaph
+++ a/net/sunrpc/auth_gss/gss_krb5_wrap.c
@@ -34,6 +34,7 @@
 #include <linux/sunrpc/gss_krb5.h>
 #include <linux/random.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 # define RPCDBG_FACILITY	RPCDBG_AUTH
--- a/net/sunrpc/cache.c~include-remove-highmemh-from-pagemaph
+++ a/net/sunrpc/cache.c
@@ -27,6 +27,7 @@
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
+#include <linux/highmem.h>
 #include <asm/ioctls.h>
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/cache.h>
_

Patches currently in -mm which might be from willy@infradead.org are

mm-improve-dump_page-for-compound-pages.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* mmotm 2020-02-13-22-26 uploaded
  2020-02-04  1:33 incoming Andrew Morton
                   ` (175 preceding siblings ...)
  2020-02-14  5:41 ` [failures] include-remove-highmemh-from-pagemaph.patch removed from " Andrew Morton
@ 2020-02-14  6:26 ` Andrew Morton
  2020-02-24  2:45 ` + mm-debug-add-tests-validating-architecture-page-table-helpers.patch added to -mm tree Andrew Morton
                   ` (61 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-14  6:26 UTC (permalink / raw)
  To: broonie, linux-fsdevel, linux-kernel, linux-mm, linux-next,
	mhocko, mm-commits, sfr

The mm-of-the-moment snapshot 2020-02-13-22-26 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (5.x
or 5.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE-yyyy-mm-dd-hh-mm-ss.  Both contain the string yyyy-mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

	https://github.com/hnaz/linux-mm

The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is also available at

	https://github.com/hnaz/linux-mm



This mmotm tree contains the following patches against 5.6-rc1:
(patches marked "*" will be included in linux-next)

  origin.patch
* y2038-remove-ktime-to-from-timespec-timeval-conversion.patch
* y2038-remove-unused-time32-interfaces.patch
* y2038-hide-timeval-timespec-itimerval-itimerspec-types.patch
* revert-ipcsem-remove-uneeded-sem_undo_list-lock-usage-in-exit_sem.patch
* uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch
* mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
* revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
* mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch
* selftests-vm-add-missed-tests-in-run_vmtests.patch
* get_maintainer-remove-uses-of-p-for-maintainer-name.patch
* scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch
* mm-fix-a-comment-in-sys_swapon.patch
* memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch
* lib-string-update-match_string-doc-strings-with-correct-behavior.patch
* mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch
* proc-kpageflags-prevent-an-integer-overflow-in-stable_page_flags.patch
* proc-kpageflags-do-not-use-uninitialized-struct-pages.patch
* x86-mm-split-vmalloc_sync_all.patch
* asm-generic-make-more-kernel-space-headers-mandatory.patch
* ramfs-support-o_tmpfile.patch
* kernel-watchdog-flush-all-printk-nmi-buffers-when-hardlockup-detected.patch
  mm.patch
* mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch
* mm-gup-split-get_user_pages_remote-into-two-routines.patch
* mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
* mm-introduce-page_ref_sub_return.patch
* mm-gup-pass-gup-flags-to-two-more-routines.patch
* mm-gup-require-foll_get-for-get_user_pages_fast.patch
* mm-gup-track-foll_pin-pages.patch
* mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
* mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
* mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
* selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
* mm-improve-dump_page-for-compound-pages.patch
* mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch
* mm-swap-move-inode_lock-out-of-claim_swapfile.patch
* mm-swapfilec-fix-comments-for-swapcache_prepare.patch
* mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch
* mm-allocate-shrinker_map-on-appropriate-numa-node.patch
* mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch
* mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
* mm-add-vm_insert_pages.patch
* mm-add-vm_insert_pages-fix.patch
* mm-add-vm_insert_pages-2.patch
* net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch
* net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
* mm-mmap-fix-the-adjusted-length-error.patch
* mm-add-mremap_dontunmap-to-mremap.patch
* mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch
* mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
* mm-vmpressure-use-mem_cgroup_is_root-api.patch
* mm-vmscan-replace-open-codings-to-numa_no_node.patch
* mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch
* mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch
* hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
* hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
* hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
* hugetlb-disable-region_add-file_region-coalescing.patch
* hugetlb_cgroup-add-accounting-for-shared-mappings.patch
* hugetlb_cgroup-support-noreserve-mappings.patch
* hugetlb-support-file_region-coalescing-again.patch
* hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
* hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch
* mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch
* mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch
* mm-migratec-check-pagelist-in-move_pages_and_store_status.patch
* mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch
* mm-migratec-migrate-pg_readahead-flag.patch
* mm-migratec-migrate-pg_readahead-flag-fix.patch
* drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup.patch
* drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup-fix.patch
* mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
* mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
* mm-add-function-__putback_isolated_page.patch
* mm-introduce-reported-pages.patch
* virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
* virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
* mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
* mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
* mm-page_reporting-add-free-page-reporting-documentation.patch
* drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch
* drivers-base-memoryc-drop-section_count.patch
* drivers-base-memoryc-drop-pages_correctly_probed.patch
* mm-page_extc-drop-pfn_present-check-when-onlining.patch
* zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch
* info-task-hung-in-generic_file_write_iter.patch
* info-task-hung-in-generic_file_write-fix.patch
* kernel-hung_taskc-monitor-killed-tasks.patch
* asm-generic-fix-unistd_32h-generation-format.patch
* maintainers-add-an-entry-for-kfifo.patch
* lib-test_lockup-test-module-to-generate-lockups.patch
* lib-bch-replace-zero-length-array-with-flexible-array-member.patch
* lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch
* lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
* lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
* lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch
* lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch
* string-add-stracpy-and-stracpy_pad-mechanisms.patch
* documentation-checkpatch-prefer-stracpy-strscpy-over-strcpy-strlcpy-strncpy.patch
* checkpatch-remove-email-address-comment-from-email-address-comparisons.patch
* checkpatch-check-spdx-tags-in-yaml-files.patch
* checkpatch-support-base-commit-format.patch
* checkpatch-prefer-fallthrough-over-fallthrough-comments.patch
* kernel-relayc-fix-read_pos-error-when-multiple-readers.patch
* aio-simplify-read_events.patch
* init-cleanup-anon_inodes-and-old-io-schedulers-options.patch
  linux-next.patch
  linux-next-rejects.patch
  linux-next-fix.patch
* mm-frontswap-mark-various-intentional-data-races.patch
* mm-page_io-mark-various-intentional-data-races.patch
* mm-page_io-mark-various-intentional-data-races-v2.patch
* mm-swap_state-mark-various-intentional-data-races.patch
* mm-kmemleak-annotate-various-data-races-obj-ptr.patch
* mm-filemap-fix-a-data-race-in-filemap_fault.patch
* mm-swapfile-fix-and-annotate-various-data-races.patch
* mm-swapfile-fix-and-annotate-various-data-races-v2.patch
* mm-page_counter-fix-various-data-races-at-memsw.patch
* mm-memcontrol-fix-a-data-race-in-scan-count.patch
* mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
* mm-mempool-fix-a-data-race-in-mempool_free.patch
* mm-util-annotate-an-data-race-at-vm_committed_as.patch
* mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
* drivers-tty-serial-sh-scic-suppress-warning.patch
* fix-read-buffer-overflow-in-delta-ipc.patch
  make-sure-nobodys-leaking-resources.patch
  releasing-resources-with-children.patch
  mutex-subsystem-synchro-test-module.patch
  kernel-forkc-export-kernel_thread-to-modules.patch
  workaround-for-a-pci-restoring-bug.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-debug-add-tests-validating-architecture-page-table-helpers.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (176 preceding siblings ...)
  2020-02-14  6:26 ` mmotm 2020-02-13-22-26 uploaded Andrew Morton
@ 2020-02-24  2:45 ` Andrew Morton
  2020-02-24  3:20 ` + proc-faster-open-read-close-with-permanent-files.patch " Andrew Morton
                   ` (60 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  2:45 UTC (permalink / raw)
  To: anshuman.khandual, benh, borntraeger, bp, catalin.marinas,
	christophe.leroy, gerald.schaefer, gor, heiko.carstens, hpa,
	kirill, mingo, mingo, mm-commits, mpe, palmer, paul.walmsley,
	paulus, rppt, tglx, vgupta, will


The patch titled
     Subject: mm/debug: add tests validating architecture page table helpers
has been added to the -mm tree.  Its filename is
     mm-debug-add-tests-validating-architecture-page-table-helpers.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-debug-add-tests-validating-architecture-page-table-helpers.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-debug-add-tests-validating-architecture-page-table-helpers.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/debug: add tests validating architecture page table helpers

This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics. 
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

This test covers basic page table entry transformations including but not
limited to old, young, dirty, clean, write, write protect etc at various
level along with populating intermediate entries with next page table page
and validating them.

Test page table pages are allocated from system memory with required size
and alignments.  The mapped pfns at page table levels are derived from a
real pfn representing a valid kernel text symbol.  This test gets called
inside kernel_init() right after async_synchronize_full().

This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. 
Any architecture, which is willing to subscribe this test will need to
select ARCH_HAS_DEBUG_VM_PGTABLE.  For now this is limited to arc, arm64,
x86, s390 and ppc32 platforms where the test is known to build and run
successfully.  Going forward, other architectures too can subscribe the
test after fixing any build or runtime problems with their page table
helpers.  Meanwhile for better platform coverage, the test can also be
enabled with CONFIG_EXPERT even without ARCH_HAS_DEBUG_VM_PGTABLE.

Folks interested in making sure that a given platform's page table helpers
conform to expected generic MM semantics should enable the above config
which will just trigger this test during boot.  Any non conformity here
will be reported as an warning which would need to be fixed.  This test
will help catch any changes to the agreed upon semantics expected from
generic MM and enable platforms to accommodate it thereafter.

Link: http://lkml.kernel.org/r/1581909460-19148-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	# s390
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>	# ppc32
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/features/debug/debug-vm-pgtable/arch-support.txt |   35 
 arch/arc/Kconfig                                               |    1 
 arch/arm64/Kconfig                                             |    1 
 arch/powerpc/Kconfig                                           |    1 
 arch/s390/Kconfig                                              |    1 
 arch/x86/Kconfig                                               |    1 
 arch/x86/include/asm/pgtable_64.h                              |    6 
 include/linux/mmdebug.h                                        |    5 
 init/main.c                                                    |    2 
 lib/Kconfig.debug                                              |   26 
 mm/Makefile                                                    |    1 
 mm/debug_vm_pgtable.c                                          |  389 ++++++++++
 12 files changed, 469 insertions(+)

--- a/arch/arc/Kconfig~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
 	def_bool y
 	select ARC_TIMERS
+	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_SETUP_DMA_OPS
--- a/arch/arm64/Kconfig~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/arch/arm64/Kconfig
@@ -11,6 +11,7 @@ config ARM64
 	select ACPI_PPTT if ACPI
 	select ARCH_CLOCKSOURCE_DATA
 	select ARCH_HAS_DEBUG_VIRTUAL
+	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_DMA_PREP_COHERENT
 	select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
--- a/arch/powerpc/Kconfig~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/arch/powerpc/Kconfig
@@ -116,6 +116,7 @@ config PPC
 	#
 	select ARCH_32BIT_OFF_T if PPC32
 	select ARCH_HAS_DEBUG_VIRTUAL
+	select ARCH_HAS_DEBUG_VM_PGTABLE if PPC32
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FORTIFY_SOURCE
--- a/arch/s390/Kconfig~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/arch/s390/Kconfig
@@ -59,6 +59,7 @@ config KASAN_SHADOW_OFFSET
 config S390
 	def_bool y
 	select ARCH_BINFMT_ELF_STATE
+	select ARCH_HAS_DEBUG_VM_PGTABLE
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FORTIFY_SOURCE
--- a/arch/x86/include/asm/pgtable_64.h~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/arch/x86/include/asm/pgtable_64.h
@@ -53,6 +53,12 @@ static inline void sync_initial_page_tab
 
 struct mm_struct;
 
+#define mm_p4d_folded mm_p4d_folded
+static inline bool mm_p4d_folded(struct mm_struct *mm)
+{
+	return !pgtable_l5_enabled();
+}
+
 void set_pte_vaddr_p4d(p4d_t *p4d_page, unsigned long vaddr, pte_t new_pte);
 void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);
 
--- a/arch/x86/Kconfig~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/arch/x86/Kconfig
@@ -61,6 +61,7 @@ config X86
 	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
 	select ARCH_HAS_DEBUG_VIRTUAL
+	select ARCH_HAS_DEBUG_VM_PGTABLE	if !X86_PAE
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FAST_MULTIPLIER
--- /dev/null
+++ a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -0,0 +1,35 @@
+#
+# Feature name:          debug-vm-pgtable
+#         Kconfig:       ARCH_HAS_DEBUG_VM_PGTABLE
+#         description:   arch supports pgtable tests for semantics compliance
+#
+    -----------------------
+    |         arch |status|
+    -----------------------
+    |       alpha: | TODO |
+    |         arc: |  ok  |
+    |         arm: | TODO |
+    |       arm64: |  ok  |
+    |         c6x: | TODO |
+    |        csky: | TODO |
+    |       h8300: | TODO |
+    |     hexagon: | TODO |
+    |        ia64: | TODO |
+    |        m68k: | TODO |
+    |  microblaze: | TODO |
+    |        mips: | TODO |
+    |       nds32: | TODO |
+    |       nios2: | TODO |
+    |    openrisc: | TODO |
+    |      parisc: | TODO |
+    |  powerpc/32: |  ok  |
+    |  powerpc/64: | TODO |
+    |       riscv: | TODO |
+    |        s390: |  ok  |
+    |          sh: | TODO |
+    |       sparc: | TODO |
+    |          um: | TODO |
+    |   unicore32: | TODO |
+    |         x86: |  ok  |
+    |      xtensa: | TODO |
+    -----------------------
--- a/include/linux/mmdebug.h~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/include/linux/mmdebug.h
@@ -64,4 +64,9 @@ void dump_mm(const struct mm_struct *mm)
 #define VM_BUG_ON_PGFLAGS(cond, page) BUILD_BUG_ON_INVALID(cond)
 #endif
 
+#ifdef CONFIG_DEBUG_VM_PGTABLE
+void debug_vm_pgtable(void);
+#else
+static inline void debug_vm_pgtable(void) { }
+#endif
 #endif
--- a/init/main.c~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/init/main.c
@@ -94,6 +94,7 @@
 #include <linux/rodata_test.h>
 #include <linux/jump_label.h>
 #include <linux/mem_encrypt.h>
+#include <linux/mmdebug.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1346,6 +1347,7 @@ static int __ref kernel_init(void *unuse
 	kernel_init_freeable();
 	/* need to finish all async __init code before freeing the memory */
 	async_synchronize_full();
+	debug_vm_pgtable();
 	ftrace_free_init_mem();
 	free_initmem();
 	mark_readonly();
--- a/lib/Kconfig.debug~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/lib/Kconfig.debug
@@ -653,6 +653,12 @@ config SCHED_STACK_END_CHECK
 	  data corruption or a sporadic crash at a later stage once the region
 	  is examined. The runtime overhead introduced is minimal.
 
+config ARCH_HAS_DEBUG_VM_PGTABLE
+	bool
+	help
+	  An architecture should select this when it can successfully
+	  build and run DEBUG_VM_PGTABLE.
+
 config DEBUG_VM
 	bool "Debug VM"
 	depends on DEBUG_KERNEL
@@ -688,6 +694,26 @@ config DEBUG_VM_PGFLAGS
 
 	  If unsure, say N.
 
+config DEBUG_VM_PGTABLE
+	bool "Debug arch page table for semantics compliance"
+	depends on MMU
+	depends on !IA64 && !ARM
+	depends on ARCH_HAS_DEBUG_VM_PGTABLE || EXPERT
+	default n if !ARCH_HAS_DEBUG_VM_PGTABLE
+	default y if DEBUG_VM
+	help
+	  This option provides a debug method which can be used to test
+	  architecture page table helper functions on various platforms in
+	  verifying if they comply with expected generic MM semantics. This
+	  will help architecture code in making sure that any changes or
+	  new additions of these helpers still conform to expected
+	  semantics of the generic MM. Platforms will have to opt in for
+	  this through ARCH_HAS_DEBUG_VM_PGTABLE. Although it can also be
+	  enabled through EXPERT without requiring code change. This test
+	  is disabled on IA64 and ARM platforms where it fails to build.
+
+	  If unsure, say N.
+
 config ARCH_HAS_DEBUG_VIRTUAL
 	bool
 
--- /dev/null
+++ a/mm/debug_vm_pgtable.c
@@ -0,0 +1,389 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This kernel test validates architecture page table helpers and
+ * accessors and helps in verifying their continued compliance with
+ * expected generic MM semantics.
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ *
+ * Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ */
+#define pr_fmt(fmt) "debug_vm_pgtable: %s: " fmt, __func__
+
+#include <linux/gfp.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/kernel.h>
+#include <linux/kconfig.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mm_types.h>
+#include <linux/module.h>
+#include <linux/pfn_t.h>
+#include <linux/printk.h>
+#include <linux/random.h>
+#include <linux/spinlock.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/start_kernel.h>
+#include <linux/sched/mm.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+
+/*
+ * Basic operations
+ *
+ * mkold(entry)			= An old and not a young entry
+ * mkyoung(entry)		= A young and not an old entry
+ * mkdirty(entry)		= A dirty and not a clean entry
+ * mkclean(entry)		= A clean and not a dirty entry
+ * mkwrite(entry)		= A write and not a write protected entry
+ * wrprotect(entry)		= A write protected and not a write entry
+ * pxx_bad(entry)		= A mapped and non-table entry
+ * pxx_same(entry1, entry2)	= Both entries hold the exact same value
+ */
+#define VMFLAGS	(VM_READ|VM_WRITE|VM_EXEC)
+
+/*
+ * On s390 platform, the lower 4 bits are used to identify given page table
+ * entry type. But these bits might affect the ability to clear entries with
+ * pxx_clear() because of how dynamic page table folding works on s390. So
+ * while loading up the entries do not change the lower 4 bits. It does not
+ * have affect any other platform.
+ */
+#define S390_MASK_BITS	4
+#define RANDOM_ORVALUE	GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
+#define RANDOM_NZVALUE	GENMASK(7, 0)
+
+static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
+{
+	pte_t pte = pfn_pte(pfn, prot);
+
+	WARN_ON(!pte_same(pte, pte));
+	WARN_ON(!pte_young(pte_mkyoung(pte_mkold(pte))));
+	WARN_ON(!pte_dirty(pte_mkdirty(pte_mkclean(pte))));
+	WARN_ON(!pte_write(pte_mkwrite(pte_wrprotect(pte))));
+	WARN_ON(pte_young(pte_mkold(pte_mkyoung(pte))));
+	WARN_ON(pte_dirty(pte_mkclean(pte_mkdirty(pte))));
+	WARN_ON(pte_write(pte_wrprotect(pte_mkwrite(pte))));
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot)
+{
+	pmd_t pmd = pfn_pmd(pfn, prot);
+
+	WARN_ON(!pmd_same(pmd, pmd));
+	WARN_ON(!pmd_young(pmd_mkyoung(pmd_mkold(pmd))));
+	WARN_ON(!pmd_dirty(pmd_mkdirty(pmd_mkclean(pmd))));
+	WARN_ON(!pmd_write(pmd_mkwrite(pmd_wrprotect(pmd))));
+	WARN_ON(pmd_young(pmd_mkold(pmd_mkyoung(pmd))));
+	WARN_ON(pmd_dirty(pmd_mkclean(pmd_mkdirty(pmd))));
+	WARN_ON(pmd_write(pmd_wrprotect(pmd_mkwrite(pmd))));
+	/*
+	 * A huge page does not point to next level page table
+	 * entry. Hence this must qualify as pmd_bad().
+	 */
+	WARN_ON(!pmd_bad(pmd_mkhuge(pmd)));
+}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot)
+{
+	pud_t pud = pfn_pud(pfn, prot);
+
+	WARN_ON(!pud_same(pud, pud));
+	WARN_ON(!pud_young(pud_mkyoung(pud_mkold(pud))));
+	WARN_ON(!pud_write(pud_mkwrite(pud_wrprotect(pud))));
+	WARN_ON(pud_write(pud_wrprotect(pud_mkwrite(pud))));
+	WARN_ON(pud_young(pud_mkold(pud_mkyoung(pud))));
+
+	if (mm_pmd_folded(mm))
+		return;
+
+	/*
+	 * A huge page does not point to next level page table
+	 * entry. Hence this must qualify as pud_bad().
+	 */
+	WARN_ON(!pud_bad(pud_mkhuge(pud)));
+}
+#else
+static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { }
+#endif
+#else
+static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) { }
+static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { }
+#endif
+
+static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot)
+{
+	p4d_t p4d;
+
+	memset(&p4d, RANDOM_NZVALUE, sizeof(p4d_t));
+	WARN_ON(!p4d_same(p4d, p4d));
+}
+
+static void __init pgd_basic_tests(unsigned long pfn, pgprot_t prot)
+{
+	pgd_t pgd;
+
+	memset(&pgd, RANDOM_NZVALUE, sizeof(pgd_t));
+	WARN_ON(!pgd_same(pgd, pgd));
+}
+
+#ifndef __PAGETABLE_PUD_FOLDED
+static void __init pud_clear_tests(struct mm_struct *mm, pud_t *pudp)
+{
+	pud_t pud = READ_ONCE(*pudp);
+
+	if (mm_pmd_folded(mm))
+		return;
+
+	pud = __pud(pud_val(pud) | RANDOM_ORVALUE);
+	WRITE_ONCE(*pudp, pud);
+	pud_clear(pudp);
+	pud = READ_ONCE(*pudp);
+	WARN_ON(!pud_none(pud));
+}
+
+static void __init pud_populate_tests(struct mm_struct *mm, pud_t *pudp,
+				      pmd_t *pmdp)
+{
+	pud_t pud;
+
+	if (mm_pmd_folded(mm))
+		return;
+	/*
+	 * This entry points to next level page table page.
+	 * Hence this must not qualify as pud_bad().
+	 */
+	pmd_clear(pmdp);
+	pud_clear(pudp);
+	pud_populate(mm, pudp, pmdp);
+	pud = READ_ONCE(*pudp);
+	WARN_ON(pud_bad(pud));
+}
+#else
+static void __init pud_clear_tests(struct mm_struct *mm, pud_t *pudp) { }
+static void __init pud_populate_tests(struct mm_struct *mm, pud_t *pudp,
+				      pmd_t *pmdp)
+{
+}
+#endif
+
+#ifndef __PAGETABLE_P4D_FOLDED
+static void __init p4d_clear_tests(struct mm_struct *mm, p4d_t *p4dp)
+{
+	p4d_t p4d = READ_ONCE(*p4dp);
+
+	if (mm_pud_folded(mm))
+		return;
+
+	p4d = __p4d(p4d_val(p4d) | RANDOM_ORVALUE);
+	WRITE_ONCE(*p4dp, p4d);
+	p4d_clear(p4dp);
+	p4d = READ_ONCE(*p4dp);
+	WARN_ON(!p4d_none(p4d));
+}
+
+static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp,
+				      pud_t *pudp)
+{
+	p4d_t p4d;
+
+	if (mm_pud_folded(mm))
+		return;
+
+	/*
+	 * This entry points to next level page table page.
+	 * Hence this must not qualify as p4d_bad().
+	 */
+	pud_clear(pudp);
+	p4d_clear(p4dp);
+	p4d_populate(mm, p4dp, pudp);
+	p4d = READ_ONCE(*p4dp);
+	WARN_ON(p4d_bad(p4d));
+}
+
+static void __init pgd_clear_tests(struct mm_struct *mm, pgd_t *pgdp)
+{
+	pgd_t pgd = READ_ONCE(*pgdp);
+
+	if (mm_p4d_folded(mm))
+		return;
+
+	pgd = __pgd(pgd_val(pgd) | RANDOM_ORVALUE);
+	WRITE_ONCE(*pgdp, pgd);
+	pgd_clear(pgdp);
+	pgd = READ_ONCE(*pgdp);
+	WARN_ON(!pgd_none(pgd));
+}
+
+static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp,
+				      p4d_t *p4dp)
+{
+	pgd_t pgd;
+
+	if (mm_p4d_folded(mm))
+		return;
+
+	/*
+	 * This entry points to next level page table page.
+	 * Hence this must not qualify as pgd_bad().
+	 */
+	p4d_clear(p4dp);
+	pgd_clear(pgdp);
+	pgd_populate(mm, pgdp, p4dp);
+	pgd = READ_ONCE(*pgdp);
+	WARN_ON(pgd_bad(pgd));
+}
+#else
+static void __init p4d_clear_tests(struct mm_struct *mm, p4d_t *p4dp) { }
+static void __init pgd_clear_tests(struct mm_struct *mm, pgd_t *pgdp) { }
+static void __init p4d_populate_tests(struct mm_struct *mm, p4d_t *p4dp,
+				      pud_t *pudp)
+{
+}
+static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp,
+				      p4d_t *p4dp)
+{
+}
+#endif
+
+static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep)
+{
+	pte_t pte = READ_ONCE(*ptep);
+
+	pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+	WRITE_ONCE(*ptep, pte);
+	pte_clear(mm, 0, ptep);
+	pte = READ_ONCE(*ptep);
+	WARN_ON(!pte_none(pte));
+}
+
+static void __init pmd_clear_tests(struct mm_struct *mm, pmd_t *pmdp)
+{
+	pmd_t pmd = READ_ONCE(*pmdp);
+
+	pmd = __pmd(pmd_val(pmd) | RANDOM_ORVALUE);
+	WRITE_ONCE(*pmdp, pmd);
+	pmd_clear(pmdp);
+	pmd = READ_ONCE(*pmdp);
+	WARN_ON(!pmd_none(pmd));
+}
+
+static void __init pmd_populate_tests(struct mm_struct *mm, pmd_t *pmdp,
+				      pgtable_t pgtable)
+{
+	pmd_t pmd;
+
+	/*
+	 * This entry points to next level page table page.
+	 * Hence this must not qualify as pmd_bad().
+	 */
+	pmd_clear(pmdp);
+	pmd_populate(mm, pmdp, pgtable);
+	pmd = READ_ONCE(*pmdp);
+	WARN_ON(pmd_bad(pmd));
+}
+
+static unsigned long __init get_random_vaddr(void)
+{
+	unsigned long random_vaddr, random_pages, total_user_pages;
+
+	total_user_pages = (TASK_SIZE - FIRST_USER_ADDRESS) / PAGE_SIZE;
+
+	random_pages = get_random_long() % total_user_pages;
+	random_vaddr = FIRST_USER_ADDRESS + random_pages * PAGE_SIZE;
+
+	return random_vaddr;
+}
+
+void __init debug_vm_pgtable(void)
+{
+	struct mm_struct *mm;
+	pgd_t *pgdp;
+	p4d_t *p4dp, *saved_p4dp;
+	pud_t *pudp, *saved_pudp;
+	pmd_t *pmdp, *saved_pmdp, pmd;
+	pte_t *ptep;
+	pgtable_t saved_ptep;
+	pgprot_t prot;
+	phys_addr_t paddr;
+	unsigned long vaddr, pte_aligned, pmd_aligned;
+	unsigned long pud_aligned, p4d_aligned, pgd_aligned;
+
+	pr_info("Validating architecture page table helpers\n");
+	prot = vm_get_page_prot(VMFLAGS);
+	vaddr = get_random_vaddr();
+	mm = mm_alloc();
+	if (!mm) {
+		pr_err("mm_struct allocation failed\n");
+		return;
+	}
+
+	/*
+	 * PFN for mapping at PTE level is determined from a standard kernel
+	 * text symbol. But pfns for higher page table levels are derived by
+	 * masking lower bits of this real pfn. These derived pfns might not
+	 * exist on the platform but that does not really matter as pfn_pxx()
+	 * helpers will still create appropriate entries for the test. This
+	 * helps avoid large memory block allocations to be used for mapping
+	 * at higher page table levels.
+	 */
+	paddr = __pa(&start_kernel);
+
+	pte_aligned = (paddr & PAGE_MASK) >> PAGE_SHIFT;
+	pmd_aligned = (paddr & PMD_MASK) >> PAGE_SHIFT;
+	pud_aligned = (paddr & PUD_MASK) >> PAGE_SHIFT;
+	p4d_aligned = (paddr & P4D_MASK) >> PAGE_SHIFT;
+	pgd_aligned = (paddr & PGDIR_MASK) >> PAGE_SHIFT;
+	WARN_ON(!pfn_valid(pte_aligned));
+
+	pgdp = pgd_offset(mm, vaddr);
+	p4dp = p4d_alloc(mm, pgdp, vaddr);
+	pudp = pud_alloc(mm, p4dp, vaddr);
+	pmdp = pmd_alloc(mm, pudp, vaddr);
+	ptep = pte_alloc_map(mm, pmdp, vaddr);
+
+	/*
+	 * Save all the page table page addresses as the page table
+	 * entries will be used for testing with random or garbage
+	 * values. These saved addresses will be used for freeing
+	 * page table pages.
+	 */
+	pmd = READ_ONCE(*pmdp);
+	saved_p4dp = p4d_offset(pgdp, 0UL);
+	saved_pudp = pud_offset(p4dp, 0UL);
+	saved_pmdp = pmd_offset(pudp, 0UL);
+	saved_ptep = pmd_pgtable(pmd);
+
+	pte_basic_tests(pte_aligned, prot);
+	pmd_basic_tests(pmd_aligned, prot);
+	pud_basic_tests(pud_aligned, prot);
+	p4d_basic_tests(p4d_aligned, prot);
+	pgd_basic_tests(pgd_aligned, prot);
+
+	pte_clear_tests(mm, ptep);
+	pmd_clear_tests(mm, pmdp);
+	pud_clear_tests(mm, pudp);
+	p4d_clear_tests(mm, p4dp);
+	pgd_clear_tests(mm, pgdp);
+
+	pte_unmap(ptep);
+
+	pmd_populate_tests(mm, pmdp, saved_ptep);
+	pud_populate_tests(mm, pudp, saved_pmdp);
+	p4d_populate_tests(mm, p4dp, saved_pudp);
+	pgd_populate_tests(mm, pgdp, saved_p4dp);
+
+	p4d_free(mm, saved_p4dp);
+	pud_free(mm, saved_pudp);
+	pmd_free(mm, saved_pmdp);
+	pte_free(mm, saved_ptep);
+
+	mm_dec_nr_puds(mm);
+	mm_dec_nr_pmds(mm);
+	mm_dec_nr_ptes(mm);
+	mmdrop(mm);
+}
--- a/mm/Makefile~mm-debug-add-tests-validating-architecture-page-table-helpers
+++ a/mm/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoiso
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
+obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + proc-faster-open-read-close-with-permanent-files.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (177 preceding siblings ...)
  2020-02-24  2:45 ` + mm-debug-add-tests-validating-architecture-page-table-helpers.patch added to -mm tree Andrew Morton
@ 2020-02-24  3:20 ` Andrew Morton
  2020-02-24  3:24 ` + proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch " Andrew Morton
                   ` (59 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:20 UTC (permalink / raw)
  To: adobriyan, dan.carpenter, joe, lkp, mm-commits, viro


The patch titled
     Subject: proc: faster open/read/close with "permanent" files
has been added to the -mm tree.  Its filename is
     proc-faster-open-read-close-with-permanent-files.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/proc-faster-open-read-close-with-permanent-files.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/proc-faster-open-read-close-with-permanent-files.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: proc: faster open/read/close with "permanent" files

Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...

Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed.  Files like /proc/cpuinfo which never
disappear simply do not need such protection.

Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.

Enable "permanent" flag for

	/proc/cpuinfo
	/proc/kmsg
	/proc/modules
	/proc/slabinfo
	/proc/stat
	/proc/sysvipc/*
	/proc/swaps

More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.

This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".

	N	R	t, s (before)	t, s (after)
	-----------------------------------------------------
	64	4096	1.582458	1.530502	-3.2%
	256	4096	6.371926	6.125168	-3.9%
	1024	4096	25.64888	24.47528	-4.6%

Benchmark source:

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;

int xxx = 0;

int glue(int n)
{
	cpu_set_t m;
	CPU_ZERO(&m);
	CPU_SET(n, &m);
	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}

void f(int n)
{
	glue(n % NR_CPUS);

	while (*(volatile int *)&xxx == 0) {
	}

	for (int i = 0; i < R; i++) {
		int fd = open(filename, O_RDONLY);
		char buf[4096];
		ssize_t rv = read(fd, buf, sizeof(buf));
		asm volatile ("" :: "g" (rv));
		close(fd);
	}
}

int main(int argc, char *argv[])
{
	if (argc < 4) {
		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
		return 1;
	}

	N = atoi(argv[1]);
	filename = argv[2];
	R = atoi(argv[3]);

	for (int i = 0; i < NR_CPUS; i++) {
		if (glue(i) == 0)
			break;
	}

	std::vector<std::thread> T;
	T.reserve(N);
	for (int i = 0; i < N; i++) {
		T.emplace_back(f, i);
	}

	auto t0 = std::chrono::system_clock::now();
	{
		*(volatile int *)&xxx = 1;
		for (auto& t: T) {
			t.join();
		}
	}
	auto t1 = std::chrono::system_clock::now();
	std::chrono::duration<double> dt = t1 - t0;
	std::cout << dt.count() << '
';

	return 0;
}

P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.

Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/cpuinfo.c       |    1 
 fs/proc/generic.c       |   33 ++++++
 fs/proc/inode.c         |  187 +++++++++++++++++++++++++++-----------
 fs/proc/internal.h      |    6 +
 fs/proc/kmsg.c          |    1 
 fs/proc/stat.c          |    1 
 include/linux/proc_fs.h |   17 +++
 ipc/util.c              |    1 
 kernel/module.c         |    1 
 mm/slab_common.c        |    1 
 mm/swapfile.c           |    1 
 11 files changed, 196 insertions(+), 54 deletions(-)

--- a/fs/proc/cpuinfo.c~proc-faster-open-read-close-with-permanent-files
+++ a/fs/proc/cpuinfo.c
@@ -17,6 +17,7 @@ static int cpuinfo_open(struct inode *in
 }
 
 static const struct proc_ops cpuinfo_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= cpuinfo_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
--- a/fs/proc/generic.c~proc-faster-open-read-close-with-permanent-files
+++ a/fs/proc/generic.c
@@ -531,6 +531,13 @@ struct proc_dir_entry *proc_create_reg(c
 	return p;
 }
 
+static inline void pde_set_flags(struct proc_dir_entry *pde)
+{
+	if (pde->proc_ops->proc_flags & PROC_ENTRY_PERMANENT) {
+		pde->flags |= PROC_ENTRY_PERMANENT;
+	}
+}
+
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
 		struct proc_dir_entry *parent,
 		const struct proc_ops *proc_ops, void *data)
@@ -541,6 +548,7 @@ struct proc_dir_entry *proc_create_data(
 	if (!p)
 		return NULL;
 	p->proc_ops = proc_ops;
+	pde_set_flags(p);
 	return proc_register(parent, p);
 }
 EXPORT_SYMBOL(proc_create_data);
@@ -572,6 +580,7 @@ static int proc_seq_release(struct inode
 }
 
 static const struct proc_ops proc_seq_ops = {
+	/* not permanent -- can call into arbitrary seq_operations */
 	.proc_open	= proc_seq_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
@@ -602,6 +611,7 @@ static int proc_single_open(struct inode
 }
 
 static const struct proc_ops proc_single_ops = {
+	/* not permanent -- can call into arbitrary ->single_show */
 	.proc_open	= proc_single_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
@@ -662,9 +672,14 @@ void remove_proc_entry(const char *name,
 
 	de = pde_subdir_find(parent, fn, len);
 	if (de) {
-		rb_erase(&de->subdir_node, &parent->subdir);
-		if (S_ISDIR(de->mode)) {
-			parent->nlink--;
+		if (unlikely(pde_is_permanent(de))) {
+			WARN(1, "removing permanent /proc entry '%s'", de->name);
+			de = NULL;
+		} else {
+			rb_erase(&de->subdir_node, &parent->subdir);
+			if (S_ISDIR(de->mode)) {
+				parent->nlink--;
+			}
 		}
 	}
 	write_unlock(&proc_subdir_lock);
@@ -700,12 +715,24 @@ int remove_proc_subtree(const char *name
 		write_unlock(&proc_subdir_lock);
 		return -ENOENT;
 	}
+	if (unlikely(pde_is_permanent(root))) {
+		write_unlock(&proc_subdir_lock);
+		WARN(1, "removing permanent /proc entry '%s/%s'",
+			root->parent->name, root->name);
+		return -EINVAL;
+	}
 	rb_erase(&root->subdir_node, &parent->subdir);
 
 	de = root;
 	while (1) {
 		next = pde_subdir_first(de);
 		if (next) {
+			if (unlikely(pde_is_permanent(root))) {
+				write_unlock(&proc_subdir_lock);
+				WARN(1, "removing permanent /proc entry '%s/%s'",
+					next->parent->name, next->name);
+				return -EINVAL;
+			}
 			rb_erase(&next->subdir_node, &de->subdir);
 			de = next;
 			continue;
--- a/fs/proc/inode.c~proc-faster-open-read-close-with-permanent-files
+++ a/fs/proc/inode.c
@@ -195,135 +195,204 @@ void proc_entry_rundown(struct proc_dir_
 	spin_unlock(&de->pde_unload_lock);
 }
 
+static loff_t pde_lseek(struct proc_dir_entry *pde, struct file *file, loff_t offset, int whence)
+{
+	typeof_member(struct proc_ops, proc_lseek) lseek;
+
+	lseek = pde->proc_ops->proc_lseek;
+	if (!lseek)
+		lseek = default_llseek;
+	return lseek(file, offset, whence);
+}
+
 static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	loff_t rv = -EINVAL;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_lseek) lseek;
 
-		lseek = pde->proc_ops->proc_lseek;
-		if (!lseek)
-			lseek = default_llseek;
-		rv = lseek(file, offset, whence);
+	if (pde_is_permanent(pde)) {
+		return pde_lseek(pde, file, offset, whence);
+	} else if (use_pde(pde)) {
+		rv = pde_lseek(pde, file, offset, whence);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 
+static ssize_t pde_read(struct proc_dir_entry *pde, struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	typeof_member(struct proc_ops, proc_read) read;
+
+	read = pde->proc_ops->proc_read;
+	if (read)
+		return read(file, buf, count, ppos);
+	return -EIO;
+}
+
 static ssize_t proc_reg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	ssize_t rv = -EIO;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_read) read;
 
-		read = pde->proc_ops->proc_read;
-		if (read)
-			rv = read(file, buf, count, ppos);
+	if (pde_is_permanent(pde)) {
+		return pde_read(pde, file, buf, count, ppos);
+	} else if (use_pde(pde)) {
+		rv = pde_read(pde, file, buf, count, ppos);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 
+static ssize_t pde_write(struct proc_dir_entry *pde, struct file *file, const char __user *buf, size_t count, loff_t *ppos)
+{
+	typeof_member(struct proc_ops, proc_write) write;
+
+	write = pde->proc_ops->proc_write;
+	if (write)
+		return write(file, buf, count, ppos);
+	return -EIO;
+}
+
 static ssize_t proc_reg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	ssize_t rv = -EIO;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_write) write;
 
-		write = pde->proc_ops->proc_write;
-		if (write)
-			rv = write(file, buf, count, ppos);
+	if (pde_is_permanent(pde)) {
+		return pde_write(pde, file, buf, count, ppos);
+	} else if (use_pde(pde)) {
+		rv = pde_write(pde, file, buf, count, ppos);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 
+static __poll_t pde_poll(struct proc_dir_entry *pde, struct file *file, struct poll_table_struct *pts)
+{
+	typeof_member(struct proc_ops, proc_poll) poll;
+
+	poll = pde->proc_ops->proc_poll;
+	if (poll)
+		return poll(file, pts);
+	return DEFAULT_POLLMASK;
+}
+
 static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct *pts)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	__poll_t rv = DEFAULT_POLLMASK;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_poll) poll;
 
-		poll = pde->proc_ops->proc_poll;
-		if (poll)
-			rv = poll(file, pts);
+	if (pde_is_permanent(pde)) {
+		return pde_poll(pde, file, pts);
+	} else if (use_pde(pde)) {
+		rv = pde_poll(pde, file, pts);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 
+static long pde_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg)
+{
+	typeof_member(struct proc_ops, proc_ioctl) ioctl;
+
+	ioctl = pde->proc_ops->proc_ioctl;
+	if (ioctl)
+		return ioctl(file, cmd, arg);
+	return -ENOTTY;
+}
+
 static long proc_reg_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	long rv = -ENOTTY;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_ioctl) ioctl;
 
-		ioctl = pde->proc_ops->proc_ioctl;
-		if (ioctl)
-			rv = ioctl(file, cmd, arg);
+	if (pde_is_permanent(pde)) {
+		return pde_ioctl(pde, file, cmd, arg);
+	} else if (use_pde(pde)) {
+		rv = pde_ioctl(pde, file, cmd, arg);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 
 #ifdef CONFIG_COMPAT
+static long pde_compat_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg)
+{
+	typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl;
+
+	compat_ioctl = pde->proc_ops->proc_compat_ioctl;
+	if (compat_ioctl)
+		return compat_ioctl(file, cmd, arg);
+	return -ENOTTY;
+}
+
 static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	long rv = -ENOTTY;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl;
-
-		compat_ioctl = pde->proc_ops->proc_compat_ioctl;
-		if (compat_ioctl)
-			rv = compat_ioctl(file, cmd, arg);
+	if (pde_is_permanent(pde)) {
+		return pde_compat_ioctl(pde, file, cmd, arg);
+	} else if (use_pde(pde)) {
+		rv = pde_compat_ioctl(pde, file, cmd, arg);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 #endif
 
+static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma)
+{
+	typeof_member(struct proc_ops, proc_mmap) mmap;
+
+	mmap = pde->proc_ops->proc_mmap;
+	if (mmap)
+		return mmap(file, vma);
+	return -EIO;
+}
+
 static int proc_reg_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct proc_dir_entry *pde = PDE(file_inode(file));
 	int rv = -EIO;
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_mmap) mmap;
 
-		mmap = pde->proc_ops->proc_mmap;
-		if (mmap)
-			rv = mmap(file, vma);
+	if (pde_is_permanent(pde)) {
+		return pde_mmap(pde, file, vma);
+	} else if (use_pde(pde)) {
+		rv = pde_mmap(pde, file, vma);
 		unuse_pde(pde);
 	}
 	return rv;
 }
 
 static unsigned long
-proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
+pde_get_unmapped_area(struct proc_dir_entry *pde, struct file *file, unsigned long orig_addr,
 			   unsigned long len, unsigned long pgoff,
 			   unsigned long flags)
 {
-	struct proc_dir_entry *pde = PDE(file_inode(file));
-	unsigned long rv = -EIO;
-
-	if (use_pde(pde)) {
-		typeof_member(struct proc_ops, proc_get_unmapped_area) get_area;
+	typeof_member(struct proc_ops, proc_get_unmapped_area) get_area;
 
-		get_area = pde->proc_ops->proc_get_unmapped_area;
+	get_area = pde->proc_ops->proc_get_unmapped_area;
 #ifdef CONFIG_MMU
-		if (!get_area)
-			get_area = current->mm->get_unmapped_area;
+	if (!get_area)
+		get_area = current->mm->get_unmapped_area;
 #endif
+	if (get_area)
+		return get_area(file, orig_addr, len, pgoff, flags);
+	return orig_addr;
+}
+
+static unsigned long
+proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
+			   unsigned long len, unsigned long pgoff,
+			   unsigned long flags)
+{
+	struct proc_dir_entry *pde = PDE(file_inode(file));
+	unsigned long rv = -EIO;
 
-		if (get_area)
-			rv = get_area(file, orig_addr, len, pgoff, flags);
-		else
-			rv = orig_addr;
+	if (pde_is_permanent(pde)) {
+		return pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags);
+	} else if (use_pde(pde)) {
+		rv = pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags);
 		unuse_pde(pde);
 	}
 	return rv;
@@ -337,6 +406,13 @@ static int proc_reg_open(struct inode *i
 	typeof_member(struct proc_ops, proc_release) release;
 	struct pde_opener *pdeo;
 
+	if (pde_is_permanent(pde)) {
+		open = pde->proc_ops->proc_open;
+		if (open)
+			rv = open(inode, file);
+		return rv;
+	}
+
 	/*
 	 * Ensure that
 	 * 1) PDE's ->release hook will be called no matter what
@@ -386,6 +462,17 @@ static int proc_reg_release(struct inode
 {
 	struct proc_dir_entry *pde = PDE(inode);
 	struct pde_opener *pdeo;
+
+	if (pde_is_permanent(pde)) {
+		typeof_member(struct proc_ops, proc_release) release;
+
+		release = pde->proc_ops->proc_release;
+		if (release) {
+			return release(inode, file);
+		}
+		return 0;
+	}
+
 	spin_lock(&pde->pde_unload_lock);
 	list_for_each_entry(pdeo, &pde->pde_openers, lh) {
 		if (pdeo->file == file) {
--- a/fs/proc/internal.h~proc-faster-open-read-close-with-permanent-files
+++ a/fs/proc/internal.h
@@ -61,6 +61,7 @@ struct proc_dir_entry {
 	struct rb_node subdir_node;
 	char *name;
 	umode_t mode;
+	u8 flags;
 	u8 namelen;
 	char inline_name[];
 } __randomize_layout;
@@ -73,6 +74,11 @@ struct proc_dir_entry {
 	0)
 #define SIZEOF_PDE_INLINE_NAME (SIZEOF_PDE - sizeof(struct proc_dir_entry))
 
+static inline bool pde_is_permanent(const struct proc_dir_entry *pde)
+{
+	return pde->flags & PROC_ENTRY_PERMANENT;
+}
+
 extern struct kmem_cache *proc_dir_entry_cache;
 void pde_free(struct proc_dir_entry *pde);
 
--- a/fs/proc/kmsg.c~proc-faster-open-read-close-with-permanent-files
+++ a/fs/proc/kmsg.c
@@ -50,6 +50,7 @@ static __poll_t kmsg_poll(struct file *f
 
 
 static const struct proc_ops kmsg_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_read	= kmsg_read,
 	.proc_poll	= kmsg_poll,
 	.proc_open	= kmsg_open,
--- a/fs/proc/stat.c~proc-faster-open-read-close-with-permanent-files
+++ a/fs/proc/stat.c
@@ -224,6 +224,7 @@ static int stat_open(struct inode *inode
 }
 
 static const struct proc_ops stat_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= stat_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
--- a/include/linux/proc_fs.h~proc-faster-open-read-close-with-permanent-files
+++ a/include/linux/proc_fs.h
@@ -5,6 +5,7 @@
 #ifndef _LINUX_PROC_FS_H
 #define _LINUX_PROC_FS_H
 
+#include <linux/compiler.h>
 #include <linux/types.h>
 #include <linux/fs.h>
 
@@ -12,7 +13,21 @@ struct proc_dir_entry;
 struct seq_file;
 struct seq_operations;
 
+enum {
+	/*
+	 * All /proc entries using this ->proc_ops instance are never removed.
+	 *
+	 * If in doubt, ignore this flag.
+	 */
+#ifdef MODULE
+	PROC_ENTRY_PERMANENT = 0U,
+#else
+	PROC_ENTRY_PERMANENT = 1U << 0,
+#endif
+};
+
 struct proc_ops {
+	unsigned int proc_flags;
 	int	(*proc_open)(struct inode *, struct file *);
 	ssize_t	(*proc_read)(struct file *, char __user *, size_t, loff_t *);
 	ssize_t	(*proc_write)(struct file *, const char __user *, size_t, loff_t *);
@@ -25,7 +40,7 @@ struct proc_ops {
 #endif
 	int	(*proc_mmap)(struct file *, struct vm_area_struct *);
 	unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
-};
+} __randomize_layout;
 
 #ifdef CONFIG_PROC_FS
 
--- a/ipc/util.c~proc-faster-open-read-close-with-permanent-files
+++ a/ipc/util.c
@@ -885,6 +885,7 @@ static int sysvipc_proc_release(struct i
 }
 
 static const struct proc_ops sysvipc_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= sysvipc_proc_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
--- a/kernel/module.c~proc-faster-open-read-close-with-permanent-files
+++ a/kernel/module.c
@@ -4355,6 +4355,7 @@ static int modules_open(struct inode *in
 }
 
 static const struct proc_ops modules_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= modules_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
--- a/mm/slab_common.c~proc-faster-open-read-close-with-permanent-files
+++ a/mm/slab_common.c
@@ -1581,6 +1581,7 @@ static int slabinfo_open(struct inode *i
 }
 
 static const struct proc_ops slabinfo_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= slabinfo_open,
 	.proc_read	= seq_read,
 	.proc_write	= slabinfo_write,
--- a/mm/swapfile.c~proc-faster-open-read-close-with-permanent-files
+++ a/mm/swapfile.c
@@ -2797,6 +2797,7 @@ static int swaps_open(struct inode *inod
 }
 
 static const struct proc_ops swaps_proc_ops = {
+	.proc_flags	= PROC_ENTRY_PERMANENT,
 	.proc_open	= swaps_open,
 	.proc_read	= seq_read,
 	.proc_lseek	= seq_lseek,
_

Patches currently in -mm which might be from adobriyan@gmail.com are

ramfs-support-o_tmpfile.patch
proc-faster-open-read-close-with-permanent-files.patch
elf-delete-loc-variable.patch
elf-allocate-less-for-static-executable.patch
elf-dont-free-interpreters-elf-pheaders-on-common-path.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (178 preceding siblings ...)
  2020-02-24  3:20 ` + proc-faster-open-read-close-with-permanent-files.patch " Andrew Morton
@ 2020-02-24  3:24 ` Andrew Morton
  2020-02-24  3:25 ` + psi-move-pf_memstall-into-psi-specific-psi_flags.patch " Andrew Morton
                   ` (58 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:24 UTC (permalink / raw)
  To: adobriyan, akpm, mm-commits


The patch titled
     Subject: proc-faster-open-read-close-with-permanent-files-checkpatch-fixes
has been added to the -mm tree.  Its filename is
     proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: proc-faster-open-read-close-with-permanent-files-checkpatch-fixes

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#12: 
Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such

WARNING: braces {} are not necessary for single statement blocks
#163: FILE: fs/proc/generic.c:536:
+	if (pde->proc_ops->proc_flags & PROC_ENTRY_PERMANENT) {
+		pde->flags |= PROC_ENTRY_PERMANENT;
+	}

WARNING: line over 80 characters
#203: FILE: fs/proc/generic.c:676:
+			WARN(1, "removing permanent /proc entry '%s'", de->name);

WARNING: braces {} are not necessary for single statement blocks
#207: FILE: fs/proc/generic.c:680:
+			if (S_ISDIR(de->mode)) {
+				parent->nlink--;
+			}

WARNING: line over 80 characters
#244: FILE: fs/proc/inode.c:198:
+static loff_t pde_lseek(struct proc_dir_entry *pde, struct file *file, loff_t offset, int whence)

WARNING: line over 80 characters
#274: FILE: fs/proc/inode.c:222:
+static ssize_t pde_read(struct proc_dir_entry *pde, struct file *file, char __user *buf, size_t count, loff_t *ppos)

WARNING: line over 80 characters
#303: FILE: fs/proc/inode.c:246:
+static ssize_t pde_write(struct proc_dir_entry *pde, struct file *file, const char __user *buf, size_t count, loff_t *ppos)

WARNING: line over 80 characters
#332: FILE: fs/proc/inode.c:270:
+static __poll_t pde_poll(struct proc_dir_entry *pde, struct file *file, struct poll_table_struct *pts)

WARNING: line over 80 characters
#361: FILE: fs/proc/inode.c:294:
+static long pde_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg)

WARNING: line over 80 characters
#391: FILE: fs/proc/inode.c:319:
+static long pde_compat_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg)

WARNING: line over 80 characters
#421: FILE: fs/proc/inode.c:343:
+static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma)

WARNING: line over 80 characters
#452: FILE: fs/proc/inode.c:368:
+pde_get_unmapped_area(struct proc_dir_entry *pde, struct file *file, unsigned long orig_addr,

WARNING: line over 80 characters
#489: FILE: fs/proc/inode.c:393:
+		return pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags);

WARNING: line over 80 characters
#491: FILE: fs/proc/inode.c:395:
+		rv = pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags);

WARNING: braces {} are not necessary for single statement blocks
#518: FILE: fs/proc/inode.c:470:
+		if (release) {
+			return release(inode, file);
+		}

total: 0 errors, 15 warnings, 462 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/proc-faster-open-read-close-with-permanent-files.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/generic.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/fs/proc/generic.c~proc-faster-open-read-close-with-permanent-files-checkpatch-fixes
+++ a/fs/proc/generic.c
@@ -533,9 +533,8 @@ struct proc_dir_entry *proc_create_reg(c
 
 static inline void pde_set_flags(struct proc_dir_entry *pde)
 {
-	if (pde->proc_ops->proc_flags & PROC_ENTRY_PERMANENT) {
+	if (pde->proc_ops->proc_flags & PROC_ENTRY_PERMANENT)
 		pde->flags |= PROC_ENTRY_PERMANENT;
-	}
 }
 
 struct proc_dir_entry *proc_create_data(const char *name, umode_t mode,
@@ -677,9 +676,8 @@ void remove_proc_entry(const char *name,
 			de = NULL;
 		} else {
 			rb_erase(&de->subdir_node, &parent->subdir);
-			if (S_ISDIR(de->mode)) {
+			if (S_ISDIR(de->mode))
 				parent->nlink--;
-			}
 		}
 	}
 	write_unlock(&proc_subdir_lock);
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa-fix.patch
mm.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
selftest-add-mremap_dontunmap-selftest-fix.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings-fix.patch
hugetlb_cgroup-add-accounting-for-shared-mappings-fix.patch
mm-migratec-migrate-pg_readahead-flag-fix.patch
proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch
linux-next-rejects.patch
linux-next-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + psi-move-pf_memstall-into-psi-specific-psi_flags.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (179 preceding siblings ...)
  2020-02-24  3:24 ` + proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch " Andrew Morton
@ 2020-02-24  3:25 ` Andrew Morton
  2020-02-24  3:29 ` + mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch " Andrew Morton
                   ` (57 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:25 UTC (permalink / raw)
  To: bsegall, dietmar.eggemann, hannes, juri.lelli, laoar.shao,
	mgorman, mingo, mm-commits, peterz, rostedt, vincent.guittot


The patch titled
     Subject: psi: move PF_MEMSTALL into psi specific psi_flags
has been added to the -mm tree.  Its filename is
     psi-move-pf_memstall-into-psi-specific-psi_flags.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/psi-move-pf_memstall-into-psi-specific-psi_flags.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/psi-move-pf_memstall-into-psi-specific-psi_flags.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Yafang Shao <laoar.shao@gmail.com>
Subject: psi: move PF_MEMSTALL into psi specific psi_flags

The task->flags is a 32-bits flag, in which 31 bits have already been
consumed.  So it is hard to introduce other new per process flags.  As
there's a psi specific flag psi_flags, move the psi specific per process
flag PF_MEMSTALL into it.

Link: http://lkml.kernel.org/r/20200222144647.10120-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/psi_types.h |   12 +++++++++++-
 include/linux/sched.h     |    7 +++++--
 kernel/sched/psi.c        |   15 ++++++++-------
 kernel/sched/stats.h      |   10 +++++-----
 4 files changed, 29 insertions(+), 15 deletions(-)

--- a/include/linux/psi_types.h~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/include/linux/psi_types.h
@@ -17,11 +17,21 @@ enum psi_task_count {
 	NR_PSI_TASK_COUNTS = 3,
 };
 
-/* Task state bitmasks */
+/*
+ * Task state bitmasks:
+ * These flags are stored in the lower PSI_TSK_BITS bits of
+ * task->psi_flags, and the higher bits are set with per process flag which
+ * persists across sleeps.
+ */
+#define PSI_TSK_STATE_BITS 16
+#define PSI_TSK_STATE_MASK ((1 << PSI_TSK_STATE_BITS) - 1)
 #define TSK_IOWAIT	(1 << NR_IOWAIT)
 #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
 #define TSK_RUNNING	(1 << NR_RUNNING)
 
+/* Stalled due to lack of memory, that's per process flag. */
+#define PSI_PF_MEMSTALL (1 << PSI_TSK_STATE_BITS)
+
 /* Resources that workloads could be stalled on */
 enum psi_res {
 	PSI_IO,
--- a/include/linux/sched.h~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/include/linux/sched.h
@@ -1023,7 +1023,11 @@ struct task_struct {
 
 	struct task_io_accounting	ioac;
 #ifdef CONFIG_PSI
-	/* Pressure stall state */
+	/*
+	 * Pressure stall state:
+	 * Bits 0 ~ PSI_TSK_STATE_BITS-1: PSI task states
+	 * Bits PSI_TSK_STATE_BITS ~ 31: Per process flags
+	 */
 	unsigned int			psi_flags;
 #endif
 #ifdef CONFIG_TASK_XACCT
@@ -1477,7 +1481,6 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
-#define PF_MEMSTALL		0x01000000	/* Stalled due to lack of memory */
 #define PF_UMH			0x02000000	/* I'm an Usermodehelper process */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
--- a/kernel/sched/psi.c~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/kernel/sched/psi.c
@@ -759,7 +759,8 @@ void psi_task_change(struct task_struct
 	    !psi_bug) {
 		printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
 				task->pid, task->comm, cpu,
-				task->psi_flags, clear, set);
+				task->psi_flags & PSI_TSK_STATE_MASK,
+				clear, set);
 		psi_bug = 1;
 	}
 
@@ -818,17 +819,17 @@ void psi_memstall_enter(unsigned long *f
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	*flags = current->flags & PF_MEMSTALL;
+	*flags = current->psi_flags & PSI_PF_MEMSTALL;
 	if (*flags)
 		return;
 	/*
-	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
+	 * PSI_PF_MEMSTALL setting & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we can
 	 * race with CPU migration.
 	 */
 	rq = this_rq_lock_irq(&rf);
 
-	current->flags |= PF_MEMSTALL;
+	current->psi_flags |= PSI_PF_MEMSTALL;
 	psi_task_change(current, 0, TSK_MEMSTALL);
 
 	rq_unlock_irq(rq, &rf);
@@ -851,13 +852,13 @@ void psi_memstall_leave(unsigned long *f
 	if (*flags)
 		return;
 	/*
-	 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+	 * PSI_PF_MEMSTALL clearing & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we could
 	 * race with CPU migration.
 	 */
 	rq = this_rq_lock_irq(&rf);
 
-	current->flags &= ~PF_MEMSTALL;
+	current->psi_flags &= ~PSI_PF_MEMSTALL;
 	psi_task_change(current, TSK_MEMSTALL, 0);
 
 	rq_unlock_irq(rq, &rf);
@@ -921,7 +922,7 @@ void cgroup_move_task(struct task_struct
 	else if (task->in_iowait)
 		task_flags = TSK_IOWAIT;
 
-	if (task->flags & PF_MEMSTALL)
+	if (task->psi_flags & PSI_PF_MEMSTALL)
 		task_flags |= TSK_MEMSTALL;
 
 	if (task_flags)
--- a/kernel/sched/stats.h~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/kernel/sched/stats.h
@@ -70,7 +70,7 @@ static inline void psi_enqueue(struct ta
 		return;
 
 	if (!wakeup || p->sched_psi_wake_requeue) {
-		if (p->flags & PF_MEMSTALL)
+		if (p->psi_flags & PSI_PF_MEMSTALL)
 			set |= TSK_MEMSTALL;
 		if (p->sched_psi_wake_requeue)
 			p->sched_psi_wake_requeue = 0;
@@ -90,7 +90,7 @@ static inline void psi_dequeue(struct ta
 		return;
 
 	if (!sleep) {
-		if (p->flags & PF_MEMSTALL)
+		if (p->psi_flags & PSI_PF_MEMSTALL)
 			clear |= TSK_MEMSTALL;
 	} else {
 		if (p->in_iowait)
@@ -109,14 +109,14 @@ static inline void psi_ttwu_dequeue(stru
 	 * deregister its sleep-persistent psi states from the old
 	 * queue, and let psi_enqueue() know it has to requeue.
 	 */
-	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+	if (unlikely(p->in_iowait || (p->psi_flags & PSI_PF_MEMSTALL))) {
 		struct rq_flags rf;
 		struct rq *rq;
 		int clear = 0;
 
 		if (p->in_iowait)
 			clear |= TSK_IOWAIT;
-		if (p->flags & PF_MEMSTALL)
+		if (p->psi_flags & PSI_PF_MEMSTALL)
 			clear |= TSK_MEMSTALL;
 
 		rq = __task_rq_lock(p, &rf);
@@ -131,7 +131,7 @@ static inline void psi_task_tick(struct
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (unlikely(rq->curr->flags & PF_MEMSTALL))
+	if (unlikely(rq->curr->psi_flags & PSI_PF_MEMSTALL))
 		psi_memstall_tick(rq->curr, cpu_of(rq));
 }
 #else /* CONFIG_PSI */
_

Patches currently in -mm which might be from laoar.shao@gmail.com are

mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch
psi-move-pf_memstall-into-psi-specific-psi_flags.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (180 preceding siblings ...)
  2020-02-24  3:25 ` + psi-move-pf_memstall-into-psi-specific-psi_flags.patch " Andrew Morton
@ 2020-02-24  3:29 ` Andrew Morton
  2020-02-24  3:31 ` + ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch " Andrew Morton
                   ` (56 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:29 UTC (permalink / raw)
  To: longpeng2, mike.kravetz, mm-commits, sean.j.christopherson,
	stable, willy


The patch titled
     Subject: mm/hugetlb.c: fix a addressing exception caused by huge_pte_offset()
has been added to the -mm tree.  Its filename is
     mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Longpeng <longpeng2@huawei.com>
Subject: mm/hugetlb.c: fix a addressing exception caused by huge_pte_offset()

Our machine encountered a panic(addressing exception) after running for a
long time.  The calltrace is:

RIP: 0010:[<ffffffff9dff0587>]  [<ffffffff9dff0587>] hugetlb_fault+0x307/0xbe0
RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff9df9b71b>] ? unlock_page+0x2b/0x30
 [<ffffffff9dff04a2>] ? hugetlb_fault+0x222/0xbe0
 [<ffffffff9dff1405>] follow_hugetlb_page+0x175/0x540
 [<ffffffff9e15b825>] ? cpumask_next_and+0x35/0x50
 [<ffffffff9dfc7230>] __get_user_pages+0x2a0/0x7e0
 [<ffffffff9dfc648d>] __get_user_pages_unlocked+0x15d/0x210
 [<ffffffffc068cfc5>] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
 [<ffffffffc06b28be>] try_async_pf+0x6e/0x2a0 [kvm]
 [<ffffffffc06b4b41>] tdp_page_fault+0x151/0x2d0 [kvm]
 [<ffffffffc075731c>] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel]
 [<ffffffffc0757328>] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel]
 [<ffffffffc06abc11>] kvm_mmu_page_fault+0x31/0x140 [kvm]
 [<ffffffffc074d1ae>] handle_ept_violation+0x9e/0x170 [kvm_intel]
 [<ffffffffc075579c>] vmx_handle_exit+0x2bc/0xc70 [kvm_intel]
 [<ffffffffc074f1a0>] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel]
 [<ffffffffc07574c0>] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel]
 [<ffffffffc069f3be>] vcpu_enter_guest+0x7be/0x13a0 [kvm]
 [<ffffffffc06cf53e>] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm]
 [<ffffffffc06a6f90>] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
 [<ffffffffc068d919>] kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
 [<ffffffff9deaa8c2>] ? dequeue_signal+0x32/0x180
 [<ffffffff9deae34d>] ? do_sigtimedwait+0xcd/0x230
 [<ffffffff9e03aed0>] do_vfs_ioctl+0x3f0/0x540
 [<ffffffff9e03b0c1>] SyS_ioctl+0xa1/0xc0
 [<ffffffff9e53879b>] system_call_fastpath+0x22/0x27

The kernel we used is older, but we think the latest kernel also has this
bug after digging into this problem.

For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
may return a wrong 'pmdp' if there is a race.  Please look at the
following code snippet:

    ...
    pud = pud_offset(p4d, addr);
    if (sz != PUD_SIZE && pud_none(*pud))
        return NULL;
    /* hugepage or swap? */
    if (pud_huge(*pud) || !pud_present(*pud))
        return (pte_t *)pud;

    pmd = pmd_offset(pud, addr);
    if (sz != PMD_SIZE && pmd_none(*pmd))
        return NULL;
    /* hugepage or swap? */
    if (pmd_huge(*pmd) || !pmd_present(*pmd))
        return (pte_t *)pmd;
    ...

The following sequence would trigger this bug:
1. CPU0: sz = PUD_SIZE and *pud = 0 , continue
1. CPU0: "pud_huge(*pud)" is false
2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
3. CPU0: "!pud_present(*pud)" is false, continue
4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
However, we want CPU0 to return NULL or pudp.

We can avoid this race by reading the pud only once.  What's more, we also
use READ_ONCE to access the entries for safety (i.e. avoid the compilier
mischief)

Link: http://lkml.kernel.org/r/1582342427-230392-1-git-send-email-longpeng2@huawei.com
Signed-off-by: Longpeng <longpeng2@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset
+++ a/mm/hugetlb.c
@@ -4910,28 +4910,30 @@ pte_t *huge_pte_offset(struct mm_struct
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
+	pud_t *pud, pud_entry;
+	pmd_t *pmd, pmd_entry;
 
 	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
+	if (!pgd_present(READ_ONCE(*pgd)))
 		return NULL;
 	p4d = p4d_offset(pgd, addr);
-	if (!p4d_present(*p4d))
+	if (!p4d_present(READ_ONCE(*p4d)))
 		return NULL;
 
 	pud = pud_offset(p4d, addr);
-	if (sz != PUD_SIZE && pud_none(*pud))
+	pud_entry = READ_ONCE(*pud);
+	if (sz != PUD_SIZE && pud_none(pud_entry))
 		return NULL;
 	/* hugepage or swap? */
-	if (pud_huge(*pud) || !pud_present(*pud))
+	if (pud_huge(pud_entry) || !pud_present(pud_entry))
 		return (pte_t *)pud;
 
 	pmd = pmd_offset(pud, addr);
-	if (sz != PMD_SIZE && pmd_none(*pmd))
+	pmd_entry = READ_ONCE(*pmd);
+	if (sz != PMD_SIZE && pmd_none(pmd_entry))
 		return NULL;
 	/* hugepage or swap? */
-	if (pmd_huge(*pmd) || !pmd_present(*pmd))
+	if (pmd_huge(pmd_entry) || !pmd_present(pmd_entry))
 		return (pte_t *)pmd;
 
 	return NULL;
_

Patches currently in -mm which might be from longpeng2@huawei.com are

mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (181 preceding siblings ...)
  2020-02-24  3:29 ` + mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch " Andrew Morton
@ 2020-02-24  3:31 ` Andrew Morton
  2020-02-24  3:31 ` + ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch " Andrew Morton
                   ` (55 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:31 UTC (permalink / raw)
  To: gechangwei, ghe, jiangqi903, jlbec, junxiao.bi, mark, mm-commits,
	piaojun, wangyan122


The patch titled
     Subject: ocfs2: there is no need to log twice in several functions
has been added to the -mm tree.  Its filename is
     ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: wangyan <wangyan122@huawei.com>
Subject: ocfs2: there is no need to log twice in several functions

There is no need to log twice in several functions.

Link: http://lkml.kernel.org/r/77eec86a-f634-5b98-4f7d-0cd15185a37b@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c    |    1 -
 fs/ocfs2/suballoc.c |    5 -----
 2 files changed, 6 deletions(-)

--- a/fs/ocfs2/alloc.c~ocfs2-there-is-no-need-to-log-twice-in-several-functions
+++ a/fs/ocfs2/alloc.c
@@ -1060,7 +1060,6 @@ bail:
 			brelse(bhs[i]);
 			bhs[i] = NULL;
 		}
-		mlog_errno(status);
 	}
 	return status;
 }
--- a/fs/ocfs2/suballoc.c~ocfs2-there-is-no-need-to-log-twice-in-several-functions
+++ a/fs/ocfs2/suballoc.c
@@ -2509,9 +2509,6 @@ static int _ocfs2_free_suballoc_bits(han
 
 bail:
 	brelse(group_bh);

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (182 preceding siblings ...)
  2020-02-24  3:31 ` + ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch " Andrew Morton
@ 2020-02-24  3:31 ` Andrew Morton
  2020-02-24  3:46 ` + hugetlb-support-file_region-coalescing-again-fix-2.patch " Andrew Morton
                   ` (54 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:31 UTC (permalink / raw)
  To: gechangwei, ghe, jiangqi903, jlbec, junxiao.bi, mark, mm-commits,
	piaojun, wangyan122


The patch titled
     Subject: ocfs2: correct annotation from "l_next_rec" to "l_next_free_rec"
has been added to the -mm tree.  Its filename is
     ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: wangyan <wangyan122@huawei.com>
Subject: ocfs2: correct annotation from "l_next_rec" to "l_next_free_rec"

Correct annotation from "l_next_rec" to "l_next_free_rec"

Link: http://lkml.kernel.org/r/5e76c953-3479-1280-023c-ad05e4c75608@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ocfs2/alloc.c~ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec
+++ a/fs/ocfs2/alloc.c
@@ -3941,7 +3941,7 @@ rotate:
 	 * above.
 	 *
 	 * This leaf needs to have space, either by the empty 1st
-	 * extent record, or by virtue of an l_next_rec < l_count.
+	 * extent record, or by virtue of an l_next_free_rec < l_count.
 	 */
 	ocfs2_rotate_leaf(el, insert_rec);
 }
_

Patches currently in -mm which might be from wangyan122@huawei.com are

ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch
ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + hugetlb-support-file_region-coalescing-again-fix-2.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (183 preceding siblings ...)
  2020-02-24  3:31 ` + ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch " Andrew Morton
@ 2020-02-24  3:46 ` Andrew Morton
  2020-02-24  3:53 ` + fat-fix-uninit-memory-access-for-partial-initialized-inode.patch " Andrew Morton
                   ` (53 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:46 UTC (permalink / raw)
  To: almasrymina, mike.kravetz, mm-commits, rientjes, shakeelb


The patch titled
     Subject: hugetlb: remove check_coalesce_bug debug code
has been added to the -mm tree.  Its filename is
     hugetlb-support-file_region-coalescing-again-fix-2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/hugetlb-support-file_region-coalescing-again-fix-2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/hugetlb-support-file_region-coalescing-again-fix-2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Mina Almasry <almasrymina@google.com>
Subject: hugetlb: remove check_coalesce_bug debug code

Commit b5f16a533ce8a ("hugetlb: support file_region coalescing again")
made changes to the resv_map code which are hard to test, it so added
debug code guarded by CONFIG_DEBUG_VM which conducts an expensive
operation that loops over the resv_map and checks it for errors.

Unfortunately, some distros have CONFIG_DEBUG_VM on in their default
kernels, and we don't want this debug code behind CONFIG_DEBUG_VM and
called each time a file region is added.  This patch removes this debug
code.  I may look into making it a test or leave it for my local testing.

Link: http://lkml.kernel.org/r/20200219233610.13808-1-almasrymina@google.com
Fixes: b5f16a533ce8a ("hugetlb: support file_region coalescing again")
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   43 -------------------------------------------
 1 file changed, 43 deletions(-)

--- a/mm/hugetlb.c~hugetlb-support-file_region-coalescing-again-fix-2
+++ a/mm/hugetlb.c
@@ -289,48 +289,6 @@ static bool has_same_uncharge_info(struc
 #endif
 }
 
-#if defined(CONFIG_DEBUG_VM) && defined(CONFIG_CGROUP_HUGETLB)
-static void dump_resv_map(struct resv_map *resv)
-{
-	struct list_head *head = &resv->regions;
-	struct file_region *rg = NULL;
-
-	pr_err("--------- start print resv_map ---------\n");
-	list_for_each_entry(rg, head, link) {
-		pr_err("rg->from=%ld, rg->to=%ld, rg->reservation_counter=%px, rg->css=%px\n",
-		       rg->from, rg->to, rg->reservation_counter, rg->css);
-	}
-	pr_err("--------- end print resv_map ---------\n");
-}
-
-/* Debug function to loop over the resv_map and make sure that coalescing is
- * working.
- */
-static void check_coalesce_bug(struct resv_map *resv)
-{
-	struct list_head *head = &resv->regions;
-	struct file_region *rg = NULL, *nrg = NULL;
-
-	list_for_each_entry(rg, head, link) {
-		nrg = list_next_entry(rg, link);
-
-		if (&nrg->link == head)
-			break;
-
-		if (nrg->reservation_counter && nrg->from == rg->to &&
-		    nrg->reservation_counter == rg->reservation_counter &&
-		    nrg->css == rg->css) {
-			dump_resv_map(resv);
-			VM_BUG_ON(true);
-		}
-	}
-}
-#else
-static void check_coalesce_bug(struct resv_map *resv)
-{
-}
-#endif
-
 static void coalesce_file_region(struct resv_map *resv, struct file_region *rg)
 {
 	struct file_region *nrg = NULL, *prg = NULL;
@@ -435,7 +393,6 @@ static long add_reservation_in_range(str
 	}
 
 	VM_BUG_ON(add < 0);
-	check_coalesce_bug(resv);
 	return add;
 }
 
_

Patches currently in -mm which might be from almasrymina@google.com are

hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
mm-hugetlb_cgroup-fix-hugetlb_cgroup-migration.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
hugetlb-disable-region_add-file_region-coalescing.patch
hugetlb-disable-region_add-file_region-coalescing-fix.patch
hugetlb_cgroup-add-accounting-for-shared-mappings.patch
hugetlb_cgroup-support-noreserve-mappings.patch
hugetlb-support-file_region-coalescing-again.patch
hugetlb-support-file_region-coalescing-again-fix.patch
hugetlb-support-file_region-coalescing-again-fix-2.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + fat-fix-uninit-memory-access-for-partial-initialized-inode.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (184 preceding siblings ...)
  2020-02-24  3:46 ` + hugetlb-support-file_region-coalescing-again-fix-2.patch " Andrew Morton
@ 2020-02-24  3:53 ` Andrew Morton
  2020-02-24  4:08 ` + mm-add-mremap_dontunmap-to-mremap-v7.patch " Andrew Morton
                   ` (52 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  3:53 UTC (permalink / raw)
  To: hirofumi, mm-commits, stable


The patch titled
     Subject: fat: fix uninit-memory access for partial initialized inode
has been added to the -mm tree.  Its filename is
     fat-fix-uninit-memory-access-for-partial-initialized-inode.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/fat-fix-uninit-memory-access-for-partial-initialized-inode.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/fat-fix-uninit-memory-access-for-partial-initialized-inode.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Subject: fat: fix uninit-memory access for partial initialized inode

When get an error in the middle of reading an inode, some fields in the
inode might be still not initialized.  And then the evict_inode path may
access those fields via iput().

To fix, this makes sure that inode fields are initialized.

Link: http://lkml.kernel.org/r/871rqnreqx.fsf@mail.parknet.co.jp
Reported-by: syzbot+9d82b8de2992579da5d0@syzkaller.appspotmail.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fat/inode.c |   19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

--- a/fs/fat/inode.c~fat-fix-uninit-memory-access-for-partial-initialized-inode
+++ a/fs/fat/inode.c
@@ -750,6 +750,13 @@ static struct inode *fat_alloc_inode(str
 		return NULL;
 
 	init_rwsem(&ei->truncate_lock);
+	/* Zeroing to allow iput() even if partial initialized inode. */
+	ei->mmu_private = 0;
+	ei->i_start = 0;
+	ei->i_logstart = 0;
+	ei->i_attrs = 0;
+	ei->i_pos = 0;
+
 	return &ei->vfs_inode;
 }
 
@@ -1374,16 +1381,6 @@ out:
 	return 0;
 }
 
-static void fat_dummy_inode_init(struct inode *inode)
-{
-	/* Initialize this dummy inode to work as no-op. */
-	MSDOS_I(inode)->mmu_private = 0;
-	MSDOS_I(inode)->i_start = 0;
-	MSDOS_I(inode)->i_logstart = 0;
-	MSDOS_I(inode)->i_attrs = 0;
-	MSDOS_I(inode)->i_pos = 0;
-}

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-add-mremap_dontunmap-to-mremap-v7.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (185 preceding siblings ...)
  2020-02-24  3:53 ` + fat-fix-uninit-memory-access-for-partial-initialized-inode.patch " Andrew Morton
@ 2020-02-24  4:08 ` Andrew Morton
  2020-02-24  4:08 ` + selftest-add-mremap_dontunmap-selftest-v7.patch " Andrew Morton
                   ` (51 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  4:08 UTC (permalink / raw)
  To: aarcange, arnd, bgeffon, fweimer, joel, jsbarnes, kirill,
	lokeshgidra, luto, minchan, mm-commits, mst, natechancellor,
	sonnyrao, will, yuzhao


The patch titled
     Subject: mm-add-mremap_dontunmap-to-mremap-v7
has been added to the -mm tree.  Its filename is
     mm-add-mremap_dontunmap-to-mremap-v7.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-add-mremap_dontunmap-to-mremap-v7.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-add-mremap_dontunmap-to-mremap-v7.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Brian Geffon <bgeffon@google.com>
Subject: mm-add-mremap_dontunmap-to-mremap-v7

Don't allow resizing VMA as part of MREMAP_DONTUNMAP.  There is no clear
use case at the moment and it can be added later as it simplifies the
implementation for now.

Link: http://lkml.kernel.org/r/20200221174248.244748-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Tested-by: Lokesh Gidra <lokeshgidra@google.com>
Cc: "Michael S . Tsirkin" <mst@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Deacon <will@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mremap.c |   33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

--- a/mm/mremap.c~mm-add-mremap_dontunmap-to-mremap-v7
+++ a/mm/mremap.c
@@ -416,24 +416,10 @@ static unsigned long move_vma(struct vm_
 			vm_acct_memory(vma_pages(new_vma));
 		}
 
-		/*
-		 * locked_vm accounting: if the mapping remained the same size
-		 * it will have just moved and we don't need to touch locked_vm
-		 * because we skip the do_unmap. If the mapping shrunk before
-		 * being moved then the do_unmap on that portion will have
-		 * adjusted vm_locked. Only if the mapping grows do we need to
-		 * do something special; the reason is locked_vm only accounts
-		 * for old_len, but we're now adding new_len - old_len locked
-		 * bytes to the new mapping.
-		 */
-		if (vm_flags & VM_LOCKED && new_len > old_len) {
-			mm->locked_vm += (new_len - old_len) >> PAGE_SHIFT;
-			*locked = true;
-		}
-
 		/* We always clear VM_LOCKED[ONFAULT] on the old vma */
 		vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
+		/* Because we won't unmap we don't need to touch locked_vm */
 		goto out;
 	}
 
@@ -588,13 +574,9 @@ static unsigned long mremap_to(unsigned
 		goto out;
 	}
 
-	/*
-	 * MREMAP_DONTUNMAP expands by new_len - (new_len - old_len), we will
-	 * check that we can expand by new_len and vma_to_resize will handle
-	 * the vma growing which is (new_len - old_len).
-	 */
+	/* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */
 	if (flags & MREMAP_DONTUNMAP &&
-		!may_expand_vm(mm, vma->vm_flags, new_len >> PAGE_SHIFT)) {
+		!may_expand_vm(mm, vma->vm_flags, old_len >> PAGE_SHIFT)) {
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -670,10 +652,15 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	if (flags & MREMAP_FIXED && !(flags & MREMAP_MAYMOVE))
 		return ret;
 
-	/* MREMAP_DONTUNMAP is always a move */
-	if (flags & MREMAP_DONTUNMAP && !(flags & MREMAP_MAYMOVE))
+	/*
+	 * MREMAP_DONTUNMAP is always a move and it does not allow resizing
+	 * in the process.
+	 */
+	if (flags & MREMAP_DONTUNMAP &&
+			(!(flags & MREMAP_MAYMOVE) || old_len != new_len))
 		return ret;
 
+
 	if (offset_in_page(addr))
 		return ret;
 
_

Patches currently in -mm which might be from bgeffon@google.com are

mm-add-mremap_dontunmap-to-mremap.patch
mm-add-mremap_dontunmap-to-mremap-v6.patch
mm-add-mremap_dontunmap-to-mremap-v7.patch
selftest-add-mremap_dontunmap-selftest.patch
selftest-add-mremap_dontunmap-selftest-v7.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + selftest-add-mremap_dontunmap-selftest-v7.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (186 preceding siblings ...)
  2020-02-24  4:08 ` + mm-add-mremap_dontunmap-to-mremap-v7.patch " Andrew Morton
@ 2020-02-24  4:08 ` Andrew Morton
  2020-02-24  4:08 ` + selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch " Andrew Morton
                   ` (50 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  4:08 UTC (permalink / raw)
  To: bgeffon, mm-commits


The patch titled
     Subject: selftest: Add MREMAP_DONTUNMAP selftest.
has been added to the -mm tree.  Its filename is
     selftest-add-mremap_dontunmap-selftest-v7.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/selftest-add-mremap_dontunmap-selftest-v7.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/selftest-add-mremap_dontunmap-selftest-v7.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Brian Geffon <bgeffon@google.com>
Subject: selftest: Add MREMAP_DONTUNMAP selftest.

Add a few simple self tests for the new flag MREMAP_DONTUNMAP,
they are simple smoke tests which also demonstrate the behavior.

Link: http://lkml.kernel.org/r/20200221174248.244748-2-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/mremap_dontunmap.c |   91 ++++++----------
 1 file changed, 39 insertions(+), 52 deletions(-)

--- a/tools/testing/selftests/vm/mremap_dontunmap.c~selftest-add-mremap_dontunmap-selftest-v7
+++ a/tools/testing/selftests/vm/mremap_dontunmap.c
@@ -36,7 +36,7 @@ static void dump_maps(void)
 		if (condition) {					      \
 			fprintf(stderr, "[FAIL]\t%s():%d\t%s:%s\n", __func__, \
 				__LINE__, (description), strerror(errno));    \
-			dump_maps();					      \
+			dump_maps();					  \
 			exit(1);					      \
 		} 							      \
 	} while (0)
@@ -224,73 +224,60 @@ static void mremap_dontunmap_partial_map
 	       "unable to unmap source mapping");
 }
 
-// This test validates that we can shrink an existing mapping via the normal
-// mremap behavior along with the MREMAP_DONTUNMAP flag.
-static void mremap_dontunmap_shrink_mapping()
+// This test validates that we can remap over only a portion of a mapping.
+static void mremap_dontunmap_partial_mapping_overwrite()
 {
 	/*
-	 * We shrink the source by 5 pages while remapping.
 	 *  source mapping:
-	 *  --------------
-	 *  | aaaaaaaaaa |
-	 *  --------------
-	 *  to become:
 	 *  ---------
-	 *  | 00000 |
+	 *  |aaaaa|
 	 *  ---------
-	 *  With the destination mapping containing 5 pages of As followed by
-	 *  the original pages of Xs.
-	 *  --------------
-	 *  | aaaaaXXXXX |
-	 *  --------------
+	 *  dest mapping initially:
+	 *  -----------
+	 *  |XXXXXXXXXX|
+	 *  ------------
+	 *  Source to become:
+	 *  ---------
+	 *  |00000|
+	 *  ---------
+	 *  With the destination mapping containing 5 pages of As.
+	 *  ------------
+	 *  |aaaaaXXXXX|
+	 *  ------------
 	 */
+	void *source_mapping =
+	    mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	BUG_ON(source_mapping == MAP_FAILED, "mmap");
+	memset(source_mapping, 'a', 5 * page_size);
 
-	unsigned long num_pages = 10;
-
-	// We use MREMAP_FIXED because we don't want the mremap to place the
-	// remapped mapping behind the source, if it did
-	// we wouldn't be able to validate that the mapping was in fact
-	// adjusted.
 	void *dest_mapping =
-	    mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE,
+	    mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
 		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
 	BUG_ON(dest_mapping == MAP_FAILED, "mmap");
-	memset(dest_mapping, 'X', num_pages * page_size);
+	memset(dest_mapping, 'X', 10 * page_size);
 
-	void *source_mapping =
-	    mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE,
-		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
-	BUG_ON(source_mapping == MAP_FAILED, "mmap");
-	memset(source_mapping, 'a', num_pages * page_size);
-
-	// We are shrinking the mapping while also using MREMAP_DONTUNMAP
+	// We will grab the last 5 pages of the source and move them.
 	void *remapped_mapping =
-	    mremap(source_mapping, num_pages * page_size, 5 * page_size,
-		   MREMAP_FIXED | MREMAP_DONTUNMAP | MREMAP_MAYMOVE,
-		   dest_mapping);
-	BUG_ON(remapped_mapping == MAP_FAILED, "mremap");
-	BUG_ON(remapped_mapping != dest_mapping,
-	       "expected mremap to place mapping at dest");
-
-	// The last 5 pages of source should have become unmapped while the
-	// first 5 remain.
-	unsigned char buf[5];
-	int ret = mincore(source_mapping + (5 * page_size), 5 * page_size, buf);
-	BUG_ON((ret != -1 || (ret == -1 && errno != ENOMEM)),
-	       "we expect -ENOMEM from mincore.");
+	    mremap(source_mapping, 5 * page_size,
+		   5 * page_size,
+		   MREMAP_DONTUNMAP | MREMAP_MAYMOVE | MREMAP_FIXED, dest_mapping);
+	BUG_ON(dest_mapping == MAP_FAILED, "mremap");
+	BUG_ON(dest_mapping != remapped_mapping, "expected to remap to dest_mapping");
 
 	BUG_ON(check_region_contains_byte(source_mapping, 5 * page_size, 0) !=
-	       0, "source should have no ptes");
-	BUG_ON(check_region_contains_byte(dest_mapping, 5 * page_size, 'a') !=
-	       0, "dest mapping should contain ptes from the source");
+	       0, "first 5 pages of source should have no ptes");
+
+	// Finally we expect the destination to have 5 pages worth of a's.
+	BUG_ON(check_region_contains_byte(dest_mapping, 5 * page_size, 'a') != 0,
+			"dest mapping should contain ptes from the source");
 
-	// And the second half of the destination should be unchanged.
+	// Finally the last 5 pages shouldn't have been touched.
 	BUG_ON(check_region_contains_byte(dest_mapping + (5 * page_size),
-					  5 * page_size, 'X') != 0,
-	       "second half of dest shouldn't be touched");
+				5 * page_size, 'X') != 0,
+			"dest mapping should have retained the last 5 pages");
 
-	// Cleanup
-	BUG_ON(munmap(dest_mapping, num_pages * page_size) == -1,
+	BUG_ON(munmap(dest_mapping, 10 * page_size) == -1,
 	       "unable to unmap destination mapping");
 	BUG_ON(munmap(source_mapping, 5 * page_size) == -1,
 	       "unable to unmap source mapping");
@@ -316,7 +303,7 @@ int main(void)
 	mremap_dontunmap_simple();
 	mremap_dontunmap_simple_fixed();
 	mremap_dontunmap_partial_mapping();
-	mremap_dontunmap_shrink_mapping();
+	mremap_dontunmap_partial_mapping_overwrite();
 
 	BUG_ON(munmap(page_buffer, page_size) == -1,
 	       "unable to unmap page buffer");
_

Patches currently in -mm which might be from bgeffon@google.com are

mm-add-mremap_dontunmap-to-mremap.patch
mm-add-mremap_dontunmap-to-mremap-v6.patch
mm-add-mremap_dontunmap-to-mremap-v7.patch
selftest-add-mremap_dontunmap-selftest.patch
selftest-add-mremap_dontunmap-selftest-v7.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (187 preceding siblings ...)
  2020-02-24  4:08 ` + selftest-add-mremap_dontunmap-selftest-v7.patch " Andrew Morton
@ 2020-02-24  4:08 ` Andrew Morton
  2020-02-24  4:10 ` + percpu_counter-fix-a-data-race-at-vm_committed_as.patch " Andrew Morton
                   ` (49 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  4:08 UTC (permalink / raw)
  To: akpm, bgeffon, mm-commits


The patch titled
     Subject: selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes
has been added to the -mm tree.  Its filename is
     selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes

ERROR: Bad function definition - void mremap_dontunmap_partial_mapping_overwrite() should probably be void mremap_dontunmap_partial_mapping_overwrite(void)
#34: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:228:
+static void mremap_dontunmap_partial_mapping_overwrite()

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#68: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:251:
+	BUG_ON(source_mapping == MAP_FAILED, "mmap");

WARNING: line over 80 characters
#109: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:264:
+		   MREMAP_DONTUNMAP | MREMAP_MAYMOVE | MREMAP_FIXED, dest_mapping);

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#110: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:265:
+	BUG_ON(dest_mapping == MAP_FAILED, "mremap");

WARNING: line over 80 characters
#111: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:266:
+	BUG_ON(dest_mapping != remapped_mapping, "expected to remap to dest_mapping");

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#111: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:266:
+	BUG_ON(dest_mapping != remapped_mapping, "expected to remap to dest_mapping");

WARNING: line over 80 characters
#120: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:272:
+	BUG_ON(check_region_contains_byte(dest_mapping, 5 * page_size, 'a') != 0,

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#120: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:272:
+	BUG_ON(check_region_contains_byte(dest_mapping, 5 * page_size, 'a') != 0,

WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
#133: FILE: tools/testing/selftests/vm/mremap_dontunmap.c:280:
+	BUG_ON(munmap(dest_mapping, 10 * page_size) == -1,

total: 1 errors, 8 warnings, 126 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/selftest-add-mremap_dontunmap-selftest-v7.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Brian Geffon <bgeffon@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/mremap_dontunmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/mremap_dontunmap.c~selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes
+++ a/tools/testing/selftests/vm/mremap_dontunmap.c
@@ -225,7 +225,7 @@ static void mremap_dontunmap_partial_map
 }
 
 // This test validates that we can remap over only a portion of a mapping.
-static void mremap_dontunmap_partial_mapping_overwrite()
+static void mremap_dontunmap_partial_mapping_overwrite(void)
 {
 	/*
 	 *  source mapping:
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa-fix.patch
mm.patch
selftest-add-mremap_dontunmap-selftest-fix.patch
selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings-fix.patch
hugetlb_cgroup-add-accounting-for-shared-mappings-fix.patch
mm-migratec-migrate-pg_readahead-flag-fix.patch
proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch
linux-next-rejects.patch
linux-next-fix.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + percpu_counter-fix-a-data-race-at-vm_committed_as.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (188 preceding siblings ...)
  2020-02-24  4:08 ` + selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch " Andrew Morton
@ 2020-02-24  4:10 ` Andrew Morton
  2020-02-24  4:10 ` + lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch " Andrew Morton
                   ` (48 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  4:10 UTC (permalink / raw)
  To: cai, elver, mm-commits


The patch titled
     Subject: percpu_counter: fix a data race at vm_committed_as
has been added to the -mm tree.  Its filename is
     percpu_counter-fix-a-data-race-at-vm_committed_as.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/percpu_counter-fix-a-data-race-at-vm_committed_as.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/percpu_counter-fix-a-data-race-at-vm_committed_as.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: percpu_counter: fix a data race at vm_committed_as

"vm_committed_as.count" could be accessed concurrently as reported by
KCSAN,

 BUG: KCSAN: data-race in __vm_enough_memory / percpu_counter_add_batch

 write to 0xffffffff9451c538 of 8 bytes by task 65879 on cpu 35:
  percpu_counter_add_batch+0x83/0xd0
  percpu_counter_add_batch at lib/percpu_counter.c:91
  __vm_enough_memory+0xb9/0x260
  dup_mm+0x3a4/0x8f0
  copy_process+0x2458/0x3240
  _do_fork+0xaa/0x9f0
  __do_sys_clone+0x125/0x160
  __x64_sys_clone+0x70/0x90
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffffffff9451c538 of 8 bytes by task 66773 on cpu 19:
  __vm_enough_memory+0x199/0x260
  percpu_counter_read_positive at include/linux/percpu_counter.h:81
  (inlined by) __vm_enough_memory at mm/util.c:839
  mmap_region+0x1b2/0xa10
  do_mmap+0x45c/0x700
  vm_mmap_pgoff+0xc0/0x130
  ksys_mmap_pgoff+0x6e/0x300
  __x64_sys_mmap+0x33/0x40
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The read is outside percpu_counter::lock critical section which results in
a data race.  Fix it by adding a READ_ONCE() in
percpu_counter_read_positive() which could also service as the existing
compiler memory barrier.

Link: http://lkml.kernel.org/r/1582302724-2804-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/percpu_counter.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/percpu_counter.h~percpu_counter-fix-a-data-race-at-vm_committed_as
+++ a/include/linux/percpu_counter.h
@@ -78,9 +78,9 @@ static inline s64 percpu_counter_read(st
  */
 static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc)
 {
-	s64 ret = fbc->count;
+	/* Prevent reloads of fbc->count */
+	s64 ret = READ_ONCE(fbc->count);
 
-	barrier();		/* Prevent reloads of fbc->count */
 	if (ret >= 0)
 		return ret;
 	return 0;
_

Patches currently in -mm which might be from cai@lca.pw are

percpu_counter-fix-a-data-race-at-vm_committed_as.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races-v2.patch
mm-swap_state-mark-various-intentional-data-races.patch
mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-swapfile-fix-and-annotate-various-data-races-v2.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-annotate-a-data-race-in-page_zonenum.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (189 preceding siblings ...)
  2020-02-24  4:10 ` + percpu_counter-fix-a-data-race-at-vm_committed_as.patch " Andrew Morton
@ 2020-02-24  4:10 ` Andrew Morton
  2020-02-24 21:40 ` + mm-swapfile-fix-data-races-in-try_to_unuse.patch " Andrew Morton
                   ` (47 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24  4:10 UTC (permalink / raw)
  To: colin.king, mm-commits


The patch titled
     Subject: lib/test_lockup.c: fix spelling mistake "iteraions" -> "iterations"
has been added to the -mm tree.  Its filename is
     lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Colin Ian King <colin.king@canonical.com>
Subject: lib/test_lockup.c: fix spelling mistake "iteraions" -> "iterations"

There is a spelling mistake in a pr_notice message.  Fix it.

Link: http://lkml.kernel.org/r/20200221155145.79522-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_lockup.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/test_lockup.c~lib-test_lockup-fix-spelling-mistake-iteraions-iterations
+++ a/lib/test_lockup.c
@@ -490,7 +490,7 @@ static int __init test_lockup_init(void)
 		return -EINVAL;
 	}
 
-	pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iteraions=%u state=%s %s%s%s%s%s%s%s%s%s%s%s\n",
+	pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iterations=%u state=%s %s%s%s%s%s%s%s%s%s%s%s\n",
 		  main_task->pid, time_secs, time_nsecs,
 		  cooldown_secs, cooldown_nsecs, iterations, state,
 		  all_cpus ? "all_cpus " : "",
_

Patches currently in -mm which might be from colin.king@canonical.com are

lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-swapfile-fix-data-races-in-try_to_unuse.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (190 preceding siblings ...)
  2020-02-24  4:10 ` + lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch " Andrew Morton
@ 2020-02-24 21:40 ` Andrew Morton
  2020-02-24 21:45 ` [nacked] psi-move-pf_memstall-into-psi-specific-psi_flags.patch removed from " Andrew Morton
                   ` (46 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 21:40 UTC (permalink / raw)
  To: cai, elver, hughd, mm-commits


The patch titled
     Subject: mm/swapfile: fix data races in try_to_unuse()
has been added to the -mm tree.  Its filename is
     mm-swapfile-fix-data-races-in-try_to_unuse.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-swapfile-fix-data-races-in-try_to_unuse.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-swapfile-fix-data-races-in-try_to_unuse.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/swapfile: fix data races in try_to_unuse()

si->inuse_pages could be accessed concurrently as noticed by KCSAN,

 write to 0xffff98b00ebd04dc of 4 bytes by task 82262 on cpu 92:
  swap_range_free+0xbe/0x230
  swap_range_free at mm/swapfile.c:719
  swapcache_free_entries+0x1be/0x250
  free_swap_slot+0x1c8/0x220
  __swap_entry_free.constprop.19+0xa3/0xb0
  free_swap_and_cache+0x53/0xa0
  unmap_page_range+0x7e0/0x1ce0
  unmap_single_vma+0xcd/0x170
  unmap_vmas+0x18b/0x220
  exit_mmap+0xee/0x220
  mmput+0xe7/0x240
  do_exit+0x598/0xfd0
  do_group_exit+0x8b/0x180
  get_signal+0x293/0x13d0
  do_signal+0x37/0x5d0
  prepare_exit_to_usermode+0x1b7/0x2c0
  ret_from_intr+0x32/0x42

 read to 0xffff98b00ebd04dc of 4 bytes by task 82499 on cpu 46:
  try_to_unuse+0x86b/0xc80
  try_to_unuse at mm/swapfile.c:2185
  __x64_sys_swapoff+0x372/0xd40
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The plain reads in try_to_unuse() are outside si->lock critical section
which result in data races that could be dangerous to be used in a loop. 
Fix them by adding READ_ONCE().

Link: http://lkml.kernel.org/r/1582578903-29294-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/swapfile.c~mm-swapfile-fix-data-races-in-try_to_unuse
+++ a/mm/swapfile.c
@@ -2132,7 +2132,7 @@ int try_to_unuse(unsigned int type, bool
 	swp_entry_t entry;
 	unsigned int i;
 
-	if (!si->inuse_pages)
+	if (!READ_ONCE(si->inuse_pages))
 		return 0;
 
 	if (!frontswap)
@@ -2148,7 +2148,7 @@ retry:
 
 	spin_lock(&mmlist_lock);
 	p = &init_mm.mmlist;
-	while (si->inuse_pages &&
+	while (READ_ONCE(si->inuse_pages) &&
 	       !signal_pending(current) &&
 	       (p = p->next) != &init_mm.mmlist) {
 
@@ -2177,7 +2177,7 @@ retry:
 	mmput(prev_mm);
 
 	i = 0;
-	while (si->inuse_pages &&
+	while (READ_ONCE(si->inuse_pages) &&
 	       !signal_pending(current) &&
 	       (i = find_next_to_unuse(si, i, frontswap)) != 0) {
 
@@ -2219,7 +2219,7 @@ retry:
 	 * been preempted after get_swap_page(), temporarily hiding that swap.
 	 * It's easy and robust (though cpu-intensive) just to keep retrying.
 	 */
-	if (si->inuse_pages) {
+	if (READ_ONCE(si->inuse_pages)) {
 		if (!signal_pending(current))
 			goto retry;
 		retval = -EINTR;
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-data-races-in-try_to_unuse.patch
percpu_counter-fix-a-data-race-at-vm_committed_as.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races-v2.patch
mm-swap_state-mark-various-intentional-data-races.patch
mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-swapfile-fix-and-annotate-various-data-races-v2.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-annotate-a-data-race-in-page_zonenum.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [nacked] psi-move-pf_memstall-into-psi-specific-psi_flags.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (191 preceding siblings ...)
  2020-02-24 21:40 ` + mm-swapfile-fix-data-races-in-try_to_unuse.patch " Andrew Morton
@ 2020-02-24 21:45 ` Andrew Morton
  2020-02-24 21:57 ` + mm-z3fold-do-not-include-rwlockh-directly.patch added to " Andrew Morton
                   ` (45 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 21:45 UTC (permalink / raw)
  To: bsegall, dietmar.eggemann, hannes, juri.lelli, laoar.shao,
	mgorman, mingo, mm-commits, peterz, rostedt, vincent.guittot


The patch titled
     Subject: psi: move PF_MEMSTALL into psi specific psi_flags
has been removed from the -mm tree.  Its filename was
     psi-move-pf_memstall-into-psi-specific-psi_flags.patch

This patch was dropped because it was nacked

------------------------------------------------------
From: Yafang Shao <laoar.shao@gmail.com>
Subject: psi: move PF_MEMSTALL into psi specific psi_flags

The task->flags is a 32-bits flag, in which 31 bits have already been
consumed.  So it is hard to introduce other new per process flags.  As
there's a psi specific flag psi_flags, move the psi specific per process
flag PF_MEMSTALL into it.

Link: http://lkml.kernel.org/r/20200222144647.10120-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/psi_types.h |   12 +++++++++++-
 include/linux/sched.h     |    7 +++++--
 kernel/sched/psi.c        |   15 ++++++++-------
 kernel/sched/stats.h      |   10 +++++-----
 4 files changed, 29 insertions(+), 15 deletions(-)

--- a/include/linux/psi_types.h~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/include/linux/psi_types.h
@@ -17,11 +17,21 @@ enum psi_task_count {
 	NR_PSI_TASK_COUNTS = 3,
 };
 
-/* Task state bitmasks */
+/*
+ * Task state bitmasks:
+ * These flags are stored in the lower PSI_TSK_BITS bits of
+ * task->psi_flags, and the higher bits are set with per process flag which
+ * persists across sleeps.
+ */
+#define PSI_TSK_STATE_BITS 16
+#define PSI_TSK_STATE_MASK ((1 << PSI_TSK_STATE_BITS) - 1)
 #define TSK_IOWAIT	(1 << NR_IOWAIT)
 #define TSK_MEMSTALL	(1 << NR_MEMSTALL)
 #define TSK_RUNNING	(1 << NR_RUNNING)
 
+/* Stalled due to lack of memory, that's per process flag. */
+#define PSI_PF_MEMSTALL (1 << PSI_TSK_STATE_BITS)
+
 /* Resources that workloads could be stalled on */
 enum psi_res {
 	PSI_IO,
--- a/include/linux/sched.h~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/include/linux/sched.h
@@ -1023,7 +1023,11 @@ struct task_struct {
 
 	struct task_io_accounting	ioac;
 #ifdef CONFIG_PSI
-	/* Pressure stall state */
+	/*
+	 * Pressure stall state:
+	 * Bits 0 ~ PSI_TSK_STATE_BITS-1: PSI task states
+	 * Bits PSI_TSK_STATE_BITS ~ 31: Per process flags
+	 */
 	unsigned int			psi_flags;
 #endif
 #ifdef CONFIG_TASK_XACCT
@@ -1477,7 +1481,6 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
-#define PF_MEMSTALL		0x01000000	/* Stalled due to lack of memory */
 #define PF_UMH			0x02000000	/* I'm an Usermodehelper process */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
--- a/kernel/sched/psi.c~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/kernel/sched/psi.c
@@ -759,7 +759,8 @@ void psi_task_change(struct task_struct
 	    !psi_bug) {
 		printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n",
 				task->pid, task->comm, cpu,
-				task->psi_flags, clear, set);
+				task->psi_flags & PSI_TSK_STATE_MASK,
+				clear, set);
 		psi_bug = 1;
 	}
 
@@ -818,17 +819,17 @@ void psi_memstall_enter(unsigned long *f
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	*flags = current->flags & PF_MEMSTALL;
+	*flags = current->psi_flags & PSI_PF_MEMSTALL;
 	if (*flags)
 		return;
 	/*
-	 * PF_MEMSTALL setting & accounting needs to be atomic wrt
+	 * PSI_PF_MEMSTALL setting & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we can
 	 * race with CPU migration.
 	 */
 	rq = this_rq_lock_irq(&rf);
 
-	current->flags |= PF_MEMSTALL;
+	current->psi_flags |= PSI_PF_MEMSTALL;
 	psi_task_change(current, 0, TSK_MEMSTALL);
 
 	rq_unlock_irq(rq, &rf);
@@ -851,13 +852,13 @@ void psi_memstall_leave(unsigned long *f
 	if (*flags)
 		return;
 	/*
-	 * PF_MEMSTALL clearing & accounting needs to be atomic wrt
+	 * PSI_PF_MEMSTALL clearing & accounting needs to be atomic wrt
 	 * changes to the task's scheduling state, otherwise we could
 	 * race with CPU migration.
 	 */
 	rq = this_rq_lock_irq(&rf);
 
-	current->flags &= ~PF_MEMSTALL;
+	current->psi_flags &= ~PSI_PF_MEMSTALL;
 	psi_task_change(current, TSK_MEMSTALL, 0);
 
 	rq_unlock_irq(rq, &rf);
@@ -921,7 +922,7 @@ void cgroup_move_task(struct task_struct
 	else if (task->in_iowait)
 		task_flags = TSK_IOWAIT;
 
-	if (task->flags & PF_MEMSTALL)
+	if (task->psi_flags & PSI_PF_MEMSTALL)
 		task_flags |= TSK_MEMSTALL;
 
 	if (task_flags)
--- a/kernel/sched/stats.h~psi-move-pf_memstall-into-psi-specific-psi_flags
+++ a/kernel/sched/stats.h
@@ -70,7 +70,7 @@ static inline void psi_enqueue(struct ta
 		return;
 
 	if (!wakeup || p->sched_psi_wake_requeue) {
-		if (p->flags & PF_MEMSTALL)
+		if (p->psi_flags & PSI_PF_MEMSTALL)
 			set |= TSK_MEMSTALL;
 		if (p->sched_psi_wake_requeue)
 			p->sched_psi_wake_requeue = 0;
@@ -90,7 +90,7 @@ static inline void psi_dequeue(struct ta
 		return;
 
 	if (!sleep) {
-		if (p->flags & PF_MEMSTALL)
+		if (p->psi_flags & PSI_PF_MEMSTALL)
 			clear |= TSK_MEMSTALL;
 	} else {
 		if (p->in_iowait)
@@ -109,14 +109,14 @@ static inline void psi_ttwu_dequeue(stru
 	 * deregister its sleep-persistent psi states from the old
 	 * queue, and let psi_enqueue() know it has to requeue.
 	 */
-	if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) {
+	if (unlikely(p->in_iowait || (p->psi_flags & PSI_PF_MEMSTALL))) {
 		struct rq_flags rf;
 		struct rq *rq;
 		int clear = 0;
 
 		if (p->in_iowait)
 			clear |= TSK_IOWAIT;
-		if (p->flags & PF_MEMSTALL)
+		if (p->psi_flags & PSI_PF_MEMSTALL)
 			clear |= TSK_MEMSTALL;
 
 		rq = __task_rq_lock(p, &rf);
@@ -131,7 +131,7 @@ static inline void psi_task_tick(struct
 	if (static_branch_likely(&psi_disabled))
 		return;
 
-	if (unlikely(rq->curr->flags & PF_MEMSTALL))
+	if (unlikely(rq->curr->psi_flags & PSI_PF_MEMSTALL))
 		psi_memstall_tick(rq->curr, cpu_of(rq));
 }
 #else /* CONFIG_PSI */
_

Patches currently in -mm which might be from laoar.shao@gmail.com are

mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-z3fold-do-not-include-rwlockh-directly.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (192 preceding siblings ...)
  2020-02-24 21:45 ` [nacked] psi-move-pf_memstall-into-psi-specific-psi_flags.patch removed from " Andrew Morton
@ 2020-02-24 21:57 ` Andrew Morton
  2020-02-24 22:04 ` + mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch " Andrew Morton
                   ` (44 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 21:57 UTC (permalink / raw)
  To: bigeasy, mm-commits, peterz, tglx, vitaly.wool


The patch titled
     Subject: mm/z3fold.c: do not include rwlock.h directly
has been added to the -mm tree.  Its filename is
     mm-z3fold-do-not-include-rwlockh-directly.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-z3fold-do-not-include-rwlockh-directly.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-z3fold-do-not-include-rwlockh-directly.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: mm/z3fold.c: do not include rwlock.h directly

rwlock.h should not be included directly. Instead linux/splinlock.h
should be included. One thing it does is to break the RT build.

Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/z3fold.c~mm-z3fold-do-not-include-rwlockh-directly
+++ a/mm/z3fold.c
@@ -41,7 +41,6 @@
 #include <linux/workqueue.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
-#include <linux/rwlock.h>
 #include <linux/zpool.h>
 #include <linux/magic.h>
 
_

Patches currently in -mm which might be from bigeasy@linutronix.de are

mm-z3fold-do-not-include-rwlockh-directly.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (193 preceding siblings ...)
  2020-02-24 21:57 ` + mm-z3fold-do-not-include-rwlockh-directly.patch added to " Andrew Morton
@ 2020-02-24 22:04 ` Andrew Morton
  2020-02-24 22:13 ` + mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch " Andrew Morton
                   ` (43 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:04 UTC (permalink / raw)
  To: cai, david, gerald.schaefer, iamjoonsoo.kim, mm-commits, stable, vbabka


The patch titled
     Subject: mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled
has been added to the -mm tree.  Its filename is
     mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled

Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
fixed memory hotplug with debug_pagealloc enabled, where onlining a page
goes through page freeing, which removes the direct mapping.  Some arches
don't like when the page is not mapped in the first place, so
generic_online_page() maps it first.  This is somewhat wasteful, but
better than special casing page freeing fast paths.

The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
it's actually enabled.  One has to test debug_pagealloc_enabled() since
031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime
configurable"), or alternatively debug_pagealloc_enabled_static() since
8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"),
but this is not done.

As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
will crash:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
0000000000000001 0000000000000000 0000000000000002 0000000000000100
0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
>0000001ecd281b9e: 94fb5006 ni 6(%r5),251
0000001ecd281ba2: 41505008 la %r5,8(%r5)
0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
0000001ecd281bac: 1a07 ar %r0,%r7
0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
Call Trace:
[<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
[<0000001ecd4d9516>] online_pages_range+0xf6/0x128
[<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
[<0000001ecda28aae>] online_pages+0x2fe/0x3f0
[<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
[<0000001ecd7add42>] device_online+0x5a/0xc8
[<0000001ecd7d0430>] state_store+0x88/0x118
[<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
[<0000001ecd5064b6>] vfs_write+0x176/0x1e0
[<0000001ecd50676a>] ksys_write+0xa2/0x100
[<0000001ecda315d4>] system_call+0xd8/0x2c8

Fix this by checking debug_pagealloc_enabled_static() before calling
kernel_map_pages(). Backports for kernel before 5.5 should use
debug_pagealloc_enabled() instead. Also add comments.

Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h  |    4 ++++
 mm/memory_hotplug.c |    8 +++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

--- a/include/linux/mm.h~mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled
+++ a/include/linux/mm.h
@@ -2715,6 +2715,10 @@ static inline bool debug_pagealloc_enabl
 #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
 extern void __kernel_map_pages(struct page *page, int numpages, int enable);
 
+/*
+ * When called in DEBUG_PAGEALLOC context, the call should most likely be
+ * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
+ */
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable)
 {
--- a/mm/memory_hotplug.c~mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled
+++ a/mm/memory_hotplug.c
@@ -574,7 +574,13 @@ EXPORT_SYMBOL_GPL(restore_online_page_ca
 
 void generic_online_page(struct page *page, unsigned int order)
 {
-	kernel_map_pages(page, 1 << order, 1);
+	/*
+	 * Freeing the page with debug_pagealloc enabled will try to unmap it,
+	 * so we should map it first. This is better than introducing a special
+	 * case in page freeing fast path.
+	 */
+	if (debug_pagealloc_enabled_static())
+		kernel_map_pages(page, 1 << order, 1);
 	__free_pages_core(page, order);
 	totalram_pages_add(1UL << order);
 #ifdef CONFIG_HIGHMEM
_

Patches currently in -mm which might be from vbabka@suse.cz are

mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (194 preceding siblings ...)
  2020-02-24 22:04 ` + mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch " Andrew Morton
@ 2020-02-24 22:13 ` Andrew Morton
  2020-02-24 22:13 ` + mm-vma-make-vma_is_accessible-available-for-general-use.patch " Andrew Morton
                   ` (42 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:13 UTC (permalink / raw)
  To: acme, aneesh.kumar, anshuman.khandual, arnd, benh, dalias,
	dave.hansen, geert, guoren, luto, mgorman, mingo, mm-commits,
	mpe, npiggin, paulburton, paulus, paulus, peterz, ralf, rostedt,
	tglx, viro, will, ysato


The patch titled
     Subject: mm/vma: add missing VMA flag readable name for VM_SYNC
has been added to the -mm tree.  Its filename is
     mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/vma: add missing VMA flag readable name for VM_SYNC

Patch series "mm/vma: Use all available wrappers when possible", v2.

Apart from adding a VMA flag readable name for trace purpose, this series
does some open encoding replacements with availabe VMA specific wrappers. 
This skips VM_HUGETLB check in vma_migratable() as its already being done
with another patch (https://patchwork.kernel.org/patch/11347831/) which is
yet to be merged.


This patch (of 4):

This just adds the missing readable name for VM_SYNC.

Link: http://lkml.kernel.org/r/1582520593-30704-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/mmflags.h |    1 +
 1 file changed, 1 insertion(+)

--- a/include/trace/events/mmflags.h~mm-vma-add-missing-vma-flag-readable-name-for-vm_sync
+++ a/include/trace/events/mmflags.h
@@ -154,6 +154,7 @@ IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
 	{VM_ACCOUNT,			"account"	},		\
 	{VM_NORESERVE,			"noreserve"	},		\
 	{VM_HUGETLB,			"hugetlb"	},		\
+	{VM_SYNC,			"sync"		},		\
 	__VM_ARCH_SPECIFIC_1				,		\
 	{VM_WIPEONFORK,			"wipeonfork"	},		\
 	{VM_DONTDUMP,			"dontdump"	},		\
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch
mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
mm-vma-make-vma_is_accessible-available-for-general-use.patch
mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vma-make-vma_is_accessible-available-for-general-use.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (195 preceding siblings ...)
  2020-02-24 22:13 ` + mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch " Andrew Morton
@ 2020-02-24 22:13 ` Andrew Morton
  2020-02-24 22:13 ` + mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch " Andrew Morton
                   ` (41 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:13 UTC (permalink / raw)
  To: acme, aneesh.kumar, anshuman.khandual, arnd, benh, dalias,
	dave.hansen, geert, guoren, luto, mgorman, mingo, mm-commits,
	mpe, npiggin, paulburton, paulus, paulus, peterz, ralf, rostedt,
	tglx, viro, will, ysato


The patch titled
     Subject: mm/vma: make vma_is_accessible() available for general use
has been added to the -mm tree.  Its filename is
     mm-vma-make-vma_is_accessible-available-for-general-use.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vma-make-vma_is_accessible-available-for-general-use.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vma-make-vma_is_accessible-available-for-general-use.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/vma: make vma_is_accessible() available for general use

Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
available for general use.  While here, this replaces all remaining open
encodings for VMA access check with vma_is_accessible().

Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/csky/mm/fault.c    |    2 +-
 arch/m68k/mm/fault.c    |    2 +-
 arch/mips/mm/fault.c    |    2 +-
 arch/powerpc/mm/fault.c |    2 +-
 arch/sh/mm/fault.c      |    2 +-
 arch/x86/mm/fault.c     |    2 +-
 include/linux/mm.h      |    5 +++++
 kernel/sched/fair.c     |    2 +-
 mm/gup.c                |    2 +-
 mm/memory.c             |    5 -----
 mm/mempolicy.c          |    3 +--
 mm/mmap.c               |    5 ++---
 12 files changed, 16 insertions(+), 18 deletions(-)

--- a/arch/csky/mm/fault.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/arch/csky/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
 		if (!(vma->vm_flags & VM_WRITE))
 			goto bad_area;
 	} else {
-		if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
+		if (!vma_is_accessible(vma))
 			goto bad_area;
 	}
 
--- a/arch/m68k/mm/fault.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/arch/m68k/mm/fault.c
@@ -125,7 +125,7 @@ good_area:
 		case 1:		/* read, present */
 			goto acc_err;
 		case 0:		/* read, not present */
-			if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+			if (!vma_is_accessible(vma))
 				goto acc_err;
 	}
 
--- a/arch/mips/mm/fault.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/arch/mips/mm/fault.c
@@ -142,7 +142,7 @@ good_area:
 				goto bad_area;
 			}
 		} else {
-			if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
+			if (!vma_is_accessible(vma))
 				goto bad_area;
 		}
 	}
--- a/arch/powerpc/mm/fault.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/arch/powerpc/mm/fault.c
@@ -314,7 +314,7 @@ static bool access_error(bool is_write,
 		return false;
 	}
 
-	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+	if (unlikely(!vma_is_accessible(vma)))
 		return true;
 	/*
 	 * We should ideally do the vma pkey access check here. But in the
--- a/arch/sh/mm/fault.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/arch/sh/mm/fault.c
@@ -355,7 +355,7 @@ static inline int access_error(int error
 		return 1;
 
 	/* read, not present: */
-	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+	if (unlikely(!vma_is_accessible(vma)))
 		return 1;
 
 	return 0;
--- a/arch/x86/mm/fault.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/arch/x86/mm/fault.c
@@ -1222,7 +1222,7 @@ access_error(unsigned long error_code, s
 		return 1;
 
 	/* read, not present: */
-	if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+	if (unlikely(!vma_is_accessible(vma)))
 		return 1;
 
 	return 0;
--- a/include/linux/mm.h~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/include/linux/mm.h
@@ -541,6 +541,11 @@ static inline bool vma_is_anonymous(stru
 	return !vma->vm_ops;
 }
 
+static inline bool vma_is_accessible(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
--- a/kernel/sched/fair.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/kernel/sched/fair.c
@@ -2573,7 +2573,7 @@ static void task_numa_work(struct callba
 		 * Skip inaccessible VMAs to avoid any confusion between
 		 * PROT_NONE and NUMA hinting ptes
 		 */
-		if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
+		if (!vma_is_accessible(vma))
 			continue;
 
 		do {
--- a/mm/gup.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/mm/gup.c
@@ -1368,7 +1368,7 @@ long populate_vma_page_range(struct vm_a
 	 * We want mlock to succeed for regions that have any permissions
 	 * other than PROT_NONE.
 	 */
-	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
+	if (vma_is_accessible(vma))
 		gup_flags |= FOLL_FORCE;
 
 	/*
--- a/mm/memory.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/mm/memory.c
@@ -3961,11 +3961,6 @@ static inline vm_fault_t wp_huge_pmd(str
 	return VM_FAULT_FALLBACK;
 }
 
-static inline bool vma_is_accessible(struct vm_area_struct *vma)
-{
-	return vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE);
-}
-
 static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/mempolicy.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/mm/mempolicy.c
@@ -649,8 +649,7 @@ static int queue_pages_test_walk(unsigne
 
 	if (flags & MPOL_MF_LAZY) {
 		/* Similar to task_numa_work, skip inaccessible VMAs */
-		if (!is_vm_hugetlb_page(vma) &&
-			(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)) &&
+		if (!is_vm_hugetlb_page(vma) && vma_is_accessible(vma) &&
 			!(vma->vm_flags & VM_MIXEDMAP))
 			change_prot_numa(vma, start, endvma);
 		return 1;
--- a/mm/mmap.c~mm-vma-make-vma_is_accessible-available-for-general-use
+++ a/mm/mmap.c
@@ -2334,8 +2334,7 @@ int expand_upwards(struct vm_area_struct
 		gap_addr = TASK_SIZE;
 
 	next = vma->vm_next;
-	if (next && next->vm_start < gap_addr &&
-			(next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
+	if (next && next->vm_start < gap_addr && vma_is_accessible(next)) {
 		if (!(next->vm_flags & VM_GROWSUP))
 			return -ENOMEM;
 		/* Check that both stack segments have the same anon_vma? */
@@ -2416,7 +2415,7 @@ int expand_downwards(struct vm_area_stru
 	prev = vma->vm_prev;
 	/* Check that both stack segments have the same anon_vma? */
 	if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
-			(prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
+			vma_is_accessible(prev)) {
 		if (address - prev->vm_end < stack_guard_gap)
 			return -ENOMEM;
 	}
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch
mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
mm-vma-make-vma_is_accessible-available-for-general-use.patch
mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (196 preceding siblings ...)
  2020-02-24 22:13 ` + mm-vma-make-vma_is_accessible-available-for-general-use.patch " Andrew Morton
@ 2020-02-24 22:13 ` Andrew Morton
  2020-02-24 22:13 ` + mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch " Andrew Morton
                   ` (40 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:13 UTC (permalink / raw)
  To: acme, aneesh.kumar, anshuman.khandual, arnd, benh, dalias,
	dave.hansen, geert, guoren, luto, mgorman, mingo, mm-commits,
	mpe, npiggin, paulburton, paulus, paulus, peterz, ralf, rostedt,
	tglx, viro, will, ysato


The patch titled
     Subject: mm/vma: replace all remaining open encodings with is_vm_hugetlb_page()
has been added to the -mm tree.  Its filename is
     mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/vma: replace all remaining open encodings with is_vm_hugetlb_page()

This replaces all remaining open encodings with is_vm_hugetlb_page().

Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/kvm/e500_mmu_host.c |    2 +-
 fs/binfmt_elf.c                  |    3 ++-
 include/asm-generic/tlb.h        |    3 ++-
 kernel/events/core.c             |    3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

--- a/arch/powerpc/kvm/e500_mmu_host.c~mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page
+++ a/arch/powerpc/kvm/e500_mmu_host.c
@@ -422,7 +422,7 @@ static inline int kvmppc_e500_shadow_map
 				break;
 			}
 		} else if (vma && hva >= vma->vm_start &&
-			   (vma->vm_flags & VM_HUGETLB)) {
+			   is_vm_hugetlb_page(vma)) {
 			unsigned long psize = vma_kernel_pagesize(vma);
 
 			tsize = (gtlbe->mas1 & MAS1_TSIZE_MASK) >>
--- a/fs/binfmt_elf.c~mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page
+++ a/fs/binfmt_elf.c
@@ -27,6 +27,7 @@
 #include <linux/highuid.h>
 #include <linux/compiler.h>
 #include <linux/highmem.h>
+#include <linux/hugetlb.h>
 #include <linux/pagemap.h>
 #include <linux/vmalloc.h>
 #include <linux/security.h>
@@ -1317,7 +1318,7 @@ static unsigned long vma_dump_size(struc
 	}
 
 	/* Hugetlb memory check */
-	if (vma->vm_flags & VM_HUGETLB) {
+	if (is_vm_hugetlb_page(vma)) {
 		if ((vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_SHARED))
 			goto whole;
 		if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE))
--- a/include/asm-generic/tlb.h~mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page
+++ a/include/asm-generic/tlb.h
@@ -13,6 +13,7 @@
 
 #include <linux/mmu_notifier.h>
 #include <linux/swap.h>
+#include <linux/hugetlb_inline.h>
 #include <asm/pgalloc.h>
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
@@ -398,7 +399,7 @@ tlb_update_vma_flags(struct mmu_gather *
 	 * We rely on tlb_end_vma() to issue a flush, such that when we reset
 	 * these values the batch is empty.
 	 */
-	tlb->vma_huge = !!(vma->vm_flags & VM_HUGETLB);
+	tlb->vma_huge = is_vm_hugetlb_page(vma);
 	tlb->vma_exec = !!(vma->vm_flags & VM_EXEC);
 }
 
--- a/kernel/events/core.c~mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page
+++ a/kernel/events/core.c
@@ -28,6 +28,7 @@
 #include <linux/export.h>
 #include <linux/vmalloc.h>
 #include <linux/hardirq.h>
+#include <linux/hugetlb.h>
 #include <linux/rculist.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
@@ -7693,7 +7694,7 @@ static void perf_event_mmap_event(struct
 		flags |= MAP_EXECUTABLE;
 	if (vma->vm_flags & VM_LOCKED)
 		flags |= MAP_LOCKED;
-	if (vma->vm_flags & VM_HUGETLB)
+	if (is_vm_hugetlb_page(vma))
 		flags |= MAP_HUGETLB;
 
 	if (file) {
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch
mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
mm-vma-make-vma_is_accessible-available-for-general-use.patch
mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (197 preceding siblings ...)
  2020-02-24 22:13 ` + mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch " Andrew Morton
@ 2020-02-24 22:13 ` Andrew Morton
  2020-02-24 22:14 ` + mm-vma-append-unlikely-while-testing-vma-access-permissions.patch " Andrew Morton
                   ` (39 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:13 UTC (permalink / raw)
  To: acme, aneesh.kumar, anshuman.khandual, arnd, benh, dalias,
	dave.hansen, geert, guoren, luto, mgorman, mingo, mm-commits,
	mpe, npiggin, paulburton, paulus, paulus, peterz, ralf, rostedt,
	tglx, viro, will, ysato


The patch titled
     Subject: mm/vma: replace all remaining open encodings with vma_is_anonymous()
has been added to the -mm tree.  Its filename is
     mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/vma: replace all remaining open encodings with vma_is_anonymous()

This replaces all remaining open encodings with vma_is_anonymous().

Link: http://lkml.kernel.org/r/1582520593-30704-5-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/gup.c~mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous
+++ a/mm/gup.c
@@ -343,7 +343,8 @@ static struct page *no_page_table(struct
 	 * But we can only make this optimization where a hole would surely
 	 * be zero-filled if handle_mm_fault() actually did handle it.
 	 */
-	if ((flags & FOLL_DUMP) && (!vma->vm_ops || !vma->vm_ops->fault))
+	if ((flags & FOLL_DUMP) &&
+			(vma_is_anonymous(vma) || !vma->vm_ops->fault))
 		return ERR_PTR(-EFAULT);
 	return NULL;
 }
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch
mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
mm-vma-make-vma_is_accessible-available-for-general-use.patch
mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vma-append-unlikely-while-testing-vma-access-permissions.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (198 preceding siblings ...)
  2020-02-24 22:13 ` + mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch " Andrew Morton
@ 2020-02-24 22:14 ` Andrew Morton
  2020-02-24 22:30 ` + samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch " Andrew Morton
                   ` (38 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:14 UTC (permalink / raw)
  To: akpm, anshuman.khandual, geert, guoren, mm-commits, paulburton,
	ralf, rppt


The patch titled
     Subject: mm/vma: append unlikely() while testing VMA access permissions
has been added to the -mm tree.  Its filename is
     mm-vma-append-unlikely-while-testing-vma-access-permissions.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vma-append-unlikely-while-testing-vma-access-permissions.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vma-append-unlikely-while-testing-vma-access-permissions.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/vma: append unlikely() while testing VMA access permissions

It is unlikely that an inaccessible VMA without required permission flags
will get a page fault.  Hence lets just append unlikely() directive to
such checks in order to improve performance while also standardizing it
across various platforms.

Link: http://lkml.kernel.org/r/1582525304-32113-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/csky/mm/fault.c |    2 +-
 arch/m68k/mm/fault.c |    2 +-
 arch/mips/mm/fault.c |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

--- a/arch/csky/mm/fault.c~mm-vma-append-unlikely-while-testing-vma-access-permissions
+++ a/arch/csky/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
 		if (!(vma->vm_flags & VM_WRITE))
 			goto bad_area;
 	} else {
-		if (!vma_is_accessible(vma))
+		if (unlikely(!vma_is_accessible(vma)))
 			goto bad_area;
 	}
 
--- a/arch/m68k/mm/fault.c~mm-vma-append-unlikely-while-testing-vma-access-permissions
+++ a/arch/m68k/mm/fault.c
@@ -125,7 +125,7 @@ good_area:
 		case 1:		/* read, present */
 			goto acc_err;
 		case 0:		/* read, not present */
-			if (!vma_is_accessible(vma))
+			if (unlikely(!vma_is_accessible(vma)))
 				goto acc_err;
 	}
 
--- a/arch/mips/mm/fault.c~mm-vma-append-unlikely-while-testing-vma-access-permissions
+++ a/arch/mips/mm/fault.c
@@ -142,7 +142,7 @@ good_area:
 				goto bad_area;
 			}
 		} else {
-			if (!vma_is_accessible(vma))
+			if (unlikely(!vma_is_accessible(vma)))
 				goto bad_area;
 		}
 	}
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch
mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
mm-vma-make-vma_is_accessible-available-for-general-use.patch
mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch
mm-vma-append-unlikely-while-testing-vma-access-permissions.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (199 preceding siblings ...)
  2020-02-24 22:14 ` + mm-vma-append-unlikely-while-testing-vma-access-permissions.patch " Andrew Morton
@ 2020-02-24 22:30 ` Andrew Morton
  2020-02-24 22:31 ` + samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch " Andrew Morton
                   ` (37 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:30 UTC (permalink / raw)
  To: ast, frederic, gregkh, hch, mhiramat, mm-commits, prasad,
	qperret, tglx, will


The patch titled
     Subject: samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes
has been added to the -mm tree.  Its filename is
     samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Will Deacon <will@kernel.org>
Subject: samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes

Patch series "Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()".

Despite having just a single modular in-tree user that I could spot,
kallsyms_lookup_name() is exported to modules and provides a mechanism
for out-of-tree modules to access and invoke arbitrary, non-exported
kernel symbols when kallsyms is enabled.

This patch series fixes up that one user and unexports the symbol along
with kallsyms_on_each_symbol(), since that could also be abused in a
similar manner.

I would like to avoid out-of-tree modules being easily able to call
functions that are not exported.  kallsyms_lookup_name() makes this
trivial to the point that there is very little incentive to rework these
modules to either use upstream interfaces correctly or propose
functionality which may be otherwise missing upstream.  Both of these
latter solutions would be pre-requisites to upstreaming these modules, and
the current state of things actively discourages that approach.

The background here is that we are aiming for Android devices to be able
to use a generic binary kernel image closely following upstream, with any
vendor extensions coming in as kernel modules.  In this case, we (Google)
end up maintaining the binary module ABI within the scope of a single LTS
kernel.  Monitoring and managing the ABI surface is not feasible if it
effectively includes all data and functions via kallsyms_lookup_name(). 
Of course, we could just carry this patch in the Android kernel tree, but
we're aiming to carry as little as possible (ideally nothing) and I think
it's a sensible change in its own right.  I'm surprised you object to it,
in all honesty.

Now, you could turn around and say "that's not upstream's problem", but it
still seems highly undesirable to me to have an upstream bypass for
exported symbols that isn't even used by upstream modules.  It's ripe for
abuse and encourages people to work outside of the upstream tree.  The
usual rule is that we don't export symbols without a user in the tree and
that seems especially relevant in this case.

This patch (of 3):

Given the name of a kernel symbol, the 'data_breakpoint' test claims to
"report any write operations on the kernel symbol".  However, it creates
the breakpoint using both HW_BREAKPOINT_W and HW_BREAKPOINT_R, which menas
it also fires for read access.

Drop HW_BREAKPOINT_R from the breakpoint attributes.

Link: http://lkml.kernel.org/r/20200221114404.14641-2-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Quentin Perret <qperret@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 samples/hw_breakpoint/data_breakpoint.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/samples/hw_breakpoint/data_breakpoint.c~samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes
+++ a/samples/hw_breakpoint/data_breakpoint.c
@@ -45,7 +45,7 @@ static int __init hw_break_module_init(v
 	hw_breakpoint_init(&attr);
 	attr.bp_addr = kallsyms_lookup_name(ksym_name);
 	attr.bp_len = HW_BREAKPOINT_LEN_4;
-	attr.bp_type = HW_BREAKPOINT_W | HW_BREAKPOINT_R;
+	attr.bp_type = HW_BREAKPOINT_W;
 
 	sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler, NULL);
 	if (IS_ERR((void __force *)sample_hbp)) {
_

Patches currently in -mm which might be from will@kernel.org are

samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch
samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch
kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (200 preceding siblings ...)
  2020-02-24 22:30 ` + samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch " Andrew Morton
@ 2020-02-24 22:31 ` Andrew Morton
  2020-02-24 22:31 ` + kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch " Andrew Morton
                   ` (36 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:31 UTC (permalink / raw)
  To: ast, frederic, gregkh, hch, mhiramat, mm-commits, prasad,
	qperret, tglx, will


The patch titled
     Subject: samples/hw_breakpoint: drop use of kallsyms_lookup_name()
has been added to the -mm tree.  Its filename is
     samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Will Deacon <will@kernel.org>
Subject: samples/hw_breakpoint: drop use of kallsyms_lookup_name()

The 'data_breakpoint' test code is the only modular user of
kallsyms_lookup_name(), which was exported as part of fixing the test in
f60d24d2ad04 ("hw-breakpoints: Fix broken hw-breakpoint sample module").

In preparation for un-exporting this symbol, switch the test over to using
__symbol_get(), which can be used to place breakpoints on exported
symbols.

Link: http://lkml.kernel.org/r/20200221114404.14641-3-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Quentin Perret <qperret@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 samples/hw_breakpoint/data_breakpoint.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/samples/hw_breakpoint/data_breakpoint.c~samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name
+++ a/samples/hw_breakpoint/data_breakpoint.c
@@ -23,7 +23,7 @@
 
 struct perf_event * __percpu *sample_hbp;
 
-static char ksym_name[KSYM_NAME_LEN] = "pid_max";
+static char ksym_name[KSYM_NAME_LEN] = "jiffies";
 module_param_string(ksym, ksym_name, KSYM_NAME_LEN, S_IRUGO);
 MODULE_PARM_DESC(ksym, "Kernel symbol to monitor; this module will report any"
 			" write operations on the kernel symbol");
@@ -41,9 +41,13 @@ static int __init hw_break_module_init(v
 {
 	int ret;
 	struct perf_event_attr attr;
+	void *addr = __symbol_get(ksym_name);
+
+	if (!addr)
+		return -ENXIO;
 
 	hw_breakpoint_init(&attr);
-	attr.bp_addr = kallsyms_lookup_name(ksym_name);
+	attr.bp_addr = (unsigned long)addr;
 	attr.bp_len = HW_BREAKPOINT_LEN_4;
 	attr.bp_type = HW_BREAKPOINT_W;
 
@@ -66,6 +70,7 @@ fail:
 static void __exit hw_break_module_exit(void)
 {
 	unregister_wide_hw_breakpoint(sample_hbp);
+	symbol_put(ksym_name);
 	printk(KERN_INFO "HW Breakpoint for %s write uninstalled\n", ksym_name);
 }
 
_

Patches currently in -mm which might be from will@kernel.org are

samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch
samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch
kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (201 preceding siblings ...)
  2020-02-24 22:31 ` + samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch " Andrew Morton
@ 2020-02-24 22:31 ` Andrew Morton
  2020-02-24 22:55 ` + loop-use-worker-per-cgroup-instead-of-kworker.patch " Andrew Morton
                   ` (35 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:31 UTC (permalink / raw)
  To: ast, frederic, gregkh, hch, mhiramat, mm-commits, prasad,
	qperret, tglx, will


The patch titled
     Subject: kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
has been added to the -mm tree.  Its filename is
     kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Will Deacon <will@kernel.org>
Subject: kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to
modules despite having no in-tree users and being wide open to abuse by
out-of-tree modules that can use them as a method to invoke arbitrary
non-exported kernel functions.

Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol().

Link: http://lkml.kernel.org/r/20200221114404.14641-4-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Quentin Perret <qperret@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kallsyms.c |    2 --
 1 file changed, 2 deletions(-)

--- a/kernel/kallsyms.c~kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol
+++ a/kernel/kallsyms.c
@@ -175,7 +175,6 @@ unsigned long kallsyms_lookup_name(const
 	}
 	return module_kallsyms_lookup_name(name);
 }
-EXPORT_SYMBOL_GPL(kallsyms_lookup_name);
 
 int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
 				      unsigned long),
@@ -194,7 +193,6 @@ int kallsyms_on_each_symbol(int (*fn)(vo
 	}
 	return module_kallsyms_on_each_symbol(fn, data);
 }
-EXPORT_SYMBOL_GPL(kallsyms_on_each_symbol);
 
 static unsigned long get_symbol_pos(unsigned long addr,
 				    unsigned long *symbolsize,
_

Patches currently in -mm which might be from will@kernel.org are

samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch
samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch
kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + loop-use-worker-per-cgroup-instead-of-kworker.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (202 preceding siblings ...)
  2020-02-24 22:31 ` + kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch " Andrew Morton
@ 2020-02-24 22:55 ` Andrew Morton
  2020-02-24 22:55 ` + mm-charge-active-memcg-when-no-mm-is-set.patch " Andrew Morton
                   ` (34 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:55 UTC (permalink / raw)
  To: axboe, chris, guro, hannes, hughd, lizefan, mhocko, mm-commits,
	schatzberg.dan, shakeelb, tglx, tj, vdavydov.dev, yang.shi


The patch titled
     Subject: loop: use worker per cgroup instead of kworker
has been added to the -mm tree.  Its filename is
     loop-use-worker-per-cgroup-instead-of-kworker.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/loop-use-worker-per-cgroup-instead-of-kworker.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/loop-use-worker-per-cgroup-instead-of-kworker.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Dan Schatzberg <schatzberg.dan@gmail.com>
Subject: loop: use worker per cgroup instead of kworker

Patch series "Charge loop device i/o to issuing cgroup", v3.

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup.  This
allows a loop device to be used to trivially bypass resource limits and
other policy.  This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
    sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through the
single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device.  This patch series does some
minor modification to the loop driver so that each cgroup can make forward
progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills when
writing through the loop device as expected.


This patch (of 3):

Existing uses of loop device may have multiple cgroups reading/writing to
the same device.  Simply charging resources for I/O to the backing file
could result in priority inversion where one cgroup gets synchronously
blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue where
each work item is a "struct loop_worker" which contains a queue of struct
loop_cmds to issue.  The loop device maintains a tree mapping blk css_id
-> loop_worker.  This allows each cgroup to independently make forward
progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which can
be used in cases of extreme memory shortage where we cannot allocate a
loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
the per-loop-device spinlock any time either is accessed.  The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Link: http://lkml.kernel.org/r/eab018412a0c9feb573d757b1bbd5f58b33e2a8d.1582581887.git.schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/loop.c |  207 +++++++++++++++++++++++++++++++++++------
 drivers/block/loop.h |   11 +-
 2 files changed, 188 insertions(+), 30 deletions(-)

--- a/drivers/block/loop.c~loop-use-worker-per-cgroup-instead-of-kworker
+++ a/drivers/block/loop.c
@@ -70,7 +70,6 @@
 #include <linux/writeback.h>
 #include <linux/completion.h>
 #include <linux/highmem.h>
-#include <linux/kthread.h>
 #include <linux/splice.h>
 #include <linux/sysfs.h>
 #include <linux/miscdevice.h>
@@ -83,6 +82,8 @@
 
 #include <linux/uaccess.h>
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -891,27 +892,100 @@ static void loop_config_discard(struct l
 
 static void loop_unprepare_queue(struct loop_device *lo)
 {
-	kthread_flush_worker(&lo->worker);
-	kthread_stop(lo->worker_task);
-}
-
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-	current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
-	return kthread_worker_fn(worker_ptr);
+	destroy_workqueue(lo->workqueue);
 }
 
 static int loop_prepare_queue(struct loop_device *lo)
 {
-	kthread_init_worker(&lo->worker);
-	lo->worker_task = kthread_run(loop_kthread_worker_fn,
-			&lo->worker, "loop%d", lo->lo_number);
-	if (IS_ERR(lo->worker_task))
+	lo->workqueue = alloc_workqueue("loop%d",
+					WQ_UNBOUND | WQ_FREEZABLE |
+					WQ_MEM_RECLAIM,
+					lo->lo_number);
+	if (IS_ERR(lo->workqueue))
 		return -ENOMEM;
-	set_user_nice(lo->worker_task, MIN_NICE);
+
 	return 0;
 }
 
+struct loop_worker {
+	struct rb_node rb_node;
+	struct work_struct work;
+	struct list_head cmd_list;
+	struct list_head idle_list;
+	struct loop_device *lo;
+	struct cgroup_subsys_state *css;
+	unsigned long last_ran_at;
+};
+
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
+
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
+{
+	struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+	struct loop_worker *cur_worker, *worker = NULL;
+	struct work_struct *work;
+	struct list_head *cmd_list;
+
+	spin_lock_irq(&lo->lo_lock);
+
+	if (!cmd->css)
+		goto queue_work;
+
+	node = &lo->worker_tree.rb_node;
+
+	while (*node) {
+		parent = *node;
+		cur_worker = container_of(*node, struct loop_worker, rb_node);
+		if (cur_worker->css == cmd->css) {
+			worker = cur_worker;
+			break;
+		} else if ((long)cur_worker->css < (long)cmd->css) {
+			node = &(*node)->rb_left;
+		} else {
+			node = &(*node)->rb_right;
+		}
+	}
+	if (worker)
+		goto queue_work;
+
+	worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+	/*
+	 * In the event we cannot allocate a worker, just queue on the
+	 * rootcg worker
+	 */
+	if (!worker)
+		goto queue_work;
+
+	worker->css = cmd->css;
+	css_get(worker->css);
+	INIT_WORK(&worker->work, loop_workfn);
+	INIT_LIST_HEAD(&worker->cmd_list);
+	INIT_LIST_HEAD(&worker->idle_list);
+	worker->lo = lo;
+	rb_link_node(&worker->rb_node, parent, node);
+	rb_insert_color(&worker->rb_node, &lo->worker_tree);
+queue_work:
+	if (worker) {
+		/*
+		 * We need to remove from the idle list here while
+		 * holding the lock so that the idle timer doesn't
+		 * free the worker
+		 */
+		if (!list_empty(&worker->idle_list))
+			list_del_init(&worker->idle_list);
+		work = &worker->work;
+		cmd_list = &worker->cmd_list;
+	} else {
+		work = &lo->rootcg_work;
+		cmd_list = &lo->rootcg_cmd_list;
+	}
+	list_add_tail(&cmd->list_entry, cmd_list);
+	queue_work(lo->workqueue, work);
+	spin_unlock_irq(&lo->lo_lock);
+}
+
 static void loop_update_rotational(struct loop_device *lo)
 {
 	struct file *file = lo->lo_backing_file;
@@ -993,6 +1067,12 @@ static int loop_set_fd(struct loop_devic
 
 	set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
 
+	INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
+	INIT_LIST_HEAD(&lo->rootcg_cmd_list);
+	INIT_LIST_HEAD(&lo->idle_worker_list);
+	lo->worker_tree = RB_ROOT;
+	timer_setup(&lo->timer, loop_free_idle_workers,
+		TIMER_DEFERRABLE);
 	lo->use_dio = false;
 	lo->lo_device = bdev;
 	lo->lo_flags = lo_flags;
@@ -1101,6 +1181,7 @@ static int __loop_clr_fd(struct loop_dev
 	int err = 0;
 	bool partscan = false;
 	int lo_number;
+	struct loop_worker *pos, *worker;
 
 	mutex_lock(&loop_ctl_mutex);
 	if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
@@ -1117,9 +1198,28 @@ static int __loop_clr_fd(struct loop_dev
 	/* freeze request queue during the transition */
 	blk_mq_freeze_queue(lo->lo_queue);
 
+	/*
+	 * Ordering here is a bit tricky:
+	 *
+	 * 1) flush/destroy workqueue without lock held so outstanding
+	 * I/O is issued and all active workers go idle
+	 *
+	 * 2) Grab lock, free all idle workers
+	 *
+	 * 3) unlock, del_timer_sync so if timer raced it will be a no-op
+	 */
+	loop_unprepare_queue(lo);
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_backing_file = NULL;
+	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
+				idle_list) {
+		list_del(&worker->idle_list);
+		rb_erase(&worker->rb_node, &lo->worker_tree);
+		css_put(worker->css);
+		kfree(worker);
+	}
 	spin_unlock_irq(&lo->lo_lock);
+	del_timer_sync(&lo->timer);
 
 	loop_release_xfer(lo);
 	lo->transfer = NULL;
@@ -1154,7 +1254,6 @@ static int __loop_clr_fd(struct loop_dev
 
 	partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
 	lo_number = lo->lo_number;
-	loop_unprepare_queue(lo);
 out_unlock:
 	mutex_unlock(&loop_ctl_mutex);
 	if (partscan) {
@@ -1932,7 +2031,7 @@ static blk_status_t loop_queue_rq(struct
 	} else
 #endif
 		cmd->css = NULL;
-	kthread_queue_work(&lo->worker, &cmd->work);
+	loop_queue_work(lo, cmd);
 
 	return BLK_STS_OK;
 }
@@ -1958,26 +2057,82 @@ static void loop_handle_cmd(struct loop_
 	}
 }
 
-static void loop_queue_work(struct kthread_work *work)
+static void loop_set_timer(struct loop_device *lo)
+{
+	timer_reduce(&lo->timer, jiffies + LOOP_IDLE_WORKER_TIMEOUT);
+}
+
+static void loop_process_work(struct loop_worker *worker,
+			struct list_head *cmd_list, struct loop_device *lo)
 {
-	struct loop_cmd *cmd =
-		container_of(work, struct loop_cmd, work);
+	int orig_flags = current->flags;
+	struct loop_cmd *cmd;
 
-	loop_handle_cmd(cmd);
+	current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+	spin_lock_irq(&lo->lo_lock);
+	while (!list_empty(cmd_list)) {
+		cmd = container_of(
+			cmd_list->next, struct loop_cmd, list_entry);
+		list_del(cmd_list->next);
+		spin_unlock_irq(&lo->lo_lock);
+
+		loop_handle_cmd(cmd);
+		cond_resched();
+
+		spin_lock_irq(&lo->lo_lock);
+	}
+
+	/*
+	 * We only add to the idle list if there are no pending cmds
+	 * *and* the worker will not run again which ensures that it
+	 * is safe to free any worker on the idle list
+	 */
+	if (worker && !work_pending(&worker->work)) {
+		worker->last_ran_at = jiffies;
+		list_add_tail(&worker->idle_list, &lo->idle_worker_list);
+		loop_set_timer(lo);
+	}
+	spin_unlock_irq(&lo->lo_lock);
+	current->flags = orig_flags;
 }
 
-static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq,
-		unsigned int hctx_idx, unsigned int numa_node)
+static void loop_workfn(struct work_struct *work)
 {
-	struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+	struct loop_worker *worker =
+		container_of(work, struct loop_worker, work);
+	loop_process_work(worker, &worker->cmd_list, worker->lo);
+}
 
-	kthread_init_work(&cmd->work, loop_queue_work);
-	return 0;
+static void loop_rootcg_workfn(struct work_struct *work)
+{
+	struct loop_device *lo =
+		container_of(work, struct loop_device, rootcg_work);
+	loop_process_work(NULL, &lo->rootcg_cmd_list, lo);
+}
+
+static void loop_free_idle_workers(struct timer_list *timer)
+{
+	struct loop_device *lo = container_of(timer, struct loop_device, timer);
+	struct loop_worker *pos, *worker;
+
+	spin_lock_irq(&lo->lo_lock);
+	list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
+				idle_list) {
+		if (time_is_after_jiffies(worker->last_ran_at +
+						LOOP_IDLE_WORKER_TIMEOUT))
+			break;
+		list_del(&worker->idle_list);
+		rb_erase(&worker->rb_node, &lo->worker_tree);
+		css_put(worker->css);
+		kfree(worker);
+	}
+	if (!list_empty(&lo->idle_worker_list))
+		loop_set_timer(lo);
+	spin_unlock_irq(&lo->lo_lock);
 }
 
 static const struct blk_mq_ops loop_mq_ops = {
 	.queue_rq       = loop_queue_rq,
-	.init_request	= loop_init_request,
 	.complete	= lo_complete_rq,
 };
 
--- a/drivers/block/loop.h~loop-use-worker-per-cgroup-instead-of-kworker
+++ a/drivers/block/loop.h
@@ -14,7 +14,6 @@
 #include <linux/blk-mq.h>
 #include <linux/spinlock.h>
 #include <linux/mutex.h>
-#include <linux/kthread.h>
 #include <uapi/linux/loop.h>
 
 /* Possible states of device */
@@ -54,8 +53,12 @@ struct loop_device {
 
 	spinlock_t		lo_lock;
 	int			lo_state;
-	struct kthread_worker	worker;
-	struct task_struct	*worker_task;
+	struct workqueue_struct *workqueue;
+	struct work_struct      rootcg_work;
+	struct list_head        rootcg_cmd_list;
+	struct list_head        idle_worker_list;
+	struct rb_root          worker_tree;
+	struct timer_list       timer;
 	bool			use_dio;
 	bool			sysfs_inited;
 
@@ -65,7 +68,7 @@ struct loop_device {
 };
 
 struct loop_cmd {
-	struct kthread_work work;
+	struct list_head list_entry;
 	bool use_aio; /* use AIO interface to handle I/O */
 	atomic_t ref; /* only for aio */
 	long ret;
_

Patches currently in -mm which might be from schatzberg.dan@gmail.com are

loop-use-worker-per-cgroup-instead-of-kworker.patch
mm-charge-active-memcg-when-no-mm-is-set.patch
loop-charge-i-o-to-mem-and-blk-cg.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-charge-active-memcg-when-no-mm-is-set.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (203 preceding siblings ...)
  2020-02-24 22:55 ` + loop-use-worker-per-cgroup-instead-of-kworker.patch " Andrew Morton
@ 2020-02-24 22:55 ` Andrew Morton
  2020-02-24 22:55 ` + loop-charge-i-o-to-mem-and-blk-cg.patch " Andrew Morton
                   ` (33 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:55 UTC (permalink / raw)
  To: axboe, chris, guro, hannes, hughd, lizefan, mhocko, mm-commits,
	schatzberg.dan, shakeelb, tglx, tj, vdavydov.dev, yang.shi


The patch titled
     Subject: mm: charge active memcg when no mm is set
has been added to the -mm tree.  Its filename is
     mm-charge-active-memcg-when-no-mm-is-set.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-charge-active-memcg-when-no-mm-is-set.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-charge-active-memcg-when-no-mm-is-set.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Dan Schatzberg <schatzberg.dan@gmail.com>
Subject: mm: charge active memcg when no mm is set

memalloc_use_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If it has configured
   a current->active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it
would always charge the root cgroup.  Now it looks up the current
active_memcg first (falling back to charging the root cgroup if not set).

Link: http://lkml.kernel.org/r/4456276af198412b2d41cd09d246cd20e0c6d22a.1582581887.git.schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   11 ++++++++---
 mm/shmem.c      |    4 ++--
 2 files changed, 10 insertions(+), 5 deletions(-)

--- a/mm/memcontrol.c~mm-charge-active-memcg-when-no-mm-is-set
+++ a/mm/memcontrol.c
@@ -6320,7 +6320,8 @@ exit:
  * @compound: charge the page as compound or small page
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. If @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Returns 0 on success, with *@memcgp pointing to the charged memcg.
  * Otherwise, an error code is returned.
@@ -6364,8 +6365,12 @@ int mem_cgroup_try_charge(struct page *p
 		}
 	}
 
-	if (!memcg)
-		memcg = get_mem_cgroup_from_mm(mm);
+	if (!memcg) {
+		if (!mm)
+			memcg = get_mem_cgroup_from_current();
+		else
+			memcg = get_mem_cgroup_from_mm(mm);
+	}
 
 	ret = try_charge(memcg, gfp_mask, nr_pages);
 
--- a/mm/shmem.c~mm-charge-active-memcg-when-no-mm-is-set
+++ a/mm/shmem.c
@@ -1631,7 +1631,7 @@ static int shmem_swapin_page(struct inod
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+	struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
 	struct mem_cgroup *memcg;
 	struct page *page;
 	swp_entry_t swap;
@@ -1766,7 +1766,7 @@ repeat:
 	}
 
 	sbinfo = SHMEM_SB(inode->i_sb);
-	charge_mm = vma ? vma->vm_mm : current->mm;
+	charge_mm = vma ? vma->vm_mm : NULL;
 
 	page = find_lock_entry(mapping, index);
 	if (xa_is_value(page)) {
_

Patches currently in -mm which might be from schatzberg.dan@gmail.com are

loop-use-worker-per-cgroup-instead-of-kworker.patch
mm-charge-active-memcg-when-no-mm-is-set.patch
loop-charge-i-o-to-mem-and-blk-cg.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + loop-charge-i-o-to-mem-and-blk-cg.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (204 preceding siblings ...)
  2020-02-24 22:55 ` + mm-charge-active-memcg-when-no-mm-is-set.patch " Andrew Morton
@ 2020-02-24 22:55 ` Andrew Morton
  2020-02-24 23:33 ` + mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch " Andrew Morton
                   ` (32 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 22:55 UTC (permalink / raw)
  To: axboe, chris, guro, hannes, hughd, lizefan, mhocko, mm-commits,
	schatzberg.dan, shakeelb, tglx, tj, vdavydov.dev, yang.shi


The patch titled
     Subject: loop: charge i/o to mem and blk cg
has been added to the -mm tree.  Its filename is
     loop-charge-i-o-to-mem-and-blk-cg.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/loop-charge-i-o-to-mem-and-blk-cg.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/loop-charge-i-o-to-mem-and-blk-cg.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Dan Schatzberg <schatzberg.dan@gmail.com>
Subject: loop: charge i/o to mem and blk cg

The current code only associates with the existing blkcg when aio is used
to access the backing file.  This patch covers all types of i/o to the
backing file and also associates the memcg so if the backing file is on
tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css so it can be used by the loop
module.

Link: http://lkml.kernel.org/r/206afee596c3f30c05b31ae5fc8b7f5d58863dc0.1582581887.git.schatzberg.dan@gmail.com
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Down <chris@chrisdown.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/loop.c       |   59 ++++++++++++++++++++++-------------
 drivers/block/loop.h       |    3 +
 include/linux/memcontrol.h |    6 +++
 kernel/cgroup/cgroup.c     |    1 
 4 files changed, 47 insertions(+), 22 deletions(-)

--- a/drivers/block/loop.c~loop-charge-i-o-to-mem-and-blk-cg
+++ a/drivers/block/loop.c
@@ -77,6 +77,7 @@
 #include <linux/uio.h>
 #include <linux/ioprio.h>
 #include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>
 
 #include "loop.h"
 
@@ -504,8 +505,6 @@ static void lo_rw_aio_complete(struct ki
 {
 	struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-	if (cmd->css)
-		css_put(cmd->css);
 	cmd->ret = ret;
 	lo_rw_aio_do_completion(cmd);
 }
@@ -566,8 +565,6 @@ static int lo_rw_aio(struct loop_device
 	cmd->iocb.ki_complete = lo_rw_aio_complete;
 	cmd->iocb.ki_flags = IOCB_DIRECT;
 	cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-	if (cmd->css)
-		kthread_associate_blkcg(cmd->css);
 
 	if (rw == WRITE)
 		ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -575,7 +572,6 @@ static int lo_rw_aio(struct loop_device
 		ret = call_read_iter(file, &cmd->iocb, &iter);
 
 	lo_rw_aio_do_completion(cmd);
-	kthread_associate_blkcg(NULL);
 
 	if (ret != -EIOCBQUEUED)
 		cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -913,7 +909,7 @@ struct loop_worker {
 	struct list_head cmd_list;
 	struct list_head idle_list;
 	struct loop_device *lo;
-	struct cgroup_subsys_state *css;
+	struct cgroup_subsys_state *blkcg_css;
 	unsigned long last_ran_at;
 };
 
@@ -930,7 +926,7 @@ static void loop_queue_work(struct loop_
 
 	spin_lock_irq(&lo->lo_lock);
 
-	if (!cmd->css)
+	if (!cmd->blkcg_css)
 		goto queue_work;
 
 	node = &lo->worker_tree.rb_node;
@@ -938,10 +934,10 @@ static void loop_queue_work(struct loop_
 	while (*node) {
 		parent = *node;
 		cur_worker = container_of(*node, struct loop_worker, rb_node);
-		if (cur_worker->css == cmd->css) {
+		if (cur_worker->blkcg_css == cmd->blkcg_css) {
 			worker = cur_worker;
 			break;
-		} else if ((long)cur_worker->css < (long)cmd->css) {
+		} else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
 			node = &(*node)->rb_left;
 		} else {
 			node = &(*node)->rb_right;
@@ -953,13 +949,16 @@ static void loop_queue_work(struct loop_
 	worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
 	/*
 	 * In the event we cannot allocate a worker, just queue on the
-	 * rootcg worker
+	 * rootcg worker and issue the I/O as the rootcg
 	 */
-	if (!worker)
+	if (!worker) {
+		cmd->blkcg_css = NULL;
+		cmd->memcg_css = NULL;
 		goto queue_work;
+	}
 
-	worker->css = cmd->css;
-	css_get(worker->css);
+	worker->blkcg_css = cmd->blkcg_css;
+	css_get(worker->blkcg_css);
 	INIT_WORK(&worker->work, loop_workfn);
 	INIT_LIST_HEAD(&worker->cmd_list);
 	INIT_LIST_HEAD(&worker->idle_list);
@@ -1215,7 +1214,7 @@ static int __loop_clr_fd(struct loop_dev
 				idle_list) {
 		list_del(&worker->idle_list);
 		rb_erase(&worker->rb_node, &lo->worker_tree);
-		css_put(worker->css);
+		css_put(worker->blkcg_css);
 		kfree(worker);
 	}
 	spin_unlock_irq(&lo->lo_lock);
@@ -2024,13 +2023,18 @@ static blk_status_t loop_queue_rq(struct
 	}
 
 	/* always use the first bio's css */
+	cmd->blkcg_css = NULL;
+	cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-	if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-		cmd->css = &bio_blkcg(rq->bio)->css;
-		css_get(cmd->css);
-	} else
+	if (rq->bio && rq->bio->bi_blkg) {
+		cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+		cmd->memcg_css =
+			cgroup_get_e_css(cmd->blkcg_css->cgroup,
+					&memory_cgrp_subsys);
+#endif
+	}
 #endif
-		cmd->css = NULL;
 	loop_queue_work(lo, cmd);
 
 	return BLK_STS_OK;
@@ -2048,8 +2052,21 @@ static void loop_handle_cmd(struct loop_
 		goto failed;
 	}
 
+	if (cmd->blkcg_css)
+		kthread_associate_blkcg(cmd->blkcg_css);
+	if (cmd->memcg_css)
+		memalloc_use_memcg(mem_cgroup_from_css(cmd->memcg_css));
+
 	ret = do_req_filebacked(lo, rq);
- failed:
+
+	if (cmd->blkcg_css)
+		kthread_associate_blkcg(NULL);
+
+	if (cmd->memcg_css) {
+		memalloc_unuse_memcg();
+		css_put(cmd->memcg_css);
+	}
+failed:
 	/* complete non-aio request */
 	if (!cmd->use_aio || ret) {
 		cmd->ret = ret ? -EIO : 0;
@@ -2123,7 +2140,7 @@ static void loop_free_idle_workers(struc
 			break;
 		list_del(&worker->idle_list);
 		rb_erase(&worker->rb_node, &lo->worker_tree);
-		css_put(worker->css);
+		css_put(worker->blkcg_css);
 		kfree(worker);
 	}
 	if (!list_empty(&lo->idle_worker_list))
--- a/drivers/block/loop.h~loop-charge-i-o-to-mem-and-blk-cg
+++ a/drivers/block/loop.h
@@ -74,7 +74,8 @@ struct loop_cmd {
 	long ret;
 	struct kiocb iocb;
 	struct bio_vec *bvec;
-	struct cgroup_subsys_state *css;
+	struct cgroup_subsys_state *blkcg_css;
+	struct cgroup_subsys_state *memcg_css;
 };
 
 /* Support for loadable transfer modules */
--- a/include/linux/memcontrol.h~loop-charge-i-o-to-mem-and-blk-cg
+++ a/include/linux/memcontrol.h
@@ -922,6 +922,12 @@ static inline struct mem_cgroup *get_mem
 	return NULL;
 }
 
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+	return NULL;
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
--- a/kernel/cgroup/cgroup.c~loop-charge-i-o-to-mem-and-blk-cg
+++ a/kernel/cgroup/cgroup.c
@@ -587,6 +587,7 @@ out_unlock:
 	rcu_read_unlock();
 	return css;
 }
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);
 
 static void cgroup_get_live(struct cgroup *cgrp)
 {
_

Patches currently in -mm which might be from schatzberg.dan@gmail.com are

loop-use-worker-per-cgroup-instead-of-kworker.patch
mm-charge-active-memcg-when-no-mm-is-set.patch
loop-charge-i-o-to-mem-and-blk-cg.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (205 preceding siblings ...)
  2020-02-24 22:55 ` + loop-charge-i-o-to-mem-and-blk-cg.patch " Andrew Morton
@ 2020-02-24 23:33 ` Andrew Morton
  2020-02-24 23:39 ` + checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch " Andrew Morton
                   ` (31 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 23:33 UTC (permalink / raw)
  To: anshuman.khandual, lixinhai.lxh, mhocko, mike.kravetz,
	mm-commits, n-horiguchi


The patch titled
     Subject: mm/mempolicy: check hugepage migration is supported by arch in vma_migratable()
has been added to the -mm tree.  Its filename is
     mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: mm/mempolicy: check hugepage migration is supported by arch in vma_migratable()

vma_migratable() is called to check if pages in vma can be migrated before
go ahead to further actions.  Currently it is used in below code path:

- task_numa_work
- mbind
- move_pages

For hugetlb mapping, whether vma is migratable or not is determined by:
- CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
- arch_hugetlb_migration_supported

Issue: current code only checks for CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
alone, and no code should use it directly.  (note that current code in
vma_migratable don't cause failure or bug because
unmap_and_move_huge_page() will catch unsupported hugepage and handle it
properly)

This patch checks the two factors by hugepage_migration_supported for
impoving code logic and robustness.  It will enable early bail out of
hugepage migration procedure, but because currently all architecture
supporting hugepage migration is able to support all page size, we would
not see performance gain with this patch applied.

vma_migratable() is moved to mm/mempolicy.c, because of the circular
reference of mempolicy.h and hugetlb.h cause defining it as inline not
feasible.

Link: http://lkml.kernel.org/r/1579786179-30633-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mempolicy.h |   29 +----------------------------
 mm/mempolicy.c            |   28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 28 deletions(-)

--- a/include/linux/mempolicy.h~mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable
+++ a/include/linux/mempolicy.h
@@ -173,34 +173,7 @@ extern int mpol_parse_str(char *str, str
 extern void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol);
 
 /* Check if a vma is migratable */
-static inline bool vma_migratable(struct vm_area_struct *vma)
-{
-	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
-		return false;
-
-	/*
-	 * DAX device mappings require predictable access latency, so avoid
-	 * incurring periodic faults.
-	 */
-	if (vma_is_dax(vma))
-		return false;
-
-#ifndef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
-	if (vma->vm_flags & VM_HUGETLB)
-		return false;
-#endif
-
-	/*
-	 * Migration allocates pages in the highest zone. If we cannot
-	 * do so then migration (at least from node to node) is not
-	 * possible.
-	 */
-	if (vma->vm_file &&
-		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
-								< policy_zone)
-			return false;
-	return true;
-}
+extern bool vma_migratable(struct vm_area_struct *vma);
 
 extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
 extern void mpol_put_task_policy(struct task_struct *);
--- a/mm/mempolicy.c~mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable
+++ a/mm/mempolicy.c
@@ -1742,6 +1742,34 @@ COMPAT_SYSCALL_DEFINE4(migrate_pages, co
 
 #endif /* CONFIG_COMPAT */
 
+bool vma_migratable(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+		return false;
+
+	/*
+	 * DAX device mappings require predictable access latency, so avoid
+	 * incurring periodic faults.
+	 */
+	if (vma_is_dax(vma))
+		return false;
+
+	if (is_vm_hugetlb_page(vma) &&
+		!hugepage_migration_supported(hstate_vma(vma)))
+		return false;
+
+	/*
+	 * Migration allocates pages in the highest zone. If we cannot
+	 * do so then migration (at least from node to node) is not
+	 * possible.
+	 */
+	if (vma->vm_file &&
+		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
+			< policy_zone)
+		return false;
+	return true;
+}
+
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
_

Patches currently in -mm which might be from lixinhai.lxh@gmail.com are

mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch
mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch
mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (206 preceding siblings ...)
  2020-02-24 23:33 ` + mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch " Andrew Morton
@ 2020-02-24 23:39 ` Andrew Morton
  2020-02-24 23:40 ` + checkpatch-fix-multiple-const-types.patch " Andrew Morton
                   ` (30 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 23:39 UTC (permalink / raw)
  To: borneo.antonio, joe, mm-commits


The patch titled
     Subject: checkpatch: fix minor typo and mixed space+tab in indentation
has been added to the -mm tree.  Its filename is
     checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Antonio Borneo <borneo.antonio@gmail.com>
Subject: checkpatch: fix minor typo and mixed space+tab in indentation

Fix spelling of "concatenation".
Don't use tab after space in indentation.

Link: http://lkml.kernel.org/r/20200122163852.124417-2-borneo.antonio@gmail.com
Signed-off-by: Antonio Borneo <borneo.antonio@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation
+++ a/scripts/checkpatch.pl
@@ -4629,7 +4629,7 @@ sub process {
 					    ($op eq '>' &&
 					     $ca =~ /<\S+\@\S+$/))
 					{
-					    	$ok = 1;
+						$ok = 1;
 					}
 
 					# for asm volatile statements
@@ -4964,7 +4964,7 @@ sub process {
 			# conditional.
 			substr($s, 0, length($c), '');
 			$s =~ s/\n.*//g;
-			$s =~ s/$;//g; 	# Remove any comments
+			$s =~ s/$;//g;	# Remove any comments
 			if (length($c) && $s !~ /^\s*{?\s*\\*\s*$/ &&
 			    $c !~ /}\s*while\s*/)
 			{
@@ -5003,7 +5003,7 @@ sub process {
 # if and else should not have general statements after it
 		if ($line =~ /^.\s*(?:}\s*)?else\b(.*)/) {
 			my $s = $1;
-			$s =~ s/$;//g; 	# Remove any comments
+			$s =~ s/$;//g;	# Remove any comments
 			if ($s !~ /^\s*(?:\sif|(?:{|)\s*\\?\s*$)/) {
 				ERROR("TRAILING_STATEMENTS",
 				      "trailing statements should be on next line\n" . $herecurr);
@@ -5179,7 +5179,7 @@ sub process {
 			{
 			}
 
-			# Flatten any obvious string concatentation.
+			# Flatten any obvious string concatenation.
 			while ($dstat =~ s/($String)\s*$Ident/$1/ ||
 			       $dstat =~ s/$Ident\s*($String)/$1/)
 			{
_

Patches currently in -mm which might be from borneo.antonio@gmail.com are

checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch
checkpatch-fix-multiple-const-types.patch
checkpatch-add-command-line-option-for-tab-size.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-fix-multiple-const-types.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (207 preceding siblings ...)
  2020-02-24 23:39 ` + checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch " Andrew Morton
@ 2020-02-24 23:40 ` Andrew Morton
  2020-02-24 23:40 ` + checkpatch-add-command-line-option-for-tab-size.patch " Andrew Morton
                   ` (29 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 23:40 UTC (permalink / raw)
  To: borneo.antonio, joe, mm-commits


The patch titled
     Subject: checkpatch: fix multiple const * types
has been added to the -mm tree.  Its filename is
     checkpatch-fix-multiple-const-types.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-fix-multiple-const-types.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-fix-multiple-const-types.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Antonio Borneo <borneo.antonio@gmail.com>
Subject: checkpatch: fix multiple const * types

Commit 1574a29f8e76 ("checkpatch: allow multiple const * types") claims to
support repetition of pattern "const *", but it actually allows only one
extra instance.

Check the following lines
	int a(char const * const x[]);
	int b(char const * const *x);
	int c(char const * const * const x[]);
	int d(char const * const * const *x);

with command

	./scripts/checkpatch.pl --show-types -f filename

to find that only the first line passes the test, while a warning
is triggered by the other 3 lines:

	WARNING:FUNCTION_ARGUMENTS: function definition argument
	'char const * const' should also have an identifier name

The reason is that the pattern match halts at the second asterisk in the
line, thus the remaining text starting with asterisk fails to match a
valid name for a variable.

Fixed by replacing "?" (Match 1 or 0 times) with "{0,4}" (Match no more
than 4 times) in the regular expression.  Fix also the similar test for
types in unusual order.

Link: http://lkml.kernel.org/r/20200122163852.124417-1-borneo.antonio@gmail.com
Fixes: 1574a29f8e76 ("checkpatch: allow multiple const * types")
Signed-off-by: Antonio Borneo <borneo.antonio@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-fix-multiple-const-types
+++ a/scripts/checkpatch.pl
@@ -818,12 +818,12 @@ sub build_types {
 		  }x;
 	$Type	= qr{
 			$NonptrType
-			(?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+)?
+			(?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+){0,4}
 			(?:\s+$Inline|\s+$Modifier)*
 		  }x;
 	$TypeMisordered	= qr{
 			$NonptrTypeMisordered
-			(?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+)?
+			(?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+){0,4}
 			(?:\s+$Inline|\s+$Modifier)*
 		  }x;
 	$Declare	= qr{(?:$Storage\s+(?:$Inline\s+)?)?$Type};
_

Patches currently in -mm which might be from borneo.antonio@gmail.com are

checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch
checkpatch-fix-multiple-const-types.patch
checkpatch-add-command-line-option-for-tab-size.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-add-command-line-option-for-tab-size.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (208 preceding siblings ...)
  2020-02-24 23:40 ` + checkpatch-fix-multiple-const-types.patch " Andrew Morton
@ 2020-02-24 23:40 ` Andrew Morton
  2020-02-24 23:43 ` + lib-test_bitmap-make-use-of-exp2_in_bits.patch " Andrew Morton
                   ` (28 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 23:40 UTC (permalink / raw)
  To: borneo.antonio, erik.ahlen, joe, mm-commits, spen


The patch titled
     Subject: checkpatch: add command-line option for TAB size
has been added to the -mm tree.  Its filename is
     checkpatch-add-command-line-option-for-tab-size.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-add-command-line-option-for-tab-size.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-add-command-line-option-for-tab-size.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Antonio Borneo <borneo.antonio@gmail.com>
Subject: checkpatch: add command-line option for TAB size

Linux kernel coding style requires a size of 8 characters for both TAB and
indentation, and such value is embedded as magic value allover the
checkpatch script.

This makes hard to reuse the script by other projects with different
requirements in their coding style (e.g.  OpenOCD [1] requires TAB size of
4 characters [2]).

Replace the magic value 8 with a variable.

Add a command-line option "--tab-size" to let the user select a
TAB size value other than 8.

[1] http://openocd.org/
[2] http://openocd.org/doc/doxygen/html/stylec.html#styleformat

Link: http://lkml.kernel.org/r/20200122163852.124417-3-borneo.antonio@gmail.com
Signed-off-by: Antonio Borneo <borneo.antonio@gmail.com>
Signed-off-by: Erik Ahlén <erik.ahlen@avalonenterprise.com>
Signed-off-by: Spencer Oliver <spen@spen-soft.co.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-add-command-line-option-for-tab-size
+++ a/scripts/checkpatch.pl
@@ -64,6 +64,7 @@ my $color = "auto";
 my $allow_c99_comments = 1; # Can be overridden by --ignore C99_COMMENT_TOLERANCE
 # git output parsing needs US English output, so first set backtick child process LANGUAGE
 my $git_command ='export LANGUAGE=en_US.UTF-8; git';
+my $tabsize = 8;
 
 sub help {
 	my ($exitcode) = @_;
@@ -98,6 +99,7 @@ Options:
   --show-types               show the specific message type in the output
   --max-line-length=n        set the maximum line length, if exceeded, warn
   --min-conf-desc-length=n   set the min description length, if shorter, warn
+  --tab-size=n               set the number of spaces for tab (default 8)
   --root=PATH                PATH to the kernel tree root
   --no-summary               suppress the per-file summary
   --mailback                 only produce a report in case of warnings/errors
@@ -215,6 +217,7 @@ GetOptions(
 	'list-types!'	=> \$list_types,
 	'max-line-length=i' => \$max_line_length,
 	'min-conf-desc-length=i' => \$min_conf_desc_length,
+	'tab-size=i'	=> \$tabsize,
 	'root=s'	=> \$root,
 	'summary!'	=> \$summary,
 	'mailback!'	=> \$mailback,
@@ -267,6 +270,9 @@ if ($color =~ /^[01]$/) {
 	die "Invalid color mode: $color\n";
 }
 
+# skip TAB size 1 to avoid additional checks on $tabsize - 1
+die "Invalid TAB size: $tabsize\n" if ($tabsize < 2);
+
 sub hash_save_array_words {
 	my ($hashRef, $arrayRef) = @_;
 
@@ -1253,7 +1259,7 @@ sub expand_tabs {
 		if ($c eq "\t") {
 			$res .= ' ';
 			$n++;
-			for (; ($n % 8) != 0; $n++) {
+			for (; ($n % $tabsize) != 0; $n++) {
 				$res .= ' ';
 			}
 			next;
@@ -2266,7 +2272,7 @@ sub string_find_replace {
 sub tabify {
 	my ($leading) = @_;
 
-	my $source_indent = 8;
+	my $source_indent = $tabsize;
 	my $max_spaces_before_tab = $source_indent - 1;
 	my $spaces_to_tab = " " x $source_indent;
 
@@ -3245,7 +3251,7 @@ sub process {
 		next if ($realfile !~ /\.(h|c|pl|dtsi|dts)$/);
 
 # at the beginning of a line any tabs must come first and anything
-# more than 8 must use tabs.
+# more than $tabsize must use tabs.
 		if ($rawline =~ /^\+\s* \t\s*\S/ ||
 		    $rawline =~ /^\+\s*        \s*/) {
 			my $herevet = "$here\n" . cat_vet($rawline) . "\n";
@@ -3264,7 +3270,7 @@ sub process {
 				"please, no space before tabs\n" . $herevet) &&
 			    $fix) {
 				while ($fixed[$fixlinenr] =~
-					   s/(^\+.*) {8,8}\t/$1\t\t/) {}
+					   s/(^\+.*) {$tabsize,$tabsize}\t/$1\t\t/) {}
 				while ($fixed[$fixlinenr] =~
 					   s/(^\+.*) +\t/$1\t/) {}
 			}
@@ -3286,11 +3292,11 @@ sub process {
 		if ($perl_version_ok &&
 		    $sline =~ /^\+\t+( +)(?:$c90_Keywords\b|\{\s*$|\}\s*(?:else\b|while\b|\s*$)|$Declare\s*$Ident\s*[;=])/) {
 			my $indent = length($1);
-			if ($indent % 8) {
+			if ($indent % $tabsize) {
 				if (WARN("TABSTOP",
 					 "Statements should start on a tabstop\n" . $herecurr) &&
 				    $fix) {
-					$fixed[$fixlinenr] =~ s@(^\+\t+) +@$1 . "\t" x ($indent/8)@e;
+					$fixed[$fixlinenr] =~ s@(^\+\t+) +@$1 . "\t" x ($indent/$tabsize)@e;
 				}
 			}
 		}
@@ -3308,8 +3314,8 @@ sub process {
 				my $newindent = $2;
 
 				my $goodtabindent = $oldindent .
-					"\t" x ($pos / 8) .
-					" "  x ($pos % 8);
+					"\t" x ($pos / $tabsize) .
+					" "  x ($pos % $tabsize);
 				my $goodspaceindent = $oldindent . " "  x $pos;
 
 				if ($newindent ne $goodtabindent &&
@@ -3780,11 +3786,11 @@ sub process {
 			#print "line<$line> prevline<$prevline> indent<$indent> sindent<$sindent> check<$check> continuation<$continuation> s<$s> cond_lines<$cond_lines> stat_real<$stat_real> stat<$stat>\n";
 
 			if ($check && $s ne '' &&
-			    (($sindent % 8) != 0 ||
+			    (($sindent % $tabsize) != 0 ||
 			     ($sindent < $indent) ||
 			     ($sindent == $indent &&
 			      ($s !~ /^\s*(?:\}|\{|else\b)/)) ||
-			     ($sindent > $indent + 8))) {
+			     ($sindent > $indent + $tabsize))) {
 				WARN("SUSPECT_CODE_INDENT",
 				     "suspect code indent for conditional statements ($indent, $sindent)\n" . $herecurr . "$stat_real\n");
 			}
_

Patches currently in -mm which might be from borneo.antonio@gmail.com are

checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch
checkpatch-fix-multiple-const-types.patch
checkpatch-add-command-line-option-for-tab-size.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-test_bitmap-make-use-of-exp2_in_bits.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (209 preceding siblings ...)
  2020-02-24 23:40 ` + checkpatch-add-command-line-option-for-tab-size.patch " Andrew Morton
@ 2020-02-24 23:43 ` Andrew Morton
  2020-02-24 23:44 ` + ocfs2-remove-useless-err.patch " Andrew Morton
                   ` (27 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 23:43 UTC (permalink / raw)
  To: alex.shi, andriy.shevchenko, mm-commits


The patch titled
     Subject: lib/test_bitmap.c: make use of EXP2_IN_BITS
has been added to the -mm tree.  Its filename is
     lib-test_bitmap-make-use-of-exp2_in_bits.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-test_bitmap-make-use-of-exp2_in_bits.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-test_bitmap-make-use-of-exp2_in_bits.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/test_bitmap.c: make use of EXP2_IN_BITS

Commit 30544ed5de43 ("lib/bitmap: introduce bitmap_replace() helper")
introduced some new test cases to the test_bitmap.c module.  Among these
it also introduced an (unused) definition.  Let's make use of
EXP2_IN_BITS.

Link: http://lkml.kernel.org/r/20200121151847.75223-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_bitmap.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/lib/test_bitmap.c~lib-test_bitmap-make-use-of-exp2_in_bits
+++ a/lib/test_bitmap.c
@@ -278,6 +278,8 @@ static void __init test_replace(void)
 	unsigned int nlongs = DIV_ROUND_UP(nbits, BITS_PER_LONG);
 	DECLARE_BITMAP(bmap, 1024);
 
+	BUILD_BUG_ON(EXP2_IN_BITS < nbits * 2);
+
 	bitmap_zero(bmap, 1024);
 	bitmap_replace(bmap, &exp2[0 * nlongs], &exp2[1 * nlongs], exp2_to_exp3_mask, nbits);
 	expect_eq_bitmap(bmap, exp3_0_1, nbits);
_

Patches currently in -mm which might be from andriy.shevchenko@linux.intel.com are

lib-test_bitmap-make-use-of-exp2_in_bits.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + ocfs2-remove-useless-err.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (210 preceding siblings ...)
  2020-02-24 23:43 ` + lib-test_bitmap-make-use-of-exp2_in_bits.patch " Andrew Morton
@ 2020-02-24 23:44 ` Andrew Morton
  2020-02-25  0:26 ` + arch-kconfig-update-have_reliable_stacktrace-description.patch " Andrew Morton
                   ` (26 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-24 23:44 UTC (permalink / raw)
  To: alex.shi, cg.chen, gregkh, jlbec, joseph.qi, kstewart, mark,
	mm-commits, rfontana, tglx


The patch titled
     Subject: ocfs2: remove useless err
has been added to the -mm tree.  Its filename is
     ocfs2-remove-useless-err.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/ocfs2-remove-useless-err.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/ocfs2-remove-useless-err.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Alex Shi <alex.shi@linux.alibaba.com>
Subject: ocfs2: remove useless err

We don't need 'err' in these 2 places, better to remove them.

Link: http://lkml.kernel.org/r/1579577836-251879-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: ChenGang <cg.chen@huawei.com>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/cluster/tcp.c |    3 +--
 fs/ocfs2/dir.c         |    4 ++--
 2 files changed, 3 insertions(+), 4 deletions(-)

--- a/fs/ocfs2/cluster/tcp.c~ocfs2-remove-useless-err
+++ a/fs/ocfs2/cluster/tcp.c
@@ -1948,7 +1948,6 @@ static void o2net_accept_many(struct wor
 {
 	struct socket *sock = o2net_listen_sock;
 	int	more;
-	int	err;
 
 	/*
 	 * It is critical to note that due to interrupt moderation
@@ -1963,7 +1962,7 @@ static void o2net_accept_many(struct wor
 	 */
 
 	for (;;) {
-		err = o2net_accept_one(sock, &more);
+		o2net_accept_one(sock, &more);
 		if (!more)
 			break;
 		cond_resched();
--- a/fs/ocfs2/dir.c~ocfs2-remove-useless-err
+++ a/fs/ocfs2/dir.c
@@ -676,7 +676,7 @@ static struct buffer_head *ocfs2_find_en
 	int ra_ptr = 0;		/* Current index into readahead
 				   buffer */
 	int num = 0;
-	int nblocks, i, err;
+	int nblocks, i;
 
 	sb = dir->i_sb;
 
@@ -708,7 +708,7 @@ restart:
 				num++;
 
 				bh = NULL;
-				err = ocfs2_read_dir_block(dir, b++, &bh,
+				ocfs2_read_dir_block(dir, b++, &bh,
 							   OCFS2_BH_READAHEAD);
 				bh_use[ra_max] = bh;
 			}
_

Patches currently in -mm which might be from alex.shi@linux.alibaba.com are

ocfs2-remove-fs_ocfs2_nm.patch
ocfs2-remove-unused-macros.patch
ocfs2-use-ocfs2_sec_bits-in-macro.patch
ocfs2-remove-dlm_lock_is_remote.patch
ocfs2-remove-useless-err.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + arch-kconfig-update-have_reliable_stacktrace-description.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (211 preceding siblings ...)
  2020-02-24 23:44 ` + ocfs2-remove-useless-err.patch " Andrew Morton
@ 2020-02-25  0:26 ` Andrew Morton
  2020-02-25  0:32 ` + mm-memcg-slab-introduce-mem_cgroup_from_obj.patch " Andrew Morton
                   ` (25 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:26 UTC (permalink / raw)
  To: jpoimboe, mbenes, mm-commits


The patch titled
     Subject: arch/Kconfig: update HAVE_RELIABLE_STACKTRACE description
has been added to the -mm tree.  Its filename is
     arch-kconfig-update-have_reliable_stacktrace-description.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/arch-kconfig-update-have_reliable_stacktrace-description.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/arch-kconfig-update-have_reliable_stacktrace-description.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Miroslav Benes <mbenes@suse.cz>
Subject: arch/Kconfig: update HAVE_RELIABLE_STACKTRACE description

save_stack_trace_tsk_reliable() is not the only function providing the
reliable stack traces anymore.  Architecture might define ARCH_STACKWALK
which provides a newer stack walking interface and has
arch_stack_walk_reliable() function.  Update the description accordingly.

Link: http://lkml.kernel.org/r/20200120154042.9934-1-mbenes@suse.cz
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/arch/Kconfig~arch-kconfig-update-have_reliable_stacktrace-description
+++ a/arch/Kconfig
@@ -738,8 +738,9 @@ config HAVE_STACK_VALIDATION
 config HAVE_RELIABLE_STACKTRACE
 	bool
 	help
-	  Architecture has a save_stack_trace_tsk_reliable() function which
-	  only returns a stack trace if it can guarantee the trace is reliable.
+	  Architecture has either save_stack_trace_tsk_reliable() or
+	  arch_stack_walk_reliable() function which only returns a stack trace
+	  if it can guarantee the trace is reliable.
 
 config HAVE_ARCH_HASH
 	bool
_

Patches currently in -mm which might be from mbenes@suse.cz are

arch-kconfig-update-have_reliable_stacktrace-description.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-memcg-slab-introduce-mem_cgroup_from_obj.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (212 preceding siblings ...)
  2020-02-25  0:26 ` + arch-kconfig-update-have_reliable_stacktrace-description.patch " Andrew Morton
@ 2020-02-25  0:32 ` Andrew Morton
  2020-02-25  0:47 ` + mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch " Andrew Morton
                   ` (24 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:32 UTC (permalink / raw)
  To: guro, hannes, laoar.shao, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: memcg/slab: introduce mem_cgroup_from_obj()
has been added to the -mm tree.  Its filename is
     mm-memcg-slab-introduce-mem_cgroup_from_obj.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-slab-introduce-mem_cgroup_from_obj.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: introduce mem_cgroup_from_obj()

Sometimes we need to get a memcg pointer from a charged kernel object. 
The right way to get it depends on whether it's a proper slab object or
it's backed by raw pages (e.g.  it's a vmalloc alloction).  In the first
case the kmem_cache->memcg_params.memcg indirection should be used; in
other cases it's just page->mem_cgroup.

To simplify this task and hide the implementation details let's introduce
a mem_cgroup_from_obj() helper, which takes a pointer to any kernel object
and returns a valid memcg pointer or NULL.

Passing a kernel address rather than a pointer to a page will allow to use
this helper for per-object (rather than per-page) tracked objects in the
future.

The caller is still responsible to ensure that the returned memcg isn't
going away underneath: take the rcu read lock, cgroup mutex etc; depending
on the context.

mem_cgroup_from_kmem() defined in mm/list_lru.c is now obsolete and can be
removed.

Link: http://lkml.kernel.org/r/20200117203609.3146239-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    7 +++++++
 mm/list_lru.c              |   12 +-----------
 mm/memcontrol.c            |   32 +++++++++++++++++++++++++++++---
 3 files changed, 37 insertions(+), 14 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcg-slab-introduce-mem_cgroup_from_obj
+++ a/include/linux/memcontrol.h
@@ -420,6 +420,8 @@ struct lruvec *mem_cgroup_page_lruvec(st
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
+struct mem_cgroup *mem_cgroup_from_obj(void *p);
+
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
@@ -912,6 +914,11 @@ static inline bool mm_match_cgroup(struc
 	return true;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
+{
+	return NULL;
+}
+
 static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
 	return NULL;
--- a/mm/list_lru.c~mm-memcg-slab-introduce-mem_cgroup_from_obj
+++ a/mm/list_lru.c
@@ -57,16 +57,6 @@ list_lru_from_memcg_idx(struct list_lru_
 	return &nlru->lru;
 }
 
-static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
-{
-	struct page *page;
-
-	if (!memcg_kmem_enabled())
-		return NULL;
-	page = virt_to_head_page(ptr);
-	return memcg_from_slab_page(page);
-}
-
 static inline struct list_lru_one *
 list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 		   struct mem_cgroup **memcg_ptr)
@@ -77,7 +67,7 @@ list_lru_from_kmem(struct list_lru_node
 	if (!nlru->memcg_lrus)
 		goto out;
 
-	memcg = mem_cgroup_from_kmem(ptr);
+	memcg = mem_cgroup_from_obj(ptr);
 	if (!memcg)
 		goto out;
 
--- a/mm/memcontrol.c~mm-memcg-slab-introduce-mem_cgroup_from_obj
+++ a/mm/memcontrol.c
@@ -759,13 +759,12 @@ void __mod_lruvec_state(struct lruvec *l
 
 void __mod_lruvec_slab_state(void *p, enum node_stat_item idx, int val)
 {
-	struct page *page = virt_to_head_page(p);
-	pg_data_t *pgdat = page_pgdat(page);
+	pg_data_t *pgdat = page_pgdat(virt_to_page(p));
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
 	rcu_read_lock();
-	memcg = memcg_from_slab_page(page);
+	memcg = mem_cgroup_from_obj(p);
 
 	/* Untracked pages have no memcg, no lruvec. Update only the node */
 	if (!memcg || memcg == root_mem_cgroup) {
@@ -2637,6 +2636,33 @@ static void commit_charge(struct page *p
 		unlock_page_lru(page, isolated);
 }
 
+/*
+ * Returns a pointer to the memory cgroup to which the kernel object is charged.
+ *
+ * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(),
+ * cgroup_mutex, etc.
+ */
+struct mem_cgroup *mem_cgroup_from_obj(void *p)
+{
+	struct page *page;
+
+	if (mem_cgroup_disabled())
+		return NULL;
+
+	page = virt_to_head_page(p);
+
+	/*
+	 * Slab pages don't have page->mem_cgroup set because corresponding
+	 * kmem caches can be reparented during the lifetime. That's why
+	 * memcg_from_slab_page() should be used instead.
+	 */
+	if (PageSlab(page))
+		return memcg_from_slab_page(page);
+
+	/* All other pages use page->mem_cgroup */
+	return page->mem_cgroup;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_alloc_cache_id(void)
 {
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (213 preceding siblings ...)
  2020-02-25  0:32 ` + mm-memcg-slab-introduce-mem_cgroup_from_obj.patch " Andrew Morton
@ 2020-02-25  0:47 ` Andrew Morton
  2020-02-25  0:48 ` + mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch " Andrew Morton
                   ` (23 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:47 UTC (permalink / raw)
  To: guro, hannes, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments
has been added to the -mm tree.  Its filename is
     mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments

Patch series "mm: memcg: kmem API cleanup", v2.

This patchset aims to clean up the kernel memory charging API.  It doesn't
bring any functional changes, just removes unused arguments, renames some
functions and fixes some comments.

Currently it's not obvious which functions are most basic
(memcg_kmem_(un)charge_memcg()) and which are based on them
(memcg_kmem_(un)charge()).  The patchset renames these functions and
removes unused arguments:

TL;DR:
was:
  memcg_kmem_charge_memcg(page, gfp, order, memcg)
  memcg_kmem_uncharge_memcg(memcg, nr_pages)
  memcg_kmem_charge(page, gfp, order)
  memcg_kmem_uncharge(page, order)

now:
  memcg_kmem_charge(memcg, gfp, nr_pages)
  memcg_kmem_uncharge(memcg, nr_pages)
  memcg_kmem_charge_page(page, gfp, order)
  memcg_kmem_uncharge_page(page, order)


This patch (of 6):

The first argument of memcg_kmem_charge_memcg() and
__memcg_kmem_charge_memcg() is the page pointer and it's not used.  Let's
drop it.

Memcg pointer is passed as the last argument.  Move it to the first place
for consistency with other memcg functions, e.g. 
__memcg_kmem_uncharge_memcg() or try_charge().

Link: http://lkml.kernel.org/r/20200109202659.752357-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    9 ++++-----
 mm/memcontrol.c            |    8 +++-----
 mm/slab.h                  |    2 +-
 3 files changed, 8 insertions(+), 11 deletions(-)

--- a/include/linux/memcontrol.h~mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments
+++ a/include/linux/memcontrol.h
@@ -1371,8 +1371,7 @@ void memcg_kmem_put_cache(struct kmem_ca
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge(struct page *page, int order);
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
-			      struct mem_cgroup *memcg);
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
 
@@ -1409,11 +1408,11 @@ static inline void memcg_kmem_uncharge(s
 		__memcg_kmem_uncharge(page, order);
 }
 
-static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
-					  int order, struct mem_cgroup *memcg)
+static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
+					  int order)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		return __memcg_kmem_charge_memcg(memcg, gfp, order);
 	return 0;
 }
 
--- a/mm/memcontrol.c~mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments
+++ a/mm/memcontrol.c
@@ -2848,15 +2848,13 @@ void memcg_kmem_put_cache(struct kmem_ca
 
 /**
  * __memcg_kmem_charge_memcg: charge a kmem page
- * @page: page to charge
+ * @memcg: memory cgroup to charge
  * @gfp: reclaim mode
  * @order: allocation order
- * @memcg: memory cgroup to charge
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
-			    struct mem_cgroup *memcg)
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order)
 {
 	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
@@ -2902,7 +2900,7 @@ int __memcg_kmem_charge(struct page *pag
 
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
+		ret = __memcg_kmem_charge_memcg(memcg, gfp, order);
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
--- a/mm/slab.h~mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments
+++ a/mm/slab.h
@@ -365,7 +365,7 @@ static __always_inline int memcg_charge_
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+	ret = memcg_kmem_charge_memcg(memcg, gfp, order);
 	if (ret)
 		goto out;
 
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (214 preceding siblings ...)
  2020-02-25  0:47 ` + mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch " Andrew Morton
@ 2020-02-25  0:48 ` Andrew Morton
  2020-02-25  0:48 ` + mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch " Andrew Morton
                   ` (22 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:48 UTC (permalink / raw)
  To: guro, hannes, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments
has been added to the -mm tree.  Its filename is
     mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments

Drop the unused page argument and put the memcg pointer at the first
place.  This make the function consistent with its peers:
__memcg_kmem_uncharge_memcg(), memcg_kmem_charge_memcg(), etc.

Link: http://lkml.kernel.org/r/20200109202659.752357-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    4 ++--
 mm/slab.h                  |    2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/memcontrol.h~mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments
+++ a/include/linux/memcontrol.h
@@ -1416,8 +1416,8 @@ static inline int memcg_kmem_charge_memc
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
-					     struct mem_cgroup *memcg)
+static inline void memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+					     int order)
 {
 	if (memcg_kmem_enabled())
 		__memcg_kmem_uncharge_memcg(memcg, 1 << order);
--- a/mm/slab.h~mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments
+++ a/mm/slab.h
@@ -395,7 +395,7 @@ static __always_inline void memcg_unchar
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
-		memcg_kmem_uncharge_memcg(page, order, memcg);
+		memcg_kmem_uncharge_memcg(memcg, order);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    -(1 << order));
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (215 preceding siblings ...)
  2020-02-25  0:48 ` + mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch " Andrew Morton
@ 2020-02-25  0:48 ` Andrew Morton
  2020-02-25  0:48 ` + mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch " Andrew Morton
                   ` (21 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:48 UTC (permalink / raw)
  To: guro, hannes, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()
has been added to the -mm tree.  Its filename is
     mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()

Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:

1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
   the current memcg

2) set or clear the PageKmemcg flag

Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/pipe.c                  |    2 +-
 include/linux/memcontrol.h |   23 +++++++++++++----------
 kernel/fork.c              |    9 +++++----
 mm/memcontrol.c            |    8 ++++----
 mm/page_alloc.c            |    4 ++--
 5 files changed, 25 insertions(+), 21 deletions(-)

--- a/fs/pipe.c~mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page
+++ a/fs/pipe.c
@@ -146,7 +146,7 @@ static int anon_pipe_buf_steal(struct pi
 	struct page *page = buf->page;
 
 	if (page_count(page) == 1) {
-		memcg_kmem_uncharge(page, 0);
+		memcg_kmem_uncharge_page(page, 0);
 		__SetPageLocked(page);
 		return 0;
 	}
--- a/include/linux/memcontrol.h~mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page
+++ a/include/linux/memcontrol.h
@@ -1369,8 +1369,8 @@ struct kmem_cache *memcg_kmem_get_cache(
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
-void __memcg_kmem_uncharge(struct page *page, int order);
+int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
+void __memcg_kmem_uncharge_page(struct page *page, int order);
 int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
@@ -1395,17 +1395,18 @@ static inline bool memcg_kmem_enabled(vo
 	return static_branch_unlikely(&memcg_kmem_enabled_key);
 }
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
+					 int order)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge(page, gfp, order);
+		return __memcg_kmem_charge_page(page, gfp, order);
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge(struct page *page, int order)
+static inline void memcg_kmem_uncharge_page(struct page *page, int order)
 {
 	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge(page, order);
+		__memcg_kmem_uncharge_page(page, order);
 }
 
 static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
@@ -1435,21 +1436,23 @@ static inline int memcg_cache_id(struct
 
 #else
 
-static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp,
+					 int order)
 {
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge(struct page *page, int order)
+static inline void memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
-static inline int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+static inline int __memcg_kmem_charge_page(struct page *page, gfp_t gfp,
+					   int order)
 {
 	return 0;
 }
 
-static inline void __memcg_kmem_uncharge(struct page *page, int order)
+static inline void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 }
 
--- a/kernel/fork.c~mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page
+++ a/kernel/fork.c
@@ -281,7 +281,7 @@ static inline void free_thread_stack(str
 					     MEMCG_KERNEL_STACK_KB,
 					     -(int)(PAGE_SIZE / 1024));
 
-			memcg_kmem_uncharge(vm->pages[i], 0);
+			memcg_kmem_uncharge_page(vm->pages[i], 0);
 		}
 
 		for (i = 0; i < NR_CACHED_STACKS; i++) {
@@ -413,12 +413,13 @@ static int memcg_charge_kernel_stack(str
 
 		for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) {
 			/*
-			 * If memcg_kmem_charge() fails, page->mem_cgroup
-			 * pointer is NULL, and both memcg_kmem_uncharge()
+			 * If memcg_kmem_charge_page() fails, page->mem_cgroup
+			 * pointer is NULL, and both memcg_kmem_uncharge_page()
 			 * and mod_memcg_page_state() in free_thread_stack()
 			 * will ignore this page. So it's safe.
 			 */
-			ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0);
+			ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL,
+						     0);
 			if (ret)
 				return ret;
 
--- a/mm/memcontrol.c~mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page
+++ a/mm/memcontrol.c
@@ -2883,14 +2883,14 @@ int __memcg_kmem_charge_memcg(struct mem
 }
 
 /**
- * __memcg_kmem_charge: charge a kmem page to the current memory cgroup
+ * __memcg_kmem_charge_page: charge a kmem page to the current memory cgroup
  * @page: page to charge
  * @gfp: reclaim mode
  * @order: allocation order
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
+int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 {
 	struct mem_cgroup *memcg;
 	int ret = 0;
@@ -2926,11 +2926,11 @@ void __memcg_kmem_uncharge_memcg(struct
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
 /**
- * __memcg_kmem_uncharge: uncharge a kmem page
+ * __memcg_kmem_uncharge_page: uncharge a kmem page
  * @page: page to uncharge
  * @order: allocation order
  */
-void __memcg_kmem_uncharge(struct page *page, int order)
+void __memcg_kmem_uncharge_page(struct page *page, int order)
 {
 	struct mem_cgroup *memcg = page->mem_cgroup;
 	unsigned int nr_pages = 1 << order;
--- a/mm/page_alloc.c~mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page
+++ a/mm/page_alloc.c
@@ -1154,7 +1154,7 @@ static __always_inline bool free_pages_p
 	if (PageMappingFlags(page))
 		page->mapping = NULL;
 	if (memcg_kmem_enabled() && PageKmemcg(page))
-		__memcg_kmem_uncharge(page, order);
+		__memcg_kmem_uncharge_page(page, order);
 	if (check_free)
 		bad += free_pages_check(page);
 	if (bad)
@@ -4754,7 +4754,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
-	    unlikely(__memcg_kmem_charge(page, gfp_mask, order) != 0)) {
+	    unlikely(__memcg_kmem_charge_page(page, gfp_mask, order) != 0)) {
 		__free_pages(page, order);
 		page = NULL;
 	}
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (216 preceding siblings ...)
  2020-02-25  0:48 ` + mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch " Andrew Morton
@ 2020-02-25  0:48 ` Andrew Morton
  2020-02-25  0:48 ` + mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch " Andrew Morton
                   ` (20 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:48 UTC (permalink / raw)
  To: guro, hannes, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()
has been added to the -mm tree.  Its filename is
     mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()

These functions are charging the given number of kernel pages to the given
memory cgroup.  The number doesn't have to be a power of two.  Let's make
them to take the unsigned int nr_pages as an argument instead of the page
order.

It makes them look consistent with the corresponding uncharge functions
and functions like: mem_cgroup_charge_skmem(memcg, nr_pages).

Link: http://lkml.kernel.org/r/20200109202659.752357-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   11 ++++++-----
 mm/memcontrol.c            |    8 ++++----
 mm/slab.h                  |    2 +-
 3 files changed, 11 insertions(+), 10 deletions(-)

--- a/include/linux/memcontrol.h~mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg
+++ a/include/linux/memcontrol.h
@@ -1371,7 +1371,8 @@ void memcg_kmem_put_cache(struct kmem_ca
 #ifdef CONFIG_MEMCG_KMEM
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order);
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
+			      unsigned int nr_pages);
 void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
 				 unsigned int nr_pages);
 
@@ -1410,18 +1411,18 @@ static inline void memcg_kmem_uncharge_p
 }
 
 static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-					  int order)
+					  unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge_memcg(memcg, gfp, order);
+		return __memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
 	return 0;
 }
 
 static inline void memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-					     int order)
+					     unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_memcg(memcg, 1 << order);
+		__memcg_kmem_uncharge_memcg(memcg, nr_pages);
 }
 
 /*
--- a/mm/memcontrol.c~mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg
+++ a/mm/memcontrol.c
@@ -2850,13 +2850,13 @@ void memcg_kmem_put_cache(struct kmem_ca
  * __memcg_kmem_charge_memcg: charge a kmem page
  * @memcg: memory cgroup to charge
  * @gfp: reclaim mode
- * @order: allocation order
+ * @nr_pages: number of pages to charge
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order)
+int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
+			      unsigned int nr_pages)
 {
-	unsigned int nr_pages = 1 << order;
 	struct page_counter *counter;
 	int ret;
 
@@ -2900,7 +2900,7 @@ int __memcg_kmem_charge_page(struct page
 
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge_memcg(memcg, gfp, order);
+		ret = __memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
--- a/mm/slab.h~mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg
+++ a/mm/slab.h
@@ -365,7 +365,7 @@ static __always_inline int memcg_charge_
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(memcg, gfp, order);
+	ret = memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
 	if (ret)
 		goto out;
 
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (217 preceding siblings ...)
  2020-02-25  0:48 ` + mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch " Andrew Morton
@ 2020-02-25  0:48 ` Andrew Morton
  2020-02-25  0:48 ` + mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch " Andrew Morton
                   ` (19 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:48 UTC (permalink / raw)
  To: guro, hannes, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: memcg/slab: cache page number in memcg_(un)charge_slab()
has been added to the -mm tree.  Its filename is
     mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: cache page number in memcg_(un)charge_slab()

There are many places in memcg_charge_slab() and memcg_uncharge_slab()
which are calculating the number of pages to charge, css references to
grab etc depending on the order of the slab page.

Let's simplify the code by calculating it once and caching in the local
variable.

Link: http://lkml.kernel.org/r/20200109202659.752357-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.h |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

--- a/mm/slab.h~mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab
+++ a/mm/slab.h
@@ -348,6 +348,7 @@ static __always_inline int memcg_charge_
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
+	unsigned int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 	int ret;
@@ -360,21 +361,21 @@ static __always_inline int memcg_charge_
 
 	if (unlikely(!memcg || mem_cgroup_is_root(memcg))) {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    (1 << order));
-		percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+				    nr_pages);
+		percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
+	ret = memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
 	if (ret)
 		goto out;
 
 	lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), nr_pages);
 
 	/* transer try_charge() page references to kmem_cache */
-	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
-	css_put_many(&memcg->css, 1 << order);
+	percpu_ref_get_many(&s->memcg_params.refcnt, nr_pages);
+	css_put_many(&memcg->css, nr_pages);
 out:
 	css_put(&memcg->css);
 	return ret;
@@ -387,6 +388,7 @@ out:
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
+	unsigned int nr_pages = 1 << order;
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
@@ -394,15 +396,15 @@ static __always_inline void memcg_unchar
 	memcg = READ_ONCE(s->memcg_params.memcg);
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
-		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
-		memcg_kmem_uncharge_memcg(memcg, order);
+		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
+		memcg_kmem_uncharge_memcg(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
-				    -(1 << order));
+				    -nr_pages);
 	}
 	rcu_read_unlock();
 
-	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
+	percpu_ref_put_many(&s->memcg_params.refcnt, nr_pages);
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (218 preceding siblings ...)
  2020-02-25  0:48 ` + mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch " Andrew Morton
@ 2020-02-25  0:48 ` Andrew Morton
  2020-02-25  2:29 ` + mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch " Andrew Morton
                   ` (18 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  0:48 UTC (permalink / raw)
  To: guro, hannes, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge()
has been added to the -mm tree.  Its filename is
     mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to __memcg_kmem_(un)charge()

Drop the _memcg suffix from (__)memcg_kmem_(un)charge functions.  It's
shorter and more obvious.

These are the most basic functions which are just (un)charging the given
cgroup with the given amount of pages.

Also fix up the corresponding comments.

Link: http://lkml.kernel.org/r/20200109202659.752357-7-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   19 +++++++---------
 mm/memcontrol.c            |   40 +++++++++++++++++------------------
 mm/slab.h                  |    4 +--
 3 files changed, 31 insertions(+), 32 deletions(-)

--- a/include/linux/memcontrol.h~mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge
+++ a/include/linux/memcontrol.h
@@ -1369,12 +1369,11 @@ struct kmem_cache *memcg_kmem_get_cache(
 void memcg_kmem_put_cache(struct kmem_cache *cachep);
 
 #ifdef CONFIG_MEMCG_KMEM
+int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
+			unsigned int nr_pages);
+void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge_page(struct page *page, int order);
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-			      unsigned int nr_pages);
-void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-				 unsigned int nr_pages);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1410,19 +1409,19 @@ static inline void memcg_kmem_uncharge_p
 		__memcg_kmem_uncharge_page(page, order);
 }
 
-static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-					  unsigned int nr_pages)
+static inline int memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
+				    unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		return __memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
+		return __memcg_kmem_charge(memcg, gfp, nr_pages);
 	return 0;
 }
 
-static inline void memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-					     unsigned int nr_pages)
+static inline void memcg_kmem_uncharge(struct mem_cgroup *memcg,
+				       unsigned int nr_pages)
 {
 	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_memcg(memcg, nr_pages);
+		__memcg_kmem_uncharge(memcg, nr_pages);
 }
 
 /*
--- a/mm/memcontrol.c~mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge
+++ a/mm/memcontrol.c
@@ -2847,15 +2847,15 @@ void memcg_kmem_put_cache(struct kmem_ca
 }
 
 /**
- * __memcg_kmem_charge_memcg: charge a kmem page
+ * __memcg_kmem_charge: charge a number of kernel pages to a memcg
  * @memcg: memory cgroup to charge
  * @gfp: reclaim mode
  * @nr_pages: number of pages to charge
  *
  * Returns 0 on success, an error code on failure.
  */
-int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp,
-			      unsigned int nr_pages)
+int __memcg_kmem_charge(struct mem_cgroup *memcg, gfp_t gfp,
+			unsigned int nr_pages)
 {
 	struct page_counter *counter;
 	int ret;
@@ -2883,6 +2883,21 @@ int __memcg_kmem_charge_memcg(struct mem
 }
 
 /**
+ * __memcg_kmem_uncharge: uncharge a number of kernel pages from a memcg
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void __memcg_kmem_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		page_counter_uncharge(&memcg->kmem, nr_pages);
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	if (do_memsw_account())
+		page_counter_uncharge(&memcg->memsw, nr_pages);
+}
+
+/**
  * __memcg_kmem_charge_page: charge a kmem page to the current memory cgroup
  * @page: page to charge
  * @gfp: reclaim mode
@@ -2900,7 +2915,7 @@ int __memcg_kmem_charge_page(struct page
 
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
-		ret = __memcg_kmem_charge_memcg(memcg, gfp, 1 << order);
+		ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
 		if (!ret) {
 			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
@@ -2911,21 +2926,6 @@ int __memcg_kmem_charge_page(struct page
 }
 
 /**
- * __memcg_kmem_uncharge_memcg: uncharge a kmem page
- * @memcg: memcg to uncharge
- * @nr_pages: number of pages to uncharge
- */
-void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
-				 unsigned int nr_pages)
-{
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		page_counter_uncharge(&memcg->kmem, nr_pages);
-
-	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_memsw_account())
-		page_counter_uncharge(&memcg->memsw, nr_pages);
-}
-/**
  * __memcg_kmem_uncharge_page: uncharge a kmem page
  * @page: page to uncharge
  * @order: allocation order
@@ -2939,7 +2939,7 @@ void __memcg_kmem_uncharge_page(struct p
 		return;
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	__memcg_kmem_uncharge_memcg(memcg, nr_pages);
+	__memcg_kmem_uncharge(memcg, nr_pages);
 	page->mem_cgroup = NULL;
 
 	/* slab pages do not have PageKmemcg flag set */
--- a/mm/slab.h~mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge
+++ a/mm/slab.h
@@ -366,7 +366,7 @@ static __always_inline int memcg_charge_
 		return 0;
 	}
 
-	ret = memcg_kmem_charge_memcg(memcg, gfp, nr_pages);
+	ret = memcg_kmem_charge(memcg, gfp, nr_pages);
 	if (ret)
 		goto out;
 
@@ -397,7 +397,7 @@ static __always_inline void memcg_unchar
 	if (likely(!mem_cgroup_is_root(memcg))) {
 		lruvec = mem_cgroup_lruvec(memcg, page_pgdat(page));
 		mod_lruvec_state(lruvec, cache_vmstat_idx(s), -nr_pages);
-		memcg_kmem_uncharge_memcg(memcg, nr_pages);
+		memcg_kmem_uncharge(memcg, nr_pages);
 	} else {
 		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
 				    -nr_pages);
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (219 preceding siblings ...)
  2020-02-25  0:48 ` + mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch " Andrew Morton
@ 2020-02-25  2:29 ` Andrew Morton
  2020-02-25  2:36 ` + ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch " Andrew Morton
                   ` (17 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  2:29 UTC (permalink / raw)
  To: guro, hannes, laoar.shao, mhocko, mm-commits, shakeelb, vdavydov.dev


The patch titled
     Subject: mm-memcg-slab-introduce-mem_cgroup_from_obj-v2
has been added to the -mm tree.  Its filename is
     mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Roman Gushchin <guro@fb.com>
Subject: mm-memcg-slab-introduce-mem_cgroup_from_obj-v2

fix build issues

Link: http://lkml.kernel.org/r/20200225022154.GA573375@carbon.DHCP.thefacebook.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   14 +++++++-------
 mm/memcontrol.c            |    2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcg-slab-introduce-mem_cgroup_from_obj-v2
+++ a/include/linux/memcontrol.h
@@ -420,8 +420,6 @@ struct lruvec *mem_cgroup_page_lruvec(st
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
-struct mem_cgroup *mem_cgroup_from_obj(void *p);
-
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
@@ -914,11 +912,6 @@ static inline bool mm_match_cgroup(struc
 	return true;
 }
 
-static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
-{
-	return NULL;
-}
-
 static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
 	return NULL;
@@ -1434,6 +1427,8 @@ static inline int memcg_cache_id(struct
 	return memcg ? memcg->kmemcg_id : -1;
 }
 
+struct mem_cgroup *mem_cgroup_from_obj(void *p);
+
 #else
 
 static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
@@ -1475,6 +1470,11 @@ static inline void memcg_put_cache_ids(v
 {
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
+{
+       return NULL;
+}
+
 #endif /* CONFIG_MEMCG_KMEM */
 
 #endif /* _LINUX_MEMCONTROL_H */
--- a/mm/memcontrol.c~mm-memcg-slab-introduce-mem_cgroup_from_obj-v2
+++ a/mm/memcontrol.c
@@ -2636,6 +2636,7 @@ static void commit_charge(struct page *p
 		unlock_page_lru(page, isolated);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
 /*
  * Returns a pointer to the memory cgroup to which the kernel object is charged.
  *
@@ -2663,7 +2664,6 @@ struct mem_cgroup *mem_cgroup_from_obj(v
 	return page->mem_cgroup;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
_

Patches currently in -mm which might be from guro@fb.com are

mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch
mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (220 preceding siblings ...)
  2020-02-25  2:29 ` + mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch " Andrew Morton
@ 2020-02-25  2:36 ` Andrew Morton
  2020-02-25  3:53 ` mmotm 2020-02-24-19-53 uploaded Andrew Morton
                   ` (16 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  2:36 UTC (permalink / raw)
  To: gechangwei, ghe, jbi.octave, jlbec, joseph.qi, junxiao.bi, mark,
	mm-commits, piaojun


The patch titled
     Subject: ocfs2: Add missing annotations for ocfs2_refcount_cache_lock() and ocfs2_refcount_cache_unlock()
has been added to the -mm tree.  Its filename is
     ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Jules Irenge <jbi.octave@gmail.com>
Subject: ocfs2: Add missing annotations for ocfs2_refcount_cache_lock() and ocfs2_refcount_cache_unlock()

Sparse reports warnings at ocfs2_refcount_cache_lock()
	and ocfs2_refcount_cache_unlock()

warning: context imbalance in ocfs2_refcount_cache_lock()
	- wrong count at exit
warning: context imbalance in ocfs2_refcount_cache_unlock()
	- unexpected unlock

The root cause is the missing annotation at ocfs2_refcount_cache_lock()
and at ocfs2_refcount_cache_unlock()

Add the missing __acquires(&rf->rf_lock) annotation to
ocfs2_refcount_cache_lock()

Add the missing __releases(&rf->rf_lock) annotation to
ocfs2_refcount_cache_unlock()

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Link: http://lkml.kernel.org/r/20200224204130.18178-1-jbi.octave@gmail.com
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/refcounttree.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/fs/ocfs2/refcounttree.c~ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock
+++ a/fs/ocfs2/refcounttree.c
@@ -154,6 +154,7 @@ ocfs2_refcount_cache_get_super(struct oc
 }
 
 static void ocfs2_refcount_cache_lock(struct ocfs2_caching_info *ci)
+__acquires(&rf->rf_lock)
 {
 	struct ocfs2_refcount_tree *rf = cache_info_to_refcount(ci);
 
@@ -161,6 +162,7 @@ static void ocfs2_refcount_cache_lock(st
 }
 
 static void ocfs2_refcount_cache_unlock(struct ocfs2_caching_info *ci)
+__releases(&rf->rf_lock)
 {
 	struct ocfs2_refcount_tree *rf = cache_info_to_refcount(ci);
 
_

Patches currently in -mm which might be from jbi.octave@gmail.com are

ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* mmotm 2020-02-24-19-53 uploaded
  2020-02-04  1:33 incoming Andrew Morton
                   ` (221 preceding siblings ...)
  2020-02-25  2:36 ` + ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch " Andrew Morton
@ 2020-02-25  3:53 ` Andrew Morton
  2020-02-26  1:06 ` + checkpatch-improve-gerrit-change-id-test.patch added to -mm tree Andrew Morton
                   ` (15 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-25  3:53 UTC (permalink / raw)
  To: broonie, linux-fsdevel, linux-kernel, linux-mm, linux-next,
	mhocko, mm-commits, sfr

The mm-of-the-moment snapshot 2020-02-24-19-53 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (5.x
or 5.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE-yyyy-mm-dd-hh-mm-ss.  Both contain the string yyyy-mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

	https://github.com/hnaz/linux-mm

The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is also available at

	https://github.com/hnaz/linux-mm



This mmotm tree contains the following patches against 5.6-rc3:
(patches marked "*" will be included in linux-next)

* mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa.patch
* mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa-fix.patch
* mm-fix-possible-pmd-dirty-bit-lost-in-set_pmd_migration_entry.patch
* mm-avoid-data-corruption-on-cow-fault-into-pfn-mapped-vma.patch
* mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch
* proc-kpageflags-prevent-an-integer-overflow-in-stable_page_flags.patch
* proc-kpageflags-do-not-use-uninitialized-struct-pages.patch
* fat-fix-uninit-memory-access-for-partial-initialized-inode.patch
* mm-z3fold-do-not-include-rwlockh-directly.patch
* mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch
* arch-kconfig-update-have_reliable_stacktrace-description.patch
* x86-mm-split-vmalloc_sync_all.patch
* kthread-mark-timer-used-by-delayed-kthread-works-as-irq-safe.patch
* asm-generic-make-more-kernel-space-headers-mandatory.patch
* scripts-spellingtxt-add-syfs-sysfs-pattern.patch
* ocfs2-remove-fs_ocfs2_nm.patch
* ocfs2-remove-unused-macros.patch
* ocfs2-use-ocfs2_sec_bits-in-macro.patch
* ocfs2-remove-dlm_lock_is_remote.patch
* ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch
* ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch
* ocfs2-remove-useless-err.patch
* ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch
* ramfs-support-o_tmpfile.patch
* kernel-watchdog-flush-all-printk-nmi-buffers-when-hardlockup-detected.patch
  mm.patch
* mm-slubc-replace-cpu_slab-partial-with-wrapped-apis.patch
* mm-slubc-replace-kmem_cache-cpu_partial-with-wrapped-apis.patch
* mm-kmemleak-use-address-of-operator-on-section-symbols.patch
* mm-debug-add-tests-validating-architecture-page-table-helpers.patch
* mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch
* mm-page-writebackc-write_cache_pages-deduplicate-identical-checks.patch
* mm-gup-split-get_user_pages_remote-into-two-routines.patch
* mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch
* mm-introduce-page_ref_sub_return.patch
* mm-gup-pass-gup-flags-to-two-more-routines.patch
* mm-gup-require-foll_get-for-get_user_pages_fast.patch
* mm-gup-track-foll_pin-pages.patch
* mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch
* mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch
* mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch
* selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch
* mm-improve-dump_page-for-compound-pages.patch
* mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch
* mm-swap-move-inode_lock-out-of-claim_swapfile.patch
* mm-swapfilec-fix-comments-for-swapcache_prepare.patch
* mm-swapc-not-necessary-to-export-__pagevec_lru_add.patch
* mm-swapfile-fix-data-races-in-try_to_unuse.patch
* mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch
* mm-allocate-shrinker_map-on-appropriate-numa-node.patch
* mm-memcg-slab-introduce-mem_cgroup_from_obj.patch
* mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch
* mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch
* mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch
* mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch
* mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch
* mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch
* mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch
* mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch
* mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch
* revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch
* mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch
* mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
* mm-vma-make-vma_is_accessible-available-for-general-use.patch
* mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
* mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch
* mm-vma-append-unlikely-while-testing-vma-access-permissions.patch
* mm-mmap-fix-the-adjusted-length-error.patch
* mm-add-mremap_dontunmap-to-mremap.patch
* mm-add-mremap_dontunmap-to-mremap-v6.patch
* mm-add-mremap_dontunmap-to-mremap-v7.patch
* selftest-add-mremap_dontunmap-selftest.patch
* selftest-add-mremap_dontunmap-selftest-fix.patch
* selftest-add-mremap_dontunmap-selftest-v7.patch
* selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch
* mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch
* mm-sparse-rename-pfn_present-as-pfn_in_present_section.patch
* mm-page_alloc-increase-default-min_free_kbytes-bound.patch
* mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch
* mm-vmpressure-use-mem_cgroup_is_root-api.patch
* mm-vmscan-replace-open-codings-to-numa_no_node.patch
* mm-vmscanc-remove-cpu-online-notification-for-now.patch
* mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch
* mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch
* mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch
* hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch
* hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch
* mm-hugetlb_cgroup-fix-hugetlb_cgroup-migration.patch
* hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch
* hugetlb_cgroup-add-reservation-accounting-for-private-mappings-fix.patch
* hugetlb-disable-region_add-file_region-coalescing.patch
* hugetlb-disable-region_add-file_region-coalescing-fix.patch
* hugetlb_cgroup-add-accounting-for-shared-mappings.patch
* hugetlb_cgroup-add-accounting-for-shared-mappings-fix.patch
* hugetlb_cgroup-support-noreserve-mappings.patch
* hugetlb-support-file_region-coalescing-again.patch
* hugetlb-support-file_region-coalescing-again-fix.patch
* hugetlb-support-file_region-coalescing-again-fix-2.patch
* hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch
* hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch
* mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch
* mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch
* mm-migratec-check-pagelist-in-move_pages_and_store_status.patch
* mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch
* mm-migratec-migrate-pg_readahead-flag.patch
* mm-migratec-migrate-pg_readahead-flag-fix.patch
* drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup.patch
* drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup-fix.patch
* mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch
* mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch
* mm-add-function-__putback_isolated_page.patch
* mm-introduce-reported-pages.patch
* virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch
* virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch
* mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch
* mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch
* mm-page_reporting-add-free-page-reporting-documentation.patch
* drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch
* drivers-base-memoryc-drop-section_count.patch
* drivers-base-memoryc-drop-pages_correctly_probed.patch
* mm-page_extc-drop-pfn_present-check-when-onlining.patch
* mm-hotplug-only-respect-mem=-parameter-during-boot-stage.patch
* shmem-distribute-switch-variables-for-initialization.patch
* zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch
* info-task-hung-in-generic_file_write_iter.patch
* info-task-hung-in-generic_file_write-fix.patch
* kernel-hung_taskc-monitor-killed-tasks.patch
* proc-faster-open-read-close-with-permanent-files.patch
* proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch
* asm-generic-fix-unistd_32h-generation-format.patch
* kernel-extable-use-address-of-operator-on-section-symbols.patch
* maintainers-add-an-entry-for-kfifo.patch
* lib-test_lockup-test-module-to-generate-lockups.patch
* lib-bch-replace-zero-length-array-with-flexible-array-member.patch
* lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
* lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
* lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch
* lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch
* lib-test_stackinitc-xfail-switch-variable-init-tests.patch
* stackdepot-check-depot_index-before-accessing-the-stack-slab.patch
* stackdepot-build-with-fno-builtin.patch
* kasan-stackdepot-move-filter_irq_stacks-to-stackdepotc.patch
* percpu_counter-fix-a-data-race-at-vm_committed_as.patch
* lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch
* lib-test_bitmap-make-use-of-exp2_in_bits.patch
* string-add-stracpy-and-stracpy_pad-mechanisms.patch
* documentation-checkpatch-prefer-stracpy-strscpy-over-strcpy-strlcpy-strncpy.patch
* checkpatch-remove-email-address-comment-from-email-address-comparisons.patch
* checkpatch-check-spdx-tags-in-yaml-files.patch
* checkpatch-support-base-commit-format.patch
* checkpatch-prefer-fallthrough-over-fallthrough-comments.patch
* checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch
* checkpatch-fix-multiple-const-types.patch
* checkpatch-add-command-line-option-for-tab-size.patch
* epoll-fix-possible-lost-wakeup-on-epoll_ctl-path.patch
* kselftest-introduce-new-epoll-test-case.patch
* elf-delete-loc-variable.patch
* elf-allocate-less-for-static-executable.patch
* elf-dont-free-interpreters-elf-pheaders-on-common-path.patch
* samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch
* samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch
* kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch
* init-mainc-mark-boot_config_checksum-static.patch
* loop-use-worker-per-cgroup-instead-of-kworker.patch
* mm-charge-active-memcg-when-no-mm-is-set.patch
* loop-charge-i-o-to-mem-and-blk-cg.patch
* kernel-relayc-fix-read_pos-error-when-multiple-readers.patch
* aio-simplify-read_events.patch
* init-cleanup-anon_inodes-and-old-io-schedulers-options.patch
  linux-next.patch
  linux-next-rejects.patch
  linux-next-fix.patch
* mm-frontswap-mark-various-intentional-data-races.patch
* mm-page_io-mark-various-intentional-data-races.patch
* mm-page_io-mark-various-intentional-data-races-v2.patch
* mm-swap_state-mark-various-intentional-data-races.patch
* mm-kmemleak-annotate-various-data-races-obj-ptr.patch
* mm-filemap-fix-a-data-race-in-filemap_fault.patch
* mm-swapfile-fix-and-annotate-various-data-races.patch
* mm-swapfile-fix-and-annotate-various-data-races-v2.patch
* mm-page_counter-fix-various-data-races-at-memsw.patch
* mm-memcontrol-fix-a-data-race-in-scan-count.patch
* mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
* mm-mempool-fix-a-data-race-in-mempool_free.patch
* mm-util-annotate-an-data-race-at-vm_committed_as.patch
* mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
* mm-annotate-a-data-race-in-page_zonenum.patch
* mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
* mm-add-vm_insert_pages.patch
* mm-add-vm_insert_pages-fix.patch
* mm-add-vm_insert_pages-2.patch
* net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch
* net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
* drivers-tty-serial-sh-scic-suppress-warning.patch
* fix-read-buffer-overflow-in-delta-ipc.patch
  make-sure-nobodys-leaking-resources.patch
  releasing-resources-with-children.patch
  mutex-subsystem-synchro-test-module.patch
  kernel-forkc-export-kernel_thread-to-modules.patch
  workaround-for-a-pci-restoring-bug.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + checkpatch-improve-gerrit-change-id-test.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (222 preceding siblings ...)
  2020-02-25  3:53 ` mmotm 2020-02-24-19-53 uploaded Andrew Morton
@ 2020-02-26  1:06 ` Andrew Morton
  2020-02-26  1:55 ` + dma-buf-free-dmabuf-name-in-dma_buf_release.patch " Andrew Morton
                   ` (14 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  1:06 UTC (permalink / raw)
  To: joe, john.stultz, mm-commits


The patch titled
     Subject: checkpatch: improve Gerrit Change-Id: test
has been added to the -mm tree.  Its filename is
     checkpatch-improve-gerrit-change-id-test.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/checkpatch-improve-gerrit-change-id-test.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/checkpatch-improve-gerrit-change-id-test.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Joe Perches <joe@perches.com>
Subject: checkpatch: improve Gerrit Change-Id: test

The Gerrit Change-Id: entry is sometimes placed after a Signed-off-by:
line.  When this occurs, the Gerrit warning is not currently emitted as
the first Signed-off-by: signature sets a flag to stop looking.

Change the test to add a test for the --- patch separator and emit the
warning before any before the --- and also before any diff file name.

Link: http://lkml.kernel.org/r/2f6d5f8766fe7439a116c77ea8cc721a3f2d77a2.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |   13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-improve-gerrit-change-id-test
+++ a/scripts/checkpatch.pl
@@ -2349,6 +2349,7 @@ sub process {
 	my $is_binding_patch = -1;
 	my $in_header_lines = $file ? 0 : 1;
 	my $in_commit_log = 0;		#Scanning lines before patch
+	my $has_patch_separator = 0;	#Found a --- line
 	my $has_commit_log = 0;		#Encountered lines before patch
 	my $commit_log_lines = 0;	#Number of commit log lines
 	my $commit_log_possible_stack_dump = 0;
@@ -2674,6 +2675,12 @@ sub process {
 			}
 		}
 
+# Check for patch separator
+		if ($line =~ /^---$/) {
+			$has_patch_separator = 1;
+			$in_commit_log = 0;
+		}
+
 # Check if MAINTAINERS is being updated.  If so, there's probably no need to
 # emit the "does MAINTAINERS need updating?" message on file add/move/delete
 		if ($line =~ /^\s*MAINTAINERS\s*\|/) {
@@ -2773,10 +2780,10 @@ sub process {
 			     "A patch subject line should describe the change not the tool that found it\n" . $herecurr);
 		}
 
-# Check for unwanted Gerrit info
-		if ($in_commit_log && $line =~ /^\s*change-id:/i) {
+# Check for Gerrit Change-Ids not in any patch context
+		if ($realfile eq '' && !$has_patch_separator && $line =~ /^\s*change-id:/i) {
 			ERROR("GERRIT_CHANGE_ID",
-			      "Remove Gerrit Change-Id's before submitting upstream.\n" . $herecurr);
+			      "Remove Gerrit Change-Id's before submitting upstream\n" . $herecurr);
 		}
 
 # Check if the commit log is in a possible stack dump
_

Patches currently in -mm which might be from joe@perches.com are

string-add-stracpy-and-stracpy_pad-mechanisms.patch
checkpatch-remove-email-address-comment-from-email-address-comparisons.patch
checkpatch-prefer-fallthrough-over-fallthrough-comments.patch
checkpatch-improve-gerrit-change-id-test.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + dma-buf-free-dmabuf-name-in-dma_buf_release.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (223 preceding siblings ...)
  2020-02-26  1:06 ` + checkpatch-improve-gerrit-change-id-test.patch added to -mm tree Andrew Morton
@ 2020-02-26  1:55 ` Andrew Morton
       [not found]   ` <CAO_48GFr9-aY4=kRqWB=UkEzPj5fQDip+G1tNZMsT0XoQpBC7Q@mail.gmail.com>
  2020-02-26  3:42 ` + lib-rbtree-fix-coding-style-of-assignments.patch " Andrew Morton
                   ` (13 subsequent siblings)
  238 siblings, 1 reply; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  1:55 UTC (permalink / raw)
  To: akpm, fengc, ghackmann, mm-commits, sumit.semwal, xiyou.wangcong


The patch titled
     Subject: dma-buf: free dmabuf->name in dma_buf_release()
has been added to the -mm tree.  Its filename is
     dma-buf-free-dmabuf-name-in-dma_buf_release.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/dma-buf-free-dmabuf-name-in-dma_buf_release.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/dma-buf-free-dmabuf-name-in-dma_buf_release.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Cong Wang <xiyou.wangcong@gmail.com>
Subject: dma-buf: free dmabuf->name in dma_buf_release()

dma-buff name can be set via DMA_BUF_SET_NAME ioctl, but once set
it never gets freed.

Free it in dma_buf_release().

Link: http://lkml.kernel.org/r/20200225204446.11378-1-xiyou.wangcong@gmail.com
Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")
Reported-by: syzbot+b2098bc44728a4efb3e9@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chenbo Feng <fengc@google.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Greg Hackmann <ghackmann@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dma-buf/dma-buf.c |    1 +
 1 file changed, 1 insertion(+)

--- a/drivers/dma-buf/dma-buf.c~dma-buf-free-dmabuf-name-in-dma_buf_release
+++ a/drivers/dma-buf/dma-buf.c
@@ -108,6 +108,7 @@ static int dma_buf_release(struct inode
 		dma_resv_fini(dmabuf->resv);
 
 	module_put(dmabuf->owner);
+	kfree(dmabuf->name);
 	kfree(dmabuf);
 	return 0;
 }
_

Patches currently in -mm which might be from xiyou.wangcong@gmail.com are

dma-buf-free-dmabuf-name-in-dma_buf_release.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-rbtree-fix-coding-style-of-assignments.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (224 preceding siblings ...)
  2020-02-26  1:55 ` + dma-buf-free-dmabuf-name-in-dma_buf_release.patch " Andrew Morton
@ 2020-02-26  3:42 ` Andrew Morton
  2020-02-26  3:56 ` + seq_read-info-message-about-buggy-next-functions.patch " Andrew Morton
                   ` (12 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  3:42 UTC (permalink / raw)
  To: chenqiwu, mm-commits, walken


The patch titled
     Subject: lib/rbtree: fix coding style of assignments
has been added to the -mm tree.  Its filename is
     lib-rbtree-fix-coding-style-of-assignments.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-rbtree-fix-coding-style-of-assignments.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-rbtree-fix-coding-style-of-assignments.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: chenqiwu <chenqiwu@xiaomi.com>
Subject: lib/rbtree: fix coding style of assignments

Leave blank space between the right-hand and left-hand side of the
assignment to meet the kernel coding style better.

Link: http://lkml.kernel.org/r/1582621140-25850-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/rbtree.c       |    4 ++--
 tools/lib/rbtree.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

--- a/lib/rbtree.c~lib-rbtree-fix-coding-style-of-assignments
+++ a/lib/rbtree.c
@@ -503,7 +503,7 @@ struct rb_node *rb_next(const struct rb_
 	if (node->rb_right) {
 		node = node->rb_right;
 		while (node->rb_left)
-			node=node->rb_left;
+			node = node->rb_left;
 		return (struct rb_node *)node;
 	}
 
@@ -535,7 +535,7 @@ struct rb_node *rb_prev(const struct rb_
 	if (node->rb_left) {
 		node = node->rb_left;
 		while (node->rb_right)
-			node=node->rb_right;
+			node = node->rb_right;
 		return (struct rb_node *)node;
 	}
 
--- a/tools/lib/rbtree.c~lib-rbtree-fix-coding-style-of-assignments
+++ a/tools/lib/rbtree.c
@@ -497,7 +497,7 @@ struct rb_node *rb_next(const struct rb_
 	if (node->rb_right) {
 		node = node->rb_right;
 		while (node->rb_left)
-			node=node->rb_left;
+			node = node->rb_left;
 		return (struct rb_node *)node;
 	}
 
@@ -528,7 +528,7 @@ struct rb_node *rb_prev(const struct rb_
 	if (node->rb_left) {
 		node = node->rb_left;
 		while (node->rb_right)
-			node=node->rb_right;
+			node = node->rb_right;
 		return (struct rb_node *)node;
 	}
 
_

Patches currently in -mm which might be from chenqiwu@xiaomi.com are

mm-slubc-replace-cpu_slab-partial-with-wrapped-apis.patch
mm-slubc-replace-kmem_cache-cpu_partial-with-wrapped-apis.patch
lib-rbtree-fix-coding-style-of-assignments.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + seq_read-info-message-about-buggy-next-functions.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (225 preceding siblings ...)
  2020-02-26  3:42 ` + lib-rbtree-fix-coding-style-of-assignments.patch " Andrew Morton
@ 2020-02-26  3:56 ` Andrew Morton
       [not found]   ` <1583173259.7365.142.camel@lca.pw>
  2020-02-26  3:56 ` + pstore_ftrace_seq_next-should-increase-position-index.patch " Andrew Morton
                   ` (11 subsequent siblings)
  238 siblings, 1 reply; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  3:56 UTC (permalink / raw)
  To: dave, longman, manfred, mingo, mm-commits, neilb, oberpar,
	rostedt, viro, vvs


The patch titled
     Subject: fs/seq_file.c: seq_read(): add info message about buggy .next functions
has been added to the -mm tree.  Its filename is
     seq_read-info-message-about-buggy-next-functions.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/seq_read-info-message-about-buggy-next-functions.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/seq_read-info-message-about-buggy-next-functions.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vasily Averin <vvs@virtuozzo.com>
Subject: fs/seq_file.c: seq_read(): add info message about buggy .next functions

Patch series "seq_file .next functions should increase position index".

In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
simplify seq_file iteration code and interface")

"Some ->next functions do not increment *pos when they return NULL... 
Note that such ->next functions are buggy and should be fixed.  A simple
demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size
larger than the size of /proc/swaps.  This will always show the whole last
line of /proc/swaps"

Described problem is still actual.  If you make lseek into middle of last
output line following read will output end of last line and whole last
line once again.

$ dd if=/proc/swaps bs=1  # usual output
Filename				Type		Size	Used	Priority
/dev/dm-0                               partition	4194812	97536	-2
104+0 records in
104+0 records out
104 bytes copied

$ dd if=/proc/swaps bs=40 skip=1    # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0                               partition	4194812	97536	-2
/dev/dm-0                               partition	4194812	97536	-2 
3+1 records in
3+1 records out
131 bytes copied

There are lot of other affected files, I've found 30+ including
/proc/net/ip_tables_matches and /proc/sysvipc/*

I've sent patches into maillists of affected subsystems already, this
patch-set fixes the problem in files related to pstore, tracing, gcov,
sysvipc and other subsystems processed via linux-kernel@ mailing list
directly

https://bugzilla.kernel.org/show_bug.cgi?id=206283


This patch (of 4):

Add debug code to seq_read() to detect missed or out-of-tree incorrect
.next seq_file functions.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Link: http://lkml.kernel.org/r/244674e5-760c-86bd-d08a-047042881748@virtuozzo.com
Link: http://lkml.kernel.org/r/7c24087c-e280-e580-5b0c-0cdaeb14cd18@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/seq_file.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- a/fs/seq_file.c~seq_read-info-message-about-buggy-next-functions
+++ a/fs/seq_file.c
@@ -256,9 +256,12 @@ Fill:
 		loff_t pos = m->index;
 
 		p = m->op->next(m, p, &m->index);
-		if (pos == m->index)
-			/* Buggy ->next function */
+		if (pos == m->index) {
+			pr_info("buggy seq_file .next function %ps "
+				"did not updated position index\n",
+				m->op->next);
 			m->index++;
+		}
 		if (!p || IS_ERR(p)) {
 			err = PTR_ERR(p);
 			break;
_

Patches currently in -mm which might be from vvs@virtuozzo.com are

seq_read-info-message-about-buggy-next-functions.patch
pstore_ftrace_seq_next-should-increase-position-index.patch
gcov_seq_next-should-increase-position-index.patch
sysvipc_find_ipc-should-increase-position-index.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + pstore_ftrace_seq_next-should-increase-position-index.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (226 preceding siblings ...)
  2020-02-26  3:56 ` + seq_read-info-message-about-buggy-next-functions.patch " Andrew Morton
@ 2020-02-26  3:56 ` Andrew Morton
       [not found]   ` <07f968e6-02cd-de2a-e868-787e4bedd346@virtuozzo.com>
  2020-02-26  3:56 ` + gcov_seq_next-should-increase-position-index.patch " Andrew Morton
                   ` (10 subsequent siblings)
  238 siblings, 1 reply; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  3:56 UTC (permalink / raw)
  To: dave, longman, manfred, mingo, mm-commits, neilb, oberpar,
	rostedt, viro, vvs


The patch titled
     Subject: fs/pstore: pstore_ftrace_seq_next() should increase position index
has been added to the -mm tree.  Its filename is
     pstore_ftrace_seq_next-should-increase-position-index.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/pstore_ftrace_seq_next-should-increase-position-index.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/pstore_ftrace_seq_next-should-increase-position-index.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vasily Averin <vvs@virtuozzo.com>
Subject: fs/pstore: pstore_ftrace_seq_next() should increase position index

If seq_file .next fuction does not change position index, read after some
lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Link: http://lkml.kernel.org/r/51376af5-e0f2-0ff2-d664-e932153b0665@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/pstore/inode.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/pstore/inode.c~pstore_ftrace_seq_next-should-increase-position-index
+++ a/fs/pstore/inode.c
@@ -87,11 +87,11 @@ static void *pstore_ftrace_seq_next(stru
 	struct pstore_private *ps = s->private;
 	struct pstore_ftrace_seq_data *data = v;
 
+	(*pos)++;
 	data->off += REC_SIZE;
 	if (data->off + REC_SIZE > ps->total_size)
 		return NULL;
 
-	(*pos)++;
 	return data;
 }
 
_

Patches currently in -mm which might be from vvs@virtuozzo.com are

seq_read-info-message-about-buggy-next-functions.patch
pstore_ftrace_seq_next-should-increase-position-index.patch
gcov_seq_next-should-increase-position-index.patch
sysvipc_find_ipc-should-increase-position-index.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + gcov_seq_next-should-increase-position-index.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (227 preceding siblings ...)
  2020-02-26  3:56 ` + pstore_ftrace_seq_next-should-increase-position-index.patch " Andrew Morton
@ 2020-02-26  3:56 ` Andrew Morton
  2020-02-26  3:56 ` + sysvipc_find_ipc-should-increase-position-index.patch " Andrew Morton
                   ` (9 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  3:56 UTC (permalink / raw)
  To: dave, longman, manfred, mingo, mm-commits, neilb, oberpar,
	rostedt, viro, vvs


The patch titled
     Subject:  kernel/gcov/fs.c: gcov_seq_next() should increase position index
has been added to the -mm tree.  Its filename is
     gcov_seq_next-should-increase-position-index.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/gcov_seq_next-should-increase-position-index.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/gcov_seq_next-should-increase-position-index.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vasily Averin <vvs@virtuozzo.com>
Subject:  kernel/gcov/fs.c: gcov_seq_next() should increase position index

If seq_file .next fuction does not change position index, read after some
lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/gcov/fs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/gcov/fs.c~gcov_seq_next-should-increase-position-index
+++ a/kernel/gcov/fs.c
@@ -108,9 +108,9 @@ static void *gcov_seq_next(struct seq_fi
 {
 	struct gcov_iterator *iter = data;
 
+	(*pos)++;
 	if (gcov_iter_next(iter))
 		return NULL;
-	(*pos)++;
 
 	return iter;
 }
_

Patches currently in -mm which might be from vvs@virtuozzo.com are

seq_read-info-message-about-buggy-next-functions.patch
pstore_ftrace_seq_next-should-increase-position-index.patch
gcov_seq_next-should-increase-position-index.patch
sysvipc_find_ipc-should-increase-position-index.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + sysvipc_find_ipc-should-increase-position-index.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (228 preceding siblings ...)
  2020-02-26  3:56 ` + gcov_seq_next-should-increase-position-index.patch " Andrew Morton
@ 2020-02-26  3:56 ` Andrew Morton
  2020-02-27  1:19 ` + mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch " Andrew Morton
                   ` (8 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-26  3:56 UTC (permalink / raw)
  To: dave, longman, manfred, mingo, mm-commits, neilb, oberpar,
	rostedt, viro, vvs


The patch titled
     Subject: ipc/util.c: sysvipc_find_ipc() should increase position index
has been added to the -mm tree.  Its filename is
     sysvipc_find_ipc-should-increase-position-index.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/sysvipc_find_ipc-should-increase-position-index.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/sysvipc_find_ipc-should-increase-position-index.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Vasily Averin <vvs@virtuozzo.com>
Subject: ipc/util.c: sysvipc_find_ipc() should increase position index

If seq_file .next fuction does not change position index, read after some
lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Link: http://lkml.kernel.org/r/b7a20945-e315-8bb0-21e6-3875c14a8494@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/util.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/ipc/util.c~sysvipc_find_ipc-should-increase-position-index
+++ a/ipc/util.c
@@ -764,13 +764,13 @@ static struct kern_ipc_perm *sysvipc_fin
 			total++;
 	}
 
+	*new_pos = pos + 1;
 	if (total >= ids->in_use)
 		return NULL;
 
 	for (; pos < ipc_mni; pos++) {
 		ipc = idr_find(&ids->ipcs_idr, pos);
 		if (ipc != NULL) {
-			*new_pos = pos + 1;
 			rcu_read_lock();
 			ipc_lock_object(ipc);
 			return ipc;
_

Patches currently in -mm which might be from vvs@virtuozzo.com are

seq_read-info-message-about-buggy-next-functions.patch
pstore_ftrace_seq_next-should-increase-position-index.patch
gcov_seq_next-should-increase-position-index.patch
sysvipc_find_ipc-should-increase-position-index.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (229 preceding siblings ...)
  2020-02-26  3:56 ` + sysvipc_find_ipc-should-increase-position-index.patch " Andrew Morton
@ 2020-02-27  1:19 ` Andrew Morton
  2020-02-27  1:37 ` + mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch " Andrew Morton
                   ` (7 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  1:19 UTC (permalink / raw)
  To: arjunroy, davem, edumazet, mm-commits, sfr, soheil, willy


The patch titled
     Subject: mm: bring sparc pte_index() semantics inline with other platforms
has been added to the -mm tree.  Its filename is
     mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Arjun Roy <arjunroy@google.com>
Subject: mm: bring sparc pte_index() semantics inline with other platforms

pte_index() on platforms other than sparc return a numerical index.  On
sparc, it returns a pte_t*.  This presents an issue for vm_insert_pages(),
which relies on pte_index() to find the offset for a pte within a pmd, for
batched inserts.

This patch:
1. Modifies pte_index() for sparc to return a numerical index, like
   other platforms,
2. Defines pte_entry() for sparc which returns a pte_t*
   (as pte_index() used to),
3. Converts existing sparc callers for pte_index() to use pte_entry().

[sfr@canb.auug.org.au: remove pte_entry and just directly modified pte_offset_kernel instead]
Link: http://lkml.kernel.org/r/20200227105045.6b421d9f@canb.auug.org.au
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/sparc/include/asm/pgtable_64.h |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/sparc/include/asm/pgtable_64.h~mm-bring-sparc-pte_index-semantics-inline-with-other-platforms
+++ a/arch/sparc/include/asm/pgtable_64.h
@@ -907,11 +907,11 @@ static inline unsigned long pud_pfn(pud_
 	 (((address) >> PMD_SHIFT) & (PTRS_PER_PMD-1)))
 
 /* Find an entry in the third-level page table.. */
-#define pte_index(dir, address)	\
-	((pte_t *) __pmd_page(*(dir)) + \
-	 ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)))
-#define pte_offset_kernel		pte_index
-#define pte_offset_map			pte_index
+#define pte_index(address)			\
+	 ((address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
+#define pte_offset_kernel(dir, address)	\
+	((pte_t *) __pmd_page(*(dir)) + pte_index(address))
+#define pte_offset_map(dir, address)	pte_offset_kernel((dir), (address))
 #define pte_unmap(pte)			do { } while (0)
 
 /* We cannot include <linux/mm_types.h> at this point yet: */
_

Patches currently in -mm which might be from arjunroy@google.com are

mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch
mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch
mm-add-vm_insert_pages.patch
mm-add-vm_insert_pages-2.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (230 preceding siblings ...)
  2020-02-27  1:19 ` + mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch " Andrew Morton
@ 2020-02-27  1:37 ` Andrew Morton
  2020-02-27  1:49 ` + fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch " Andrew Morton
                   ` (6 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  1:37 UTC (permalink / raw)
  To: cai, elver, mm-commits, willy


The patch titled
     Subject: mm/vmscan.c: fix data races using kswapd_classzone_idx
has been added to the -mm tree.  Its filename is
     mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Qian Cai <cai@lca.pw>
Subject: mm/vmscan.c: fix data races using kswapd_classzone_idx

pgdat->kswapd_classzone_idx could be accessed concurrently in
wakeup_kswapd().  Plain writes and reads without any lock protection
result in data races.  Fix them by adding a pair of READ|WRITE_ONCE() as
well as saving a branch (compilers might well optimize the original code
in an unintentional way anyway).  While at it, also take care of
pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim().  The
data races were reported by KCSAN,

 BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd

 write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
  wakeup_kswapd+0xf1/0x400
  wakeup_kswapd at mm/vmscan.c:3967
  wake_all_kswapds+0x59/0xc0
  wake_all_kswapds at mm/page_alloc.c:4241
  __alloc_pages_slowpath+0xdcc/0x1290
  __alloc_pages_slowpath at mm/page_alloc.c:4512
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x16e/0x6f0
  __handle_mm_fault+0xcd5/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 1 lock held by mtest01/7454:
  #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
 do_page_fault+0x143/0x6f9
 do_user_addr_fault at arch/x86/mm/fault.c:1405
 (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
 irq event stamp: 6944085
 count_memcg_event_mm+0x1a6/0x270
 count_memcg_event_mm+0x119/0x270
 __do_softirq+0x34c/0x57c
 irq_exit+0xa2/0xc0

 read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
  wakeup_kswapd+0xc8/0x400
  wake_all_kswapds+0x59/0xc0
  __alloc_pages_slowpath+0xdcc/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x16e/0x6f0
  __handle_mm_fault+0xcd5/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 1 lock held by mtest01/7472:
  #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
 do_page_fault+0x143/0x6f9
 irq event stamp: 6793561
 count_memcg_event_mm+0x1a6/0x270
 count_memcg_event_mm+0x119/0x270
 __do_softirq+0x34c/0x57c
 irq_exit+0xa2/0xc0

 BUG: KCSAN: data-race in kswapd / wakeup_kswapd

 write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6:
  kswapd+0x27c/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0:
  wakeup_kswapd+0xf3/0x450
  wake_all_kswapds+0x59/0xc0
  __alloc_pages_slowpath+0xdcc/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x170/0x700
  __handle_mm_fault+0xc9f/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-fix-data-races-at-kswapd_classzone_idx
+++ a/mm/vmscan.c
@@ -3136,8 +3136,9 @@ static bool allow_direct_reclaim(pg_data
 
 	/* kswapd must be awake if processes are being throttled */
 	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
-		pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
-						(enum zone_type)ZONE_NORMAL);
+		if (READ_ONCE(pgdat->kswapd_classzone_idx) > ZONE_NORMAL)
+			WRITE_ONCE(pgdat->kswapd_classzone_idx, ZONE_NORMAL);
+
 		wake_up_interruptible(&pgdat->kswapd_wait);
 	}
 
@@ -3769,9 +3770,9 @@ out:
 static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
 					   enum zone_type prev_classzone_idx)
 {
-	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
-		return prev_classzone_idx;
-	return pgdat->kswapd_classzone_idx;
+	enum zone_type curr_idx = READ_ONCE(pgdat->kswapd_classzone_idx);
+
+	return curr_idx == MAX_NR_ZONES ? prev_classzone_idx : curr_idx;
 }
 
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
@@ -3815,8 +3816,11 @@ static void kswapd_try_to_sleep(pg_data_
 		 * the previous request that slept prematurely.
 		 */
 		if (remaining) {
-			pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
-			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
+			WRITE_ONCE(pgdat->kswapd_classzone_idx,
+				   kswapd_classzone_idx(pgdat, classzone_idx));
+
+			if (READ_ONCE(pgdat->kswapd_order) < reclaim_order)
+				WRITE_ONCE(pgdat->kswapd_order, reclaim_order);
 		}
 
 		finish_wait(&pgdat->kswapd_wait, &wait);
@@ -3893,12 +3897,12 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	pgdat->kswapd_order = 0;
-	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
+	WRITE_ONCE(pgdat->kswapd_order, 0);
+	WRITE_ONCE(pgdat->kswapd_classzone_idx, MAX_NR_ZONES);
 	for ( ; ; ) {
 		bool ret;
 
-		alloc_order = reclaim_order = pgdat->kswapd_order;
+		alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order);
 		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 
 kswapd_try_sleep:
@@ -3906,10 +3910,10 @@ kswapd_try_sleep:
 					classzone_idx);
 
 		/* Read the new order and classzone_idx */
-		alloc_order = reclaim_order = pgdat->kswapd_order;
+		alloc_order = reclaim_order = READ_ONCE(pgdat->kswapd_order);
 		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
-		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
+		WRITE_ONCE(pgdat->kswapd_order, 0);
+		WRITE_ONCE(pgdat->kswapd_classzone_idx, MAX_NR_ZONES);
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3953,20 +3957,23 @@ void wakeup_kswapd(struct zone *zone, gf
 		   enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
+	enum zone_type curr_idx;
 
 	if (!managed_zone(zone))
 		return;
 
 	if (!cpuset_zone_allowed(zone, gfp_flags))
 		return;
+
 	pgdat = zone->zone_pgdat;
+	curr_idx = READ_ONCE(pgdat->kswapd_classzone_idx);
+
+	if (curr_idx == MAX_NR_ZONES || curr_idx < classzone_idx)
+		WRITE_ONCE(pgdat->kswapd_classzone_idx, classzone_idx);
+
+	if (READ_ONCE(pgdat->kswapd_order) < order)
+		WRITE_ONCE(pgdat->kswapd_order, order);
 
-	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
-		pgdat->kswapd_classzone_idx = classzone_idx;
-	else
-		pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx,
-						  classzone_idx);
-	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
_

Patches currently in -mm which might be from cai@lca.pw are

mm-swapfile-fix-data-races-in-try_to_unuse.patch
mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch
percpu_counter-fix-a-data-race-at-vm_committed_as.patch
mm-frontswap-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races.patch
mm-page_io-mark-various-intentional-data-races-v2.patch
mm-swap_state-mark-various-intentional-data-races.patch
mm-kmemleak-annotate-various-data-races-obj-ptr.patch
mm-swapfile-fix-and-annotate-various-data-races.patch
mm-swapfile-fix-and-annotate-various-data-races-v2.patch
mm-page_counter-fix-various-data-races-at-memsw.patch
mm-memcontrol-fix-a-data-race-in-scan-count.patch
mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch
mm-mempool-fix-a-data-race-in-mempool_free.patch
mm-util-annotate-an-data-race-at-vm_committed_as.patch
mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch
mm-annotate-a-data-race-in-page_zonenum.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (231 preceding siblings ...)
  2020-02-27  1:37 ` + mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch " Andrew Morton
@ 2020-02-27  1:49 ` Andrew Morton
  2020-02-27  3:50 ` + lib-optimize-cpumask_local_spread.patch " Andrew Morton
                   ` (5 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  1:49 UTC (permalink / raw)
  To: dbueso, jbaron, mm-commits, normalperson, rpenyaev, viro


The patch titled
     Subject: fs/epoll: make nesting accounting safe for -rt kernel
has been added to the -mm tree.  Its filename is
     fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Jason Baron <jbaron@akamai.com>
Subject: fs/epoll: make nesting accounting safe for -rt kernel

Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set
ep_poll_safewake() can take several non-raw spinlocks after disabling
interrupts.  Since a spinlock can block in the -rt kernel, we can't take a
spinlock after disabling interrupts.  So let's re-work how we determine
the nesting level such that it plays nicely with the -rt kernel.

Let's introduce a 'nests' field in struct eventpoll that records the
current nesting level during ep_poll_callback().  Then, if we nest again
we can find the previous struct eventpoll that we were called from and
increase our count by 1.  The 'nests' field is protected by
ep->poll_wait.lock.

I've also moved the visited field to reduce the size of struct eventpoll
from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which
is typical for a production config.

Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com
Reported-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Roman Penyaev <rpenyaev@suse.de>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/eventpoll.c |   64 +++++++++++++++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 21 deletions(-)

--- a/fs/eventpoll.c~fs-epoll-make-nesting-accounting-safe-for-rt-kernel
+++ a/fs/eventpoll.c
@@ -218,13 +218,18 @@ struct eventpoll {
 	struct file *file;
 
 	/* used to optimize loop detection check */
-	int visited;
 	struct list_head visited_list_link;
+	int visited;
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	/* used to track busy poll napi_id */
 	unsigned int napi_id;
 #endif
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	/* tracks wakeup nests for lockdep validation */
+	u8 nests;
+#endif
 };
 
 /* Wait structure used by the poll hooks */
@@ -545,30 +550,47 @@ out_unlock:
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
-static DEFINE_PER_CPU(int, wakeup_nest);
-
-static void ep_poll_safewake(wait_queue_head_t *wq)
+static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi)
 {
+	struct eventpoll *ep_src;
 	unsigned long flags;
-	int subclass;
+	u8 nests = 0;
 
-	local_irq_save(flags);
-	preempt_disable();
-	subclass = __this_cpu_read(wakeup_nest);
-	spin_lock_nested(&wq->lock, subclass + 1);
-	__this_cpu_inc(wakeup_nest);
-	wake_up_locked_poll(wq, POLLIN);
-	__this_cpu_dec(wakeup_nest);
-	spin_unlock(&wq->lock);
-	local_irq_restore(flags);
-	preempt_enable();
+	/*
+	 * To set the subclass or nesting level for spin_lock_irqsave_nested()
+	 * it might be natural to create a per-cpu nest count. However, since
+	 * we can recurse on ep->poll_wait.lock, and a non-raw spinlock can
+	 * schedule() in the -rt kernel, the per-cpu variable are no longer
+	 * protected. Thus, we are introducing a per eventpoll nest field.
+	 * If we are not being call from ep_poll_callback(), epi is NULL and
+	 * we are at the first level of nesting, 0. Otherwise, we are being
+	 * called from ep_poll_callback() and if a previous wakeup source is
+	 * not an epoll file itself, we are at depth 1 since the wakeup source
+	 * is depth 0. If the wakeup source is a previous epoll file in the
+	 * wakeup chain then we use its nests value and record ours as
+	 * nests + 1. The previous epoll file nests value is stable since its
+	 * already holding its own poll_wait.lock.
+	 */
+	if (epi) {
+		if ((is_file_epoll(epi->ffd.file))) {
+			ep_src = epi->ffd.file->private_data;
+			nests = ep_src->nests;
+		} else {
+			nests = 1;
+		}
+	}
+	spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests);
+	ep->nests = nests + 1;
+	wake_up_locked_poll(&ep->poll_wait, EPOLLIN);
+	ep->nests = 0;
+	spin_unlock_irqrestore(&ep->poll_wait.lock, flags);
 }
 
 #else
 
-static void ep_poll_safewake(wait_queue_head_t *wq)
+static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi)
 {
-	wake_up_poll(wq, EPOLLIN);
+	wake_up_poll(&ep->poll_wait, EPOLLIN);
 }
 
 #endif
@@ -789,7 +811,7 @@ static void ep_free(struct eventpoll *ep
 
 	/* We need to release all tasks waiting for these file */
 	if (waitqueue_active(&ep->poll_wait))
-		ep_poll_safewake(&ep->poll_wait);
+		ep_poll_safewake(ep, NULL);
 
 	/*
 	 * We need to lock this because we could be hit by
@@ -1258,7 +1280,7 @@ out_unlock:
 
 	/* We have to call this outside the lock */
 	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
+		ep_poll_safewake(ep, epi);
 
 	if (!(epi->event.events & EPOLLEXCLUSIVE))
 		ewake = 1;
@@ -1562,7 +1584,7 @@ static int ep_insert(struct eventpoll *e
 
 	/* We have to call this outside the lock */
 	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
+		ep_poll_safewake(ep, NULL);
 
 	return 0;
 
@@ -1666,7 +1688,7 @@ static int ep_modify(struct eventpoll *e
 
 	/* We have to call this outside the lock */
 	if (pwake)
-		ep_poll_safewake(&ep->poll_wait);
+		ep_poll_safewake(ep, NULL);
 
 	return 0;
 }
_

Patches currently in -mm which might be from jbaron@akamai.com are

fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + lib-optimize-cpumask_local_spread.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (232 preceding siblings ...)
  2020-02-27  1:49 ` + fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch " Andrew Morton
@ 2020-02-27  3:50 ` Andrew Morton
  2020-02-27  4:04 ` + mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch " Andrew Morton
                   ` (4 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  3:50 UTC (permalink / raw)
  To: anshuman.khandual, jgross, jinyuqi, mgorman, mhocko, mingo,
	mm-commits, mpe, paul.burton, peterz, rppt, rusty, zhangshaokun


The patch titled
     Subject: lib: optimize cpumask_local_spread()
has been added to the -mm tree.  Its filename is
     lib-optimize-cpumask_local_spread.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/lib-optimize-cpumask_local_spread.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/lib-optimize-cpumask_local_spread.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: yuqi jin <jinyuqi@huawei.com>
Subject: lib: optimize cpumask_local_spread()

In multi-processor and NUMA system, I/O driver will find cpu cores that
which shall be bound IRQ.  When cpu cores in the local numa have been
used, it is better to find the node closest to the local numa node for
performance, instead of choosing any online cpu immediately.

On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu
system(0 - 1). The topology of this server is followed:
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 63379 MB
node 0 free: 61899 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64509 MB
node 1 free: 63942 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 64509 MB
node 2 free: 63056 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 63997 MB
node 3 free: 63420 MB
node distances:
node   0   1   2   3
  0:  10  16  32  33
  1:  16  10  25  32
  2:  32  25  10  16
  3:  33  32  16  10

We perform PS (parameter server) business test, the behavior of the
service is that the client initiates a request through the network card,
the server responds to the request after calculation.  When two PS
processes run on node2 and node3 separately and the network card is
located on 'node2' which is in cpu1, the performance of node2 (26W QPS)
and node3 (22W QPS) is different.

It is better that the NIC queues are bound to the cpu1 cores in turn, then
XPS will also be properly initialized, while cpumask_local_spread only
considers the local node.  When the number of NIC queues exceeds the
number of cores in the local node, it returns to the online core directly.
So when PS runs on node3 sending a calculated request, the performance is
not as good as the node2.

The IRQ from 369-392 will be bound from NUMA node0 to NUMA node3 with this
patch, before the patch:

Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
0
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
1
...
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
22
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
23
After the patch:
Euler:/sys/bus/pci # cat /proc/irq/369/smp_affinity_list
72
Euler:/sys/bus/pci # cat /proc/irq/370/smp_affinity_list
73
...
Euler:/sys/bus/pci # cat /proc/irq/391/smp_affinity_list
94
Euler:/sys/bus/pci # cat /proc/irq/392/smp_affinity_list
95

So the performance of the node3 is the same as node2 that is 26W QPS when
the network card is still in 'node2' with the patch.

It is considered that the NIC and other I/O devices shall initialize the
interrupt binding, if the cores of the local node are used up, it is
reasonable to return the node closest to it.  Let's optimize it and find
the nearest node through NUMA distance for the non-local NUMA nodes.

Link: http://lkml.kernel.org/r/1582768688-2314-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: yuqi jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Juergen Gross <jgross@suse.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/cpumask.c |  102 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 90 insertions(+), 12 deletions(-)

--- a/lib/cpumask.c~lib-optimize-cpumask_local_spread
+++ a/lib/cpumask.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/memblock.h>
 #include <linux/numa.h>
+#include <linux/spinlock.h>
 
 /**
  * cpumask_next - get the next cpu in a cpumask
@@ -192,18 +193,39 @@ void __init free_bootmem_cpumask_var(cpu
 }
 #endif
 
-/**
- * cpumask_local_spread - select the i'th cpu with local numa cpu's first
- * @i: index number
- * @node: local numa_node
- *
- * This function selects an online CPU according to a numa aware policy;
- * local cpus are returned first, followed by non-local ones, then it
- * wraps around.
- *
- * It's not very efficient, but useful for setup.
- */
-unsigned int cpumask_local_spread(unsigned int i, int node)
+static void calc_node_distance(int *node_dist, int node)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++)
+		node_dist[i] = node_distance(node, i);
+}
+
+static int find_nearest_node(int *node_dist, bool *used)
+{
+	int i, min_dist = node_dist[0], node_id = -1;
+
+	/* Choose the first unused node to compare */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (used[i] == 0) {
+			min_dist = node_dist[i];
+			node_id = i;
+			break;
+		}
+	}
+
+	/* Compare and return the nearest node */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (node_dist[i] < min_dist && used[i] == 0) {
+			min_dist = node_dist[i];
+			node_id = i;
+		}
+	}
+
+	return node_id;
+}
+
+static unsigned int __cpumask_local_spread(unsigned int i, int node)
 {
 	int cpu;
 
@@ -231,4 +253,60 @@ unsigned int cpumask_local_spread(unsign
 	}
 	BUG();
 }
+
+/**
+ * cpumask_local_spread - select the i'th cpu with local numa cpu's first
+ * @i: index number
+ * @node: local numa_node
+ *
+ * This function selects an online CPU according to a numa aware policy;
+ * local cpus are returned first, followed by the nearest non-local ones,
+ * then it wraps around.
+ *
+ * It's not very efficient, but useful for setup.
+ */
+unsigned int cpumask_local_spread(unsigned int i, int node)
+{
+	static DEFINE_SPINLOCK(spread_lock);
+	static int node_dist[MAX_NUMNODES];
+	static bool used[MAX_NUMNODES];
+	unsigned long flags;
+	int cpu, j, id;
+
+	/* Wrap: we always want a cpu. */
+	i %= num_online_cpus();
+
+	if (node == NUMA_NO_NODE) {
+		for_each_cpu(cpu, cpu_online_mask)
+			if (i-- == 0)
+				return cpu;
+	} else {
+		if (nr_node_ids > MAX_NUMNODES)
+			return __cpumask_local_spread(i, node);
+
+		spin_lock_irqsave(&spread_lock, flags);
+		memset(used, 0, nr_node_ids * sizeof(bool));
+		calc_node_distance(node_dist, node);
+		for (j = 0; j < nr_node_ids; j++) {
+			id = find_nearest_node(node_dist, used);
+			if (id < 0)
+				break;
+
+			for_each_cpu_and(cpu, cpumask_of_node(id),
+					 cpu_online_mask)
+				if (i-- == 0) {
+					spin_unlock_irqrestore(&spread_lock,
+							       flags);
+					return cpu;
+				}
+			used[id] = 1;
+		}
+		spin_unlock_irqrestore(&spread_lock, flags);
+
+		for_each_cpu(cpu, cpu_online_mask)
+			if (i-- == 0)
+				return cpu;
+	}
+	BUG();
+}
 EXPORT_SYMBOL(cpumask_local_spread);
_

Patches currently in -mm which might be from jinyuqi@huawei.com are

lib-optimize-cpumask_local_spread.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (233 preceding siblings ...)
  2020-02-27  3:50 ` + lib-optimize-cpumask_local_spread.patch " Andrew Morton
@ 2020-02-27  4:04 ` Andrew Morton
  2020-02-27  4:11 ` + gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
                   ` (3 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:04 UTC (permalink / raw)
  To: akpm, anshuman.khandual, cai, christophe.leroy, james.morse, mm-commits


The patch titled
     Subject: mm-debug-add-tests-validating-architecture-page-table-helpers-fix
has been added to the -mm tree.  Its filename is
     mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-debug-add-tests-validating-architecture-page-table-helpers-fix

A warning gets exposed with DEBUG_VIRTUAL due to __pa() on a kernel symbol
i.e 'start_kernel' which might be outside the linear map.  This happens
due to kernel mapping position randomization with KASLR.

__pa_symbol() should have been used instead, for accessing the physical
address here.  On arm64 __pa() does check for linear address with
__is_lm_address() and switch accordingly if it is a kernel text symbol. 
Nevertheless, its much better to use __pa_symbol() here rather than
__pa().

Reported-by: Qian Cai <cai@lca.pw>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug_vm_pgtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/debug_vm_pgtable.c~mm-debug-add-tests-validating-architecture-page-table-helpers-fix
+++ a/mm/debug_vm_pgtable.c
@@ -331,7 +331,7 @@ void __init debug_vm_pgtable(void)
 	 * helps avoid large memory block allocations to be used for mapping
 	 * at higher page table levels.
 	 */
-	paddr = __pa(&start_kernel);
+	paddr = __pa_symbol(&start_kernel);
 
 	pte_aligned = (paddr & PAGE_MASK) >> PAGE_SHIFT;
 	pmd_aligned = (paddr & PMD_MASK) >> PAGE_SHIFT;
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa-fix.patch
mm.patch
mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch
selftest-add-mremap_dontunmap-selftest-fix.patch
selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings-fix.patch
hugetlb_cgroup-add-accounting-for-shared-mappings-fix.patch
mm-migratec-migrate-pg_readahead-flag-fix.patch
proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch
linux-next-rejects.patch
linux-next-fix.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (234 preceding siblings ...)
  2020-02-27  4:04 ` + mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch " Andrew Morton
@ 2020-02-27  4:11 ` Andrew Morton
  2020-02-27  4:42 ` + mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch " Andrew Morton
                   ` (2 subsequent siblings)
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:11 UTC (permalink / raw)
  To: gustavo, mm-commits, oberpar


The patch titled
     Subject: gcov: gcc_4_7: replace zero-length array with flexible-array member
has been added to the -mm tree.  Its filename is
     gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: gcov: gcc_4_7: replace zero-length array with flexible-array member

The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by this
change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied.  As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Link: http://lkml.kernel.org/r/20200213152241.GA877@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/gcov/gcc_4_7.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/gcov/gcc_4_7.c~gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member
+++ a/kernel/gcov/gcc_4_7.c
@@ -68,7 +68,7 @@ struct gcov_fn_info {
 	unsigned int ident;
 	unsigned int lineno_checksum;
 	unsigned int cfg_checksum;
-	struct gcov_ctr_info ctrs[0];
+	struct gcov_ctr_info ctrs[];
 };
 
 /**
_

Patches currently in -mm which might be from gustavo@embeddedor.com are

lib-bch-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch
lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch
gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: + dma-buf-free-dmabuf-name-in-dma_buf_release.patch added to -mm tree
       [not found]     ` <CAKMK7uGvixQ2xoQMt3pvt0OpNXDjDGTvSWsaAppsKrmO_EP3Kg@mail.gmail.com>
@ 2020-02-27  4:20       ` Andrew Morton
  0 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:20 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Sumit Semwal, mm-commits, Chenbo Feng, DRI mailing list,
	Greg Hackmann, WANG Cong, open list:DMA BUFFER SHARING FRAMEWORK

On Wed, 26 Feb 2020 10:36:26 +0100 Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Feb 26, 2020 at 5:29 AM Sumit Semwal <sumit.semwal@linaro.org> wrote:
> >
> > Hello Andrew,
> >
> >
> > On Wed, 26 Feb 2020 at 07:25, Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > >
> > > The patch titled
> > >      Subject: dma-buf: free dmabuf->name in dma_buf_release()
> > > has been added to the -mm tree.  Its filename is
> > >      dma-buf-free-dmabuf-name-in-dma_buf_release.patch
> >
> > Thanks for taking this patch via -mm during my absence (I'm just
> > returning from a bit of an illness). If there are other dma-buf
> > patches on your radar that you'd like to take via the mm tree, please
> > let me know and I can provide the necessary Acks.
> > Else I will take them in via drm-misc as usual.
> 
> I thought at least that for cases like these -mm is the last resort
> tree, so proper thing to do here is apply this fix to drm-misc-fixes
> and get it out there. -mm rebases, so will fall out again.

Yup, go ahead.  If a patch pops up in linux-next I'll autodrop by copy.

And please do give some thought to whether this should be cc:stable. 
If it's an unprivileged operation then hellyeah.

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: + pstore_ftrace_seq_next-should-increase-position-index.patch added to -mm tree
       [not found]   ` <07f968e6-02cd-de2a-e868-787e4bedd346@virtuozzo.com>
@ 2020-02-27  4:26     ` Andrew Morton
  0 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:26 UTC (permalink / raw)
  To: Vasily Averin
  Cc: dave, longman, manfred, mingo, mm-commits, neilb, oberpar, rostedt, viro

On Wed, 26 Feb 2020 09:14:12 +0300 Vasily Averin <vvs@virtuozzo.com> wrote:

> Dear Andrew,
> could you please drop this one,
> because of I've resend it to pstore maintainers.

No problems - if/when this appears in linux-next I'll drop my copy. 
Until then, it remains not lost!

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (235 preceding siblings ...)
  2020-02-27  4:11 ` + gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
@ 2020-02-27  4:42 ` Andrew Morton
  2020-02-27  4:42 ` [to-be-updated] mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch removed from " Andrew Morton
  2020-02-27  4:44 ` + huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch added to " Andrew Morton
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:42 UTC (permalink / raw)
  To: anshuman.khandual, cai, christophe.leroy, james.morse, mm-commits


The patch titled
     Subject: mm/debug: replace __pa() with __pa_symbol() in debug_vm_pgtable()
has been added to the -mm tree.  Its filename is
     mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/debug: replace __pa() with __pa_symbol() in debug_vm_pgtable()

Replace __pa() with __pa_symbol() in debug_vm_pgtable() while accessing
the physical address for 'start_kernel' which is a kernel text symbol,
else it might trigger the following warning on some platforms when
DEBUG_VIRTUAL is enabled.

[   23.123852] ------------[ cut here ]------------
[   23.124486] virt_to_phys used for non-linear address: (____ptrval____) (start_kernel+0x0/0x424)
[   23.125663] WARNING: CPU: 11 PID: 1 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x60/0x98
[   23.126877] Modules linked in:
[   23.127390] CPU: 11 PID: 1 Comm: swapper/0 Tainted: G        W 5.6.0-rc3-next-20200226-00001-g306225cc6ffd #163
[   23.129139] Hardware name: linux,dummy-virt (DT)
[   23.129898] pstate: 60400005 (nZCv daif +PAN -UAO)
[   23.130693] pc : __virt_to_phys+0x60/0x98
[   23.131359] lr : __virt_to_phys+0x60/0x98
[   23.132022] sp : ffff800011e6be10
[   23.132575] x29: ffff800011e6be10 x28: 0000000000000000
[   23.133447] x27: 0000000000000000 x26: 0000000000000000
[   23.134319] x25: 0000000000000000 x24: 0020000000000fd3
[   23.135197] x23: fffffe000bd27700 x22: ffff0002fcae8000
[   23.136069] x21: ffff800011a47000 x20: 0000000000000001
[   23.136941] x19: ffff800011350a14 x18: 0000000000000010
[   23.137815] x17: 000000006bb8910e x16: 00000000a1fdc699
[   23.138693] x15: ffffffffffffffff x14: 6c656e72656b5f74
[   23.139567] x13: 726174732820295f x12: 5f5f5f6c61767274
[   23.140441] x11: 705f5f5f5f28203a x10: 7373657264646120
[   23.141314] x9 : 7261656e696c2d6e x8 : ffff8000106b4e70
[   23.142189] x7 : 00000000000002a1 x6 : ffff800011a58bba
[   23.143067] x5 : 001fffffffffffff x4 : 0000000000000000
[   23.143939] x3 : 00000000ffffffff x2 : ffff800011881bf8
[   23.144810] x1 : 16ee3f9cb03efc00 x0 : 0000000000000000
[   23.145682] Call trace:
[   23.146097]  __virt_to_phys+0x60/0x98
[   23.146710]  debug_vm_pgtable+0xd0/0x440
[   23.147362]  kernel_init+0x18/0x100
[   23.147944]  ret_from_fork+0x10/0x18
[   23.148539] ---[ end trace fc4ccb3cb35ff225 ]---

Link: http://lkml.kernel.org/r/1582776031-30344-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: Qian Cai <cai@lca.pw>
Cc: James Morse <james.morse@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug_vm_pgtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/debug_vm_pgtable.c~mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2
+++ a/mm/debug_vm_pgtable.c
@@ -331,7 +331,7 @@ void __init debug_vm_pgtable(void)
 	 * helps avoid large memory block allocations to be used for mapping
 	 * at higher page table levels.
 	 */
-	paddr = __pa(&start_kernel);
+	paddr = __pa_symbol(&start_kernel);
 
 	pte_aligned = (paddr & PAGE_MASK) >> PAGE_SHIFT;
 	pmd_aligned = (paddr & PMD_MASK) >> PAGE_SHIFT;
_

Patches currently in -mm which might be from anshuman.khandual@arm.com are

mm-debug-add-tests-validating-architecture-page-table-helpers.patch
mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch
mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch
mm-vma-make-vma_is_accessible-available-for-general-use.patch
mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch
mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch
mm-vma-append-unlikely-while-testing-vma-access-permissions.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* [to-be-updated] mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch removed from -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (236 preceding siblings ...)
  2020-02-27  4:42 ` + mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch " Andrew Morton
@ 2020-02-27  4:42 ` Andrew Morton
  2020-02-27  4:44 ` + huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch added to " Andrew Morton
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:42 UTC (permalink / raw)
  To: akpm, anshuman.khandual, cai, christophe.leroy, james.morse, mm-commits


The patch titled
     Subject: mm-debug-add-tests-validating-architecture-page-table-helpers-fix
has been removed from the -mm tree.  Its filename was
     mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch

This patch was dropped because an updated version will be merged

------------------------------------------------------
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-debug-add-tests-validating-architecture-page-table-helpers-fix

A warning gets exposed with DEBUG_VIRTUAL due to __pa() on a kernel symbol
i.e 'start_kernel' which might be outside the linear map.  This happens
due to kernel mapping position randomization with KASLR.

__pa_symbol() should have been used instead, for accessing the physical
address here.  On arm64 __pa() does check for linear address with
__is_lm_address() and switch accordingly if it is a kernel text symbol. 
Nevertheless, its much better to use __pa_symbol() here rather than
__pa().

Reported-by: Qian Cai <cai@lca.pw>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug_vm_pgtable.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/debug_vm_pgtable.c~mm-debug-add-tests-validating-architecture-page-table-helpers-fix
+++ a/mm/debug_vm_pgtable.c
@@ -331,7 +331,7 @@ void __init debug_vm_pgtable(void)
 	 * helps avoid large memory block allocations to be used for mapping
 	 * at higher page table levels.
 	 */
-	paddr = __pa(&start_kernel);
+	paddr = __pa_symbol(&start_kernel);
 
 	pte_aligned = (paddr & PAGE_MASK) >> PAGE_SHIFT;
 	pmd_aligned = (paddr & PMD_MASK) >> PAGE_SHIFT;
_

Patches currently in -mm which might be from akpm@linux-foundation.org are

mm-numa-fix-bad-pmd-by-atomically-check-for-pmd_trans_huge-when-marking-page-tables-prot_numa-fix.patch
mm.patch
selftest-add-mremap_dontunmap-selftest-fix.patch
selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch
hugetlb_cgroup-add-reservation-accounting-for-private-mappings-fix.patch
hugetlb_cgroup-add-accounting-for-shared-mappings-fix.patch
mm-migratec-migrate-pg_readahead-flag-fix.patch
proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch
linux-next-rejects.patch
linux-next-fix.patch
mm-add-vm_insert_pages-fix.patch
net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch
drivers-tty-serial-sh-scic-suppress-warning.patch
kernel-forkc-export-kernel_thread-to-modules.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* + huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch added to -mm tree
  2020-02-04  1:33 incoming Andrew Morton
                   ` (237 preceding siblings ...)
  2020-02-27  4:42 ` [to-be-updated] mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch removed from " Andrew Morton
@ 2020-02-27  4:44 ` Andrew Morton
  238 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-02-27  4:44 UTC (permalink / raw)
  To: aarcange, alexander.duyck, david, hughd, kirill.shutemov,
	mm-commits, mst, willy, yang.shi


The patch titled
     Subject: mm: huge tmpfs: try to split_huge_page() when punching hole
has been added to the -mm tree.  Its filename is
     huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Hugh Dickins <hughd@google.com>
Subject: mm: huge tmpfs: try to split_huge_page() when punching hole

Yang Shi writes:

Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP. 
Even though all the subpages are truncated (randomly or sequentially), the
THP may still be kept in page cache.

This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size.  So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory.  But the inflation usecase really needs to get the
memory freed.  (Anonymous THP will also not get freed right away, but will
be freed eventually when all subpages are unmapped: whereas shmem THP
still stays in page cache.)

Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.

Hugh Dickins adds:

Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation.  Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch a
hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.

That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky. 
But in practice it works successfully enough that we've not seen any
problem from that.

Add shmem_punch_compound() to encapsulate the decision of when a split is
needed, and doing the split if so.  Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing.  Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty() when
clearing a hole - though I'm not certain that either is needed.

But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page.  The easiest way to fix
that is for find_get_entries() to return early, as soon as it has put one
compound head or tail into the pagevec.  At first this felt like a hack;
but on examination, this convention better suits all its callers - or will
do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
speedup by checking for compound pages there.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |   14 ++++++-
 mm/shmem.c   |   98 +++++++++++++++++++++----------------------------
 mm/swap.c    |    4 ++
 3 files changed, 60 insertions(+), 56 deletions(-)

--- a/mm/filemap.c~huge-tmpfs-try-to-split_huge_page-when-punching-hole
+++ a/mm/filemap.c
@@ -1697,6 +1697,11 @@ EXPORT_SYMBOL(pagecache_get_page);
  * Any shadow entries of evicted pages, or swap entries from
  * shmem/tmpfs, are included in the returned array.
  *
+ * If it finds a Transparent Huge Page, head or tail, find_get_entries()
+ * stops at that page: the caller is likely to have a better way to handle
+ * the compound page as a whole, and then skip its extent, than repeatedly
+ * calling find_get_entries() to return all its tails.
+ *
  * Return: the number of pages and shadow entries which were found.
  */
 unsigned find_get_entries(struct address_space *mapping,
@@ -1728,8 +1733,15 @@ unsigned find_get_entries(struct address
 		/* Has the page moved or been split? */
 		if (unlikely(page != xas_reload(&xas)))
 			goto put_page;
-		page = find_subpage(page, xas.xa_index);
 
+		/*
+		 * Terminate early on finding a THP, to allow the caller to
+		 * handle it all at once; but continue if this is hugetlbfs.
+		 */
+		if (PageTransHuge(page) && !PageHuge(page)) {
+			page = find_subpage(page, xas.xa_index);
+			nr_entries = ret + 1;
+		}
 export:
 		indices[ret] = xas.xa_index;
 		entries[ret] = page;
--- a/mm/shmem.c~huge-tmpfs-try-to-split_huge_page-when-punching-hole
+++ a/mm/shmem.c
@@ -789,6 +789,32 @@ void shmem_unlock_mapping(struct address
 }
 
 /*
+ * Check whether a hole-punch or truncation needs to split a huge page,
+ * returning true if no split was required, or the split has been successful.
+ *
+ * Eviction (or truncation to 0 size) should never need to split a huge page;
+ * but in rare cases might do so, if shmem_undo_range() failed to trylock on
+ * head, and then succeeded to trylock on tail.
+ *
+ * A split can only succeed when there are no additional references on the
+ * huge page: so the split below relies upon find_get_entries() having stopped
+ * when it found a subpage of the huge page, without getting further references.
+ */
+static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end)
+{
+	if (!PageTransCompound(page))
+		return true;
+
+	/* Just proceed to delete a huge page wholly within the range punched */
+	if (PageHead(page) &&
+	    page->index >= start && page->index + HPAGE_PMD_NR <= end)
+		return true;
+
+	/* Try to split huge page, so we can truly punch the hole or truncate */
+	return split_huge_page(page) >= 0;
+}
+
+/*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
  */
@@ -838,31 +864,11 @@ static void shmem_undo_range(struct inod
 			if (!trylock_page(page))
 				continue;
 
-			if (PageTransTail(page)) {
-				/* Middle of THP: zero out the page */
-				clear_highpage(page);
-				unlock_page(page);
-				continue;
-			} else if (PageTransHuge(page)) {
-				if (index == round_down(end, HPAGE_PMD_NR)) {
-					/*
-					 * Range ends in the middle of THP:
-					 * zero out the page
-					 */
-					clear_highpage(page);
-					unlock_page(page);
-					continue;
-				}
-				index += HPAGE_PMD_NR - 1;
-				i += HPAGE_PMD_NR - 1;
-			}
-
-			if (!unfalloc || !PageUptodate(page)) {
-				VM_BUG_ON_PAGE(PageTail(page), page);
-				if (page_mapping(page) == mapping) {
-					VM_BUG_ON_PAGE(PageWriteback(page), page);
+			if ((!unfalloc || !PageUptodate(page)) &&
+			    page_mapping(page) == mapping) {
+				VM_BUG_ON_PAGE(PageWriteback(page), page);
+				if (shmem_punch_compound(page, start, end))
 					truncate_inode_page(mapping, page);
-				}
 			}
 			unlock_page(page);
 		}
@@ -936,43 +942,25 @@ static void shmem_undo_range(struct inod
 
 			lock_page(page);
 
-			if (PageTransTail(page)) {
-				/* Middle of THP: zero out the page */
-				clear_highpage(page);
-				unlock_page(page);
-				/*
-				 * Partial thp truncate due 'start' in middle
-				 * of THP: don't need to look on these pages
-				 * again on !pvec.nr restart.
-				 */
-				if (index != round_down(end, HPAGE_PMD_NR))
-					start++;
-				continue;
-			} else if (PageTransHuge(page)) {
-				if (index == round_down(end, HPAGE_PMD_NR)) {
-					/*
-					 * Range ends in the middle of THP:
-					 * zero out the page
-					 */
-					clear_highpage(page);
-					unlock_page(page);
-					continue;
-				}
-				index += HPAGE_PMD_NR - 1;
-				i += HPAGE_PMD_NR - 1;
-			}
-
 			if (!unfalloc || !PageUptodate(page)) {
-				VM_BUG_ON_PAGE(PageTail(page), page);
-				if (page_mapping(page) == mapping) {
-					VM_BUG_ON_PAGE(PageWriteback(page), page);
-					truncate_inode_page(mapping, page);
-				} else {
+				if (page_mapping(page) != mapping) {
 					/* Page was replaced by swap: retry */
 					unlock_page(page);
 					index--;
 					break;
 				}
+				VM_BUG_ON_PAGE(PageWriteback(page), page);
+				if (shmem_punch_compound(page, start, end))
+					truncate_inode_page(mapping, page);
+				else {
+					/* Wipe the page and don't get stuck */
+					clear_highpage(page);
+					flush_dcache_page(page);
+					set_page_dirty(page);
+					if (index <
+					    round_up(start, HPAGE_PMD_NR))
+						start = index + 1;
+				}
 			}
 			unlock_page(page);
 		}
--- a/mm/swap.c~huge-tmpfs-try-to-split_huge_page-when-punching-hole
+++ a/mm/swap.c
@@ -1004,6 +1004,10 @@ void __pagevec_lru_add(struct pagevec *p
  * ascending indexes.  There may be holes in the indices due to
  * not-present entries.
  *
+ * Only one subpage of a Transparent Huge Page is returned in one call:
+ * allowing truncate_inode_pages_range() to evict the whole THP without
+ * cycling through a pagevec of extra references.
+ *
  * pagevec_lookup_entries() returns the number of entries which were
  * found.
  */
_

Patches currently in -mm which might be from hughd@google.com are

huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch

^ permalink raw reply	[flat|nested] 244+ messages in thread

* Re: + seq_read-info-message-about-buggy-next-functions.patch added to -mm tree
       [not found]     ` <1583177508.7365.144.camel@lca.pw>
@ 2020-03-02 20:42       ` Andrew Morton
  0 siblings, 0 replies; 244+ messages in thread
From: Andrew Morton @ 2020-03-02 20:42 UTC (permalink / raw)
  To: Qian Cai
  Cc: linux-kernel, dave, longman, manfred, mingo, mm-commits, neilb,
	oberpar, rostedt, viro, vvs

On Mon, 02 Mar 2020 14:31:48 -0500 Qian Cai <cai@lca.pw> wrote:

> On Mon, 2020-03-02 at 13:20 -0500, Qian Cai wrote:
> > This patch spams the console like crazy while reading sysfs,
> > 
> > # dmesg | grep 'buggy seq_file' | wc -l
> > 4204
> > 
> > [ 9505.321981] LTP: starting read_all_proc (read_all -d /proc -q -r 10)
> > [ 9508.222934] buggy seq_file .next function xt_match_seq_next [x_tables] did
> > not updated position index
> > [ 9508.223319] buggy seq_file .next function xt_match_seq_next [x_tables] did
> > not updated position index
> > [ 9508.223654] buggy seq_file .next function xt_match_seq_next [x_tables] did
> > not updated position index
> > [ 9508.223994] buggy seq_file .next function xt_match_seq_next [x_tables] did
> > not updated position index
> > [ 9508.224337] buggy seq_file .next function xt_match_seq_next [x_tables] did
> > not updated position index
> > ...
> > 
> > 
> > > --- a/fs/seq_file.c~seq_read-info-message-about-buggy-next-functions
> > > +++ a/fs/seq_file.c
> > > @@ -256,9 +256,12 @@ Fill:
> > >  		loff_t pos = m->index;
> > >  
> > >  		p = m->op->next(m, p, &m->index);
> > > -		if (pos == m->index)
> > > -			/* Buggy ->next function */
> > > +		if (pos == m->index) {
> > > +			pr_info("buggy seq_file .next function %ps "
> > > +				"did not updated position index\n",
> > > +				m->op->next);
> 
> This?
> 
> s/pr_info/pr_info_ratelimited/
> 

Fair enough - I made that change.

^ permalink raw reply	[flat|nested] 244+ messages in thread

end of thread, other threads:[~2020-03-02 20:42 UTC | newest]

Thread overview: 244+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-04  1:33 incoming Andrew Morton
2020-02-04  1:33 ` [patch 01/67] ocfs2: fix oops when writing cloned file Andrew Morton
2020-02-04  1:33 ` [patch 02/67] mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section Andrew Morton
2020-02-04  1:33 ` [patch 03/67] fs/proc/page.c: allow inspection of last section and fix end detection Andrew Morton
2020-02-04  1:33 ` [patch 04/67] mm/page_alloc.c: initialize memmap of unavailable memory directly Andrew Morton
2020-02-04  1:33 ` [patch 05/67] mm/page_alloc: fix and rework pfn handling in memmap_init_zone() Andrew Morton
2020-02-04  1:34 ` [patch 06/67] mm: factor out next_present_section_nr() Andrew Morton
     [not found]   ` <CAHk-=widODg0oV6uNrpTvA3tFVN706o-nrWC5XBL9RZ4y3RFcg@mail.gmail.com>
2020-02-04  4:29     ` Andrew Morton
2020-02-04  1:34 ` [patch 07/67] mm/memmap_init: update variable name in memmap_init_zone Andrew Morton
2020-02-04  1:34 ` [patch 08/67] mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone() Andrew Morton
2020-02-04  1:34 ` [patch 09/67] mm/memory_hotplug: we always have a zone in find_(smallest|biggest)_section_pfn Andrew Morton
2020-02-04  1:34 ` [patch 10/67] mm/memory_hotplug: don't check for "all holes" in shrink_zone_span() Andrew Morton
2020-02-04  1:34 ` [patch 11/67] mm/memory_hotplug: drop local variables " Andrew Morton
2020-02-04  1:34 ` [patch 12/67] mm/memory_hotplug: cleanup __remove_pages() Andrew Morton
2020-02-04  1:34 ` [patch 13/67] mm/memory_hotplug: drop valid_start/valid_end from test_pages_in_a_zone() Andrew Morton
2020-02-04  1:34 ` [patch 14/67] smp_mb__{before,after}_atomic(): update Documentation Andrew Morton
2020-02-04  1:34 ` [patch 15/67] ipc/mqueue.c: remove duplicated code Andrew Morton
2020-02-04  1:34 ` [patch 16/67] ipc/mqueue.c: update/document memory barriers Andrew Morton
2020-02-04  1:34 ` [patch 17/67] ipc/msg.c: update and document " Andrew Morton
2020-02-04  1:34 ` [patch 18/67] ipc/sem.c: document and update " Andrew Morton
2020-02-04  1:34 ` [patch 19/67] ipc/msg.c: consolidate all xxxctl_down() functions Andrew Morton
2020-02-04  1:34 ` [patch 20/67] drivers/block/null_blk_main.c: fix layout Andrew Morton
2020-02-04  1:34 ` [patch 21/67] drivers/block/null_blk_main.c: fix uninitialized var warnings Andrew Morton
2020-02-04  1:34 ` [patch 22/67] pinctrl: fix pxa2xx.c build warnings Andrew Morton
2020-02-04  1:34 ` [patch 23/67] mm: remove __krealloc Andrew Morton
2020-02-04  1:35 ` [patch 24/67] mm: add generic p?d_leaf() macros Andrew Morton
2020-02-04  1:35 ` [patch 25/67] arc: mm: add p?d_leaf() definitions Andrew Morton
2020-02-04  1:35 ` [patch 26/67] arm: " Andrew Morton
2020-02-04  1:35 ` [patch 27/67] arm64: " Andrew Morton
2020-02-04  1:35 ` [patch 28/67] mips: " Andrew Morton
2020-02-04  1:35 ` [patch 29/67] powerpc: " Andrew Morton
2020-02-04  1:35 ` [patch 30/67] riscv: " Andrew Morton
2020-02-04  1:35 ` [patch 31/67] s390: " Andrew Morton
2020-02-04  1:35 ` [patch 32/67] sparc: " Andrew Morton
2020-02-04  1:35 ` [patch 33/67] x86: " Andrew Morton
2020-02-04  1:35 ` [patch 34/67] mm: pagewalk: add p4d_entry() and pgd_entry() Andrew Morton
2020-02-04  1:35 ` [patch 35/67] mm: pagewalk: allow walking without vma Andrew Morton
2020-02-04  1:35 ` [patch 36/67] mm: pagewalk: don't lock PTEs for walk_page_range_novma() Andrew Morton
2020-02-04  1:35 ` [patch 37/67] mm: pagewalk: fix termination condition in walk_pte_range() Andrew Morton
2020-02-04  1:36 ` [patch 38/67] mm: pagewalk: add 'depth' parameter to pte_hole Andrew Morton
2020-02-04  1:36 ` [patch 39/67] x86: mm: point to struct seq_file from struct pg_state Andrew Morton
2020-02-04  1:36 ` [patch 40/67] x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct Andrew Morton
2020-02-04  1:36 ` [patch 41/67] x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct Andrew Morton
2020-02-04  1:36 ` [patch 42/67] mm: add generic ptdump Andrew Morton
2020-02-04  1:36 ` [patch 43/67] x86: mm: convert dump_pagetables to use walk_page_range Andrew Morton
2020-02-04  1:36 ` [patch 44/67] arm64: mm: convert mm/dump.c to use walk_page_range() Andrew Morton
2020-02-04  1:36 ` [patch 45/67] arm64: mm: display non-present entries in ptdump Andrew Morton
2020-02-04  1:36 ` [patch 46/67] mm: ptdump: reduce level numbers by 1 in note_page() Andrew Morton
2020-02-04  1:36 ` [patch 47/67] x86: mm: avoid allocating struct mm_struct on the stack Andrew Morton
2020-02-04  1:36 ` [patch 48/67] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case Andrew Morton
2020-02-04  1:36 ` [patch 49/67] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush Andrew Morton
2020-02-04  1:36 ` [patch 50/67] asm-generic/tlb: avoid potential double flush Andrew Morton
2020-02-04  1:36 ` [patch 51/67] asm-gemeric/tlb: remove stray function declarations Andrew Morton
2020-02-04  1:36 ` [patch 52/67] asm-generic/tlb: add missing CONFIG symbol Andrew Morton
2020-02-04  1:37 ` [patch 53/67] asm-generic/tlb: rename HAVE_RCU_TABLE_FREE Andrew Morton
2020-02-04  1:37 ` [patch 54/67] asm-generic/tlb: rename HAVE_MMU_GATHER_PAGE_SIZE Andrew Morton
2020-02-04  1:37 ` [patch 55/67] asm-generic/tlb: rename HAVE_MMU_GATHER_NO_GATHER Andrew Morton
2020-02-04  1:37 ` [patch 56/67] asm-generic/tlb: provide MMU_GATHER_TABLE_FREE Andrew Morton
2020-02-04  1:37 ` [patch 57/67] proc: decouple proc from VFS with "struct proc_ops" Andrew Morton
2020-02-04  1:37 ` [patch 58/67] proc: convert everything to " Andrew Morton
2020-02-04  1:37 ` [patch 59/67] lib/string: add strnchrnul() Andrew Morton
2020-02-04  1:37 ` [patch 60/67] bitops: more BITS_TO_* macros Andrew Morton
2020-02-04  1:37 ` [patch 61/67] lib: add test for bitmap_parse() Andrew Morton
2020-02-04  1:37 ` [patch 62/67] lib: make bitmap_parse_user a wrapper on bitmap_parse Andrew Morton
2020-02-04  1:37 ` [patch 63/67] lib: rework bitmap_parse() Andrew Morton
2020-02-04  1:37 ` [patch 64/67] lib: new testcases for bitmap_parse{_user} Andrew Morton
2020-02-04  1:37 ` [patch 65/67] include/linux/cpumask.h: don't calculate length of the input string Andrew Morton
2020-02-04  1:37 ` [patch 66/67] treewide: remove redundant IS_ERR() before error code check Andrew Morton
2020-02-04  1:37 ` [patch 67/67] ARM: dma-api: fix max_pfn off-by-one error in __dma_supported() Andrew Morton
2020-02-04  1:48 ` [merged] mips-kdb-remove-old-workaround-for-backtracing-on-other-cpus.patch removed from -mm tree Andrew Morton
2020-02-04  1:48 ` [merged] kdb-kdb_current_regs-should-be-private.patch " Andrew Morton
2020-02-04  1:48 ` [merged] kdb-kdb_current_task-shouldnt-be-exported.patch " Andrew Morton
2020-02-04  1:48 ` [merged] kdb-gid-rid-of-implicit-setting-of-the-current-task-regs.patch " Andrew Morton
2020-02-04  1:48 ` [merged] kdb-get-rid-of-confusing-diag-msg-from-rd-if-current-task-has-no-regs.patch " Andrew Morton
2020-02-04  1:49 ` [obsolete] linux-next-git-rejects.patch " Andrew Morton
     [not found] ` <CAHk-=whog86e4fRY_sxHqAos6spwAi_4aFF49S7h5C4XAZM2qw@mail.gmail.com>
2020-02-04  2:46   ` incoming Andrew Morton
2020-02-10  0:55 ` + mm-dont-prepare-anon_vma-if-vma-has-vm_wipeonfork.patch added to -mm tree Andrew Morton
2020-02-10  0:55 ` + revert-mm-rmapc-reuse-mergeable-anon_vma-as-parent-when-fork.patch " Andrew Morton
2020-02-10  0:55 ` + mm-set-vm_next-and-vm_prev-to-null-in-vm_area_dup.patch " Andrew Morton
2020-02-10  1:03 ` + mm-list_lru-fix-a-data-race-in-list_lru_count_one.patch " Andrew Morton
2020-02-10  1:05 ` + mm-frontswap-mark-various-intentional-data-races.patch " Andrew Morton
2020-02-10  1:22 ` + mm-add-mremap_dontunmap-to-mremap.patch " Andrew Morton
2020-02-10  1:35 ` + mm-sparsemem-get-address-to-page-struct-instead-of-address-to-pfn.patch " Andrew Morton
2020-02-10  1:48 ` + mm-swapfile-fix-and-annotate-various-data-races.patch " Andrew Morton
2020-02-10  2:01 ` + mm-page_io-mark-various-intentional-data-races.patch " Andrew Morton
2020-02-10  2:01 ` + mm-swap_state-mark-various-intentional-data-races.patch " Andrew Morton
2020-02-10  4:00 ` + linux-pipe_fs_ih-fix-kernel-doc-warnings-after-wait-was-split.patch " Andrew Morton
2020-02-10  4:16 ` + mm-swap-move-inode_lock-out-of-claim_swapfile.patch " Andrew Morton
2020-02-10  4:20 ` + selftests-vm-add-missed-tests-in-run_vmtests.patch " Andrew Morton
2020-02-10  4:21 ` + mm-memcg-fix-build-error-around-the-usage-of-kmem_caches.patch " Andrew Morton
2020-02-10  4:23 ` + mm-memcontrol-fix-a-data-race-in-scan-count.patch " Andrew Morton
2020-02-10  4:29 ` + get_maintainer-remove-uses-of-p-for-maintainer-name.patch " Andrew Morton
2020-02-10  4:37 ` + scripts-get_maintainerpl-deprioritize-old-fixes-addresses.patch " Andrew Morton
2020-02-11  5:33 ` + mm-vmpressure-dont-need-call-kfree-if-kstrndup-fails.patch " Andrew Morton
2020-02-11  5:34 ` + mm-vmpressure-use-mem_cgroup_is_root-api.patch " Andrew Morton
2020-02-11  5:39 ` + mm-filemap-fix-a-data-race-in-filemap_fault.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-split-get_user_pages_remote-into-two-routines.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-pass-a-flags-arg-to-__gup_device_-functions.patch " Andrew Morton
2020-02-11  5:50 ` + mm-introduce-page_ref_sub_return.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-pass-gup-flags-to-two-more-routines.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-require-foll_get-for-get_user_pages_fast.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-track-foll_pin-pages.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-page-hpage_pinned_refcount-exact-pin-counts-for-huge-pages.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup-proc-vmstat-pin_user_pages-foll_pin-reporting.patch " Andrew Morton
2020-02-11  5:50 ` + mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch " Andrew Morton
2020-02-11  5:50 ` + selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch " Andrew Morton
2020-02-11  5:50 ` + mm-improve-dump_page-for-compound-pages.patch " Andrew Morton
2020-02-11  5:51 ` + mm-dump_page-additional-diagnostics-for-huge-pinned-pages.patch " Andrew Morton
2020-02-11  6:05 ` + mm-mapping_dirty_helpers-update-huge-page-table-entry-callbacks.patch " Andrew Morton
2020-02-11  6:06 ` + zswap-allow-setting-default-status-compressor-and-allocator-in-kconfig.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-counter.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-add-interface-for-charge-uncharge-hugetlb-reservations.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-add-reservation-accounting-for-private-mappings.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb-disable-region_add-file_region-coalescing.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-add-accounting-for-shared-mappings.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-support-noreserve-mappings.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb-support-file_region-coalescing-again.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-tests.patch " Andrew Morton
2020-02-11 23:19 ` + hugetlb_cgroup-add-hugetlb_cgroup-reservation-docs.patch " Andrew Morton
2020-02-11 23:21 ` + lib-bch-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
2020-02-11 23:21 ` + lib-objagg-replace-zero-length-arrays-with-flexible-array-member.patch " Andrew Morton
2020-02-11 23:21 ` + lib-ts_bm-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
2020-02-11 23:21 ` + lib-ts_fsm-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
2020-02-11 23:21 ` + lib-ts_kmp-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
2020-02-11 23:23 ` + mm-rmap-annotate-a-data-race-at-tlb_flush_batched.patch " Andrew Morton
2020-02-11 23:26 ` + mm-mempool-fix-a-data-race-in-mempool_free.patch " Andrew Morton
2020-02-11 23:53 ` + mm-kmemleak-annotate-a-data-race-in-checksum.patch " Andrew Morton
2020-02-12  0:19 ` + mm-adjust-shuffle-code-to-allow-for-future-coalescing.patch " Andrew Morton
2020-02-12  0:19 ` + mm-use-zone-and-order-instead-of-free-area-in-free_list-manipulators.patch " Andrew Morton
2020-02-12  0:19 ` + mm-add-function-__putback_isolated_page.patch " Andrew Morton
2020-02-12  0:19 ` + mm-introduce-reported-pages.patch " Andrew Morton
2020-02-12  0:19 ` + virtio-balloon-pull-page-poisoning-config-out-of-free-page-hinting.patch " Andrew Morton
2020-02-12  0:19 ` + virtio-balloon-add-support-for-providing-free-page-reports-to-host.patch " Andrew Morton
2020-02-12  0:20 ` + mm-page_reporting-rotate-reported-pages-to-the-tail-of-the-list.patch " Andrew Morton
2020-02-12  0:20 ` + mm-page_reporting-add-budget-limit-on-how-many-pages-can-be-reported-per-pass.patch " Andrew Morton
2020-02-12  0:20 ` + mm-page_reporting-add-free-page-reporting-documentation.patch " Andrew Morton
2020-02-12  0:30 ` + mm-page_counter-fix-various-data-races.patch " Andrew Morton
2020-02-12  0:33 ` + memcg-lost-css_put-in-memcg_expand_shrinker_maps.patch " Andrew Morton
2020-02-12 21:16 ` + mm-page_counter-fix-various-data-races-at-memsw.patch " Andrew Morton
2020-02-12 21:16 ` [to-be-updated] mm-page_counter-fix-various-data-races.patch removed from " Andrew Morton
2020-02-12 21:18 ` + mm-swapfilec-fix-comments-for-swapcache_prepare.patch added to " Andrew Morton
2020-02-12 21:22 ` + mm-util-annotate-an-data-race-at-vm_committed_as.patch " Andrew Morton
2020-02-12 22:08 ` + asm-generic-fix-unistd_32h-generation-format.patch " Andrew Morton
2020-02-12 22:26 ` + mm-dont-bother-dropping-mmap_sem-for-zero-size-readahead.patch " Andrew Morton
2020-02-12 22:34 ` + lib-scatterlist-fix-sg_copy_buffer-kerneldoc.patch " Andrew Morton
2020-02-12 22:59 ` + mm-allocate-shrinker_map-on-appropriate-numa-node.patch " Andrew Morton
2020-02-12 23:05 ` + init-cleanup-anon_inodes-and-old-io-schedulers-options.patch " Andrew Morton
2020-02-12 23:10 ` + checkpatch-check-spdx-tags-in-yaml-files.patch " Andrew Morton
2020-02-13  2:16 ` + drivers-base-memoryc-indicate-all-memory-blocks-as-removable.patch " Andrew Morton
2020-02-13  2:56 ` + mm-refactor-insert_page-to-prepare-for-batched-lock-insert.patch " Andrew Morton
2020-02-13  2:56 ` + mm-add-vm_insert_pages.patch " Andrew Morton
2020-02-13  2:57 ` + mm-add-vm_insert_pages-fix.patch " Andrew Morton
2020-02-13  2:57 ` + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy.patch " Andrew Morton
2020-02-13  2:57 ` + net-zerocopy-use-vm_insert_pages-for-tcp-rcv-zerocopy-fix.patch " Andrew Morton
2020-02-14  2:50 ` + mm-add-vm_insert_pages-2.patch " Andrew Morton
2020-02-14  2:52 ` + mm-migratec-no-need-to-check-for-i-start-in-do_pages_move.patch " Andrew Morton
2020-02-14  2:52 ` + mm-migratec-wrap-do_move_pages_to_node-and-store_status.patch " Andrew Morton
2020-02-14  2:52 ` + mm-migratec-check-pagelist-in-move_pages_and_store_status.patch " Andrew Morton
2020-02-14  2:52 ` + mm-migratec-unify-not-queued-for-migration-handling-in-do_pages_move.patch " Andrew Morton
2020-02-14  2:57 ` + mm-migratec-migrate-pg_readahead-flag.patch " Andrew Morton
2020-02-14  2:57 ` + mm-migratec-migrate-pg_readahead-flag-fix.patch " Andrew Morton
2020-02-14  2:58 ` + include-remove-highmemh-from-pagemaph.patch " Andrew Morton
2020-02-14  3:06 ` + mm-kmemleak-annotate-various-data-races-obj-ptr.patch " Andrew Morton
2020-02-14  3:06 ` [to-be-updated] mm-kmemleak-annotate-a-data-race-in-checksum.patch removed from " Andrew Morton
2020-02-14  3:26 ` + mm-swapfile-fix-and-annotate-various-data-races-v2.patch added to " Andrew Morton
2020-02-14  3:27 ` + mm-page_io-mark-various-intentional-data-races-v2.patch " Andrew Morton
2020-02-14  5:10 ` + checkpatch-support-base-commit-format.patch " Andrew Morton
2020-02-14  5:11 ` + checkpatch-prefer-fallthrough-over-fallthrough-comments.patch " Andrew Morton
2020-02-14  5:12 ` + uapi-fix-userspace-breakage-use-__bits_per_long-for-swap.patch " Andrew Morton
2020-02-14  5:15 ` + lib-string-update-match_string-doc-strings-with-correct-behavior.patch " Andrew Morton
2020-02-14  5:16 ` + mm-vmscan-replace-open-codings-to-numa_no_node.patch " Andrew Morton
2020-02-14  5:19 ` + mm-mempolicy-support-mpol_mf_strict-for-huge-page-mapping.patch " Andrew Morton
2020-02-14  5:22 ` + mm-vmscan-dont-round-up-scan-size-for-online-memory-cgroup.patch " Andrew Morton
2020-02-14  5:26 ` + mm-mempolicy-use-vm_bug_on_vma-in-queue_pages_test_walk.patch " Andrew Morton
2020-02-14  5:30 ` + drivers-base-memoryc-drop-section_count.patch " Andrew Morton
2020-02-14  5:31 ` + drivers-base-memoryc-drop-pages_correctly_probed.patch " Andrew Morton
2020-02-14  5:31 ` + mm-page_extc-drop-pfn_present-check-when-onlining.patch " Andrew Morton
2020-02-14  5:41 ` [failures] include-remove-highmemh-from-pagemaph.patch removed from " Andrew Morton
2020-02-14  6:26 ` mmotm 2020-02-13-22-26 uploaded Andrew Morton
2020-02-24  2:45 ` + mm-debug-add-tests-validating-architecture-page-table-helpers.patch added to -mm tree Andrew Morton
2020-02-24  3:20 ` + proc-faster-open-read-close-with-permanent-files.patch " Andrew Morton
2020-02-24  3:24 ` + proc-faster-open-read-close-with-permanent-files-checkpatch-fixes.patch " Andrew Morton
2020-02-24  3:25 ` + psi-move-pf_memstall-into-psi-specific-psi_flags.patch " Andrew Morton
2020-02-24  3:29 ` + mm-hugetlb-fix-a-addressing-exception-caused-by-huge_pte_offset.patch " Andrew Morton
2020-02-24  3:31 ` + ocfs2-there-is-no-need-to-log-twice-in-several-functions.patch " Andrew Morton
2020-02-24  3:31 ` + ocfs2-correct-annotation-from-l_next_rec-to-l_next_free_rec.patch " Andrew Morton
2020-02-24  3:46 ` + hugetlb-support-file_region-coalescing-again-fix-2.patch " Andrew Morton
2020-02-24  3:53 ` + fat-fix-uninit-memory-access-for-partial-initialized-inode.patch " Andrew Morton
2020-02-24  4:08 ` + mm-add-mremap_dontunmap-to-mremap-v7.patch " Andrew Morton
2020-02-24  4:08 ` + selftest-add-mremap_dontunmap-selftest-v7.patch " Andrew Morton
2020-02-24  4:08 ` + selftest-add-mremap_dontunmap-selftest-v7-checkpatch-fixes.patch " Andrew Morton
2020-02-24  4:10 ` + percpu_counter-fix-a-data-race-at-vm_committed_as.patch " Andrew Morton
2020-02-24  4:10 ` + lib-test_lockup-fix-spelling-mistake-iteraions-iterations.patch " Andrew Morton
2020-02-24 21:40 ` + mm-swapfile-fix-data-races-in-try_to_unuse.patch " Andrew Morton
2020-02-24 21:45 ` [nacked] psi-move-pf_memstall-into-psi-specific-psi_flags.patch removed from " Andrew Morton
2020-02-24 21:57 ` + mm-z3fold-do-not-include-rwlockh-directly.patch added to " Andrew Morton
2020-02-24 22:04 ` + mm-hotplug-fix-page-online-with-debug_pagealloc-compiled-but-not-enabled.patch " Andrew Morton
2020-02-24 22:13 ` + mm-vma-add-missing-vma-flag-readable-name-for-vm_sync.patch " Andrew Morton
2020-02-24 22:13 ` + mm-vma-make-vma_is_accessible-available-for-general-use.patch " Andrew Morton
2020-02-24 22:13 ` + mm-vma-replace-all-remaining-open-encodings-with-is_vm_hugetlb_page.patch " Andrew Morton
2020-02-24 22:13 ` + mm-vma-replace-all-remaining-open-encodings-with-vma_is_anonymous.patch " Andrew Morton
2020-02-24 22:14 ` + mm-vma-append-unlikely-while-testing-vma-access-permissions.patch " Andrew Morton
2020-02-24 22:30 ` + samples-hw_breakpoint-drop-hw_breakpoint_r-when-reporting-writes.patch " Andrew Morton
2020-02-24 22:31 ` + samples-hw_breakpoint-drop-use-of-kallsyms_lookup_name.patch " Andrew Morton
2020-02-24 22:31 ` + kallsyms-unexport-kallsyms_lookup_name-and-kallsyms_on_each_symbol.patch " Andrew Morton
2020-02-24 22:55 ` + loop-use-worker-per-cgroup-instead-of-kworker.patch " Andrew Morton
2020-02-24 22:55 ` + mm-charge-active-memcg-when-no-mm-is-set.patch " Andrew Morton
2020-02-24 22:55 ` + loop-charge-i-o-to-mem-and-blk-cg.patch " Andrew Morton
2020-02-24 23:33 ` + mm-mempolicy-checking-hugepage-migration-is-supported-by-arch-in-vma_migratable.patch " Andrew Morton
2020-02-24 23:39 ` + checkpatch-fix-minor-typo-and-mixed-spacetab-in-indentation.patch " Andrew Morton
2020-02-24 23:40 ` + checkpatch-fix-multiple-const-types.patch " Andrew Morton
2020-02-24 23:40 ` + checkpatch-add-command-line-option-for-tab-size.patch " Andrew Morton
2020-02-24 23:43 ` + lib-test_bitmap-make-use-of-exp2_in_bits.patch " Andrew Morton
2020-02-24 23:44 ` + ocfs2-remove-useless-err.patch " Andrew Morton
2020-02-25  0:26 ` + arch-kconfig-update-have_reliable_stacktrace-description.patch " Andrew Morton
2020-02-25  0:32 ` + mm-memcg-slab-introduce-mem_cgroup_from_obj.patch " Andrew Morton
2020-02-25  0:47 ` + mm-kmem-cleanup-__memcg_kmem_charge_memcg-arguments.patch " Andrew Morton
2020-02-25  0:48 ` + mm-kmem-cleanup-memcg_kmem_uncharge_memcg-arguments.patch " Andrew Morton
2020-02-25  0:48 ` + mm-kmem-rename-memcg_kmem_uncharge-into-memcg_kmem_uncharge_page.patch " Andrew Morton
2020-02-25  0:48 ` + mm-kmem-switch-to-nr_pages-in-__memcg_kmem_charge_memcg.patch " Andrew Morton
2020-02-25  0:48 ` + mm-memcg-slab-cache-page-number-in-memcg_uncharge_slab.patch " Andrew Morton
2020-02-25  0:48 ` + mm-kmem-rename-__memcg_kmem_uncharge_memcg-to-__memcg_kmem_uncharge.patch " Andrew Morton
2020-02-25  2:29 ` + mm-memcg-slab-introduce-mem_cgroup_from_obj-v2.patch " Andrew Morton
2020-02-25  2:36 ` + ocfs2-add-missing-annotations-for-ocfs2_refcount_cache_lock-and-ocfs2_refcount_cache_unlock.patch " Andrew Morton
2020-02-25  3:53 ` mmotm 2020-02-24-19-53 uploaded Andrew Morton
2020-02-26  1:06 ` + checkpatch-improve-gerrit-change-id-test.patch added to -mm tree Andrew Morton
2020-02-26  1:55 ` + dma-buf-free-dmabuf-name-in-dma_buf_release.patch " Andrew Morton
     [not found]   ` <CAO_48GFr9-aY4=kRqWB=UkEzPj5fQDip+G1tNZMsT0XoQpBC7Q@mail.gmail.com>
     [not found]     ` <CAKMK7uGvixQ2xoQMt3pvt0OpNXDjDGTvSWsaAppsKrmO_EP3Kg@mail.gmail.com>
2020-02-27  4:20       ` Andrew Morton
2020-02-26  3:42 ` + lib-rbtree-fix-coding-style-of-assignments.patch " Andrew Morton
2020-02-26  3:56 ` + seq_read-info-message-about-buggy-next-functions.patch " Andrew Morton
     [not found]   ` <1583173259.7365.142.camel@lca.pw>
     [not found]     ` <1583177508.7365.144.camel@lca.pw>
2020-03-02 20:42       ` Andrew Morton
2020-02-26  3:56 ` + pstore_ftrace_seq_next-should-increase-position-index.patch " Andrew Morton
     [not found]   ` <07f968e6-02cd-de2a-e868-787e4bedd346@virtuozzo.com>
2020-02-27  4:26     ` Andrew Morton
2020-02-26  3:56 ` + gcov_seq_next-should-increase-position-index.patch " Andrew Morton
2020-02-26  3:56 ` + sysvipc_find_ipc-should-increase-position-index.patch " Andrew Morton
2020-02-27  1:19 ` + mm-bring-sparc-pte_index-semantics-inline-with-other-platforms.patch " Andrew Morton
2020-02-27  1:37 ` + mm-vmscan-fix-data-races-at-kswapd_classzone_idx.patch " Andrew Morton
2020-02-27  1:49 ` + fs-epoll-make-nesting-accounting-safe-for-rt-kernel.patch " Andrew Morton
2020-02-27  3:50 ` + lib-optimize-cpumask_local_spread.patch " Andrew Morton
2020-02-27  4:04 ` + mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch " Andrew Morton
2020-02-27  4:11 ` + gcov-gcc_4_7-replace-zero-length-array-with-flexible-array-member.patch " Andrew Morton
2020-02-27  4:42 ` + mm-debug-add-tests-validating-architecture-page-table-helpers-fix-2.patch " Andrew Morton
2020-02-27  4:42 ` [to-be-updated] mm-debug-add-tests-validating-architecture-page-table-helpers-fix.patch removed from " Andrew Morton
2020-02-27  4:44 ` + huge-tmpfs-try-to-split_huge_page-when-punching-hole.patch added to " Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).