mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* incoming
@ 2020-01-31  6:10 Andrew Morton
  2020-01-31  6:11 ` [patch 001/118] lib/test_bitmap: correct test data offsets for 32-bit Andrew Morton
                   ` (117 more replies)
  0 siblings, 118 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits


Most of -mm and quite a number of other subsystems.

MM is fairly quiet this time.  Holidays, I assume.



119 patches, based on 39bed42de2e7d74686a2d5a45638d6a5d7e7d473:

Subsystems affected by this patch series:

  hotfixes
  scripts
  ocfs2
  mm/slub
  mm/kmemleak
  mm/debug
  mm/pagecache
  mm/gup
  mm/swap
  mm/memcg
  mm/pagemap
  mm/tracing
  mm/kasan
  mm/initialization
  mm/pagealloc
  mm/vmscan
  mm/tools
  mm/memblock
  mm/oom-kill
  mm/hugetlb
  mm/migration
  mm/mmap
  mm/memory-hotplug
  mm/zswap
  mm/cleanups
  mm/zram
  misc
  lib
  binfmt
  init
  reiserfs
  exec
  dma-mapping
  kcov

Subsystem: hotfixes

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      lib/test_bitmap: correct test data offsets for 32-bit

    "Theodore Ts'o" <tytso@mit.edu>:
      memcg: fix a crash in wb_workfn when a device disappears

    Dan Carpenter <dan.carpenter@oracle.com>:
      mm/mempolicy.c: fix out of bounds write in mpol_parse_str()

    Pingfan Liu <kernelfans@gmail.com>:
      mm/sparse.c: reset section's mem_map when fully deactivated

    Wei Yang <richardw.yang@linux.intel.com>:
      mm/migrate.c: also overwrite error when it is bigger than zero

    Dan Williams <dan.j.williams@intel.com>:
      mm/memory_hotplug: fix remove_memory() lockdep splat

    Wei Yang <richardw.yang@linux.intel.com>:
      mm: thp: don't need care deferred split queue in memcg charge move path

    Yang Shi <yang.shi@linux.alibaba.com>:
      mm: move_pages: report the number of non-attempted pages

Subsystem: scripts

    Xiong <xndchn@gmail.com>:
      scripts/spelling.txt: add more spellings to spelling.txt

    Luca Ceresoli <luca@lucaceresoli.net>:
      scripts/spelling.txt: add "issus" typo

Subsystem: ocfs2

    Aditya Pakki <pakki001@umn.edu>:
      fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres

    zhengbin <zhengbin13@huawei.com>:
      ocfs2: remove unneeded semicolons

    Masahiro Yamada <masahiroy@kernel.org>:
      ocfs2: make local header paths relative to C files

    Colin Ian King <colin.king@canonical.com>:
      ocfs2/dlm: remove redundant assignment to ret

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use

    wangyan <wangyan122@huawei.com>:
      ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()
      ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction

Subsystem: mm/slub

    Yu Zhao <yuzhao@google.com>:
      mm/slub.c: avoid slub allocation while holding list_lock

Subsystem: mm/kmemleak

    He Zhe <zhe.he@windriver.com>:
      mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t

Subsystem: mm/debug

    Vlastimil Babka <vbabka@suse.cz>:
      mm/debug.c: always print flags in dump_page()

Subsystem: mm/pagecache

    Ira Weiny <ira.weiny@intel.com>:
      mm/filemap.c: clean up filemap_write_and_wait()

Subsystem: mm/gup

    Qiujun Huang <hqjagain@gmail.com>:
      mm: fix gup_pud_range

    Wei Yang <richardw.yang@linux.intel.com>:
      mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge

    John Hubbard <jhubbard@nvidia.com>:
    Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12:
      mm/gup: factor out duplicate code from four routines
      mm/gup: move try_get_compound_head() to top, fix minor issues

    Dan Williams <dan.j.williams@intel.com>:
      mm: Cleanup __put_devmap_managed_page() vs ->page_free()

    John Hubbard <jhubbard@nvidia.com>:
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
      goldish_pipe: rename local pin_user_pages() routine
      mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
      vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
      mm/gup: allow FOLL_FORCE for get_user_pages_fast()
      IB/umem: use get_user_pages_fast() to pin DMA pages
      media/v4l2-core: set pages dirty upon releasing DMA buffers
      mm/gup: introduce pin_user_pages*() and FOLL_PIN
      goldish_pipe: convert to pin_user_pages() and put_user_page()
      IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
      mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
      drm/via: set FOLL_PIN via pin_user_pages_fast()
      fs/io_uring: set FOLL_PIN via pin_user_pages()
      net/xdp: set FOLL_PIN via pin_user_pages()
      media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion
      vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
      powerpc: book3s64: convert to pin_user_pages() and put_user_page()
      mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"
      mm, tree-wide: rename put_user_page*() to unpin_user_page*()

Subsystem: mm/swap

    Vasily Averin <vvs@virtuozzo.com>:
      mm/swapfile.c: swap_next should increase position index

Subsystem: mm/memcg

    Kaitao Cheng <pilgrimtao@gmail.com>:
      mm/memcontrol.c: cleanup some useless code

Subsystem: mm/pagemap

    Li Xinhai <lixinhai.lxh@gmail.com>:
      mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page

Subsystem: mm/tracing

    Junyong Sun <sunjy516@gmail.com>:
      mm, tracing: print symbol name for kmem_alloc_node call_site events

Subsystem: mm/kasan

    "Gustavo A. R. Silva" <gustavo@embeddedor.com>:
      lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more()

Subsystem: mm/initialization

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      mm/early_ioremap.c: use %pa to print resource_size_t variables

Subsystem: mm/pagealloc

    "Kirill A. Shutemov" <kirill@shutemov.name>:
      mm/page_alloc: skip non present sections on zone initialization

    David Hildenbrand <david@redhat.com>:
      mm: remove the memory isolate notifier
      mm: remove "count" parameter from has_unmovable_pages()

Subsystem: mm/vmscan

    Liu Song <liu.song11@zte.com.cn>:
      mm/vmscan.c: remove unused return value of shrink_node

    Alex Shi <alex.shi@linux.alibaba.com>:
      mm/vmscan: remove prefetch_prev_lru_page
      mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE

Subsystem: mm/tools

    Daniel Wagner <dwagner@suse.de>:
      tools/vm/slabinfo: fix sanity checks enabling

Subsystem: mm/memblock

    Anshuman Khandual <anshuman.khandual@arm.com>:
      mm/memblock: define memblock_physmem_add()
      memblock: Use __func__ in remaining memblock_dbg() call sites

Subsystem: mm/oom-kill

    David Rientjes <rientjes@google.com>:
      mm, oom: dump stack of victim when reaping failed

Subsystem: mm/hugetlb

    Wei Yang <richardw.yang@linux.intel.com>:
      mm/huge_memory.c: use head to check huge zero page
      mm/huge_memory.c: use head to emphasize the purpose of page
      mm/huge_memory.c: reduce critical section protected by split_queue_lock

Subsystem: mm/migration

    Ralph Campbell <rcampbell@nvidia.com>:
      mm/migrate: remove useless mask of start address
      mm/migrate: clean up some minor coding style
      mm/migrate: add stable check in migrate_vma_insert_page()

    David Rientjes <rientjes@google.com>:
      mm, thp: fix defrag setting if newline is not used

Subsystem: mm/mmap

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma()

Subsystem: mm/memory-hotplug

    David Hildenbrand <david@redhat.com>:
    Patch series "mm/memory_hotplug: pass in nid to online_pages()":
      mm/memory_hotplug: pass in nid to online_pages()

    Qian Cai <cai@lca.pw>:
      mm/hotplug: silence a lockdep splat with printk()
      mm/page_isolation: fix potential warning from user

Subsystem: mm/zswap

    Vitaly Wool <vitaly.wool@konsulko.com>:
      mm/zswap.c: add allocation hysteresis if pool limit is hit

    Dan Carpenter <dan.carpenter@oracle.com>:
      zswap: potential NULL dereference on error in init_zswap()

Subsystem: mm/cleanups

    Yu Zhao <yuzhao@google.com>:
      include/linux/mm.h: clean up obsolete check on space in page->flags

    Wei Yang <richardw.yang@linux.intel.com>:
      include/linux/mm.h: remove dead code totalram_pages_set()

    Anshuman Khandual <anshuman.khandual@arm.com>:
      include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block

    Hao Lee <haolee.swjtu@gmail.com>:
      mm: fix comments related to node reclaim

Subsystem: mm/zram

    Taejoon Song <taejoon.song@lge.com>:
      zram: try to avoid worst-case scenario on same element pages

    Colin Ian King <colin.king@canonical.com>:
      drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store

Subsystem: misc

    Akinobu Mita <akinobu.mita@gmail.com>:
    Patch series "add header file for kelvin to/from Celsius conversion:
      include/linux/units.h: add helpers for kelvin to/from Celsius conversion
      ACPI: thermal: switch to use <linux/units.h> helpers
      platform/x86: asus-wmi: switch to use <linux/units.h> helpers
      platform/x86: intel_menlow: switch to use <linux/units.h> helpers
      thermal: int340x: switch to use <linux/units.h> helpers
      thermal: intel_pch: switch to use <linux/units.h> helpers
      nvme: hwmon: switch to use <linux/units.h> helpers
      thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h>
      iwlegacy: use <linux/units.h> helpers
      iwlwifi: use <linux/units.h> helpers
      thermal: armada: remove unused TO_MCELSIUS macro
      iio: adc: qcom-vadc-common: use <linux/units.h> helpers

Subsystem: lib

    Mikhail Zaslonko <zaslonko@linux.ibm.com>:
    Patch series "S390 hardware support for kernel zlib", v3:
      lib/zlib: add s390 hardware support for kernel zlib_deflate
      s390/boot: rename HEAP_SIZE due to name collision
      lib/zlib: add s390 hardware support for kernel zlib_inflate
      s390/boot: add dfltcc= kernel command line parameter
      lib/zlib: add zlib_deflate_dfltcc_enabled() function
      btrfs: use larger zlib buffer for s390 hardware compression

    Nathan Chancellor <natechancellor@gmail.com>:
      lib/scatterlist.c: adjust indentation in __sg_alloc_table

    Yury Norov <yury.norov@gmail.com>:
      uapi: rename ext2_swab() to swab() and share globally in swab.h
      lib/find_bit.c: join _find_next_bit{_le}
      lib/find_bit.c: uninline helper _find_next_bit()

Subsystem: binfmt

    Alexey Dobriyan <adobriyan@gmail.com>:
      fs/binfmt_elf.c: smaller code generation around auxv vector fill
      fs/binfmt_elf.c: fix ->start_code calculation
      fs/binfmt_elf.c: don't copy ELF header around
      fs/binfmt_elf.c: better codegen around current->mm
      fs/binfmt_elf.c: make BAD_ADDR() unlikely
      fs/binfmt_elf.c: coredump: allocate core ELF header on stack
      fs/binfmt_elf.c: coredump: delete duplicated overflow check
      fs/binfmt_elf.c: coredump: allow process with empty address space to coredump

Subsystem: init

    Arvind Sankar <nivedita@alum.mit.edu>:
      init/main.c: log arguments and environment passed to init
      init/main.c: remove unnecessary repair_env_string in do_initcall_level
    Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2:
      init/main.c: fix quoted value handling in unknown_bootoption

    Christophe Leroy <christophe.leroy@c-s.fr>:
      init/main.c: fix misleading "This architecture does not have kernel memory protection" message

Subsystem: reiserfs

    Yunfeng Ye <yeyunfeng@huawei.com>:
      reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()

Subsystem: exec

    Alexey Dobriyan <adobriyan@gmail.com>:
      execve: warn if process starts with executable stack

Subsystem: dma-mapping

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()

Subsystem: kcov

    Dmitry Vyukov <dvyukov@google.com>:
      kcov: ignore fault-inject and stacktrace

 Documentation/admin-guide/kernel-parameters.txt              |   12 
 Documentation/core-api/index.rst                             |    1 
 Documentation/core-api/pin_user_pages.rst                    |  234 +++++
 Documentation/vm/zswap.rst                                   |   13 
 arch/powerpc/mm/book3s64/iommu_api.c                         |   14 
 arch/s390/boot/compressed/decompressor.c                     |    8 
 arch/s390/boot/ipl_parm.c                                    |   14 
 arch/s390/include/asm/setup.h                                |    7 
 arch/s390/kernel/setup.c                                     |   14 
 drivers/acpi/thermal.c                                       |   34 
 drivers/base/memory.c                                        |   25 
 drivers/block/zram/zram_drv.c                                |   10 
 drivers/gpu/drm/via/via_dmablit.c                            |    6 
 drivers/iio/adc/qcom-vadc-common.c                           |    6 
 drivers/iio/adc/qcom-vadc-common.h                           |    1 
 drivers/infiniband/core/umem.c                               |   21 
 drivers/infiniband/core/umem_odp.c                           |   13 
 drivers/infiniband/hw/hfi1/user_pages.c                      |    4 
 drivers/infiniband/hw/mthca/mthca_memfree.c                  |    8 
 drivers/infiniband/hw/qib/qib_user_pages.c                   |    4 
 drivers/infiniband/hw/qib/qib_user_sdma.c                    |    8 
 drivers/infiniband/hw/usnic/usnic_uiom.c                     |    4 
 drivers/infiniband/sw/siw/siw_mem.c                          |    4 
 drivers/media/v4l2-core/videobuf-dma-sg.c                    |   20 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h             |    1 
 drivers/net/wireless/intel/iwlegacy/4965-mac.c               |    3 
 drivers/net/wireless/intel/iwlegacy/4965.c                   |   17 
 drivers/net/wireless/intel/iwlegacy/common.h                 |    3 
 drivers/net/wireless/intel/iwlwifi/dvm/dev.h                 |    5 
 drivers/net/wireless/intel/iwlwifi/dvm/devices.c             |    6 
 drivers/nvdimm/pmem.c                                        |    6 
 drivers/nvme/host/hwmon.c                                    |   13 
 drivers/platform/goldfish/goldfish_pipe.c                    |   39 
 drivers/platform/x86/asus-wmi.c                              |    7 
 drivers/platform/x86/intel_menlow.c                          |    9 
 drivers/thermal/armada_thermal.c                             |    2 
 drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c |    7 
 drivers/thermal/intel/intel_pch_thermal.c                    |    3 
 drivers/vfio/vfio_iommu_type1.c                              |   39 
 fs/binfmt_elf.c                                              |  154 +--
 fs/btrfs/compression.c                                       |    2 
 fs/btrfs/zlib.c                                              |  135 ++
 fs/exec.c                                                    |    5 
 fs/fs-writeback.c                                            |    2 
 fs/io_uring.c                                                |    6 
 fs/ocfs2/cluster/quorum.c                                    |    2 
 fs/ocfs2/dlm/Makefile                                        |    2 
 fs/ocfs2/dlm/dlmast.c                                        |    8 
 fs/ocfs2/dlm/dlmcommon.h                                     |    4 
 fs/ocfs2/dlm/dlmconvert.c                                    |    8 
 fs/ocfs2/dlm/dlmdebug.c                                      |    8 
 fs/ocfs2/dlm/dlmdomain.c                                     |    8 
 fs/ocfs2/dlm/dlmlock.c                                       |    8 
 fs/ocfs2/dlm/dlmmaster.c                                     |   10 
 fs/ocfs2/dlm/dlmrecovery.c                                   |   10 
 fs/ocfs2/dlm/dlmthread.c                                     |    8 
 fs/ocfs2/dlm/dlmunlock.c                                     |    8 
 fs/ocfs2/dlmfs/Makefile                                      |    2 
 fs/ocfs2/dlmfs/dlmfs.c                                       |    4 
 fs/ocfs2/dlmfs/userdlm.c                                     |    6 
 fs/ocfs2/dlmglue.c                                           |    2 
 fs/ocfs2/journal.h                                           |    8 
 fs/ocfs2/namei.c                                             |    3 
 fs/reiserfs/stree.c                                          |    3 
 include/linux/backing-dev.h                                  |   10 
 include/linux/bitops.h                                       |    1 
 include/linux/fs.h                                           |    6 
 include/linux/io-mapping.h                                   |    5 
 include/linux/memblock.h                                     |    7 
 include/linux/memory.h                                       |   29 
 include/linux/memory_hotplug.h                               |    3 
 include/linux/mm.h                                           |  116 +-
 include/linux/mmzone.h                                       |    2 
 include/linux/page-isolation.h                               |    8 
 include/linux/swab.h                                         |    1 
 include/linux/thermal.h                                      |   11 
 include/linux/units.h                                        |   84 +
 include/linux/zlib.h                                         |    6 
 include/trace/events/kmem.h                                  |    4 
 include/trace/events/writeback.h                             |   37 
 include/uapi/linux/swab.h                                    |   10 
 include/uapi/linux/sysctl.h                                  |    2 
 init/main.c                                                  |   36 
 kernel/Makefile                                              |    1 
 lib/Kconfig                                                  |    7 
 lib/Makefile                                                 |    2 
 lib/decompress_inflate.c                                     |   13 
 lib/find_bit.c                                               |   82 -
 lib/scatterlist.c                                            |    2 
 lib/test_bitmap.c                                            |    9 
 lib/test_kasan.c                                             |    1 
 lib/zlib_deflate/deflate.c                                   |   85 +
 lib/zlib_deflate/deflate_syms.c                              |    1 
 lib/zlib_deflate/deftree.c                                   |   54 -
 lib/zlib_deflate/defutil.h                                   |  134 ++
 lib/zlib_dfltcc/Makefile                                     |   13 
 lib/zlib_dfltcc/dfltcc.c                                     |   57 +
 lib/zlib_dfltcc/dfltcc.h                                     |  155 +++
 lib/zlib_dfltcc/dfltcc_deflate.c                             |  280 ++++++
 lib/zlib_dfltcc/dfltcc_inflate.c                             |  149 +++
 lib/zlib_dfltcc/dfltcc_syms.c                                |   17 
 lib/zlib_dfltcc/dfltcc_util.h                                |  123 ++
 lib/zlib_inflate/inflate.c                                   |   32 
 lib/zlib_inflate/inflate.h                                   |    8 
 lib/zlib_inflate/infutil.h                                   |   18 
 mm/Makefile                                                  |    1 
 mm/backing-dev.c                                             |    1 
 mm/debug.c                                                   |   18 
 mm/early_ioremap.c                                           |    8 
 mm/filemap.c                                                 |   34 
 mm/gup.c                                                     |  503 ++++++-----
 mm/gup_benchmark.c                                           |    9 
 mm/huge_memory.c                                             |   44 
 mm/kmemleak.c                                                |  112 +-
 mm/memblock.c                                                |   22 
 mm/memcontrol.c                                              |   25 
 mm/memory_hotplug.c                                          |   24 
 mm/mempolicy.c                                               |    6 
 mm/memremap.c                                                |   95 --
 mm/migrate.c                                                 |   77 +
 mm/mmap.c                                                    |   30 
 mm/oom_kill.c                                                |    2 
 mm/page_alloc.c                                              |   83 +
 mm/page_isolation.c                                          |   69 -
 mm/page_vma_mapped.c                                         |   12 
 mm/process_vm_access.c                                       |   32 
 mm/slub.c                                                    |   88 +
 mm/sparse.c                                                  |    2 
 mm/swap.c                                                    |   27 
 mm/swapfile.c                                                |    2 
 mm/vmscan.c                                                  |   24 
 mm/zswap.c                                                   |   88 +
 net/xdp/xdp_umem.c                                           |    4 
 scripts/spelling.txt                                         |   14 
 tools/testing/selftests/vm/gup_benchmark.c                   |    6 
 tools/vm/slabinfo.c                                          |    4 
 136 files changed, 2790 insertions(+), 1358 deletions(-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 001/118] lib/test_bitmap: correct test data offsets for 32-bit
  2020-01-31  6:10 incoming Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 002/118] memcg: fix a crash in wb_workfn when a device disappears Andrew Morton
                   ` (116 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, linux-mm, linux, linux, mm-commits,
	stable, torvalds, yury.norov

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: lib/test_bitmap: correct test data offsets for 32-bit

On 32-bit platform the size of long is only 32 bits which makes wrong
offset in the array of 64 bit size.

Calculate offset based on BITS_PER_LONG.

Link: http://lkml.kernel.org/r/20200109103601.45929-1-andriy.shevchenko@linux.intel.com
Fixes: 30544ed5de43 ("lib/bitmap: introduce bitmap_replace() helper")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_bitmap.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/lib/test_bitmap.c~lib-test_bitmap-correct-test-data-offsets-for-32-bit
+++ a/lib/test_bitmap.c
@@ -275,22 +275,23 @@ static void __init test_copy(void)
 static void __init test_replace(void)
 {
 	unsigned int nbits = 64;
+	unsigned int nlongs = DIV_ROUND_UP(nbits, BITS_PER_LONG);
 	DECLARE_BITMAP(bmap, 1024);
 
 	bitmap_zero(bmap, 1024);
-	bitmap_replace(bmap, &exp2[0], &exp2[1], exp2_to_exp3_mask, nbits);
+	bitmap_replace(bmap, &exp2[0 * nlongs], &exp2[1 * nlongs], exp2_to_exp3_mask, nbits);
 	expect_eq_bitmap(bmap, exp3_0_1, nbits);
 
 	bitmap_zero(bmap, 1024);
-	bitmap_replace(bmap, &exp2[1], &exp2[0], exp2_to_exp3_mask, nbits);
+	bitmap_replace(bmap, &exp2[1 * nlongs], &exp2[0 * nlongs], exp2_to_exp3_mask, nbits);
 	expect_eq_bitmap(bmap, exp3_1_0, nbits);
 
 	bitmap_fill(bmap, 1024);
-	bitmap_replace(bmap, &exp2[0], &exp2[1], exp2_to_exp3_mask, nbits);
+	bitmap_replace(bmap, &exp2[0 * nlongs], &exp2[1 * nlongs], exp2_to_exp3_mask, nbits);
 	expect_eq_bitmap(bmap, exp3_0_1, nbits);
 
 	bitmap_fill(bmap, 1024);
-	bitmap_replace(bmap, &exp2[1], &exp2[0], exp2_to_exp3_mask, nbits);
+	bitmap_replace(bmap, &exp2[1 * nlongs], &exp2[0 * nlongs], exp2_to_exp3_mask, nbits);
 	expect_eq_bitmap(bmap, exp3_1_0, nbits);
 }
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 002/118] memcg: fix a crash in wb_workfn when a device disappears
  2020-01-31  6:10 incoming Andrew Morton
  2020-01-31  6:11 ` [patch 001/118] lib/test_bitmap: correct test data offsets for 32-bit Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 003/118] mm/mempolicy.c: fix out of bounds write in mpol_parse_str() Andrew Morton
                   ` (115 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, axboe, clm, linux-mm, mm-commits, stable, tj, torvalds, tytso

From: "Theodore Ts'o" <tytso@mit.edu>
Subject: memcg: fix a crash in wb_workfn when a device disappears

Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures.  In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown the
bdi_writeback structure (or wb), and part of that writeback ensures that
no other work queued against the wb, and that the wb is fully drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects which
can all point to a single bdi.  There is a refcount which prevents the bdi
object from being released (and hence, unregistered).  So in theory, the
bdi_unregister() *should* only get called once its refcount goes to zero
(bdi_put will drop the refcount, and when it is zero, release_bdi gets
called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly.  It does
this without informing the file system, or the memcg code, or anything
else.  This causes the root wb associated with the bdi to be unregistered,
but none of the memcg-specific wb's are shutdown.  So when one of these
wb's are woken up to do delayed work, they try to dereference their
wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now
NULL, thanks to the bdi_unregister() called by del_gendisk().  As a
result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL. 
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on.  It was triggering several
times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <clm@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fs-writeback.c                |    2 -
 include/linux/backing-dev.h      |   10 +++++++
 include/trace/events/writeback.h |   37 +++++++++++++----------------
 mm/backing-dev.c                 |    1 
 4 files changed, 29 insertions(+), 21 deletions(-)

--- a/fs/fs-writeback.c~memcg-fix-a-crash-in-wb_workfn-when-a-device-disappears
+++ a/fs/fs-writeback.c
@@ -2063,7 +2063,7 @@ void wb_workfn(struct work_struct *work)
 						struct bdi_writeback, dwork);
 	long pages_written;
 
-	set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
+	set_worker_desc("flush-%s", bdi_dev_name(wb->bdi));
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
--- a/include/linux/backing-dev.h~memcg-fix-a-crash-in-wb_workfn-when-a-device-disappears
+++ a/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
 #include <linux/fs.h>
 #include <linux/sched.h>
 #include <linux/blkdev.h>
+#include <linux/device.h>
 #include <linux/writeback.h>
 #include <linux/blk-cgroup.h>
 #include <linux/backing-dev-defs.h>
@@ -504,4 +505,13 @@ static inline int bdi_rw_congested(struc
 				  (1 << WB_async_congested));
 }
 
+extern const char *bdi_unknown_name;
+
+static inline const char *bdi_dev_name(struct backing_dev_info *bdi)
+{
+	if (!bdi || !bdi->dev)
+		return bdi_unknown_name;
+	return dev_name(bdi->dev);
+}
+
 #endif	/* _LINUX_BACKING_DEV_H */
--- a/include/trace/events/writeback.h~memcg-fix-a-crash-in-wb_workfn-when-a-device-disappears
+++ a/include/trace/events/writeback.h
@@ -67,8 +67,8 @@ DECLARE_EVENT_CLASS(writeback_page_templ
 
 	TP_fast_assign(
 		strscpy_pad(__entry->name,
-			    mapping ? dev_name(inode_to_bdi(mapping->host)->dev) : "(unknown)",
-			    32);
+			    bdi_dev_name(mapping ? inode_to_bdi(mapping->host) :
+					 NULL), 32);
 		__entry->ino = mapping ? mapping->host->i_ino : 0;
 		__entry->index = page->index;
 	),
@@ -111,8 +111,7 @@ DECLARE_EVENT_CLASS(writeback_dirty_inod
 		struct backing_dev_info *bdi = inode_to_bdi(inode);
 
 		/* may be called for files on pseudo FSes w/ unregistered bdi */
-		strscpy_pad(__entry->name,
-			    bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
+		strscpy_pad(__entry->name, bdi_dev_name(bdi), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->state		= inode->i_state;
 		__entry->flags		= flags;
@@ -193,7 +192,7 @@ TRACE_EVENT(inode_foreign_history,
 	),
 
 	TP_fast_assign(
-		strncpy(__entry->name, dev_name(inode_to_bdi(inode)->dev), 32);
+		strncpy(__entry->name, bdi_dev_name(inode_to_bdi(inode)), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->cgroup_ino	= __trace_wbc_assign_cgroup(wbc);
 		__entry->history	= history;
@@ -222,7 +221,7 @@ TRACE_EVENT(inode_switch_wbs,
 	),
 
 	TP_fast_assign(
-		strncpy(__entry->name,	dev_name(old_wb->bdi->dev), 32);
+		strncpy(__entry->name,	bdi_dev_name(old_wb->bdi), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->old_cgroup_ino	= __trace_wb_assign_cgroup(old_wb);
 		__entry->new_cgroup_ino	= __trace_wb_assign_cgroup(new_wb);
@@ -255,7 +254,7 @@ TRACE_EVENT(track_foreign_dirty,
 		struct address_space *mapping = page_mapping(page);
 		struct inode *inode = mapping ? mapping->host : NULL;
 
-		strncpy(__entry->name,	dev_name(wb->bdi->dev), 32);
+		strncpy(__entry->name,	bdi_dev_name(wb->bdi), 32);
 		__entry->bdi_id		= wb->bdi->id;
 		__entry->ino		= inode ? inode->i_ino : 0;
 		__entry->memcg_id	= wb->memcg_css->id;
@@ -288,7 +287,7 @@ TRACE_EVENT(flush_foreign,
 	),
 
 	TP_fast_assign(
-		strncpy(__entry->name,	dev_name(wb->bdi->dev), 32);
+		strncpy(__entry->name,	bdi_dev_name(wb->bdi), 32);
 		__entry->cgroup_ino	= __trace_wb_assign_cgroup(wb);
 		__entry->frn_bdi_id	= frn_bdi_id;
 		__entry->frn_memcg_id	= frn_memcg_id;
@@ -318,7 +317,7 @@ DECLARE_EVENT_CLASS(writeback_write_inod
 
 	TP_fast_assign(
 		strscpy_pad(__entry->name,
-			    dev_name(inode_to_bdi(inode)->dev), 32);
+			    bdi_dev_name(inode_to_bdi(inode)), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->sync_mode	= wbc->sync_mode;
 		__entry->cgroup_ino	= __trace_wbc_assign_cgroup(wbc);
@@ -361,9 +360,7 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(ino_t, cgroup_ino)
 	),
 	TP_fast_assign(
-		strscpy_pad(__entry->name,
-			    wb->bdi->dev ? dev_name(wb->bdi->dev) :
-			    "(unknown)", 32);
+		strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32);
 		__entry->nr_pages = work->nr_pages;
 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
 		__entry->sync_mode = work->sync_mode;
@@ -416,7 +413,7 @@ DECLARE_EVENT_CLASS(writeback_class,
 		__field(ino_t, cgroup_ino)
 	),
 	TP_fast_assign(
-		strscpy_pad(__entry->name, dev_name(wb->bdi->dev), 32);
+		strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32);
 		__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
 	),
 	TP_printk("bdi %s: cgroup_ino=%lu",
@@ -438,7 +435,7 @@ TRACE_EVENT(writeback_bdi_register,
 		__array(char, name, 32)
 	),
 	TP_fast_assign(
-		strscpy_pad(__entry->name, dev_name(bdi->dev), 32);
+		strscpy_pad(__entry->name, bdi_dev_name(bdi), 32);
 	),
 	TP_printk("bdi %s",
 		__entry->name
@@ -463,7 +460,7 @@ DECLARE_EVENT_CLASS(wbc_class,
 	),
 
 	TP_fast_assign(
-		strscpy_pad(__entry->name, dev_name(bdi->dev), 32);
+		strscpy_pad(__entry->name, bdi_dev_name(bdi), 32);
 		__entry->nr_to_write	= wbc->nr_to_write;
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->sync_mode	= wbc->sync_mode;
@@ -514,7 +511,7 @@ TRACE_EVENT(writeback_queue_io,
 	),
 	TP_fast_assign(
 		unsigned long *older_than_this = work->older_than_this;
-		strscpy_pad(__entry->name, dev_name(wb->bdi->dev), 32);
+		strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32);
 		__entry->older	= older_than_this ?  *older_than_this : 0;
 		__entry->age	= older_than_this ?
 				  (jiffies - *older_than_this) * 1000 / HZ : -1;
@@ -600,7 +597,7 @@ TRACE_EVENT(bdi_dirty_ratelimit,
 	),
 
 	TP_fast_assign(
-		strscpy_pad(__entry->bdi, dev_name(wb->bdi->dev), 32);
+		strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
 		__entry->write_bw	= KBps(wb->write_bandwidth);
 		__entry->avg_write_bw	= KBps(wb->avg_write_bandwidth);
 		__entry->dirty_rate	= KBps(dirty_rate);
@@ -665,7 +662,7 @@ TRACE_EVENT(balance_dirty_pages,
 
 	TP_fast_assign(
 		unsigned long freerun = (thresh + bg_thresh) / 2;
-		strscpy_pad(__entry->bdi, dev_name(wb->bdi->dev), 32);
+		strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
 
 		__entry->limit		= global_wb_domain.dirty_limit;
 		__entry->setpoint	= (global_wb_domain.dirty_limit +
@@ -726,7 +723,7 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
 
 	TP_fast_assign(
 		strscpy_pad(__entry->name,
-			    dev_name(inode_to_bdi(inode)->dev), 32);
+			    bdi_dev_name(inode_to_bdi(inode)), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->state		= inode->i_state;
 		__entry->dirtied_when	= inode->dirtied_when;
@@ -800,7 +797,7 @@ DECLARE_EVENT_CLASS(writeback_single_ino
 
 	TP_fast_assign(
 		strscpy_pad(__entry->name,
-			    dev_name(inode_to_bdi(inode)->dev), 32);
+			    bdi_dev_name(inode_to_bdi(inode)), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->state		= inode->i_state;
 		__entry->dirtied_when	= inode->dirtied_when;
--- a/mm/backing-dev.c~memcg-fix-a-crash-in-wb_workfn-when-a-device-disappears
+++ a/mm/backing-dev.c
@@ -21,6 +21,7 @@ struct backing_dev_info noop_backing_dev
 EXPORT_SYMBOL_GPL(noop_backing_dev_info);
 
 static struct class *bdi_class;
+const char *bdi_unknown_name = "(unknown)";
 
 /*
  * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 003/118] mm/mempolicy.c: fix out of bounds write in mpol_parse_str()
  2020-01-31  6:10 incoming Andrew Morton
  2020-01-31  6:11 ` [patch 001/118] lib/test_bitmap: correct test data offsets for 32-bit Andrew Morton
  2020-01-31  6:11 ` [patch 002/118] memcg: fix a crash in wb_workfn when a device disappears Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 004/118] mm/sparse.c: reset section's mem_map when fully deactivated Andrew Morton
                   ` (114 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: aarcange, akpm, dan.carpenter, hughd, lee.schermerhorn, linux-mm,
	mhocko, mm-commits, stable, torvalds, vbabka

From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: mm/mempolicy.c: fix out of bounds write in mpol_parse_str()

What we are trying to do is change the '=' character to a NUL terminator
and then at the end of the function we restore it back to an '='.  The
problem is there are two error paths where we jump to the end of the
function before we have replaced the '=' with NUL.  We end up putting the
'=' in the wrong place (possibly one element before the start of the
buffer).

Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Dmitry Vyukov <dvyukov@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicyc-fix-out-of-bounds-write-in-mpol_parse_str
+++ a/mm/mempolicy.c
@@ -2821,6 +2821,9 @@ int mpol_parse_str(char *str, struct mem
 	char *flags = strchr(str, '=');
 	int err = 1, mode;
 
+	if (flags)
+		*flags++ = '\0';	/* terminate mode string */
+
 	if (nodelist) {
 		/* NUL-terminate mode or flags string */
 		*nodelist++ = '\0';
@@ -2831,9 +2834,6 @@ int mpol_parse_str(char *str, struct mem
 	} else
 		nodes_clear(nodes);
 
-	if (flags)
-		*flags++ = '\0';	/* terminate mode string */

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 004/118] mm/sparse.c: reset section's mem_map when fully deactivated
  2020-01-31  6:10 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2020-01-31  6:11 ` [patch 003/118] mm/mempolicy.c: fix out of bounds write in mpol_parse_str() Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 005/118] mm/migrate.c: also overwrite error when it is bigger than zero Andrew Morton
                   ` (113 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, bhe, cai, dan.j.williams, david, k-hagio, kernelfans,
	linux-mm, mhocko, mm-commits, osalvador, stable, torvalds

From: Pingfan Liu <kernelfans@gmail.com>
Subject: mm/sparse.c: reset section's mem_map when fully deactivated

After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"),
when a mem section is fully deactivated, section_mem_map still records the
section's start pfn, which is not used any more and will be reassigned
during re-addition.

In analogy with alloc/free pattern, it is better to clear all fields of
section_mem_map.

Beside this, it breaks the user space tool "makedumpfile" [1], which makes
assumption that a hot-removed section has mem_map as NULL, instead of
checking directly against SECTION_MARKED_PRESENT bit.  (makedumpfile will
be better to change the assumption, and need a patch)

The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
trigger a crash, and save vmcore by makedumpfile

[1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")

Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/sparse.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/sparse.c~mm-sparse-reset-sections-mem_map-when-fully-deactivated
+++ a/mm/sparse.c
@@ -789,7 +789,7 @@ static void section_deactivate(unsigned
 			ms->usage = NULL;
 		}
 		memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
-		ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr);
+		ms->section_mem_map = (unsigned long)NULL;
 	}
 
 	if (section_is_early && memmap)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 005/118] mm/migrate.c: also overwrite error when it is bigger than zero
  2020-01-31  6:10 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2020-01-31  6:11 ` [patch 004/118] mm/sparse.c: reset section's mem_map when fully deactivated Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 006/118] mm/memory_hotplug: fix remove_memory() lockdep splat Andrew Morton
                   ` (112 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, cl, jhubbard, linux-mm, mhocko, mm-commits, richardw.yang,
	stable, torvalds, vbabka, yang.shi

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/migrate.c: also overwrite error when it is bigger than zero

If we get here after successfully adding page to list, err would be 1 to
indicate the page is queued in the list.

Current code has two problems:

  * on success, 0 is not returned
  * on error, if add_page_for_migratioin() return 1, and the following err1
    from do_move_pages_to_node() is set, the err1 is not returned since err
    is 1

And these behaviors break the user interface.

Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node").
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/migrate.c~mm-migratec-also-overwrite-error-when-it-is-bigger-than-zero
+++ a/mm/migrate.c
@@ -1676,7 +1676,7 @@ out_flush:
 	err1 = do_move_pages_to_node(mm, &pagelist, current_node);
 	if (!err1)
 		err1 = store_status(status, start, current_node, i - start);
-	if (!err)
+	if (err >= 0)
 		err = err1;
 out:
 	return err;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 006/118] mm/memory_hotplug: fix remove_memory() lockdep splat
  2020-01-31  6:10 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2020-01-31  6:11 ` [patch 005/118] mm/migrate.c: also overwrite error when it is bigger than zero Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 007/118] mm: thp: don't need care deferred split queue in memcg charge move path Andrew Morton
                   ` (111 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, dan.j.williams, dave.hansen, david, linux-mm, mhocko,
	mm-commits, pasha.tatashin, stable, torvalds, vishal.l.verma

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm/memory_hotplug: fix remove_memory() lockdep splat

The daxctl unit test for the dax_kmem driver currently triggers the (false
positive) lockdep splat below.  It results from the fact that
remove_memory_block_devices() is invoked under the mem_hotplug_lock()
causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
active state tracking).  It is a false positive because the sysfs
attribute path triggering the memory remove is not the same attribute path
associated with memory-block device.

sysfs_break_active_protection() is not applicable since there is no real
deadlock conflict, instead move memory-block device removal outside the
lock.  The mem_hotplug_lock() is not needed to synchronize the
memory-block device removal vs the page online state, that is already
handled by lock_device_hotplug().  Specifically, lock_device_hotplug() is
sufficient to allow try_remove_memory() to check the offline state of the
memblocks and be assured that any in progress online attempts are flushed
/ blocked by kernfs_drain() / attribute removal.

The add_memory() path safely creates memblock devices under the
mem_hotplug_lock().  There is no kernfs active state synchronization in
the memblock device_register() path, so nothing to fix there.

This change is only possible thanks to the recent change that refactored
memory block device removal out of arch_remove_memory() (commit
4c4b7f9ba948 mm/memory_hotplug: remove memory block devices before
arch_remove_memory()), and David's due diligence tracking down the
guarantees afforded by kernfs_drain().  Not flagged for -stable since this
only impacts ongoing development and lockdep validation, not a runtime
issue.

    ======================================================
    WARNING: possible circular locking dependency detected
    5.5.0-rc3+ #230 Tainted: G           OE
    ------------------------------------------------------
    lt-daxctl/6459 is trying to acquire lock:
    ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80

    but task is already holding lock:
    ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (mem_hotplug_lock.rw_sem){++++}:
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           get_online_mems+0x3e/0xb0
           kmem_cache_create_usercopy+0x2e/0x260
           kmem_cache_create+0x12/0x20
           ptlock_cache_init+0x20/0x28
           start_kernel+0x243/0x547
           secondary_startup_64+0xb6/0xc0

    -> #1 (cpu_hotplug_lock.rw_sem){++++}:
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           cpus_read_lock+0x3e/0xb0
           online_pages+0x37/0x300
           memory_subsys_online+0x17d/0x1c0
           device_online+0x60/0x80
           state_store+0x65/0xd0
           kernfs_fop_write+0xcf/0x1c0
           vfs_write+0xdb/0x1d0
           ksys_write+0x65/0xe0
           do_syscall_64+0x5c/0xa0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#241){++++}:
           check_prev_add+0x98/0xa40
           validate_chain+0x576/0x860
           __lock_acquire+0x39c/0x790
           lock_acquire+0xa2/0x1b0
           __kernfs_remove+0x25f/0x2e0
           kernfs_remove_by_name_ns+0x41/0x80
           remove_files.isra.0+0x30/0x70
           sysfs_remove_group+0x3d/0x80
           sysfs_remove_groups+0x29/0x40
           device_remove_attrs+0x39/0x70
           device_del+0x16a/0x3f0
           device_unregister+0x16/0x60
           remove_memory_block_devices+0x82/0xb0
           try_remove_memory+0xb5/0x130
           remove_memory+0x26/0x40
           dev_dax_kmem_remove+0x44/0x6a [kmem]
           device_release_driver_internal+0xe4/0x1c0
           unbind_store+0xef/0x120
           kernfs_fop_write+0xcf/0x1c0
           vfs_write+0xdb/0x1d0
           ksys_write+0x65/0xe0
           do_syscall_64+0x5c/0xa0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Chain exists of:
      kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(mem_hotplug_lock.rw_sem);
                                   lock(cpu_hotplug_lock.rw_sem);
                                   lock(mem_hotplug_lock.rw_sem);
      lock(kn->count#241);

     *** DEADLOCK ***

No fixes tag as this has been a long standing issue that predated the
addition of kernfs lockdep annotations.

Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-fix-remove_memory-lockdep-splat
+++ a/mm/memory_hotplug.c
@@ -1764,8 +1764,6 @@ static int __ref try_remove_memory(int n
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
-	mem_hotplug_begin();

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 007/118] mm: thp: don't need care deferred split queue in memcg charge move path
  2020-01-31  6:10 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2020-01-31  6:11 ` [patch 006/118] mm/memory_hotplug: fix remove_memory() lockdep splat Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 008/118] mm: move_pages: report the number of non-attempted pages Andrew Morton
                   ` (110 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, hannes, kirill.shutemov, linux-mm, mhocko, mm-commits,
	richardw.yang, rientjes, stable, torvalds, vdavydov.dev,
	yang.shi

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm: thp: don't need care deferred split queue in memcg charge move path

If compound is true, this means it is a PMD mapped THP. Which implies
the page is not linked to any defer list. So the first code chunk will
not be executed.

Also with this reason, it would not be proper to add this page to a
defer list. So the second code chunk is not correct.

Based on this, we should remove the defer list related code.

[yang.shi@linux.alibaba.com: better patch title]
Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>    [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   18 ------------------
 1 file changed, 18 deletions(-)

--- a/mm/memcontrol.c~mm-thp-remove-the-defer-list-related-code-since-this-will-not-happen
+++ a/mm/memcontrol.c
@@ -5340,14 +5340,6 @@ static int mem_cgroup_move_account(struc
 		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
 	}
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (compound && !list_empty(page_deferred_list(page))) {
-		spin_lock(&from->deferred_split_queue.split_queue_lock);
-		list_del_init(page_deferred_list(page));
-		from->deferred_split_queue.split_queue_len--;
-		spin_unlock(&from->deferred_split_queue.split_queue_lock);
-	}
-#endif
 	/*
 	 * It is safe to change page->mem_cgroup here because the page
 	 * is referenced, charged, and isolated - we can't race with
@@ -5357,16 +5349,6 @@ static int mem_cgroup_move_account(struc
 	/* caller should have done css_get */
 	page->mem_cgroup = to;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (compound && list_empty(page_deferred_list(page))) {
-		spin_lock(&to->deferred_split_queue.split_queue_lock);
-		list_add_tail(page_deferred_list(page),
-			      &to->deferred_split_queue.split_queue);
-		to->deferred_split_queue.split_queue_len++;
-		spin_unlock(&to->deferred_split_queue.split_queue_lock);
-	}
-#endif

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 008/118] mm: move_pages: report the number of non-attempted pages
  2020-01-31  6:10 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2020-01-31  6:11 ` [patch 007/118] mm: thp: don't need care deferred split queue in memcg charge move path Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 009/118] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
                   ` (109 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, linux-mm, mhocko, mm-commits, richardw.yang, stable,
	torvalds, yang.shi

From: Yang Shi <yang.shi@linux.alibaba.com>
Subject: mm: move_pages: report the number of non-attempted pages

Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the semantic
of move_pages() has changed to return the number of non-migrated pages if
they were result of a non-fatal reasons (usually a busy page).  This was
an unintentional change that hasn't been noticed except for LTP tests
which checked for the documented behavior.

There are two ways to go around this change.  We can even get back to the
original behavior and return -EAGAIN whenever migrate_pages is not able to
migrate pages due to non-fatal reasons.  Another option would be to simply
continue with the changed semantic and extend move_pages documentation to
clarify that -errno is returned on an invalid input or when migration
simply cannot succeed (e.g.  -ENOMEM, -EBUSY) or the number of pages that
couldn't have been migrated due to ephemeral reasons (e.g.  page is pinned
or locked for other reasons).

This patch implements the second option because this behavior is in place
for some time without anybody complaining and possibly new users depending
on it.  Also it allows to have a slightly easier error handling as the
caller knows that it is worth to retry when err > 0.

But since the new semantic would be aborted immediately if migration is
failed due to ephemeral reasons, need include the number of non-attempted
pages in the return value too.

Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: <stable@vger.kernel.org>    [4.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

--- a/mm/migrate.c~mm-move_pages-report-the-number-of-non-attempted-pages
+++ a/mm/migrate.c
@@ -1627,8 +1627,19 @@ static int do_pages_move(struct mm_struc
 			start = i;
 		} else if (node != current_node) {
 			err = do_move_pages_to_node(mm, &pagelist, current_node);
-			if (err)
+			if (err) {
+				/*
+				 * Positive err means the number of failed
+				 * pages to migrate.  Since we are going to
+				 * abort and return the number of non-migrated
+				 * pages, so need to incude the rest of the
+				 * nr_pages that have not been attempted as
+				 * well.
+				 */
+				if (err > 0)
+					err += nr_pages - i - 1;
 				goto out;
+			}
 			err = store_status(status, start, current_node, i - start);
 			if (err)
 				goto out;
@@ -1659,8 +1670,11 @@ static int do_pages_move(struct mm_struc
 			goto out_flush;
 
 		err = do_move_pages_to_node(mm, &pagelist, current_node);
-		if (err)
+		if (err) {
+			if (err > 0)
+				err += nr_pages - i - 1;
 			goto out;
+		}
 		if (i > start) {
 			err = store_status(status, start, current_node, i - start);
 			if (err)
@@ -1674,6 +1688,13 @@ out_flush:
 
 	/* Make sure we do not overwrite the existing error */
 	err1 = do_move_pages_to_node(mm, &pagelist, current_node);
+	/*
+	 * Don't have to report non-attempted pages here since:
+	 *     - If the above loop is done gracefully all pages have been
+	 *       attempted.
+	 *     - If the above loop is aborted it means a fatal error
+	 *       happened, should return ret.
+	 */
 	if (!err1)
 		err1 = store_status(status, start, current_node, i - start);
 	if (err >= 0)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 009/118] scripts/spelling.txt: add more spellings to spelling.txt
  2020-01-31  6:10 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2020-01-31  6:11 ` [patch 008/118] mm: move_pages: report the number of non-attempted pages Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 010/118] scripts/spelling.txt: add "issus" typo Andrew Morton
                   ` (108 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, chris.paterson2, colin.king, linux-mm, mm-commits,
	paul.walmsley, sboyd, torvalds, xndchn

From: Xiong <xndchn@gmail.com>
Subject: scripts/spelling.txt: add more spellings to spelling.txt

Here are some of the common spelling mistakes and typos that I've found
while fixing up spelling mistakes in the kernel.  Most of them still exist
in more than two source files.

Link: http://lkml.kernel.org/r/20191229143626.51238-1-xndchn@gmail.com
Signed-off-by: Xiong <xndchn@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Chris Paterson <chris.paterson2@renesas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/spelling.txt |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- a/scripts/spelling.txt~scripts-spellingtxt-add-more-spellings-to-spellingtxt
+++ a/scripts/spelling.txt
@@ -39,6 +39,8 @@ accout||account
 accquire||acquire
 accquired||acquired
 accross||across
+accumalate||accumulate
+accumalator||accumulator
 acessable||accessible
 acess||access
 acessing||accessing
@@ -106,6 +108,7 @@ alogrithm||algorithm
 alot||a lot
 alow||allow
 alows||allows
+alreay||already
 alredy||already
 altough||although
 alue||value
@@ -241,6 +244,7 @@ calender||calendar
 calescing||coalescing
 calle||called
 callibration||calibration
+callled||called
 calucate||calculate
 calulate||calculate
 cancelation||cancellation
@@ -311,6 +315,7 @@ compaibility||compatibility
 comparsion||comparison
 compatability||compatibility
 compatable||compatible
+compatibililty||compatibility
 compatibiliy||compatibility
 compatibilty||compatibility
 compatiblity||compatibility
@@ -330,6 +335,7 @@ comunication||communication
 conbination||combination
 conditionaly||conditionally
 conditon||condition
+condtion||condition
 conected||connected
 conector||connector
 connecetd||connected
@@ -388,6 +394,8 @@ dafault||default
 deafult||default
 deamon||daemon
 debouce||debounce
+decendant||descendant
+decendants||descendants
 decompres||decompress
 decsribed||described
 decription||description
@@ -411,11 +419,13 @@ delare||declare
 delares||declares
 delaring||declaring
 delemiter||delimiter
+delievered||delivered
 demodualtor||demodulator
 demension||dimension
 dependancies||dependencies
 dependancy||dependency
 dependant||dependent
+dependend||dependent
 depreacted||deprecated
 depreacte||deprecate
 desactivate||deactivate
@@ -995,6 +1005,7 @@ peice||piece
 pendantic||pedantic
 peprocessor||preprocessor
 perfoming||performing
+perfomring||performing
 peripherial||peripheral
 permissons||permissions
 peroid||period
@@ -1166,6 +1177,8 @@ retreive||retrieve
 retreiving||retrieving
 retrive||retrieve
 retrived||retrieved
+retrun||return
+retun||return
 retuned||returned
 reudce||reduce
 reuest||request
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 010/118] scripts/spelling.txt: add "issus" typo
  2020-01-31  6:10 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2020-01-31  6:11 ` [patch 009/118] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 011/118] fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres Andrew Morton
                   ` (107 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, linux-mm, luca, mm-commits, torvalds

From: Luca Ceresoli <luca@lucaceresoli.net>
Subject: scripts/spelling.txt: add "issus" typo

Add "issus" and correct it as "issues".

Link: http://lkml.kernel.org/r/20200105221950.8384-1-luca@lucaceresoli.net
Signed-off-by: Luca Ceresoli <luca@lucaceresoli.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/spelling.txt |    1 +
 1 file changed, 1 insertion(+)

--- a/scripts/spelling.txt~scripts-spellingtxt-add-issus-typo
+++ a/scripts/spelling.txt
@@ -801,6 +801,7 @@ ireelevant||irrelevant
 irrelevent||irrelevant
 isnt||isn't
 isssue||issue
+issus||issues
 iternations||iterations
 itertation||iteration
 itslef||itself
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 011/118] fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres
  2020-01-31  6:10 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2020-01-31  6:11 ` [patch 010/118] scripts/spelling.txt: add "issus" typo Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 012/118] ocfs2: remove unneeded semicolons Andrew Morton
                   ` (106 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jiangqi903, jlbec, junxiao.bi, linux-mm,
	mark, mm-commits, pakki001, piaojun, torvalds

From: Aditya Pakki <pakki001@umn.edu>
Subject: fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres

In the only caller of dlm_migrate_lockres() - dlm_empty_lockres(), target
is checked for O2NM_MAX_NODES.  Thus, the assertion in
dlm_migrate_lockres() is unnecessary and can be removed.  The patch
eliminates such a check.

Link: http://lkml.kernel.org/r/20191218194111.26041-1-pakki001@umn.edu
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/dlm/dlmmaster.c |    2 --
 1 file changed, 2 deletions(-)

--- a/fs/ocfs2/dlm/dlmmaster.c~fs-ocfs-remove-unnecessary-assertion-in-dlm_migrate_lockres
+++ a/fs/ocfs2/dlm/dlmmaster.c
@@ -2554,8 +2554,6 @@ static int dlm_migrate_lockres(struct dl
 	if (!dlm_grab(dlm))
 		return -EINVAL;
 
-	BUG_ON(target == O2NM_MAX_NODES);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 012/118] ocfs2: remove unneeded semicolons
  2020-01-31  6:10 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2020-01-31  6:11 ` [patch 011/118] fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 013/118] ocfs2: make local header paths relative to C files Andrew Morton
                   ` (105 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, hulkci, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, mm-commits, piaojun, torvalds, zhengbin13

From: zhengbin <zhengbin13@huawei.com>
Subject: ocfs2: remove unneeded semicolons

Fixes coccicheck warnings:

fs/ocfs2/cluster/quorum.c:76:2-3: Unneeded semicolon
fs/ocfs2/dlmglue.c:573:2-3: Unneeded semicolon

Link: http://lkml.kernel.org/r/6ee3aa16-9078-30b1-df3f-22064950bd98@linux.alibaba.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/cluster/quorum.c |    2 +-
 fs/ocfs2/dlmglue.c        |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/fs/ocfs2/cluster/quorum.c~ocfs2-remove-unneeded-semicolon
+++ a/fs/ocfs2/cluster/quorum.c
@@ -73,7 +73,7 @@ static void o2quo_fence_self(void)
 		       "system by restarting ***\n");
 		emergency_restart();
 		break;
-	};
+	}
 }
 
 /* Indicate that a timeout occurred on a heartbeat region write. The
--- a/fs/ocfs2/dlmglue.c~ocfs2-remove-unneeded-semicolon
+++ a/fs/ocfs2/dlmglue.c
@@ -570,7 +570,7 @@ void ocfs2_inode_lock_res_init(struct oc
 			mlog_bug_on_msg(1, "type: %d\n", type);
 			ops = NULL; /* thanks, gcc */
 			break;
-	};
+	}
 
 	ocfs2_build_lock_name(type, OCFS2_I(inode)->ip_blkno,
 			      generation, res->l_name);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 013/118] ocfs2: make local header paths relative to C files
  2020-01-31  6:10 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2020-01-31  6:11 ` [patch 012/118] ocfs2: remove unneeded semicolons Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 014/118] ocfs2/dlm: remove redundant assignment to ret Andrew Morton
                   ` (104 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jiangqi903, jlbec, junxiao.bi, linux-mm,
	mark, masahiroy, mm-commits, piaojun, torvalds

From: Masahiro Yamada <masahiroy@kernel.org>
Subject: ocfs2: make local header paths relative to C files

Gang He reports the failure of building fs/ocfs2/ as an external module of
the kernel installed on the system:

 $ cd fs/ocfs2
 $ make -C /lib/modules/`uname -r`/build M=`pwd` modules

If you want to make it work reliably, I'd recommend to remove ccflags-y
from the Makefiles, and to make header paths relative to the C files.
I think this is the correct usage of the #include "..." directive.

Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Gang He <ghe@suse.com>
Reported-by: Gang He <ghe@suse.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/dlm/Makefile      |    2 --
 fs/ocfs2/dlm/dlmast.c      |    8 ++++----
 fs/ocfs2/dlm/dlmconvert.c  |    8 ++++----
 fs/ocfs2/dlm/dlmdebug.c    |    8 ++++----
 fs/ocfs2/dlm/dlmdomain.c   |    8 ++++----
 fs/ocfs2/dlm/dlmlock.c     |    8 ++++----
 fs/ocfs2/dlm/dlmmaster.c   |    8 ++++----
 fs/ocfs2/dlm/dlmrecovery.c |    8 ++++----
 fs/ocfs2/dlm/dlmthread.c   |    8 ++++----
 fs/ocfs2/dlm/dlmunlock.c   |    8 ++++----
 fs/ocfs2/dlmfs/Makefile    |    2 --
 fs/ocfs2/dlmfs/dlmfs.c     |    4 ++--
 fs/ocfs2/dlmfs/userdlm.c   |    6 +++---
 13 files changed, 41 insertions(+), 45 deletions(-)

--- a/fs/ocfs2/dlm/dlmast.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmast.c
@@ -23,15 +23,15 @@
 #include <linux/spinlock.h>
 
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
 
 #define MLOG_MASK_PREFIX ML_DLM
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 static void dlm_update_lvb(struct dlm_ctxt *dlm, struct dlm_lock_resource *res,
 			   struct dlm_lock *lock);
--- a/fs/ocfs2/dlm/dlmconvert.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmconvert.c
@@ -23,9 +23,9 @@
 #include <linux/spinlock.h>
 
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
@@ -33,7 +33,7 @@
 #include "dlmconvert.h"
 
 #define MLOG_MASK_PREFIX ML_DLM
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 /* NOTE: __dlmconvert_master is the only function in here that
  * needs a spinlock held on entry (res->spinlock) and it is the
--- a/fs/ocfs2/dlm/dlmdebug.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmdebug.c
@@ -17,9 +17,9 @@
 #include <linux/debugfs.h>
 #include <linux/export.h>
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
@@ -27,7 +27,7 @@
 #include "dlmdebug.h"
 
 #define MLOG_MASK_PREFIX ML_DLM
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 static int stringify_lockname(const char *lockname, int locklen, char *buf,
 			      int len);
--- a/fs/ocfs2/dlm/dlmdomain.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmdomain.c
@@ -20,9 +20,9 @@
 #include <linux/debugfs.h>
 #include <linux/sched/signal.h>
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
@@ -30,7 +30,7 @@
 #include "dlmdebug.h"
 
 #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_DOMAIN)
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 /*
  * ocfs2 node maps are array of long int, which limits to send them freely
--- a/fs/ocfs2/dlm/dlmlock.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmlock.c
@@ -25,9 +25,9 @@
 #include <linux/delay.h>
 
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
@@ -35,7 +35,7 @@
 #include "dlmconvert.h"
 
 #define MLOG_MASK_PREFIX ML_DLM
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 static struct kmem_cache *dlm_lock_cache;
 
--- a/fs/ocfs2/dlm/dlmmaster.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmmaster.c
@@ -25,9 +25,9 @@
 #include <linux/delay.h>
 
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
@@ -35,7 +35,7 @@
 #include "dlmdebug.h"
 
 #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_MASTER)
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 static void dlm_mle_node_down(struct dlm_ctxt *dlm,
 			      struct dlm_master_list_entry *mle,
--- a/fs/ocfs2/dlm/dlmrecovery.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmrecovery.c
@@ -26,16 +26,16 @@
 #include <linux/delay.h>
 
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
 #include "dlmdomain.h"
 
 #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_RECOVERY)
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node);
 
--- a/fs/ocfs2/dlm/dlmthread.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmthread.c
@@ -25,16 +25,16 @@
 #include <linux/delay.h>
 
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
 #include "dlmdomain.h"
 
 #define MLOG_MASK_PREFIX (ML_DLM|ML_DLM_THREAD)
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 static int dlm_thread(void *data);
 static void dlm_flush_asts(struct dlm_ctxt *dlm);
--- a/fs/ocfs2/dlm/dlmunlock.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/dlmunlock.c
@@ -23,15 +23,15 @@
 #include <linux/spinlock.h>
 #include <linux/delay.h>
 
-#include "cluster/heartbeat.h"
-#include "cluster/nodemanager.h"
-#include "cluster/tcp.h"
+#include "../cluster/heartbeat.h"
+#include "../cluster/nodemanager.h"
+#include "../cluster/tcp.h"
 
 #include "dlmapi.h"
 #include "dlmcommon.h"
 
 #define MLOG_MASK_PREFIX ML_DLM
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 #define DLM_UNLOCK_FREE_LOCK           0x00000001
 #define DLM_UNLOCK_CALL_AST            0x00000002
--- a/fs/ocfs2/dlmfs/dlmfs.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlmfs/dlmfs.c
@@ -33,11 +33,11 @@
 
 #include <linux/uaccess.h>
 
-#include "stackglue.h"
+#include "../stackglue.h"
 #include "userdlm.h"
 
 #define MLOG_MASK_PREFIX ML_DLMFS
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 
 static const struct super_operations dlmfs_ops;
--- a/fs/ocfs2/dlmfs/Makefile~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlmfs/Makefile
@@ -1,6 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
-ccflags-y := -I $(srctree)/$(src)/..
-
 obj-$(CONFIG_OCFS2_FS) += ocfs2_dlmfs.o
 
 ocfs2_dlmfs-objs := userdlm.o dlmfs.o
--- a/fs/ocfs2/dlmfs/userdlm.c~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlmfs/userdlm.c
@@ -21,12 +21,12 @@
 #include <linux/types.h>
 #include <linux/crc32.h>
 
-#include "ocfs2_lockingver.h"
-#include "stackglue.h"
+#include "../ocfs2_lockingver.h"
+#include "../stackglue.h"
 #include "userdlm.h"
 
 #define MLOG_MASK_PREFIX ML_DLMFS
-#include "cluster/masklog.h"
+#include "../cluster/masklog.h"
 
 
 static inline struct user_lock_res *user_lksb_to_lock_res(struct ocfs2_dlm_lksb *lksb)
--- a/fs/ocfs2/dlm/Makefile~ocfs2-make-local-header-paths-relative-to-c-files
+++ a/fs/ocfs2/dlm/Makefile
@@ -1,6 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
-ccflags-y := -I $(srctree)/$(src)/..

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 014/118] ocfs2/dlm: remove redundant assignment to ret
  2020-01-31  6:10 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2020-01-31  6:11 ` [patch 013/118] ocfs2: make local header paths relative to C files Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 015/118] ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use Andrew Morton
                   ` (103 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, colin.king, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, mm-commits, piaojun, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: ocfs2/dlm: remove redundant assignment to ret

The variable ret is being initialized with a value that is never read and
it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses Coverity ("Unused value")

Link: http://lkml.kernel.org/r/20191202164833.62865-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/dlm/dlmrecovery.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/ocfs2/dlm/dlmrecovery.c~ocfs2-dlm-remove-redundant-assignment-to-ret
+++ a/fs/ocfs2/dlm/dlmrecovery.c
@@ -1668,7 +1668,7 @@ static int dlm_lockres_master_requery(st
 int dlm_do_master_requery(struct dlm_ctxt *dlm, struct dlm_lock_resource *res,
 			  u8 nodenum, u8 *real_master)
 {
-	int ret = -EINVAL;
+	int ret;
 	struct dlm_master_requery req;
 	int status = DLM_LOCK_RES_OWNER_UNKNOWN;
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 015/118] ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use
  2020-01-31  6:10 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2020-01-31  6:11 ` [patch 014/118] ocfs2/dlm: remove redundant assignment to ret Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 016/118] ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans() Andrew Morton
                   ` (102 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, gechangwei, ghe, jlbec, joseph.qi,
	junxiao.bi, linux-mm, mark, mm-commits, piaojun, skalluru,
	torvalds

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use

There are users already and will be more of BITS_TO_BYTES() macro.  Move
it to bitops.h for wider use.

In the case of ocfs2 the replacement is identical.

As for bnx2x, there are two places where floor version is used.  In the
first case to calculate the amount of structures that can fit one memory
page.  In this case obviously the ceiling variant is correct and original
code might have a potential bug, if amount of bits % 8 is not 0.  In the
second case the macro is used to calculate bytes transmitted in one
microsecond.  This will work for all speeds which is multiply of 1Gbps
without any change, for the rest new code will give ceiling value, for
instance 100Mbps will give 13 bytes, while old code gives 12 bytes and the
arithmetically correct one is 12.5 bytes.  Further the value is used to
setup timer threshold which in any case has its own margins due to certain
resolution.  I don't see here an issue with slightly shifting thresholds
for low speed connections, the card is supposed to utilize highest
available rate, which is usually 10Gbps.

Link: http://lkml.kernel.org/r/20200108121316.22411-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h |    1 -
 fs/ocfs2/dlm/dlmcommon.h                         |    4 ----
 include/linux/bitops.h                           |    1 +
 3 files changed, 1 insertion(+), 5 deletions(-)

--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h~ocfs2-dlm-move-bits_to_bytes-to-bitopsh-for-wider-use
+++ a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_init.h
@@ -296,7 +296,6 @@ static inline void bnx2x_dcb_config_qm(s
  *    possible, the driver should only write the valid vnics into the internal
  *    ram according to the appropriate port mode.
  */
-#define BITS_TO_BYTES(x) ((x)/8)
 
 /* CMNG constants, as derived from system spec calculations */
 
--- a/fs/ocfs2/dlm/dlmcommon.h~ocfs2-dlm-move-bits_to_bytes-to-bitopsh-for-wider-use
+++ a/fs/ocfs2/dlm/dlmcommon.h
@@ -688,10 +688,6 @@ struct dlm_begin_reco
 	__be32 pad2;
 };
 
-
-#define BITS_PER_BYTE 8
-#define BITS_TO_BYTES(bits) (((bits)+BITS_PER_BYTE-1)/BITS_PER_BYTE)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 016/118] ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2020-01-31  6:11 ` [patch 015/118] ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 017/118] ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction Andrew Morton
                   ` (101 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jiangqi903, jlbec, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, torvalds, wangyan122

From: wangyan <wangyan122@huawei.com>
Subject: ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()

I found a NULL pointer dereference in ocfs2_update_inode_fsync_trans(),
handle->h_transaction may be NULL in this situation:

ocfs2_file_write_iter
  ->__generic_file_write_iter
      ->generic_perform_write
        ->ocfs2_write_begin
          ->ocfs2_write_begin_nolock
            ->ocfs2_write_cluster_by_desc
              ->ocfs2_write_cluster
                ->ocfs2_mark_extent_written
                  ->ocfs2_change_extent_flag
                    ->ocfs2_split_extent
                      ->ocfs2_try_to_merge_extent
                        ->ocfs2_extend_rotate_transaction
                          ->ocfs2_extend_trans
                            ->jbd2_journal_restart
                              ->jbd2__journal_restart
                                // handle->h_transaction is NULL here
                                ->handle->h_transaction = NULL;
                                ->start_this_handle
                                  /* journal aborted due to storage
                                     network disconnection, return error */
                                  ->return -EROFS;
                         /* line 3806 in ocfs2_try_to_merge_extent (),
                            it will ignore ret error. */
                        ->ret = 0;
        ->...
        ->ocfs2_write_end
          ->ocfs2_write_end_nolock
            ->ocfs2_update_inode_fsync_trans
              // NULL pointer dereference
              ->oi->i_sync_tid = handle->h_transaction->t_tid;

The information of NULL pointer dereference as follows:
    JBD2: Detected IO errors while flushing file data on dm-11-45
    Aborting journal on device dm-11-45.
    JBD2: Error -5 detected when updating journal superblock for dm-11-45.
    (dd,22081,3):ocfs2_extend_trans:474 ERROR: status = -30
    (dd,22081,3):ocfs2_try_to_merge_extent:3877 ERROR: status = -30
    Unable to handle kernel NULL pointer dereference at
    virtual address 0000000000000008
    Mem abort info:
      ESR = 0x96000004
      Exception class = DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
    Data abort info:
      ISV = 0, ISS = 0x00000004
      CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e74e1338
    [0000000000000008] pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    Process dd (pid: 22081, stack limit = 0x00000000584f35a9)
    CPU: 3 PID: 22081 Comm: dd Kdump: loaded
    Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
    pstate: 60400009 (nZCv daif +PAN -UAO)
    pc : ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
    lr : ocfs2_write_end_nolock+0x2a0/0x550 [ocfs2]
    sp : ffff0000459fba70
    x29: ffff0000459fba70 x28: 0000000000000000
    x27: ffff807ccf7f1000 x26: 0000000000000001
    x25: ffff807bdff57970 x24: ffff807caf1d4000
    x23: ffff807cc79e9000 x22: 0000000000001000
    x21: 000000006c6cd000 x20: ffff0000091d9000
    x19: ffff807ccb239db0 x18: ffffffffffffffff
    x17: 000000000000000e x16: 0000000000000007
    x15: ffff807c5e15bd78 x14: 0000000000000000
    x13: 0000000000000000 x12: 0000000000000000
    x11: 0000000000000000 x10: 0000000000000001
    x9 : 0000000000000228 x8 : 000000000000000c
    x7 : 0000000000000fff x6 : ffff807a308ed6b0
    x5 : ffff7e01f10967c0 x4 : 0000000000000018
    x3 : d0bc661572445600 x2 : 0000000000000000
    x1 : 000000001b2e0200 x0 : 0000000000000000
    Call trace:
     ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2]
     ocfs2_write_end+0x4c/0x80 [ocfs2]
     generic_perform_write+0x108/0x1a8
     __generic_file_write_iter+0x158/0x1c8
     ocfs2_file_write_iter+0x668/0x950 [ocfs2]
     __vfs_write+0x11c/0x190
     vfs_write+0xac/0x1c0
     ksys_write+0x6c/0xd8
     __arm64_sys_write+0x24/0x30
     el0_svc_common+0x78/0x130
     el0_svc_handler+0x38/0x78
     el0_svc+0x8/0xc

To prevent NULL pointer dereference in this situation, we use
is_handle_aborted() before using handle->h_transaction->t_tid.

Link: http://lkml.kernel.org/r/03e750ab-9ade-83aa-b000-b9e81e34e539@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/journal.h |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/fs/ocfs2/journal.h~ocfs2-fix-a-null-pointer-dereference-when-call-ocfs2_update_inode_fsync_trans
+++ a/fs/ocfs2/journal.h
@@ -597,9 +597,11 @@ static inline void ocfs2_update_inode_fs
 {
 	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 
-	oi->i_sync_tid = handle->h_transaction->t_tid;
-	if (datasync)
-		oi->i_datasync_tid = handle->h_transaction->t_tid;
+	if (!is_handle_aborted(handle)) {
+		oi->i_sync_tid = handle->h_transaction->t_tid;
+		if (datasync)
+			oi->i_datasync_tid = handle->h_transaction->t_tid;
+	}
 }
 
 #endif /* OCFS2_JOURNAL_H */
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 017/118] ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction
  2020-01-31  6:10 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2020-01-31  6:11 ` [patch 016/118] ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans() Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:11 ` [patch 018/118] mm/slub.c: avoid slub allocation while holding list_lock Andrew Morton
                   ` (100 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jiangqi903, jlbec, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, torvalds, wangyan122

From: wangyan <wangyan122@huawei.com>
Subject: ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction

For the uniform format, we use ocfs2_update_inode_fsync_trans() to access
t_tid in handle->h_transaction

Link: http://lkml.kernel.org/r/6ff9a312-5f7d-0e27-fb51-bc4e062fcd97@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/namei.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/fs/ocfs2/namei.c~ocfs2-use-ocfs2_update_inode_fsync_trans-to-access-t_tid-in-handle-h_transaction
+++ a/fs/ocfs2/namei.c
@@ -586,8 +586,7 @@ static int __ocfs2_mknod_locked(struct i
 			mlog_errno(status);
 	}
 
-	oi->i_sync_tid = handle->h_transaction->t_tid;
-	oi->i_datasync_tid = handle->h_transaction->t_tid;
+	ocfs2_update_inode_fsync_trans(handle, inode, 1);
 
 leave:
 	if (status < 0) {
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 018/118] mm/slub.c: avoid slub allocation while holding list_lock
  2020-01-31  6:10 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2020-01-31  6:11 ` [patch 017/118] ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction Andrew Morton
@ 2020-01-31  6:11 ` Andrew Morton
  2020-01-31  6:12 ` [patch 019/118] mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t Andrew Morton
                   ` (99 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:11 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, kirill.shutemov, linux-mm, mm-commits,
	penberg, penguin-kernel, rientjes, torvalds, yuzhao

From: Yu Zhao <yuzhao@google.com>
Subject: mm/slub.c: avoid slub allocation while holding list_lock

If we are already under list_lock, don't call kmalloc().  Otherwise we
will run into a deadlock because kmalloc() also tries to grab the same
lock.

Fix the problem by using a static bitmap instead.

  WARNING: possible recursive locking detected
  --------------------------------------------
  mount-encrypted/4921 is trying to acquire lock:
  (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437

  but task is already holding lock:
  (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&(&n->list_lock)->rlock);
    lock(&(&n->list_lock)->rlock);

   *** DEADLOCK ***

Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   88 +++++++++++++++++++++++++++-------------------------
 1 file changed, 47 insertions(+), 41 deletions(-)

--- a/mm/slub.c~mm-avoid-slub-allocation-while-holding-list_lock
+++ a/mm/slub.c
@@ -439,19 +439,38 @@ static inline bool cmpxchg_double_slab(s
 }
 
 #ifdef CONFIG_SLUB_DEBUG
+static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
+static DEFINE_SPINLOCK(object_map_lock);
+
 /*
  * Determine a map of object in use on a page.
  *
  * Node listlock must be held to guarantee that the page does
  * not vanish from under us.
  */
-static void get_map(struct kmem_cache *s, struct page *page, unsigned long *map)
+static unsigned long *get_map(struct kmem_cache *s, struct page *page)
 {
 	void *p;
 	void *addr = page_address(page);
 
+	VM_BUG_ON(!irqs_disabled());
+
+	spin_lock(&object_map_lock);
+
+	bitmap_zero(object_map, page->objects);
+
 	for (p = page->freelist; p; p = get_freepointer(s, p))
-		set_bit(slab_index(p, s, addr), map);
+		set_bit(slab_index(p, s, addr), object_map);
+
+	return object_map;
+}
+
+static void put_map(unsigned long *map)
+{
+	VM_BUG_ON(map != object_map);
+	lockdep_assert_held(&object_map_lock);
+
+	spin_unlock(&object_map_lock);
 }
 
 static inline unsigned int size_from_object(struct kmem_cache *s)
@@ -3675,13 +3694,12 @@ static void list_slab_objects(struct kme
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = page_address(page);
 	void *p;
-	unsigned long *map = bitmap_zalloc(page->objects, GFP_ATOMIC);
-	if (!map)
-		return;
+	unsigned long *map;
+
 	slab_err(s, page, text, s->name);
 	slab_lock(page);
 
-	get_map(s, page, map);
+	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
 
 		if (!test_bit(slab_index(p, s, addr), map)) {
@@ -3689,8 +3707,9 @@ static void list_slab_objects(struct kme
 			print_tracking(s, p);
 		}
 	}
+	put_map(map);
+
 	slab_unlock(page);
-	bitmap_free(map);
 #endif
 }
 
@@ -4384,19 +4403,19 @@ static int count_total(struct page *page
 #endif
 
 #ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
+static void validate_slab(struct kmem_cache *s, struct page *page)
 {
 	void *p;
 	void *addr = page_address(page);
+	unsigned long *map;
+
+	slab_lock(page);
 
 	if (!check_slab(s, page) || !on_freelist(s, page, NULL))
-		return;
+		goto unlock;
 
 	/* Now we know that a valid freelist exists */
-	bitmap_zero(map, page->objects);
-
-	get_map(s, page, map);
+	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
 		u8 val = test_bit(slab_index(p, s, addr), map) ?
 			 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
@@ -4404,18 +4423,13 @@ static void validate_slab(struct kmem_ca
 		if (!check_object(s, page, p, val))
 			break;
 	}
-}
-
-static void validate_slab_slab(struct kmem_cache *s, struct page *page,
-						unsigned long *map)
-{
-	slab_lock(page);
-	validate_slab(s, page, map);
+	put_map(map);
+unlock:
 	slab_unlock(page);
 }
 
 static int validate_slab_node(struct kmem_cache *s,
-		struct kmem_cache_node *n, unsigned long *map)
+		struct kmem_cache_node *n)
 {
 	unsigned long count = 0;
 	struct page *page;
@@ -4424,7 +4438,7 @@ static int validate_slab_node(struct kme
 	spin_lock_irqsave(&n->list_lock, flags);
 
 	list_for_each_entry(page, &n->partial, slab_list) {
-		validate_slab_slab(s, page, map);
+		validate_slab(s, page);
 		count++;
 	}
 	if (count != n->nr_partial)
@@ -4435,7 +4449,7 @@ static int validate_slab_node(struct kme
 		goto out;
 
 	list_for_each_entry(page, &n->full, slab_list) {
-		validate_slab_slab(s, page, map);
+		validate_slab(s, page);
 		count++;
 	}
 	if (count != atomic_long_read(&n->nr_slabs))
@@ -4452,15 +4466,11 @@ static long validate_slab_cache(struct k
 	int node;
 	unsigned long count = 0;
 	struct kmem_cache_node *n;
-	unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL);
-
-	if (!map)
-		return -ENOMEM;
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n)
-		count += validate_slab_node(s, n, map);
-	bitmap_free(map);
+		count += validate_slab_node(s, n);
+
 	return count;
 }
 /*
@@ -4590,18 +4600,17 @@ static int add_location(struct loc_track
 }
 
 static void process_slab(struct loc_track *t, struct kmem_cache *s,
-		struct page *page, enum track_item alloc,
-		unsigned long *map)
+		struct page *page, enum track_item alloc)
 {
 	void *addr = page_address(page);
 	void *p;
+	unsigned long *map;
 
-	bitmap_zero(map, page->objects);
-	get_map(s, page, map);
-
+	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects)
 		if (!test_bit(slab_index(p, s, addr), map))
 			add_location(t, s, get_track(s, p, alloc));
+	put_map(map);
 }
 
 static int list_locations(struct kmem_cache *s, char *buf,
@@ -4612,11 +4621,9 @@ static int list_locations(struct kmem_ca
 	struct loc_track t = { 0, 0, NULL };
 	int node;
 	struct kmem_cache_node *n;
-	unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL);
 
-	if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
-				     GFP_KERNEL)) {
-		bitmap_free(map);
+	if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location),
+			     GFP_KERNEL)) {
 		return sprintf(buf, "Out of memory\n");
 	}
 	/* Push back cpu slabs */
@@ -4631,9 +4638,9 @@ static int list_locations(struct kmem_ca
 
 		spin_lock_irqsave(&n->list_lock, flags);
 		list_for_each_entry(page, &n->partial, slab_list)
-			process_slab(&t, s, page, alloc, map);
+			process_slab(&t, s, page, alloc);
 		list_for_each_entry(page, &n->full, slab_list)
-			process_slab(&t, s, page, alloc, map);
+			process_slab(&t, s, page, alloc);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 	}
 
@@ -4682,7 +4689,6 @@ static int list_locations(struct kmem_ca
 	}
 
 	free_loc_track(&t);
-	bitmap_free(map);
 	if (!t.count)
 		len += sprintf(buf, "No data\n");
 	return len;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 019/118] mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t
  2020-01-31  6:10 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2020-01-31  6:11 ` [patch 018/118] mm/slub.c: avoid slub allocation while holding list_lock Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 020/118] mm/debug.c: always print flags in dump_page() Andrew Morton
                   ` (98 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, bigeasy, catalin.marinas, haitao.liu, linux-mm, mm-commits,
	torvalds, yongxin.liu, zhe.he

From: He Zhe <zhe.he@windriver.com>
Subject: mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t

kmemleak_lock as a rwlock on RT can possibly be acquired in atomic context
which does work.

Since the kmemleak operation is performed in atomic context make it a
raw_spinlock_t so it can also be acquired on RT.  This is used for
debugging and is not enabled by default in a production like environment
(where performance/latency matters) so it makes sense to make it a
raw_spinlock_t instead trying to get rid of the atomic context.  Turn also
the kmemleak_object->lock into raw_spinlock_t which is acquired (nested)
while the kmemleak_lock is held.

The time spent in "echo scan > kmemleak" slightly improved on 64core box
with this patch applied after boot.

[bigeasy@linutronix.de: redo the description, update comments. Merge the individual bits:  He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.]
Link: http://lkml.kernel.org/r/20191219170834.4tah3prf2gdothz4@linutronix.de
Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com
Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com
Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Liu Haitao <haitao.liu@windriver.com>
Signed-off-by: Yongxin Liu <yongxin.liu@windriver.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |  112 ++++++++++++++++++++++++------------------------
 1 file changed, 56 insertions(+), 56 deletions(-)

--- a/mm/kmemleak.c~kmemleak-turn-kmemleak_lock-and-object-lock-to-raw_spinlock_t
+++ a/mm/kmemleak.c
@@ -13,7 +13,7 @@
  *
  * The following locks and mutexes are used by kmemleak:
  *
- * - kmemleak_lock (rwlock): protects the object_list modifications and
+ * - kmemleak_lock (raw_spinlock_t): protects the object_list modifications and
  *   accesses to the object_tree_root. The object_list is the main list
  *   holding the metadata (struct kmemleak_object) for the allocated memory
  *   blocks. The object_tree_root is a red black tree used to look-up
@@ -22,13 +22,13 @@
  *   object_tree_root in the create_object() function called from the
  *   kmemleak_alloc() callback and removed in delete_object() called from the
  *   kmemleak_free() callback
- * - kmemleak_object.lock (spinlock): protects a kmemleak_object. Accesses to
- *   the metadata (e.g. count) are protected by this lock. Note that some
- *   members of this structure may be protected by other means (atomic or
- *   kmemleak_lock). This lock is also held when scanning the corresponding
- *   memory block to avoid the kernel freeing it via the kmemleak_free()
- *   callback. This is less heavyweight than holding a global lock like
- *   kmemleak_lock during scanning
+ * - kmemleak_object.lock (raw_spinlock_t): protects a kmemleak_object.
+ *   Accesses to the metadata (e.g. count) are protected by this lock. Note
+ *   that some members of this structure may be protected by other means
+ *   (atomic or kmemleak_lock). This lock is also held when scanning the
+ *   corresponding memory block to avoid the kernel freeing it via the
+ *   kmemleak_free() callback. This is less heavyweight than holding a global
+ *   lock like kmemleak_lock during scanning.
  * - scan_mutex (mutex): ensures that only one thread may scan the memory for
  *   unreferenced objects at a time. The gray_list contains the objects which
  *   are already referenced or marked as false positives and need to be
@@ -135,7 +135,7 @@ struct kmemleak_scan_area {
  * (use_count) and freed using the RCU mechanism.
  */
 struct kmemleak_object {
-	spinlock_t lock;
+	raw_spinlock_t lock;
 	unsigned int flags;		/* object status flags */
 	struct list_head object_list;
 	struct list_head gray_list;
@@ -191,8 +191,8 @@ static int mem_pool_free_count = ARRAY_S
 static LIST_HEAD(mem_pool_free_list);
 /* search tree for object boundaries */
 static struct rb_root object_tree_root = RB_ROOT;
-/* rw_lock protecting the access to object_list and object_tree_root */
-static DEFINE_RWLOCK(kmemleak_lock);
+/* protecting the access to object_list and object_tree_root */
+static DEFINE_RAW_SPINLOCK(kmemleak_lock);
 
 /* allocation caches for kmemleak internal data */
 static struct kmem_cache *object_cache;
@@ -426,7 +426,7 @@ static struct kmemleak_object *mem_pool_
 	}
 
 	/* slab allocation failed, try the memory pool */
-	write_lock_irqsave(&kmemleak_lock, flags);
+	raw_spin_lock_irqsave(&kmemleak_lock, flags);
 	object = list_first_entry_or_null(&mem_pool_free_list,
 					  typeof(*object), object_list);
 	if (object)
@@ -435,7 +435,7 @@ static struct kmemleak_object *mem_pool_
 		object = &mem_pool[--mem_pool_free_count];
 	else
 		pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n");
-	write_unlock_irqrestore(&kmemleak_lock, flags);
+	raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
 
 	return object;
 }
@@ -453,9 +453,9 @@ static void mem_pool_free(struct kmemlea
 	}
 
 	/* add the object to the memory pool free list */
-	write_lock_irqsave(&kmemleak_lock, flags);
+	raw_spin_lock_irqsave(&kmemleak_lock, flags);
 	list_add(&object->object_list, &mem_pool_free_list);
-	write_unlock_irqrestore(&kmemleak_lock, flags);
+	raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
 }
 
 /*
@@ -514,9 +514,9 @@ static struct kmemleak_object *find_and_
 	struct kmemleak_object *object;
 
 	rcu_read_lock();
-	read_lock_irqsave(&kmemleak_lock, flags);
+	raw_spin_lock_irqsave(&kmemleak_lock, flags);
 	object = lookup_object(ptr, alias);
-	read_unlock_irqrestore(&kmemleak_lock, flags);
+	raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
 
 	/* check whether the object is still available */
 	if (object && !get_object(object))
@@ -546,11 +546,11 @@ static struct kmemleak_object *find_and_
 	unsigned long flags;
 	struct kmemleak_object *object;
 
-	write_lock_irqsave(&kmemleak_lock, flags);
+	raw_spin_lock_irqsave(&kmemleak_lock, flags);
 	object = lookup_object(ptr, alias);
 	if (object)
 		__remove_object(object);
-	write_unlock_irqrestore(&kmemleak_lock, flags);
+	raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
 
 	return object;
 }
@@ -585,7 +585,7 @@ static struct kmemleak_object *create_ob
 	INIT_LIST_HEAD(&object->object_list);
 	INIT_LIST_HEAD(&object->gray_list);
 	INIT_HLIST_HEAD(&object->area_list);
-	spin_lock_init(&object->lock);
+	raw_spin_lock_init(&object->lock);
 	atomic_set(&object->use_count, 1);
 	object->flags = OBJECT_ALLOCATED;
 	object->pointer = ptr;
@@ -617,7 +617,7 @@ static struct kmemleak_object *create_ob
 	/* kernel backtrace */
 	object->trace_len = __save_stack_trace(object->trace);
 
-	write_lock_irqsave(&kmemleak_lock, flags);
+	raw_spin_lock_irqsave(&kmemleak_lock, flags);
 
 	untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr);
 	min_addr = min(min_addr, untagged_ptr);
@@ -649,7 +649,7 @@ static struct kmemleak_object *create_ob
 
 	list_add_tail_rcu(&object->object_list, &object_list);
 out:
-	write_unlock_irqrestore(&kmemleak_lock, flags);
+	raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
 	return object;
 }
 
@@ -667,9 +667,9 @@ static void __delete_object(struct kmeml
 	 * Locking here also ensures that the corresponding memory block
 	 * cannot be freed when it is being scanned.
 	 */
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	object->flags &= ~OBJECT_ALLOCATED;
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 	put_object(object);
 }
 
@@ -739,9 +739,9 @@ static void paint_it(struct kmemleak_obj
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	__paint_it(object, color);
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 }
 
 static void paint_ptr(unsigned long ptr, int color)
@@ -798,7 +798,7 @@ static void add_scan_area(unsigned long
 	if (scan_area_cache)
 		area = kmem_cache_alloc(scan_area_cache, gfp_kmemleak_mask(gfp));
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	if (!area) {
 		pr_warn_once("Cannot allocate a scan area, scanning the full object\n");
 		/* mark the object for full scan to avoid false positives */
@@ -820,7 +820,7 @@ static void add_scan_area(unsigned long
 
 	hlist_add_head(&area->node, &object->area_list);
 out_unlock:
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 	put_object(object);
 }
 
@@ -842,9 +842,9 @@ static void object_set_excess_ref(unsign
 		return;
 	}
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	object->excess_ref = excess_ref;
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 	put_object(object);
 }
 
@@ -864,9 +864,9 @@ static void object_no_scan(unsigned long
 		return;
 	}
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	object->flags |= OBJECT_NO_SCAN;
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 	put_object(object);
 }
 
@@ -1026,9 +1026,9 @@ void __ref kmemleak_update_trace(const v
 		return;
 	}
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	object->trace_len = __save_stack_trace(object->trace);
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 
 	put_object(object);
 }
@@ -1233,7 +1233,7 @@ static void scan_block(void *_start, voi
 	unsigned long flags;
 	unsigned long untagged_ptr;
 
-	read_lock_irqsave(&kmemleak_lock, flags);
+	raw_spin_lock_irqsave(&kmemleak_lock, flags);
 	for (ptr = start; ptr < end; ptr++) {
 		struct kmemleak_object *object;
 		unsigned long pointer;
@@ -1268,7 +1268,7 @@ static void scan_block(void *_start, voi
 		 * previously acquired in scan_object(). These locks are
 		 * enclosed by scan_mutex.
 		 */
-		spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING);
+		raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING);
 		/* only pass surplus references (object already gray) */
 		if (color_gray(object)) {
 			excess_ref = object->excess_ref;
@@ -1277,7 +1277,7 @@ static void scan_block(void *_start, voi
 			excess_ref = 0;
 			update_refs(object);
 		}
-		spin_unlock(&object->lock);
+		raw_spin_unlock(&object->lock);
 
 		if (excess_ref) {
 			object = lookup_object(excess_ref, 0);
@@ -1286,12 +1286,12 @@ static void scan_block(void *_start, voi
 			if (object == scanned)
 				/* circular reference, ignore */
 				continue;
-			spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING);
+			raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING);
 			update_refs(object);
-			spin_unlock(&object->lock);
+			raw_spin_unlock(&object->lock);
 		}
 	}
-	read_unlock_irqrestore(&kmemleak_lock, flags);
+	raw_spin_unlock_irqrestore(&kmemleak_lock, flags);
 }
 
 /*
@@ -1324,7 +1324,7 @@ static void scan_object(struct kmemleak_
 	 * Once the object->lock is acquired, the corresponding memory block
 	 * cannot be freed (the same lock is acquired in delete_object).
 	 */
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	if (object->flags & OBJECT_NO_SCAN)
 		goto out;
 	if (!(object->flags & OBJECT_ALLOCATED))
@@ -1344,9 +1344,9 @@ static void scan_object(struct kmemleak_
 			if (start >= end)
 				break;
 
-			spin_unlock_irqrestore(&object->lock, flags);
+			raw_spin_unlock_irqrestore(&object->lock, flags);
 			cond_resched();
-			spin_lock_irqsave(&object->lock, flags);
+			raw_spin_lock_irqsave(&object->lock, flags);
 		} while (object->flags & OBJECT_ALLOCATED);
 	} else
 		hlist_for_each_entry(area, &object->area_list, node)
@@ -1354,7 +1354,7 @@ static void scan_object(struct kmemleak_
 				   (void *)(area->start + area->size),
 				   object);
 out:
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 }
 
 /*
@@ -1407,7 +1407,7 @@ static void kmemleak_scan(void)
 	/* prepare the kmemleak_object's */
 	rcu_read_lock();
 	list_for_each_entry_rcu(object, &object_list, object_list) {
-		spin_lock_irqsave(&object->lock, flags);
+		raw_spin_lock_irqsave(&object->lock, flags);
 #ifdef DEBUG
 		/*
 		 * With a few exceptions there should be a maximum of
@@ -1424,7 +1424,7 @@ static void kmemleak_scan(void)
 		if (color_gray(object) && get_object(object))
 			list_add_tail(&object->gray_list, &gray_list);
 
-		spin_unlock_irqrestore(&object->lock, flags);
+		raw_spin_unlock_irqrestore(&object->lock, flags);
 	}
 	rcu_read_unlock();
 
@@ -1492,14 +1492,14 @@ static void kmemleak_scan(void)
 	 */
 	rcu_read_lock();
 	list_for_each_entry_rcu(object, &object_list, object_list) {
-		spin_lock_irqsave(&object->lock, flags);
+		raw_spin_lock_irqsave(&object->lock, flags);
 		if (color_white(object) && (object->flags & OBJECT_ALLOCATED)
 		    && update_checksum(object) && get_object(object)) {
 			/* color it gray temporarily */
 			object->count = object->min_count;
 			list_add_tail(&object->gray_list, &gray_list);
 		}
-		spin_unlock_irqrestore(&object->lock, flags);
+		raw_spin_unlock_irqrestore(&object->lock, flags);
 	}
 	rcu_read_unlock();
 
@@ -1519,7 +1519,7 @@ static void kmemleak_scan(void)
 	 */
 	rcu_read_lock();
 	list_for_each_entry_rcu(object, &object_list, object_list) {
-		spin_lock_irqsave(&object->lock, flags);
+		raw_spin_lock_irqsave(&object->lock, flags);
 		if (unreferenced_object(object) &&
 		    !(object->flags & OBJECT_REPORTED)) {
 			object->flags |= OBJECT_REPORTED;
@@ -1529,7 +1529,7 @@ static void kmemleak_scan(void)
 
 			new_leaks++;
 		}
-		spin_unlock_irqrestore(&object->lock, flags);
+		raw_spin_unlock_irqrestore(&object->lock, flags);
 	}
 	rcu_read_unlock();
 
@@ -1681,10 +1681,10 @@ static int kmemleak_seq_show(struct seq_
 	struct kmemleak_object *object = v;
 	unsigned long flags;
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object))
 		print_unreferenced(seq, object);
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 	return 0;
 }
 
@@ -1714,9 +1714,9 @@ static int dump_str_object_info(const ch
 		return -EINVAL;
 	}
 
-	spin_lock_irqsave(&object->lock, flags);
+	raw_spin_lock_irqsave(&object->lock, flags);
 	dump_object_info(object);
-	spin_unlock_irqrestore(&object->lock, flags);
+	raw_spin_unlock_irqrestore(&object->lock, flags);
 
 	put_object(object);
 	return 0;
@@ -1735,11 +1735,11 @@ static void kmemleak_clear(void)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(object, &object_list, object_list) {
-		spin_lock_irqsave(&object->lock, flags);
+		raw_spin_lock_irqsave(&object->lock, flags);
 		if ((object->flags & OBJECT_REPORTED) &&
 		    unreferenced_object(object))
 			__paint_it(object, KMEMLEAK_GREY);
-		spin_unlock_irqrestore(&object->lock, flags);
+		raw_spin_unlock_irqrestore(&object->lock, flags);
 	}
 	rcu_read_unlock();
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 020/118] mm/debug.c: always print flags in dump_page()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2020-01-31  6:12 ` [patch 019/118] mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 021/118] mm/filemap.c: clean up filemap_write_and_wait() Andrew Morton
                   ` (97 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, anshuman.khandual, cai, dan.j.williams, david, linux-mm,
	mgorman, mhocko, mhocko, mm-commits, osalvador, pavel.tatashin,
	rcampbell, rppt, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm/debug.c: always print flags in dump_page()

Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
inadvertently removed printing of page flags for pages that are neither
anon nor ksm nor have a mapping.  Fix that.

Using pr_cont() again would be a solution, but the commit explicitly
removed its use.  Avoiding the danger of mixing up split lines from
multiple CPUs might be beneficial for near-panic dumps like this, so fix
this without reintroducing pr_cont().

Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug.c |    8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/mm/debug.c~mm-debug-always-print-flags-in-dump_page
+++ a/mm/debug.c
@@ -47,6 +47,7 @@ void __dump_page(struct page *page, cons
 	struct address_space *mapping;
 	bool page_poisoned = PagePoisoned(page);
 	int mapcount;
+	char *type = "";
 
 	/*
 	 * If struct page is poisoned don't access Page*() functions as that
@@ -78,9 +79,9 @@ void __dump_page(struct page *page, cons
 			page, page_ref_count(page), mapcount,
 			page->mapping, page_to_pgoff(page));
 	if (PageKsm(page))
-		pr_warn("ksm flags: %#lx(%pGp)\n", page->flags, &page->flags);
+		type = "ksm ";
 	else if (PageAnon(page))
-		pr_warn("anon flags: %#lx(%pGp)\n", page->flags, &page->flags);
+		type = "anon ";
 	else if (mapping) {
 		if (mapping->host && mapping->host->i_dentry.first) {
 			struct dentry *dentry;
@@ -88,10 +89,11 @@ void __dump_page(struct page *page, cons
 			pr_warn("%ps name:\"%pd\"\n", mapping->a_ops, dentry);
 		} else
 			pr_warn("%ps\n", mapping->a_ops);
-		pr_warn("flags: %#lx(%pGp)\n", page->flags, &page->flags);
 	}
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1);
 
+	pr_warn("%sflags: %#lx(%pGp)\n", type, page->flags, &page->flags);
+
 hex_only:
 	print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
 			sizeof(unsigned long), page,
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 021/118] mm/filemap.c: clean up filemap_write_and_wait()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2020-01-31  6:12 ` [patch 020/118] mm/debug.c: always print flags in dump_page() Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 022/118] mm: fix gup_pud_range Andrew Morton
                   ` (96 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, ira.weiny, linux-mm, mm-commits, nborisov, torvalds, willy

From: Ira Weiny <ira.weiny@intel.com>
Subject: mm/filemap.c: clean up filemap_write_and_wait()

At some point filemap_write_and_wait() and filemap_write_and_wait_range()
got the exact same implementation with the exception of the range being
specified in *_range()

Similar to other functions in fs.h which call *_range(..., 0, LLONG_MAX),
change filemap_write_and_wait() to be a static inline which calls
filemap_write_and_wait_range()

Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/fs.h |    6 +++++-
 mm/filemap.c       |   34 ++++++----------------------------
 2 files changed, 11 insertions(+), 29 deletions(-)

--- a/include/linux/fs.h~mm-clean-up-filemap_write_and_wait
+++ a/include/linux/fs.h
@@ -2737,7 +2737,6 @@ static inline int filemap_fdatawait(stru
 
 extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
 				  loff_t lend);
-extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
 				        loff_t lstart, loff_t lend);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
@@ -2747,6 +2746,11 @@ extern int filemap_fdatawrite_range(stru
 extern int filemap_check_errors(struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
+static inline int filemap_write_and_wait(struct address_space *mapping)
+{
+	return filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
+}
+
 extern int __must_check file_fdatawait_range(struct file *file, loff_t lstart,
 						loff_t lend);
 extern int __must_check file_check_and_advance_wb_err(struct file *file);
--- a/mm/filemap.c~mm-clean-up-filemap_write_and_wait
+++ a/mm/filemap.c
@@ -632,33 +632,6 @@ static bool mapping_needs_writeback(stru
 	return mapping->nrpages;
 }
 
-int filemap_write_and_wait(struct address_space *mapping)
-{
-	int err = 0;
-
-	if (mapping_needs_writeback(mapping)) {
-		err = filemap_fdatawrite(mapping);
-		/*
-		 * Even if the above returned error, the pages may be
-		 * written partially (e.g. -ENOSPC), so we wait for it.
-		 * But the -EIO is special case, it may indicate the worst
-		 * thing (e.g. bug) happened, so we avoid waiting for it.
-		 */
-		if (err != -EIO) {
-			int err2 = filemap_fdatawait(mapping);
-			if (!err)
-				err = err2;
-		} else {
-			/* Clear any previously stored errors */
-			filemap_check_errors(mapping);
-		}
-	} else {
-		err = filemap_check_errors(mapping);
-	}
-	return err;
-}
-EXPORT_SYMBOL(filemap_write_and_wait);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 022/118] mm: fix gup_pud_range
  2020-01-31  6:10 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2020-01-31  6:12 ` [patch 021/118] mm/filemap.c: clean up filemap_write_and_wait() Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 023/118] mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge Andrew Morton
                   ` (95 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, aneesh.kumar, hqjagain, jhubbard, linux-mm, mm-commits,
	n-horiguchi, torvalds

From: Qiujun Huang <hqjagain@gmail.com>
Subject: mm: fix gup_pud_range

sorry for not processing for a long time. I met it again.

patch v1   https://lkml.org/lkml/2019/9/20/656

do_machine_check()
  do_memory_failure()
    memory_failure()
      hw_poison_user_mappings()
        try_to_unmap()
          pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));

...and now we have a swap entry that indicates that the page entry
refers to a bad (and poisoned) page of memory, but gup_fast() at this
level of the page table was ignoring swap entries, and incorrectly
assuming that "!pxd_none() == valid and present".

And this was not just a poisoned page problem, but a generaly swap entry
problem. So, any swap entry type (device memory migration, numa migration,
or just regular swapping) could lead to the same problem.

Fix this by checking for pxd_present(), instead of pxd_none().

Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/gup.c~mm-fix-gup_pud_range
+++ a/mm/gup.c
@@ -2237,7 +2237,7 @@ static int gup_pud_range(p4d_t p4d, unsi
 		pud_t pud = READ_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
-		if (pud_none(pud))
+		if (unlikely(!pud_present(pud)))
 			return 0;
 		if (unlikely(pud_huge(pud))) {
 			if (!gup_huge_pud(pud, pudp, addr, next, flags,
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 023/118] mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge
  2020-01-31  6:10 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2020-01-31  6:12 ` [patch 022/118] mm: fix gup_pud_range Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 024/118] mm/gup: factor out duplicate code from four routines Andrew Morton
                   ` (94 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rcampbell, richardw.yang, rientjes,
	torvalds, vbabka

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge

No functional change, just leverage the helper function to improve
readability as others.

Link: http://lkml.kernel.org/r/20200113070322.26627-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/gup.c~mm-gupc-use-is_vm_hugetlb_page-to-check-whether-to-follow-huge
+++ a/mm/gup.c
@@ -323,7 +323,7 @@ static struct page *follow_pmd_mask(stru
 	pmdval = READ_ONCE(*pmd);
 	if (pmd_none(pmdval))
 		return no_page_table(vma, flags);
-	if (pmd_huge(pmdval) && vma->vm_flags & VM_HUGETLB) {
+	if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) {
 		page = follow_huge_pmd(mm, address, pmd, flags);
 		if (page)
 			return page;
@@ -433,7 +433,7 @@ static struct page *follow_pud_mask(stru
 	pud = pud_offset(p4dp, address);
 	if (pud_none(*pud))
 		return no_page_table(vma, flags);
-	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
+	if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) {
 		page = follow_huge_pud(mm, address, pud, flags);
 		if (page)
 			return page;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 024/118] mm/gup: factor out duplicate code from four routines
  2020-01-31  6:10 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2020-01-31  6:12 ` [patch 023/118] mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 025/118] mm/gup: move try_get_compound_head() to top, fix minor issues Andrew Morton
                   ` (93 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: factor out duplicate code from four routines

Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.

Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

    * io_uring: Some directed tests from liburing exercise this, and
                they pass.
    * process_vm_access.c: A small directed test passes.
    * gup_benchmark: the enhanced version hits the new gup.c code, and
                     passes.
    * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
    * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                      not ready just yet)
    * powerpc: it compiles...
    * drm/via: compiles...
    * goldfish: compiles...
    * net/xdp: compiles...
    * media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/


This patch (of 22):

There are four locations in gup.c that have a fair amount of code
duplication.  This means that changing one requires making the same
changes in four places, not to mention reading the same code four times,
and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the error
cases, by doing a mass subtraction of the refcount, surrounded by
get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful end
of each routine, to increment *nr.

Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   95 ++++++++++++++++++++++-------------------------------
 1 file changed, 40 insertions(+), 55 deletions(-)

--- a/mm/gup.c~mm-gup-factor-out-duplicate-code-from-four-routines
+++ a/mm/gup.c
@@ -1978,6 +1978,29 @@ static int __gup_device_huge_pud(pud_t p
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+	/*
+	 * Calling put_page() for each ref is unnecessarily slow. Only the last
+	 * ref needs a put_page().
+	 */
+	if (refs > 1)
+		page_ref_sub(page, refs - 1);
+	put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
@@ -2007,32 +2030,20 @@ static int gup_hugepte(pte_t *ptep, unsi
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	refs = 0;
 	head = pte_page(pte);
-
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(head, refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
@@ -2079,28 +2090,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
 	}
 
-	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(pmd_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
@@ -2120,28 +2122,19 @@ static int gup_huge_pud(pud_t orig, pud_
 		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
 	}
 
-	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(pud_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
@@ -2157,28 +2150,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_
 		return 0;
 
 	BUILD_BUG_ON(pgd_devmap(orig));
-	refs = 0;
+
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(pgd_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 025/118] mm/gup: move try_get_compound_head() to top, fix minor issues
  2020-01-31  6:10 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2020-01-31  6:12 ` [patch 024/118] mm/gup: factor out duplicate code from four routines Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 026/118] mm: Cleanup __put_devmap_managed_page() vs ->page_free() Andrew Morton
                   ` (92 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: move try_get_compound_head() to top, fix minor issues

An upcoming patch uses try_get_compound_head() more widely, so move it to
the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

--- a/mm/gup.c~mm-gup-move-try_get_compound_head-to-top-fix-minor-issues
+++ a/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+	struct page *head = compound_head(page);
+
+	if (WARN_ON_ONCE(page_ref_count(head) < 0))
+		return NULL;
+	if (unlikely(!page_cache_add_speculative(head, refs)))
+		return NULL;
+	return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_page
 	}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-	struct page *head = compound_head(page);
-	if (WARN_ON_ONCE(page_ref_count(head) < 0))
-		return NULL;
-	if (unlikely(!page_cache_add_speculative(head, refs)))
-		return NULL;
-	return head;
-}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 026/118] mm: Cleanup __put_devmap_managed_page() vs ->page_free()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2020-01-31  6:12 ` [patch 025/118] mm/gup: move try_get_compound_head() to top, fix minor issues Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 027/118] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages Andrew Morton
                   ` (91 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm: Cleanup __put_devmap_managed_page() vs ->page_free()

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel.  One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case.  In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback.  For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/nvdimm/pmem.c |    6 ---
 mm/memremap.c         |   80 ++++++++++++++++++++++------------------
 2 files changed, 44 insertions(+), 42 deletions(-)

--- a/drivers/nvdimm/pmem.c~mm-cleanup-__put_devmap_managed_page-vs-page_free
+++ a/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pm
 	put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-	wake_up_var(&page->_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-	.page_free		= pmem_pagemap_page_free,
 	.kill			= pmem_pagemap_kill,
 	.cleanup		= pmem_pagemap_cleanup,
 };
--- a/mm/memremap.c~mm-cleanup-__put_devmap_managed_page-vs-page_free
+++ a/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(vo
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-	if (!pgmap->ops || !pgmap->ops->page_free) {
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+	    (!pgmap->ops || !pgmap->ops->page_free)) {
 		WARN(1, "Missing page_free method\n");
 		return -EINVAL;
 	}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct pa
 {
 	int count = page_ref_dec_return(page);
 
-	/*
-	 * If refcount is 1 then page is freed and refcount is stable as nobody
-	 * holds a reference on the page.
-	 */
-	if (count == 1) {
-		/* Clear Active bit in case of parallel mark_page_accessed */
-		__ClearPageActive(page);
-		__ClearPageWaiters(page);
+	/* still busy */
+	if (count > 1)
+		return;
 
-		mem_cgroup_uncharge(page);
+	/* only triggered by the dev_pagemap shutdown path */
+	if (count == 0) {
+		__put_page(page);
+		return;
+	}
 
-		/*
-		 * When a device_private page is freed, the page->mapping field
-		 * may still contain a (stale) mapping value. For example, the
-		 * lower bits of page->mapping may still identify the page as
-		 * an anonymous page. Ultimately, this entire field is just
-		 * stale and wrong, and it will cause errors if not cleared.
-		 * One example is:
-		 *
-		 *  migrate_vma_pages()
-		 *    migrate_vma_insert_page()
-		 *      page_add_new_anon_rmap()
-		 *        __page_set_anon_rmap()
-		 *          ...checks page->mapping, via PageAnon(page) call,
-		 *            and incorrectly concludes that the page is an
-		 *            anonymous page. Therefore, it incorrectly,
-		 *            silently fails to set up the new anon rmap.
-		 *
-		 * For other types of ZONE_DEVICE pages, migration is either
-		 * handled differently or not done at all, so there is no need
-		 * to clear page->mapping.
-		 */
-		if (is_device_private_page(page))
-			page->mapping = NULL;
+	/* notify page idle for dax */
+	if (!is_device_private_page(page)) {
+		wake_up_var(&page->_refcount);
+		return;
+	}
 
-		page->pgmap->ops->page_free(page);
-	} else if (!count)
-		__put_page(page);
+	/* Clear Active bit in case of parallel mark_page_accessed */
+	__ClearPageActive(page);
+	__ClearPageWaiters(page);
+
+	mem_cgroup_uncharge(page);
+
+	/*
+	 * When a device_private page is freed, the page->mapping field
+	 * may still contain a (stale) mapping value. For example, the
+	 * lower bits of page->mapping may still identify the page as an
+	 * anonymous page. Ultimately, this entire field is just stale
+	 * and wrong, and it will cause errors if not cleared.  One
+	 * example is:
+	 *
+	 *  migrate_vma_pages()
+	 *    migrate_vma_insert_page()
+	 *      page_add_new_anon_rmap()
+	 *        __page_set_anon_rmap()
+	 *          ...checks page->mapping, via PageAnon(page) call,
+	 *            and incorrectly concludes that the page is an
+	 *            anonymous page. Therefore, it incorrectly,
+	 *            silently fails to set up the new anon rmap.
+	 *
+	 * For other types of ZONE_DEVICE pages, migration is either
+	 * handled differently or not done at all, so there is no need
+	 * to clear page->mapping.
+	 */
+	page->mapping = NULL;
+	page->pgmap->ops->page_free(page);
 }
 EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 027/118] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2020-01-31  6:10 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2020-01-31  6:12 ` [patch 026/118] mm: Cleanup __put_devmap_managed_page() vs ->page_free() Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 028/118] goldish_pipe: rename local pin_user_pages() routine Andrew Morton
                   ` (90 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

An upcoming patch changes and complicates the refcounting and especially
the "put page" aspects of it.  In order to keep everything clean, refactor
the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(), and
  limit the functionality to "read only": return a bool, with no side
  effects.

* Add a new routine, put_devmap_managed_page(), to handle decrementing
  the refcount for ZONE_DEVICE pages.

* Change callers (just release_pages() and put_page()) to check
  page_is_devmap_managed() before calling the new
  put_devmap_managed_page() routine.  This is a performance point:
  put_page() is a hot path, so we need to avoid non- inline function calls
  where possible.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
  limit the functionality to unconditionally freeing a devmap page.

This is originally based on a separate patch by Ira Weiny, which applied
to an early version of the put_user_page() experiments.  Since then,
Jérôme Glisse suggested the refactoring described above.

Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |   18 +++++++++++++-----
 mm/memremap.c      |   15 +--------------
 mm/swap.c          |   27 ++++++++++++++++++++++++++-
 3 files changed, 40 insertions(+), 20 deletions(-)

--- a/include/linux/mm.h~mm-devmap-refactor-1-based-refcounting-for-zone_device-pages
+++ a/include/linux/mm.h
@@ -947,9 +947,10 @@ static inline bool is_zone_device_page(c
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	if (!static_branch_unlikely(&devmap_managed_key))
 		return false;
@@ -958,7 +959,6 @@ static inline bool put_devmap_managed_pa
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_FS_DAX:
-		__put_devmap_managed_page(page);
 		return true;
 	default:
 		break;
@@ -966,11 +966,17 @@ static inline bool put_devmap_managed_pa
 	return false;
 }
 
+void put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool put_devmap_managed_page(struct page *page)
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	return false;
 }
+
+static inline void put_devmap_managed_page(struct page *page)
+{
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1023,8 +1029,10 @@ static inline void put_page(struct page
 	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (put_devmap_managed_page(page))
+	if (page_is_devmap_managed(page)) {
+		put_devmap_managed_page(page);
 		return;
+	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
--- a/mm/memremap.c~mm-devmap-refactor-1-based-refcounting-for-zone_device-pages
+++ a/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsi
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-	int count = page_ref_dec_return(page);
-
-	/* still busy */
-	if (count > 1)
-		return;
-
-	/* only triggered by the dev_pagemap shutdown path */
-	if (count == 0) {
-		__put_page(page);
-		return;
-	}
-
 	/* notify page idle for dax */
 	if (!is_device_private_page(page)) {
 		wake_up_var(&page->_refcount);
@@ -461,5 +449,4 @@ void __put_devmap_managed_page(struct pa
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
--- a/mm/swap.c~mm-devmap-refactor-1-based-refcounting-for-zone_device-pages
+++ a/mm/swap.c
@@ -813,8 +813,10 @@ void release_pages(struct page **pages,
 			 * processing, and instead, expect a call to
 			 * put_page_testzero().
 			 */
-			if (put_devmap_managed_page(page))
+			if (page_is_devmap_managed(page)) {
+				put_devmap_managed_page(page);
 				continue;
+			}
 		}
 
 		page = compound_head(page);
@@ -1102,3 +1104,26 @@ void __init swap_setup(void)
 	 * _really_ don't want to cluster much more
 	 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void put_devmap_managed_page(struct page *page)
+{
+	int count;
+
+	if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
+		return;
+
+	count = page_ref_dec_return(page);
+
+	/*
+	 * devmap page refcounts are 1-based, rather than 0-based: if
+	 * refcount is 1, then the page is free and the refcount is
+	 * stable because nobody holds a reference on the page.
+	 */
+	if (count == 1)
+		free_devmap_managed_page(page);
+	else if (!count)
+		__put_page(page);
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 028/118] goldish_pipe: rename local pin_user_pages() routine
  2020-01-31  6:10 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2020-01-31  6:12 ` [patch 027/118] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 029/118] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM Andrew Morton
                   ` (89 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: goldish_pipe: rename local pin_user_pages() routine

Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages() function.

Link: http://lkml.kernel.org/r/20200107224558.2362728-6-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/platform/goldfish/goldfish_pipe.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

--- a/drivers/platform/goldfish/goldfish_pipe.c~goldish_pipe-rename-local-pin_user_pages-routine
+++ a/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(i
 	}
 }
 
-static int pin_user_pages(unsigned long first_page,
-			  unsigned long last_page,
-			  unsigned int last_page_size,
-			  int is_write,
-			  struct page *pages[MAX_BUFFERS_PER_COMMAND],
-			  unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+			      unsigned long last_page,
+			      unsigned int last_page_size,
+			      int is_write,
+			      struct page *pages[MAX_BUFFERS_PER_COMMAND],
+			      unsigned int *iter_last_page_size)
 {
 	int ret;
 	int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct g
 	if (mutex_lock_interruptible(&pipe->lock))
 		return -ERESTARTSYS;
 
-	pages_count = pin_user_pages(first_page, last_page,
-				     last_page_size, is_write,
-				     pipe->pages, &iter_last_page_size);
+	pages_count = goldfish_pin_pages(first_page, last_page,
+					 last_page_size, is_write,
+					 pipe->pages, &iter_last_page_size);
 	if (pages_count < 0) {
 		mutex_unlock(&pipe->lock);
 		return pages_count;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 029/118] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  2020-01-31  6:10 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2020-01-31  6:12 ` [patch 028/118] goldish_pipe: rename local pin_user_pages() routine Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 030/118] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call Andrew Morton
                   ` (88 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

As it says in the updated comment in gup.c: current FOLL_LONGTERM behavior
is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS DAX check
requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all FOLL_LONGTERM
callers, but we can actually allow FOLL_LONGTERM callers that do not set
the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals of
get_user_pages_remote() and __gup_longterm_locked().  That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn calls
check_dax_vmas().  This check will then be removed from the VFIO call site
in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and to
Dan Williams for helping clarify the DAX refactoring.

Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |  174 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 92 insertions(+), 82 deletions(-)

--- a/mm/gup.c~mm-fix-get_user_pages_remotes-handling-of-foll_longterm
+++ a/mm/gup.c
@@ -1111,88 +1111,6 @@ static __always_inline long __get_user_p
 	return pages_done;
 }
 
-/*
- * get_user_pages_remote() - pin user pages in memory
- * @tsk:	the task_struct to use for page fault accounting, or
- *		NULL if faults are not to be recorded.
- * @mm:		mm_struct of target mm
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying lookup behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long. Or NULL, if caller
- *		only intends to ensure the pages are faulted in.
- * @vmas:	array of pointers to vmas corresponding to each page.
- *		Or NULL if the caller does not require them.
- * @locked:	pointer to lock flag indicating whether lock is held and
- *		subsequently whether VM_FAULT_RETRY functionality can be
- *		utilised. Lock must initially be held.
- *
- * Returns either number of pages pinned (which may be less than the
- * number requested), or an error. Details about the return value:
- *
- * -- If nr_pages is 0, returns 0.
- * -- If nr_pages is >0, but no pages were pinned, returns -errno.
- * -- If nr_pages is >0, and some pages were pinned, returns the number of
- *    pages pinned. Again, this may be less than nr_pages.
- *
- * The caller is responsible for releasing returned @pages, via put_page().
- *
- * @vmas are valid only as long as mmap_sem is held.
- *
- * Must be called with mmap_sem held for read or write.
- *
- * get_user_pages walks a process's page tables and takes a reference to
- * each struct page that each user address corresponds to at a given
- * instant. That is, it takes the page that would be accessed if a user
- * thread accesses the given user virtual address at that instant.
- *
- * This does not guarantee that the page exists in the user mappings when
- * get_user_pages returns, and there may even be a completely different
- * page there in some cases (eg. if mmapped pagecache has been invalidated
- * and subsequently re faulted). However it does guarantee that the page
- * won't be freed completely. And mostly callers simply care that the page
- * contains data that was valid *at some point in time*. Typically, an IO
- * or similar operation cannot guarantee anything stronger anyway because
- * locks can't be held over the syscall boundary.
- *
- * If gup_flags & FOLL_WRITE == 0, the page must not be written to. If the page
- * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must
- * be called after the page is finished with, and before put_page is called.
- *
- * get_user_pages is typically used for fewer-copy IO operations, to get a
- * handle on the memory by some means other than accesses via the user virtual
- * addresses. The pages may be submitted for DMA to devices or accessed via
- * their kernel linear mapping (via the kmap APIs). Care should be taken to
- * use the correct cache flushing APIs.
- *
- * See also get_user_pages_fast, for performance critical applications.
- *
- * get_user_pages should be phased out in favor of
- * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
- * should use get_user_pages because it cannot pass
- * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
- */
-long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
-		unsigned long start, unsigned long nr_pages,
-		unsigned int gup_flags, struct page **pages,
-		struct vm_area_struct **vmas, int *locked)
-{
-	/*
-	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas.  As there are no users of this flag in this call we simply
-	 * disallow this option for now.
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-		return -EINVAL;
-
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
-				       locked,
-				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
-}
-EXPORT_SYMBOL(get_user_pages_remote);
-
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
  * @vma:   target vma
@@ -1627,6 +1545,98 @@ static __always_inline long __gup_longte
 #endif /* CONFIG_FS_DAX || CONFIG_CMA */
 
 /*
+ * get_user_pages_remote() - pin user pages in memory
+ * @tsk:	the task_struct to use for page fault accounting, or
+ *		NULL if faults are not to be recorded.
+ * @mm:		mm_struct of target mm
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying lookup behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long. Or NULL, if caller
+ *		only intends to ensure the pages are faulted in.
+ * @vmas:	array of pointers to vmas corresponding to each page.
+ *		Or NULL if the caller does not require them.
+ * @locked:	pointer to lock flag indicating whether lock is held and
+ *		subsequently whether VM_FAULT_RETRY functionality can be
+ *		utilised. Lock must initially be held.
+ *
+ * Returns either number of pages pinned (which may be less than the
+ * number requested), or an error. Details about the return value:
+ *
+ * -- If nr_pages is 0, returns 0.
+ * -- If nr_pages is >0, but no pages were pinned, returns -errno.
+ * -- If nr_pages is >0, and some pages were pinned, returns the number of
+ *    pages pinned. Again, this may be less than nr_pages.
+ *
+ * The caller is responsible for releasing returned @pages, via put_page().
+ *
+ * @vmas are valid only as long as mmap_sem is held.
+ *
+ * Must be called with mmap_sem held for read or write.
+ *
+ * get_user_pages walks a process's page tables and takes a reference to
+ * each struct page that each user address corresponds to at a given
+ * instant. That is, it takes the page that would be accessed if a user
+ * thread accesses the given user virtual address at that instant.
+ *
+ * This does not guarantee that the page exists in the user mappings when
+ * get_user_pages returns, and there may even be a completely different
+ * page there in some cases (eg. if mmapped pagecache has been invalidated
+ * and subsequently re faulted). However it does guarantee that the page
+ * won't be freed completely. And mostly callers simply care that the page
+ * contains data that was valid *at some point in time*. Typically, an IO
+ * or similar operation cannot guarantee anything stronger anyway because
+ * locks can't be held over the syscall boundary.
+ *
+ * If gup_flags & FOLL_WRITE == 0, the page must not be written to. If the page
+ * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must
+ * be called after the page is finished with, and before put_page is called.
+ *
+ * get_user_pages is typically used for fewer-copy IO operations, to get a
+ * handle on the memory by some means other than accesses via the user virtual
+ * addresses. The pages may be submitted for DMA to devices or accessed via
+ * their kernel linear mapping (via the kmap APIs). Care should be taken to
+ * use the correct cache flushing APIs.
+ *
+ * See also get_user_pages_fast, for performance critical applications.
+ *
+ * get_user_pages should be phased out in favor of
+ * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
+ * should use get_user_pages because it cannot pass
+ * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
+ */
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long start, unsigned long nr_pages,
+		unsigned int gup_flags, struct page **pages,
+		struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * Parts of FOLL_LONGTERM behavior are incompatible with
+	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
+	 * vmas. However, this only comes up if locked is set, and there are
+	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
+	 * allow what we can.
+	 */
+	if (gup_flags & FOLL_LONGTERM) {
+		if (WARN_ON_ONCE(locked))
+			return -EINVAL;
+		/*
+		 * This will check the vmas (even if our vmas arg is NULL)
+		 * and return -ENOTSUPP if DAX isn't allowed in this case:
+		 */
+		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+					     vmas, gup_flags | FOLL_TOUCH |
+					     FOLL_REMOTE);
+	}
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked,
+				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
+EXPORT_SYMBOL(get_user_pages_remote);
+
+/*
  * This is the same as get_user_pages_remote(), just with a
  * less-flexible calling convention where we assume that the task
  * and mm being operated on are the current task's and don't allow
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 030/118] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
  2020-01-31  6:10 incoming Andrew Morton
                   ` (28 preceding siblings ...)
  2020-01-31  6:12 ` [patch 029/118] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 031/118] mm/gup: allow FOLL_FORCE for get_user_pages_fast() Andrew Morton
                   ` (87 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote().  Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem.  There is no need to avoid holding mmap_sem just
in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long term
pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.  So
remove it.

Link: http://lkml.kernel.org/r/20200107224558.2362728-8-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/vfio/vfio_iommu_type1.c |   30 +++++-------------------------
 1 file changed, 5 insertions(+), 25 deletions(-)

--- a/drivers/vfio/vfio_iommu_type1.c~vfio-fix-foll_longterm-use-simplify-get_user_pages_remote-call
+++ a/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struc
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
-	struct vm_area_struct *vmas[1];
 	unsigned int flags = 0;
 	int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struc
 		flags |= FOLL_WRITE;
 
 	down_read(&mm->mmap_sem);
-	if (mm == current->mm) {
-		ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-				     vmas);
-	} else {
-		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-					    vmas, NULL);
-		/*
-		 * The lifetime of a vaddr_get_pfn() page pin is
-		 * userspace-controlled. In the fs-dax case this could
-		 * lead to indefinite stalls in filesystem operations.
-		 * Disallow attempts to pin fs-dax pages via this
-		 * interface.
-		 */
-		if (ret > 0 && vma_is_fsdax(vmas[0])) {
-			ret = -EOPNOTSUPP;
-			put_page(page[0]);
-		}
-	}
-	up_read(&mm->mmap_sem);
-
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+				    page, NULL, NULL);
 	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
-		return 0;
+		ret = 0;
+		goto done;
 	}
 
-	down_read(&mm->mmap_sem);
-
 	vaddr = untagged_addr(vaddr);
 
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struc
 		if (is_invalid_reserved_pfn(*pfn))
 			ret = 0;
 	}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 031/118] mm/gup: allow FOLL_FORCE for get_user_pages_fast()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (29 preceding siblings ...)
  2020-01-31  6:12 ` [patch 030/118] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 032/118] IB/umem: use get_user_pages_fast() to pin DMA pages Andrew Morton
                   ` (86 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: allow FOLL_FORCE for get_user_pages_fast()

Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast(). 
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation: if
you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE. 
There is nothing in the _fast() implementation that requires that we avoid
writing to the pages.  So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/gup.c~mm-gup-allow-foll_force-for-get_user_pages_fast
+++ a/mm/gup.c
@@ -2411,7 +2411,8 @@ int get_user_pages_fast(unsigned long st
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
 
-	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+				       FOLL_FORCE)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 032/118] IB/umem: use get_user_pages_fast() to pin DMA pages
  2020-01-31  6:10 incoming Andrew Morton
                   ` (30 preceding siblings ...)
  2020-01-31  6:12 ` [patch 031/118] mm/gup: allow FOLL_FORCE for get_user_pages_fast() Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 033/118] media/v4l2-core: set pages dirty upon releasing DMA buffers Andrew Morton
                   ` (85 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: IB/umem: use get_user_pages_fast() to pin DMA pages

And get rid of the mmap_sem calls, as part of that.  Note that
get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Link: http://lkml.kernel.org/r/20200107224558.2362728-10-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/infiniband/core/umem.c |   17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

--- a/drivers/infiniband/core/umem.c~ib-umem-use-get_user_pages_fast-to-pin-dma-pages
+++ a/drivers/infiniband/core/umem.c
@@ -257,16 +257,13 @@ struct ib_umem *ib_umem_get(struct ib_de
 	sg = umem->sg_head.sgl;
 
 	while (npages) {
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(cur_base,
-				     min_t(unsigned long, npages,
-					   PAGE_SIZE / sizeof (struct page *)),
-				     gup_flags | FOLL_LONGTERM,
-				     page_list, NULL);
-		if (ret < 0) {
-			up_read(&mm->mmap_sem);
+		ret = get_user_pages_fast(cur_base,
+					  min_t(unsigned long, npages,
+						PAGE_SIZE /
+						sizeof(struct page *)),
+					  gup_flags | FOLL_LONGTERM, page_list);
+		if (ret < 0)
 			goto umem_release;
-		}
 
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
@@ -274,8 +271,6 @@ struct ib_umem *ib_umem_get(struct ib_de
 		sg = ib_umem_add_sg_table(sg, page_list, ret,
 			dma_get_max_seg_size(device->dma_device),
 			&umem->sg_nents);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 033/118] media/v4l2-core: set pages dirty upon releasing DMA buffers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (31 preceding siblings ...)
  2020-01-31  6:12 ` [patch 032/118] IB/umem: use get_user_pages_fast() to pin DMA pages Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 034/118] mm/gup: introduce pin_user_pages*() and FOLL_PIN Andrew Morton
                   ` (84 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, stable, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: media/v4l2-core: set pages dirty upon releasing DMA buffers

After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was coming
from the device.  However, this driver was just issuing a bare put_page()
call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages were
potentially receiving data from the device.

Link: http://lkml.kernel.org/r/20200107224558.2362728-11-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/media/v4l2-core/videobuf-dma-sg.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~media-v4l2-core-set-pages-dirty-upon-releasing-dma-buffers
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dm
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		for (i = 0; i < dma->nr_pages; i++)
+		for (i = 0; i < dma->nr_pages; i++) {
+			if (dma->direction == DMA_FROM_DEVICE)
+				set_page_dirty_lock(dma->pages[i]);
 			put_page(dma->pages[i]);
+		}
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 034/118] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2020-01-31  6:10 incoming Andrew Morton
                   ` (32 preceding siblings ...)
  2020-01-31  6:12 ` [patch 033/118] media/v4l2-core: set pages dirty upon releasing DMA buffers Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:12 ` [patch 035/118] goldish_pipe: convert to pin_user_pages() and put_user_page() Andrew Morton
                   ` (83 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: introduce pin_user_pages*() and FOLL_PIN

Introduce pin_user_pages*() variations of get_user_pages*() calls, and
also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites are
converted to use the correct get_user_pages*() or pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also introduced,
and thoroughly documented.

    pin_user_pages()
    pin_user_pages_remote()
    pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
  set it.  That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN.  These wrappers
  will:

    * Start with "pin_user_pages" instead of "get_user_pages".  That
      makes it easy to find and audit the call sites.

    * Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in this
documentation.  (I've reworded it and expanded upon it.)

Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[Documentation]
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/index.rst          |    1 
 Documentation/core-api/pin_user_pages.rst |  232 ++++++++++++++++++++
 include/linux/mm.h                        |   63 ++++-
 mm/gup.c                                  |  164 ++++++++++++--
 4 files changed, 426 insertions(+), 34 deletions(-)

--- a/Documentation/core-api/index.rst~mm-gup-introduce-pin_user_pages-and-foll_pin
+++ a/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
    generic-radix-tree
    memory-allocation
    mm-api
+   pin_user_pages
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm
--- /dev/null
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+pin_user_pages() and related calls
+====================================================
+
+.. contents:: :local:
+
+Overview
+========
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=============================
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===================================
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ --------
+ pin_user_pages          FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast     FOLL_PIN is always set internally by this function.
+ pin_user_pages_remote   FOLL_PIN is always set internally by this function.
+
+For these get_user_pages*() functions, FOLL_GET might not even be specified.
+Behavior is a little more complex than above. If FOLL_GET was *not* specified,
+but the caller passed in a non-null struct pages* array, then the function
+sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
+of each page by +1.::
+
+ Function
+ --------
+ get_user_pages           FOLL_GET is sometimes set internally by this function.
+ get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
+ get_user_pages_remote    FOLL_GET is sometimes set internally by this function.
+
+Tracking dma-pinned pages
+=========================
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 31-10==21 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 21 - 18 == 3 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of using addition, rather than "hard"
+  bitfields, within page->_refcount, allows for sharing these bits gracefully.
+  But we're still looking at about 8 references.
+
+This, however, is a missing feature more than anything else, because it's easily
+solved by addressing an obvious inefficiency in the original get_user_pages()
+approach of retrieving pages: stop treating all the pages as if they were
+PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
+this, so some work is required. Once that's in place, this limitation mostly
+disappears from view, because there will be ample refcounting range available.
+
+* Callers must specifically request "dma-pinned tracking of pages". In other
+  words, just calling get_user_pages() will not suffice; a new set of functions,
+  pin_user_page() and related, must be used.
+
+FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
+==========================================================
+
+Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
+these categories:
+
+CASE 1: Direct IO (DIO)
+-----------------------
+There are GUP references to pages that are serving
+as DIO buffers. These buffers are needed for a relatively short time (so they
+are not "long term"). No special synchronization with page_mkclean() or
+munmap() is provided. Therefore, flags to set at the call site are: ::
+
+    FOLL_PIN
+
+...but rather than setting FOLL_PIN directly, call sites should use one of
+the pin_user_pages*() routines that set FOLL_PIN.
+
+CASE 2: RDMA
+------------
+There are GUP references to pages that are serving as DMA
+buffers. These buffers are needed for a long time ("long term"). No special
+synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+to set at the call site are: ::
+
+    FOLL_PIN | FOLL_LONGTERM
+
+NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
+because DAX pages do not have a separate page cache, and so "pinning" implies
+locking down file system blocks, which is not (yet) supported in that way.
+
+CASE 3: Hardware with page faulting support
+-------------------------------------------
+Here, a well-written driver doesn't normally need to pin pages at all. However,
+if the driver does choose to do so, it can register MMU notifiers for the range,
+and will be called back upon invalidation. Either way (avoiding page pinning, or
+using MMU notifiers to unpin upon request), there is proper synchronization with
+both filesystem and mm (page_mkclean(), munmap(), etc).
+
+Therefore, neither flag needs to be set.
+
+In this case, ideally, neither get_user_pages() nor pin_user_pages() should be
+called. Instead, the software should be written so that it does not pin pages.
+This allows mm and filesystems to operate more efficiently and reliably.
+
+CASE 4: Pinning for struct page manipulation only
+-------------------------------------------------
+Here, normal GUP calls are sufficient, so neither flag needs to be set.
+
+page_dma_pinned(): the whole point of pinning
+=============================================
+
+The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
+to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
+(and file system writeback code in general) to make informed decisions about
+what to do when a page cannot be unmapped due to such pins.
+
+What to do in those cases is the subject of a years-long series of discussions
+and debates (see the References at the end of this document). It's a TODO item
+here: fill in the details once that's worked out. Meanwhile, it's safe to say
+that having this available: ::
+
+        static inline bool page_dma_pinned(struct page *page)
+
+...is a prerequisite to solving the long-running gup+DMA problem.
+
+Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
+===================================================================
+
+Another way of thinking about these flags is as a progression of restrictions:
+FOLL_GET is for struct page manipulation, without affecting the data that the
+struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
+short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
+a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
+restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
+will be pinned longterm, and whose data will be accessed.
+
+Unit testing
+============
+This file::
+
+ tools/testing/selftests/vm/gup_benchmark.c
+
+has the following new calls to exercise the new pin*() wrapper functions:
+
+* PIN_FAST_BENCHMARK (./gup_benchmark -a)
+* PIN_BENCHMARK (./gup_benchmark -b)
+
+You can monitor how many total dma-pinned pages have been acquired and released
+since the system was booted, via two new /proc/vmstat entries: ::
+
+    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_requested
+
+Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
+because there is a noticeable performance drop in put_user_page(), when they
+are activated.
+
+References
+==========
+
+* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
+* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
+* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+
+John Hubbard, October, 2019
--- a/include/linux/mm.h~mm-gup-introduce-pin_user_pages-and-foll_pin
+++ a/include/linux/mm.h
@@ -1042,16 +1042,14 @@ static inline void put_page(struct page
  * put_user_page() - release a gup-pinned page
  * @page:            pointer to page to be released
  *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * put_user_page(), or one of the put_user_pages*() routines. This is so that
+ * eventually such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
  *
  * put_user_page() and put_page() are not interchangeable, despite this early
  * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
+ * be perfectly matched up with pin*() calls.
  */
 static inline void put_user_page(struct page *page)
 {
@@ -1509,9 +1507,16 @@ long get_user_pages_remote(struct task_s
 			    unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1519,6 +1524,8 @@ long get_user_pages_unlocked(unsigned lo
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2583,13 +2590,15 @@ struct page *follow_page(struct vm_area_
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), whose usages are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2604,11 +2613,39 @@ struct page *follow_page(struct vm_area_
  * Currently only get_user_pages() and get_user_pages_fast() support this flag
  * and calls to get_user_pages_[un]locked are specifically not allowed.  This
  * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
  *
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region.  And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region.  And so, CMA attempts to migrate the page before pinning, when
  * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
+ * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
+ * a call to put_user_page().
+ *
+ * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
+ * and separate refcounting mechanisms, however, and that means that each has
+ * its own acquire and release mechanisms:
+ *
+ *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
+ *
+ *     FOLL_PIN: pin_user_pages*() to acquire, and put_user_pages to release.
+ *
+ * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
+ * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
+ * calls applied to them, and that's perfectly OK. This is a constraint on the
+ * callers, not on the pages.)
+ *
+ * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
+ * directly by the caller. That's in order to help avoid mismatches when
+ * releasing pages: get_user_pages*() pages must be released via put_page(),
+ * while pin_user_pages*() pages must be released via put_user_page().
+ *
+ * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
 
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
--- a/mm/gup.c~mm-gup-introduce-pin_user_pages-and-foll_pin
+++ a/mm/gup.c
@@ -194,6 +194,10 @@ static struct page *follow_page_pte(stru
 	spinlock_t *ptl;
 	pte_t *ptep, pte;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return ERR_PTR(-EINVAL);
 retry:
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
@@ -811,7 +815,7 @@ static long __get_user_pages(struct task
 
 	start = untagged_addr(start);
 
-	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
+	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
 
 	/*
 	 * If FOLL_FORCE is set then do not force a full fault as the hinting
@@ -1035,7 +1039,16 @@ static __always_inline long __get_user_p
 		BUG_ON(*locked != 1);
 	}
 
-	if (pages)
+	/*
+	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
+	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
+	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
+	 * for FOLL_GET, not for the newer FOLL_PIN.
+	 *
+	 * FOLL_PIN always expects pages to be non-null, but no need to assert
+	 * that here, as any failures will be obvious enough.
+	 */
+	if (pages && !(flags & FOLL_PIN))
 		flags |= FOLL_GET;
 
 	pages_done = 0;
@@ -1606,12 +1619,20 @@ static __always_inline long __gup_longte
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
+#ifdef CONFIG_MMU
 long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *locked)
 {
 	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	/*
 	 * Parts of FOLL_LONGTERM behavior are incompatible with
 	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
 	 * vmas. However, this only comes up if locked is set, and there are
@@ -1636,6 +1657,16 @@ long get_user_pages_remote(struct task_s
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
+#else /* CONFIG_MMU */
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	return 0;
+}
+#endif /* !CONFIG_MMU */
+
 /*
  * This is the same as get_user_pages_remote(), just with a
  * less-flexible calling convention where we assume that the task
@@ -1647,6 +1678,13 @@ long get_user_pages(unsigned long start,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
 				     pages, vmas, gup_flags | FOLL_TOUCH);
 }
@@ -2389,30 +2427,15 @@ static int __gup_longterm_unlocked(unsig
 	return ret;
 }
 
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
- *
- * Attempt to pin user pages in memory without taking mm->mmap_sem.
- * If not successful, it will fall back to taking the lock and
- * calling get_user_pages().
- *
- * Returns number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int get_user_pages_fast(unsigned long start, int nr_pages,
-			unsigned int gup_flags, struct page **pages)
+static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+					unsigned int gup_flags,
+					struct page **pages)
 {
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE)))
+				       FOLL_FORCE | FOLL_PIN)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2452,4 +2475,103 @@ int get_user_pages_fast(unsigned long st
 
 	return ret;
 }
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying pin behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number requested.
+ * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
+ * -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/**
+ * pin_user_pages_fast() - pin user pages in memory without taking locks
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages_fast().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast);
+
+/**
+ * pin_user_pages_remote() - pin pages of a remote process (task != current)
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages_remote().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
+				     vmas, locked);
+}
+EXPORT_SYMBOL(pin_user_pages_remote);
+
+/**
+ * pin_user_pages() - pin user pages in memory for use by other devices
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+EXPORT_SYMBOL(pin_user_pages);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 035/118] goldish_pipe: convert to pin_user_pages() and put_user_page()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (33 preceding siblings ...)
  2020-01-31  6:12 ` [patch 034/118] mm/gup: introduce pin_user_pages*() and FOLL_PIN Andrew Morton
@ 2020-01-31  6:12 ` Andrew Morton
  2020-01-31  6:13 ` [patch 036/118] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP Andrew Morton
                   ` (82 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:12 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: goldish_pipe: convert to pin_user_pages() and put_user_page()

1. Call the new global pin_user_pages_fast(), from
   pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
   put_user_page().  In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead of
set_page_dirty().  This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because the
page[] loop is now in gup.c instead of here, so just delete the local
release_user_pages() entirely, and call put_user_pages_dirty_lock()
directly, instead.

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Link: http://lkml.kernel.org/r/20200107224558.2362728-13-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/platform/goldfish/goldfish_pipe.c |   17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

--- a/drivers/platform/goldfish/goldfish_pipe.c~goldish_pipe-convert-to-pin_user_pages-and-put_user_page
+++ a/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned l
 		*iter_last_page_size = last_page_size;
 	}
 
-	ret = get_user_pages_fast(first_page, requested_pages,
+	ret = pin_user_pages_fast(first_page, requested_pages,
 				  !is_write ? FOLL_WRITE : 0,
 				  pages);
 	if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned l
 	return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-			       int is_write, s32 consumed_size)
-{
-	int i;
-
-	for (i = 0; i < pages_count; i++) {
-		if (!is_write && consumed_size > 0)
-			set_page_dirty(pages[i]);
-		put_page(pages[i]);
-	}
-}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 036/118] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
  2020-01-31  6:10 incoming Andrew Morton
                   ` (34 preceding siblings ...)
  2020-01-31  6:12 ` [patch 035/118] goldish_pipe: convert to pin_user_pages() and put_user_page() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 037/118] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() Andrew Morton
                   ` (81 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page().  ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag.  The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Link: http://lkml.kernel.org/r/20200107224558.2362728-14-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/infiniband/core/umem.c              |    2 +-
 drivers/infiniband/core/umem_odp.c          |   13 ++++++-------
 drivers/infiniband/hw/hfi1/user_pages.c     |    2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |    2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |    2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |    2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |    2 +-
 drivers/infiniband/sw/siw/siw_mem.c         |    2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

--- a/drivers/infiniband/core/umem.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/core/umem.c
@@ -257,7 +257,7 @@ struct ib_umem *ib_umem_get(struct ib_de
 	sg = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages_fast(cur_base,
+		ret = pin_user_pages_fast(cur_base,
 					  min_t(unsigned long, npages,
 						PAGE_SIZE /
 						sizeof(struct page *)),
--- a/drivers/infiniband/core/umem_odp.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/core/umem_odp.c
@@ -293,9 +293,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
 		struct ib_umem_odp *umem_odp,
@@ -348,7 +347,7 @@ static int ib_umem_odp_map_dma_single_pa
 	}
 
 out:
-	put_user_page(page);
+	put_page(page);
 	return ret;
 }
 
@@ -458,7 +457,7 @@ int ib_umem_odp_map_dma_pages(struct ib_
 					ret = -EFAULT;
 					break;
 				}
-				put_user_page(local_page_list[j]);
+				put_page(local_page_list[j]);
 				continue;
 			}
 
@@ -485,8 +484,8 @@ int ib_umem_odp_map_dma_pages(struct ib_
 			 * ib_umem_odp_map_dma_single_page().
 			 */
 			if (npages - (j + 1) > 0)
-				put_user_pages(&local_page_list[j+1],
-					       npages - (j + 1));
+				release_pages(&local_page_list[j+1],
+					      npages - (j + 1));
 			break;
 		}
 	}
--- a/drivers/infiniband/hw/hfi1/user_pages.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_st
 	int ret;
 	unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-	ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+	ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
 	if (ret < 0)
 		return ret;
 
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *
 		goto out;
 	}
 
-	ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+	ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
 				  FOLL_WRITE | FOLL_LONGTERM, pages);
 	if (ret < 0)
 		goto out;
--- a/drivers/infiniband/hw/qib/qib_user_pages.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -108,7 +108,7 @@ int qib_get_user_pages(unsigned long sta
 
 	down_read(&current->mm->mmap_sem);
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(start_page + got * PAGE_SIZE,
+		ret = pin_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got,
 				     FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
 				     p + got, NULL);
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -670,7 +670,7 @@ static int qib_user_sdma_pin_pages(const
 		else
 			j = npages;
 
-		ret = get_user_pages_fast(addr, j, FOLL_LONGTERM, pages);
+		ret = pin_user_pages_fast(addr, j, FOLL_LONGTERM, pages);
 		if (ret != j) {
 			i = 0;
 			j = ret;
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -141,7 +141,7 @@ static int usnic_uiom_get_pages(unsigned
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(cur_base,
+		ret = pin_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 				     PAGE_SIZE / sizeof(struct page *)),
 				     gup_flags | FOLL_LONGTERM,
--- a/drivers/infiniband/sw/siw/siw_mem.c~ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp
+++ a/drivers/infiniband/sw/siw/siw_mem.c
@@ -426,7 +426,7 @@ struct siw_umem *siw_umem_get(u64 start,
 		while (nents) {
 			struct page **plist = &umem->page_chunk[i].plist[got];
 
-			rv = get_user_pages(first_page_va, nents,
+			rv = pin_user_pages(first_page_va, nents,
 					    foll_flags | FOLL_LONGTERM,
 					    plist, NULL);
 			if (rv < 0)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 037/118] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (35 preceding siblings ...)
  2020-01-31  6:13 ` [patch 036/118] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 038/118] drm/via: set FOLL_PIN via pin_user_pages_fast() Andrew Morton
                   ` (80 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

Convert process_vm_access to use the new pin_user_pages_remote() call,
which sets FOLL_PIN.  Setting FOLL_PIN is now required for code that
requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for easier reading
of process_vm_rw_single_vec().

Link: http://lkml.kernel.org/r/20200107224558.2362728-15-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/process_vm_access.c |   28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

--- a/mm/process_vm_access.c~mm-process_vm_access-set-foll_pin-via-pin_user_pages_remote
+++ a/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct pa
 		if (copy > len)
 			copy = len;
 
-		if (vm_write) {
+		if (vm_write)
 			copied = copy_page_from_iter(page, offset, copy, iter);
-			set_page_dirty_lock(page);
-		} else {
+		else
 			copied = copy_page_to_iter(page, offset, copy, iter);
-		}
+
 		len -= copied;
 		if (copied < copy && iov_iter_count(iter))
 			return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsi
 		flags |= FOLL_WRITE;
 
 	while (!rc && nr_pages && iov_iter_count(iter)) {
-		int pages = min(nr_pages, max_pages_per_loop);
+		int pinned_pages = min(nr_pages, max_pages_per_loop);
 		int locked = 1;
 		size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsi
 		 * current/current->mm
 		 */
 		down_read(&mm->mmap_sem);
-		pages = get_user_pages_remote(task, mm, pa, pages, flags,
-					      process_pages, NULL, &locked);
+		pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+						     flags, process_pages,
+						     NULL, &locked);
 		if (locked)
 			up_read(&mm->mmap_sem);
-		if (pages <= 0)
+		if (pinned_pages <= 0)
 			return -EFAULT;
 
-		bytes = pages * PAGE_SIZE - start_offset;
+		bytes = pinned_pages * PAGE_SIZE - start_offset;
 		if (bytes > len)
 			bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsi
 					 vm_write);
 		len -= bytes;
 		start_offset = 0;
-		nr_pages -= pages;
-		pa += pages * PAGE_SIZE;
-		while (pages)
-			put_page(process_pages[--pages]);
+		nr_pages -= pinned_pages;
+		pa += pinned_pages * PAGE_SIZE;
+
+		/* If vm_write is set, the pages need to be made dirty: */
+		put_user_pages_dirty_lock(process_pages, pinned_pages,
+					  vm_write);
 	}
 
 	return rc;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 038/118] drm/via: set FOLL_PIN via pin_user_pages_fast()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (36 preceding siblings ...)
  2020-01-31  6:13 ` [patch 037/118] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 039/118] fs/io_uring: set FOLL_PIN via pin_user_pages() Andrew Morton
                   ` (79 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: drm/via: set FOLL_PIN via pin_user_pages_fast()

Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page().  Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required is to
change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/20200107224558.2362728-16-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/gpu/drm/via/via_dmablit.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/gpu/drm/via/via_dmablit.c~drm-via-set-foll_pin-via-pin_user_pages_fast
+++ a/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t
 	vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
 	if (NULL == vsg->pages)
 		return -ENOMEM;
-	ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+	ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
 			vsg->num_pages,
 			vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
 			vsg->pages);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 039/118] fs/io_uring: set FOLL_PIN via pin_user_pages()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (37 preceding siblings ...)
  2020-01-31  6:13 ` [patch 038/118] drm/via: set FOLL_PIN via pin_user_pages_fast() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 040/118] net/xdp: " Andrew Morton
                   ` (78 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: fs/io_uring: set FOLL_PIN via pin_user_pages()

Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page().  Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/20200107224558.2362728-17-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/io_uring.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/io_uring.c~fs-io_uring-set-foll_pin-via-pin_user_pages
+++ a/fs/io_uring.c
@@ -6126,7 +6126,7 @@ static int io_sqe_buffer_register(struct
 
 		ret = 0;
 		down_read(&current->mm->mmap_sem);
-		pret = get_user_pages(ubuf, nr_pages,
+		pret = pin_user_pages(ubuf, nr_pages,
 				      FOLL_WRITE | FOLL_LONGTERM,
 				      pages, vmas);
 		if (pret == nr_pages) {
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 040/118] net/xdp: set FOLL_PIN via pin_user_pages()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (38 preceding siblings ...)
  2020-01-31  6:13 ` [patch 039/118] fs/io_uring: set FOLL_PIN via pin_user_pages() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 041/118] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion Andrew Morton
                   ` (77 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: net/xdp: set FOLL_PIN via pin_user_pages()

Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already calling
put_user_page() instead of put_page().  Therefore, in order to convert
from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/20200107224558.2362728-18-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 net/xdp/xdp_umem.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/xdp/xdp_umem.c~net-xdp-set-foll_pin-via-pin_user_pages
+++ a/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp
 		return -ENOMEM;
 
 	down_read(&current->mm->mmap_sem);
-	npgs = get_user_pages(umem->address, umem->npgs,
+	npgs = pin_user_pages(umem->address, umem->npgs,
 			      gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
 	up_read(&current->mm->mmap_sem);
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 041/118] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  2020-01-31  6:10 incoming Andrew Morton
                   ` (39 preceding siblings ...)
  2020-01-31  6:13 ` [patch 040/118] net/xdp: " Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 042/118] vfio, mm: " Andrew Morton
                   ` (76 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Link: http://lkml.kernel.org/r/20200107224558.2362728-19-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/media/v4l2-core/videobuf-dma-sg.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~media-v4l2-core-pin_user_pages-foll_pin-and-put_user_page-conversion
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+	err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     flags | FOLL_LONGTERM, dma->pages, NULL);
 
 	if (err != dma->nr_pages) {
 		dma->nr_pages = (err >= 0) ? err : 0;
-		dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+		dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
 			dma->nr_pages);
 		return err < 0 ? err : -EINVAL;
 	}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dm
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		for (i = 0; i < dma->nr_pages; i++) {
-			if (dma->direction == DMA_FROM_DEVICE)
-				set_page_dirty_lock(dma->pages[i]);
-			put_page(dma->pages[i]);
-		}
+		put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+					  dma->direction == DMA_FROM_DEVICE);
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 042/118] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  2020-01-31  6:10 incoming Andrew Morton
                   ` (40 preceding siblings ...)
  2020-01-31  6:13 ` [patch 041/118] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 043/118] powerpc: book3s64: convert to pin_user_pages() and put_user_page() Andrew Morton
                   ` (75 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

1. Change vfio from get_user_pages_remote(), to
   pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
   put_user_page(), also convert the put_page() call over to
   put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty().  This is probably more
accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Link: http://lkml.kernel.org/r/20200107224558.2362728-20-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/vfio/vfio_iommu_type1.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/drivers/vfio/vfio_iommu_type1.c~vfio-mm-pin_user_pages-foll_pin-and-put_user_page-conversion
+++ a/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, in
 {
 	if (!is_invalid_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
-		if (prot & IOMMU_WRITE)
-			SetPageDirty(page);
-		put_page(page);
+
+		put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
 		return 1;
 	}
 	return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struc
 		flags |= FOLL_WRITE;
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+	ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
 				    page, NULL, NULL);
 	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 043/118] powerpc: book3s64: convert to pin_user_pages() and put_user_page()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (41 preceding siblings ...)
  2020-01-31  6:13 ` [patch 042/118] vfio, mm: " Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 044/118] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" Andrew Morton
                   ` (74 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: powerpc: book3s64: convert to pin_user_pages() and put_user_page()

1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Link: http://lkml.kernel.org/r/20200107224558.2362728-21-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/mm/book3s64/iommu_api.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/powerpc/mm/book3s64/iommu_api.c~powerpc-book3s64-convert-to-pin_user_pages-and-put_user_page
+++ a/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_
 	for (entry = 0; entry < entries; entry += chunk) {
 		unsigned long n = min(entries - entry, chunk);
 
-		ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+		ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
 				FOLL_WRITE | FOLL_LONGTERM,
 				mem->hpages + entry, NULL);
 		if (ret == n) {
@@ -167,9 +167,8 @@ good_exit:
 	return 0;
 
 free_exit:
-	/* free the reference taken */
-	for (i = 0; i < pinned; i++)
-		put_page(mem->hpages[i]);
+	/* free the references taken */
+	put_user_pages(mem->hpages, pinned);
 
 	vfree(mem->hpas);
 	kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct mm_iom
 		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
 			SetPageDirty(page);
 
-		put_page(page);
+		put_user_page(page);
+
 		mem->hpas[i] = 0;
 	}
 }
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 044/118] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"
  2020-01-31  6:10 incoming Andrew Morton
                   ` (42 preceding siblings ...)
  2020-01-31  6:13 ` [patch 043/118] powerpc: book3s64: convert to pin_user_pages() and put_user_page() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 045/118] mm, tree-wide: rename put_user_page*() to unpin_user_page*() Andrew Morton
                   ` (73 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

Fix the gup benchmark flags to use the symbolic FOLL_WRITE, instead of a
hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing it once
before issuing any of the get_user_pages*() calls.  This makes it harder
to overlook, instead of having little "gup_flags & 1" phrases in the
function calls.

Link: http://lkml.kernel.org/r/20200107224558.2362728-22-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup_benchmark.c                         |    9 ++++++---
 tools/testing/selftests/vm/gup_benchmark.c |    6 +++++-
 2 files changed, 11 insertions(+), 4 deletions(-)

--- a/mm/gup_benchmark.c~mm-gup_benchmark-use-proper-foll_write-flags-instead-of-hard-coding-1
+++ a/mm/gup_benchmark.c
@@ -49,18 +49,21 @@ static int __gup_benchmark_ioctl(unsigne
 			nr = (next - addr) / PAGE_SIZE;
 		}
 
+		/* Filter out most gup flags: only allow a tiny subset here: */
+		gup->flags &= FOLL_WRITE;
+
 		switch (cmd) {
 		case GUP_FAST_BENCHMARK:
-			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+			nr = get_user_pages_fast(addr, nr, gup->flags,
 						 pages + i);
 			break;
 		case GUP_LONGTERM_BENCHMARK:
 			nr = get_user_pages(addr, nr,
-					    (gup->flags & 1) | FOLL_LONGTERM,
+					    gup->flags | FOLL_LONGTERM,
 					    pages + i, NULL);
 			break;
 		case GUP_BENCHMARK:
-			nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+			nr = get_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
 			break;
 		default:
--- a/tools/testing/selftests/vm/gup_benchmark.c~mm-gup_benchmark-use-proper-foll_write-flags-instead-of-hard-coding-1
+++ a/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE	0x01	/* check pte is writable */
+
 struct gup_benchmark {
 	__u64 get_delta_usec;
 	__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
 	}
 
 	gup.nr_pages_per_call = nr_pages;
-	gup.flags = write;
+	if (write)
+		gup.flags |= FOLL_WRITE;
 
 	fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
 	if (fd == -1)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 045/118] mm, tree-wide: rename put_user_page*() to unpin_user_page*()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (43 preceding siblings ...)
  2020-01-31  6:13 ` [patch 044/118] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 046/118] mm/swapfile.c: swap_next should increase position index Andrew Morton
                   ` (72 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, alex.williamson, aneesh.kumar, axboe, bjorn.topel, corbet,
	dan.j.williams, daniel.vetter, hch, hverkuil-cisco, ira.weiny,
	jack, jgg, jgg, jglisse, jhubbard, kirill, leonro, linux-mm,
	mchehab, mm-commits, rppt, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm, tree-wide: rename put_user_page*() to unpin_user_page*()

In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages.  This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/pin_user_pages.rst   |    2 -
 arch/powerpc/mm/book3s64/iommu_api.c        |    4 +-
 drivers/gpu/drm/via/via_dmablit.c           |    4 +-
 drivers/infiniband/core/umem.c              |    2 -
 drivers/infiniband/hw/hfi1/user_pages.c     |    2 -
 drivers/infiniband/hw/mthca/mthca_memfree.c |    6 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  |    2 -
 drivers/infiniband/hw/qib/qib_user_sdma.c   |    6 +--
 drivers/infiniband/hw/usnic/usnic_uiom.c    |    2 -
 drivers/infiniband/sw/siw/siw_mem.c         |    2 -
 drivers/media/v4l2-core/videobuf-dma-sg.c   |    4 +-
 drivers/platform/goldfish/goldfish_pipe.c   |    4 +-
 drivers/vfio/vfio_iommu_type1.c             |    2 -
 fs/io_uring.c                               |    4 +-
 include/linux/mm.h                          |   26 +++++++-------
 mm/gup.c                                    |   32 +++++++++---------
 mm/process_vm_access.c                      |    4 +-
 net/xdp/xdp_umem.c                          |    2 -
 18 files changed, 55 insertions(+), 55 deletions(-)

--- a/arch/powerpc/mm/book3s64/iommu_api.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ good_exit:
 
 free_exit:
 	/* free the references taken */
-	put_user_pages(mem->hpages, pinned);
+	unpin_user_pages(mem->hpages, pinned);
 
 	vfree(mem->hpas);
 	kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct mm_iom
 		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
 			SetPageDirty(page);
 
-		put_user_page(page);
+		unpin_user_page(page);
 
 		mem->hpas[i] = 0;
 	}
--- a/Documentation/core-api/pin_user_pages.rst~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new
     /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
--- a/drivers/gpu/drm/via/via_dmablit.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, d
 		kfree(vsg->desc_pages);
 		/* fall through */
 	case dr_via_pages_locked:
-		put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
-					  (vsg->direction == DMA_FROM_DEVICE));
+		unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+					   (vsg->direction == DMA_FROM_DEVICE));
 		/* fall through */
 	case dr_via_pages_alloc:
 		vfree(vsg->pages);
--- a/drivers/infiniband/core/umem.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_
 
 	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
 		page = sg_page_iter_page(&sg_iter);
-		put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
+		unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
 	}
 
 	sg_free_table(&umem->sg_head);
--- a/drivers/infiniband/hw/hfi1/user_pages.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_st
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	put_user_pages_dirty_lock(p, npages, dirty);
+	unpin_user_pages_dirty_lock(p, npages, dirty);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		atomic64_sub(npages, &mm->pinned_vm);
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -482,7 +482,7 @@ int mthca_map_user_db(struct mthca_dev *
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_user_page(pages[0]);
+		unpin_user_page(pages[0]);
 		goto out;
 	}
 
@@ -490,7 +490,7 @@ int mthca_map_user_db(struct mthca_dev *
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_user_page(sg_page(&db_tab->page[i].mem));
+		unpin_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -556,7 +556,7 @@ void mthca_cleanup_user_db_tab(struct mt
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_user_page(sg_page(&db_tab->page[i].mem));
+			unpin_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
--- a/drivers/infiniband/hw/qib/qib_user_pages.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,7 +40,7 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 				     int dirty)
 {
-	put_user_pages_dirty_lock(p, num_pages, dirty);
+	unpin_user_pages_dirty_lock(p, num_pages, dirty);
 }
 
 /**
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -317,7 +317,7 @@ static int qib_user_sdma_page_to_frags(c
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_user_page(page);
+			unpin_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -631,7 +631,7 @@ static void qib_user_sdma_free_pkt_frag(
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_user_page(pkt->addr[i].page);
+			unpin_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -706,7 +706,7 @@ static int qib_user_sdma_pin_pages(const
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_user_page(pages[i++]);
+		unpin_user_page(pages[i++]);
 
 done:
 	return ret;
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -75,7 +75,7 @@ static void usnic_uiom_put_pages(struct
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			put_user_pages_dirty_lock(&page, 1, dirty);
+			unpin_user_pages_dirty_lock(&page, 1, dirty);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
--- a/drivers/infiniband/sw/siw/siw_mem.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/infiniband/sw/siw/siw_mem.c
@@ -63,7 +63,7 @@ struct siw_mem *siw_mem_id2obj(struct si
 static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
 			   bool dirty)
 {
-	put_user_pages_dirty_lock(chunk->plist, num_pages, dirty);
+	unpin_user_pages_dirty_lock(chunk->plist, num_pages, dirty);
 }
 
 void siw_umem_release(struct siw_umem *umem, bool dirty)
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,8 @@ int videobuf_dma_free(struct videobuf_dm
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
-					  dma->direction == DMA_FROM_DEVICE);
+		unpin_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+					    dma->direction == DMA_FROM_DEVICE);
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
--- a/drivers/platform/goldfish/goldfish_pipe.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/platform/goldfish/goldfish_pipe.c
@@ -360,8 +360,8 @@ static int transfer_max_buffers(struct g
 
 	*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-	put_user_pages_dirty_lock(pipe->pages, pages_count,
-				  !is_write && *consumed_size > 0);
+	unpin_user_pages_dirty_lock(pipe->pages, pages_count,
+				    !is_write && *consumed_size > 0);
 
 	mutex_unlock(&pipe->lock);
 	return 0;
--- a/drivers/vfio/vfio_iommu_type1.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/drivers/vfio/vfio_iommu_type1.c
@@ -310,7 +310,7 @@ static int put_pfn(unsigned long pfn, in
 	if (!is_invalid_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
 
-		put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
+		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
 		return 1;
 	}
 	return 0;
--- a/fs/io_uring.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/fs/io_uring.c
@@ -6005,7 +6005,7 @@ static int io_sqe_buffer_unregister(stru
 		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
 
 		for (j = 0; j < imu->nr_bvecs; j++)
-			put_user_page(imu->bvec[j].bv_page);
+			unpin_user_page(imu->bvec[j].bv_page);
 
 		if (ctx->account_mem)
 			io_unaccount_mem(ctx->user, imu->nr_bvecs);
@@ -6150,7 +6150,7 @@ static int io_sqe_buffer_register(struct
 			 * release any pages we did get
 			 */
 			if (pret > 0)
-				put_user_pages(pages, pret);
+				unpin_user_pages(pages, pret);
 			if (ctx->account_mem)
 				io_unaccount_mem(ctx->user, nr_pages);
 			kvfree(imu->bvec);
--- a/include/linux/mm.h~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/include/linux/mm.h
@@ -1039,27 +1039,27 @@ static inline void put_page(struct page
 }
 
 /**
- * put_user_page() - release a gup-pinned page
+ * unpin_user_page() - release a gup-pinned page
  * @page:            pointer to page to be released
  *
  * Pages that were pinned via pin_user_pages*() must be released via either
- * put_user_page(), or one of the put_user_pages*() routines. This is so that
- * eventually such pages can be separately tracked and uniquely handled. In
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that eventually such pages can be separately tracked and uniquely handled. In
  * particular, interactions with RDMA and filesystems need special handling.
  *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
+ * unpin_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. unpin_user_page() calls must
  * be perfectly matched up with pin*() calls.
  */
-static inline void put_user_page(struct page *page)
+static inline void unpin_user_page(struct page *page)
 {
 	put_page(page);
 }
 
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
-			       bool make_dirty);
+void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+				 bool make_dirty);
 
-void put_user_pages(struct page **pages, unsigned long npages);
+void unpin_user_pages(struct page **pages, unsigned long npages);
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
@@ -2590,7 +2590,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
-#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
+#define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -2625,7 +2625,7 @@ struct page *follow_page(struct vm_area_
  * Direct IO). This lets the filesystem know that some non-file-system entity is
  * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
  * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
- * a call to put_user_page().
+ * a call to unpin_user_page().
  *
  * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
  * and separate refcounting mechanisms, however, and that means that each has
@@ -2633,7 +2633,7 @@ struct page *follow_page(struct vm_area_
  *
  *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
  *
- *     FOLL_PIN: pin_user_pages*() to acquire, and put_user_pages to release.
+ *     FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
  *
  * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
  * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
@@ -2643,7 +2643,7 @@ struct page *follow_page(struct vm_area_
  * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
  * directly by the caller. That's in order to help avoid mismatches when
  * releasing pages: get_user_pages*() pages must be released via put_page(),
- * while pin_user_pages*() pages must be released via put_user_page().
+ * while pin_user_pages*() pages must be released via unpin_user_page().
  *
  * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
--- a/mm/gup.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/mm/gup.c
@@ -45,7 +45,7 @@ static inline struct page *try_get_compo
 }
 
 /**
- * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
+ * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
  * @npages: number of pages in the @pages array.
  * @make_dirty: whether to mark the pages dirty
@@ -55,19 +55,19 @@ static inline struct page *try_get_compo
  *
  * For each page in the @pages array, make that page (or its head page, if a
  * compound page) dirty, if @make_dirty is true, and if the page was previously
- * listed as clean. In any case, releases all pages using put_user_page(),
- * possibly via put_user_pages(), for the non-dirty case.
+ * listed as clean. In any case, releases all pages using unpin_user_page(),
+ * possibly via unpin_user_pages(), for the non-dirty case.
  *
- * Please see the put_user_page() documentation for details.
+ * Please see the unpin_user_page() documentation for details.
  *
  * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
  * required, then the caller should a) verify that this is really correct,
  * because _lock() is usually required, and b) hand code it:
- * set_page_dirty_lock(), put_user_page().
+ * set_page_dirty_lock(), unpin_user_page().
  *
  */
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
-			       bool make_dirty)
+void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+				 bool make_dirty)
 {
 	unsigned long index;
 
@@ -78,7 +78,7 @@ void put_user_pages_dirty_lock(struct pa
 	 */
 
 	if (!make_dirty) {
-		put_user_pages(pages, npages);
+		unpin_user_pages(pages, npages);
 		return;
 	}
 
@@ -106,21 +106,21 @@ void put_user_pages_dirty_lock(struct pa
 		 */
 		if (!PageDirty(page))
 			set_page_dirty_lock(page);
-		put_user_page(page);
+		unpin_user_page(page);
 	}
 }
-EXPORT_SYMBOL(put_user_pages_dirty_lock);
+EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
 /**
- * put_user_pages() - release an array of gup-pinned pages.
+ * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
  * @npages: number of pages in the @pages array.
  *
- * For each page in the @pages array, release the page using put_user_page().
+ * For each page in the @pages array, release the page using unpin_user_page().
  *
- * Please see the put_user_page() documentation for details.
+ * Please see the unpin_user_page() documentation for details.
  */
-void put_user_pages(struct page **pages, unsigned long npages)
+void unpin_user_pages(struct page **pages, unsigned long npages)
 {
 	unsigned long index;
 
@@ -130,9 +130,9 @@ void put_user_pages(struct page **pages,
 	 * single operation to the head page should suffice.
 	 */
 	for (index = 0; index < npages; index++)
-		put_user_page(pages[index]);
+		unpin_user_page(pages[index]);
 }
-EXPORT_SYMBOL(put_user_pages);
+EXPORT_SYMBOL(unpin_user_pages);
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
--- a/mm/process_vm_access.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/mm/process_vm_access.c
@@ -126,8 +126,8 @@ static int process_vm_rw_single_vec(unsi
 		pa += pinned_pages * PAGE_SIZE;
 
 		/* If vm_write is set, the pages need to be made dirty: */
-		put_user_pages_dirty_lock(process_pages, pinned_pages,
-					  vm_write);
+		unpin_user_pages_dirty_lock(process_pages, pinned_pages,
+					    vm_write);
 	}
 
 	return rc;
--- a/net/xdp/xdp_umem.c~mm-tree-wide-rename-put_user_page-to-unpin_user_page
+++ a/net/xdp/xdp_umem.c
@@ -212,7 +212,7 @@ static int xdp_umem_map_pages(struct xdp
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
-	put_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
+	unpin_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
 
 	kfree(umem->pgs);
 	umem->pgs = NULL;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 046/118] mm/swapfile.c: swap_next should increase position index
  2020-01-31  6:10 incoming Andrew Morton
                   ` (44 preceding siblings ...)
  2020-01-31  6:13 ` [patch 045/118] mm, tree-wide: rename put_user_page*() to unpin_user_page*() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 047/118] mm/memcontrol.c: cleanup some useless code Andrew Morton
                   ` (71 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, hughd, jannh, keescook, linux-mm, mm-commits, torvalds, viro, vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: mm/swapfile.c: swap_next should increase position index

If seq_file .next fuction does not change position index, read after some
lseek can generate unexpected output.

In Aug 2018 NeilBrown noticed commit 1f4aace60b0e ("fs/seq_file.c:
simplify seq_file iteration code and interface") "Some ->next functions do
not increment *pos when they return NULL...  Note that such ->next
functions are buggy and should be fixed.  A simple demonstration is

dd if=/proc/swaps bs=1000 skip=1

Choose any block size larger than the size of /proc/swaps.  This will
always show the whole last line of /proc/swaps"

Described problem is still actual.  If you make lseek into middle of last
output line following read will output end of last line and whole last
line once again.

$ dd if=/proc/swaps bs=1  # usual output
Filename				Type		Size	Used	Priority
/dev/dm-0                               partition	4194812	97536	-2
104+0 records in
104+0 records out
104 bytes copied

$ dd if=/proc/swaps bs=40 skip=1    # last line was generated twice
dd: /proc/swaps: cannot skip to specified offset
v/dm-0                               partition	4194812	97536	-2
/dev/dm-0                               partition	4194812	97536	-2
3+1 records in
3+1 records out
131 bytes copied

https://bugzilla.kernel.org/show_bug.cgi?id=206283

Link: http://lkml.kernel.org/r/bd8cfd7b-ac95-9b91-f9e7-e8438bd5047d@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/swapfile.c~swap_next-should-increase-position-index
+++ a/mm/swapfile.c
@@ -2737,10 +2737,10 @@ static void *swap_next(struct seq_file *
 	else
 		type = si->type + 1;
 
+	++(*pos);
 	for (; (si = swap_type_to_swap_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
-		++*pos;
 		return si;
 	}
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 047/118] mm/memcontrol.c: cleanup some useless code
  2020-01-31  6:10 incoming Andrew Morton
                   ` (45 preceding siblings ...)
  2020-01-31  6:13 ` [patch 046/118] mm/swapfile.c: swap_next should increase position index Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 048/118] mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page Andrew Morton
                   ` (70 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mhocko, mm-commits, pilgrimtao, torvalds,
	vdavydov.dev

From: Kaitao Cheng <pilgrimtao@gmail.com>
Subject: mm/memcontrol.c: cleanup some useless code

Compound pages handling in mem_cgroup_migrate is more convoluted than
necessary.  The state is duplicated in compound variable and the same
could be achieved by PageTransHuge check which is trivial and
hpage_nr_pages is already PageTransHuge aware.

It is much simpler to just use hpage_nr_pages for nr_pages and replace the
local variable by PageTransHuge check directly

Link: http://lkml.kernel.org/r/20191210160450.3395-1-pilgrimtao@gmail.com
Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/mm/memcontrol.c~mm-cleanup-some-useless-code
+++ a/mm/memcontrol.c
@@ -6633,7 +6633,6 @@ void mem_cgroup_migrate(struct page *old
 {
 	struct mem_cgroup *memcg;
 	unsigned int nr_pages;
-	bool compound;
 	unsigned long flags;
 
 	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
@@ -6655,8 +6654,7 @@ void mem_cgroup_migrate(struct page *old
 		return;
 
 	/* Force-charge the new page. The old one will be freed soon */
-	compound = PageTransHuge(newpage);
-	nr_pages = compound ? hpage_nr_pages(newpage) : 1;
+	nr_pages = hpage_nr_pages(newpage);
 
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
@@ -6666,7 +6664,8 @@ void mem_cgroup_migrate(struct page *old
 	commit_charge(newpage, memcg, false);
 
 	local_irq_save(flags);
-	mem_cgroup_charge_statistics(memcg, newpage, compound, nr_pages);
+	mem_cgroup_charge_statistics(memcg, newpage, PageTransHuge(newpage),
+			nr_pages);
 	memcg_check_events(memcg, newpage);
 	local_irq_restore(flags);
 }
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 048/118] mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page
  2020-01-31  6:10 incoming Andrew Morton
                   ` (46 preceding siblings ...)
  2020-01-31  6:13 ` [patch 047/118] mm/memcontrol.c: cleanup some useless code Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 049/118] mm, tracing: print symbol name for kmem_alloc_node call_site events Andrew Morton
                   ` (69 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, kirill.shutemov, linux-mm, lixinhai.lxh, mhocko,
	mike.kravetz, mm-commits, torvalds

From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page

When check_pte, pfn of normal, hugetlbfs and THP page need be compared. 
The current implementation apply comparison as

- normal 4K page: page_pfn <= pfn < page_pfn + 1
- hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
- THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR

in pfn_in_hpage.  For hugetlbfs page, it should be page_pfn == pfn

Now, change pfn_in_hpage to pfn_is_match to highlight that comparison is
not only for THP and explicitly compare for these cases.

No impact upon current behavior, just make the code clear.  I think it
is important to make the code clear - comparing hugetlbfs page in range
page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusing.

Link: http://lkml.kernel.org/r/1578737885-8890-1-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_vma_mapped.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- a/mm/page_vma_mapped.c~mm-page_vma_mappedc-explicitly-compare-pfn-for-normal-hugetlbfs-and-thp-page
+++ a/mm/page_vma_mapped.c
@@ -52,12 +52,16 @@ static bool map_pte(struct page_vma_mapp
 	return true;
 }
 
-static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
+static inline bool pfn_is_match(struct page *page, unsigned long pfn)
 {
-	unsigned long hpage_pfn = page_to_pfn(hpage);
+	unsigned long page_pfn = page_to_pfn(page);
+
+	/* normal page and hugetlbfs page */
+	if (!PageTransCompound(page) || PageHuge(page))
+		return page_pfn == pfn;
 
 	/* THP can be referenced by any subpage */
-	return pfn >= hpage_pfn && pfn - hpage_pfn < hpage_nr_pages(hpage);
+	return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page);
 }
 
 /**
@@ -108,7 +112,7 @@ static bool check_pte(struct page_vma_ma
 		pfn = pte_pfn(*pvmw->pte);
 	}
 
-	return pfn_in_hpage(pvmw->page, pfn);
+	return pfn_is_match(pvmw->page, pfn);
 }
 
 /**
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 049/118] mm, tracing: print symbol name for kmem_alloc_node call_site events
  2020-01-31  6:10 incoming Andrew Morton
                   ` (47 preceding siblings ...)
  2020-01-31  6:13 ` [patch 048/118] mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 050/118] lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() Andrew Morton
                   ` (68 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, changbin.du, joel, linux-mm, mingo, mm-commits, rostedt,
	sunjunyong, sunjy516, timmurray, torvalds

From: Junyong Sun <sunjy516@gmail.com>
Subject: mm, tracing: print symbol name for kmem_alloc_node call_site events

Print the call_site ip of kmem_alloc_node using '%pS' to improve the
readability of raw slab trace points.

Link: http://lkml.kernel.org/r/1577949568-4518-1-git-send-email-sunjunyong@xiaomi.com
Signed-off-by: Junyong Sun <sunjunyong@xiaomi.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/kmem.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/trace/events/kmem.h~mm-tracing-print-symbol-name-for-kmem_alloc_node-call_site-events
+++ a/include/trace/events/kmem.h
@@ -88,8 +88,8 @@ DECLARE_EVENT_CLASS(kmem_alloc_node,
 		__entry->node		= node;
 	),
 
-	TP_printk("call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d",
-		__entry->call_site,
+	TP_printk("call_site=%pS ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d",
+		(void *)__entry->call_site,
 		__entry->ptr,
 		__entry->bytes_req,
 		__entry->bytes_alloc,
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 050/118] lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (48 preceding siblings ...)
  2020-01-31  6:13 ` [patch 049/118] mm, tracing: print symbol name for kmem_alloc_node call_site events Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 051/118] mm/early_ioremap.c: use %pa to print resource_size_t variables Andrew Morton
                   ` (67 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, dvyukov, gustavo, linux-mm, mm-commits, stable, torvalds

From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Subject: lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more()

In case memory resources for _ptr2_ were allocated, release them before
return.

Notice that in case _ptr1_ happens to be NULL, krealloc() behaves exactly
like kmalloc().

Addresses-Coverity-ID: 1490594 ("Resource leak")
Link: http://lkml.kernel.org/r/20200123160115.GA4202@embeddedor
Fixes: 3f15801cdc23 ("lib: add kasan test module")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_kasan.c |    1 +
 1 file changed, 1 insertion(+)

--- a/lib/test_kasan.c~lib-test_kasanc-fix-memory-leak-in-kmalloc_oob_krealloc_more
+++ a/lib/test_kasan.c
@@ -158,6 +158,7 @@ static noinline void __init kmalloc_oob_
 	if (!ptr1 || !ptr2) {
 		pr_err("Allocation failed\n");
 		kfree(ptr1);
+		kfree(ptr2);
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 051/118] mm/early_ioremap.c: use %pa to print resource_size_t variables
  2020-01-31  6:10 incoming Andrew Morton
                   ` (49 preceding siblings ...)
  2020-01-31  6:13 ` [patch 050/118] lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:13 ` [patch 052/118] mm/page_alloc: skip non present sections on zone initialization Andrew Morton
                   ` (66 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, david, linux-mm, mm-commits, torvalds

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: mm/early_ioremap.c: use %pa to print resource_size_t variables

%pa takes into consideration the special types such as resource_size_t. 
Use this specifier %instead of explicit casting.

Link: http://lkml.kernel.org/r/20191209165413.56263-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/early_ioremap.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/early_ioremap.c~mm-early_remap-use-%pa-to-print-resource_size_t-variables
+++ a/mm/early_ioremap.c
@@ -121,8 +121,8 @@ __early_ioremap(resource_size_t phys_add
 		}
 	}
 
-	if (WARN(slot < 0, "%s(%08llx, %08lx) not found slot\n",
-		 __func__, (u64)phys_addr, size))
+	if (WARN(slot < 0, "%s(%pa, %08lx) not found slot\n",
+		 __func__, &phys_addr, size))
 		return NULL;
 
 	/* Don't allow wraparound or zero size */
@@ -158,8 +158,8 @@ __early_ioremap(resource_size_t phys_add
 		--idx;
 		--nrpages;
 	}
-	WARN(early_ioremap_debug, "%s(%08llx, %08lx) [%d] => %08lx + %08lx\n",
-	     __func__, (u64)phys_addr, size, slot, offset, slot_virt[slot]);
+	WARN(early_ioremap_debug, "%s(%pa, %08lx) [%d] => %08lx + %08lx\n",
+	     __func__, &phys_addr, size, slot, offset, slot_virt[slot]);
 
 	prev_map[slot] = (void __iomem *)(offset + slot_virt[slot]);
 	return prev_map[slot];
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 052/118] mm/page_alloc: skip non present sections on zone initialization
  2020-01-31  6:10 incoming Andrew Morton
                   ` (50 preceding siblings ...)
  2020-01-31  6:13 ` [patch 051/118] mm/early_ioremap.c: use %pa to print resource_size_t variables Andrew Morton
@ 2020-01-31  6:13 ` Andrew Morton
  2020-01-31  6:14 ` [patch 053/118] mm: remove the memory isolate notifier Andrew Morton
                   ` (65 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:13 UTC (permalink / raw)
  To: akpm, bhe, dan.j.williams, david, kirill.shutemov, kirill,
	linux-mm, mgorman, mhocko, mhocko, mm-commits, osalvador,
	pasha.tatashin, torvalds, vbabka, zhi.jin

From: "Kirill A. Shutemov" <kirill@shutemov.name>
Subject: mm/page_alloc: skip non present sections on zone initialization

memmap_init_zone() can be called on the ranges with holes during the boot.
It will skip any non-valid PFNs one-by-one.  It works fine as long as
holes are not too big.

But huge holes in the memory map causes a problem.  It takes over 20
seconds to walk 32TiB hole.  x86-64 with 5-level paging allows for much
larger holes in the memory map which would practically hang the system.

Deferred struct page init doesn't help here.  It only works on the present
ranges.

Skipping non-present sections would fix the issue.

Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Jin, Zhi" <zhi.jin@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-skip-non-present-sections-on-zone-initialization
+++ a/mm/page_alloc.c
@@ -5848,6 +5848,30 @@ overlap_memmap_init(unsigned long zone,
 	return false;
 }
 
+#ifdef CONFIG_SPARSEMEM
+/* Skip PFNs that belong to non-present sections */
+static inline __meminit unsigned long next_pfn(unsigned long pfn)
+{
+	unsigned long section_nr;
+
+	section_nr = pfn_to_section_nr(++pfn);
+	if (present_section_nr(section_nr))
+		return pfn;
+
+	while (++section_nr <= __highest_present_section_nr) {
+		if (present_section_nr(section_nr))
+			return section_nr_to_pfn(section_nr);
+	}
+
+	return -1;
+}
+#else
+static inline __meminit unsigned long next_pfn(unsigned long pfn)
+{
+	return pfn++;
+}
+#endif
+
 /*
  * Initially all pages are reserved - free ones are freed
  * up by memblock_free_all() once the early boot process is
@@ -5887,8 +5911,10 @@ void __meminit memmap_init_zone(unsigned
 		 * function.  They do not exist on hotplugged memory.
 		 */
 		if (context == MEMMAP_EARLY) {
-			if (!early_pfn_valid(pfn))
+			if (!early_pfn_valid(pfn)) {
+				pfn = next_pfn(pfn) - 1;
 				continue;
+			}
 			if (!early_pfn_in_nid(pfn, nid))
 				continue;
 			if (overlap_memmap_init(zone, &pfn))
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 053/118] mm: remove the memory isolate notifier
  2020-01-31  6:10 incoming Andrew Morton
                   ` (51 preceding siblings ...)
  2020-01-31  6:13 ` [patch 052/118] mm/page_alloc: skip non present sections on zone initialization Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 054/118] mm: remove "count" parameter from has_unmovable_pages() Andrew Morton
                   ` (64 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, anshuman.khandual, cai, dan.j.williams, david, gregkh,
	kernelfans, linux-mm, mhocko, mm-commits, mpe, osalvador,
	pasha.tatashin, rafael, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm: remove the memory isolate notifier

Luckily, we have no users left, so we can get rid of it.  Cleanup
set_migratetype_isolate() a little bit.

Link: http://lkml.kernel.org/r/20191114131911.11783-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c  |   19 -------------------
 include/linux/memory.h |   27 ---------------------------
 mm/page_isolation.c    |   38 ++++----------------------------------
 3 files changed, 4 insertions(+), 80 deletions(-)

--- a/drivers/base/memory.c~mm-remove-the-memory-isolate-notifier
+++ a/drivers/base/memory.c
@@ -70,20 +70,6 @@ void unregister_memory_notifier(struct n
 }
 EXPORT_SYMBOL(unregister_memory_notifier);
 
-static ATOMIC_NOTIFIER_HEAD(memory_isolate_chain);
-
-int register_memory_isolate_notifier(struct notifier_block *nb)
-{
-	return atomic_notifier_chain_register(&memory_isolate_chain, nb);
-}
-EXPORT_SYMBOL(register_memory_isolate_notifier);
-
-void unregister_memory_isolate_notifier(struct notifier_block *nb)
-{
-	atomic_notifier_chain_unregister(&memory_isolate_chain, nb);
-}
-EXPORT_SYMBOL(unregister_memory_isolate_notifier);
-
 static void memory_block_release(struct device *dev)
 {
 	struct memory_block *mem = to_memory_block(dev);
@@ -175,11 +161,6 @@ int memory_notify(unsigned long val, voi
 	return blocking_notifier_call_chain(&memory_chain, val, v);
 }
 
-int memory_isolate_notify(unsigned long val, void *v)
-{
-	return atomic_notifier_call_chain(&memory_isolate_chain, val, v);
-}
-
 /*
  * The probe routines leave the pages uninitialized, just as the bootmem code
  * does. Make sure we do not access them, but instead use only information from
--- a/include/linux/memory.h~mm-remove-the-memory-isolate-notifier
+++ a/include/linux/memory.h
@@ -55,19 +55,6 @@ struct memory_notify {
 	int status_change_nid;
 };
 
-/*
- * During pageblock isolation, count the number of pages within the
- * range [start_pfn, start_pfn + nr_pages) which are owned by code
- * in the notifier chain.
- */
-#define MEM_ISOLATE_COUNT	(1<<0)
-
-struct memory_isolate_notify {
-	unsigned long start_pfn;	/* Start of range to check */
-	unsigned int nr_pages;		/* # pages in range to check */
-	unsigned int pages_found;	/* # pages owned found by callbacks */
-};
-
 struct notifier_block;
 struct mem_section;
 
@@ -94,27 +81,13 @@ static inline int memory_notify(unsigned
 {
 	return 0;
 }
-static inline int register_memory_isolate_notifier(struct notifier_block *nb)
-{
-	return 0;
-}
-static inline void unregister_memory_isolate_notifier(struct notifier_block *nb)
-{
-}
-static inline int memory_isolate_notify(unsigned long val, void *v)
-{
-	return 0;
-}
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
-extern int register_memory_isolate_notifier(struct notifier_block *nb);
-extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
-extern int memory_isolate_notify(unsigned long val, void *v);
 extern struct memory_block *find_memory_block(struct mem_section *);
 typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *);
 extern int walk_memory_blocks(unsigned long start, unsigned long size,
--- a/mm/page_isolation.c~mm-remove-the-memory-isolate-notifier
+++ a/mm/page_isolation.c
@@ -18,9 +18,7 @@
 static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
 {
 	struct zone *zone;
-	unsigned long flags, pfn;
-	struct memory_isolate_notify arg;
-	int notifier_ret;
+	unsigned long flags;
 	int ret = -EBUSY;
 
 	zone = page_zone(page);
@@ -35,41 +33,11 @@ static int set_migratetype_isolate(struc
 	if (is_migrate_isolate_page(page))
 		goto out;
 
-	pfn = page_to_pfn(page);
-	arg.start_pfn = pfn;
-	arg.nr_pages = pageblock_nr_pages;
-	arg.pages_found = 0;
-
-	/*
-	 * It may be possible to isolate a pageblock even if the
-	 * migratetype is not MIGRATE_MOVABLE. The memory isolation
-	 * notifier chain is used by balloon drivers to return the
-	 * number of pages in a range that are held by the balloon
-	 * driver to shrink memory. If all the pages are accounted for
-	 * by balloons, are free, or on the LRU, isolation can continue.
-	 * Later, for example, when memory hotplug notifier runs, these
-	 * pages reported as "can be isolated" should be isolated(freed)
-	 * by the balloon driver through the memory notifier chain.
-	 */
-	notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg);
-	notifier_ret = notifier_to_errno(notifier_ret);
-	if (notifier_ret)
-		goto out;
 	/*
 	 * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
 	 * We just check MOVABLE pages.
 	 */
-	if (!has_unmovable_pages(zone, page, arg.pages_found, migratetype,
-				 isol_flags))
-		ret = 0;
-
-	/*
-	 * immobile means "not-on-lru" pages. If immobile is larger than
-	 * removable-by-driver pages reported by notifier, we'll fail.
-	 */

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 054/118] mm: remove "count" parameter from has_unmovable_pages()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (52 preceding siblings ...)
  2020-01-31  6:14 ` [patch 053/118] mm: remove the memory isolate notifier Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 055/118] mm/vmscan.c: remove unused return value of shrink_node Andrew Morton
                   ` (63 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, alexander.h.duyck, anshuman.khandual, arunks, cai,
	dan.j.williams, david, glider, kernelfans, linux-mm, mgorman,
	mhocko, mm-commits, mpe, osalvador, pasha.tatashin,
	richardw.yang, rppt, sfr, torvalds, vbabka

From: David Hildenbrand <david@redhat.com>
Subject: mm: remove "count" parameter from has_unmovable_pages()

Now that the memory isolate notifier is gone, the parameter is always 0. 
Drop it and cleanup has_unmovable_pages().

Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-isolation.h |    4 ++--
 mm/memory_hotplug.c            |    2 +-
 mm/page_alloc.c                |   21 +++++++--------------
 mm/page_isolation.c            |    2 +-
 4 files changed, 11 insertions(+), 18 deletions(-)

--- a/include/linux/page-isolation.h~mm-remove-count-parameter-from-has_unmovable_pages
+++ a/include/linux/page-isolation.h
@@ -33,8 +33,8 @@ static inline bool is_migrate_isolate(in
 #define MEMORY_OFFLINE	0x1
 #define REPORT_FAILURE	0x2
 
-bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
-			 int migratetype, int flags);
+bool has_unmovable_pages(struct zone *zone, struct page *page, int migratetype,
+			 int flags);
 void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
 				int migratetype, int *num_movable);
--- a/mm/memory_hotplug.c~mm-remove-count-parameter-from-has_unmovable_pages
+++ a/mm/memory_hotplug.c
@@ -1182,7 +1182,7 @@ static bool is_pageblock_removable_noloc
 	if (!zone_spans_pfn(zone, pfn))
 		return false;
 
-	return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE,
+	return !has_unmovable_pages(zone, page, MIGRATE_MOVABLE,
 				    MEMORY_OFFLINE);
 }
 
--- a/mm/page_alloc.c~mm-remove-count-parameter-from-has_unmovable_pages
+++ a/mm/page_alloc.c
@@ -8180,17 +8180,15 @@ void *__init alloc_large_system_hash(con
 
 /*
  * This function checks whether pageblock includes unmovable pages or not.
- * If @count is not zero, it is okay to include less @count unmovable pages
  *
  * PageLRU check without isolation or lru_lock could race so that
  * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
  * check without lock_page also may miss some movable non-lru pages at
  * race condition. So you can't expect this function should be exact.
  */
-bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
-			 int migratetype, int flags)
+bool has_unmovable_pages(struct zone *zone, struct page *page, int migratetype,
+			 int flags)
 {
-	unsigned long found;
 	unsigned long iter = 0;
 	unsigned long pfn = page_to_pfn(page);
 	const char *reason = "unmovable page";
@@ -8216,13 +8214,11 @@ bool has_unmovable_pages(struct zone *zo
 		goto unmovable;
 	}
 
-	for (found = 0; iter < pageblock_nr_pages; iter++) {
-		unsigned long check = pfn + iter;
-
-		if (!pfn_valid_within(check))
+	for (; iter < pageblock_nr_pages; iter++) {
+		if (!pfn_valid_within(pfn + iter))
 			continue;
 
-		page = pfn_to_page(check);
+		page = pfn_to_page(pfn + iter);
 
 		if (PageReserved(page))
 			goto unmovable;
@@ -8271,11 +8267,9 @@ bool has_unmovable_pages(struct zone *zo
 		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
 			continue;
 
-		if (__PageMovable(page))
+		if (__PageMovable(page) || PageLRU(page))
 			continue;
 
-		if (!PageLRU(page))
-			found++;
 		/*
 		 * If there are RECLAIMABLE pages, we need to check
 		 * it.  But now, memory offline itself doesn't call
@@ -8289,8 +8283,7 @@ bool has_unmovable_pages(struct zone *zo
 		 * is set to both of a memory hole page and a _used_ kernel
 		 * page at boot.
 		 */
-		if (found > count)
-			goto unmovable;
+		goto unmovable;
 	}
 	return false;
 unmovable:
--- a/mm/page_isolation.c~mm-remove-count-parameter-from-has_unmovable_pages
+++ a/mm/page_isolation.c
@@ -37,7 +37,7 @@ static int set_migratetype_isolate(struc
 	 * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
 	 * We just check MOVABLE pages.
 	 */
-	if (!has_unmovable_pages(zone, page, 0, migratetype, isol_flags)) {
+	if (!has_unmovable_pages(zone, page, migratetype, isol_flags)) {
 		unsigned long nr_pages;
 		int mt = get_pageblock_migratetype(page);
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 055/118] mm/vmscan.c: remove unused return value of shrink_node
  2020-01-31  6:10 incoming Andrew Morton
                   ` (53 preceding siblings ...)
  2020-01-31  6:14 ` [patch 054/118] mm: remove "count" parameter from has_unmovable_pages() Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 056/118] mm/vmscan: remove prefetch_prev_lru_page Andrew Morton
                   ` (62 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, david, linux-mm, liu.song11, mm-commits, torvalds

From: Liu Song <liu.song11@zte.com.cn>
Subject: mm/vmscan.c: remove unused return value of shrink_node

The return value of shrink_node is not used, so remove
unnecessary operations.

Link: http://lkml.kernel.org/r/20191128143524.3223-1-fishland@aliyun.com
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/vmscan.c~mm-vmscanc-remove-unused-return-value-of-shrink_node
+++ a/mm/vmscan.c
@@ -2695,7 +2695,7 @@ static void shrink_node_memcgs(pg_data_t
 	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
 }
 
-static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
+static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
@@ -2874,8 +2874,6 @@ again:
 	 */
 	if (reclaimable)
 		pgdat->kswapd_failures = 0;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 056/118] mm/vmscan: remove prefetch_prev_lru_page
  2020-01-31  6:10 incoming Andrew Morton
                   ` (54 preceding siblings ...)
  2020-01-31  6:14 ` [patch 055/118] mm/vmscan.c: remove unused return value of shrink_node Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 057/118] mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE Andrew Morton
                   ` (61 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, alex.shi, cai, linux-mm, mm-commits, torvalds

From: Alex Shi <alex.shi@linux.alibaba.com>
Subject: mm/vmscan: remove prefetch_prev_lru_page

This macro was never used in git history. So better to remove.

Link: http://lkml.kernel.org/r/1579006500-127143-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |   14 --------------
 1 file changed, 14 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-remove-prefetch_prev_lru_page
+++ a/mm/vmscan.c
@@ -146,20 +146,6 @@ struct scan_control {
 	struct reclaim_state reclaim_state;
 };
 
-#ifdef ARCH_HAS_PREFETCH
-#define prefetch_prev_lru_page(_page, _base, _field)			\
-	do {								\
-		if ((_page)->lru.prev != _base) {			\
-			struct page *prev;				\
-									\
-			prev = lru_to_page(&(_page->lru));		\
-			prefetch(&prev->_field);			\
-		}							\
-	} while (0)
-#else
-#define prefetch_prev_lru_page(_page, _base, _field) do { } while (0)
-#endif

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 057/118] mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE
  2020-01-31  6:10 incoming Andrew Morton
                   ` (55 preceding siblings ...)
  2020-01-31  6:14 ` [patch 056/118] mm/vmscan: remove prefetch_prev_lru_page Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 058/118] tools/vm/slabinfo: fix sanity checks enabling Andrew Morton
                   ` (60 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, alex.shi, cl, dwagner, linux-mm, mm-commits, tobin, torvalds

From: Alex Shi <alex.shi@linux.alibaba.com>
Subject: mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE

commit 1b2ffb7896ad ("[PATCH] Zone reclaim: Allow modification of zone
reclaim behavior")' defined RECLAIM_OFF/RECLAIM_ZONE, but never use them,
so better to remove them.

[dwagner@suse.de: fix sanity checks enabling]
  Link: http://lkml.kernel.org/r/20200116131642.642-1-dwagner@suse.de
[akpm@linux-foundation.org: renumber the bits for neatness]
Link: http://lkml.kernel.org/r/1579005573-58923-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-remove-unused-reclaim_off-reclaim_zone
+++ a/mm/vmscan.c
@@ -4110,10 +4110,8 @@ module_init(kswapd_init)
  */
 int node_reclaim_mode __read_mostly;
 
-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0)	/* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP (1<<2)	/* Unmap pages during reclaim */
+#define RECLAIM_WRITE (1<<0)	/* Writeout pages during reclaim */
+#define RECLAIM_UNMAP (1<<1)	/* Unmap pages during reclaim */
 
 /*
  * Priority for NODE_RECLAIM. This determines the fraction of pages
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 058/118] tools/vm/slabinfo: fix sanity checks enabling
  2020-01-31  6:10 incoming Andrew Morton
                   ` (56 preceding siblings ...)
  2020-01-31  6:14 ` [patch 057/118] mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 059/118] mm/memblock: define memblock_physmem_add() Andrew Morton
                   ` (59 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, cl, dwagner, linux-mm, mm-commits, tobin, torvalds

From: Daniel Wagner <dwagner@suse.de>
Subject: tools/vm/slabinfo: fix sanity checks enabling

The sysfs file name for enabling sanity checking is called 'sanity_checks'
and not 'sanity'.

The name of the file has never changed since the introduction of the slub
allocator.  Obviously, most people turn the checks on via the command line
option and not during runtime using slabinfo.

Link: http://lkml.kernel.org/r/20200116131642.642-1-dwagner@suse.de
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/vm/slabinfo.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/tools/vm/slabinfo.c~tools-vm-slabinfo-fix-sanity-checks-enabling
+++ a/tools/vm/slabinfo.c
@@ -720,11 +720,11 @@ static void slab_debug(struct slabinfo *
 		return;
 
 	if (sanity && !s->sanity_checks) {
-		set_obj(s, "sanity", 1);
+		set_obj(s, "sanity_checks", 1);
 	}
 	if (!sanity && s->sanity_checks) {
 		if (slab_empty(s))
-			set_obj(s, "sanity", 0);
+			set_obj(s, "sanity_checks", 0);
 		else
 			fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name);
 	}
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 059/118] mm/memblock: define memblock_physmem_add()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (57 preceding siblings ...)
  2020-01-31  6:14 ` [patch 058/118] tools/vm/slabinfo: fix sanity checks enabling Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 060/118] memblock: Use __func__ in remaining memblock_dbg() call sites Andrew Morton
                   ` (58 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, anshuman.khandual, borntraeger, gerald.schaefer, gor,
	heiko.carstens, linux-mm, mm-commits, prudo, rppt, schwidefsky,
	torvalds, walling

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: mm/memblock: define memblock_physmem_add()

On s390 platform memblock.physmem array is being built by directly calling
into memblock_add_range() which is a low level function not intended to be
used outside of memblock.  Hence lets conditionally add helper functions
for physmem array when HAVE_MEMBLOCK_PHYS_MAP is enabled.  Also use
MAX_NUMNODES instead of 0 as node ID similar to memblock_add() and
memblock_reserve().  Make memblock_add_range() a static function as it is
no longer getting used outside of memblock.

Link: http://lkml.kernel.org/r/1578283835-21969-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Collin Walling <walling@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Philipp Rudo <prudo@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/kernel/setup.c |   12 +++---------
 include/linux/memblock.h |    7 +++----
 mm/memblock.c            |   14 +++++++++++++-
 3 files changed, 19 insertions(+), 14 deletions(-)

--- a/arch/s390/kernel/setup.c~mm-memblock-define-memblock_physmem_add
+++ a/arch/s390/kernel/setup.c
@@ -759,14 +759,6 @@ static void __init free_mem_detect_info(
 		memblock_free(start, size);
 }
 
-static void __init memblock_physmem_add(phys_addr_t start, phys_addr_t size)
-{
-	memblock_dbg("memblock_physmem_add: [%#016llx-%#016llx]\n",
-		     start, start + size - 1);
-	memblock_add_range(&memblock.memory, start, size, 0, 0);
-	memblock_add_range(&memblock.physmem, start, size, 0, 0);
-}
-
 static const char * __init get_mem_info_source(void)
 {
 	switch (mem_detect.info_source) {
@@ -791,8 +783,10 @@ static void __init memblock_add_mem_dete
 		     get_mem_info_source(), mem_detect.info_source);
 	/* keep memblock lists close to the kernel */
 	memblock_set_bottom_up(true);
-	for_each_mem_detect_block(i, &start, &end)
+	for_each_mem_detect_block(i, &start, &end) {
+		memblock_add(start, end - start);
 		memblock_physmem_add(start, end - start);
+	}
 	memblock_set_bottom_up(false);
 	memblock_dump_all();
 }
--- a/include/linux/memblock.h~mm-memblock-define-memblock_physmem_add
+++ a/include/linux/memblock.h
@@ -113,6 +113,9 @@ int memblock_add(phys_addr_t base, phys_
 int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
+#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
+int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
+#endif
 void memblock_trim_memory(phys_addr_t align);
 bool memblock_overlaps_region(struct memblock_type *type,
 			      phys_addr_t base, phys_addr_t size);
@@ -127,10 +130,6 @@ void reset_node_managed_pages(pg_data_t
 void reset_all_zones_managed_pages(void);
 
 /* Low level functions */
-int memblock_add_range(struct memblock_type *type,
-		       phys_addr_t base, phys_addr_t size,
-		       int nid, enum memblock_flags flags);
-
 void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags,
 		      struct memblock_type *type_a,
 		      struct memblock_type *type_b, phys_addr_t *out_start,
--- a/mm/memblock.c~mm-memblock-define-memblock_physmem_add
+++ a/mm/memblock.c
@@ -575,7 +575,7 @@ static void __init_memblock memblock_ins
  * Return:
  * 0 on success, -errno on failure.
  */
-int __init_memblock memblock_add_range(struct memblock_type *type,
+static int __init_memblock memblock_add_range(struct memblock_type *type,
 				phys_addr_t base, phys_addr_t size,
 				int nid, enum memblock_flags flags)
 {
@@ -830,6 +830,18 @@ int __init_memblock memblock_reserve(phy
 	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
+int __init_memblock memblock_physmem_add(phys_addr_t base, phys_addr_t size)
+{
+	phys_addr_t end = base + size - 1;
+
+	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
+		     &base, &end, (void *)_RET_IP_);
+
+	return memblock_add_range(&memblock.physmem, base, size, MAX_NUMNODES, 0);
+}
+#endif
+
 /**
  * memblock_setclr_flag - set or clear flag for a memory region
  * @base: base address of the region
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 060/118] memblock: Use __func__ in remaining memblock_dbg() call sites
  2020-01-31  6:10 incoming Andrew Morton
                   ` (58 preceding siblings ...)
  2020-01-31  6:14 ` [patch 059/118] mm/memblock: define memblock_physmem_add() Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 061/118] mm, oom: dump stack of victim when reaping failed Andrew Morton
                   ` (57 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, anshuman.khandual, linux-mm, mm-commits, rppt, torvalds

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: memblock: Use __func__ in remaining memblock_dbg() call sites

Replace open function name strings with %s (__func__) in all remaining
memblock_dbg() call sites.

Link: http://lkml.kernel.org/r/1578285510-28261-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memblock.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/memblock.c~memblock-use-__func__-in-remaining-memblock_dbg-call-sites
+++ a/mm/memblock.c
@@ -694,7 +694,7 @@ int __init_memblock memblock_add(phys_ad
 {
 	phys_addr_t end = base + size - 1;
 
-	memblock_dbg("memblock_add: [%pa-%pa] %pS\n",
+	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
 		     &base, &end, (void *)_RET_IP_);
 
 	return memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
@@ -795,7 +795,7 @@ int __init_memblock memblock_remove(phys
 {
 	phys_addr_t end = base + size - 1;
 
-	memblock_dbg("memblock_remove: [%pa-%pa] %pS\n",
+	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
 		     &base, &end, (void *)_RET_IP_);
 
 	return memblock_remove_range(&memblock.memory, base, size);
@@ -813,7 +813,7 @@ int __init_memblock memblock_free(phys_a
 {
 	phys_addr_t end = base + size - 1;
 
-	memblock_dbg("   memblock_free: [%pa-%pa] %pS\n",
+	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
 		     &base, &end, (void *)_RET_IP_);
 
 	kmemleak_free_part_phys(base, size);
@@ -824,7 +824,7 @@ int __init_memblock memblock_reserve(phy
 {
 	phys_addr_t end = base + size - 1;
 
-	memblock_dbg("memblock_reserve: [%pa-%pa] %pS\n",
+	memblock_dbg("%s: [%pa-%pa] %pS\n", __func__,
 		     &base, &end, (void *)_RET_IP_);
 
 	return memblock_add_range(&memblock.reserved, base, size, MAX_NUMNODES, 0);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 061/118] mm, oom: dump stack of victim when reaping failed
  2020-01-31  6:10 incoming Andrew Morton
                   ` (59 preceding siblings ...)
  2020-01-31  6:14 ` [patch 060/118] memblock: Use __func__ in remaining memblock_dbg() call sites Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 062/118] mm/huge_memory.c: use head to check huge zero page Andrew Morton
                   ` (56 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, linux-mm, mhocko, mm-commits, penguin-kernel, rientjes, torvalds

From: David Rientjes <rientjes@google.com>
Subject: mm, oom: dump stack of victim when reaping failed

When a process cannot be oom reaped, for whatever reason, currently the
list of locks that are held is currently dumped to the kernel log.

Much more interesting is the stack trace of the victim that cannot be
reaped.  If the stack trace is dumped, we have the ability to find related
occurrences in the same kernel code and hopefully solve the issue that is
making it wedged.

Dump the stack trace when a process fails to be oom reaped.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/mm/oom_kill.c~mm-oom-dump-stack-of-victim-when-reaping-failed
+++ a/mm/oom_kill.c
@@ -26,6 +26,7 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
+#include <linux/sched/debug.h>
 #include <linux/swap.h>
 #include <linux/timex.h>
 #include <linux/jiffies.h>
@@ -620,6 +621,7 @@ static void oom_reap_task(struct task_st
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
+	sched_show_task(tsk);
 	debug_show_all_locks();
 
 done:
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 062/118] mm/huge_memory.c: use head to check huge zero page
  2020-01-31  6:10 incoming Andrew Morton
                   ` (60 preceding siblings ...)
  2020-01-31  6:14 ` [patch 061/118] mm, oom: dump stack of victim when reaping failed Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 063/118] mm/huge_memory.c: use head to emphasize the purpose of page Andrew Morton
                   ` (55 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, kirill.shutemov, linux-mm, mm-commits, richardw.yang, torvalds

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/huge_memory.c: use head to check huge zero page

The page could be a tail page, if this is the case, this BUG_ON will never
be triggered.

Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com
Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/huge_memory.c~mm-huge_memoryc-use-head-to-check-huge-zero-page
+++ a/mm/huge_memory.c
@@ -2712,7 +2712,7 @@ int split_huge_page_to_list(struct page
 	unsigned long flags;
 	pgoff_t end;
 
-	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
+	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 063/118] mm/huge_memory.c: use head to emphasize the purpose of page
  2020-01-31  6:10 incoming Andrew Morton
                   ` (61 preceding siblings ...)
  2020-01-31  6:14 ` [patch 062/118] mm/huge_memory.c: use head to check huge zero page Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 064/118] mm/huge_memory.c: reduce critical section protected by split_queue_lock Andrew Morton
                   ` (54 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, kirill.shutemov, linux-mm, mm-commits, richardw.yang, torvalds

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/huge_memory.c: use head to emphasize the purpose of page

During split huge page, it checks the property of the page.  Currently we
do the check on page and head without emphasizing the check is on the
compound page.  In case the page passed to split_huge_page_to_list is a
tail page, audience would take some time to think about whether the check
is on compound page or tail page itself.

To make it explicit, use head instead of page for those checks.  After
this, audience would be more clear about the checks are on compound page
and the page is used to do the split and dump error message if failed.

Link: http://lkml.kernel.org/r/20200110032610.26499-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- a/mm/huge_memory.c~mm-huge_memoryc-use-head-to-emphasize-the-purpose-of-page
+++ a/mm/huge_memory.c
@@ -2704,7 +2704,7 @@ int split_huge_page_to_list(struct page
 {
 	struct page *head = compound_head(page);
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
-	struct deferred_split *ds_queue = get_deferred_split_queue(page);
+	struct deferred_split *ds_queue = get_deferred_split_queue(head);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
@@ -2713,10 +2713,10 @@ int split_huge_page_to_list(struct page
 	pgoff_t end;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(head), head);
+	VM_BUG_ON_PAGE(!PageCompound(head), head);
 
-	if (PageWriteback(page))
+	if (PageWriteback(head))
 		return -EBUSY;
 
 	if (PageAnon(head)) {
@@ -2767,7 +2767,7 @@ int split_huge_page_to_list(struct page
 		goto out_unlock;
 	}
 
-	mlocked = PageMlocked(page);
+	mlocked = PageMlocked(head);
 	unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
@@ -2800,10 +2800,10 @@ int split_huge_page_to_list(struct page
 			list_del(page_deferred_list(head));
 		}
 		if (mapping) {
-			if (PageSwapBacked(page))
-				__dec_node_page_state(page, NR_SHMEM_THPS);
+			if (PageSwapBacked(head))
+				__dec_node_page_state(head, NR_SHMEM_THPS);
 			else
-				__dec_node_page_state(page, NR_FILE_THPS);
+				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
 		spin_unlock(&ds_queue->split_queue_lock);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 064/118] mm/huge_memory.c: reduce critical section protected by split_queue_lock
  2020-01-31  6:10 incoming Andrew Morton
                   ` (62 preceding siblings ...)
  2020-01-31  6:14 ` [patch 063/118] mm/huge_memory.c: use head to emphasize the purpose of page Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 065/118] mm/migrate: remove useless mask of start address Andrew Morton
                   ` (53 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, kirill.shutemov, linux-mm, mm-commits, richardw.yang, torvalds

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: mm/huge_memory.c: reduce critical section protected by split_queue_lock

split_queue_lock protects data in struct deferred_split.  We can release
the lock after delete the page from deferred_split_queue.

This patch moves the THP accounting out of the lock protection, which is
introduced in commit 65c453778aea ("mm, rmap: account shmem thp pages").

Link: http://lkml.kernel.org/r/20200110025516.23996-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/huge_memory.c~mm-huge_memoryc-reduce-critical-section-protected-by-split_queue_lock
+++ a/mm/huge_memory.c
@@ -2799,6 +2799,7 @@ int split_huge_page_to_list(struct page
 			ds_queue->split_queue_len--;
 			list_del(page_deferred_list(head));
 		}
+		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
 			if (PageSwapBacked(head))
 				__dec_node_page_state(head, NR_SHMEM_THPS);
@@ -2806,7 +2807,6 @@ int split_huge_page_to_list(struct page
 				__dec_node_page_state(head, NR_FILE_THPS);
 		}
 
-		spin_unlock(&ds_queue->split_queue_lock);
 		__split_huge_page(page, list, end, flags);
 		if (PageSwapCache(head)) {
 			swp_entry_t entry = { .val = page_private(head) };
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 065/118] mm/migrate: remove useless mask of start address
  2020-01-31  6:10 incoming Andrew Morton
                   ` (63 preceding siblings ...)
  2020-01-31  6:14 ` [patch 064/118] mm/huge_memory.c: reduce critical section protected by split_queue_lock Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 066/118] mm/migrate: clean up some minor coding style Andrew Morton
                   ` (52 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, bharata, chris, hch, jgg, jglisse, jhubbard, linux-mm,
	mhocko, mm-commits, rcampbell, torvalds

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/migrate: remove useless mask of start address

Addresses passed to walk_page_range() callback functions are already page
aligned and don't need to be masked with PAGE_MASK.

Link: http://lkml.kernel.org/r/20200107211208.24595-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/migrate.c~mm-migrate-remove-useless-mask-of-start-address
+++ a/mm/migrate.c
@@ -2156,7 +2156,7 @@ static int migrate_vma_collect_hole(unsi
 	struct migrate_vma *migrate = walk->private;
 	unsigned long addr;
 
-	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
 		migrate->dst[migrate->npages] = 0;
 		migrate->npages++;
@@ -2173,7 +2173,7 @@ static int migrate_vma_collect_skip(unsi
 	struct migrate_vma *migrate = walk->private;
 	unsigned long addr;
 
-	for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
 		migrate->dst[migrate->npages] = 0;
 		migrate->src[migrate->npages++] = 0;
 	}
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 066/118] mm/migrate: clean up some minor coding style
  2020-01-31  6:10 incoming Andrew Morton
                   ` (64 preceding siblings ...)
  2020-01-31  6:14 ` [patch 065/118] mm/migrate: remove useless mask of start address Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 067/118] mm/migrate: add stable check in migrate_vma_insert_page() Andrew Morton
                   ` (51 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, bharata, chris, hch, jgg, jglisse, jhubbard, linux-mm,
	mhocko, mm-commits, rcampbell, torvalds

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/migrate: clean up some minor coding style

Fix some comment typos and coding style clean up in preparation for the
next patch.  No functional changes.

Link: http://lkml.kernel.org/r/20200107211208.24595-3-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

--- a/mm/migrate.c~mm-migrate-clean-up-some-minor-coding-style
+++ a/mm/migrate.c
@@ -986,7 +986,7 @@ static int move_to_new_page(struct page
 		}
 
 		/*
-		 * Anonymous and movable page->mapping will be cleard by
+		 * Anonymous and movable page->mapping will be cleared by
 		 * free_pages_prepare so don't reset it here for keeping
 		 * the type to work PageAnon, for example.
 		 */
@@ -1199,8 +1199,7 @@ out:
 		/*
 		 * A page that has been migrated has all references
 		 * removed and will be freed. A page that has not been
-		 * migrated will have kepts its references and be
-		 * restored.
+		 * migrated will have kept its references and be restored.
 		 */
 		list_del(&page->lru);
 
@@ -2779,27 +2778,18 @@ static void migrate_vma_insert_page(stru
 	if (pte_present(*ptep)) {
 		unsigned long pfn = pte_pfn(*ptep);
 
-		if (!is_zero_pfn(pfn)) {
-			pte_unmap_unlock(ptep, ptl);
-			mem_cgroup_cancel_charge(page, memcg, false);
-			goto abort;
-		}
+		if (!is_zero_pfn(pfn))
+			goto unlock_abort;
 		flush = true;
-	} else if (!pte_none(*ptep)) {
-		pte_unmap_unlock(ptep, ptl);
-		mem_cgroup_cancel_charge(page, memcg, false);
-		goto abort;
-	}
+	} else if (!pte_none(*ptep))
+		goto unlock_abort;
 
 	/*
-	 * Check for usefaultfd but do not deliver the fault. Instead,
+	 * Check for userfaultfd but do not deliver the fault. Instead,
 	 * just back off.
 	 */
-	if (userfaultfd_missing(vma)) {
-		pte_unmap_unlock(ptep, ptl);
-		mem_cgroup_cancel_charge(page, memcg, false);
-		goto abort;
-	}
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, addr, false);
@@ -2823,6 +2813,9 @@ static void migrate_vma_insert_page(stru
 	*src = MIGRATE_PFN_MIGRATE;
 	return;
 
+unlock_abort:
+	pte_unmap_unlock(ptep, ptl);
+	mem_cgroup_cancel_charge(page, memcg, false);
 abort:
 	*src &= ~MIGRATE_PFN_MIGRATE;
 }
@@ -2855,9 +2848,8 @@ void migrate_vma_pages(struct migrate_vm
 		}
 
 		if (!page) {
-			if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) {
+			if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
 				continue;
-			}
 			if (!notified) {
 				notified = true;
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 067/118] mm/migrate: add stable check in migrate_vma_insert_page()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (65 preceding siblings ...)
  2020-01-31  6:14 ` [patch 066/118] mm/migrate: clean up some minor coding style Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 068/118] mm, thp: fix defrag setting if newline is not used Andrew Morton
                   ` (50 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, bharata, chris, hch, jgg, jglisse, jhubbard, linux-mm,
	mhocko, mm-commits, rcampbell, torvalds

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/migrate: add stable check in migrate_vma_insert_page()

migrate_vma_insert_page() closely follows the code in:
  __handle_mm_fault()
    handle_pte_fault()
      do_anonymous_page()

Add a call to check_stable_address_space() after locking the page table
entry before inserting a ZONE_DEVICE private zero page mapping similar to
page faulting a new anonymous page.

Link: http://lkml.kernel.org/r/20200107211208.24595-4-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/migrate.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- a/mm/migrate.c~mm-migrate-add-stable-check-in-migrate_vma_insert_page
+++ a/mm/migrate.c
@@ -48,6 +48,7 @@
 #include <linux/page_owner.h>
 #include <linux/sched/mm.h>
 #include <linux/ptrace.h>
+#include <linux/oom.h>
 
 #include <asm/tlbflush.h>
 
@@ -2695,6 +2696,14 @@ int migrate_vma_setup(struct migrate_vma
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+/*
+ * This code closely matches the code in:
+ *   __handle_mm_fault()
+ *     handle_pte_fault()
+ *       do_anonymous_page()
+ * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
+ * private page.
+ */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
 				    struct page *page,
@@ -2775,6 +2784,9 @@ static void migrate_vma_insert_page(stru
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 
+	if (check_stable_address_space(mm))
+		goto unlock_abort;
+
 	if (pte_present(*ptep)) {
 		unsigned long pfn = pte_pfn(*ptep);
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 068/118] mm, thp: fix defrag setting if newline is not used
  2020-01-31  6:10 incoming Andrew Morton
                   ` (66 preceding siblings ...)
  2020-01-31  6:14 ` [patch 067/118] mm/migrate: add stable check in migrate_vma_insert_page() Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 069/118] mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma() Andrew Morton
                   ` (49 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, linux-mm, mgorman, mm-commits, rientjes, torvalds, vbabka

From: David Rientjes <rientjes@google.com>
Subject: mm, thp: fix defrag setting if newline is not used

If thp defrag setting "defer" is used and a newline is *not* used when
writing to the sysfs file, this is interpreted as the "defer+madvise"
option.

This is because we do prefix matching and if five characters are written
without a newline, the current code ends up comparing to the first five
bytes of the "defer+madvise" option and using that instead.

Use the more appropriate sysfs_streq() that handles the trailing newline
for us.  Since this doubles as a nice cleanup, do it in enabled_store() as
well.


The current implementation relies on prefix matching: the number of
bytes compared is either the number of bytes written or the length of
the option being compared.  With a newline, "defer\n" does not match
"defer+"madvise"; without a newline, however, "defer" is considered to
match "defer+madvise" (prefix matching is only comparing the first five
bytes).  End result is that writing "defer" is broken unless it has an
additional trailing character.

This means that writing "madv" in the past would match and set
"madvise".  With strict checking, that no longer is the case but it is
unlikely anybody is currently doing this.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com
Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option")
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |   24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

--- a/mm/huge_memory.c~mm-thp-fix-defrag-setting-if-newline-is-not-used
+++ a/mm/huge_memory.c
@@ -177,16 +177,13 @@ static ssize_t enabled_store(struct kobj
 {
 	ssize_t ret = count;
 
-	if (!memcmp("always", buf,
-		    min(sizeof("always")-1, count))) {
+	if (sysfs_streq(buf, "always")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-	} else if (!memcmp("madvise", buf,
-			   min(sizeof("madvise")-1, count))) {
+	} else if (sysfs_streq(buf, "madvise")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (!memcmp("never", buf,
-			   min(sizeof("never")-1, count))) {
+	} else if (sysfs_streq(buf, "never")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
 	} else
@@ -250,32 +247,27 @@ static ssize_t defrag_store(struct kobje
 			    struct kobj_attribute *attr,
 			    const char *buf, size_t count)
 {
-	if (!memcmp("always", buf,
-		    min(sizeof("always")-1, count))) {
+	if (sysfs_streq(buf, "always")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
-	} else if (!memcmp("defer+madvise", buf,
-		    min(sizeof("defer+madvise")-1, count))) {
+	} else if (sysfs_streq(buf, "defer+madvise")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (!memcmp("defer", buf,
-		    min(sizeof("defer")-1, count))) {
+	} else if (sysfs_streq(buf, "defer")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
-	} else if (!memcmp("madvise", buf,
-			   min(sizeof("madvise")-1, count))) {
+	} else if (sysfs_streq(buf, "madvise")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
 		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (!memcmp("never", buf,
-			   min(sizeof("never")-1, count))) {
+	} else if (sysfs_streq(buf, "never")) {
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
 		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 069/118] mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (67 preceding siblings ...)
  2020-01-31  6:14 ` [patch 068/118] mm, thp: fix defrag setting if newline is not used Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 070/118] mm/memory_hotplug: pass in nid to online_pages() Andrew Morton
                   ` (48 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, david, jhubbard, linmiaohe, linux-mm, mm-commits,
	richardw.yang, rientjes, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma()

The jump labels try_prev and none are not really needed in
find_mergeable_anon_vma(), eliminate them to improve readability.

Link: http://lkml.kernel.org/r/1574079844-17493-1-git-send-email-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |   30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

--- a/mm/mmap.c~mm-get-rid-of-odd-jump-labels-in-find_mergeable_anon_vma
+++ a/mm/mmap.c
@@ -1270,26 +1270,22 @@ static struct anon_vma *reusable_anon_vm
  */
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
-	struct anon_vma *anon_vma;
-	struct vm_area_struct *near;
+	struct anon_vma *anon_vma = NULL;
 
-	near = vma->vm_next;
-	if (!near)
-		goto try_prev;
+	/* Try next first. */
+	if (vma->vm_next) {
+		anon_vma = reusable_anon_vma(vma->vm_next, vma, vma->vm_next);
+		if (anon_vma)
+			return anon_vma;
+	}
 
-	anon_vma = reusable_anon_vma(near, vma, near);
-	if (anon_vma)
-		return anon_vma;
-try_prev:
-	near = vma->vm_prev;
-	if (!near)
-		goto none;
+	/* Try prev next. */
+	if (vma->vm_prev)
+		anon_vma = reusable_anon_vma(vma->vm_prev, vma->vm_prev, vma);
 
-	anon_vma = reusable_anon_vma(near, near, vma);
-	if (anon_vma)
-		return anon_vma;
-none:
 	/*
+	 * We might reach here with anon_vma == NULL if we can't find
+	 * any reusable anon_vma.
 	 * There's no absolute need to look only at touching neighbours:
 	 * we could search further afield for "compatible" anon_vmas.
 	 * But it would probably just be a waste of time searching,
@@ -1297,7 +1293,7 @@ none:
 	 * We're trying to allow mprotect remerging later on,
 	 * not trying to minimize memory used for anon_vmas.
 	 */
-	return NULL;
+	return anon_vma;
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 070/118] mm/memory_hotplug: pass in nid to online_pages()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (68 preceding siblings ...)
  2020-01-31  6:14 ` [patch 069/118] mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma() Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:14 ` [patch 071/118] mm/hotplug: silence a lockdep splat with printk() Andrew Morton
                   ` (47 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, david, gregkh, linux-mm,
	mhocko, mm-commits, osalvador, pasha.tatashin, rafael, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: pass in nid to online_pages()

Patch series "mm/memory_hotplug: pass in nid to online_pages()".

Simplify onlining code and get rid of find_memory_block(). Pass in the
nid from the memory block we are trying to online directly, instead of
manually looking it up.


This patch (of 2):

No need to lookup the memory block, we can directly pass in the nid.

Link: http://lkml.kernel.org/r/20200113113354.6341-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |    6 +++---
 include/linux/memory_hotplug.h |    3 ++-
 mm/memory_hotplug.c            |   13 ++-----------
 3 files changed, 7 insertions(+), 15 deletions(-)

--- a/drivers/base/memory.c~mm-memory_hotplug-pass-in-nid-to-online_pages
+++ a/drivers/base/memory.c
@@ -206,7 +206,7 @@ static bool pages_correctly_probed(unsig
  */
 static int
 memory_block_action(unsigned long start_section_nr, unsigned long action,
-		    int online_type)
+		    int online_type, int nid)
 {
 	unsigned long start_pfn;
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
@@ -219,7 +219,7 @@ memory_block_action(unsigned long start_
 		if (!pages_correctly_probed(start_pfn))
 			return -EBUSY;
 
-		ret = online_pages(start_pfn, nr_pages, online_type);
+		ret = online_pages(start_pfn, nr_pages, online_type, nid);
 		break;
 	case MEM_OFFLINE:
 		ret = offline_pages(start_pfn, nr_pages);
@@ -245,7 +245,7 @@ static int memory_block_change_state(str
 		mem->state = MEM_GOING_OFFLINE;
 
 	ret = memory_block_action(mem->start_section_nr, to_state,
-				mem->online_type);
+				  mem->online_type, mem->nid);
 
 	mem->state = ret ? from_state_req : to_state;
 
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-pass-in-nid-to-online_pages
+++ a/include/linux/memory_hotplug.h
@@ -94,7 +94,8 @@ extern int zone_grow_free_lists(struct z
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 /* VM interface that may be used by firmware interface */
-extern int online_pages(unsigned long, unsigned long, int);
+extern int online_pages(unsigned long pfn, unsigned long nr_pages,
+			int online_type, int nid);
 extern int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long *valid_start, unsigned long *valid_end);
 extern unsigned long __offline_isolated_pages(unsigned long start_pfn,
--- a/mm/memory_hotplug.c~mm-memory_hotplug-pass-in-nid-to-online_pages
+++ a/mm/memory_hotplug.c
@@ -783,27 +783,18 @@ struct zone * zone_for_pfn_range(int onl
 	return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
 
-int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
+		       int online_type, int nid)
 {
 	unsigned long flags;
 	unsigned long onlined_pages = 0;
 	struct zone *zone;
 	int need_zonelists_rebuild = 0;
-	int nid;
 	int ret;
 	struct memory_notify arg;
-	struct memory_block *mem;
 
 	mem_hotplug_begin();
 
-	/*
-	 * We can't use pfn_to_nid() because nid might be stored in struct page
-	 * which is not yet initialized. Instead, we find nid from memory block.
-	 */
-	mem = find_memory_block(__pfn_to_section(pfn));
-	nid = mem->nid;
-	put_device(&mem->dev);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 071/118] mm/hotplug: silence a lockdep splat with printk()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (69 preceding siblings ...)
  2020-01-31  6:14 ` [patch 070/118] mm/memory_hotplug: pass in nid to online_pages() Andrew Morton
@ 2020-01-31  6:14 ` Andrew Morton
  2020-01-31  6:15 ` [patch 072/118] mm/page_isolation: fix potential warning from user Andrew Morton
                   ` (46 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:14 UTC (permalink / raw)
  To: akpm, cai, david, linux-mm, mhocko, mm-commits, peterz, pmladek,
	rostedt, sergey.senozhatsky.work, torvalds

From: Qian Cai <cai@lca.pw>
Subject: mm/hotplug: silence a lockdep splat with printk()

It is not that hard to trigger lockdep splats by calling printk from under
zone->lock.  Most of them are false positives caused by lock chains
introduced early in the boot process and they do not cause any real
problems (although most of the early boot lock dependencies could happen
after boot as well).  There are some console drivers which do allocate
from the printk context as well and those should be fixed.  In any case,
false positives are not that trivial to workaround and it is far from
optimal to lose lockdep functionality for something that is a non-issue.

So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure, the
caller can call dump_page() after releasing zone->lock.  Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.

Even though has_unmovable_pages doesn't hold any reference to the returned
page this should be reasonably safe for the purpose of reporting the page
(dump_page) because it cannot be hotremoved in the context of memory
unplug.  The state of the page might change but that is the case even with
the existing code as zone->lock only plays role for free pages.

While at it, remove a similar but unnecessary debug-only printk() as well.
A sample of one of those lockdep splats is,

WARNING: possible circular locking dependency detected
------------------------------------------------------
test.sh/8653 is trying to acquire lock:
ffffffff865a4460 (console_owner){-.-.}, at:
console_unlock+0x207/0x750

but task is already holding lock:
ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&(&zone->lock)->rlock){-.-.}:
       __lock_acquire+0x5b3/0xb40
       lock_acquire+0x126/0x280
       _raw_spin_lock+0x2f/0x40
       rmqueue_bulk.constprop.21+0xb6/0x1160
       get_page_from_freelist+0x898/0x22c0
       __alloc_pages_nodemask+0x2f3/0x1cd0
       alloc_pages_current+0x9c/0x110
       allocate_slab+0x4c6/0x19c0
       new_slab+0x46/0x70
       ___slab_alloc+0x58b/0x960
       __slab_alloc+0x43/0x70
       __kmalloc+0x3ad/0x4b0
       __tty_buffer_request_room+0x100/0x250
       tty_insert_flip_string_fixed_flag+0x67/0x110
       pty_write+0xa2/0xf0
       n_tty_write+0x36b/0x7b0
       tty_write+0x284/0x4c0
       __vfs_write+0x50/0xa0
       vfs_write+0x105/0x290
       redirected_tty_write+0x6a/0xc0
       do_iter_write+0x248/0x2a0
       vfs_writev+0x106/0x1e0
       do_writev+0xd4/0x180
       __x64_sys_writev+0x45/0x50
       do_syscall_64+0xcc/0x76c
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #2 (&(&port->lock)->rlock){-.-.}:
       __lock_acquire+0x5b3/0xb40
       lock_acquire+0x126/0x280
       _raw_spin_lock_irqsave+0x3a/0x50
       tty_port_tty_get+0x20/0x60
       tty_port_default_wakeup+0xf/0x30
       tty_port_tty_wakeup+0x39/0x40
       uart_write_wakeup+0x2a/0x40
       serial8250_tx_chars+0x22e/0x440
       serial8250_handle_irq.part.8+0x14a/0x170
       serial8250_default_handle_irq+0x5c/0x90
       serial8250_interrupt+0xa6/0x130
       __handle_irq_event_percpu+0x78/0x4f0
       handle_irq_event_percpu+0x70/0x100
       handle_irq_event+0x5a/0x8b
       handle_edge_irq+0x117/0x370
       do_IRQ+0x9e/0x1e0
       ret_from_intr+0x0/0x2a
       cpuidle_enter_state+0x156/0x8e0
       cpuidle_enter+0x41/0x70
       call_cpuidle+0x5e/0x90
       do_idle+0x333/0x370
       cpu_startup_entry+0x1d/0x1f
       start_secondary+0x290/0x330
       secondary_startup_64+0xb6/0xc0

-> #1 (&port_lock_key){-.-.}:
       __lock_acquire+0x5b3/0xb40
       lock_acquire+0x126/0x280
       _raw_spin_lock_irqsave+0x3a/0x50
       serial8250_console_write+0x3e4/0x450
       univ8250_console_write+0x4b/0x60
       console_unlock+0x501/0x750
       vprintk_emit+0x10d/0x340
       vprintk_default+0x1f/0x30
       vprintk_func+0x44/0xd4
       printk+0x9f/0xc5

-> #0 (console_owner){-.-.}:
       check_prev_add+0x107/0xea0
       validate_chain+0x8fc/0x1200
       __lock_acquire+0x5b3/0xb40
       lock_acquire+0x126/0x280
       console_unlock+0x269/0x750
       vprintk_emit+0x10d/0x340
       vprintk_default+0x1f/0x30
       vprintk_func+0x44/0xd4
       printk+0x9f/0xc5
       __offline_isolated_pages.cold.52+0x2f/0x30a
       offline_isolated_pages_cb+0x17/0x30
       walk_system_ram_range+0xda/0x160
       __offline_pages+0x79c/0xa10
       offline_pages+0x11/0x20
       memory_subsys_offline+0x7e/0xc0
       device_offline+0xd5/0x110
       state_store+0xc6/0xe0
       dev_attr_store+0x3f/0x60
       sysfs_kf_write+0x89/0xb0
       kernfs_fop_write+0x188/0x240
       __vfs_write+0x50/0xa0
       vfs_write+0x105/0x290
       ksys_write+0xc6/0x160
       __x64_sys_write+0x43/0x50
       do_syscall_64+0xcc/0x76c
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
  console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&zone->lock)->rlock);
                               lock(&(&port->lock)->rlock);
                               lock(&(&zone->lock)->rlock);
  lock(console_owner);

 *** DEADLOCK ***

9 locks held by test.sh/8653:
 #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
vfs_write+0x25f/0x290
 #1: ffff888277618880 (&of->mutex){+.+.}, at:
kernfs_fop_write+0x128/0x240
 #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
kernfs_fop_write+0x138/0x240
 #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
lock_device_hotplug_sysfs+0x16/0x50
 #4: ffff8884374f4990 (&dev->mutex){....}, at:
device_offline+0x70/0x110
 #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
__offline_pages+0xbf/0xa10
 #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
percpu_down_write+0x87/0x2f0
 #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
__offline_isolated_pages+0x179/0x3e0
 #8: ffffffff865a4920 (console_lock){+.+.}, at:
vprintk_emit+0x100/0x340

stack backtrace:
Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
BIOS U34 05/21/2019
Call Trace:
 dump_stack+0x86/0xca
 print_circular_bug.cold.31+0x243/0x26e
 check_noncircular+0x29e/0x2e0
 check_prev_add+0x107/0xea0
 validate_chain+0x8fc/0x1200
 __lock_acquire+0x5b3/0xb40
 lock_acquire+0x126/0x280
 console_unlock+0x269/0x750
 vprintk_emit+0x10d/0x340
 vprintk_default+0x1f/0x30
 vprintk_func+0x44/0xd4
 printk+0x9f/0xc5
 __offline_isolated_pages.cold.52+0x2f/0x30a
 offline_isolated_pages_cb+0x17/0x30
 walk_system_ram_range+0xda/0x160
 __offline_pages+0x79c/0xa10
 offline_pages+0x11/0x20
 memory_subsys_offline+0x7e/0xc0
 device_offline+0xd5/0x110
 state_store+0xc6/0xe0
 dev_attr_store+0x3f/0x60
 sysfs_kf_write+0x89/0xb0
 kernfs_fop_write+0x188/0x240
 __vfs_write+0x50/0xa0
 vfs_write+0x105/0x290
 ksys_write+0xc6/0x160
 __x64_sys_write+0x43/0x50
 do_syscall_64+0xcc/0x76c
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-isolation.h |    4 ++--
 mm/debug.c                     |   10 +++++++++-
 mm/page_alloc.c                |   23 ++++++++++-------------
 mm/page_isolation.c            |   11 ++++++++++-
 4 files changed, 31 insertions(+), 17 deletions(-)

--- a/include/linux/page-isolation.h~mm-hotplug-silence-a-lockdep-splat-with-printk
+++ a/include/linux/page-isolation.h
@@ -33,8 +33,8 @@ static inline bool is_migrate_isolate(in
 #define MEMORY_OFFLINE	0x1
 #define REPORT_FAILURE	0x2
 
-bool has_unmovable_pages(struct zone *zone, struct page *page, int migratetype,
-			 int flags);
+struct page *has_unmovable_pages(struct zone *zone, struct page *page,
+				 int migratetype, int flags);
 void set_pageblock_migratetype(struct page *page, int migratetype);
 int move_freepages_block(struct zone *zone, struct page *page,
 				int migratetype, int *num_movable);
--- a/mm/debug.c~mm-hotplug-silence-a-lockdep-splat-with-printk
+++ a/mm/debug.c
@@ -46,6 +46,13 @@ void __dump_page(struct page *page, cons
 {
 	struct address_space *mapping;
 	bool page_poisoned = PagePoisoned(page);
+	/*
+	 * Accessing the pageblock without the zone lock. It could change to
+	 * "isolate" again in the meantime, but since we are just dumping the
+	 * state for debugging, it should be fine to accept a bit of
+	 * inaccuracy here due to racing.
+	 */
+	bool page_cma = is_migrate_cma_page(page);
 	int mapcount;
 	char *type = "";
 
@@ -92,7 +99,8 @@ void __dump_page(struct page *page, cons
 	}
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1);
 
-	pr_warn("%sflags: %#lx(%pGp)\n", type, page->flags, &page->flags);
+	pr_warn("%sflags: %#lx(%pGp)%s\n", type, page->flags, &page->flags,
+		page_cma ? " CMA" : "");
 
 hex_only:
 	print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
--- a/mm/page_alloc.c~mm-hotplug-silence-a-lockdep-splat-with-printk
+++ a/mm/page_alloc.c
@@ -8185,13 +8185,17 @@ void *__init alloc_large_system_hash(con
  * MIGRATE_MOVABLE block might include unmovable pages. And __PageMovable
  * check without lock_page also may miss some movable non-lru pages at
  * race condition. So you can't expect this function should be exact.
+ *
+ * Returns a page without holding a reference. If the caller wants to
+ * dereference that page (e.g., dumping), it has to make sure that that it
+ * cannot get removed (e.g., via memory unplug) concurrently.
+ *
  */
-bool has_unmovable_pages(struct zone *zone, struct page *page, int migratetype,
-			 int flags)
+struct page *has_unmovable_pages(struct zone *zone, struct page *page,
+				 int migratetype, int flags)
 {
 	unsigned long iter = 0;
 	unsigned long pfn = page_to_pfn(page);
-	const char *reason = "unmovable page";
 
 	/*
 	 * TODO we could make this much more efficient by not checking every
@@ -8208,9 +8212,8 @@ bool has_unmovable_pages(struct zone *zo
 		 * so consider them movable here.
 		 */
 		if (is_migrate_cma(migratetype))
-			return false;
+			return NULL;
 
-		reason = "CMA page";
 		goto unmovable;
 	}
 
@@ -8285,12 +8288,10 @@ bool has_unmovable_pages(struct zone *zo
 		 */
 		goto unmovable;
 	}
-	return false;
+	return NULL;
 unmovable:
 	WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
-	if (flags & REPORT_FAILURE)
-		dump_page(pfn_to_page(pfn + iter), reason);
-	return true;
+	return pfn_to_page(pfn + iter);
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
@@ -8694,10 +8695,6 @@ __offline_isolated_pages(unsigned long s
 		BUG_ON(!PageBuddy(page));
 		order = page_order(page);
 		offlined_pages += 1 << order;
-#ifdef CONFIG_DEBUG_VM
-		pr_info("remove from free list %lx %d %lx\n",
-			pfn, 1 << order, end_pfn);
-#endif
 		del_page_from_free_area(page, &zone->free_area[order]);
 		pfn += (1 << order);
 	}
--- a/mm/page_isolation.c~mm-hotplug-silence-a-lockdep-splat-with-printk
+++ a/mm/page_isolation.c
@@ -17,6 +17,7 @@
 
 static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
 {
+	struct page *unmovable = NULL;
 	struct zone *zone;
 	unsigned long flags;
 	int ret = -EBUSY;
@@ -37,7 +38,8 @@ static int set_migratetype_isolate(struc
 	 * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
 	 * We just check MOVABLE pages.
 	 */
-	if (!has_unmovable_pages(zone, page, migratetype, isol_flags)) {
+	unmovable = has_unmovable_pages(zone, page, migratetype, isol_flags);
+	if (!unmovable) {
 		unsigned long nr_pages;
 		int mt = get_pageblock_migratetype(page);
 
@@ -54,6 +56,13 @@ out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 	if (!ret)
 		drain_all_pages(zone);
+	else if ((isol_flags & REPORT_FAILURE) && unmovable)
+		/*
+		 * printk() with zone->lock held will guarantee to trigger a
+		 * lockdep splat, so defer it here.
+		 */
+		dump_page(unmovable, "unmovable page");
+
 	return ret;
 }
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 072/118] mm/page_isolation: fix potential warning from user
  2020-01-31  6:10 incoming Andrew Morton
                   ` (70 preceding siblings ...)
  2020-01-31  6:14 ` [patch 071/118] mm/hotplug: silence a lockdep splat with printk() Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 073/118] mm/zswap.c: add allocation hysteresis if pool limit is hit Andrew Morton
                   ` (45 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, cai, david, linux-mm, mhocko, mhocko, mm-commits, torvalds

From: Qian Cai <cai@lca.pw>
Subject: mm/page_isolation: fix potential warning from user

It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
from start_isolate_page_range(), but should avoid triggering it from
userspace, i.e, from is_mem_section_removable() because it could crash the
system by a non-root user if warn_on_panic is set.

While at it, simplify the code a bit by removing an unnecessary jump
label.

Link: http://lkml.kernel.org/r/20200120163915.1469-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c     |   11 ++++-------
 mm/page_isolation.c |   18 +++++++++++-------
 2 files changed, 15 insertions(+), 14 deletions(-)

--- a/mm/page_alloc.c~mm-page_isolation-fix-potential-warning-from-user
+++ a/mm/page_alloc.c
@@ -8214,7 +8214,7 @@ struct page *has_unmovable_pages(struct
 		if (is_migrate_cma(migratetype))
 			return NULL;
 
-		goto unmovable;
+		return page;
 	}
 
 	for (; iter < pageblock_nr_pages; iter++) {
@@ -8224,7 +8224,7 @@ struct page *has_unmovable_pages(struct
 		page = pfn_to_page(pfn + iter);
 
 		if (PageReserved(page))
-			goto unmovable;
+			return page;
 
 		/*
 		 * If the zone is movable and we have ruled out all reserved
@@ -8244,7 +8244,7 @@ struct page *has_unmovable_pages(struct
 			unsigned int skip_pages;
 
 			if (!hugepage_migration_supported(page_hstate(head)))
-				goto unmovable;
+				return page;
 
 			skip_pages = compound_nr(head) - (page - head);
 			iter += skip_pages - 1;
@@ -8286,12 +8286,9 @@ struct page *has_unmovable_pages(struct
 		 * is set to both of a memory hole page and a _used_ kernel
 		 * page at boot.
 		 */
-		goto unmovable;
+		return page;
 	}
 	return NULL;
-unmovable:
-	WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
-	return pfn_to_page(pfn + iter);
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
--- a/mm/page_isolation.c~mm-page_isolation-fix-potential-warning-from-user
+++ a/mm/page_isolation.c
@@ -54,14 +54,18 @@ static int set_migratetype_isolate(struc
 
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret)
+	if (!ret) {
 		drain_all_pages(zone);
-	else if ((isol_flags & REPORT_FAILURE) && unmovable)
-		/*
-		 * printk() with zone->lock held will guarantee to trigger a
-		 * lockdep splat, so defer it here.
-		 */
-		dump_page(unmovable, "unmovable page");
+	} else {
+		WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
+
+		if ((isol_flags & REPORT_FAILURE) && unmovable)
+			/*
+			 * printk() with zone->lock held will likely trigger a
+			 * lockdep splat, so defer it here.
+			 */
+			dump_page(unmovable, "unmovable page");
+	}
 
 	return ret;
 }
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 073/118] mm/zswap.c: add allocation hysteresis if pool limit is hit
  2020-01-31  6:10 incoming Andrew Morton
                   ` (71 preceding siblings ...)
  2020-01-31  6:15 ` [patch 072/118] mm/page_isolation: fix potential warning from user Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 074/118] zswap: potential NULL dereference on error in init_zswap() Andrew Morton
                   ` (44 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, ddstreet, linux-mm, mm-commits, torvalds, vitaly.wool

From: Vitaly Wool <vitaly.wool@konsulko.com>
Subject: mm/zswap.c: add allocation hysteresis if pool limit is hit

zswap will always try to shrink pool when zswap is full.  If there is a
high pressure on zswap it will result in flipping pages in and out zswap
pool without any real benefit, and the overall system performance will
drop.  The previous discussion on this subject [1] ended up with a
suggestion to implement a sort of hysteresis to refuse taking pages into
zswap pool until it has sufficient space if the limit has been hit.  This
is my take on this.

Hysteresis is controlled with a sysfs-configurable parameter (namely,
/sys/kernel/debug/zswap/accept_threhsold_percent).  It specifies the
threshold at which zswap would start accepting pages again after it became
full.  Setting this parameter to 100 disables the hysteresis and sets the
zswap behavior to pre-hysteresis state.

[1] https://lkml.org/lkml/2019/11/8/949

Link: http://lkml.kernel.org/r/20200108200118.15563-1-vitaly.wool@konsulko.com
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/zswap.rst |   13 +++++
 mm/zswap.c                 |   85 ++++++++++++++++++++++-------------
 2 files changed, 67 insertions(+), 31 deletions(-)

--- a/Documentation/vm/zswap.rst~zswap-add-allocation-hysteresis-if-pool-limit-is-hit
+++ a/Documentation/vm/zswap.rst
@@ -130,6 +130,19 @@ checking for the same-value filled pages
 existing pages which are marked as same-value filled pages remain stored
 unchanged in zswap until they are either loaded or invalidated.
 
+To prevent zswap from shrinking pool when zswap is full and there's a high
+pressure on swap (this will result in flipping pages in and out zswap pool
+without any real benefit but with a performance drop for the system), a
+special parameter has been introduced to implement a sort of hysteresis to
+refuse taking pages into zswap pool until it has sufficient space if the limit
+has been hit. To set the threshold at which zswap would start accepting pages
+again after it became full, use the sysfs ``accept_threhsold_percent``
+attribute, e. g.::
+
+	echo 80 > /sys/module/zswap/parameters/accept_threhsold_percent
+
+Setting this parameter to 100 will disable the hysteresis.
+
 A debugfs interface is provided for various statistic about pool size, number
 of pages stored, same-value filled pages and various counters for the reasons
 pages are rejected.
--- a/mm/zswap.c~zswap-add-allocation-hysteresis-if-pool-limit-is-hit
+++ a/mm/zswap.c
@@ -32,6 +32,7 @@
 #include <linux/swapops.h>
 #include <linux/writeback.h>
 #include <linux/pagemap.h>
+#include <linux/workqueue.h>
 
 /*********************************
 * statistics
@@ -65,6 +66,11 @@ static u64 zswap_reject_kmemcache_fail;
 /* Duplicate store was encountered (rare) */
 static u64 zswap_duplicate_entry;
 
+/* Shrinker work queue */
+static struct workqueue_struct *shrink_wq;
+/* Pool limit was hit, we need to calm down */
+static bool zswap_pool_reached_full;
+
 /*********************************
 * tunables
 **********************************/
@@ -109,6 +115,11 @@ module_param_cb(zpool, &zswap_zpool_para
 static unsigned int zswap_max_pool_percent = 20;
 module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644);
 
+/* The threshold for accepting new pages after the max_pool_percent was hit */
+static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
+module_param_named(accept_threshold_percent, zswap_accept_thr_percent,
+		   uint, 0644);
+
 /* Enable/disable handling same-value filled pages (enabled by default) */
 static bool zswap_same_filled_pages_enabled = true;
 module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled,
@@ -123,7 +134,8 @@ struct zswap_pool {
 	struct crypto_comp * __percpu *tfm;
 	struct kref kref;
 	struct list_head list;
-	struct work_struct work;
+	struct work_struct release_work;
+	struct work_struct shrink_work;
 	struct hlist_node node;
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
 };
@@ -214,6 +226,13 @@ static bool zswap_is_full(void)
 			DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE);
 }
 
+static bool zswap_can_accept(void)
+{
+	return totalram_pages() * zswap_accept_thr_percent / 100 *
+				zswap_max_pool_percent / 100 >
+			DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE);
+}
+
 static void zswap_update_total_size(void)
 {
 	struct zswap_pool *pool;
@@ -501,6 +520,16 @@ static struct zswap_pool *zswap_pool_fin
 	return NULL;
 }
 
+static void shrink_worker(struct work_struct *w)
+{
+	struct zswap_pool *pool = container_of(w, typeof(*pool),
+						shrink_work);
+
+	if (zpool_shrink(pool->zpool, 1, NULL))
+		zswap_reject_reclaim_fail++;
+	zswap_pool_put(pool);
+}
+
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	struct zswap_pool *pool;
@@ -551,6 +580,7 @@ static struct zswap_pool *zswap_pool_cre
 	 */
 	kref_init(&pool->kref);
 	INIT_LIST_HEAD(&pool->list);
+	INIT_WORK(&pool->shrink_work, shrink_worker);
 
 	zswap_pool_debug("created", pool);
 
@@ -624,7 +654,8 @@ static int __must_check zswap_pool_get(s
 
 static void __zswap_pool_release(struct work_struct *work)
 {
-	struct zswap_pool *pool = container_of(work, typeof(*pool), work);
+	struct zswap_pool *pool = container_of(work, typeof(*pool),
+						release_work);
 
 	synchronize_rcu();
 
@@ -647,8 +678,8 @@ static void __zswap_pool_empty(struct kr
 
 	list_del_rcu(&pool->list);
 
-	INIT_WORK(&pool->work, __zswap_pool_release);
-	schedule_work(&pool->work);
+	INIT_WORK(&pool->release_work, __zswap_pool_release);
+	schedule_work(&pool->release_work);
 
 	spin_unlock(&zswap_pools_lock);
 }
@@ -942,22 +973,6 @@ end:
 	return ret;
 }
 
-static int zswap_shrink(void)
-{
-	struct zswap_pool *pool;
-	int ret;
-
-	pool = zswap_pool_last_get();
-	if (!pool)
-		return -ENOENT;
-
-	ret = zpool_shrink(pool->zpool, 1, NULL);
-
-	zswap_pool_put(pool);
-
-	return ret;
-}
-
 static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
 {
 	unsigned int pos;
@@ -1011,21 +1026,23 @@ static int zswap_frontswap_store(unsigne
 
 	/* reclaim space if needed */
 	if (zswap_is_full()) {
+		struct zswap_pool *pool;
+
 		zswap_pool_limit_hit++;
-		if (zswap_shrink()) {
-			zswap_reject_reclaim_fail++;
-			ret = -ENOMEM;
-			goto reject;
-		}
+		zswap_pool_reached_full = true;
+		pool = zswap_pool_last_get();
+		if (pool)
+			queue_work(shrink_wq, &pool->shrink_work);
+		ret = -ENOMEM;
+		goto reject;
+	}
 
-		/* A second zswap_is_full() check after
-		 * zswap_shrink() to make sure it's now
-		 * under the max_pool_percent
-		 */
-		if (zswap_is_full()) {
+	if (zswap_pool_reached_full) {
+	       if (!zswap_can_accept()) {
 			ret = -ENOMEM;
 			goto reject;
-		}
+		} else
+			zswap_pool_reached_full = false;
 	}
 
 	/* allocate entry */
@@ -1332,11 +1349,17 @@ static int __init init_zswap(void)
 		zswap_enabled = false;
 	}
 
+	shrink_wq = create_workqueue("zswap-shrink");
+	if (!shrink_wq)
+		goto fallback_fail;
+
 	frontswap_register_ops(&zswap_frontswap_ops);
 	if (zswap_debugfs_init())
 		pr_warn("debugfs initialization failed\n");
 	return 0;
 
+fallback_fail:
+	zswap_pool_destroy(pool);
 hp_fail:
 	cpuhp_remove_state(CPUHP_MM_ZSWP_MEM_PREPARE);
 dstmem_fail:
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 074/118] zswap: potential NULL dereference on error in init_zswap()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (72 preceding siblings ...)
  2020-01-31  6:15 ` [patch 073/118] mm/zswap.c: add allocation hysteresis if pool limit is hit Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 075/118] include/linux/mm.h: clean up obsolete check on space in page->flags Andrew Morton
                   ` (43 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, dan.carpenter, linux-mm, mm-commits, torvalds, vitaly.wool

From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: zswap: potential NULL dereference on error in init_zswap()

The "pool" pointer can be NULL at the end of the init_zswap().  (We would
allocate a new pool later in that situation.) So in the error handling
then we need to make sure pool is a valid pointer before calling
"zswap_pool_destroy(pool);" because that function dereferences the
argument.

Link: http://lkml.kernel.org/r/20200114050902.og32fkllkod5ycf5@kili.mountain
Fixes: 93d4dfa9fbd0 ("mm/zswap.c: add allocation hysteresis if pool limit is hit")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zswap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/zswap.c~zswap-potential-null-dereference-on-error-in-init_zswap
+++ a/mm/zswap.c
@@ -1359,7 +1359,8 @@ static int __init init_zswap(void)
 	return 0;
 
 fallback_fail:
-	zswap_pool_destroy(pool);
+	if (pool)
+		zswap_pool_destroy(pool);
 hp_fail:
 	cpuhp_remove_state(CPUHP_MM_ZSWP_MEM_PREPARE);
 dstmem_fail:
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 075/118] include/linux/mm.h: clean up obsolete check on space in page->flags
  2020-01-31  6:10 incoming Andrew Morton
                   ` (73 preceding siblings ...)
  2020-01-31  6:15 ` [patch 074/118] zswap: potential NULL dereference on error in init_zswap() Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 076/118] include/linux/mm.h: remove dead code totalram_pages_set() Andrew Morton
                   ` (42 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, arnd, david, linux-mm, mm-commits, torvalds, yuzhao

From: Yu Zhao <yuzhao@google.com>
Subject: include/linux/mm.h: clean up obsolete check on space in page->flags

The check was intended to make sure we don't overrun page flags.  But it's
obsolete because it doesn't include LAST_CPUPID_WIDTH nor KASAN_TAG_WIDTH.

Just remove check since we already have it covered in
linux/page-flags-layout.h (near the end of the file).

Link: http://lkml.kernel.org/r/20191208183508.89177-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    4 ----
 1 file changed, 4 deletions(-)

--- a/include/linux/mm.h~mm-clean-up-obsolete-check-on-space-in-page-flags
+++ a/include/linux/mm.h
@@ -916,10 +916,6 @@ vm_fault_t finish_mkwrite_fault(struct v
 
 #define ZONEID_PGSHIFT		(ZONEID_PGOFF * (ZONEID_SHIFT != 0))
 
-#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
-#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
-#endif

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 076/118] include/linux/mm.h: remove dead code totalram_pages_set()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (74 preceding siblings ...)
  2020-01-31  6:15 ` [patch 075/118] include/linux/mm.h: clean up obsolete check on space in page->flags Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 077/118] include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block Andrew Morton
                   ` (41 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, david, linux-mm, mm-commits, richardw.yang, torvalds

From: Wei Yang <richardw.yang@linux.intel.com>
Subject: include/linux/mm.h: remove dead code totalram_pages_set()

totalram_pages_set() was introduced in commit ca79b0c211af ("mm:
convert totalram_pages and totalhigh_pages variables to atomic"), but
no one uses it.

Link: http://lkml.kernel.org/r/20191218005543.24146-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    5 -----
 1 file changed, 5 deletions(-)

--- a/include/linux/mm.h~mm-remove-dead-code-totalram_pages_set
+++ a/include/linux/mm.h
@@ -70,11 +70,6 @@ static inline void totalram_pages_add(lo
 	atomic_long_add(count, &_totalram_pages);
 }
 
-static inline void totalram_pages_set(long val)
-{
-	atomic_long_set(&_totalram_pages, val);
-}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 077/118] include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block
  2020-01-31  6:10 incoming Andrew Morton
                   ` (75 preceding siblings ...)
  2020-01-31  6:15 ` [patch 076/118] include/linux/mm.h: remove dead code totalram_pages_set() Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 078/118] mm: fix comments related to node reclaim Andrew Morton
                   ` (40 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, david, linux-mm, mhocko,
	mm-commits, pasha.tatashin, torvalds

From: Anshuman Khandual <anshuman.khandual@arm.com>
Subject: include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block

memory_block structure elements 'hw' and 'phys_callback' are not getting
used.  This was originally added with commit 3947be1969a9 ("[PATCH] memory
hotplug: sysfs and add/remove functions") but never seem to have been
used.  Just drop them now.

Link: http://lkml.kernel.org/r/1576728650-13867-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memory.h |    2 --
 1 file changed, 2 deletions(-)

--- a/include/linux/memory.h~mm-drop-elements-hw-and-phys_callback-from-struct-memory_block
+++ a/include/linux/memory.h
@@ -29,8 +29,6 @@ struct memory_block {
 	int section_count;		/* serialized by mem_sysfs_mutex */
 	int online_type;		/* for passing data to online routine */
 	int phys_device;		/* to which fru does this belong? */
-	void *hw;			/* optional pointer to fw/hw data */
-	int (*phys_callback)(struct memory_block *);
 	struct device dev;
 	int nid;			/* NID for this memory block */
 };
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 078/118] mm: fix comments related to node reclaim
  2020-01-31  6:10 incoming Andrew Morton
                   ` (76 preceding siblings ...)
  2020-01-31  6:15 ` [patch 077/118] include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 079/118] zram: try to avoid worst-case scenario on same element pages Andrew Morton
                   ` (39 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, anshuman.khandual, haolee.swjtu, linux-mm, mgorman,
	mm-commits, torvalds

From: Hao Lee <haolee.swjtu@gmail.com>
Subject: mm: fix comments related to node reclaim

As zone reclaim has been replaced by node reclaim, this patch fixes related
comments.

Link: http://lkml.kernel.org/r/20191126141346.GA22665@haolee.github.io
Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h      |    2 +-
 include/uapi/linux/sysctl.h |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/mmzone.h~mm-fix-comments-related-to-node-reclaim
+++ a/include/linux/mmzone.h
@@ -758,7 +758,7 @@ typedef struct pglist_data {
 
 #ifdef CONFIG_NUMA
 	/*
-	 * zone reclaim becomes active if more unmapped pages exist.
+	 * node reclaim becomes active if more unmapped pages exist.
 	 */
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
--- a/include/uapi/linux/sysctl.h~mm-fix-comments-related-to-node-reclaim
+++ a/include/uapi/linux/sysctl.h
@@ -195,7 +195,7 @@ enum
 	VM_MIN_UNMAPPED=32,	/* Set min percent of unmapped pages */
 	VM_PANIC_ON_OOM=33,	/* panic at out-of-memory */
 	VM_VDSO_ENABLED=34,	/* map VDSO into new processes? */
-	VM_MIN_SLAB=35,		 /* Percent pages ignored by zone reclaim */
+	VM_MIN_SLAB=35,		 /* Percent pages ignored by node reclaim */
 };
 
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 079/118] zram: try to avoid worst-case scenario on same element pages
  2020-01-31  6:10 incoming Andrew Morton
                   ` (77 preceding siblings ...)
  2020-01-31  6:15 ` [patch 078/118] mm: fix comments related to node reclaim Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 080/118] drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store Andrew Morton
                   ` (38 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, axboe, linux-mm, minchan, mm-commits,
	sergey.senozhatsky.work, taejoon.song, torvalds

From: Taejoon Song <taejoon.song@lge.com>
Subject: zram: try to avoid worst-case scenario on same element pages

The worst-case scenario on finding same element pages is that almost all
elements are same at the first glance but only last few elements are
different.

Since the same element tends to be grouped from the beginning of the
pages, if we check the first element with the last element before looping
through all elements, we might have some chances to quickly detect
non-same element pages.

1. Test is done under LG webOS TV (64-bit arch)
2. Dump the swap-out pages (~819200 pages)
3. Analyze the pages with simple test script which counts the iteration
   number and measures the speed at off-line

Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512.
The speed is based on the time to consume page_same_filled() function
only.  The result, on average, is listed as below:

                                   Num of Iter    Speed(MB/s)
Looping-Forward (Orig)                 38            99265
Looping-Backward                       36           102725
Last-element-check (This Patch)        33           125072

The result shows that the average iteration count decreases by 13% and the
speed increases by 25% with this patch.  This patch does not increase the
overall time complexity, though.

I also ran simpler version which uses backward loop.  Just looping
backward also makes some improvement, but less than this patch.

[taejoon.song@lge.com: fix off-by-one]
  Link: http://lkml.kernel.org/r/1578642001-11765-1-git-send-email-taejoon.song@lge.com
Link: http://lkml.kernel.org/r/1575424418-16119-1-git-send-email-taejoon.song@lge.com
Signed-off-by: Taejoon Song <taejoon.song@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- a/drivers/block/zram/zram_drv.c~zram-try-to-avoid-worst-case-scenario-on-same-element-pages
+++ a/drivers/block/zram/zram_drv.c
@@ -207,14 +207,17 @@ static inline void zram_fill_page(void *
 
 static bool page_same_filled(void *ptr, unsigned long *element)
 {
-	unsigned int pos;
 	unsigned long *page;
 	unsigned long val;
+	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;
 
 	page = (unsigned long *)ptr;
 	val = page[0];
 
-	for (pos = 1; pos < PAGE_SIZE / sizeof(*page); pos++) {
+	if (val != page[last_pos])
+		return false;
+
+	for (pos = 1; pos < last_pos; pos++) {
 		if (val != page[pos])
 			return false;
 	}
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 080/118] drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store
  2020-01-31  6:10 incoming Andrew Morton
                   ` (78 preceding siblings ...)
  2020-01-31  6:15 ` [patch 079/118] zram: try to avoid worst-case scenario on same element pages Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 081/118] include/linux/units.h: add helpers for kelvin to/from Celsius conversion Andrew Morton
                   ` (37 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akpm, axboe, colin.king, linux-mm, minchan, mm-commits,
	sergey.senozhatsky, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store

Currently when an error code -EIO or -ENOSPC in the for-loop of
writeback_store the error code is being overwritten by a ret = len
assignment at the end of the function and the error codes are being lost. 
Fix this by assigning ret = len at the start of the function and remove
the assignment from the end, hence allowing ret to be preserved when error
codes are assigned to it.

Addresses Coverity ("Unused value")

Link: http://lkml.kernel.org/r/20191128122958.178290-1-colin.king@canonical.com
Fixes: a939888ec38b ("zram: support idle/huge page writeback")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/block/zram/zram_drv.c~zram-fix-error-return-codes-not-being-returned-in-writeback_store
+++ a/drivers/block/zram/zram_drv.c
@@ -629,7 +629,7 @@ static ssize_t writeback_store(struct de
 	struct bio bio;
 	struct bio_vec bio_vec;
 	struct page *page;
-	ssize_t ret;
+	ssize_t ret = len;
 	int mode;
 	unsigned long blk_idx = 0;
 
@@ -765,7 +765,6 @@ next:
 
 	if (blk_idx)
 		free_block_bdev(zram, blk_idx);
-	ret = len;
 	__free_page(page);
 release_init_lock:
 	up_read(&zram->init_lock);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 081/118] include/linux/units.h: add helpers for kelvin to/from Celsius conversion
  2020-01-31  6:10 incoming Andrew Morton
                   ` (79 preceding siblings ...)
  2020-01-31  6:15 ` [patch 080/118] drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 082/118] ACPI: thermal: switch to use <linux/units.h> helpers Andrew Morton
                   ` (36 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: include/linux/units.h: add helpers for kelvin to/from Celsius conversion

Patch series "add header file for kelvin to/from Celsius conversion
helpers", v4.

There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers.  These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just for
the helpers.

This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems, and switches all the users of
conversion helpers in <linux/thermal.h> to use <linux/units.h> helpers.


This patch (of 12):

There are several helper macros to convert kelvin to/from Celsius in
<linux/thermal.h> for thermal drivers.  These are useful for any other
drivers or subsystems, but it's odd to include <linux/thermal.h> just for
the helpers.

This adds a new <linux/units.h> that provides the equivalent inline
functions for any drivers or subsystems.  It is intended to replace the
helpers in <linux/thermal.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-2-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/units.h |   84 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

--- /dev/null
+++ a/include/linux/units.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNITS_H
+#define _LINUX_UNITS_H
+
+#include <linux/kernel.h>
+
+#define ABSOLUTE_ZERO_MILLICELSIUS -273150
+
+static inline long milli_kelvin_to_millicelsius(long t)
+{
+	return t + ABSOLUTE_ZERO_MILLICELSIUS;
+}
+
+static inline long millicelsius_to_milli_kelvin(long t)
+{
+	return t - ABSOLUTE_ZERO_MILLICELSIUS;
+}
+
+#define MILLIDEGREE_PER_DEGREE 1000
+#define MILLIDEGREE_PER_DECIDEGREE 100
+
+static inline long kelvin_to_millicelsius(long t)
+{
+	return milli_kelvin_to_millicelsius(t * MILLIDEGREE_PER_DEGREE);
+}
+
+static inline long millicelsius_to_kelvin(long t)
+{
+	t = millicelsius_to_milli_kelvin(t);
+
+	return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DEGREE);
+}
+
+static inline long deci_kelvin_to_celsius(long t)
+{
+	t = milli_kelvin_to_millicelsius(t * MILLIDEGREE_PER_DECIDEGREE);
+
+	return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DEGREE);
+}
+
+static inline long celsius_to_deci_kelvin(long t)
+{
+	t = millicelsius_to_milli_kelvin(t * MILLIDEGREE_PER_DEGREE);
+
+	return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DECIDEGREE);
+}
+
+/**
+ * deci_kelvin_to_millicelsius_with_offset - convert Kelvin to Celsius
+ * @t: temperature value in decidegrees Kelvin
+ * @offset: difference between Kelvin and Celsius in millidegrees
+ *
+ * Return: temperature value in millidegrees Celsius
+ */
+static inline long deci_kelvin_to_millicelsius_with_offset(long t, long offset)
+{
+	return t * MILLIDEGREE_PER_DECIDEGREE - offset;
+}
+
+static inline long deci_kelvin_to_millicelsius(long t)
+{
+	return milli_kelvin_to_millicelsius(t * MILLIDEGREE_PER_DECIDEGREE);
+}
+
+static inline long millicelsius_to_deci_kelvin(long t)
+{
+	t = millicelsius_to_milli_kelvin(t);
+
+	return DIV_ROUND_CLOSEST(t, MILLIDEGREE_PER_DECIDEGREE);
+}
+
+static inline long kelvin_to_celsius(long t)
+{
+	return t + DIV_ROUND_CLOSEST(ABSOLUTE_ZERO_MILLICELSIUS,
+				     MILLIDEGREE_PER_DEGREE);
+}
+
+static inline long celsius_to_kelvin(long t)
+{
+	return t - DIV_ROUND_CLOSEST(ABSOLUTE_ZERO_MILLICELSIUS,
+				     MILLIDEGREE_PER_DEGREE);
+}
+
+#endif /* _LINUX_UNITS_H */
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 082/118] ACPI: thermal: switch to use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (80 preceding siblings ...)
  2020-01-31  6:15 ` [patch 081/118] include/linux/units.h: add helpers for kelvin to/from Celsius conversion Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 083/118] platform/x86: asus-wmi: " Andrew Morton
                   ` (35 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: ACPI: thermal: switch to use <linux/units.h> helpers

This switches the ACPI thermal zone driver to use
celsius_to_deci_kelvin(), deci_kelvin_to_celsius(), and
deci_kelvin_to_millicelsius_with_offset() in <linux/units.h> instead of
helpers in <linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius conversion
helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-3-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/acpi/thermal.c |   34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

--- a/drivers/acpi/thermal.c~acpi-thermal-switch-to-use-linux-unitsh-helpers
+++ a/drivers/acpi/thermal.c
@@ -27,6 +27,7 @@
 #include <linux/acpi.h>
 #include <linux/workqueue.h>
 #include <linux/uaccess.h>
+#include <linux/units.h>
 
 #define PREFIX "ACPI: "
 
@@ -172,7 +173,7 @@ struct acpi_thermal {
 	struct acpi_handle_list devices;
 	struct thermal_zone_device *thermal_zone;
 	int tz_enabled;
-	int kelvin_offset;
+	int kelvin_offset;	/* in millidegrees */
 	struct work_struct thermal_check_work;
 };
 
@@ -297,7 +298,8 @@ static int acpi_thermal_trips_update(str
 			if (crt == -1) {
 				tz->trips.critical.flags.valid = 0;
 			} else if (crt > 0) {
-				unsigned long crt_k = CELSIUS_TO_DECI_KELVIN(crt);
+				unsigned long crt_k = celsius_to_deci_kelvin(crt);
+
 				/*
 				 * Allow override critical threshold
 				 */
@@ -333,7 +335,7 @@ static int acpi_thermal_trips_update(str
 		if (psv == -1) {
 			status = AE_SUPPORT;
 		} else if (psv > 0) {
-			tmp = CELSIUS_TO_DECI_KELVIN(psv);
+			tmp = celsius_to_deci_kelvin(psv);
 			status = AE_OK;
 		} else {
 			status = acpi_evaluate_integer(tz->device->handle,
@@ -413,7 +415,7 @@ static int acpi_thermal_trips_update(str
 					break;
 				if (i == 1)
 					tz->trips.active[0].temperature =
-						CELSIUS_TO_DECI_KELVIN(act);
+						celsius_to_deci_kelvin(act);
 				else
 					/*
 					 * Don't allow override higher than
@@ -421,9 +423,9 @@ static int acpi_thermal_trips_update(str
 					 */
 					tz->trips.active[i - 1].temperature =
 						(tz->trips.active[i - 2].temperature <
-						CELSIUS_TO_DECI_KELVIN(act) ?
+						celsius_to_deci_kelvin(act) ?
 						tz->trips.active[i - 2].temperature :
-						CELSIUS_TO_DECI_KELVIN(act));
+						celsius_to_deci_kelvin(act));
 				break;
 			} else {
 				tz->trips.active[i].temperature = tmp;
@@ -519,7 +521,7 @@ static int thermal_get_temp(struct therm
 	if (result)
 		return result;
 
-	*temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(tz->temperature,
+	*temp = deci_kelvin_to_millicelsius_with_offset(tz->temperature,
 							tz->kelvin_offset);
 	return 0;
 }
@@ -624,7 +626,7 @@ static int thermal_get_trip_temp(struct
 
 	if (tz->trips.critical.flags.valid) {
 		if (!trip) {
-			*temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(
+			*temp = deci_kelvin_to_millicelsius_with_offset(
 				tz->trips.critical.temperature,
 				tz->kelvin_offset);
 			return 0;
@@ -634,7 +636,7 @@ static int thermal_get_trip_temp(struct
 
 	if (tz->trips.hot.flags.valid) {
 		if (!trip) {
-			*temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(
+			*temp = deci_kelvin_to_millicelsius_with_offset(
 				tz->trips.hot.temperature,
 				tz->kelvin_offset);
 			return 0;
@@ -644,7 +646,7 @@ static int thermal_get_trip_temp(struct
 
 	if (tz->trips.passive.flags.valid) {
 		if (!trip) {
-			*temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(
+			*temp = deci_kelvin_to_millicelsius_with_offset(
 				tz->trips.passive.temperature,
 				tz->kelvin_offset);
 			return 0;
@@ -655,7 +657,7 @@ static int thermal_get_trip_temp(struct
 	for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE &&
 		tz->trips.active[i].flags.valid; i++) {
 		if (!trip) {
-			*temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(
+			*temp = deci_kelvin_to_millicelsius_with_offset(
 				tz->trips.active[i].temperature,
 				tz->kelvin_offset);
 			return 0;
@@ -672,7 +674,7 @@ static int thermal_get_crit_temp(struct
 	struct acpi_thermal *tz = thermal->devdata;
 
 	if (tz->trips.critical.flags.valid) {
-		*temperature = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(
+		*temperature = deci_kelvin_to_millicelsius_with_offset(
 				tz->trips.critical.temperature,
 				tz->kelvin_offset);
 		return 0;
@@ -692,7 +694,7 @@ static int thermal_get_trend(struct ther
 
 	if (type == THERMAL_TRIP_ACTIVE) {
 		int trip_temp;
-		int temp = DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(
+		int temp = deci_kelvin_to_millicelsius_with_offset(
 					tz->temperature, tz->kelvin_offset);
 		if (thermal_get_trip_temp(thermal, trip, &trip_temp))
 			return -EINVAL;
@@ -1043,9 +1045,9 @@ static void acpi_thermal_guess_offset(st
 {
 	if (tz->trips.critical.flags.valid &&
 	    (tz->trips.critical.temperature % 5) == 1)
-		tz->kelvin_offset = 2731;
+		tz->kelvin_offset = 273100;
 	else
-		tz->kelvin_offset = 2732;
+		tz->kelvin_offset = 273200;
 }
 
 static void acpi_thermal_check_fn(struct work_struct *work)
@@ -1087,7 +1089,7 @@ static int acpi_thermal_add(struct acpi_
 	INIT_WORK(&tz->thermal_check_work, acpi_thermal_check_fn);
 
 	pr_info(PREFIX "%s [%s] (%ld C)\n", acpi_device_name(device),
-		acpi_device_bid(device), DECI_KELVIN_TO_CELSIUS(tz->temperature));
+		acpi_device_bid(device), deci_kelvin_to_celsius(tz->temperature));
 	goto end;
 
 free_memory:
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 083/118] platform/x86: asus-wmi: switch to use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (81 preceding siblings ...)
  2020-01-31  6:15 ` [patch 082/118] ACPI: thermal: switch to use <linux/units.h> helpers Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 084/118] platform/x86: intel_menlow: " Andrew Morton
                   ` (34 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: platform/x86: asus-wmi: switch to use <linux/units.h> helpers

The asus-wmi driver doesn't implement the thermal device functionality
directly, so including <linux/thermal.h> just for DECI_KELVIN_TO_CELSIUS()
is a bit odd.

This switches the asus-wmi driver to use deci_kelvin_to_millicelsius() in
<linux/units.h>.

The format string is changed from %d to %ld due to function returned type.

Link: http://lkml.kernel.org/r/1576386975-7941-4-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/platform/x86/asus-wmi.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/drivers/platform/x86/asus-wmi.c~platform-x86-asus-wmi-switch-to-use-linux-unitsh-helpers
+++ a/drivers/platform/x86/asus-wmi.c
@@ -33,9 +33,9 @@
 #include <linux/seq_file.h>
 #include <linux/platform_data/x86/asus-wmi.h>
 #include <linux/platform_device.h>
-#include <linux/thermal.h>
 #include <linux/acpi.h>
 #include <linux/dmi.h>
+#include <linux/units.h>
 
 #include <acpi/battery.h>
 #include <acpi/video.h>
@@ -1514,9 +1514,8 @@ static ssize_t asus_hwmon_temp1(struct d
 	if (err < 0)
 		return err;
 
-	value = DECI_KELVIN_TO_CELSIUS((value & 0xFFFF)) * 1000;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 084/118] platform/x86: intel_menlow: switch to use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (82 preceding siblings ...)
  2020-01-31  6:15 ` [patch 083/118] platform/x86: asus-wmi: " Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 085/118] thermal: int340x: " Andrew Morton
                   ` (33 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: platform/x86: intel_menlow: switch to use <linux/units.h> helpers

This switches the intel_menlow driver to use deci_kelvin_to_celsius() and
celsius_to_deci_kelvin() in <linux/units.h> instead of helpers in
<linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius conversion
helpers in <linux/units.h>.

This also removes a trailing space, while we're at it.

Link: http://lkml.kernel.org/r/1576386975-7941-5-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/platform/x86/intel_menlow.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- a/drivers/platform/x86/intel_menlow.c~platform-x86-intel_menlow-switch-to-use-linux-unitsh-helpers
+++ a/drivers/platform/x86/intel_menlow.c
@@ -22,6 +22,7 @@
 #include <linux/slab.h>
 #include <linux/thermal.h>
 #include <linux/types.h>
+#include <linux/units.h>
 
 MODULE_AUTHOR("Thomas Sujith");
 MODULE_AUTHOR("Zhang Rui");
@@ -302,8 +303,10 @@ static ssize_t aux_show(struct device *d
 	int result;
 
 	result = sensor_get_auxtrip(attr->handle, idx, &value);
+	if (result)
+		return result;
 
-	return result ? result : sprintf(buf, "%lu", DECI_KELVIN_TO_CELSIUS(value));
+	return sprintf(buf, "%lu", deci_kelvin_to_celsius(value));
 }
 
 static ssize_t aux0_show(struct device *dev,
@@ -332,8 +335,8 @@ static ssize_t aux_store(struct device *
 	if (value < 0)
 		return -EINVAL;
 
-	result = sensor_set_auxtrip(attr->handle, idx, 
-				    CELSIUS_TO_DECI_KELVIN(value));
+	result = sensor_set_auxtrip(attr->handle, idx,
+				    celsius_to_deci_kelvin(value));
 	return result ? result : count;
 }
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 085/118] thermal: int340x: switch to use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (83 preceding siblings ...)
  2020-01-31  6:15 ` [patch 084/118] platform/x86: intel_menlow: " Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 086/118] thermal: intel_pch: " Andrew Morton
                   ` (32 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: thermal: int340x: switch to use <linux/units.h> helpers

This switches the int340x thermal zone driver to use
deci_kelvin_to_millicelsius() and millicelsius_to_deci_kelvin() in
<linux/units.h> instead of helpers in <linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius conversion
helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-6-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c~thermal-int340x-switch-to-use-linux-unitsh-helpers
+++ a/drivers/thermal/intel/int340x_thermal/int340x_thermal_zone.c
@@ -8,6 +8,7 @@
 #include <linux/init.h>
 #include <linux/acpi.h>
 #include <linux/thermal.h>
+#include <linux/units.h>
 #include "int340x_thermal_zone.h"
 
 static int int340x_thermal_get_zone_temp(struct thermal_zone_device *zone,
@@ -34,7 +35,7 @@ static int int340x_thermal_get_zone_temp
 		*temp = (unsigned long)conv_temp * 10;
 	} else
 		/* _TMP returns the temperature in tenths of degrees Kelvin */
-		*temp = DECI_KELVIN_TO_MILLICELSIUS(tmp);
+		*temp = deci_kelvin_to_millicelsius(tmp);
 
 	return 0;
 }
@@ -116,7 +117,7 @@ static int int340x_thermal_set_trip_temp
 
 	snprintf(name, sizeof(name), "PAT%d", trip);
 	status = acpi_execute_simple_method(d->adev->handle, name,
-			MILLICELSIUS_TO_DECI_KELVIN(temp));
+			millicelsius_to_deci_kelvin(temp));
 	if (ACPI_FAILURE(status))
 		return -EIO;
 
@@ -163,7 +164,7 @@ static int int340x_thermal_get_trip_conf
 	if (ACPI_FAILURE(status))
 		return -EIO;
 
-	*temp = DECI_KELVIN_TO_MILLICELSIUS(r);
+	*temp = deci_kelvin_to_millicelsius(r);
 
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 086/118] thermal: intel_pch: switch to use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (84 preceding siblings ...)
  2020-01-31  6:15 ` [patch 085/118] thermal: int340x: " Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 087/118] nvme: hwmon: " Andrew Morton
                   ` (31 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: thermal: intel_pch: switch to use <linux/units.h> helpers

This switches the intel pch thermal driver to use
deci_kelvin_to_millicelsius() in <linux/units.h> instead of helpers in
<linux/thermal.h>.

This is preparation for centralizing the kelvin to/from Celsius conversion
helpers in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-7-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/thermal/intel/intel_pch_thermal.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/drivers/thermal/intel/intel_pch_thermal.c~thermal-intel_pch-switch-to-use-linux-unitsh-helpers
+++ a/drivers/thermal/intel/intel_pch_thermal.c
@@ -13,6 +13,7 @@
 #include <linux/pci.h>
 #include <linux/acpi.h>
 #include <linux/thermal.h>
+#include <linux/units.h>
 #include <linux/pm.h>
 
 /* Intel PCH thermal Device IDs */
@@ -93,7 +94,7 @@ static void pch_wpt_add_acpi_psv_trip(st
 		if (ACPI_SUCCESS(status)) {
 			unsigned long trip_temp;
 
-			trip_temp = DECI_KELVIN_TO_MILLICELSIUS(r);
+			trip_temp = deci_kelvin_to_millicelsius(r);
 			if (trip_temp) {
 				ptd->psv_temp = trip_temp;
 				ptd->psv_trip_id = *nr_trips;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 087/118] nvme: hwmon: switch to use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (85 preceding siblings ...)
  2020-01-31  6:15 ` [patch 086/118] thermal: intel_pch: " Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:15 ` [patch 088/118] thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h> Andrew Morton
                   ` (30 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: nvme: hwmon: switch to use <linux/units.h> helpers

This switches the nvme driver to use kelvin_to_millicelsius() and
millicelsius_to_kelvin() in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-8-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/nvme/host/hwmon.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

--- a/drivers/nvme/host/hwmon.c~nvme-hwmon-switch-to-use-linux-unitsh-helpers
+++ a/drivers/nvme/host/hwmon.c
@@ -5,14 +5,11 @@
  */
 
 #include <linux/hwmon.h>
+#include <linux/units.h>
 #include <asm/unaligned.h>
 
 #include "nvme.h"
 
-/* These macros should be moved to linux/temperature.h */
-#define MILLICELSIUS_TO_KELVIN(t) DIV_ROUND_CLOSEST((t) + 273150, 1000)
-#define KELVIN_TO_MILLICELSIUS(t) ((t) * 1000L - 273150)
-
 struct nvme_hwmon_data {
 	struct nvme_ctrl *ctrl;
 	struct nvme_smart_log log;
@@ -35,7 +32,7 @@ static int nvme_get_temp_thresh(struct n
 		return -EIO;
 	if (ret < 0)
 		return ret;
-	*temp = KELVIN_TO_MILLICELSIUS(status & NVME_TEMP_THRESH_MASK);
+	*temp = kelvin_to_millicelsius(status & NVME_TEMP_THRESH_MASK);
 
 	return 0;
 }
@@ -46,7 +43,7 @@ static int nvme_set_temp_thresh(struct n
 	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
 	int ret;
 
-	temp = MILLICELSIUS_TO_KELVIN(temp);
+	temp = millicelsius_to_kelvin(temp);
 	threshold |= clamp_val(temp, 0, NVME_TEMP_THRESH_MASK);
 
 	if (under)
@@ -88,7 +85,7 @@ static int nvme_hwmon_read(struct device
 	case hwmon_temp_min:
 		return nvme_get_temp_thresh(data->ctrl, channel, true, val);
 	case hwmon_temp_crit:
-		*val = KELVIN_TO_MILLICELSIUS(data->ctrl->cctemp);
+		*val = kelvin_to_millicelsius(data->ctrl->cctemp);
 		return 0;
 	default:
 		break;
@@ -105,7 +102,7 @@ static int nvme_hwmon_read(struct device
 			temp = get_unaligned_le16(log->temperature);
 		else
 			temp = le16_to_cpu(log->temp_sensor[channel - 1]);
-		*val = KELVIN_TO_MILLICELSIUS(temp);
+		*val = kelvin_to_millicelsius(temp);
 		break;
 	case hwmon_temp_alarm:
 		*val = !!(log->critical_warning & NVME_SMART_CRIT_TEMPERATURE);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 088/118] thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h>
  2020-01-31  6:10 incoming Andrew Morton
                   ` (86 preceding siblings ...)
  2020-01-31  6:15 ` [patch 087/118] nvme: hwmon: " Andrew Morton
@ 2020-01-31  6:15 ` Andrew Morton
  2020-01-31  6:16 ` [patch 089/118] iwlegacy: use <linux/units.h> helpers Andrew Morton
                   ` (29 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:15 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h>

This removes the kelvin to/from Celsius conversion helper macros in
<linux/thermal.h> which were switched to the inline helper functions in
<linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-9-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/thermal.h |   11 -----------
 1 file changed, 11 deletions(-)

--- a/include/linux/thermal.h~thermal-remove-kelvin-to-from-celsius-conversion-helpers-from-linux-thermalh
+++ a/include/linux/thermal.h
@@ -32,17 +32,6 @@
 /* use value, which < 0K, to indicate an invalid/uninitialized temperature */
 #define THERMAL_TEMP_INVALID	-274000
 
-/* Unit conversion macros */
-#define DECI_KELVIN_TO_CELSIUS(t)	({			\
-	long _t = (t);						\
-	((_t-2732 >= 0) ? (_t-2732+5)/10 : (_t-2732-5)/10);	\
-})
-#define CELSIUS_TO_DECI_KELVIN(t)	((t)*10+2732)
-#define DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(t, off) (((t) - (off)) * 100)
-#define DECI_KELVIN_TO_MILLICELSIUS(t) DECI_KELVIN_TO_MILLICELSIUS_WITH_OFFSET(t, 2732)
-#define MILLICELSIUS_TO_DECI_KELVIN_WITH_OFFSET(t, off) (((t) / 100) + (off))
-#define MILLICELSIUS_TO_DECI_KELVIN(t) MILLICELSIUS_TO_DECI_KELVIN_WITH_OFFSET(t, 2732)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 089/118] iwlegacy: use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (87 preceding siblings ...)
  2020-01-31  6:15 ` [patch 088/118] thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h> Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 090/118] iwlwifi: " Andrew Morton
                   ` (28 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, kbusch, knaack.h, kvalo, lars, linux-mm, linux,
	luciano.coelho, mm-commits, pmeerw, rui.zhang, sagi, sfr,
	sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: iwlegacy: use <linux/units.h> helpers

This switches the iwlegacy driver to use celsius_to_kelvin() and
kelvin_to_celsius() in <linux/units.h>.

[akinobu.mita@gmail.com: fix build warnings with format string]
  Link: http://lkml.kernel.org/r/1579014483-9226-1-git-send-email-akinobu.mita@gmail.com
  Link: https://lore.kernel.org/r/20200106171452.201c3b4c@canb.auug.org.au
Link: http://lkml.kernel.org/r/1576386975-7941-10-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/net/wireless/intel/iwlegacy/4965-mac.c |    3 +-
 drivers/net/wireless/intel/iwlegacy/4965.c     |   17 +++++++--------
 drivers/net/wireless/intel/iwlegacy/common.h   |    3 --
 3 files changed, 11 insertions(+), 12 deletions(-)

--- a/drivers/net/wireless/intel/iwlegacy/4965.c~iwlegacy-use-linux-unitsh-helpers
+++ a/drivers/net/wireless/intel/iwlegacy/4965.c
@@ -17,6 +17,7 @@
 #include <linux/sched.h>
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
+#include <linux/units.h>
 #include <net/mac80211.h>
 #include <linux/etherdevice.h>
 #include <asm/unaligned.h>
@@ -1104,7 +1105,7 @@ il4965_fill_txpower_tbl(struct il_priv *
 	/* get current temperature (Celsius) */
 	current_temp = max(il->temperature, IL_TX_POWER_TEMPERATURE_MIN);
 	current_temp = min(il->temperature, IL_TX_POWER_TEMPERATURE_MAX);
-	current_temp = KELVIN_TO_CELSIUS(current_temp);
+	current_temp = kelvin_to_celsius(current_temp);
 
 	/* select thermal txpower adjustment params, based on channel group
 	 *   (same frequency group used for mimo txatten adjustment) */
@@ -1610,8 +1611,8 @@ il4965_hw_get_temperature(struct il_priv
 	temperature =
 	    (temperature * 97) / 100 + TEMPERATURE_CALIB_KELVIN_OFFSET;
 
-	D_TEMP("Calibrated temperature: %dK, %dC\n", temperature,
-	       KELVIN_TO_CELSIUS(temperature));
+	D_TEMP("Calibrated temperature: %dK, %ldC\n", temperature,
+	       kelvin_to_celsius(temperature));
 
 	return temperature;
 }
@@ -1670,12 +1671,12 @@ il4965_temperature_calib(struct il_priv
 
 	if (il->temperature != temp) {
 		if (il->temperature)
-			D_TEMP("Temperature changed " "from %dC to %dC\n",
-			       KELVIN_TO_CELSIUS(il->temperature),
-			       KELVIN_TO_CELSIUS(temp));
+			D_TEMP("Temperature changed " "from %ldC to %ldC\n",
+			       kelvin_to_celsius(il->temperature),
+			       kelvin_to_celsius(temp));
 		else
-			D_TEMP("Temperature " "initialized to %dC\n",
-			       KELVIN_TO_CELSIUS(temp));
+			D_TEMP("Temperature " "initialized to %ldC\n",
+			       kelvin_to_celsius(temp));
 	}
 
 	il->temperature = temp;
--- a/drivers/net/wireless/intel/iwlegacy/4965-mac.c~iwlegacy-use-linux-unitsh-helpers
+++ a/drivers/net/wireless/intel/iwlegacy/4965-mac.c
@@ -27,6 +27,7 @@
 #include <linux/firmware.h>
 #include <linux/etherdevice.h>
 #include <linux/if_arp.h>
+#include <linux/units.h>
 
 #include <net/mac80211.h>
 
@@ -6468,7 +6469,7 @@ il4965_set_hw_params(struct il_priv *il)
 	il->hw_params.valid_rx_ant = il->cfg->valid_rx_ant;
 
 	il->hw_params.ct_kill_threshold =
-	   CELSIUS_TO_KELVIN(CT_KILL_THRESHOLD_LEGACY);
+	   celsius_to_kelvin(CT_KILL_THRESHOLD_LEGACY);
 
 	il->hw_params.sens = &il4965_sensitivity;
 	il->hw_params.beacon_time_tsf_bits = IL4965_EXT_BEACON_TIME_POS;
--- a/drivers/net/wireless/intel/iwlegacy/common.h~iwlegacy-use-linux-unitsh-helpers
+++ a/drivers/net/wireless/intel/iwlegacy/common.h
@@ -779,9 +779,6 @@ struct il_sensitivity_ranges {
 	u16 nrg_th_cca;
 };
 
-#define KELVIN_TO_CELSIUS(x) ((x)-273)
-#define CELSIUS_TO_KELVIN(x) ((x)+273)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 090/118] iwlwifi: use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (88 preceding siblings ...)
  2020-01-31  6:16 ` [patch 089/118] iwlegacy: use <linux/units.h> helpers Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 091/118] thermal: armada: remove unused TO_MCELSIUS macro Andrew Morton
                   ` (27 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: iwlwifi: use <linux/units.h> helpers

This switches the iwlwifi driver to use celsius_to_kelvin() and
kelvin_to_celsius() in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-11-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Luca Coelho <luciano.coelho@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/net/wireless/intel/iwlwifi/dvm/dev.h     |    5 -----
 drivers/net/wireless/intel/iwlwifi/dvm/devices.c |    6 ++++--
 2 files changed, 4 insertions(+), 7 deletions(-)

--- a/drivers/net/wireless/intel/iwlwifi/dvm/dev.h~iwlwifi-use-linux-unitsh-helpers
+++ a/drivers/net/wireless/intel/iwlwifi/dvm/dev.h
@@ -237,11 +237,6 @@ struct iwl_sensitivity_ranges {
 	u16 nrg_th_cca;
 };
 
-
-#define KELVIN_TO_CELSIUS(x) ((x)-273)
-#define CELSIUS_TO_KELVIN(x) ((x)+273)
-
-
 /******************************************************************************
  *
  * Functions implemented in core module which are forward declared here
--- a/drivers/net/wireless/intel/iwlwifi/dvm/devices.c~iwlwifi-use-linux-unitsh-helpers
+++ a/drivers/net/wireless/intel/iwlwifi/dvm/devices.c
@@ -10,6 +10,8 @@
  *
  *****************************************************************************/
 
+#include <linux/units.h>
+
 /*
  * DVM device-specific data & functions
  */
@@ -345,7 +347,7 @@ static s32 iwl_temp_calib_to_offset(stru
 static void iwl5150_set_ct_threshold(struct iwl_priv *priv)
 {
 	const s32 volt2temp_coef = IWL_5150_VOLTAGE_TO_TEMPERATURE_COEFF;
-	s32 threshold = (s32)CELSIUS_TO_KELVIN(CT_KILL_THRESHOLD_LEGACY) -
+	s32 threshold = (s32)celsius_to_kelvin(CT_KILL_THRESHOLD_LEGACY) -
 			iwl_temp_calib_to_offset(priv);
 
 	priv->hw_params.ct_kill_threshold = threshold * volt2temp_coef;
@@ -381,7 +383,7 @@ static void iwl5150_temperature(struct i
 	vt = le32_to_cpu(priv->statistics.common.temperature);
 	vt = vt / IWL_5150_VOLTAGE_TO_TEMPERATURE_COEFF + offset;
 	/* now vt hold the temperature in Kelvin */
-	priv->temperature = KELVIN_TO_CELSIUS(vt);
+	priv->temperature = kelvin_to_celsius(vt);
 	iwl_tt_handler(priv);
 }
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 091/118] thermal: armada: remove unused TO_MCELSIUS macro
  2020-01-31  6:10 incoming Andrew Morton
                   ` (89 preceding siblings ...)
  2020-01-31  6:16 ` [patch 090/118] iwlwifi: " Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 092/118] iio: adc: qcom-vadc-common: use <linux/units.h> helpers Andrew Morton
                   ` (26 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: thermal: armada: remove unused TO_MCELSIUS macro

This removes unused TO_MCELSIUS() macro.

Link: http://lkml.kernel.org/r/1576386975-7941-12-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/thermal/armada_thermal.c |    2 --
 1 file changed, 2 deletions(-)

--- a/drivers/thermal/armada_thermal.c~thermal-armada-remove-unused-to_mcelsius-macro
+++ a/drivers/thermal/armada_thermal.c
@@ -21,8 +21,6 @@
 
 #include "thermal_core.h"
 
-#define TO_MCELSIUS(c)			((c) * 1000)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 092/118] iio: adc: qcom-vadc-common: use <linux/units.h> helpers
  2020-01-31  6:10 incoming Andrew Morton
                   ` (90 preceding siblings ...)
  2020-01-31  6:16 ` [patch 091/118] thermal: armada: remove unused TO_MCELSIUS macro Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 093/118] lib/zlib: add s390 hardware support for kernel zlib_deflate Andrew Morton
                   ` (25 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akinobu.mita, akpm, amit.kucheria, andy.shevchenko, andy, axboe,
	daniel.lezcano, dvhart, emmanuel.grumbach, hch, jdelvare, jic23,
	johannes.berg, Jonathan.Cameron, kbusch, knaack.h, kvalo, lars,
	linux-mm, linux, luciano.coelho, mm-commits, pmeerw, rui.zhang,
	sagi, sgruszka, sujith.thomas, torvalds

From: Akinobu Mita <akinobu.mita@gmail.com>
Subject: iio: adc: qcom-vadc-common: use <linux/units.h> helpers

This switches the qcom-vadc-common to use milli_kelvin_to_millicelsius()
in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-13-git-send-email-akinobu.mita@gmail.com
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/iio/adc/qcom-vadc-common.c |    6 +++---
 drivers/iio/adc/qcom-vadc-common.h |    1 -
 2 files changed, 3 insertions(+), 4 deletions(-)

--- a/drivers/iio/adc/qcom-vadc-common.c~iio-adc-qcom-vadc-common-use-linux-unitsh-helpers
+++ a/drivers/iio/adc/qcom-vadc-common.c
@@ -6,6 +6,7 @@
 #include <linux/log2.h>
 #include <linux/err.h>
 #include <linux/module.h>
+#include <linux/units.h>
 
 #include "qcom-vadc-common.h"
 
@@ -236,8 +237,7 @@ static int qcom_vadc_scale_die_temp(cons
 		voltage = 0;
 	}
 
-	voltage -= KELVINMIL_CELSIUSMIL;
-	*result_mdec = voltage;
+	*result_mdec = milli_kelvin_to_millicelsius(voltage);
 
 	return 0;
 }
@@ -325,7 +325,7 @@ static int qcom_vadc_scale_hw_calib_die_
 {
 	*result_mdec = qcom_vadc_scale_code_voltage_factor(adc_code,
 				prescale, data, 2);
-	*result_mdec -= KELVINMIL_CELSIUSMIL;
+	*result_mdec = milli_kelvin_to_millicelsius(*result_mdec);
 
 	return 0;
 }
--- a/drivers/iio/adc/qcom-vadc-common.h~iio-adc-qcom-vadc-common-use-linux-unitsh-helpers
+++ a/drivers/iio/adc/qcom-vadc-common.h
@@ -38,7 +38,6 @@
 #define VADC_AVG_SAMPLES_MAX			512
 #define ADC5_AVG_SAMPLES_MAX			16
 
-#define KELVINMIL_CELSIUSMIL			273150
 #define PMIC5_CHG_TEMP_SCALE_FACTOR		377500
 #define PMIC5_SMB_TEMP_CONSTANT			419400
 #define PMIC5_SMB_TEMP_SCALE_FACTOR		356
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 093/118] lib/zlib: add s390 hardware support for kernel zlib_deflate
  2020-01-31  6:10 incoming Andrew Morton
                   ` (91 preceding siblings ...)
  2020-01-31  6:16 ` [patch 092/118] iio: adc: qcom-vadc-common: use <linux/units.h> helpers Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 094/118] s390/boot: rename HEAP_SIZE due to name collision Andrew Morton
                   ` (24 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, borntraeger, clm, dsterba, edward6, gor, heiko.carstens,
	iii, josef, linux-mm, mm-commits, rpurdie, torvalds, zaslonko

From: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Subject: lib/zlib: add s390 hardware support for kernel zlib_deflate

Patch series "S390 hardware support for kernel zlib", v3.

With IBM z15 mainframe the new DFLTCC instruction is available.  It
implements deflate algorithm in hardware (Nest Acceleration Unit - NXU)
with estimated compression and decompression performance orders of
magnitude faster than the current zlib.

This patchset adds s390 hardware compression support to kernel zlib.  The
code is based on the userspace zlib implementation:

	https://github.com/madler/zlib/pull/410

The coding style is also preserved for future maintainability.  There is
only limited set of userspace zlib functions represented in kernel.  Apart
from that, all the memory allocation should be performed in advance. 
Thus, the workarea structures are extended with the parameter lists
required for the DEFLATE CONVENTION CALL instruction.

Since kernel zlib itself does not support gzip headers, only Adler-32
checksum is processed (also can be produced by DFLTCC facility).  Like it
was implemented for userspace, kernel zlib will compress in hardware on
level 1, and in software on all other levels.  Decompression will always
happen in hardware (when enabled).

Two DFLTCC compression calls produce the same results only when they both
are made on machines of the same generation, and when the respective
buffers have the same offset relative to the start of the page.  Therefore
care should be taken when using hardware compression when reproducible
results are desired.  However it does always produce the standard conform
output which can be inflated anyway.

The new kernel command line parameter 'dfltcc' is introduced to configure
s390 zlib hardware support:

    Format: { on | off | def_only | inf_only | always }
     on:       s390 zlib hardware support for compression on
               level 1 and decompression (default)
     off:      No s390 zlib hardware support
     def_only: s390 zlib hardware support for deflate
               only (compression on level 1)
     inf_only: s390 zlib hardware support for inflate
               only (decompression)
     always:   Same as 'on' but ignores the selected compression
               level always using hardware support (used for debugging)

The main purpose of the integration of the NXU support into the kernel
zlib is the use of hardware deflate in btrfs filesystem with on-the-fly
compression enabled.  Apart from that, hardware support can also be used
during boot for decompressing the kernel or the ramdisk image 

With the patch for btrfs expanding zlib buffer from 1 to 4 pages (patch 6)
the following performance results have been achieved using the ramdisk
with btrfs.  These are relative numbers based on throughput rate and
compression ratio for zlib level 1:

  Input data              Deflate rate   Inflate rate   Compression ratio
                          NXU/Software   NXU/Software   NXU/Software
  stream of zeroes        1.46           1.02           1.00
  random ASCII data       10.44          3.00           0.96
  ASCII text (dickens)    6,21           3.33           0.94
  binary data (vmlinux)   8,37           3.90           1.02

This means that s390 hardware deflate can provide up to 10 times faster
compression (on level 1) and up to 4 times faster decompression (refers to
all compression levels) for btrfs zlib.

Disclaimer: Performance results are based on IBM internal tests using DD
command-line utility on btrfs on a Fedora 30 based internal driver in
native LPAR on a z15 system.  Results may vary based on individual
workload, configuration and software levels.


This patch (of 9):

Create zlib_dfltcc library with the s390 DEFLATE CONVERSION CALL
implementation and related compression functions.  Update zlib_deflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware deflate.

Link: http://lkml.kernel.org/r/20200103223334.20669-2-zaslonko@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig                      |    7 
 lib/Makefile                     |    1 
 lib/zlib_deflate/deflate.c       |   79 ++++----
 lib/zlib_deflate/deftree.c       |   54 -----
 lib/zlib_deflate/defutil.h       |  134 +++++++++++++-
 lib/zlib_dfltcc/Makefile         |   11 +
 lib/zlib_dfltcc/dfltcc.c         |   52 +++++
 lib/zlib_dfltcc/dfltcc.h         |  115 ++++++++++++
 lib/zlib_dfltcc/dfltcc_deflate.c |  274 +++++++++++++++++++++++++++++
 lib/zlib_dfltcc/dfltcc_syms.c    |   17 +
 lib/zlib_dfltcc/dfltcc_util.h    |  110 +++++++++++
 11 files changed, 752 insertions(+), 102 deletions(-)

--- a/lib/Kconfig~lib-zlib-add-s390-hardware-support-for-kernel-zlib_deflate
+++ a/lib/Kconfig
@@ -278,6 +278,13 @@ config ZLIB_DEFLATE
 	tristate
 	select BITREVERSE
 
+config ZLIB_DFLTCC
+	def_bool y
+	depends on S390
+	prompt "Enable s390x DEFLATE CONVERSION CALL support for kernel zlib"
+	help
+	 Enable s390x hardware support for zlib in the kernel.
+
 config LZO_COMPRESS
 	tristate
 
--- a/lib/Makefile~lib-zlib-add-s390-hardware-support-for-kernel-zlib_deflate
+++ a/lib/Makefile
@@ -140,6 +140,7 @@ obj-$(CONFIG_842_COMPRESS) += 842/
 obj-$(CONFIG_842_DECOMPRESS) += 842/
 obj-$(CONFIG_ZLIB_INFLATE) += zlib_inflate/
 obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate/
+obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc/
 obj-$(CONFIG_REED_SOLOMON) += reed_solomon/
 obj-$(CONFIG_BCH) += bch.o
 obj-$(CONFIG_LZO_COMPRESS) += lzo/
--- a/lib/zlib_deflate/deflate.c~lib-zlib-add-s390-hardware-support-for-kernel-zlib_deflate
+++ a/lib/zlib_deflate/deflate.c
@@ -52,16 +52,18 @@
 #include <linux/zutil.h>
 #include "defutil.h"
 
+/* architecture-specific bits */
+#ifdef CONFIG_ZLIB_DFLTCC
+#  include "../zlib_dfltcc/dfltcc.h"
+#else
+#define DEFLATE_RESET_HOOK(strm) do {} while (0)
+#define DEFLATE_HOOK(strm, flush, bstate) 0
+#define DEFLATE_NEED_CHECKSUM(strm) 1
+#endif
 
 /* ===========================================================================
  *  Function prototypes.
  */
-typedef enum {
-    need_more,      /* block not completed, need more input or more output */
-    block_done,     /* block flush performed */
-    finish_started, /* finish started, need only more output at next deflate */
-    finish_done     /* finish done, accept no more input or output */
-} block_state;
 
 typedef block_state (*compress_func) (deflate_state *s, int flush);
 /* Compression function. Returns the block state after the call. */
@@ -72,7 +74,6 @@ static block_state deflate_fast   (defla
 static block_state deflate_slow   (deflate_state *s, int flush);
 static void lm_init        (deflate_state *s);
 static void putShortMSB    (deflate_state *s, uInt b);
-static void flush_pending  (z_streamp strm);
 static int read_buf        (z_streamp strm, Byte *buf, unsigned size);
 static uInt longest_match  (deflate_state *s, IPos cur_match);
 
@@ -98,6 +99,25 @@ static  void check_match (deflate_state
  * See deflate.c for comments about the MIN_MATCH+1.
  */
 
+/* Workspace to be allocated for deflate processing */
+typedef struct deflate_workspace {
+    /* State memory for the deflator */
+    deflate_state deflate_memory;
+#ifdef CONFIG_ZLIB_DFLTCC
+    /* State memory for s390 hardware deflate */
+    struct dfltcc_state dfltcc_memory;
+#endif
+    Byte *window_memory;
+    Pos *prev_memory;
+    Pos *head_memory;
+    char *overlay_memory;
+} deflate_workspace;
+
+#ifdef CONFIG_ZLIB_DFLTCC
+/* dfltcc_state must be doubleword aligned for DFLTCC call */
+static_assert(offsetof(struct deflate_workspace, dfltcc_memory) % 8 == 0);
+#endif
+
 /* Values for max_lazy_match, good_match and max_chain_length, depending on
  * the desired pack level (0..9). The values given below have been tuned to
  * exclude worst case performance for pathological files. Better values may be
@@ -207,7 +227,15 @@ int zlib_deflateInit2(
      */
     next = (char *) mem;
     next += sizeof(*mem);
+#ifdef CONFIG_ZLIB_DFLTCC
+    /*
+     *  DFLTCC requires the window to be page aligned.
+     *  Thus, we overallocate and take the aligned portion of the buffer.
+     */
+    mem->window_memory = (Byte *) PTR_ALIGN(next, PAGE_SIZE);
+#else
     mem->window_memory = (Byte *) next;
+#endif
     next += zlib_deflate_window_memsize(windowBits);
     mem->prev_memory = (Pos *) next;
     next += zlib_deflate_prev_memsize(windowBits);
@@ -277,6 +305,8 @@ int zlib_deflateReset(
     zlib_tr_init(s);
     lm_init(s);
 
+    DEFLATE_RESET_HOOK(strm);
+
     return Z_OK;
 }
 
@@ -294,35 +324,6 @@ static void putShortMSB(
     put_byte(s, (Byte)(b & 0xff));
 }   
 
-/* =========================================================================
- * Flush as much pending output as possible. All deflate() output goes
- * through this function so some applications may wish to modify it
- * to avoid allocating a large strm->next_out buffer and copying into it.
- * (See also read_buf()).
- */
-static void flush_pending(
-	z_streamp strm
-)
-{
-    deflate_state *s = (deflate_state *) strm->state;
-    unsigned len = s->pending;
-
-    if (len > strm->avail_out) len = strm->avail_out;
-    if (len == 0) return;
-
-    if (strm->next_out != NULL) {
-	memcpy(strm->next_out, s->pending_out, len);
-	strm->next_out += len;
-    }
-    s->pending_out += len;
-    strm->total_out += len;
-    strm->avail_out  -= len;
-    s->pending -= len;
-    if (s->pending == 0) {
-        s->pending_out = s->pending_buf;
-    }
-}
-
 /* ========================================================================= */
 int zlib_deflate(
 	z_streamp strm,
@@ -404,7 +405,8 @@ int zlib_deflate(
         (flush != Z_NO_FLUSH && s->status != FINISH_STATE)) {
         block_state bstate;
 
-	bstate = (*(configuration_table[s->level].func))(s, flush);
+	bstate = DEFLATE_HOOK(strm, flush, &bstate) ? bstate :
+		 (*(configuration_table[s->level].func))(s, flush);
 
         if (bstate == finish_started || bstate == finish_done) {
             s->status = FINISH_STATE;
@@ -503,7 +505,8 @@ static int read_buf(
 
     strm->avail_in  -= len;
 
-    if (!((deflate_state *)(strm->state))->noheader) {
+    if (!DEFLATE_NEED_CHECKSUM(strm)) {}
+    else if (!((deflate_state *)(strm->state))->noheader) {
         strm->adler = zlib_adler32(strm->adler, strm->next_in, len);
     }
     memcpy(buf, strm->next_in, len);
--- a/lib/zlib_deflate/deftree.c~lib-zlib-add-s390-hardware-support-for-kernel-zlib_deflate
+++ a/lib/zlib_deflate/deftree.c
@@ -76,11 +76,6 @@ static const uch bl_order[BL_CODES]
  * probability, to avoid transmitting the lengths for unused bit length codes.
  */
 
-#define Buf_size (8 * 2*sizeof(char))
-/* Number of bits used within bi_buf. (bi_buf might be implemented on
- * more than 16 bits on some systems.)
- */
-
 /* ===========================================================================
  * Local data. These are initialized only once.
  */
@@ -147,7 +142,6 @@ static void send_all_trees (deflate_stat
 static void compress_block (deflate_state *s, ct_data *ltree,
                            ct_data *dtree);
 static void set_data_type  (deflate_state *s);
-static void bi_windup      (deflate_state *s);
 static void bi_flush       (deflate_state *s);
 static void copy_block     (deflate_state *s, char *buf, unsigned len,
                            int header);
@@ -170,54 +164,6 @@ static void copy_block     (deflate_stat
  */
 
 /* ===========================================================================
- * Send a value on a given number of bits.
- * IN assertion: length <= 16 and value fits in length bits.
- */
-#ifdef DEBUG_ZLIB
-static void send_bits      (deflate_state *s, int value, int length);
-
-static void send_bits(
-	deflate_state *s,
-	int value,  /* value to send */
-	int length  /* number of bits */
-)
-{
-    Tracevv((stderr," l %2d v %4x ", length, value));
-    Assert(length > 0 && length <= 15, "invalid length");
-    s->bits_sent += (ulg)length;
-
-    /* If not enough room in bi_buf, use (valid) bits from bi_buf and
-     * (16 - bi_valid) bits from value, leaving (width - (16-bi_valid))
-     * unused bits in value.
-     */
-    if (s->bi_valid > (int)Buf_size - length) {
-        s->bi_buf |= (value << s->bi_valid);
-        put_short(s, s->bi_buf);
-        s->bi_buf = (ush)value >> (Buf_size - s->bi_valid);
-        s->bi_valid += length - Buf_size;
-    } else {
-        s->bi_buf |= value << s->bi_valid;
-        s->bi_valid += length;
-    }
-}
-#else /* !DEBUG_ZLIB */
-
-#define send_bits(s, value, length) \
-{ int len = length;\
-  if (s->bi_valid > (int)Buf_size - len) {\
-    int val = value;\
-    s->bi_buf |= (val << s->bi_valid);\
-    put_short(s, s->bi_buf);\
-    s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\
-    s->bi_valid += len - Buf_size;\
-  } else {\
-    s->bi_buf |= (value) << s->bi_valid;\
-    s->bi_valid += len;\
-  }\
-}
-#endif /* DEBUG_ZLIB */
-
-/* ===========================================================================
  * Initialize the various 'constant' tables. In a multi-threaded environment,
  * this function may be called by two threads concurrently, but this is
  * harmless since both invocations do exactly the same thing.
--- a/lib/zlib_deflate/defutil.h~lib-zlib-add-s390-hardware-support-for-kernel-zlib_deflate
+++ a/lib/zlib_deflate/defutil.h
@@ -1,5 +1,7 @@
+#ifndef DEFUTIL_H
+#define DEFUTIL_H
 
-
+#include <linux/zutil.h>
 
 #define Assert(err, str) 
 #define Trace(dummy) 
@@ -238,17 +240,13 @@ typedef struct deflate_state {
 
 } deflate_state;
 
-typedef struct deflate_workspace {
-    /* State memory for the deflator */
-    deflate_state deflate_memory;
-    Byte *window_memory;
-    Pos *prev_memory;
-    Pos *head_memory;
-    char *overlay_memory;
-} deflate_workspace;
-
+#ifdef CONFIG_ZLIB_DFLTCC
+#define zlib_deflate_window_memsize(windowBits) \
+	(2 * (1 << (windowBits)) * sizeof(Byte) + PAGE_SIZE)
+#else
 #define zlib_deflate_window_memsize(windowBits) \
 	(2 * (1 << (windowBits)) * sizeof(Byte))
+#endif
 #define zlib_deflate_prev_memsize(windowBits) \
 	((1 << (windowBits)) * sizeof(Pos))
 #define zlib_deflate_head_memsize(memLevel) \
@@ -293,6 +291,24 @@ void zlib_tr_stored_type_only (deflate_s
 }
 
 /* ===========================================================================
+ * Reverse the first len bits of a code, using straightforward code (a faster
+ * method would use a table)
+ * IN assertion: 1 <= len <= 15
+ */
+static inline unsigned  bi_reverse(
+    unsigned code, /* the value to invert */
+    int len        /* its bit length */
+)
+{
+    register unsigned res = 0;
+    do {
+        res |= code & 1;
+        code >>= 1, res <<= 1;
+    } while (--len > 0);
+    return res >> 1;
+}
+
+/* ===========================================================================
  * Flush the bit buffer, keeping at most 7 bits in it.
  */
 static inline void bi_flush(deflate_state *s)
@@ -325,3 +341,101 @@ static inline void bi_windup(deflate_sta
 #endif
 }
 
+typedef enum {
+    need_more,      /* block not completed, need more input or more output */
+    block_done,     /* block flush performed */
+    finish_started, /* finish started, need only more output at next deflate */
+    finish_done     /* finish done, accept no more input or output */
+} block_state;
+
+#define Buf_size (8 * 2*sizeof(char))
+/* Number of bits used within bi_buf. (bi_buf might be implemented on
+ * more than 16 bits on some systems.)
+ */
+
+/* ===========================================================================
+ * Send a value on a given number of bits.
+ * IN assertion: length <= 16 and value fits in length bits.
+ */
+#ifdef DEBUG_ZLIB
+static void send_bits      (deflate_state *s, int value, int length);
+
+static void send_bits(
+    deflate_state *s,
+    int value,  /* value to send */
+    int length  /* number of bits */
+)
+{
+    Tracevv((stderr," l %2d v %4x ", length, value));
+    Assert(length > 0 && length <= 15, "invalid length");
+    s->bits_sent += (ulg)length;
+
+    /* If not enough room in bi_buf, use (valid) bits from bi_buf and
+     * (16 - bi_valid) bits from value, leaving (width - (16-bi_valid))
+     * unused bits in value.
+     */
+    if (s->bi_valid > (int)Buf_size - length) {
+        s->bi_buf |= (value << s->bi_valid);
+        put_short(s, s->bi_buf);
+        s->bi_buf = (ush)value >> (Buf_size - s->bi_valid);
+        s->bi_valid += length - Buf_size;
+    } else {
+        s->bi_buf |= value << s->bi_valid;
+        s->bi_valid += length;
+    }
+}
+#else /* !DEBUG_ZLIB */
+
+#define send_bits(s, value, length) \
+{ int len = length;\
+  if (s->bi_valid > (int)Buf_size - len) {\
+    int val = value;\
+    s->bi_buf |= (val << s->bi_valid);\
+    put_short(s, s->bi_buf);\
+    s->bi_buf = (ush)val >> (Buf_size - s->bi_valid);\
+    s->bi_valid += len - Buf_size;\
+  } else {\
+    s->bi_buf |= (value) << s->bi_valid;\
+    s->bi_valid += len;\
+  }\
+}
+#endif /* DEBUG_ZLIB */
+
+static inline void zlib_tr_send_bits(
+    deflate_state *s,
+    int value,
+    int length
+)
+{
+    send_bits(s, value, length);
+}
+
+/* =========================================================================
+ * Flush as much pending output as possible. All deflate() output goes
+ * through this function so some applications may wish to modify it
+ * to avoid allocating a large strm->next_out buffer and copying into it.
+ * (See also read_buf()).
+ */
+static inline void flush_pending(
+	z_streamp strm
+)
+{
+    deflate_state *s = (deflate_state *) strm->state;
+    unsigned len = s->pending;
+
+    if (len > strm->avail_out) len = strm->avail_out;
+    if (len == 0) return;
+
+    if (strm->next_out != NULL) {
+	memcpy(strm->next_out, s->pending_out, len);
+	strm->next_out += len;
+    }
+    s->pending_out += len;
+    strm->total_out += len;
+    strm->avail_out  -= len;
+    s->pending -= len;
+    if (s->pending == 0) {
+        s->pending_out = s->pending_buf;
+    }
+}
+#endif /* DEFUTIL_H */
--- /dev/null
+++ a/lib/zlib_dfltcc/dfltcc.c
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: Zlib
+/* dfltcc.c - SystemZ DEFLATE CONVERSION CALL support. */
+
+#include <linux/zutil.h>
+#include "dfltcc_util.h"
+#include "dfltcc.h"
+
+char *oesc_msg(
+    char *buf,
+    int oesc
+)
+{
+    if (oesc == 0x00)
+        return NULL; /* Successful completion */
+    else {
+#ifdef STATIC
+        return NULL; /* Ignore for pre-boot decompressor */
+#else
+        sprintf(buf, "Operation-Ending-Supplemental Code is 0x%.2X", oesc);
+        return buf;
+#endif
+    }
+}
+
+void dfltcc_reset(
+    z_streamp strm,
+    uInt size
+)
+{
+    struct dfltcc_state *dfltcc_state =
+        (struct dfltcc_state *)((char *)strm->state + size);
+    struct dfltcc_qaf_param *param =
+        (struct dfltcc_qaf_param *)&dfltcc_state->param;
+
+    /* Initialize available functions */
+    if (is_dfltcc_enabled()) {
+        dfltcc(DFLTCC_QAF, param, NULL, NULL, NULL, NULL, NULL);
+        memmove(&dfltcc_state->af, param, sizeof(dfltcc_state->af));
+    } else
+        memset(&dfltcc_state->af, 0, sizeof(dfltcc_state->af));
+
+    /* Initialize parameter block */
+    memset(&dfltcc_state->param, 0, sizeof(dfltcc_state->param));
+    dfltcc_state->param.nt = 1;
+
+    /* Initialize tuning parameters */
+    dfltcc_state->level_mask = DFLTCC_LEVEL_MASK;
+    dfltcc_state->block_size = DFLTCC_BLOCK_SIZE;
+    dfltcc_state->block_threshold = DFLTCC_FIRST_FHT_BLOCK_SIZE;
+    dfltcc_state->dht_threshold = DFLTCC_DHT_MIN_SAMPLE_SIZE;
+    dfltcc_state->param.ribm = DFLTCC_RIBM;
+}
--- /dev/null
+++ a/lib/zlib_dfltcc/dfltcc_deflate.c
@@ -0,0 +1,274 @@
+// SPDX-License-Identifier: Zlib
+
+#include "../zlib_deflate/defutil.h"
+#include "dfltcc_util.h"
+#include "dfltcc.h"
+#include <linux/zutil.h>
+
+/*
+ * Compress.
+ */
+int dfltcc_can_deflate(
+    z_streamp strm
+)
+{
+    deflate_state *state = (deflate_state *)strm->state;
+    struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
+
+    /* Unsupported compression settings */
+    if (!dfltcc_are_params_ok(state->level, state->w_bits, state->strategy,
+                              dfltcc_state->level_mask))
+        return 0;
+
+    /* Unsupported hardware */
+    if (!is_bit_set(dfltcc_state->af.fns, DFLTCC_GDHT) ||
+            !is_bit_set(dfltcc_state->af.fns, DFLTCC_CMPR) ||
+            !is_bit_set(dfltcc_state->af.fmts, DFLTCC_FMT0))
+        return 0;
+
+    return 1;
+}
+
+static void dfltcc_gdht(
+    z_streamp strm
+)
+{
+    deflate_state *state = (deflate_state *)strm->state;
+    struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param;
+    size_t avail_in = avail_in = strm->avail_in;
+
+    dfltcc(DFLTCC_GDHT,
+           param, NULL, NULL,
+           &strm->next_in, &avail_in, NULL);
+}
+
+static dfltcc_cc dfltcc_cmpr(
+    z_streamp strm
+)
+{
+    deflate_state *state = (deflate_state *)strm->state;
+    struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param;
+    size_t avail_in = strm->avail_in;
+    size_t avail_out = strm->avail_out;
+    dfltcc_cc cc;
+
+    cc = dfltcc(DFLTCC_CMPR | HBT_CIRCULAR,
+                param, &strm->next_out, &avail_out,
+                &strm->next_in, &avail_in, state->window);
+    strm->total_in += (strm->avail_in - avail_in);
+    strm->total_out += (strm->avail_out - avail_out);
+    strm->avail_in = avail_in;
+    strm->avail_out = avail_out;
+    return cc;
+}
+
+static void send_eobs(
+    z_streamp strm,
+    const struct dfltcc_param_v0 *param
+)
+{
+    deflate_state *state = (deflate_state *)strm->state;
+
+    zlib_tr_send_bits(
+          state,
+          bi_reverse(param->eobs >> (15 - param->eobl), param->eobl),
+          param->eobl);
+    flush_pending(strm);
+    if (state->pending != 0) {
+        /* The remaining data is located in pending_out[0:pending]. If someone
+         * calls put_byte() - this might happen in deflate() - the byte will be
+         * placed into pending_buf[pending], which is incorrect. Move the
+         * remaining data to the beginning of pending_buf so that put_byte() is
+         * usable again.
+         */
+        memmove(state->pending_buf, state->pending_out, state->pending);
+        state->pending_out = state->pending_buf;
+    }
+#ifdef ZLIB_DEBUG
+    state->compressed_len += param->eobl;
+#endif
+}
+
+int dfltcc_deflate(
+    z_streamp strm,
+    int flush,
+    block_state *result
+)
+{
+    deflate_state *state = (deflate_state *)strm->state;
+    struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
+    struct dfltcc_param_v0 *param = &dfltcc_state->param;
+    uInt masked_avail_in;
+    dfltcc_cc cc;
+    int need_empty_block;
+    int soft_bcc;
+    int no_flush;
+
+    if (!dfltcc_can_deflate(strm))
+        return 0;
+
+again:
+    masked_avail_in = 0;
+    soft_bcc = 0;
+    no_flush = flush == Z_NO_FLUSH;
+
+    /* Trailing empty block. Switch to software, except when Continuation Flag
+     * is set, which means that DFLTCC has buffered some output in the
+     * parameter block and needs to be called again in order to flush it.
+     */
+    if (flush == Z_FINISH && strm->avail_in == 0 && !param->cf) {
+        if (param->bcf) {
+            /* A block is still open, and the hardware does not support closing
+             * blocks without adding data. Thus, close it manually.
+             */
+            send_eobs(strm, param);
+            param->bcf = 0;
+        }
+        return 0;
+    }
+
+    if (strm->avail_in == 0 && !param->cf) {
+        *result = need_more;
+        return 1;
+    }
+
+    /* There is an open non-BFINAL block, we are not going to close it just
+     * yet, we have compressed more than DFLTCC_BLOCK_SIZE bytes and we see
+     * more than DFLTCC_DHT_MIN_SAMPLE_SIZE bytes. Open a new block with a new
+     * DHT in order to adapt to a possibly changed input data distribution.
+     */
+    if (param->bcf && no_flush &&
+            strm->total_in > dfltcc_state->block_threshold &&
+            strm->avail_in >= dfltcc_state->dht_threshold) {
+        if (param->cf) {
+            /* We need to flush the DFLTCC buffer before writing the
+             * End-of-block Symbol. Mask the input data and proceed as usual.
+             */
+            masked_avail_in += strm->avail_in;
+            strm->avail_in = 0;
+            no_flush = 0;
+        } else {
+            /* DFLTCC buffer is empty, so we can manually write the
+             * End-of-block Symbol right away.
+             */
+            send_eobs(strm, param);
+            param->bcf = 0;
+            dfltcc_state->block_threshold =
+                strm->total_in + dfltcc_state->block_size;
+            if (strm->avail_out == 0) {
+                *result = need_more;
+                return 1;
+            }
+        }
+    }
+
+    /* The caller gave us too much data. Pass only one block worth of
+     * uncompressed data to DFLTCC and mask the rest, so that on the next
+     * iteration we start a new block.
+     */
+    if (no_flush && strm->avail_in > dfltcc_state->block_size) {
+        masked_avail_in += (strm->avail_in - dfltcc_state->block_size);
+        strm->avail_in = dfltcc_state->block_size;
+    }
+
+    /* When we have an open non-BFINAL deflate block and caller indicates that
+     * the stream is ending, we need to close an open deflate block and open a
+     * BFINAL one.
+     */
+    need_empty_block = flush == Z_FINISH && param->bcf && !param->bhf;
+
+    /* Translate stream to parameter block */
+    param->cvt = CVT_ADLER32;
+    if (!no_flush)
+        /* We need to close a block. Always do this in software - when there is
+         * no input data, the hardware will not nohor BCC. */
+        soft_bcc = 1;
+    if (flush == Z_FINISH && !param->bcf)
+        /* We are about to open a BFINAL block, set Block Header Final bit
+         * until the stream ends.
+         */
+        param->bhf = 1;
+    /* DFLTCC-CMPR will write to next_out, so make sure that buffers with
+     * higher precedence are empty.
+     */
+    Assert(state->pending == 0, "There must be no pending bytes");
+    Assert(state->bi_valid < 8, "There must be less than 8 pending bits");
+    param->sbb = (unsigned int)state->bi_valid;
+    if (param->sbb > 0)
+        *strm->next_out = (Byte)state->bi_buf;
+    if (param->hl)
+        param->nt = 0; /* Honor history */
+    param->cv = strm->adler;
+
+    /* When opening a block, choose a Huffman-Table Type */
+    if (!param->bcf) {
+        if (strm->total_in == 0 && dfltcc_state->block_threshold > 0) {
+            param->htt = HTT_FIXED;
+        }
+        else {
+            param->htt = HTT_DYNAMIC;
+            dfltcc_gdht(strm);
+        }
+    }
+
+    /* Deflate */
+    do {
+        cc = dfltcc_cmpr(strm);
+        if (strm->avail_in < 4096 && masked_avail_in > 0)
+            /* We are about to call DFLTCC with a small input buffer, which is
+             * inefficient. Since there is masked data, there will be at least
+             * one more DFLTCC call, so skip the current one and make the next
+             * one handle more data.
+             */
+            break;
+    } while (cc == DFLTCC_CC_AGAIN);
+
+    /* Translate parameter block to stream */
+    strm->msg = oesc_msg(dfltcc_state->msg, param->oesc);
+    state->bi_valid = param->sbb;
+    if (state->bi_valid == 0)
+        state->bi_buf = 0; /* Avoid accessing next_out */
+    else
+        state->bi_buf = *strm->next_out & ((1 << state->bi_valid) - 1);
+    strm->adler = param->cv;
+
+    /* Unmask the input data */
+    strm->avail_in += masked_avail_in;
+    masked_avail_in = 0;
+
+    /* If we encounter an error, it means there is a bug in DFLTCC call */
+    Assert(cc != DFLTCC_CC_OP2_CORRUPT || param->oesc == 0, "BUG");
+
+    /* Update Block-Continuation Flag. It will be used to check whether to call
+     * GDHT the next time.
+     */
+    if (cc == DFLTCC_CC_OK) {
+        if (soft_bcc) {
+            send_eobs(strm, param);
+            param->bcf = 0;
+            dfltcc_state->block_threshold =
+                strm->total_in + dfltcc_state->block_size;
+        } else
+            param->bcf = 1;
+        if (flush == Z_FINISH) {
+            if (need_empty_block)
+                /* Make the current deflate() call also close the stream */
+                return 0;
+            else {
+                bi_windup(state);
+                *result = finish_done;
+            }
+        } else {
+            if (flush == Z_FULL_FLUSH)
+                param->hl = 0; /* Clear history */
+            *result = flush == Z_NO_FLUSH ? need_more : block_done;
+        }
+    } else {
+        param->bcf = 1;
+        *result = need_more;
+    }
+    if (strm->avail_in != 0 && strm->avail_out != 0)
+        goto again; /* deflate() must use all input or all output */
+    return 1;
+}
+
--- /dev/null
+++ a/lib/zlib_dfltcc/dfltcc.h
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: Zlib
+#ifndef DFLTCC_H
+#define DFLTCC_H
+
+#include "../zlib_deflate/defutil.h"
+
+/*
+ * Tuning parameters.
+ */
+#define DFLTCC_LEVEL_MASK 0x2 /* DFLTCC compression for level 1 only */
+#define DFLTCC_BLOCK_SIZE 1048576
+#define DFLTCC_FIRST_FHT_BLOCK_SIZE 4096
+#define DFLTCC_DHT_MIN_SAMPLE_SIZE 4096
+#define DFLTCC_RIBM 0
+
+/*
+ * Parameter Block for Query Available Functions.
+ */
+struct dfltcc_qaf_param {
+    char fns[16];
+    char reserved1[8];
+    char fmts[2];
+    char reserved2[6];
+};
+
+static_assert(sizeof(struct dfltcc_qaf_param) == 32);
+
+#define DFLTCC_FMT0 0
+
+/*
+ * Parameter Block for Generate Dynamic-Huffman Table, Compress and Expand.
+ */
+struct dfltcc_param_v0 {
+    uint16_t pbvn;                     /* Parameter-Block-Version Number */
+    uint8_t mvn;                       /* Model-Version Number */
+    uint8_t ribm;                      /* Reserved for IBM use */
+    unsigned reserved32 : 31;
+    unsigned cf : 1;                   /* Continuation Flag */
+    uint8_t reserved64[8];
+    unsigned nt : 1;                   /* New Task */
+    unsigned reserved129 : 1;
+    unsigned cvt : 1;                  /* Check Value Type */
+    unsigned reserved131 : 1;
+    unsigned htt : 1;                  /* Huffman-Table Type */
+    unsigned bcf : 1;                  /* Block-Continuation Flag */
+    unsigned bcc : 1;                  /* Block Closing Control */
+    unsigned bhf : 1;                  /* Block Header Final */
+    unsigned reserved136 : 1;
+    unsigned reserved137 : 1;
+    unsigned dhtgc : 1;                /* DHT Generation Control */
+    unsigned reserved139 : 5;
+    unsigned reserved144 : 5;
+    unsigned sbb : 3;                  /* Sub-Byte Boundary */
+    uint8_t oesc;                      /* Operation-Ending-Supplemental Code */
+    unsigned reserved160 : 12;
+    unsigned ifs : 4;                  /* Incomplete-Function Status */
+    uint16_t ifl;                      /* Incomplete-Function Length */
+    uint8_t reserved192[8];
+    uint8_t reserved256[8];
+    uint8_t reserved320[4];
+    uint16_t hl;                       /* History Length */
+    unsigned reserved368 : 1;
+    uint16_t ho : 15;                  /* History Offset */
+    uint32_t cv;                       /* Check Value */
+    unsigned eobs : 15;                /* End-of-block Symbol */
+    unsigned reserved431: 1;
+    uint8_t eobl : 4;                  /* End-of-block Length */
+    unsigned reserved436 : 12;
+    unsigned reserved448 : 4;
+    uint16_t cdhtl : 12;               /* Compressed-Dynamic-Huffman Table
+                                          Length */
+    uint8_t reserved464[6];
+    uint8_t cdht[288];
+    uint8_t reserved[32];
+    uint8_t csb[1152];
+};
+
+static_assert(sizeof(struct dfltcc_param_v0) == 1536);
+
+#define CVT_CRC32 0
+#define CVT_ADLER32 1
+#define HTT_FIXED 0
+#define HTT_DYNAMIC 1
+
+/*
+ *  Extension of inflate_state and deflate_state for DFLTCC.
+ */
+struct dfltcc_state {
+    struct dfltcc_param_v0 param;      /* Parameter block */
+    struct dfltcc_qaf_param af;        /* Available functions */
+    uLong level_mask;                  /* Levels on which to use DFLTCC */
+    uLong block_size;                  /* New block each X bytes */
+    uLong block_threshold;             /* New block after total_in > X */
+    uLong dht_threshold;               /* New block only if avail_in >= X */
+    char msg[64];                      /* Buffer for strm->msg */
+};
+
+/* Resides right after inflate_state or deflate_state */
+#define GET_DFLTCC_STATE(state) ((struct dfltcc_state *)((state) + 1))
+
+/* External functions */
+int dfltcc_can_deflate(z_streamp strm);
+int dfltcc_deflate(z_streamp strm,
+                   int flush,
+                   block_state *result);
+void dfltcc_reset(z_streamp strm, uInt size);
+
+#define DEFLATE_RESET_HOOK(strm) \
+    dfltcc_reset((strm), sizeof(deflate_state))
+
+#define DEFLATE_HOOK dfltcc_deflate
+
+#define DEFLATE_NEED_CHECKSUM(strm) (!dfltcc_can_deflate((strm)))
+
+#endif /* DFLTCC_H */
--- /dev/null
+++ a/lib/zlib_dfltcc/dfltcc_syms.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * linux/lib/zlib_dfltcc/dfltcc_syms.c
+ *
+ * Exported symbols for the s390 zlib dfltcc support.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/zlib.h>
+#include "dfltcc.h"
+
+EXPORT_SYMBOL(dfltcc_can_deflate);
+EXPORT_SYMBOL(dfltcc_deflate);
+EXPORT_SYMBOL(dfltcc_reset);
+MODULE_LICENSE("GPL");
--- /dev/null
+++ a/lib/zlib_dfltcc/dfltcc_util.h
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: Zlib
+#ifndef DFLTCC_UTIL_H
+#define DFLTCC_UTIL_H
+
+#include <linux/zutil.h>
+#include <asm/facility.h>
+
+/*
+ * C wrapper for the DEFLATE CONVERSION CALL instruction.
+ */
+typedef enum {
+    DFLTCC_CC_OK = 0,
+    DFLTCC_CC_OP1_TOO_SHORT = 1,
+    DFLTCC_CC_OP2_TOO_SHORT = 2,
+    DFLTCC_CC_OP2_CORRUPT = 2,
+    DFLTCC_CC_AGAIN = 3,
+} dfltcc_cc;
+
+#define DFLTCC_QAF 0
+#define DFLTCC_GDHT 1
+#define DFLTCC_CMPR 2
+#define DFLTCC_XPND 4
+#define HBT_CIRCULAR (1 << 7)
+#define HB_BITS 15
+#define HB_SIZE (1 << HB_BITS)
+#define DFLTCC_FACILITY 151
+
+static inline dfltcc_cc dfltcc(
+    int fn,
+    void *param,
+    Byte **op1,
+    size_t *len1,
+    const Byte **op2,
+    size_t *len2,
+    void *hist
+)
+{
+    Byte *t2 = op1 ? *op1 : NULL;
+    size_t t3 = len1 ? *len1 : 0;
+    const Byte *t4 = op2 ? *op2 : NULL;
+    size_t t5 = len2 ? *len2 : 0;
+    register int r0 __asm__("r0") = fn;
+    register void *r1 __asm__("r1") = param;
+    register Byte *r2 __asm__("r2") = t2;
+    register size_t r3 __asm__("r3") = t3;
+    register const Byte *r4 __asm__("r4") = t4;
+    register size_t r5 __asm__("r5") = t5;
+    int cc;
+
+    __asm__ volatile(
+                     ".insn rrf,0xb9390000,%[r2],%[r4],%[hist],0\n"
+                     "ipm %[cc]\n"
+                     : [r2] "+r" (r2)
+                     , [r3] "+r" (r3)
+                     , [r4] "+r" (r4)
+                     , [r5] "+r" (r5)
+                     , [cc] "=r" (cc)
+                     : [r0] "r" (r0)
+                     , [r1] "r" (r1)
+                     , [hist] "r" (hist)
+                     : "cc", "memory");
+    t2 = r2; t3 = r3; t4 = r4; t5 = r5;
+
+    if (op1)
+        *op1 = t2;
+    if (len1)
+        *len1 = t3;
+    if (op2)
+        *op2 = t4;
+    if (len2)
+        *len2 = t5;
+    return (cc >> 28) & 3;
+}
+
+static inline int is_bit_set(
+    const char *bits,
+    int n
+)
+{
+    return bits[n / 8] & (1 << (7 - (n % 8)));
+}
+
+static inline void turn_bit_off(
+    char *bits,
+    int n
+)
+{
+    bits[n / 8] &= ~(1 << (7 - (n % 8)));
+}
+
+static inline int dfltcc_are_params_ok(
+    int level,
+    uInt window_bits,
+    int strategy,
+    uLong level_mask
+)
+{
+    return (level_mask & (1 << level)) != 0 &&
+        (window_bits == HB_BITS) &&
+        (strategy == Z_DEFAULT_STRATEGY);
+}
+
+static inline int is_dfltcc_enabled(void)
+{
+    return test_facility(DFLTCC_FACILITY);
+}
+
+char *oesc_msg(char *buf, int oesc);
+
+#endif /* DFLTCC_UTIL_H */
--- /dev/null
+++ a/lib/zlib_dfltcc/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# This is a modified version of zlib, which does all memory
+# allocation ahead of time.
+#
+# This is the code for s390 zlib hardware support.
+#
+
+obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc.o
+
+zlib_dfltcc-objs := dfltcc.o dfltcc_deflate.o dfltcc_syms.o
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 094/118] s390/boot: rename HEAP_SIZE due to name collision
  2020-01-31  6:10 incoming Andrew Morton
                   ` (92 preceding siblings ...)
  2020-01-31  6:16 ` [patch 093/118] lib/zlib: add s390 hardware support for kernel zlib_deflate Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 095/118] lib/zlib: add s390 hardware support for kernel zlib_inflate Andrew Morton
                   ` (23 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, borntraeger, linux-mm, mm-commits, torvalds, zaslonko

From: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Subject: s390/boot: rename HEAP_SIZE due to name collision

Change the conflicting macro name in preparation for zlib_inflate hardware
support.

Link: http://lkml.kernel.org/r/20200103223334.20669-3-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/boot/compressed/decompressor.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/arch/s390/boot/compressed/decompressor.c~s390-boot-rename-heap_size-due-to-name-collision
+++ a/arch/s390/boot/compressed/decompressor.c
@@ -30,13 +30,13 @@ extern unsigned char _compressed_start[]
 extern unsigned char _compressed_end[];
 
 #ifdef CONFIG_HAVE_KERNEL_BZIP2
-#define HEAP_SIZE	0x400000
+#define BOOT_HEAP_SIZE	0x400000
 #else
-#define HEAP_SIZE	0x10000
+#define BOOT_HEAP_SIZE	0x10000
 #endif
 
 static unsigned long free_mem_ptr = (unsigned long) _end;
-static unsigned long free_mem_end_ptr = (unsigned long) _end + HEAP_SIZE;
+static unsigned long free_mem_end_ptr = (unsigned long) _end + BOOT_HEAP_SIZE;
 
 #ifdef CONFIG_KERNEL_GZIP
 #include "../../../../lib/decompress_inflate.c"
@@ -62,7 +62,7 @@ static unsigned long free_mem_end_ptr =
 #include "../../../../lib/decompress_unxz.c"
 #endif
 
-#define decompress_offset ALIGN((unsigned long)_end + HEAP_SIZE, PAGE_SIZE)
+#define decompress_offset ALIGN((unsigned long)_end + BOOT_HEAP_SIZE, PAGE_SIZE)
 
 unsigned long mem_safe_offset(void)
 {
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 095/118] lib/zlib: add s390 hardware support for kernel zlib_inflate
  2020-01-31  6:10 incoming Andrew Morton
                   ` (93 preceding siblings ...)
  2020-01-31  6:16 ` [patch 094/118] s390/boot: rename HEAP_SIZE due to name collision Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 096/118] s390/boot: add dfltcc= kernel command line parameter Andrew Morton
                   ` (22 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, borntraeger, clm, dsterba, edward6, gor, heiko.carstens,
	iii, josef, linux-mm, mm-commits, rpurdie, torvalds, zaslonko

From: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Subject: lib/zlib: add s390 hardware support for kernel zlib_inflate

Add decompression functions to zlib_dfltcc library.  Update zlib_inflate
functions with the hooks for s390 hardware support and adjust workspace
structures with extra parameter lists required for hardware inflate
decompression.

Link: http://lkml.kernel.org/r/20200103223334.20669-4-zaslonko@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Co-developed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/decompress_inflate.c         |   13 ++
 lib/zlib_dfltcc/Makefile         |    2 
 lib/zlib_dfltcc/dfltcc.h         |   28 +++++
 lib/zlib_dfltcc/dfltcc_inflate.c |  143 +++++++++++++++++++++++++++++
 lib/zlib_inflate/inflate.c       |   32 ++++--
 lib/zlib_inflate/inflate.h       |    8 +
 lib/zlib_inflate/infutil.h       |   18 +++
 7 files changed, 233 insertions(+), 11 deletions(-)

--- a/lib/decompress_inflate.c~lib-zlib-add-s390-hardware-support-for-kernel-zlib_inflate
+++ a/lib/decompress_inflate.c
@@ -10,6 +10,10 @@
 #include "zlib_inflate/inftrees.c"
 #include "zlib_inflate/inffast.c"
 #include "zlib_inflate/inflate.c"
+#ifdef CONFIG_ZLIB_DFLTCC
+#include "zlib_dfltcc/dfltcc.c"
+#include "zlib_dfltcc/dfltcc_inflate.c"
+#endif
 
 #else /* STATIC */
 /* initramfs et al: linked */
@@ -76,7 +80,12 @@ STATIC int INIT __gunzip(unsigned char *
 	}
 
 	strm->workspace = malloc(flush ? zlib_inflate_workspacesize() :
+#ifdef CONFIG_ZLIB_DFLTCC
+	/* Always allocate the full workspace for DFLTCC */
+				 zlib_inflate_workspacesize());
+#else
 				 sizeof(struct inflate_state));
+#endif
 	if (strm->workspace == NULL) {
 		error("Out of memory while allocating workspace");
 		goto gunzip_nomem4;
@@ -123,10 +132,14 @@ STATIC int INIT __gunzip(unsigned char *
 
 	rc = zlib_inflateInit2(strm, -MAX_WBITS);
 
+#ifdef CONFIG_ZLIB_DFLTCC
+	/* Always keep the window for DFLTCC */
+#else
 	if (!flush) {
 		WS(strm)->inflate_state.wsize = 0;
 		WS(strm)->inflate_state.window = NULL;
 	}
+#endif
 
 	while (rc == Z_OK) {
 		if (strm->avail_in == 0) {
--- a/lib/zlib_dfltcc/dfltcc.h~lib-zlib-add-s390-hardware-support-for-kernel-zlib_inflate
+++ a/lib/zlib_dfltcc/dfltcc.h
@@ -104,6 +104,14 @@ int dfltcc_deflate(z_streamp strm,
                    int flush,
                    block_state *result);
 void dfltcc_reset(z_streamp strm, uInt size);
+int dfltcc_can_inflate(z_streamp strm);
+typedef enum {
+    DFLTCC_INFLATE_CONTINUE,
+    DFLTCC_INFLATE_BREAK,
+    DFLTCC_INFLATE_SOFTWARE,
+} dfltcc_inflate_action;
+dfltcc_inflate_action dfltcc_inflate(z_streamp strm,
+                                     int flush, int *ret);
 
 #define DEFLATE_RESET_HOOK(strm) \
     dfltcc_reset((strm), sizeof(deflate_state))
@@ -112,4 +120,24 @@ void dfltcc_reset(z_streamp strm, uInt s
 
 #define DEFLATE_NEED_CHECKSUM(strm) (!dfltcc_can_deflate((strm)))
 
+#define INFLATE_RESET_HOOK(strm) \
+    dfltcc_reset((strm), sizeof(struct inflate_state))
+
+#define INFLATE_TYPEDO_HOOK(strm, flush) \
+    if (dfltcc_can_inflate((strm))) { \
+        dfltcc_inflate_action action; \
+\
+        RESTORE(); \
+        action = dfltcc_inflate((strm), (flush), &ret); \
+        LOAD(); \
+        if (action == DFLTCC_INFLATE_CONTINUE) \
+            break; \
+        else if (action == DFLTCC_INFLATE_BREAK) \
+            goto inf_leave; \
+    }
+
+#define INFLATE_NEED_CHECKSUM(strm) (!dfltcc_can_inflate((strm)))
+
+#define INFLATE_NEED_UPDATEWINDOW(strm) (!dfltcc_can_inflate((strm)))
+
 #endif /* DFLTCC_H */
--- /dev/null
+++ a/lib/zlib_dfltcc/dfltcc_inflate.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: Zlib
+
+#include "../zlib_inflate/inflate.h"
+#include "dfltcc_util.h"
+#include "dfltcc.h"
+#include <linux/zutil.h>
+
+/*
+ * Expand.
+ */
+int dfltcc_can_inflate(
+    z_streamp strm
+)
+{
+    struct inflate_state *state = (struct inflate_state *)strm->state;
+    struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
+
+    /* Unsupported compression settings */
+    if (state->wbits != HB_BITS)
+        return 0;
+
+    /* Unsupported hardware */
+    return is_bit_set(dfltcc_state->af.fns, DFLTCC_XPND) &&
+               is_bit_set(dfltcc_state->af.fmts, DFLTCC_FMT0);
+}
+
+static int dfltcc_was_inflate_used(
+    z_streamp strm
+)
+{
+    struct inflate_state *state = (struct inflate_state *)strm->state;
+    struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param;
+
+    return !param->nt;
+}
+
+static int dfltcc_inflate_disable(
+    z_streamp strm
+)
+{
+    struct inflate_state *state = (struct inflate_state *)strm->state;
+    struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
+
+    if (!dfltcc_can_inflate(strm))
+        return 0;
+    if (dfltcc_was_inflate_used(strm))
+        /* DFLTCC has already decompressed some data. Since there is not
+         * enough information to resume decompression in software, the call
+         * must fail.
+         */
+        return 1;
+    /* DFLTCC was not used yet - decompress in software */
+    memset(&dfltcc_state->af, 0, sizeof(dfltcc_state->af));
+    return 0;
+}
+
+static dfltcc_cc dfltcc_xpnd(
+    z_streamp strm
+)
+{
+    struct inflate_state *state = (struct inflate_state *)strm->state;
+    struct dfltcc_param_v0 *param = &GET_DFLTCC_STATE(state)->param;
+    size_t avail_in = strm->avail_in;
+    size_t avail_out = strm->avail_out;
+    dfltcc_cc cc;
+
+    cc = dfltcc(DFLTCC_XPND | HBT_CIRCULAR,
+                param, &strm->next_out, &avail_out,
+                &strm->next_in, &avail_in, state->window);
+    strm->avail_in = avail_in;
+    strm->avail_out = avail_out;
+    return cc;
+}
+
+dfltcc_inflate_action dfltcc_inflate(
+    z_streamp strm,
+    int flush,
+    int *ret
+)
+{
+    struct inflate_state *state = (struct inflate_state *)strm->state;
+    struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
+    struct dfltcc_param_v0 *param = &dfltcc_state->param;
+    dfltcc_cc cc;
+
+    if (flush == Z_BLOCK) {
+        /* DFLTCC does not support stopping on block boundaries */
+        if (dfltcc_inflate_disable(strm)) {
+            *ret = Z_STREAM_ERROR;
+            return DFLTCC_INFLATE_BREAK;
+        } else
+            return DFLTCC_INFLATE_SOFTWARE;
+    }
+
+    if (state->last) {
+        if (state->bits != 0) {
+            strm->next_in++;
+            strm->avail_in--;
+            state->bits = 0;
+        }
+        state->mode = CHECK;
+        return DFLTCC_INFLATE_CONTINUE;
+    }
+
+    if (strm->avail_in == 0 && !param->cf)
+        return DFLTCC_INFLATE_BREAK;
+
+    if (!state->window || state->wsize == 0) {
+        state->mode = MEM;
+        return DFLTCC_INFLATE_CONTINUE;
+    }
+
+    /* Translate stream to parameter block */
+    param->cvt = CVT_ADLER32;
+    param->sbb = state->bits;
+    param->hl = state->whave; /* Software and hardware history formats match */
+    param->ho = (state->write - state->whave) & ((1 << HB_BITS) - 1);
+    if (param->hl)
+        param->nt = 0; /* Honor history for the first block */
+    param->cv = state->flags ? REVERSE(state->check) : state->check;
+
+    /* Inflate */
+    do {
+        cc = dfltcc_xpnd(strm);
+    } while (cc == DFLTCC_CC_AGAIN);
+
+    /* Translate parameter block to stream */
+    strm->msg = oesc_msg(dfltcc_state->msg, param->oesc);
+    state->last = cc == DFLTCC_CC_OK;
+    state->bits = param->sbb;
+    state->whave = param->hl;
+    state->write = (param->ho + param->hl) & ((1 << HB_BITS) - 1);
+    state->check = state->flags ? REVERSE(param->cv) : param->cv;
+    if (cc == DFLTCC_CC_OP2_CORRUPT && param->oesc != 0) {
+        /* Report an error if stream is corrupted */
+        state->mode = BAD;
+        return DFLTCC_INFLATE_CONTINUE;
+    }
+    state->mode = TYPEDO;
+    /* Break if operands are exhausted, otherwise continue looping */
+    return (cc == DFLTCC_CC_OP1_TOO_SHORT || cc == DFLTCC_CC_OP2_TOO_SHORT) ?
+        DFLTCC_INFLATE_BREAK : DFLTCC_INFLATE_CONTINUE;
+}
--- a/lib/zlib_dfltcc/Makefile~lib-zlib-add-s390-hardware-support-for-kernel-zlib_inflate
+++ a/lib/zlib_dfltcc/Makefile
@@ -8,4 +8,4 @@
 
 obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc.o
 
-zlib_dfltcc-objs := dfltcc.o dfltcc_deflate.o dfltcc_syms.o
+zlib_dfltcc-objs := dfltcc.o dfltcc_deflate.o dfltcc_inflate.o dfltcc_syms.o
--- a/lib/zlib_inflate/inflate.c~lib-zlib-add-s390-hardware-support-for-kernel-zlib_inflate
+++ a/lib/zlib_inflate/inflate.c
@@ -15,6 +15,16 @@
 #include "inffast.h"
 #include "infutil.h"
 
+/* architecture-specific bits */
+#ifdef CONFIG_ZLIB_DFLTCC
+#  include "../zlib_dfltcc/dfltcc.h"
+#else
+#define INFLATE_RESET_HOOK(strm) do {} while (0)
+#define INFLATE_TYPEDO_HOOK(strm, flush) do {} while (0)
+#define INFLATE_NEED_UPDATEWINDOW(strm) 1
+#define INFLATE_NEED_CHECKSUM(strm) 1
+#endif
+
 int zlib_inflate_workspacesize(void)
 {
     return sizeof(struct inflate_workspace);
@@ -42,6 +52,7 @@ int zlib_inflateReset(z_streamp strm)
     state->write = 0;
     state->whave = 0;
 
+    INFLATE_RESET_HOOK(strm);
     return Z_OK;
 }
 
@@ -66,7 +77,15 @@ int zlib_inflateInit2(z_streamp strm, in
         return Z_STREAM_ERROR;
     }
     state->wbits = (unsigned)windowBits;
+#ifdef CONFIG_ZLIB_DFLTCC
+    /*
+     * DFLTCC requires the window to be page aligned.
+     * Thus, we overallocate and take the aligned portion of the buffer.
+     */
+    state->window = PTR_ALIGN(&WS(strm)->working_window[0], PAGE_SIZE);
+#else
     state->window = &WS(strm)->working_window[0];
+#endif
 
     return zlib_inflateReset(strm);
 }
@@ -227,11 +246,6 @@ static int zlib_inflateSyncPacket(z_stre
         bits -= bits & 7; \
     } while (0)
 
-/* Reverse the bytes in a 32-bit value */
-#define REVERSE(q) \
-    ((((q) >> 24) & 0xff) + (((q) >> 8) & 0xff00) + \
-     (((q) & 0xff00) << 8) + (((q) & 0xff) << 24))
-
 /*
    inflate() uses a state machine to process as much input data and generate as
    much output data as possible before returning.  The state machine is
@@ -395,6 +409,7 @@ int zlib_inflate(z_streamp strm, int flu
             if (flush == Z_BLOCK) goto inf_leave;
 	    /* fall through */
         case TYPEDO:
+            INFLATE_TYPEDO_HOOK(strm, flush);
             if (state->last) {
                 BYTEBITS();
                 state->mode = CHECK;
@@ -692,7 +707,7 @@ int zlib_inflate(z_streamp strm, int flu
                 out -= left;
                 strm->total_out += out;
                 state->total += out;
-                if (out)
+                if (INFLATE_NEED_CHECKSUM(strm) && out)
                     strm->adler = state->check =
                         UPDATE(state->check, put - out, out);
                 out = left;
@@ -726,7 +741,8 @@ int zlib_inflate(z_streamp strm, int flu
      */
   inf_leave:
     RESTORE();
-    if (state->wsize || (state->mode < CHECK && out != strm->avail_out))
+    if (INFLATE_NEED_UPDATEWINDOW(strm) &&
+            (state->wsize || (state->mode < CHECK && out != strm->avail_out)))
         zlib_updatewindow(strm, out);
 
     in -= strm->avail_in;
@@ -734,7 +750,7 @@ int zlib_inflate(z_streamp strm, int flu
     strm->total_in += in;
     strm->total_out += out;
     state->total += out;
-    if (state->wrap && out)
+    if (INFLATE_NEED_CHECKSUM(strm) && state->wrap && out)
         strm->adler = state->check =
             UPDATE(state->check, strm->next_out - out, out);
 
--- a/lib/zlib_inflate/inflate.h~lib-zlib-add-s390-hardware-support-for-kernel-zlib_inflate
+++ a/lib/zlib_inflate/inflate.h
@@ -11,6 +11,8 @@
    subject to change. Applications should only use zlib.h.
  */
 
+#include "inftrees.h"
+
 /* Possible inflate modes between inflate() calls */
 typedef enum {
     HEAD,       /* i: waiting for magic header */
@@ -108,4 +110,10 @@ struct inflate_state {
     unsigned short work[288];   /* work area for code table building */
     code codes[ENOUGH];         /* space for code tables */
 };
+
+/* Reverse the bytes in a 32-bit value */
+#define REVERSE(q) \
+    ((((q) >> 24) & 0xff) + (((q) >> 8) & 0xff00) + \
+     (((q) & 0xff00) << 8) + (((q) & 0xff) << 24))
+
 #endif
--- a/lib/zlib_inflate/infutil.h~lib-zlib-add-s390-hardware-support-for-kernel-zlib_inflate
+++ a/lib/zlib_inflate/infutil.h
@@ -12,14 +12,28 @@
 #define _INFUTIL_H
 
 #include <linux/zlib.h>
+#ifdef CONFIG_ZLIB_DFLTCC
+#include "../zlib_dfltcc/dfltcc.h"
+#include <asm/page.h>
+#endif
 
 /* memory allocation for inflation */
 
 struct inflate_workspace {
 	struct inflate_state inflate_state;
-	unsigned char working_window[1 << MAX_WBITS];
+#ifdef CONFIG_ZLIB_DFLTCC
+	struct dfltcc_state dfltcc_state;
+	unsigned char working_window[(1 << MAX_WBITS) + PAGE_SIZE];
+#else
+	unsigned char working_window[(1 << MAX_WBITS)];
+#endif
 };
 
-#define WS(z) ((struct inflate_workspace *)(z->workspace))
+#ifdef CONFIG_ZLIB_DFLTCC
+/* dfltcc_state must be doubleword aligned for DFLTCC call */
+static_assert(offsetof(struct inflate_workspace, dfltcc_state) % 8 == 0);
+#endif
+
+#define WS(strm) ((struct inflate_workspace *)(strm->workspace))
 
 #endif
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 096/118] s390/boot: add dfltcc= kernel command line parameter
  2020-01-31  6:10 incoming Andrew Morton
                   ` (94 preceding siblings ...)
  2020-01-31  6:16 ` [patch 095/118] lib/zlib: add s390 hardware support for kernel zlib_inflate Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 097/118] lib/zlib: add zlib_deflate_dfltcc_enabled() function Andrew Morton
                   ` (21 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, borntraeger, clm, dsterba, edward6, gor, heiko.carstens,
	iii, josef, linux-mm, mm-commits, rpurdie, torvalds, zaslonko

From: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Subject: s390/boot: add dfltcc= kernel command line parameter

Add the new kernel command line parameter 'dfltcc=' to configure s390 zlib
hardware support.

Format: { on | off | def_only | inf_only | always }
 on:       s390 zlib hardware support for compression on
           level 1 and decompression (default)
 off:      No s390 zlib hardware support
 def_only: s390 zlib hardware support for deflate
           only (compression on level 1)
 inf_only: s390 zlib hardware support for inflate
           only (decompression)
 always:   Same as 'on' but ignores the selected compression
           level always using hardware support (used for debugging)

Link: http://lkml.kernel.org/r/20200103223334.20669-5-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |   12 ++++++++++++
 arch/s390/boot/ipl_parm.c                       |   14 ++++++++++++++
 arch/s390/include/asm/setup.h                   |    7 +++++++
 arch/s390/kernel/setup.c                        |    2 ++
 lib/zlib_dfltcc/dfltcc.c                        |    5 ++++-
 lib/zlib_dfltcc/dfltcc.h                        |    1 +
 lib/zlib_dfltcc/dfltcc_deflate.c                |    6 ++++++
 lib/zlib_dfltcc/dfltcc_inflate.c                |    6 ++++++
 lib/zlib_dfltcc/dfltcc_util.h                   |    4 +++-
 9 files changed, 55 insertions(+), 2 deletions(-)

--- a/arch/s390/boot/ipl_parm.c~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/arch/s390/boot/ipl_parm.c
@@ -14,6 +14,7 @@
 char __bootdata(early_command_line)[COMMAND_LINE_SIZE];
 struct ipl_parameter_block __bootdata_preserved(ipl_block);
 int __bootdata_preserved(ipl_block_valid);
+unsigned int __bootdata_preserved(zlib_dfltcc_support) = ZLIB_DFLTCC_FULL;
 
 unsigned long __bootdata(vmalloc_size) = VMALLOC_DEFAULT_SIZE;
 unsigned long __bootdata(memory_end);
@@ -229,6 +230,19 @@ void parse_boot_command_line(void)
 		if (!strcmp(param, "vmalloc") && val)
 			vmalloc_size = round_up(memparse(val, NULL), PAGE_SIZE);
 
+		if (!strcmp(param, "dfltcc")) {
+			if (!strcmp(val, "off"))
+				zlib_dfltcc_support = ZLIB_DFLTCC_DISABLED;
+			else if (!strcmp(val, "on"))
+				zlib_dfltcc_support = ZLIB_DFLTCC_FULL;
+			else if (!strcmp(val, "def_only"))
+				zlib_dfltcc_support = ZLIB_DFLTCC_DEFLATE_ONLY;
+			else if (!strcmp(val, "inf_only"))
+				zlib_dfltcc_support = ZLIB_DFLTCC_INFLATE_ONLY;
+			else if (!strcmp(val, "always"))
+				zlib_dfltcc_support = ZLIB_DFLTCC_FULL_DEBUG;
+		}
+
 		if (!strcmp(param, "noexec")) {
 			rc = kstrtobool(val, &enabled);
 			if (!rc && !enabled)
--- a/arch/s390/include/asm/setup.h~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/arch/s390/include/asm/setup.h
@@ -79,6 +79,13 @@ struct parmarea {
 	char command_line[ARCH_COMMAND_LINE_SIZE];	/* 0x10480 */
 };
 
+extern unsigned int zlib_dfltcc_support;
+#define ZLIB_DFLTCC_DISABLED		0
+#define ZLIB_DFLTCC_FULL		1
+#define ZLIB_DFLTCC_DEFLATE_ONLY	2
+#define ZLIB_DFLTCC_INFLATE_ONLY	3
+#define ZLIB_DFLTCC_FULL_DEBUG		4
+
 extern int noexec_disabled;
 extern int memory_end_set;
 extern unsigned long memory_end;
--- a/arch/s390/kernel/setup.c~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/arch/s390/kernel/setup.c
@@ -111,6 +111,8 @@ unsigned long __bootdata_preserved(__ete
 unsigned long __bootdata_preserved(__sdma);
 unsigned long __bootdata_preserved(__edma);
 unsigned long __bootdata_preserved(__kaslr_offset);
+unsigned int __bootdata_preserved(zlib_dfltcc_support);
+EXPORT_SYMBOL(zlib_dfltcc_support);
 
 unsigned long VMALLOC_START;
 EXPORT_SYMBOL(VMALLOC_START);
--- a/Documentation/admin-guide/kernel-parameters.txt~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -834,6 +834,18 @@
 			dump out devices still on the deferred probe list after
 			retrying.
 
+	dfltcc=		[HW,S390]
+			Format: { on | off | def_only | inf_only | always }
+			on:       s390 zlib hardware support for compression on
+			          level 1 and decompression (default)
+			off:      No s390 zlib hardware support
+			def_only: s390 zlib hardware support for deflate
+			          only (compression on level 1)
+			inf_only: s390 zlib hardware support for inflate
+			          only (decompression)
+			always:   Same as 'on' but ignores the selected compression
+			          level always using hardware support (used for debugging)
+
 	dhash_entries=	[KNL]
 			Set number of hash buckets for dentry cache.
 
--- a/lib/zlib_dfltcc/dfltcc.c~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/lib/zlib_dfltcc/dfltcc.c
@@ -44,7 +44,10 @@ void dfltcc_reset(
     dfltcc_state->param.nt = 1;
 
     /* Initialize tuning parameters */
-    dfltcc_state->level_mask = DFLTCC_LEVEL_MASK;
+    if (zlib_dfltcc_support == ZLIB_DFLTCC_FULL_DEBUG)
+        dfltcc_state->level_mask = DFLTCC_LEVEL_MASK_DEBUG;
+    else
+        dfltcc_state->level_mask = DFLTCC_LEVEL_MASK;
     dfltcc_state->block_size = DFLTCC_BLOCK_SIZE;
     dfltcc_state->block_threshold = DFLTCC_FIRST_FHT_BLOCK_SIZE;
     dfltcc_state->dht_threshold = DFLTCC_DHT_MIN_SAMPLE_SIZE;
--- a/lib/zlib_dfltcc/dfltcc_deflate.c~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/lib/zlib_dfltcc/dfltcc_deflate.c
@@ -3,6 +3,7 @@
 #include "../zlib_deflate/defutil.h"
 #include "dfltcc_util.h"
 #include "dfltcc.h"
+#include <asm/setup.h>
 #include <linux/zutil.h>
 
 /*
@@ -15,6 +16,11 @@ int dfltcc_can_deflate(
     deflate_state *state = (deflate_state *)strm->state;
     struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
 
+    /* Check for kernel dfltcc command line parameter */
+    if (zlib_dfltcc_support == ZLIB_DFLTCC_DISABLED ||
+            zlib_dfltcc_support == ZLIB_DFLTCC_INFLATE_ONLY)
+        return 0;
+
     /* Unsupported compression settings */
     if (!dfltcc_are_params_ok(state->level, state->w_bits, state->strategy,
                               dfltcc_state->level_mask))
--- a/lib/zlib_dfltcc/dfltcc.h~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/lib/zlib_dfltcc/dfltcc.h
@@ -8,6 +8,7 @@
  * Tuning parameters.
  */
 #define DFLTCC_LEVEL_MASK 0x2 /* DFLTCC compression for level 1 only */
+#define DFLTCC_LEVEL_MASK_DEBUG 0x3fe /* DFLTCC compression for all levels */
 #define DFLTCC_BLOCK_SIZE 1048576
 #define DFLTCC_FIRST_FHT_BLOCK_SIZE 4096
 #define DFLTCC_DHT_MIN_SAMPLE_SIZE 4096
--- a/lib/zlib_dfltcc/dfltcc_inflate.c~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/lib/zlib_dfltcc/dfltcc_inflate.c
@@ -3,6 +3,7 @@
 #include "../zlib_inflate/inflate.h"
 #include "dfltcc_util.h"
 #include "dfltcc.h"
+#include <asm/setup.h>
 #include <linux/zutil.h>
 
 /*
@@ -15,6 +16,11 @@ int dfltcc_can_inflate(
     struct inflate_state *state = (struct inflate_state *)strm->state;
     struct dfltcc_state *dfltcc_state = GET_DFLTCC_STATE(state);
 
+    /* Check for kernel dfltcc command line parameter */
+    if (zlib_dfltcc_support == ZLIB_DFLTCC_DISABLED ||
+            zlib_dfltcc_support == ZLIB_DFLTCC_DEFLATE_ONLY)
+        return 0;
+
     /* Unsupported compression settings */
     if (state->wbits != HB_BITS)
         return 0;
--- a/lib/zlib_dfltcc/dfltcc_util.h~s390-boot-add-dfltcc=-kernel-command-line-parameter
+++ a/lib/zlib_dfltcc/dfltcc_util.h
@@ -4,6 +4,7 @@
 
 #include <linux/zutil.h>
 #include <asm/facility.h>
+#include <asm/setup.h>
 
 /*
  * C wrapper for the DEFLATE CONVERSION CALL instruction.
@@ -102,7 +103,8 @@ static inline int dfltcc_are_params_ok(
 
 static inline int is_dfltcc_enabled(void)
 {
-    return test_facility(DFLTCC_FACILITY);
+    return (zlib_dfltcc_support != ZLIB_DFLTCC_DISABLED &&
+            test_facility(DFLTCC_FACILITY));
 }
 
 char *oesc_msg(char *buf, int oesc);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 097/118] lib/zlib: add zlib_deflate_dfltcc_enabled() function
  2020-01-31  6:10 incoming Andrew Morton
                   ` (95 preceding siblings ...)
  2020-01-31  6:16 ` [patch 096/118] s390/boot: add dfltcc= kernel command line parameter Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 098/118] btrfs: use larger zlib buffer for s390 hardware compression Andrew Morton
                   ` (20 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, borntraeger, clm, dsterba, edward6, gor, heiko.carstens,
	iii, josef, linux-mm, mm-commits, rpurdie, torvalds, zaslonko

From: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Subject: lib/zlib: add zlib_deflate_dfltcc_enabled() function

Add a new function to zlib.h checking if s390 Deflate-Conversion facility
is installed and enabled.

Link: http://lkml.kernel.org/r/20200103223334.20669-6-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/zlib.h            |    6 ++++++
 lib/zlib_deflate/deflate.c      |    6 ++++++
 lib/zlib_deflate/deflate_syms.c |    1 +
 lib/zlib_dfltcc/dfltcc.h        |   11 +++++++++++
 lib/zlib_dfltcc/dfltcc_util.h   |    9 ---------
 5 files changed, 24 insertions(+), 9 deletions(-)

--- a/include/linux/zlib.h~lib-zlib-add-zlib_deflate_dfltcc_enabled-function
+++ a/include/linux/zlib.h
@@ -191,6 +191,12 @@ extern int zlib_deflate_workspacesize (i
    exceed those passed here.
 */
 
+extern int zlib_deflate_dfltcc_enabled (void);
+/*
+   Returns 1 if Deflate-Conversion facility is installed and enabled,
+   otherwise 0.
+*/
+
 /* 
 extern int deflateInit (z_streamp strm, int level);
 
--- a/lib/zlib_deflate/deflate.c~lib-zlib-add-zlib_deflate_dfltcc_enabled-function
+++ a/lib/zlib_deflate/deflate.c
@@ -59,6 +59,7 @@
 #define DEFLATE_RESET_HOOK(strm) do {} while (0)
 #define DEFLATE_HOOK(strm, flush, bstate) 0
 #define DEFLATE_NEED_CHECKSUM(strm) 1
+#define DEFLATE_DFLTCC_ENABLED() 0
 #endif
 
 /* ===========================================================================
@@ -1138,3 +1139,8 @@ int zlib_deflate_workspacesize(int windo
         + zlib_deflate_head_memsize(memLevel)
         + zlib_deflate_overlay_memsize(memLevel);
 }
+
+int zlib_deflate_dfltcc_enabled(void)
+{
+	return DEFLATE_DFLTCC_ENABLED();
+}
--- a/lib/zlib_deflate/deflate_syms.c~lib-zlib-add-zlib_deflate_dfltcc_enabled-function
+++ a/lib/zlib_deflate/deflate_syms.c
@@ -12,6 +12,7 @@
 #include <linux/zlib.h>
 
 EXPORT_SYMBOL(zlib_deflate_workspacesize);
+EXPORT_SYMBOL(zlib_deflate_dfltcc_enabled);
 EXPORT_SYMBOL(zlib_deflate);
 EXPORT_SYMBOL(zlib_deflateInit2);
 EXPORT_SYMBOL(zlib_deflateEnd);
--- a/lib/zlib_dfltcc/dfltcc.h~lib-zlib-add-zlib_deflate_dfltcc_enabled-function
+++ a/lib/zlib_dfltcc/dfltcc.h
@@ -3,6 +3,8 @@
 #define DFLTCC_H
 
 #include "../zlib_deflate/defutil.h"
+#include <asm/facility.h>
+#include <asm/setup.h>
 
 /*
  * Tuning parameters.
@@ -14,6 +16,8 @@
 #define DFLTCC_DHT_MIN_SAMPLE_SIZE 4096
 #define DFLTCC_RIBM 0
 
+#define DFLTCC_FACILITY 151
+
 /*
  * Parameter Block for Query Available Functions.
  */
@@ -113,6 +117,11 @@ typedef enum {
 } dfltcc_inflate_action;
 dfltcc_inflate_action dfltcc_inflate(z_streamp strm,
                                      int flush, int *ret);
+static inline int is_dfltcc_enabled(void)
+{
+return (zlib_dfltcc_support != ZLIB_DFLTCC_DISABLED &&
+        test_facility(DFLTCC_FACILITY));
+}
 
 #define DEFLATE_RESET_HOOK(strm) \
     dfltcc_reset((strm), sizeof(deflate_state))
@@ -121,6 +130,8 @@ dfltcc_inflate_action dfltcc_inflate(z_s
 
 #define DEFLATE_NEED_CHECKSUM(strm) (!dfltcc_can_deflate((strm)))
 
+#define DEFLATE_DFLTCC_ENABLED() is_dfltcc_enabled()
+
 #define INFLATE_RESET_HOOK(strm) \
     dfltcc_reset((strm), sizeof(struct inflate_state))
 
--- a/lib/zlib_dfltcc/dfltcc_util.h~lib-zlib-add-zlib_deflate_dfltcc_enabled-function
+++ a/lib/zlib_dfltcc/dfltcc_util.h
@@ -3,8 +3,6 @@
 #define DFLTCC_UTIL_H
 
 #include <linux/zutil.h>
-#include <asm/facility.h>
-#include <asm/setup.h>
 
 /*
  * C wrapper for the DEFLATE CONVERSION CALL instruction.
@@ -24,7 +22,6 @@ typedef enum {
 #define HBT_CIRCULAR (1 << 7)
 #define HB_BITS 15
 #define HB_SIZE (1 << HB_BITS)
-#define DFLTCC_FACILITY 151
 
 static inline dfltcc_cc dfltcc(
     int fn,
@@ -101,12 +98,6 @@ static inline int dfltcc_are_params_ok(
         (strategy == Z_DEFAULT_STRATEGY);
 }
 
-static inline int is_dfltcc_enabled(void)
-{
-    return (zlib_dfltcc_support != ZLIB_DFLTCC_DISABLED &&
-            test_facility(DFLTCC_FACILITY));
-}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 098/118] btrfs: use larger zlib buffer for s390 hardware compression
  2020-01-31  6:10 incoming Andrew Morton
                   ` (96 preceding siblings ...)
  2020-01-31  6:16 ` [patch 097/118] lib/zlib: add zlib_deflate_dfltcc_enabled() function Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 099/118] lib/scatterlist.c: adjust indentation in __sg_alloc_table Andrew Morton
                   ` (19 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, borntraeger, clm, dsterba, edward6, gor, heiko.carstens,
	iii, josef, linux-mm, mm-commits, rpurdie, torvalds, zaslonko

From: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Subject: btrfs: use larger zlib buffer for s390 hardware compression

In order to benefit from s390 zlib hardware compression support, increase
the btrfs zlib workspace buffer size from 1 to 4 pages (if s390 zlib
hardware support is enabled on the machine).  This brings up to 60% better
performance in hardware on s390 compared to the PAGE_SIZE buffer and much
more compared to the software zlib processing in btrfs.  In case of memory
pressure, fall back to a single page buffer during workspace allocation.

The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.

Link: http://lkml.kernel.org/r/20200108105103.29028-1-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/btrfs/compression.c |    2 
 fs/btrfs/zlib.c        |  135 ++++++++++++++++++++++++++++-----------
 2 files changed, 101 insertions(+), 36 deletions(-)

--- a/fs/btrfs/compression.c~btrfs-use-larger-zlib-buffer-for-s390-hardware-compression
+++ a/fs/btrfs/compression.c
@@ -1290,7 +1290,7 @@ int btrfs_decompress_buf2page(const char
 	/* copy bytes from the working buffer into the pages */
 	while (working_bytes > 0) {
 		bytes = min_t(unsigned long, bvec.bv_len,
-				PAGE_SIZE - buf_offset);
+				PAGE_SIZE - (buf_offset % PAGE_SIZE));
 		bytes = min(bytes, working_bytes);
 
 		kaddr = kmap_atomic(bvec.bv_page);
--- a/fs/btrfs/zlib.c~btrfs-use-larger-zlib-buffer-for-s390-hardware-compression
+++ a/fs/btrfs/zlib.c
@@ -20,9 +20,13 @@
 #include <linux/refcount.h>
 #include "compression.h"
 
+/* workspace buffer size for s390 zlib hardware support */
+#define ZLIB_DFLTCC_BUF_SIZE    (4 * PAGE_SIZE)
+
 struct workspace {
 	z_stream strm;
 	char *buf;
+	unsigned int buf_size;
 	struct list_head list;
 	int level;
 };
@@ -61,7 +65,21 @@ struct list_head *zlib_alloc_workspace(u
 			zlib_inflate_workspacesize());
 	workspace->strm.workspace = kvmalloc(workspacesize, GFP_KERNEL);
 	workspace->level = level;
-	workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	workspace->buf = NULL;
+	/*
+	 * In case of s390 zlib hardware support, allocate lager workspace
+	 * buffer. If allocator fails, fall back to a single page buffer.
+	 */
+	if (zlib_deflate_dfltcc_enabled()) {
+		workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE,
+					 __GFP_NOMEMALLOC | __GFP_NORETRY |
+					 __GFP_NOWARN | GFP_NOIO);
+		workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE;
+	}
+	if (!workspace->buf) {
+		workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+		workspace->buf_size = PAGE_SIZE;
+	}
 	if (!workspace->strm.workspace || !workspace->buf)
 		goto fail;
 
@@ -85,6 +103,7 @@ int zlib_compress_pages(struct list_head
 	struct page *in_page = NULL;
 	struct page *out_page = NULL;
 	unsigned long bytes_left;
+	unsigned int in_buf_pages;
 	unsigned long len = *total_out;
 	unsigned long nr_dest_pages = *out_pages;
 	const unsigned long max_out = nr_dest_pages * PAGE_SIZE;
@@ -102,9 +121,6 @@ int zlib_compress_pages(struct list_head
 	workspace->strm.total_in = 0;
 	workspace->strm.total_out = 0;
 
-	in_page = find_get_page(mapping, start >> PAGE_SHIFT);
-	data_in = kmap(in_page);
-
 	out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
 	if (out_page == NULL) {
 		ret = -ENOMEM;
@@ -114,12 +130,51 @@ int zlib_compress_pages(struct list_head
 	pages[0] = out_page;
 	nr_pages = 1;
 
-	workspace->strm.next_in = data_in;
+	workspace->strm.next_in = workspace->buf;
+	workspace->strm.avail_in = 0;
 	workspace->strm.next_out = cpage_out;
 	workspace->strm.avail_out = PAGE_SIZE;
-	workspace->strm.avail_in = min(len, PAGE_SIZE);
 
 	while (workspace->strm.total_in < len) {
+		/*
+		 * Get next input pages and copy the contents to
+		 * the workspace buffer if required.
+		 */
+		if (workspace->strm.avail_in == 0) {
+			bytes_left = len - workspace->strm.total_in;
+			in_buf_pages = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE),
+					   workspace->buf_size / PAGE_SIZE);
+			if (in_buf_pages > 1) {
+				int i;
+
+				for (i = 0; i < in_buf_pages; i++) {
+					if (in_page) {
+						kunmap(in_page);
+						put_page(in_page);
+					}
+					in_page = find_get_page(mapping,
+								start >> PAGE_SHIFT);
+					data_in = kmap(in_page);
+					memcpy(workspace->buf + i * PAGE_SIZE,
+					       data_in, PAGE_SIZE);
+					start += PAGE_SIZE;
+				}
+				workspace->strm.next_in = workspace->buf;
+			} else {
+				if (in_page) {
+					kunmap(in_page);
+					put_page(in_page);
+				}
+				in_page = find_get_page(mapping,
+							start >> PAGE_SHIFT);
+				data_in = kmap(in_page);
+				start += PAGE_SIZE;
+				workspace->strm.next_in = data_in;
+			}
+			workspace->strm.avail_in = min(bytes_left,
+						       (unsigned long) workspace->buf_size);
+		}
+
 		ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH);
 		if (ret != Z_OK) {
 			pr_debug("BTRFS: deflate in loop returned %d\n",
@@ -161,33 +216,43 @@ int zlib_compress_pages(struct list_head
 		/* we're all done */
 		if (workspace->strm.total_in >= len)
 			break;
-
-		/* we've read in a full page, get a new one */
-		if (workspace->strm.avail_in == 0) {
-			if (workspace->strm.total_out > max_out)
-				break;
-
-			bytes_left = len - workspace->strm.total_in;
-			kunmap(in_page);
-			put_page(in_page);
-
-			start += PAGE_SIZE;
-			in_page = find_get_page(mapping,
-						start >> PAGE_SHIFT);
-			data_in = kmap(in_page);
-			workspace->strm.avail_in = min(bytes_left,
-							   PAGE_SIZE);
-			workspace->strm.next_in = data_in;
-		}
+		if (workspace->strm.total_out > max_out)
+			break;
 	}
 	workspace->strm.avail_in = 0;
-	ret = zlib_deflate(&workspace->strm, Z_FINISH);
-	zlib_deflateEnd(&workspace->strm);
-
-	if (ret != Z_STREAM_END) {
-		ret = -EIO;
-		goto out;
+	/*
+	 * Call deflate with Z_FINISH flush parameter providing more output
+	 * space but no more input data, until it returns with Z_STREAM_END.
+	 */
+	while (ret != Z_STREAM_END) {
+		ret = zlib_deflate(&workspace->strm, Z_FINISH);
+		if (ret == Z_STREAM_END)
+			break;
+		if (ret != Z_OK && ret != Z_BUF_ERROR) {
+			zlib_deflateEnd(&workspace->strm);
+			ret = -EIO;
+			goto out;
+		} else if (workspace->strm.avail_out == 0) {
+			/* get another page for the stream end */
+			kunmap(out_page);
+			if (nr_pages == nr_dest_pages) {
+				out_page = NULL;
+				ret = -E2BIG;
+				goto out;
+			}
+			out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			if (out_page == NULL) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			cpage_out = kmap(out_page);
+			pages[nr_pages] = out_page;
+			nr_pages++;
+			workspace->strm.avail_out = PAGE_SIZE;
+			workspace->strm.next_out = cpage_out;
+		}
 	}
+	zlib_deflateEnd(&workspace->strm);
 
 	if (workspace->strm.total_out >= workspace->strm.total_in) {
 		ret = -E2BIG;
@@ -231,7 +296,7 @@ int zlib_decompress_bio(struct list_head
 
 	workspace->strm.total_out = 0;
 	workspace->strm.next_out = workspace->buf;
-	workspace->strm.avail_out = PAGE_SIZE;
+	workspace->strm.avail_out = workspace->buf_size;
 
 	/* If it's deflate, and it's got no preset dictionary, then
 	   we can tell zlib to skip the adler32 check. */
@@ -270,7 +335,7 @@ int zlib_decompress_bio(struct list_head
 		}
 
 		workspace->strm.next_out = workspace->buf;
-		workspace->strm.avail_out = PAGE_SIZE;
+		workspace->strm.avail_out = workspace->buf_size;
 
 		if (workspace->strm.avail_in == 0) {
 			unsigned long tmp;
@@ -320,7 +385,7 @@ int zlib_decompress(struct list_head *ws
 	workspace->strm.total_in = 0;
 
 	workspace->strm.next_out = workspace->buf;
-	workspace->strm.avail_out = PAGE_SIZE;
+	workspace->strm.avail_out = workspace->buf_size;
 	workspace->strm.total_out = 0;
 	/* If it's deflate, and it's got no preset dictionary, then
 	   we can tell zlib to skip the adler32 check. */
@@ -364,7 +429,7 @@ int zlib_decompress(struct list_head *ws
 			buf_offset = 0;
 
 		bytes = min(PAGE_SIZE - pg_offset,
-			    PAGE_SIZE - buf_offset);
+			    PAGE_SIZE - (buf_offset % PAGE_SIZE));
 		bytes = min(bytes, bytes_left);
 
 		kaddr = kmap_atomic(dest_page);
@@ -375,7 +440,7 @@ int zlib_decompress(struct list_head *ws
 		bytes_left -= bytes;
 next:
 		workspace->strm.next_out = workspace->buf;
-		workspace->strm.avail_out = PAGE_SIZE;
+		workspace->strm.avail_out = workspace->buf_size;
 	}
 
 	if (ret != Z_STREAM_END && bytes_left != 0)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 099/118] lib/scatterlist.c: adjust indentation in __sg_alloc_table
  2020-01-31  6:10 incoming Andrew Morton
                   ` (97 preceding siblings ...)
  2020-01-31  6:16 ` [patch 098/118] btrfs: use larger zlib buffer for s390 hardware compression Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 100/118] uapi: rename ext2_swab() to swab() and share globally in swab.h Andrew Morton
                   ` (18 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, natechancellor, torvalds

From: Nathan Chancellor <natechancellor@gmail.com>
Subject: lib/scatterlist.c: adjust indentation in __sg_alloc_table

Clang warns:

../lib/scatterlist.c:314:5: warning: misleading indentation; statement
is not part of the previous 'if' [-Wmisleading-indentation]
                        return -ENOMEM;
                        ^
../lib/scatterlist.c:311:4: note: previous statement is here
                        if (prv)
                        ^
1 warning generated.

This warning occurs because there is a space before the tab on this line. 
Remove it so that the indentation is consistent with the Linux kernel
coding style and clang no longer warns.

Link: http://lkml.kernel.org/r/20191218033606.11942-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/830
Fixes: edce6820a9fd ("scatterlist: prevent invalid free when alloc fails")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/scatterlist.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/scatterlist.c~lib-scatterlist-adjust-indentation-in-__sg_alloc_table
+++ a/lib/scatterlist.c
@@ -311,7 +311,7 @@ int __sg_alloc_table(struct sg_table *ta
 			if (prv)
 				table->nents = ++table->orig_nents;
 
- 			return -ENOMEM;
+			return -ENOMEM;
 		}
 
 		sg_init_table(sg, alloc_size);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 100/118] uapi: rename ext2_swab() to swab() and share globally in swab.h
  2020-01-31  6:10 incoming Andrew Morton
                   ` (98 preceding siblings ...)
  2020-01-31  6:16 ` [patch 099/118] lib/scatterlist.c: adjust indentation in __sg_alloc_table Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 101/118] lib/find_bit.c: join _find_next_bit{_le} Andrew Morton
                   ` (17 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, allison, joe, linux-mm, mm-commits, tglx, torvalds,
	vilhelm.gray, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: uapi: rename ext2_swab() to swab() and share globally in swab.h

ext2_swab() is defined locally in lib/find_bit.c However it is not specific
to ext2, neither to bitmaps.

There are many potential users of it, so rename it to just swab() and move
to include/uapi/linux/swab.h

ABI guarantees that size of unsigned long corresponds to BITS_PER_LONG,
therefore drop unneeded cast.

Link: http://lkml.kernel.org/r/20200103202846.21616-1-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Joe Perches <joe@perches.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swab.h      |    1 +
 include/uapi/linux/swab.h |   10 ++++++++++
 lib/find_bit.c            |   16 ++--------------
 3 files changed, 13 insertions(+), 14 deletions(-)

--- a/include/linux/swab.h~uapi-rename-ext2_swab-to-swab-and-share-globally-in-swabh
+++ a/include/linux/swab.h
@@ -7,6 +7,7 @@
 # define swab16 __swab16
 # define swab32 __swab32
 # define swab64 __swab64
+# define swab __swab
 # define swahw32 __swahw32
 # define swahb32 __swahb32
 # define swab16p __swab16p
--- a/include/uapi/linux/swab.h~uapi-rename-ext2_swab-to-swab-and-share-globally-in-swabh
+++ a/include/uapi/linux/swab.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <asm/bitsperlong.h>
 #include <asm/swab.h>
 
 /*
@@ -132,6 +133,15 @@ static inline __attribute_const__ __u32
 	__fswab64(x))
 #endif
 
+static __always_inline unsigned long __swab(const unsigned long y)
+{
+#if BITS_PER_LONG == 64
+	return __swab64(y);
+#else /* BITS_PER_LONG == 32 */
+	return __swab32(y);
+#endif
+}
+
 /**
  * __swahw32 - return a word-swapped 32-bit value
  * @x: value to wordswap
--- a/lib/find_bit.c~uapi-rename-ext2_swab-to-swab-and-share-globally-in-swabh
+++ a/lib/find_bit.c
@@ -149,18 +149,6 @@ EXPORT_SYMBOL(find_last_bit);
 
 #ifdef __BIG_ENDIAN
 
-/* include/linux/byteorder does not support "unsigned long" type */
-static inline unsigned long ext2_swab(const unsigned long y)
-{
-#if BITS_PER_LONG == 64
-	return (unsigned long) __swab64((u64) y);
-#elif BITS_PER_LONG == 32
-	return (unsigned long) __swab32((u32) y);
-#else
-#error BITS_PER_LONG not defined
-#endif
-}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 101/118] lib/find_bit.c: join _find_next_bit{_le}
  2020-01-31  6:10 incoming Andrew Morton
                   ` (99 preceding siblings ...)
  2020-01-31  6:16 ` [patch 100/118] uapi: rename ext2_swab() to swab() and share globally in swab.h Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 102/118] lib/find_bit.c: uninline helper _find_next_bit() Andrew Morton
                   ` (16 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, allison, joe, linux-mm, mm-commits, tglx, torvalds,
	vilhelm.gray, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib/find_bit.c: join _find_next_bit{_le}

_find_next_bit and _find_next_bit_le are very similar functions.  It's
possible to join them by adding 1 parameter and a couple of simple checks.
It's simplify maintenance and make possible to shrink the size of .text
by un-inlining the unified function (in the following patch).

Link: http://lkml.kernel.org/r/20200103202846.21616-2-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Cc: Allison Randal <allison@lohutok.net>
Cc: Joe Perches <joe@perches.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/find_bit.c |   64 +++++++++++++----------------------------------
 1 file changed, 19 insertions(+), 45 deletions(-)

--- a/lib/find_bit.c~lib-find_bitc-join-_find_next_bit_le
+++ a/lib/find_bit.c
@@ -17,9 +17,9 @@
 #include <linux/export.h>
 #include <linux/kernel.h>
 
-#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \
-		!defined(find_next_and_bit)
-
+#if !defined(find_next_bit) || !defined(find_next_zero_bit) ||			\
+	!defined(find_next_bit_le) || !defined(find_next_zero_bit_le) ||	\
+	!defined(find_next_and_bit)
 /*
  * This is a common helper function for find_next_bit, find_next_zero_bit, and
  * find_next_and_bit. The differences are:
@@ -29,9 +29,9 @@
  */
 static inline unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert)
+		unsigned long start, unsigned long invert, unsigned long le)
 {
-	unsigned long tmp;
+	unsigned long tmp, mask;
 
 	if (unlikely(start >= nbits))
 		return nbits;
@@ -42,7 +42,12 @@ static inline unsigned long _find_next_b
 	tmp ^= invert;
 
 	/* Handle 1st word. */
-	tmp &= BITMAP_FIRST_WORD_MASK(start);
+	mask = BITMAP_FIRST_WORD_MASK(start);
+	if (le)
+		mask = swab(mask);
+
+	tmp &= mask;
+
 	start = round_down(start, BITS_PER_LONG);
 
 	while (!tmp) {
@@ -56,6 +61,9 @@ static inline unsigned long _find_next_b
 		tmp ^= invert;
 	}
 
+	if (le)
+		tmp = swab(tmp);
+
 	return min(start + __ffs(tmp), nbits);
 }
 #endif
@@ -67,7 +75,7 @@ static inline unsigned long _find_next_b
 unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 			    unsigned long offset)
 {
-	return _find_next_bit(addr, NULL, size, offset, 0UL);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
 }
 EXPORT_SYMBOL(find_next_bit);
 #endif
@@ -76,7 +84,7 @@ EXPORT_SYMBOL(find_next_bit);
 unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 				 unsigned long offset)
 {
-	return _find_next_bit(addr, NULL, size, offset, ~0UL);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
 }
 EXPORT_SYMBOL(find_next_zero_bit);
 #endif
@@ -86,7 +94,7 @@ unsigned long find_next_and_bit(const un
 		const unsigned long *addr2, unsigned long size,
 		unsigned long offset)
 {
-	return _find_next_bit(addr1, addr2, size, offset, 0UL);
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
 }
 EXPORT_SYMBOL(find_next_and_bit);
 #endif
@@ -149,45 +157,11 @@ EXPORT_SYMBOL(find_last_bit);
 
 #ifdef __BIG_ENDIAN
 
-#if !defined(find_next_bit_le) || !defined(find_next_zero_bit_le)
-static inline unsigned long _find_next_bit_le(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert)
-{
-	unsigned long tmp;
-
-	if (unlikely(start >= nbits))
-		return nbits;
-
-	tmp = addr1[start / BITS_PER_LONG];
-	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
-	tmp ^= invert;
-
-	/* Handle 1st word. */
-	tmp &= swab(BITMAP_FIRST_WORD_MASK(start));
-	start = round_down(start, BITS_PER_LONG);
-
-	while (!tmp) {
-		start += BITS_PER_LONG;
-		if (start >= nbits)
-			return nbits;
-
-		tmp = addr1[start / BITS_PER_LONG];
-		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
-		tmp ^= invert;
-	}
-
-	return min(start + __ffs(swab(tmp)), nbits);
-}
-#endif

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 102/118] lib/find_bit.c: uninline helper _find_next_bit()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (100 preceding siblings ...)
  2020-01-31  6:16 ` [patch 101/118] lib/find_bit.c: join _find_next_bit{_le} Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 103/118] fs/binfmt_elf.c: smaller code generation around auxv vector fill Andrew Morton
                   ` (15 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: akpm, allison, joe, linux-mm, mm-commits, tglx, torvalds,
	vilhelm.gray, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib/find_bit.c: uninline helper _find_next_bit()

It saves 25% of .text for arm64, and more for BE architectures.

Before:
$ size lib/find_bit.o
   text    data     bss     dec     hex filename
   1012      56       0    1068     42c lib/find_bit.o

After:
$ size lib/find_bit.o
   text    data     bss     dec     hex filename
    776      56       0     832     340 lib/find_bit.o

Link: http://lkml.kernel.org/r/20200103202846.21616-3-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Allison Randal <allison@lohutok.net>
Cc: William Breathitt Gray <vilhelm.gray@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/find_bit.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/find_bit.c~lib-find_bitc-uninline-helper-_find_next_bit
+++ a/lib/find_bit.c
@@ -27,7 +27,7 @@
  *    searching it for one bits.
  *  - The optional "addr2", which is anded with "addr1" if present.
  */
-static inline unsigned long _find_next_bit(const unsigned long *addr1,
+static unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
 		unsigned long start, unsigned long invert, unsigned long le)
 {
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 103/118] fs/binfmt_elf.c: smaller code generation around auxv vector fill
  2020-01-31  6:10 incoming Andrew Morton
                   ` (101 preceding siblings ...)
  2020-01-31  6:16 ` [patch 102/118] lib/find_bit.c: uninline helper _find_next_bit() Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 104/118] fs/binfmt_elf.c: fix ->start_code calculation Andrew Morton
                   ` (14 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: smaller code generation around auxv vector fill

Filling auxv vector as array with index (auxv[i++] = ...) generates
terrible code.  "saved_auxv" should be reworked because it is the worst
member of mm_struct by size/usefullness ratio but do it later.

Meanwhile help gcc a little with *auxv++ idiom.

Space savings on x86_64:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-127 (-127)
	Function                                     old     new   delta
	load_elf_binary                             5470    5343    -127

Link: http://lkml.kernel.org/r/20191208172301.GD19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

--- a/fs/binfmt_elf.c~elf-smaller-code-generation-around-auxv-vector-fill
+++ a/fs/binfmt_elf.c
@@ -176,7 +176,7 @@ create_elf_tables(struct linux_binprm *b
 	unsigned char k_rand_bytes[16];
 	int items;
 	elf_addr_t *elf_info;
-	int ei_index = 0;
+	int ei_index;
 	const struct cred *cred = current_cred();
 	struct vm_area_struct *vma;
 
@@ -230,8 +230,8 @@ create_elf_tables(struct linux_binprm *b
 	/* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */
 #define NEW_AUX_ENT(id, val) \
 	do { \
-		elf_info[ei_index++] = id; \
-		elf_info[ei_index++] = val; \
+		*elf_info++ = id; \
+		*elf_info++ = val; \
 	} while (0)
 
 #ifdef ARCH_DLINFO
@@ -275,12 +275,13 @@ create_elf_tables(struct linux_binprm *b
 	}
 #undef NEW_AUX_ENT
 	/* AT_NULL is zero; clear the rest too */
-	memset(&elf_info[ei_index], 0,
-	       sizeof current->mm->saved_auxv - ei_index * sizeof elf_info[0]);
+	memset(elf_info, 0, (char *)current->mm->saved_auxv +
+			sizeof(current->mm->saved_auxv) - (char *)elf_info);
 
 	/* And advance past the AT_NULL entry.  */
-	ei_index += 2;
+	elf_info += 2;
 
+	ei_index = elf_info - (elf_addr_t *)current->mm->saved_auxv;
 	sp = STACK_ADD(p, ei_index);
 
 	items = (argc + 1) + (envc + 1) + 1;
@@ -338,7 +339,7 @@ create_elf_tables(struct linux_binprm *b
 	current->mm->env_end = p;
 
 	/* Put the elf_info on the stack in the right place.  */
-	if (copy_to_user(sp, elf_info, ei_index * sizeof(elf_addr_t)))
+	if (copy_to_user(sp, current->mm->saved_auxv, ei_index * sizeof(elf_addr_t)))
 		return -EFAULT;
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 104/118] fs/binfmt_elf.c: fix ->start_code calculation
  2020-01-31  6:10 incoming Andrew Morton
                   ` (102 preceding siblings ...)
  2020-01-31  6:16 ` [patch 103/118] fs/binfmt_elf.c: smaller code generation around auxv vector fill Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:16 ` [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around Andrew Morton
                   ` (13 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: fix ->start_code calculation

Only executable segments should be accounted to ->start_code just like
they do to ->end_code (correctly).

Link: http://lkml.kernel.org/r/20191208171410.GB19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/binfmt_elf.c~elf-fix-start_code-calculation
+++ a/fs/binfmt_elf.c
@@ -999,7 +999,7 @@ out_free_interp:
 			}
 		}
 		k = elf_ppnt->p_vaddr;
-		if (k < start_code)
+		if ((elf_ppnt->p_flags & PF_X) && k < start_code)
 			start_code = k;
 		if (start_data < k)
 			start_data = k;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around
  2020-01-31  6:10 incoming Andrew Morton
                   ` (103 preceding siblings ...)
  2020-01-31  6:16 ` [patch 104/118] fs/binfmt_elf.c: fix ->start_code calculation Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2023-11-22  7:15   ` Jinjie Ruan
  2020-01-31  6:16 ` [patch 106/118] fs/binfmt_elf.c: better codegen around current->mm Andrew Morton
                   ` (12 subsequent siblings)
  117 siblings, 1 reply; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: don't copy ELF header around

ELF header is read into bprm->buf[] by generic execve code.

Save a memcpy and allocate just one header for the interpreter instead of
two headers (64 bytes instead of 128 on 64-bit).

Link: http://lkml.kernel.org/r/20191208171242.GA19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |   55 ++++++++++++++++++++++------------------------
 1 file changed, 27 insertions(+), 28 deletions(-)

--- a/fs/binfmt_elf.c~elf-dont-copy-elf-header-around
+++ a/fs/binfmt_elf.c
@@ -161,8 +161,9 @@ static int padzero(unsigned long elf_bss
 #endif
 
 static int
-create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
-		unsigned long load_addr, unsigned long interp_load_addr)
+create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
+		unsigned long load_addr, unsigned long interp_load_addr,
+		unsigned long e_entry)
 {
 	unsigned long p = bprm->p;
 	int argc = bprm->argc;
@@ -251,7 +252,7 @@ create_elf_tables(struct linux_binprm *b
 	NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
 	NEW_AUX_ENT(AT_BASE, interp_load_addr);
 	NEW_AUX_ENT(AT_FLAGS, 0);
-	NEW_AUX_ENT(AT_ENTRY, exec->e_entry);
+	NEW_AUX_ENT(AT_ENTRY, e_entry);
 	NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
 	NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid));
 	NEW_AUX_ENT(AT_GID, from_kgid_munged(cred->user_ns, cred->gid));
@@ -690,12 +691,13 @@ static int load_elf_binary(struct linux_
 	int bss_prot = 0;
 	int retval, i;
 	unsigned long elf_entry;
+	unsigned long e_entry;
 	unsigned long interp_load_addr = 0;
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long reloc_func_desc __maybe_unused = 0;
 	int executable_stack = EXSTACK_DEFAULT;
+	struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf;
 	struct {
-		struct elfhdr elf_ex;
 		struct elfhdr interp_elf_ex;
 	} *loc;
 	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
@@ -706,30 +708,27 @@ static int load_elf_binary(struct linux_
 		retval = -ENOMEM;
 		goto out_ret;
 	}
-	
-	/* Get the exec-header */
-	loc->elf_ex = *((struct elfhdr *)bprm->buf);
 
 	retval = -ENOEXEC;
 	/* First of all, some simple consistency checks */
-	if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
+	if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
 		goto out;
 
-	if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
+	if (elf_ex->e_type != ET_EXEC && elf_ex->e_type != ET_DYN)
 		goto out;
-	if (!elf_check_arch(&loc->elf_ex))
+	if (!elf_check_arch(elf_ex))
 		goto out;
-	if (elf_check_fdpic(&loc->elf_ex))
+	if (elf_check_fdpic(elf_ex))
 		goto out;
 	if (!bprm->file->f_op->mmap)
 		goto out;
 
-	elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
+	elf_phdata = load_elf_phdrs(elf_ex, bprm->file);
 	if (!elf_phdata)
 		goto out;
 
 	elf_ppnt = elf_phdata;
-	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
+	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
 		char *elf_interpreter;
 
 		if (elf_ppnt->p_type != PT_INTERP)
@@ -783,7 +782,7 @@ out_free_interp:
 	}
 
 	elf_ppnt = elf_phdata;
-	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
+	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++)
 		switch (elf_ppnt->p_type) {
 		case PT_GNU_STACK:
 			if (elf_ppnt->p_flags & PF_X)
@@ -793,7 +792,7 @@ out_free_interp:
 			break;
 
 		case PT_LOPROC ... PT_HIPROC:
-			retval = arch_elf_pt_proc(&loc->elf_ex, elf_ppnt,
+			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
 						  &arch_state);
 			if (retval)
@@ -837,7 +836,7 @@ out_free_interp:
 	 * still possible to return an error to the code that invoked
 	 * the exec syscall.
 	 */
-	retval = arch_check_elf(&loc->elf_ex,
+	retval = arch_check_elf(elf_ex,
 				!!interpreter, &loc->interp_elf_ex,
 				&arch_state);
 	if (retval)
@@ -850,8 +849,8 @@ out_free_interp:
 
 	/* Do this immediately, since STACK_TOP as used in setup_arg_pages
 	   may depend on the personality.  */
-	SET_PERSONALITY2(loc->elf_ex, &arch_state);
-	if (elf_read_implies_exec(loc->elf_ex, executable_stack))
+	SET_PERSONALITY2(*elf_ex, &arch_state);
+	if (elf_read_implies_exec(*elf_ex, executable_stack))
 		current->personality |= READ_IMPLIES_EXEC;
 
 	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
@@ -878,7 +877,7 @@ out_free_interp:
 	/* Now we do a little grungy work by mmapping the ELF image into
 	   the correct location in memory. */
 	for(i = 0, elf_ppnt = elf_phdata;
-	    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
+	    i < elf_ex->e_phnum; i++, elf_ppnt++) {
 		int elf_prot, elf_flags;
 		unsigned long k, vaddr;
 		unsigned long total_size = 0;
@@ -922,9 +921,9 @@ out_free_interp:
 		 * If we are loading ET_EXEC or we have already performed
 		 * the ET_DYN load_addr calculations, proceed normally.
 		 */
-		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
+		if (elf_ex->e_type == ET_EXEC || load_addr_set) {
 			elf_flags |= MAP_FIXED;
-		} else if (loc->elf_ex.e_type == ET_DYN) {
+		} else if (elf_ex->e_type == ET_DYN) {
 			/*
 			 * This logic is run once for the first LOAD Program
 			 * Header for ET_DYN binaries to calculate the
@@ -973,7 +972,7 @@ out_free_interp:
 			load_bias = ELF_PAGESTART(load_bias - vaddr);
 
 			total_size = total_mapping_size(elf_phdata,
-							loc->elf_ex.e_phnum);
+							elf_ex->e_phnum);
 			if (!total_size) {
 				retval = -EINVAL;
 				goto out_free_dentry;
@@ -991,7 +990,7 @@ out_free_interp:
 		if (!load_addr_set) {
 			load_addr_set = 1;
 			load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset);
-			if (loc->elf_ex.e_type == ET_DYN) {
+			if (elf_ex->e_type == ET_DYN) {
 				load_bias += error -
 				             ELF_PAGESTART(load_bias + vaddr);
 				load_addr += load_bias;
@@ -1032,7 +1031,7 @@ out_free_interp:
 		}
 	}
 
-	loc->elf_ex.e_entry += load_bias;
+	e_entry = elf_ex->e_entry + load_bias;
 	elf_bss += load_bias;
 	elf_brk += load_bias;
 	start_code += load_bias;
@@ -1075,7 +1074,7 @@ out_free_interp:
 		allow_write_access(interpreter);
 		fput(interpreter);
 	} else {
-		elf_entry = loc->elf_ex.e_entry;
+		elf_entry = e_entry;
 		if (BAD_ADDR(elf_entry)) {
 			retval = -EINVAL;
 			goto out_free_dentry;
@@ -1093,8 +1092,8 @@ out_free_interp:
 		goto out;
 #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */
 
-	retval = create_elf_tables(bprm, &loc->elf_ex,
-			  load_addr, interp_load_addr);
+	retval = create_elf_tables(bprm, elf_ex,
+			  load_addr, interp_load_addr, e_entry);
 	if (retval < 0)
 		goto out;
 	current->mm->end_code = end_code;
@@ -1112,7 +1111,7 @@ out_free_interp:
 		 * growing down), and into the unused ELF_ET_DYN_BASE region.
 		 */
 		if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
-		    loc->elf_ex.e_type == ET_DYN && !interpreter)
+		    elf_ex->e_type == ET_DYN && !interpreter)
 			current->mm->brk = current->mm->start_brk =
 				ELF_ET_DYN_BASE;
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 106/118] fs/binfmt_elf.c: better codegen around current->mm
  2020-01-31  6:10 incoming Andrew Morton
                   ` (104 preceding siblings ...)
  2020-01-31  6:16 ` [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around Andrew Morton
@ 2020-01-31  6:16 ` Andrew Morton
  2020-01-31  6:17 ` [patch 107/118] fs/binfmt_elf.c: make BAD_ADDR() unlikely Andrew Morton
                   ` (11 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:16 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: better codegen around current->mm

"current->mm" pointer is stable in general except few cases one of which
execve(2).  Compiler can't treat is as stable but it _is_ stable most of
the time.  During ELF loading process ->mm becomes stable right after
flush_old_exec().

Help compiler by caching current->mm, otherwise it continues to refetch it.

	add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-141 (-141)
	Function                                     old     new   delta
	elf_core_dump                               5062    5039     -23
	load_elf_binary                             5426    5308    -118

Note: other cases are left as is because it is either pessimisation or no
change in binary size.

Link: http://lkml.kernel.org/r/20191215124755.GB21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |   52 ++++++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 24 deletions(-)

--- a/fs/binfmt_elf.c~elf-better-codegen-around-current-mm
+++ a/fs/binfmt_elf.c
@@ -165,6 +165,7 @@ create_elf_tables(struct linux_binprm *b
 		unsigned long load_addr, unsigned long interp_load_addr,
 		unsigned long e_entry)
 {
+	struct mm_struct *mm = current->mm;
 	unsigned long p = bprm->p;
 	int argc = bprm->argc;
 	int envc = bprm->envc;
@@ -227,7 +228,7 @@ create_elf_tables(struct linux_binprm *b
 		return -EFAULT;
 
 	/* Create the ELF interpreter info */
-	elf_info = (elf_addr_t *)current->mm->saved_auxv;
+	elf_info = (elf_addr_t *)mm->saved_auxv;
 	/* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */
 #define NEW_AUX_ENT(id, val) \
 	do { \
@@ -276,13 +277,13 @@ create_elf_tables(struct linux_binprm *b
 	}
 #undef NEW_AUX_ENT
 	/* AT_NULL is zero; clear the rest too */
-	memset(elf_info, 0, (char *)current->mm->saved_auxv +
-			sizeof(current->mm->saved_auxv) - (char *)elf_info);
+	memset(elf_info, 0, (char *)mm->saved_auxv +
+			sizeof(mm->saved_auxv) - (char *)elf_info);
 
 	/* And advance past the AT_NULL entry.  */
 	elf_info += 2;
 
-	ei_index = elf_info - (elf_addr_t *)current->mm->saved_auxv;
+	ei_index = elf_info - (elf_addr_t *)mm->saved_auxv;
 	sp = STACK_ADD(p, ei_index);
 
 	items = (argc + 1) + (envc + 1) + 1;
@@ -301,7 +302,7 @@ create_elf_tables(struct linux_binprm *b
 	 * Grow the stack manually; some architectures have a limit on how
 	 * far ahead a user-space access may be in order to grow the stack.
 	 */
-	vma = find_extend_vma(current->mm, bprm->p);
+	vma = find_extend_vma(mm, bprm->p);
 	if (!vma)
 		return -EFAULT;
 
@@ -310,7 +311,7 @@ create_elf_tables(struct linux_binprm *b
 		return -EFAULT;
 
 	/* Populate list of argv pointers back to argv strings. */
-	p = current->mm->arg_end = current->mm->arg_start;
+	p = mm->arg_end = mm->arg_start;
 	while (argc-- > 0) {
 		size_t len;
 		if (__put_user((elf_addr_t)p, sp++))
@@ -322,10 +323,10 @@ create_elf_tables(struct linux_binprm *b
 	}
 	if (__put_user(0, sp++))
 		return -EFAULT;
-	current->mm->arg_end = p;
+	mm->arg_end = p;
 
 	/* Populate list of envp pointers back to envp strings. */
-	current->mm->env_end = current->mm->env_start = p;
+	mm->env_end = mm->env_start = p;
 	while (envc-- > 0) {
 		size_t len;
 		if (__put_user((elf_addr_t)p, sp++))
@@ -337,10 +338,10 @@ create_elf_tables(struct linux_binprm *b
 	}
 	if (__put_user(0, sp++))
 		return -EFAULT;
-	current->mm->env_end = p;
+	mm->env_end = p;
 
 	/* Put the elf_info on the stack in the right place.  */
-	if (copy_to_user(sp, current->mm->saved_auxv, ei_index * sizeof(elf_addr_t)))
+	if (copy_to_user(sp, mm->saved_auxv, ei_index * sizeof(elf_addr_t)))
 		return -EFAULT;
 	return 0;
 }
@@ -701,6 +702,7 @@ static int load_elf_binary(struct linux_
 		struct elfhdr interp_elf_ex;
 	} *loc;
 	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
+	struct mm_struct *mm;
 	struct pt_regs *regs;
 
 	loc = kmalloc(sizeof(*loc), GFP_KERNEL);
@@ -1096,11 +1098,13 @@ out_free_interp:
 			  load_addr, interp_load_addr, e_entry);
 	if (retval < 0)
 		goto out;
-	current->mm->end_code = end_code;
-	current->mm->start_code = start_code;
-	current->mm->start_data = start_data;
-	current->mm->end_data = end_data;
-	current->mm->start_stack = bprm->p;
+
+	mm = current->mm;
+	mm->end_code = end_code;
+	mm->start_code = start_code;
+	mm->start_data = start_data;
+	mm->end_data = end_data;
+	mm->start_stack = bprm->p;
 
 	if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
 		/*
@@ -1111,12 +1115,11 @@ out_free_interp:
 		 * growing down), and into the unused ELF_ET_DYN_BASE region.
 		 */
 		if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
-		    elf_ex->e_type == ET_DYN && !interpreter)
-			current->mm->brk = current->mm->start_brk =
-				ELF_ET_DYN_BASE;
+		    elf_ex->e_type == ET_DYN && !interpreter) {
+			mm->brk = mm->start_brk = ELF_ET_DYN_BASE;
+		}
 
-		current->mm->brk = current->mm->start_brk =
-			arch_randomize_brk(current->mm);
+		mm->brk = mm->start_brk = arch_randomize_brk(mm);
 #ifdef compat_brk_randomized
 		current->brk_randomized = 1;
 #endif
@@ -1574,6 +1577,7 @@ static void fill_siginfo_note(struct mem
  */
 static int fill_files_note(struct memelfnote *note)
 {
+	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned count, size, names_ofs, remaining, n;
 	user_long_t *data;
@@ -1581,7 +1585,7 @@ static int fill_files_note(struct memelf
 	char *name_base, *name_curpos;
 
 	/* *Estimated* file count and total data size needed */
-	count = current->mm->map_count;
+	count = mm->map_count;
 	if (count > UINT_MAX / 64)
 		return -EINVAL;
 	size = count * 64;
@@ -1599,7 +1603,7 @@ static int fill_files_note(struct memelf
 	name_base = name_curpos = ((char *)data) + names_ofs;
 	remaining = size - names_ofs;
 	count = 0;
-	for (vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) {
+	for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
 		struct file *file;
 		const char *filename;
 
@@ -1633,10 +1637,10 @@ static int fill_files_note(struct memelf
 	data[0] = count;
 	data[1] = PAGE_SIZE;
 	/*
-	 * Count usually is less than current->mm->map_count,
+	 * Count usually is less than mm->map_count,
 	 * we need to move filenames down.
 	 */
-	n = current->mm->map_count - count;
+	n = mm->map_count - count;
 	if (n != 0) {
 		unsigned shift_bytes = n * 3 * sizeof(data[0]);
 		memmove(name_base - shift_bytes, name_base,
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 107/118] fs/binfmt_elf.c: make BAD_ADDR() unlikely
  2020-01-31  6:10 incoming Andrew Morton
                   ` (105 preceding siblings ...)
  2020-01-31  6:16 ` [patch 106/118] fs/binfmt_elf.c: better codegen around current->mm Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 108/118] fs/binfmt_elf.c: coredump: allocate core ELF header on stack Andrew Morton
                   ` (10 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: make BAD_ADDR() unlikely

If some mapping goes past TASK_SIZE it will be rejected by kernel which
means no such userspace binaries exist.

Mark every such check as unlikely.

Link: http://lkml.kernel.org/r/20191215124355.GA21124@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/binfmt_elf.c~elf-make-bad_addr-unlikely
+++ a/fs/binfmt_elf.c
@@ -97,7 +97,7 @@ static struct linux_binfmt elf_format =
 	.min_coredump	= ELF_EXEC_PAGESIZE,
 };
 
-#define BAD_ADDR(x) ((unsigned long)(x) >= TASK_SIZE)
+#define BAD_ADDR(x) (unlikely((unsigned long)(x) >= TASK_SIZE))
 
 static int set_brk(unsigned long start, unsigned long end, int prot)
 {
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 108/118] fs/binfmt_elf.c: coredump: allocate core ELF header on stack
  2020-01-31  6:10 incoming Andrew Morton
                   ` (106 preceding siblings ...)
  2020-01-31  6:17 ` [patch 107/118] fs/binfmt_elf.c: make BAD_ADDR() unlikely Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 109/118] fs/binfmt_elf.c: coredump: delete duplicated overflow check Andrew Morton
                   ` (9 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: coredump: allocate core ELF header on stack

Comment says ELF header is "too large to be on stack".  64 bytes on 64-bit
is not large by any means.

Link: http://lkml.kernel.org/r/20191222143850.GA24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |   16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

--- a/fs/binfmt_elf.c~elf-coredump-allocate-core-elf-header-on-stack
+++ a/fs/binfmt_elf.c
@@ -2186,7 +2186,7 @@ static int elf_core_dump(struct coredump
 	int segs, i;
 	size_t vma_data_size = 0;
 	struct vm_area_struct *vma, *gate_vma;
-	struct elfhdr *elf = NULL;
+	struct elfhdr elf;
 	loff_t offset = 0, dataoff;
 	struct elf_note_info info = { };
 	struct elf_phdr *phdr4note = NULL;
@@ -2207,10 +2207,6 @@ static int elf_core_dump(struct coredump
 	 * exists while dumping the mm->vm_next areas to the core file.
 	 */
   
-	/* alloc memory for large data structures: too large to be on stack */
-	elf = kmalloc(sizeof(*elf), GFP_KERNEL);
-	if (!elf)
-		goto out;
 	/*
 	 * The number of segs are recored into ELF header as 16bit value.
 	 * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here.
@@ -2234,7 +2230,7 @@ static int elf_core_dump(struct coredump
 	 * Collect all the non-memory information about the process for the
 	 * notes.  This also sets up the file header.
 	 */
-	if (!fill_note_info(elf, e_phnum, &info, cprm->siginfo, cprm->regs))
+	if (!fill_note_info(&elf, e_phnum, &info, cprm->siginfo, cprm->regs))
 		goto cleanup;
 
 	has_dumped = 1;
@@ -2242,7 +2238,7 @@ static int elf_core_dump(struct coredump
 	fs = get_fs();
 	set_fs(KERNEL_DS);
 
-	offset += sizeof(*elf);				/* Elf header */
+	offset += sizeof(elf);				/* Elf header */
 	offset += segs * sizeof(struct elf_phdr);	/* Program headers */
 
 	/* Write notes phdr entry */
@@ -2285,12 +2281,12 @@ static int elf_core_dump(struct coredump
 		shdr4extnum = kmalloc(sizeof(*shdr4extnum), GFP_KERNEL);
 		if (!shdr4extnum)
 			goto end_coredump;
-		fill_extnum_info(elf, shdr4extnum, e_shoff, segs);
+		fill_extnum_info(&elf, shdr4extnum, e_shoff, segs);
 	}
 
 	offset = dataoff;
 
-	if (!dump_emit(cprm, elf, sizeof(*elf)))
+	if (!dump_emit(cprm, &elf, sizeof(elf)))
 		goto end_coredump;
 
 	if (!dump_emit(cprm, phdr4note, sizeof(*phdr4note)))
@@ -2374,8 +2370,6 @@ cleanup:
 	kfree(shdr4extnum);
 	kvfree(vma_filesz);
 	kfree(phdr4note);
-	kfree(elf);
-out:
 	return has_dumped;
 }
 
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 109/118] fs/binfmt_elf.c: coredump: delete duplicated overflow check
  2020-01-31  6:10 incoming Andrew Morton
                   ` (107 preceding siblings ...)
  2020-01-31  6:17 ` [patch 108/118] fs/binfmt_elf.c: coredump: allocate core ELF header on stack Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 110/118] fs/binfmt_elf.c: coredump: allow process with empty address space to coredump Andrew Morton
                   ` (8 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: coredump: delete duplicated overflow check

array_size() macro will do overflow check anyway.

Link: http://lkml.kernel.org/r/20191222144009.GB24341@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |    2 --
 1 file changed, 2 deletions(-)

--- a/fs/binfmt_elf.c~elf-coredump-delete-duplicated-overflow-check
+++ a/fs/binfmt_elf.c
@@ -2257,8 +2257,6 @@ static int elf_core_dump(struct coredump
 
 	dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
 
-	if (segs - 1 > ULONG_MAX / sizeof(*vma_filesz))
-		goto end_coredump;
 	vma_filesz = kvmalloc(array_size(sizeof(*vma_filesz), (segs - 1)),
 			      GFP_KERNEL);
 	if (ZERO_OR_NULL_PTR(vma_filesz))
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 110/118] fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
  2020-01-31  6:10 incoming Andrew Morton
                   ` (108 preceding siblings ...)
  2020-01-31  6:17 ` [patch 109/118] fs/binfmt_elf.c: coredump: delete duplicated overflow check Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 111/118] init/main.c: log arguments and environment passed to init Andrew Morton
                   ` (7 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: adobriyan, akpm, linux-mm, mm-commits, torvalds

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: fs/binfmt_elf.c: coredump: allow process with empty address space to coredump

Unmapping whole address space at once with

	munmap(0, (1ULL<<47) - 4096)

or equivalent will create empty coredump.

It is silly way to exit, however registers content may still be useful.

The right to coredump is fundamental right of a process!

Link: http://lkml.kernel.org/r/20191222150137.GA1277@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/binfmt_elf.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/fs/binfmt_elf.c~elf-coredump-allow-process-with-empty-address-space-to-coredump
+++ a/fs/binfmt_elf.c
@@ -1595,6 +1595,10 @@ static int fill_files_note(struct memelf
 	if (size >= MAX_FILE_NOTE_SIZE) /* paranoia check */
 		return -EINVAL;
 	size = round_up(size, PAGE_SIZE);
+	/*
+	 * "size" can be 0 here legitimately.
+	 * Let it ENOMEM and omit NT_FILE section which will be empty anyway.
+	 */
 	data = kvmalloc(size, GFP_KERNEL);
 	if (ZERO_OR_NULL_PTR(data))
 		return -ENOMEM;
@@ -2257,9 +2261,13 @@ static int elf_core_dump(struct coredump
 
 	dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
 
+	/*
+	 * Zero vma process will get ZERO_SIZE_PTR here.
+	 * Let coredump continue for register state at least.
+	 */
 	vma_filesz = kvmalloc(array_size(sizeof(*vma_filesz), (segs - 1)),
 			      GFP_KERNEL);
-	if (ZERO_OR_NULL_PTR(vma_filesz))
+	if (!vma_filesz)
 		goto end_coredump;
 
 	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 111/118] init/main.c: log arguments and environment passed to init
  2020-01-31  6:10 incoming Andrew Morton
                   ` (109 preceding siblings ...)
  2020-01-31  6:17 ` [patch 110/118] fs/binfmt_elf.c: coredump: allow process with empty address space to coredump Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 112/118] init/main.c: remove unnecessary repair_env_string in do_initcall_level Andrew Morton
                   ` (6 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, cmetcalf, krzysiek, linux-mm, mm-commits, nivedita, torvalds

From: Arvind Sankar <nivedita@alum.mit.edu>
Subject: init/main.c: log arguments and environment passed to init

Extend logging in `run_init_process` to also show the arguments and
environment that we are passing to init.

Link: http://lkml.kernel.org/r/20191212180023.24339-2-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/init/main.c~init-mainc-log-arguments-and-environment-passed-to-init
+++ a/init/main.c
@@ -1043,8 +1043,16 @@ static void __init do_pre_smp_initcalls(
 
 static int run_init_process(const char *init_filename)
 {
+	const char *const *p;
+
 	argv_init[0] = init_filename;
 	pr_info("Run %s as init process\n", init_filename);
+	pr_debug("  with arguments:\n");
+	for (p = argv_init; *p; p++)
+		pr_debug("    %s\n", *p);
+	pr_debug("  with environment:\n");
+	for (p = envp_init; *p; p++)
+		pr_debug("    %s\n", *p);
 	return do_execve(getname_kernel(init_filename),
 		(const char __user *const __user *)argv_init,
 		(const char __user *const __user *)envp_init);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 112/118] init/main.c: remove unnecessary repair_env_string in do_initcall_level
  2020-01-31  6:10 incoming Andrew Morton
                   ` (110 preceding siblings ...)
  2020-01-31  6:17 ` [patch 111/118] init/main.c: log arguments and environment passed to init Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 113/118] init/main.c: fix quoted value handling in unknown_bootoption Andrew Morton
                   ` (5 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, cmetcalf, krzysiek, linux-mm, mm-commits, nivedita, torvalds

From: Arvind Sankar <nivedita@alum.mit.edu>
Subject: init/main.c: remove unnecessary repair_env_string in do_initcall_level

Since commit 08746a65c296 ("init: fix in-place parameter modification
regression"), parse_args in do_initcall_level is called on a copy of
saved_command_line.  It is unnecessary to call repair_env_string during
this parsing, as this copy is not used for anything later.

Remove the now unnecessary arguments from repair_env_string as well.

Link: http://lkml.kernel.org/r/20191212180023.24339-3-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

--- a/init/main.c~init-mainc-remove-unnecessary-repair_env_string-in-do_initcall_level
+++ a/init/main.c
@@ -246,8 +246,7 @@ static int __init loglevel(char *str)
 early_param("loglevel", loglevel);
 
 /* Change NUL term back to "=", to make "param" the whole string. */
-static int __init repair_env_string(char *param, char *val,
-				    const char *unused, void *arg)
+static void __init repair_env_string(char *param, char *val)
 {
 	if (val) {
 		/* param=val or param="val"? */
@@ -260,7 +259,6 @@ static int __init repair_env_string(char
 		} else
 			BUG();
 	}
-	return 0;
 }
 
 /* Anything after -- gets handed straight to init. */
@@ -272,7 +270,7 @@ static int __init set_init_arg(char *par
 	if (panic_later)
 		return 0;
 
-	repair_env_string(param, val, unused, NULL);
+	repair_env_string(param, val);
 
 	for (i = 0; argv_init[i]; i++) {
 		if (i == MAX_INIT_ARGS) {
@@ -292,7 +290,7 @@ static int __init set_init_arg(char *par
 static int __init unknown_bootoption(char *param, char *val,
 				     const char *unused, void *arg)
 {
-	repair_env_string(param, val, unused, NULL);
+	repair_env_string(param, val);
 
 	/* Handle obsolete-style parameters */
 	if (obsolete_checksetup(param))
@@ -991,6 +989,12 @@ static const char *initcall_level_names[
 	"late",
 };
 
+static int __init ignore_unknown_bootoption(char *param, char *val,
+			       const char *unused, void *arg)
+{
+	return 0;
+}
+
 static void __init do_initcall_level(int level)
 {
 	initcall_entry_t *fn;
@@ -1000,7 +1004,7 @@ static void __init do_initcall_level(int
 		   initcall_command_line, __start___param,
 		   __stop___param - __start___param,
 		   level, level,
-		   NULL, &repair_env_string);
+		   NULL, ignore_unknown_bootoption);
 
 	trace_initcall_level(initcall_level_names[level]);
 	for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 113/118] init/main.c: fix quoted value handling in unknown_bootoption
  2020-01-31  6:10 incoming Andrew Morton
                   ` (111 preceding siblings ...)
  2020-01-31  6:17 ` [patch 112/118] init/main.c: remove unnecessary repair_env_string in do_initcall_level Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 114/118] init/main.c: fix misleading "This architecture does not have kernel memory protection" message Andrew Morton
                   ` (4 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, cmetcalf, krzysiek, linux-mm, mm-commits, nivedita, torvalds

From: Arvind Sankar <nivedita@alum.mit.edu>
Subject: init/main.c: fix quoted value handling in unknown_bootoption

Patch series "init/main.c: minor cleanup/bugfix of envvar handling", v2.

unknown_bootoption passes unrecognized command line arguments to init as
either environment variables or arguments.  Some of the logic in the
function is broken for quoted command line arguments.

When an argument of the form param="value" is processed by parse_args and
passed to unknown_bootoption, the command line has
  param\0"value\0
with val pointing to the beginning of value.  The helper function
repair_env_string is then used to restore the '=' character that was
removed by parse_args, and strip the quotes off fully.  This results in
  param=value\0\0
and val ends up pointing to the 'a' instead of the 'v' in value.  This bug
was introduced when repair_env_string was refactored into a separate
function, and the decrement of val in repair_env_string became dead code.

This causes two problems in unknown_bootoption in the two places where the
val pointer is used as a substitute for the length of param:

1. An argument of the form param=".value" is misinterpreted as a
   potential module parameter, with the result that it will not be placed
   in init's environment.

2. An argument of the form param="value" is checked to see if param is
   an existing environment variable that should be overwritten, but the
   comparison is off-by-one and compares 'param=v' instead of 'param='
   against the existing environment.  So passing, for example,
   TERM="vt100" on the command line results in init being passed both
   TERM=linux and TERM=vt100 in its environment.

Patch 1 adds logging for the arguments and environment passed to init and
is independent of the rest: it can be dropped if this is unnecessarily
verbose.

Patch 2 removes repair_env_string from initcall parameter parsing in
do_initcall_level, as that uses a separate copy of the command line now
and the repairing is no longer necessary.

Patch 3 fixes the bug in unknown_bootoption by recording the length of
param explicitly instead of implying it from val-param.


This patch (of 3):

Commit a99cd1125189 ("init: fix bug where environment vars can't be passed
via boot args") introduced two minor bugs in unknown_bootoption by
factoring out the quoted value handling into a separate function.

When value is quoted, repair_env_string will move the value up 1 byte to
strip the quotes, so val in unknown_bootoption no longer points to the
actual location of the value.

The result is that an argument of the form param=".value" is mistakenly
treated as a potential module parameter and is not placed in init's
environment, and an argument of the form param="value" can result in a
duplicate environment variable: eg TERM="vt100" on the command line will
result in both TERM=linux and TERM=vt100 being placed into init's
environment.

Fix this by recording the length of the param before calling
repair_env_string instead of relying on val.

Link: http://lkml.kernel.org/r/20191212180023.24339-4-nivedita@alum.mit.edu
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Krzysztof Mazur <krzysiek@podlesie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/init/main.c~init-mainc-fix-quoted-value-handling-in-unknown_bootoption
+++ a/init/main.c
@@ -255,7 +255,6 @@ static void __init repair_env_string(cha
 		else if (val == param+strlen(param)+2) {
 			val[-2] = '=';
 			memmove(val-1, val, strlen(val)+1);
-			val--;
 		} else
 			BUG();
 	}
@@ -290,6 +289,8 @@ static int __init set_init_arg(char *par
 static int __init unknown_bootoption(char *param, char *val,
 				     const char *unused, void *arg)
 {
+	size_t len = strlen(param);
+
 	repair_env_string(param, val);
 
 	/* Handle obsolete-style parameters */
@@ -297,7 +298,7 @@ static int __init unknown_bootoption(cha
 		return 0;
 
 	/* Unused module parameter. */
-	if (strchr(param, '.') && (!val || strchr(param, '.') < val))
+	if (strnchr(param, len, '.'))
 		return 0;
 
 	if (panic_later)
@@ -311,7 +312,7 @@ static int __init unknown_bootoption(cha
 				panic_later = "env";
 				panic_param = param;
 			}
-			if (!strncmp(param, envp_init[i], val - param))
+			if (!strncmp(param, envp_init[i], len+1))
 				break;
 		}
 		envp_init[i] = param;
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 114/118] init/main.c: fix misleading "This architecture does not have kernel memory protection" message
  2020-01-31  6:10 incoming Andrew Morton
                   ` (112 preceding siblings ...)
  2020-01-31  6:17 ` [patch 113/118] init/main.c: fix quoted value handling in unknown_bootoption Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 115/118] reiserfs: prevent NULL pointer dereference in reiserfs_insert_item() Andrew Morton
                   ` (3 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, keescook, linux-mm, mm-commits,
	mpe, paulus, torvalds

From: Christophe Leroy <christophe.leroy@c-s.fr>
Subject: init/main.c: fix misleading "This architecture does not have kernel memory protection" message

This message leads to thinking that memory protection is not implemented
for the said architecture, whereas absence of CONFIG_STRICT_KERNEL_RWX
only means that memory protection has not been selected at compile time.

Don't print this message when CONFIG_ARCH_HAS_STRICT_KERNEL_RWX is
selected by the architecture.  Instead, print "Kernel memory protection
not selected by kernel config."

Link: http://lkml.kernel.org/r/62477e446d9685459d4f27d193af6ff1bd69d55f.1578557581.git.christophe.leroy@c-s.fr
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/init/main.c~init-fix-misleading-this-architecture-does-not-have-kernel-memory-protection-message
+++ a/init/main.c
@@ -1104,6 +1104,11 @@ static void mark_readonly(void)
 	} else
 		pr_info("Kernel memory protection disabled.\n");
 }
+#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
+static inline void mark_readonly(void)
+{
+	pr_warn("Kernel memory protection not selected by kernel config.\n");
+}
 #else
 static inline void mark_readonly(void)
 {
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 115/118] reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (113 preceding siblings ...)
  2020-01-31  6:17 ` [patch 114/118] init/main.c: fix misleading "This architecture does not have kernel memory protection" message Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 116/118] execve: warn if process starts with executable stack Andrew Morton
                   ` (2 subsequent siblings)
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, hushiyuan, jack, linfeilong, linux-mm, mm-commits,
	torvalds, yeyunfeng, zhengbin13

From: Yunfeng Ye <yeyunfeng@huawei.com>
Subject: reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()

The variable inode may be NULL in reiserfs_insert_item(), but there is
no check before accessing the member of inode.

Fix this by adding NULL pointer check before calling reiserfs_debug().

Link: http://lkml.kernel.org/r/79c5135d-ff25-1cc9-4e99-9f572b88cc00@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/reiserfs/stree.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/fs/reiserfs/stree.c~reiserfs-prevent-null-pointer-dereference-in-reiserfs_insert_item
+++ a/fs/reiserfs/stree.c
@@ -2246,7 +2246,8 @@ error_out:
 	/* also releases the path */
 	unfix_nodes(&s_ins_balance);
 #ifdef REISERQUOTA_DEBUG
-	reiserfs_debug(th->t_super, REISERFS_DEBUG_CODE,
+	if (inode)
+		reiserfs_debug(th->t_super, REISERFS_DEBUG_CODE,
 		       "reiserquota insert_item(): freeing %u id=%u type=%c",
 		       quota_bytes, inode->i_uid, head2type(ih));
 #endif
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 116/118] execve: warn if process starts with executable stack
  2020-01-31  6:10 incoming Andrew Morton
                   ` (114 preceding siblings ...)
  2020-01-31  6:17 ` [patch 115/118] reiserfs: prevent NULL pointer dereference in reiserfs_insert_item() Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 117/118] include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc() Andrew Morton
  2020-01-31  6:17 ` [patch 118/118] kcov: ignore fault-inject and stacktrace Andrew Morton
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: adobriyan, akpm, dan.carpenter, ebiederm, linux-mm, mm-commits,
	torvalds, will

From: Alexey Dobriyan <adobriyan@gmail.com>
Subject: execve: warn if process starts with executable stack

There were few episodes of silent downgrade to an executable stack over
years:

1) linking innocent looking assembly file will silently add executable
   stack if proper linker options is not given as well:

	$ cat f.S
	.intel_syntax noprefix
	.text
	.globl f
	f:
	        ret

	$ cat main.c
	void f(void);
	int main(void)
	{
	        f();
	        return 0;
	}

	$ gcc main.c f.S
	$ readelf -l ./a.out
	  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                         0x0000000000000000 0x0000000000000000  RWE    0x10
			 					 ^^^

2) converting C99 nested function into a closure
https://nullprogram.com/blog/2019/11/15/

	void intsort2(int *base, size_t nmemb, _Bool invert)
	{
	    int cmp(const void *a, const void *b)
	    {
	        int r = *(int *)a - *(int *)b;
	        return invert ? -r : r;
	    }
	    qsort(base, nmemb, sizeof(*base), cmp);
	}

will silently require stack trampolines while non-closure version will not.

Without doubt this behaviour is documented somewhere, add a warning so
that developers and users can at least notice.  After so many years of
x86_64 having proper executable stack support it should not cause too many
problems.

Link: http://lkml.kernel.org/r/20191208171918.GC19716@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Will Deacon <will@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/exec.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/fs/exec.c~execve-warn-if-process-starts-with-executable-stack
+++ a/fs/exec.c
@@ -761,6 +761,11 @@ int setup_arg_pages(struct linux_binprm
 		goto out_unlock;
 	BUG_ON(prev != vma);
 
+	if (unlikely(vm_flags & VM_EXEC)) {
+		pr_warn_once("process '%pD4' started with executable stack\n",
+			     bprm->file);
+	}
+
 	/* Move stack pages down in memory. */
 	if (stack_shift) {
 		ret = shift_arg_pages(vma, stack_shift);
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 117/118] include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
  2020-01-31  6:10 incoming Andrew Morton
                   ` (115 preceding siblings ...)
  2020-01-31  6:17 ` [patch 116/118] execve: warn if process starts with executable stack Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  2020-01-31  6:17 ` [patch 118/118] kcov: ignore fault-inject and stacktrace Andrew Morton
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, linux-mm, mm-commits, torvalds

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()

Use PHYS_PFN() macro in io_mapping_map_atomic_wc() instead of open coded
variant.

Link: http://lkml.kernel.org/r/20191209165624.56351-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/io-mapping.h |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/include/linux/io-mapping.h~io-mapping-use-phys_pfn-macro-in-io_mapping_map_atomic_wc
+++ a/include/linux/io-mapping.h
@@ -28,6 +28,7 @@ struct io_mapping {
 
 #ifdef CONFIG_HAVE_ATOMIC_IOMAP
 
+#include <linux/pfn.h>
 #include <asm/iomap.h>
 /*
  * For small address space machines, mapping large objects
@@ -64,12 +65,10 @@ io_mapping_map_atomic_wc(struct io_mappi
 			 unsigned long offset)
 {
 	resource_size_t phys_addr;
-	unsigned long pfn;
 
 	BUG_ON(offset >= mapping->size);
 	phys_addr = mapping->base + offset;
-	pfn = (unsigned long) (phys_addr >> PAGE_SHIFT);
-	return iomap_atomic_prot_pfn(pfn, mapping->prot);
+	return iomap_atomic_prot_pfn(PHYS_PFN(phys_addr), mapping->prot);
 }
 
 static inline void
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [patch 118/118] kcov: ignore fault-inject and stacktrace
  2020-01-31  6:10 incoming Andrew Morton
                   ` (116 preceding siblings ...)
  2020-01-31  6:17 ` [patch 117/118] include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc() Andrew Morton
@ 2020-01-31  6:17 ` Andrew Morton
  117 siblings, 0 replies; 120+ messages in thread
From: Andrew Morton @ 2020-01-31  6:17 UTC (permalink / raw)
  To: akpm, andreyknvl, dvyukov, linux-mm, mm-commits, torvalds

From: Dmitry Vyukov <dvyukov@google.com>
Subject: kcov: ignore fault-inject and stacktrace

Don't instrument 3 more files that contain debugging facilities and
produce large amounts of uninteresting coverage for every syscall.  The
following snippets are sprinkled all over the place in kcov traces in a
debugging kernel.  We already try to disable instrumentation of stack
unwinding code and of most debug facilities.  I guess we did not use
fault-inject.c at the time, and stacktrace.c was somehow missed (or
something has changed in kernel/configs).  This change both speeds up kcov
(kernel doesn't need to store these PCs, user-space doesn't need to
process them) and frees trace buffer capacity for more useful coverage.

should_fail
lib/fault-inject.c:149
fail_dump
lib/fault-inject.c:45

stack_trace_save
kernel/stacktrace.c:124
stack_trace_consume_entry
kernel/stacktrace.c:86
stack_trace_consume_entry
kernel/stacktrace.c:89
... a hundred frames skipped ...
stack_trace_consume_entry
kernel/stacktrace.c:93
stack_trace_consume_entry
kernel/stacktrace.c:86

Link: http://lkml.kernel.org/r/20200116111449.217744-1-dvyukov@gmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/Makefile |    1 +
 lib/Makefile    |    1 +
 mm/Makefile     |    1 +
 3 files changed, 3 insertions(+)

--- a/kernel/Makefile~kcov-ignore-fault-inject-and-stacktrace
+++ a/kernel/Makefile
@@ -27,6 +27,7 @@ KCOV_INSTRUMENT_softirq.o := n
 # and produce insane amounts of uninteresting coverage.
 KCOV_INSTRUMENT_module.o := n
 KCOV_INSTRUMENT_extable.o := n
+KCOV_INSTRUMENT_stacktrace.o := n
 # Don't self-instrument.
 KCOV_INSTRUMENT_kcov.o := n
 KASAN_SANITIZE_kcov.o := n
--- a/lib/Makefile~kcov-ignore-fault-inject-and-stacktrace
+++ a/lib/Makefile
@@ -16,6 +16,7 @@ KCOV_INSTRUMENT_rbtree.o := n
 KCOV_INSTRUMENT_list_debug.o := n
 KCOV_INSTRUMENT_debugobjects.o := n
 KCOV_INSTRUMENT_dynamic_debug.o := n
+KCOV_INSTRUMENT_fault-inject.o := n
 
 # Early boot use of cmdline, don't instrument it
 ifdef CONFIG_AMD_MEM_ENCRYPT
--- a/mm/Makefile~kcov-ignore-fault-inject-and-stacktrace
+++ a/mm/Makefile
@@ -20,6 +20,7 @@ KCOV_INSTRUMENT_kmemleak.o := n
 KCOV_INSTRUMENT_memcontrol.o := n
 KCOV_INSTRUMENT_mmzone.o := n
 KCOV_INSTRUMENT_vmstat.o := n
+KCOV_INSTRUMENT_failslab.o := n
 
 CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
 CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
_

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around
  2020-01-31  6:16 ` [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around Andrew Morton
@ 2023-11-22  7:15   ` Jinjie Ruan
  0 siblings, 0 replies; 120+ messages in thread
From: Jinjie Ruan @ 2023-11-22  7:15 UTC (permalink / raw)
  To: linux-kernel, adobriyan, akpm, linux-mm, mm-commits, torvalds; +Cc: chris.zjh



On 2020/1/31 14:16, Andrew Morton wrote:
> From: Alexey Dobriyan <adobriyan@gmail.com>
> Subject: fs/binfmt_elf.c: don't copy ELF header around
> 
> ELF header is read into bprm->buf[] by generic execve code.
> 
> Save a memcpy and allocate just one header for the interpreter instead of
> two headers (64 bytes instead of 128 on 64-bit).
> 
> Link: http://lkml.kernel.org/r/20191208171242.GA19716@avx2
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  fs/binfmt_elf.c |   55 ++++++++++++++++++++++------------------------
>  1 file changed, 27 insertions(+), 28 deletions(-)
> 
> --- a/fs/binfmt_elf.c~elf-dont-copy-elf-header-around
> +++ a/fs/binfmt_elf.c
> @@ -161,8 +161,9 @@ static int padzero(unsigned long elf_bss
>  #endif
>  
>  static int
> -create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
> -		unsigned long load_addr, unsigned long interp_load_addr)
> +create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
> +		unsigned long load_addr, unsigned long interp_load_addr,
> +		unsigned long e_entry)
>  {
>  	unsigned long p = bprm->p;
>  	int argc = bprm->argc;
> @@ -251,7 +252,7 @@ create_elf_tables(struct linux_binprm *b
>  	NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
>  	NEW_AUX_ENT(AT_BASE, interp_load_addr);
>  	NEW_AUX_ENT(AT_FLAGS, 0);
> -	NEW_AUX_ENT(AT_ENTRY, exec->e_entry);
> +	NEW_AUX_ENT(AT_ENTRY, e_entry);
>  	NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
>  	NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid));
>  	NEW_AUX_ENT(AT_GID, from_kgid_munged(cred->user_ns, cred->gid));
> @@ -690,12 +691,13 @@ static int load_elf_binary(struct linux_
>  	int bss_prot = 0;
>  	int retval, i;
>  	unsigned long elf_entry;
> +	unsigned long e_entry;
>  	unsigned long interp_load_addr = 0;
>  	unsigned long start_code, end_code, start_data, end_data;
>  	unsigned long reloc_func_desc __maybe_unused = 0;
>  	int executable_stack = EXSTACK_DEFAULT;
> +	struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf;
>  	struct {
> -		struct elfhdr elf_ex;
>  		struct elfhdr interp_elf_ex;
>  	} *loc;
>  	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
> @@ -706,30 +708,27 @@ static int load_elf_binary(struct linux_
>  		retval = -ENOMEM;
>  		goto out_ret;
>  	}
> -	
> -	/* Get the exec-header */
> -	loc->elf_ex = *((struct elfhdr *)bprm->buf);
>  
>  	retval = -ENOEXEC;
>  	/* First of all, some simple consistency checks */
> -	if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
> +	if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
>  		goto out;
>  
> -	if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
> +	if (elf_ex->e_type != ET_EXEC && elf_ex->e_type != ET_DYN)
>  		goto out;
> -	if (!elf_check_arch(&loc->elf_ex))
> +	if (!elf_check_arch(elf_ex))
>  		goto out;
> -	if (elf_check_fdpic(&loc->elf_ex))
> +	if (elf_check_fdpic(elf_ex))
>  		goto out;
>  	if (!bprm->file->f_op->mmap)
>  		goto out;
>  
> -	elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
> +	elf_phdata = load_elf_phdrs(elf_ex, bprm->file);
>  	if (!elf_phdata)
>  		goto out;
>  
>  	elf_ppnt = elf_phdata;
> -	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
> +	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
>  		char *elf_interpreter;
>  
>  		if (elf_ppnt->p_type != PT_INTERP)
> @@ -783,7 +782,7 @@ out_free_interp:
>  	}
>  
>  	elf_ppnt = elf_phdata;
> -	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
> +	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++)
>  		switch (elf_ppnt->p_type) {
>  		case PT_GNU_STACK:
>  			if (elf_ppnt->p_flags & PF_X)
> @@ -793,7 +792,7 @@ out_free_interp:
>  			break;
>  
>  		case PT_LOPROC ... PT_HIPROC:
> -			retval = arch_elf_pt_proc(&loc->elf_ex, elf_ppnt,
> +			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
>  						  bprm->file, false,
>  						  &arch_state);
>  			if (retval)
> @@ -837,7 +836,7 @@ out_free_interp:
>  	 * still possible to return an error to the code that invoked
>  	 * the exec syscall.
>  	 */
> -	retval = arch_check_elf(&loc->elf_ex,
> +	retval = arch_check_elf(elf_ex,
>  				!!interpreter, &loc->interp_elf_ex,
>  				&arch_state);
>  	if (retval)
> @@ -850,8 +849,8 @@ out_free_interp:
>  
>  	/* Do this immediately, since STACK_TOP as used in setup_arg_pages
>  	   may depend on the personality.  */
> -	SET_PERSONALITY2(loc->elf_ex, &arch_state);
> -	if (elf_read_implies_exec(loc->elf_ex, executable_stack))
> +	SET_PERSONALITY2(*elf_ex, &arch_state);

It seems that the "SET_PERSONALITY2()" is a little late.

When a 32-bit compatible user-mode program forks out a 64-bit program,
when the 64-bit program is run in execve() the 32-bit STACK_TOP_MAX is
used to set vm_end and vm_start of the vma in __bprm_mm_init() in
alloc_bprm() because the 32-bit compatible flag has not been cleared,
but the setup_arg_pages() function later uses 64-bit STACK_TOP after
calling this SET_PERSONALITY2() to clear the 32-bit compatible flag,
which doesn't seem reasonable.

> +	if (elf_read_implies_exec(*elf_ex, executable_stack))
>  		current->personality |= READ_IMPLIES_EXEC;
>  
>  	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
> @@ -878,7 +877,7 @@ out_free_interp:
>  	/* Now we do a little grungy work by mmapping the ELF image into
>  	   the correct location in memory. */
>  	for(i = 0, elf_ppnt = elf_phdata;
> -	    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
> +	    i < elf_ex->e_phnum; i++, elf_ppnt++) {
>  		int elf_prot, elf_flags;
>  		unsigned long k, vaddr;
>  		unsigned long total_size = 0;
> @@ -922,9 +921,9 @@ out_free_interp:
>  		 * If we are loading ET_EXEC or we have already performed
>  		 * the ET_DYN load_addr calculations, proceed normally.
>  		 */
> -		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
> +		if (elf_ex->e_type == ET_EXEC || load_addr_set) {
>  			elf_flags |= MAP_FIXED;
> -		} else if (loc->elf_ex.e_type == ET_DYN) {
> +		} else if (elf_ex->e_type == ET_DYN) {
>  			/*
>  			 * This logic is run once for the first LOAD Program
>  			 * Header for ET_DYN binaries to calculate the
> @@ -973,7 +972,7 @@ out_free_interp:
>  			load_bias = ELF_PAGESTART(load_bias - vaddr);
>  
>  			total_size = total_mapping_size(elf_phdata,
> -							loc->elf_ex.e_phnum);
> +							elf_ex->e_phnum);
>  			if (!total_size) {
>  				retval = -EINVAL;
>  				goto out_free_dentry;
> @@ -991,7 +990,7 @@ out_free_interp:
>  		if (!load_addr_set) {
>  			load_addr_set = 1;
>  			load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset);
> -			if (loc->elf_ex.e_type == ET_DYN) {
> +			if (elf_ex->e_type == ET_DYN) {
>  				load_bias += error -
>  				             ELF_PAGESTART(load_bias + vaddr);
>  				load_addr += load_bias;
> @@ -1032,7 +1031,7 @@ out_free_interp:
>  		}
>  	}
>  
> -	loc->elf_ex.e_entry += load_bias;
> +	e_entry = elf_ex->e_entry + load_bias;
>  	elf_bss += load_bias;
>  	elf_brk += load_bias;
>  	start_code += load_bias;
> @@ -1075,7 +1074,7 @@ out_free_interp:
>  		allow_write_access(interpreter);
>  		fput(interpreter);
>  	} else {
> -		elf_entry = loc->elf_ex.e_entry;
> +		elf_entry = e_entry;
>  		if (BAD_ADDR(elf_entry)) {
>  			retval = -EINVAL;
>  			goto out_free_dentry;
> @@ -1093,8 +1092,8 @@ out_free_interp:
>  		goto out;
>  #endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */
>  
> -	retval = create_elf_tables(bprm, &loc->elf_ex,
> -			  load_addr, interp_load_addr);
> +	retval = create_elf_tables(bprm, elf_ex,
> +			  load_addr, interp_load_addr, e_entry);
>  	if (retval < 0)
>  		goto out;
>  	current->mm->end_code = end_code;
> @@ -1112,7 +1111,7 @@ out_free_interp:
>  		 * growing down), and into the unused ELF_ET_DYN_BASE region.
>  		 */
>  		if (IS_ENABLED(CONFIG_ARCH_HAS_ELF_RANDOMIZE) &&
> -		    loc->elf_ex.e_type == ET_DYN && !interpreter)
> +		    elf_ex->e_type == ET_DYN && !interpreter)
>  			current->mm->brk = current->mm->start_brk =
>  				ELF_ET_DYN_BASE;
>  
> _

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2023-11-22  7:15 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-31  6:10 incoming Andrew Morton
2020-01-31  6:11 ` [patch 001/118] lib/test_bitmap: correct test data offsets for 32-bit Andrew Morton
2020-01-31  6:11 ` [patch 002/118] memcg: fix a crash in wb_workfn when a device disappears Andrew Morton
2020-01-31  6:11 ` [patch 003/118] mm/mempolicy.c: fix out of bounds write in mpol_parse_str() Andrew Morton
2020-01-31  6:11 ` [patch 004/118] mm/sparse.c: reset section's mem_map when fully deactivated Andrew Morton
2020-01-31  6:11 ` [patch 005/118] mm/migrate.c: also overwrite error when it is bigger than zero Andrew Morton
2020-01-31  6:11 ` [patch 006/118] mm/memory_hotplug: fix remove_memory() lockdep splat Andrew Morton
2020-01-31  6:11 ` [patch 007/118] mm: thp: don't need care deferred split queue in memcg charge move path Andrew Morton
2020-01-31  6:11 ` [patch 008/118] mm: move_pages: report the number of non-attempted pages Andrew Morton
2020-01-31  6:11 ` [patch 009/118] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
2020-01-31  6:11 ` [patch 010/118] scripts/spelling.txt: add "issus" typo Andrew Morton
2020-01-31  6:11 ` [patch 011/118] fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres Andrew Morton
2020-01-31  6:11 ` [patch 012/118] ocfs2: remove unneeded semicolons Andrew Morton
2020-01-31  6:11 ` [patch 013/118] ocfs2: make local header paths relative to C files Andrew Morton
2020-01-31  6:11 ` [patch 014/118] ocfs2/dlm: remove redundant assignment to ret Andrew Morton
2020-01-31  6:11 ` [patch 015/118] ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use Andrew Morton
2020-01-31  6:11 ` [patch 016/118] ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans() Andrew Morton
2020-01-31  6:11 ` [patch 017/118] ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction Andrew Morton
2020-01-31  6:11 ` [patch 018/118] mm/slub.c: avoid slub allocation while holding list_lock Andrew Morton
2020-01-31  6:12 ` [patch 019/118] mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t Andrew Morton
2020-01-31  6:12 ` [patch 020/118] mm/debug.c: always print flags in dump_page() Andrew Morton
2020-01-31  6:12 ` [patch 021/118] mm/filemap.c: clean up filemap_write_and_wait() Andrew Morton
2020-01-31  6:12 ` [patch 022/118] mm: fix gup_pud_range Andrew Morton
2020-01-31  6:12 ` [patch 023/118] mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge Andrew Morton
2020-01-31  6:12 ` [patch 024/118] mm/gup: factor out duplicate code from four routines Andrew Morton
2020-01-31  6:12 ` [patch 025/118] mm/gup: move try_get_compound_head() to top, fix minor issues Andrew Morton
2020-01-31  6:12 ` [patch 026/118] mm: Cleanup __put_devmap_managed_page() vs ->page_free() Andrew Morton
2020-01-31  6:12 ` [patch 027/118] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages Andrew Morton
2020-01-31  6:12 ` [patch 028/118] goldish_pipe: rename local pin_user_pages() routine Andrew Morton
2020-01-31  6:12 ` [patch 029/118] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM Andrew Morton
2020-01-31  6:12 ` [patch 030/118] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call Andrew Morton
2020-01-31  6:12 ` [patch 031/118] mm/gup: allow FOLL_FORCE for get_user_pages_fast() Andrew Morton
2020-01-31  6:12 ` [patch 032/118] IB/umem: use get_user_pages_fast() to pin DMA pages Andrew Morton
2020-01-31  6:12 ` [patch 033/118] media/v4l2-core: set pages dirty upon releasing DMA buffers Andrew Morton
2020-01-31  6:12 ` [patch 034/118] mm/gup: introduce pin_user_pages*() and FOLL_PIN Andrew Morton
2020-01-31  6:12 ` [patch 035/118] goldish_pipe: convert to pin_user_pages() and put_user_page() Andrew Morton
2020-01-31  6:13 ` [patch 036/118] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP Andrew Morton
2020-01-31  6:13 ` [patch 037/118] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() Andrew Morton
2020-01-31  6:13 ` [patch 038/118] drm/via: set FOLL_PIN via pin_user_pages_fast() Andrew Morton
2020-01-31  6:13 ` [patch 039/118] fs/io_uring: set FOLL_PIN via pin_user_pages() Andrew Morton
2020-01-31  6:13 ` [patch 040/118] net/xdp: " Andrew Morton
2020-01-31  6:13 ` [patch 041/118] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion Andrew Morton
2020-01-31  6:13 ` [patch 042/118] vfio, mm: " Andrew Morton
2020-01-31  6:13 ` [patch 043/118] powerpc: book3s64: convert to pin_user_pages() and put_user_page() Andrew Morton
2020-01-31  6:13 ` [patch 044/118] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" Andrew Morton
2020-01-31  6:13 ` [patch 045/118] mm, tree-wide: rename put_user_page*() to unpin_user_page*() Andrew Morton
2020-01-31  6:13 ` [patch 046/118] mm/swapfile.c: swap_next should increase position index Andrew Morton
2020-01-31  6:13 ` [patch 047/118] mm/memcontrol.c: cleanup some useless code Andrew Morton
2020-01-31  6:13 ` [patch 048/118] mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page Andrew Morton
2020-01-31  6:13 ` [patch 049/118] mm, tracing: print symbol name for kmem_alloc_node call_site events Andrew Morton
2020-01-31  6:13 ` [patch 050/118] lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() Andrew Morton
2020-01-31  6:13 ` [patch 051/118] mm/early_ioremap.c: use %pa to print resource_size_t variables Andrew Morton
2020-01-31  6:13 ` [patch 052/118] mm/page_alloc: skip non present sections on zone initialization Andrew Morton
2020-01-31  6:14 ` [patch 053/118] mm: remove the memory isolate notifier Andrew Morton
2020-01-31  6:14 ` [patch 054/118] mm: remove "count" parameter from has_unmovable_pages() Andrew Morton
2020-01-31  6:14 ` [patch 055/118] mm/vmscan.c: remove unused return value of shrink_node Andrew Morton
2020-01-31  6:14 ` [patch 056/118] mm/vmscan: remove prefetch_prev_lru_page Andrew Morton
2020-01-31  6:14 ` [patch 057/118] mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE Andrew Morton
2020-01-31  6:14 ` [patch 058/118] tools/vm/slabinfo: fix sanity checks enabling Andrew Morton
2020-01-31  6:14 ` [patch 059/118] mm/memblock: define memblock_physmem_add() Andrew Morton
2020-01-31  6:14 ` [patch 060/118] memblock: Use __func__ in remaining memblock_dbg() call sites Andrew Morton
2020-01-31  6:14 ` [patch 061/118] mm, oom: dump stack of victim when reaping failed Andrew Morton
2020-01-31  6:14 ` [patch 062/118] mm/huge_memory.c: use head to check huge zero page Andrew Morton
2020-01-31  6:14 ` [patch 063/118] mm/huge_memory.c: use head to emphasize the purpose of page Andrew Morton
2020-01-31  6:14 ` [patch 064/118] mm/huge_memory.c: reduce critical section protected by split_queue_lock Andrew Morton
2020-01-31  6:14 ` [patch 065/118] mm/migrate: remove useless mask of start address Andrew Morton
2020-01-31  6:14 ` [patch 066/118] mm/migrate: clean up some minor coding style Andrew Morton
2020-01-31  6:14 ` [patch 067/118] mm/migrate: add stable check in migrate_vma_insert_page() Andrew Morton
2020-01-31  6:14 ` [patch 068/118] mm, thp: fix defrag setting if newline is not used Andrew Morton
2020-01-31  6:14 ` [patch 069/118] mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma() Andrew Morton
2020-01-31  6:14 ` [patch 070/118] mm/memory_hotplug: pass in nid to online_pages() Andrew Morton
2020-01-31  6:14 ` [patch 071/118] mm/hotplug: silence a lockdep splat with printk() Andrew Morton
2020-01-31  6:15 ` [patch 072/118] mm/page_isolation: fix potential warning from user Andrew Morton
2020-01-31  6:15 ` [patch 073/118] mm/zswap.c: add allocation hysteresis if pool limit is hit Andrew Morton
2020-01-31  6:15 ` [patch 074/118] zswap: potential NULL dereference on error in init_zswap() Andrew Morton
2020-01-31  6:15 ` [patch 075/118] include/linux/mm.h: clean up obsolete check on space in page->flags Andrew Morton
2020-01-31  6:15 ` [patch 076/118] include/linux/mm.h: remove dead code totalram_pages_set() Andrew Morton
2020-01-31  6:15 ` [patch 077/118] include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block Andrew Morton
2020-01-31  6:15 ` [patch 078/118] mm: fix comments related to node reclaim Andrew Morton
2020-01-31  6:15 ` [patch 079/118] zram: try to avoid worst-case scenario on same element pages Andrew Morton
2020-01-31  6:15 ` [patch 080/118] drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store Andrew Morton
2020-01-31  6:15 ` [patch 081/118] include/linux/units.h: add helpers for kelvin to/from Celsius conversion Andrew Morton
2020-01-31  6:15 ` [patch 082/118] ACPI: thermal: switch to use <linux/units.h> helpers Andrew Morton
2020-01-31  6:15 ` [patch 083/118] platform/x86: asus-wmi: " Andrew Morton
2020-01-31  6:15 ` [patch 084/118] platform/x86: intel_menlow: " Andrew Morton
2020-01-31  6:15 ` [patch 085/118] thermal: int340x: " Andrew Morton
2020-01-31  6:15 ` [patch 086/118] thermal: intel_pch: " Andrew Morton
2020-01-31  6:15 ` [patch 087/118] nvme: hwmon: " Andrew Morton
2020-01-31  6:15 ` [patch 088/118] thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h> Andrew Morton
2020-01-31  6:16 ` [patch 089/118] iwlegacy: use <linux/units.h> helpers Andrew Morton
2020-01-31  6:16 ` [patch 090/118] iwlwifi: " Andrew Morton
2020-01-31  6:16 ` [patch 091/118] thermal: armada: remove unused TO_MCELSIUS macro Andrew Morton
2020-01-31  6:16 ` [patch 092/118] iio: adc: qcom-vadc-common: use <linux/units.h> helpers Andrew Morton
2020-01-31  6:16 ` [patch 093/118] lib/zlib: add s390 hardware support for kernel zlib_deflate Andrew Morton
2020-01-31  6:16 ` [patch 094/118] s390/boot: rename HEAP_SIZE due to name collision Andrew Morton
2020-01-31  6:16 ` [patch 095/118] lib/zlib: add s390 hardware support for kernel zlib_inflate Andrew Morton
2020-01-31  6:16 ` [patch 096/118] s390/boot: add dfltcc= kernel command line parameter Andrew Morton
2020-01-31  6:16 ` [patch 097/118] lib/zlib: add zlib_deflate_dfltcc_enabled() function Andrew Morton
2020-01-31  6:16 ` [patch 098/118] btrfs: use larger zlib buffer for s390 hardware compression Andrew Morton
2020-01-31  6:16 ` [patch 099/118] lib/scatterlist.c: adjust indentation in __sg_alloc_table Andrew Morton
2020-01-31  6:16 ` [patch 100/118] uapi: rename ext2_swab() to swab() and share globally in swab.h Andrew Morton
2020-01-31  6:16 ` [patch 101/118] lib/find_bit.c: join _find_next_bit{_le} Andrew Morton
2020-01-31  6:16 ` [patch 102/118] lib/find_bit.c: uninline helper _find_next_bit() Andrew Morton
2020-01-31  6:16 ` [patch 103/118] fs/binfmt_elf.c: smaller code generation around auxv vector fill Andrew Morton
2020-01-31  6:16 ` [patch 104/118] fs/binfmt_elf.c: fix ->start_code calculation Andrew Morton
2020-01-31  6:16 ` [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around Andrew Morton
2023-11-22  7:15   ` Jinjie Ruan
2020-01-31  6:16 ` [patch 106/118] fs/binfmt_elf.c: better codegen around current->mm Andrew Morton
2020-01-31  6:17 ` [patch 107/118] fs/binfmt_elf.c: make BAD_ADDR() unlikely Andrew Morton
2020-01-31  6:17 ` [patch 108/118] fs/binfmt_elf.c: coredump: allocate core ELF header on stack Andrew Morton
2020-01-31  6:17 ` [patch 109/118] fs/binfmt_elf.c: coredump: delete duplicated overflow check Andrew Morton
2020-01-31  6:17 ` [patch 110/118] fs/binfmt_elf.c: coredump: allow process with empty address space to coredump Andrew Morton
2020-01-31  6:17 ` [patch 111/118] init/main.c: log arguments and environment passed to init Andrew Morton
2020-01-31  6:17 ` [patch 112/118] init/main.c: remove unnecessary repair_env_string in do_initcall_level Andrew Morton
2020-01-31  6:17 ` [patch 113/118] init/main.c: fix quoted value handling in unknown_bootoption Andrew Morton
2020-01-31  6:17 ` [patch 114/118] init/main.c: fix misleading "This architecture does not have kernel memory protection" message Andrew Morton
2020-01-31  6:17 ` [patch 115/118] reiserfs: prevent NULL pointer dereference in reiserfs_insert_item() Andrew Morton
2020-01-31  6:17 ` [patch 116/118] execve: warn if process starts with executable stack Andrew Morton
2020-01-31  6:17 ` [patch 117/118] include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc() Andrew Morton
2020-01-31  6:17 ` [patch 118/118] kcov: ignore fault-inject and stacktrace Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).