* incoming
@ 2020-08-15 0:29 Andrew Morton
2020-08-15 0:30 ` [patch 01/39] asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one() Andrew Morton
` (39 more replies)
0 siblings, 40 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-mm, mm-commits
39 patches, based on b923f1247b72fc100b87792fd2129d026bb10e66.
Subsystems affected by this patch series:
mm/hotfixes
lz4
exec
mailmap
mm/thp
autofs
mm/madvise
sysctl
mm/kmemleak
mm/misc
lib
Subsystem: mm/hotfixes
Mike Rapoport <rppt@linux.ibm.com>:
asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one()
Baoquan He <bhe@redhat.com>:
Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone"
Subsystem: lz4
Nick Terrell <terrelln@fb.com>:
lz4: fix kernel decompression speed
Subsystem: exec
Kees Cook <keescook@chromium.org>:
Patch series "Fix S_ISDIR execve() errno":
exec: restore EACCES of S_ISDIR execve()
selftests/exec: add file type errno tests
Subsystem: mailmap
Greg Kurz <groug@kaod.org>:
mailmap: add entry for Greg Kurz
Subsystem: mm/thp
"Matthew Wilcox (Oracle)" <willy@infradead.org>:
Patch series "THP prep patches":
mm: store compound_nr as well as compound_order
mm: move page-flags include to top of file
mm: add thp_order
mm: add thp_size
mm: replace hpage_nr_pages with thp_nr_pages
mm: add thp_head
mm: introduce offset_in_thp
Subsystem: autofs
Randy Dunlap <rdunlap@infradead.org>:
fs: autofs: delete repeated words in comments
Subsystem: mm/madvise
Minchan Kim <minchan@kernel.org>:
Patch series "introduce memory hinting API for external process", v8:
mm/madvise: pass task and mm to do_madvise
pid: move pidfd_get_pid() to pid.c
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
mm/madvise: check fatal signal pending of target process
Subsystem: sysctl
Xiaoming Ni <nixiaoming@huawei.com>:
all arch: remove system call sys_sysctl
Subsystem: mm/kmemleak
Qian Cai <cai@lca.pw>:
mm/kmemleak: silence KCSAN splats in checksum
Subsystem: mm/misc
Qian Cai <cai@lca.pw>:
mm/frontswap: mark various intentional data races
mm/page_io: mark various intentional data races
mm/swap_state: mark various intentional data races
Kirill A. Shutemov <kirill@shutemov.name>:
mm/filemap.c: fix a data race in filemap_fault()
Qian Cai <cai@lca.pw>:
mm/swapfile: fix and annotate various data races
mm/page_counter: fix various data races at memsw
mm/memcontrol: fix a data race in scan count
mm/list_lru: fix a data race in list_lru_count_one
mm/mempool: fix a data race in mempool_free()
mm/rmap: annotate a data race at tlb_flush_batched
mm/swap.c: annotate data races for lru_rotate_pvecs
mm: annotate a data race in page_zonenum()
Romain Naour <romain.naour@gmail.com>:
include/asm-generic/vmlinux.lds.h: align ro_after_init
Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>:
sh: clkfwk: remove r8/r16/r32
sh: use generic strncpy()
Subsystem: lib
Krzysztof Kozlowski <krzk@kernel.org>:
Patch series "iomap: Constify ioreadX() iomem argument", v3:
iomap: constify ioreadX() iomem argument (as in generic implementation)
rtl818x: constify ioreadX() iomem argument (as in generic implementation)
ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
.mailmap | 1
arch/alpha/include/asm/core_apecs.h | 6
arch/alpha/include/asm/core_cia.h | 6
arch/alpha/include/asm/core_lca.h | 6
arch/alpha/include/asm/core_marvel.h | 4
arch/alpha/include/asm/core_mcpcia.h | 6
arch/alpha/include/asm/core_t2.h | 2
arch/alpha/include/asm/io.h | 12 -
arch/alpha/include/asm/io_trivial.h | 16 -
arch/alpha/include/asm/jensen.h | 2
arch/alpha/include/asm/machvec.h | 6
arch/alpha/kernel/core_marvel.c | 2
arch/alpha/kernel/io.c | 12 -
arch/alpha/kernel/syscalls/syscall.tbl | 3
arch/arm/configs/am200epdkit_defconfig | 1
arch/arm/tools/syscall.tbl | 3
arch/arm64/include/asm/unistd.h | 2
arch/arm64/include/asm/unistd32.h | 6
arch/ia64/kernel/syscalls/syscall.tbl | 3
arch/m68k/kernel/syscalls/syscall.tbl | 3
arch/microblaze/kernel/syscalls/syscall.tbl | 3
arch/mips/configs/cu1000-neo_defconfig | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 3
arch/mips/kernel/syscalls/syscall_n64.tbl | 3
arch/mips/kernel/syscalls/syscall_o32.tbl | 3
arch/parisc/include/asm/io.h | 4
arch/parisc/kernel/syscalls/syscall.tbl | 3
arch/parisc/lib/iomap.c | 72 +++---
arch/powerpc/kernel/iomap.c | 28 +-
arch/powerpc/kernel/syscalls/syscall.tbl | 3
arch/s390/kernel/syscalls/syscall.tbl | 3
arch/sh/configs/dreamcast_defconfig | 1
arch/sh/configs/espt_defconfig | 1
arch/sh/configs/hp6xx_defconfig | 1
arch/sh/configs/landisk_defconfig | 1
arch/sh/configs/lboxre2_defconfig | 1
arch/sh/configs/microdev_defconfig | 1
arch/sh/configs/migor_defconfig | 1
arch/sh/configs/r7780mp_defconfig | 1
arch/sh/configs/r7785rp_defconfig | 1
arch/sh/configs/rts7751r2d1_defconfig | 1
arch/sh/configs/rts7751r2dplus_defconfig | 1
arch/sh/configs/se7206_defconfig | 1
arch/sh/configs/se7343_defconfig | 1
arch/sh/configs/se7619_defconfig | 1
arch/sh/configs/se7705_defconfig | 1
arch/sh/configs/se7750_defconfig | 1
arch/sh/configs/se7751_defconfig | 1
arch/sh/configs/secureedge5410_defconfig | 1
arch/sh/configs/sh03_defconfig | 1
arch/sh/configs/sh7710voipgw_defconfig | 1
arch/sh/configs/sh7757lcr_defconfig | 1
arch/sh/configs/sh7763rdp_defconfig | 1
arch/sh/configs/shmin_defconfig | 1
arch/sh/configs/titan_defconfig | 1
arch/sh/include/asm/string_32.h | 26 --
arch/sh/kernel/iomap.c | 22 -
arch/sh/kernel/syscalls/syscall.tbl | 3
arch/sparc/kernel/syscalls/syscall.tbl | 3
arch/x86/entry/syscalls/syscall_32.tbl | 3
arch/x86/entry/syscalls/syscall_64.tbl | 4
arch/xtensa/kernel/syscalls/syscall.tbl | 3
drivers/mailbox/bcm-pdc-mailbox.c | 2
drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h | 6
drivers/ntb/hw/intel/ntb_hw_gen1.c | 2
drivers/ntb/hw/intel/ntb_hw_gen3.h | 2
drivers/ntb/hw/intel/ntb_hw_intel.h | 2
drivers/nvdimm/btt.c | 4
drivers/nvdimm/pmem.c | 6
drivers/sh/clk/cpg.c | 25 --
drivers/virtio/virtio_pci_modern.c | 6
fs/autofs/dev-ioctl.c | 4
fs/io_uring.c | 2
fs/namei.c | 4
include/asm-generic/iomap.h | 28 +-
include/asm-generic/pgalloc.h | 2
include/asm-generic/vmlinux.lds.h | 1
include/linux/compat.h | 5
include/linux/huge_mm.h | 58 ++++-
include/linux/io-64-nonatomic-hi-lo.h | 4
include/linux/io-64-nonatomic-lo-hi.h | 4
include/linux/memcontrol.h | 2
include/linux/mm.h | 16 -
include/linux/mm_inline.h | 6
include/linux/mm_types.h | 1
include/linux/pagemap.h | 6
include/linux/pid.h | 1
include/linux/syscalls.h | 4
include/linux/sysctl.h | 6
include/uapi/asm-generic/unistd.h | 4
kernel/Makefile | 2
kernel/exit.c | 17 -
kernel/pid.c | 17 +
kernel/sys_ni.c | 3
kernel/sysctl_binary.c | 171 --------------
lib/iomap.c | 30 +-
lib/lz4/lz4_compress.c | 4
lib/lz4/lz4_decompress.c | 18 -
lib/lz4/lz4defs.h | 10
lib/lz4/lz4hc_compress.c | 2
mm/compaction.c | 2
mm/filemap.c | 22 +
mm/frontswap.c | 8
mm/gup.c | 2
mm/internal.h | 4
mm/kmemleak.c | 2
mm/list_lru.c | 2
mm/madvise.c | 190 ++++++++++++++--
mm/memcontrol.c | 10
mm/memory.c | 4
mm/memory_hotplug.c | 7
mm/mempolicy.c | 2
mm/mempool.c | 2
mm/migrate.c | 18 -
mm/mlock.c | 9
mm/page_alloc.c | 5
mm/page_counter.c | 13 -
mm/page_io.c | 12 -
mm/page_vma_mapped.c | 6
mm/rmap.c | 10
mm/swap.c | 21 -
mm/swap_state.c | 10
mm/swapfile.c | 33 +-
mm/vmscan.c | 6
mm/vmstat.c | 12 -
mm/workingset.c | 6
tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2
tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2
tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 2
tools/testing/selftests/exec/.gitignore | 1
tools/testing/selftests/exec/Makefile | 5
tools/testing/selftests/exec/non-regular.c | 196 +++++++++++++++++
132 files changed, 815 insertions(+), 614 deletions(-)
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 01/39] asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one()
2020-08-15 0:29 incoming Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 02/39] Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone" Andrew Morton
` (38 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, linux-mm, lkp, mm-commits, rppt, torvalds
From: Mike Rapoport <rppt@linux.ibm.com>
Subject: asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one()
The #ifdef statement that guards the generic version of pud_alloc_one() by
mistake used __HAVE_ARCH_PUD_FREE instead of __HAVE_ARCH_PUD_ALLOC_ONE.
Fix it.
Link: http://lkml.kernel.org/r/20200812191415.GE163101@linux.ibm.com
Fixes: d9e8b929670b ("asm-generic: pgalloc: provide generic pud_alloc_one() and pud_free_one()")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/asm-generic/pgalloc.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/include/asm-generic/pgalloc.h~asm-generic-pgalloch-use-correct-ifdef-to-enable-pud_alloc_one
+++ a/include/asm-generic/pgalloc.h
@@ -147,7 +147,7 @@ static inline void pmd_free(struct mm_st
#if CONFIG_PGTABLE_LEVELS > 3
-#ifndef __HAVE_ARCH_PUD_FREE
+#ifndef __HAVE_ARCH_PUD_ALLOC_ONE
/**
* pud_alloc_one - allocate a page for PUD-level page table
* @mm: the mm_struct of the current context
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 02/39] Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone"
2020-08-15 0:29 incoming Andrew Morton
2020-08-15 0:30 ` [patch 01/39] asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one() Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 03/39] lz4: fix kernel decompression speed Andrew Morton
` (37 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, bhe, david, linux-mm, mm-commits, rientjes, sonnyrao,
stable, torvalds
From: Baoquan He <bhe@redhat.com>
Subject: Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone"
This reverts commit 26e7deadaae175.
Sonny reported that one of their tests started failing on the latest
kernel on their Chrome OS platform. The root cause is that the above
commit removed the protection line of empty zone, while the parser used in
the test relies on the protection line to mark the end of each zone.
Let's revert it to avoid breaking userspace testing or applications.
Link: http://lkml.kernel.org/r/20200811075412.12872-1-bhe@redhat.com
Fixes: 26e7deadaae175 ("mm/vmstat.c: do not show lowmem reserve protection information of empty zone)"
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [5.8.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/vmstat.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
--- a/mm/vmstat.c~revert-mm-vmstatc-do-not-show-lowmem-reserve-protection-information-of-empty-zone
+++ a/mm/vmstat.c
@@ -1642,12 +1642,6 @@ static void zoneinfo_show_print(struct s
zone->present_pages,
zone_managed_pages(zone));
- /* If unpopulated, no other information is useful */
- if (!populated_zone(zone)) {
- seq_putc(m, '\n');
- return;
- }
-
seq_printf(m,
"\n protection: (%ld",
zone->lowmem_reserve[0]);
@@ -1655,6 +1649,12 @@ static void zoneinfo_show_print(struct s
seq_printf(m, ", %ld", zone->lowmem_reserve[i]);
seq_putc(m, ')');
+ /* If unpopulated, no other information is useful */
+ if (!populated_zone(zone)) {
+ seq_putc(m, '\n');
+ return;
+ }
+
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", zone_stat_name(i),
zone_page_state(zone, i));
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 03/39] lz4: fix kernel decompression speed
2020-08-15 0:29 incoming Andrew Morton
2020-08-15 0:30 ` [patch 01/39] asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one() Andrew Morton
2020-08-15 0:30 ` [patch 02/39] Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone" Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 04/39] exec: restore EACCES of S_ISDIR execve() Andrew Morton
` (36 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: 4sschmid, akpm, gaoxiang25, gregkh, linux-mm, mingo, mm-commits,
nivedita, terrelln, torvalds, yann.collet.73
From: Nick Terrell <terrelln@fb.com>
Subject: lz4: fix kernel decompression speed
This patch replaces all memcpy() calls with LZ4_memcpy() which calls
__builtin_memcpy() so the compiler can inline it.
LZ4 relies heavily on memcpy() with a constant size being inlined. In x86
and i386 pre-boot environments memcpy() cannot be inlined because memcpy()
doesn't get defined as __builtin_memcpy().
An equivalent patch has been applied upstream so that the next import
won't lose this change [1].
I've measured the kernel decompression speed using QEMU before and after
this patch for the x86_64 and i386 architectures. The speed-up is about
10x as shown below.
Code Arch Kernel Size Time Speed
v5.8 x86_64 11504832 B 148 ms 79 MB/s
patch x86_64 11503872 B 13 ms 885 MB/s
v5.8 i386 9621216 B 91 ms 106 MB/s
patch i386 9620224 B 10 ms 962 MB/s
I also measured the time to decompress the initramfs on x86_64, i386, and
arm. All three show the same decompression speed before and after, as
expected.
[1] https://github.com/lz4/lz4/pull/890
Link: http://lkml.kernel.org/r/20200803194022.2966806-1-nickrterrell@gmail.com
Signed-off-by: Nick Terrell <terrelln@fb.com>
Cc: Yann Collet <yann.collet.73@gmail.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Sven Schmidt <4sschmid@informatik.uni-hamburg.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
lib/lz4/lz4_compress.c | 4 ++--
lib/lz4/lz4_decompress.c | 18 +++++++++---------
lib/lz4/lz4defs.h | 10 ++++++++++
lib/lz4/lz4hc_compress.c | 2 +-
4 files changed, 22 insertions(+), 12 deletions(-)
--- a/lib/lz4/lz4_compress.c~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4_compress.c
@@ -446,7 +446,7 @@ _last_literals:
*op++ = (BYTE)(lastRun << ML_BITS);
}
- memcpy(op, anchor, lastRun);
+ LZ4_memcpy(op, anchor, lastRun);
op += lastRun;
}
@@ -708,7 +708,7 @@ _last_literals:
} else {
*op++ = (BYTE)(lastRunSize<<ML_BITS);
}
- memcpy(op, anchor, lastRunSize);
+ LZ4_memcpy(op, anchor, lastRunSize);
op += lastRunSize;
}
--- a/lib/lz4/lz4_decompress.c~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4_decompress.c
@@ -153,7 +153,7 @@ static FORCE_INLINE int LZ4_decompress_g
&& likely((endOnInput ? ip < shortiend : 1) &
(op <= shortoend))) {
/* Copy the literals */
- memcpy(op, ip, endOnInput ? 16 : 8);
+ LZ4_memcpy(op, ip, endOnInput ? 16 : 8);
op += length; ip += length;
/*
@@ -172,9 +172,9 @@ static FORCE_INLINE int LZ4_decompress_g
(offset >= 8) &&
(dict == withPrefix64k || match >= lowPrefix)) {
/* Copy the match. */
- memcpy(op + 0, match + 0, 8);
- memcpy(op + 8, match + 8, 8);
- memcpy(op + 16, match + 16, 2);
+ LZ4_memcpy(op + 0, match + 0, 8);
+ LZ4_memcpy(op + 8, match + 8, 8);
+ LZ4_memcpy(op + 16, match + 16, 2);
op += length + MINMATCH;
/* Both stages worked, load the next token. */
continue;
@@ -263,7 +263,7 @@ static FORCE_INLINE int LZ4_decompress_g
}
}
- memcpy(op, ip, length);
+ LZ4_memcpy(op, ip, length);
ip += length;
op += length;
@@ -350,7 +350,7 @@ _copy_match:
size_t const copySize = (size_t)(lowPrefix - match);
size_t const restSize = length - copySize;
- memcpy(op, dictEnd - copySize, copySize);
+ LZ4_memcpy(op, dictEnd - copySize, copySize);
op += copySize;
if (restSize > (size_t)(op - lowPrefix)) {
/* overlap copy */
@@ -360,7 +360,7 @@ _copy_match:
while (op < endOfMatch)
*op++ = *copyFrom++;
} else {
- memcpy(op, lowPrefix, restSize);
+ LZ4_memcpy(op, lowPrefix, restSize);
op += restSize;
}
}
@@ -386,7 +386,7 @@ _copy_match:
while (op < copyEnd)
*op++ = *match++;
} else {
- memcpy(op, match, mlen);
+ LZ4_memcpy(op, match, mlen);
}
op = copyEnd;
if (op == oend)
@@ -400,7 +400,7 @@ _copy_match:
op[2] = match[2];
op[3] = match[3];
match += inc32table[offset];
- memcpy(op + 4, match, 4);
+ LZ4_memcpy(op + 4, match, 4);
match -= dec64table[offset];
} else {
LZ4_copy8(op, match);
--- a/lib/lz4/lz4defs.h~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4defs.h
@@ -137,6 +137,16 @@ static FORCE_INLINE void LZ4_writeLE16(v
return put_unaligned_le16(value, memPtr);
}
+/*
+ * LZ4 relies on memcpy with a constant size being inlined. In freestanding
+ * environments, the compiler can't assume the implementation of memcpy() is
+ * standard compliant, so apply its specialized memcpy() inlining logic. When
+ * possible, use __builtin_memcpy() to tell the compiler to analyze memcpy()
+ * as-if it were standard compliant, so it can inline it in freestanding
+ * environments. This is needed when decompressing the Linux Kernel, for example.
+ */
+#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size)
+
static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
{
#if LZ4_ARCH64
--- a/lib/lz4/lz4hc_compress.c~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4hc_compress.c
@@ -570,7 +570,7 @@ _Search3:
*op++ = (BYTE) lastRun;
} else
*op++ = (BYTE)(lastRun<<ML_BITS);
- memcpy(op, anchor, iend - anchor);
+ LZ4_memcpy(op, anchor, iend - anchor);
op += iend - anchor;
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 04/39] exec: restore EACCES of S_ISDIR execve()
2020-08-15 0:29 incoming Andrew Morton
` (2 preceding siblings ...)
2020-08-15 0:30 ` [patch 03/39] lz4: fix kernel decompression speed Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 05/39] selftests/exec: add file type errno tests Andrew Morton
` (35 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, gregkh, gregkh, keescook, linux-mm, maz, mm-commits, torvalds
From: Kees Cook <keescook@chromium.org>
Subject: exec: restore EACCES of S_ISDIR execve()
Patch series "Fix S_ISDIR execve() errno".
Fix an errno change for execve() of directories, noticed by Marc Zyngier.
Along with the fix, include a regression test to avoid seeing this return
in the future.
This patch (of 2):
The return code for attempting to execute a directory has always been
EACCES. Adjust the S_ISDIR exec test to reflect the old errno instead of
the general EISDIR for other kinds of "open" attempts on directories.
Link: http://lkml.kernel.org/r/20200813231723.2725102-2-keescook@chromium.org
Link: https://lore.kernel.org/lkml/20200813151305.6191993b@why
Fixes: 633fb6ac3980 ("exec: move S_ISREG() check earlier")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@google.com>
Tested-by: Greg Kroah-Hartman <gregkh@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/namei.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/fs/namei.c~exec-restore-eacces-of-s_isdir-execve
+++ a/fs/namei.c
@@ -2849,8 +2849,10 @@ static int may_open(const struct path *p
case S_IFLNK:
return -ELOOP;
case S_IFDIR:
- if (acc_mode & (MAY_WRITE | MAY_EXEC))
+ if (acc_mode & MAY_WRITE)
return -EISDIR;
+ if (acc_mode & MAY_EXEC)
+ return -EACCES;
break;
case S_IFBLK:
case S_IFCHR:
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 05/39] selftests/exec: add file type errno tests
2020-08-15 0:29 incoming Andrew Morton
` (3 preceding siblings ...)
2020-08-15 0:30 ` [patch 04/39] exec: restore EACCES of S_ISDIR execve() Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 06/39] mailmap: add entry for Greg Kurz Andrew Morton
` (34 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, keescook, linux-mm, maz, mm-commits, torvalds
From: Kees Cook <keescook@chromium.org>
Subject: selftests/exec: add file type errno tests
Make sure execve() returns the expected errno values for non-regular
files.
Link: http://lkml.kernel.org/r/20200813231723.2725102-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
tools/testing/selftests/exec/.gitignore | 1
tools/testing/selftests/exec/Makefile | 5
tools/testing/selftests/exec/non-regular.c | 196 +++++++++++++++++++
3 files changed, 200 insertions(+), 2 deletions(-)
--- a/tools/testing/selftests/exec/.gitignore~selftests-exec-add-file-type-errno-tests
+++ a/tools/testing/selftests/exec/.gitignore
@@ -10,3 +10,4 @@ execveat.denatured
/recursion-depth
xxxxxxxx*
pipe
+S_I*.test
--- a/tools/testing/selftests/exec/Makefile~selftests-exec-add-file-type-errno-tests
+++ a/tools/testing/selftests/exec/Makefile
@@ -3,7 +3,7 @@ CFLAGS = -Wall
CFLAGS += -Wno-nonnull
CFLAGS += -D_GNU_SOURCE
-TEST_PROGS := binfmt_script
+TEST_PROGS := binfmt_script non-regular
TEST_GEN_PROGS := execveat
TEST_GEN_FILES := execveat.symlink execveat.denatured script subdir pipe
# Makefile is a run-time dependency, since it's accessed by the execveat test
@@ -11,7 +11,8 @@ TEST_FILES := Makefile
TEST_GEN_PROGS += recursion-depth
-EXTRA_CLEAN := $(OUTPUT)/subdir.moved $(OUTPUT)/execveat.moved $(OUTPUT)/xxxxx*
+EXTRA_CLEAN := $(OUTPUT)/subdir.moved $(OUTPUT)/execveat.moved $(OUTPUT)/xxxxx* \
+ $(OUTPUT)/S_I*.test
include ../lib.mk
--- /dev/null
+++ a/tools/testing/selftests/exec/non-regular.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/sysmacros.h>
+#include <sys/types.h>
+
+#include "../kselftest_harness.h"
+
+/* Remove a file, ignoring the result if it didn't exist. */
+void rm(struct __test_metadata *_metadata, const char *pathname,
+ int is_dir)
+{
+ int rc;
+
+ if (is_dir)
+ rc = rmdir(pathname);
+ else
+ rc = unlink(pathname);
+
+ if (rc < 0) {
+ ASSERT_EQ(errno, ENOENT) {
+ TH_LOG("Not ENOENT: %s", pathname);
+ }
+ } else {
+ ASSERT_EQ(rc, 0) {
+ TH_LOG("Failed to remove: %s", pathname);
+ }
+ }
+}
+
+FIXTURE(file) {
+ char *pathname;
+ int is_dir;
+};
+
+FIXTURE_VARIANT(file)
+{
+ const char *name;
+ int expected;
+ int is_dir;
+ void (*setup)(struct __test_metadata *_metadata,
+ FIXTURE_DATA(file) *self,
+ const FIXTURE_VARIANT(file) *variant);
+ int major, minor, mode; /* for mknod() */
+};
+
+void setup_link(struct __test_metadata *_metadata,
+ FIXTURE_DATA(file) *self,
+ const FIXTURE_VARIANT(file) *variant)
+{
+ const char * const paths[] = {
+ "/bin/true",
+ "/usr/bin/true",
+ };
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(paths); i++) {
+ if (access(paths[i], X_OK) == 0) {
+ ASSERT_EQ(symlink(paths[i], self->pathname), 0);
+ return;
+ }
+ }
+ ASSERT_EQ(1, 0) {
+ TH_LOG("Could not find viable 'true' binary");
+ }
+}
+
+FIXTURE_VARIANT_ADD(file, S_IFLNK)
+{
+ .name = "S_IFLNK",
+ .expected = ELOOP,
+ .setup = setup_link,
+};
+
+void setup_dir(struct __test_metadata *_metadata,
+ FIXTURE_DATA(file) *self,
+ const FIXTURE_VARIANT(file) *variant)
+{
+ ASSERT_EQ(mkdir(self->pathname, 0755), 0);
+}
+
+FIXTURE_VARIANT_ADD(file, S_IFDIR)
+{
+ .name = "S_IFDIR",
+ .is_dir = 1,
+ .expected = EACCES,
+ .setup = setup_dir,
+};
+
+void setup_node(struct __test_metadata *_metadata,
+ FIXTURE_DATA(file) *self,
+ const FIXTURE_VARIANT(file) *variant)
+{
+ dev_t dev;
+ int rc;
+
+ dev = makedev(variant->major, variant->minor);
+ rc = mknod(self->pathname, 0755 | variant->mode, dev);
+ ASSERT_EQ(rc, 0) {
+ if (errno == EPERM)
+ SKIP(return, "Please run as root; cannot mknod(%s)",
+ variant->name);
+ }
+}
+
+FIXTURE_VARIANT_ADD(file, S_IFBLK)
+{
+ .name = "S_IFBLK",
+ .expected = EACCES,
+ .setup = setup_node,
+ /* /dev/loop0 */
+ .major = 7,
+ .minor = 0,
+ .mode = S_IFBLK,
+};
+
+FIXTURE_VARIANT_ADD(file, S_IFCHR)
+{
+ .name = "S_IFCHR",
+ .expected = EACCES,
+ .setup = setup_node,
+ /* /dev/zero */
+ .major = 1,
+ .minor = 5,
+ .mode = S_IFCHR,
+};
+
+void setup_fifo(struct __test_metadata *_metadata,
+ FIXTURE_DATA(file) *self,
+ const FIXTURE_VARIANT(file) *variant)
+{
+ ASSERT_EQ(mkfifo(self->pathname, 0755), 0);
+}
+
+FIXTURE_VARIANT_ADD(file, S_IFIFO)
+{
+ .name = "S_IFIFO",
+ .expected = EACCES,
+ .setup = setup_fifo,
+};
+
+FIXTURE_SETUP(file)
+{
+ ASSERT_GT(asprintf(&self->pathname, "%s.test", variant->name), 6);
+ self->is_dir = variant->is_dir;
+
+ rm(_metadata, self->pathname, variant->is_dir);
+ variant->setup(_metadata, self, variant);
+}
+
+FIXTURE_TEARDOWN(file)
+{
+ rm(_metadata, self->pathname, self->is_dir);
+}
+
+TEST_F(file, exec_errno)
+{
+ char * const argv[2] = { (char * const)self->pathname, NULL };
+
+ EXPECT_LT(execv(argv[0], argv), 0);
+ EXPECT_EQ(errno, variant->expected);
+}
+
+/* S_IFSOCK */
+FIXTURE(sock)
+{
+ int fd;
+};
+
+FIXTURE_SETUP(sock)
+{
+ self->fd = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(self->fd, 0);
+}
+
+FIXTURE_TEARDOWN(sock)
+{
+ if (self->fd >= 0)
+ ASSERT_EQ(close(self->fd), 0);
+}
+
+TEST_F(sock, exec_errno)
+{
+ char * const argv[2] = { " magic socket ", NULL };
+ char * const envp[1] = { NULL };
+
+ EXPECT_LT(fexecve(self->fd, argv, envp), 0);
+ EXPECT_EQ(errno, EACCES);
+}
+
+TEST_HARNESS_MAIN
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 06/39] mailmap: add entry for Greg Kurz
2020-08-15 0:29 incoming Andrew Morton
` (4 preceding siblings ...)
2020-08-15 0:30 ` [patch 05/39] selftests/exec: add file type errno tests Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 07/39] mm: store compound_nr as well as compound_order Andrew Morton
` (33 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, groug, linux-mm, mm-commits, torvalds
From: Greg Kurz <groug@kaod.org>
Subject: mailmap: add entry for Greg Kurz
I had stopped using gkurz@linux.vnet.ibm.com a while back already but this
email address was shutdown last June when I quit IBM. It's about time to
map it to groug@kaod.org.
Link: http://lkml.kernel.org/r/159724692879.76040.4938578139173154028.stgit@bahia.lan
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
.mailmap | 1 +
1 file changed, 1 insertion(+)
--- a/.mailmap~mailmap-add-entry-for-greg-kurz
+++ a/.mailmap
@@ -104,6 +104,7 @@ Gerald Schaefer <gerald.schaefer@linux.i
Greg Kroah-Hartman <greg@echidna.(none)>
Greg Kroah-Hartman <gregkh@suse.de>
Greg Kroah-Hartman <greg@kroah.com>
+Greg Kurz <groug@kaod.org> <gkurz@linux.vnet.ibm.com>
Gregory CLEMENT <gregory.clement@bootlin.com> <gregory.clement@free-electrons.com>
Hanjun Guo <guohanjun@huawei.com> <hanjun.guo@linaro.org>
Heiko Carstens <hca@linux.ibm.com> <h.carstens@de.ibm.com>
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 07/39] mm: store compound_nr as well as compound_order
2020-08-15 0:29 incoming Andrew Morton
` (5 preceding siblings ...)
2020-08-15 0:30 ` [patch 06/39] mailmap: add entry for Greg Kurz Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 08/39] mm: move page-flags include to top of file Andrew Morton
` (32 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: store compound_nr as well as compound_order
Patch series "THP prep patches".
These are some generic cleanups and improvements, which I would like
merged into mmotm soon. The first one should be a performance improvement
for all users of compound pages, and the others are aimed at getting code
to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small
systems). Also better documented / less confusing than the current prefix
mixture of compound, hpage and thp.
This patch (of 7):
This removes a few instructions from functions which need to know how many
pages are in a compound page. The storage used is either page->mapping on
64-bit or page->index on 32-bit. Both of these are fine to overlay on
tail pages.
Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org
Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/mm.h | 5 ++++-
include/linux/mm_types.h | 1 +
mm/page_alloc.c | 5 +++--
3 files changed, 8 insertions(+), 3 deletions(-)
--- a/include/linux/mm.h~mm-store-compound_nr-as-well-as-compound_order
+++ a/include/linux/mm.h
@@ -922,12 +922,15 @@ static inline int compound_pincount(stru
static inline void set_compound_order(struct page *page, unsigned int order)
{
page[1].compound_order = order;
+ page[1].compound_nr = 1U << order;
}
/* Returns the number of pages in this potentially compound page. */
static inline unsigned long compound_nr(struct page *page)
{
- return 1UL << compound_order(page);
+ if (!PageHead(page))
+ return 1;
+ return page[1].compound_nr;
}
/* Returns the number of bytes in this potentially compound page. */
--- a/include/linux/mm_types.h~mm-store-compound_nr-as-well-as-compound_order
+++ a/include/linux/mm_types.h
@@ -134,6 +134,7 @@ struct page {
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
+ unsigned int compound_nr; /* 1 << compound_order */
};
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
--- a/mm/page_alloc.c~mm-store-compound_nr-as-well-as-compound_order
+++ a/mm/page_alloc.c
@@ -666,8 +666,6 @@ void prep_compound_page(struct page *pag
int i;
int nr_pages = 1 << order;
- set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
- set_compound_order(page, order);
__SetPageHead(page);
for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
@@ -675,6 +673,9 @@ void prep_compound_page(struct page *pag
p->mapping = TAIL_MAPPING;
set_compound_head(p, page);
}
+
+ set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+ set_compound_order(page, order);
atomic_set(compound_mapcount_ptr(page), -1);
if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 08/39] mm: move page-flags include to top of file
2020-08-15 0:29 incoming Andrew Morton
` (6 preceding siblings ...)
2020-08-15 0:30 ` [patch 07/39] mm: store compound_nr as well as compound_order Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 09/39] mm: add thp_order Andrew Morton
` (31 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: move page-flags include to top of file
Give up on the notion that we can remove page-flags.h from mm.h. There
are currently 14 inline functions which use a PageFoo function. Also, two
of the files directly included by mm.h include page-flags.h themselves,
and there are probably more indirect inclusions. So just include it at
the top like any other header file.
Link: http://lkml.kernel.org/r/20200629151959.15779-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/mm.h | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
--- a/include/linux/mm.h~mm-move-page-flags-include-to-top-of-file
+++ a/include/linux/mm.h
@@ -24,6 +24,7 @@
#include <linux/resource.h>
#include <linux/page_ext.h>
#include <linux/err.h>
+#include <linux/page-flags.h>
#include <linux/page_ref.h>
#include <linux/memremap.h>
#include <linux/overflow.h>
@@ -668,11 +669,6 @@ int vma_is_stack_for_current(struct vm_a
struct mmu_gather;
struct inode;
-/*
- * FIXME: take this include out, include page-flags.h in
- * files which need it (119 of them)
- */
-#include <linux/page-flags.h>
#include <linux/huge_mm.h>
/*
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 09/39] mm: add thp_order
2020-08-15 0:29 incoming Andrew Morton
` (7 preceding siblings ...)
2020-08-15 0:30 ` [patch 08/39] mm: move page-flags include to top of file Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 10/39] mm: add thp_size Andrew Morton
` (30 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: add thp_order
This function returns the order of a transparent huge page. It compiles
to 0 if CONFIG_TRANSPARENT_HUGEPAGE is disabled.
Link: http://lkml.kernel.org/r/20200629151959.15779-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/huge_mm.h | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
--- a/include/linux/huge_mm.h~mm-add-thp_order
+++ a/include/linux/huge_mm.h
@@ -258,6 +258,19 @@ static inline spinlock_t *pud_trans_huge
else
return NULL;
}
+
+/**
+ * thp_order - Order of a transparent huge page.
+ * @page: Head page of a transparent huge page.
+ */
+static inline unsigned int thp_order(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageTail(page), page);
+ if (PageHead(page))
+ return HPAGE_PMD_ORDER;
+ return 0;
+}
+
static inline int hpage_nr_pages(struct page *page)
{
if (unlikely(PageTransHuge(page)))
@@ -317,6 +330,12 @@ static inline struct list_head *page_def
#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
+static inline unsigned int thp_order(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageTail(page), page);
+ return 0;
+}
+
static inline int hpage_nr_pages(struct page *page)
{
VM_BUG_ON_PAGE(PageTail(page), page);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 10/39] mm: add thp_size
2020-08-15 0:29 incoming Andrew Morton
` (8 preceding siblings ...)
2020-08-15 0:30 ` [patch 09/39] mm: add thp_order Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 11/39] mm: replace hpage_nr_pages with thp_nr_pages Andrew Morton
` (29 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: add thp_size
This function returns the number of bytes in a THP. It is like
page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
is disabled.
Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
drivers/nvdimm/btt.c | 4 +---
drivers/nvdimm/pmem.c | 6 ++----
include/linux/huge_mm.h | 11 +++++++++++
mm/internal.h | 2 +-
mm/page_io.c | 2 +-
mm/page_vma_mapped.c | 4 ++--
6 files changed, 18 insertions(+), 11 deletions(-)
--- a/drivers/nvdimm/btt.c~mm-add-thp_size
+++ a/drivers/nvdimm/btt.c
@@ -1490,10 +1490,8 @@ static int btt_rw_page(struct block_devi
{
struct btt *btt = bdev->bd_disk->private_data;
int rc;
- unsigned int len;
- len = hpage_nr_pages(page) * PAGE_SIZE;
- rc = btt_do_bvec(btt, NULL, page, len, 0, op, sector);
+ rc = btt_do_bvec(btt, NULL, page, thp_size(page), 0, op, sector);
if (rc == 0)
page_endio(page, op_is_write(op), 0);
--- a/drivers/nvdimm/pmem.c~mm-add-thp_size
+++ a/drivers/nvdimm/pmem.c
@@ -238,11 +238,9 @@ static int pmem_rw_page(struct block_dev
blk_status_t rc;
if (op_is_write(op))
- rc = pmem_do_write(pmem, page, 0, sector,
- hpage_nr_pages(page) * PAGE_SIZE);
+ rc = pmem_do_write(pmem, page, 0, sector, thp_size(page));
else
- rc = pmem_do_read(pmem, page, 0, sector,
- hpage_nr_pages(page) * PAGE_SIZE);
+ rc = pmem_do_read(pmem, page, 0, sector, thp_size(page));
/*
* The ->rw_page interface is subtle and tricky. The core
* retries on any error, so we can only invoke page_endio() in
--- a/include/linux/huge_mm.h~mm-add-thp_size
+++ a/include/linux/huge_mm.h
@@ -462,4 +462,15 @@ static inline bool thp_migration_support
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+/**
+ * thp_size - Size of a transparent huge page.
+ * @page: Head page of a transparent huge page.
+ *
+ * Return: Number of bytes in this page.
+ */
+static inline unsigned long thp_size(struct page *page)
+{
+ return PAGE_SIZE << thp_order(page);
+}
+
#endif /* _LINUX_HUGE_MM_H */
--- a/mm/internal.h~mm-add-thp_size
+++ a/mm/internal.h
@@ -396,7 +396,7 @@ vma_address(struct page *page, struct vm
unsigned long start, end;
start = __vma_address(page, vma);
- end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+ end = start + thp_size(page) - PAGE_SIZE;
/* page should be within @vma mapping range */
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
--- a/mm/page_io.c~mm-add-thp_size
+++ a/mm/page_io.c
@@ -40,7 +40,7 @@ static struct bio *get_swap_bio(gfp_t gf
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
bio->bi_end_io = end_io;
- bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
+ bio_add_page(bio, page, thp_size(page), 0);
}
return bio;
}
--- a/mm/page_vma_mapped.c~mm-add-thp_size
+++ a/mm/page_vma_mapped.c
@@ -227,7 +227,7 @@ next_pte:
if (pvmw->address >= pvmw->vma->vm_end ||
pvmw->address >=
__vma_address(pvmw->page, pvmw->vma) +
- hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+ thp_size(pvmw->page))
return not_found(pvmw);
/* Did we cross page table boundary? */
if (pvmw->address % PMD_SIZE == 0) {
@@ -268,7 +268,7 @@ int page_mapped_in_vma(struct page *page
unsigned long start, end;
start = __vma_address(page, vma);
- end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+ end = start + thp_size(page) - PAGE_SIZE;
if (unlikely(end < vma->vm_start || start >= vma->vm_end))
return 0;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 11/39] mm: replace hpage_nr_pages with thp_nr_pages
2020-08-15 0:29 incoming Andrew Morton
` (9 preceding siblings ...)
2020-08-15 0:30 ` [patch 10/39] mm: add thp_size Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 12/39] mm: add thp_head Andrew Morton
` (28 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.
[akpm@linux-foundation.org: fix mm/migrate.c]
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/huge_mm.h | 13 +++++++++----
include/linux/mm_inline.h | 6 +++---
include/linux/pagemap.h | 6 +++---
mm/compaction.c | 2 +-
mm/filemap.c | 2 +-
mm/gup.c | 2 +-
mm/internal.h | 2 +-
mm/memcontrol.c | 10 +++++-----
mm/memory_hotplug.c | 7 +++----
mm/mempolicy.c | 2 +-
mm/migrate.c | 18 +++++++++---------
mm/mlock.c | 9 ++++-----
mm/page_io.c | 2 +-
mm/page_vma_mapped.c | 2 +-
mm/rmap.c | 8 ++++----
mm/swap.c | 16 ++++++++--------
mm/swap_state.c | 6 +++---
mm/swapfile.c | 2 +-
mm/vmscan.c | 6 +++---
mm/workingset.c | 6 +++---
20 files changed, 65 insertions(+), 62 deletions(-)
--- a/include/linux/huge_mm.h~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/include/linux/huge_mm.h
@@ -271,9 +271,14 @@ static inline unsigned int thp_order(str
return 0;
}
-static inline int hpage_nr_pages(struct page *page)
+/**
+ * thp_nr_pages - The number of regular pages in this huge page.
+ * @page: The head page of a huge page.
+ */
+static inline int thp_nr_pages(struct page *page)
{
- if (unlikely(PageTransHuge(page)))
+ VM_BUG_ON_PGFLAGS(PageTail(page), page);
+ if (PageHead(page))
return HPAGE_PMD_NR;
return 1;
}
@@ -336,9 +341,9 @@ static inline unsigned int thp_order(str
return 0;
}
-static inline int hpage_nr_pages(struct page *page)
+static inline int thp_nr_pages(struct page *page)
{
- VM_BUG_ON_PAGE(PageTail(page), page);
+ VM_BUG_ON_PGFLAGS(PageTail(page), page);
return 1;
}
--- a/include/linux/mm_inline.h~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/include/linux/mm_inline.h
@@ -48,14 +48,14 @@ static __always_inline void update_lru_s
static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
- update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+ update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
list_add(&page->lru, &lruvec->lists[lru]);
}
static __always_inline void add_page_to_lru_list_tail(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
- update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+ update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
list_add_tail(&page->lru, &lruvec->lists[lru]);
}
@@ -63,7 +63,7 @@ static __always_inline void del_page_fro
struct lruvec *lruvec, enum lru_list lru)
{
list_del(&page->lru);
- update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
+ update_lru_size(lruvec, lru, page_zonenum(page), -thp_nr_pages(page));
}
/**
--- a/include/linux/pagemap.h~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/include/linux/pagemap.h
@@ -381,7 +381,7 @@ static inline struct page *find_subpage(
if (PageHuge(head))
return head;
- return head + (index & (hpage_nr_pages(head) - 1));
+ return head + (index & (thp_nr_pages(head) - 1));
}
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
@@ -773,7 +773,7 @@ static inline struct page *readahead_pag
page = xa_load(&rac->mapping->i_pages, rac->_index);
VM_BUG_ON_PAGE(!PageLocked(page), page);
- rac->_batch_count = hpage_nr_pages(page);
+ rac->_batch_count = thp_nr_pages(page);
return page;
}
@@ -796,7 +796,7 @@ static inline unsigned int __readahead_b
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
array[i++] = page;
- rac->_batch_count += hpage_nr_pages(page);
+ rac->_batch_count += thp_nr_pages(page);
/*
* The page cache isn't using multi-index entries yet,
--- a/mm/compaction.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/compaction.c
@@ -1009,7 +1009,7 @@ isolate_migratepages_block(struct compac
del_page_from_lru_list(page, lruvec, page_lru(page));
mod_node_page_state(page_pgdat(page),
NR_ISOLATED_ANON + page_is_file_lru(page),
- hpage_nr_pages(page));
+ thp_nr_pages(page));
isolate_success:
list_add(&page->lru, &cc->migratepages);
--- a/mm/filemap.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/filemap.c
@@ -198,7 +198,7 @@ static void unaccount_page_cache_page(st
if (PageHuge(page))
return;
- nr = hpage_nr_pages(page);
+ nr = thp_nr_pages(page);
__mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
if (PageSwapBacked(page)) {
--- a/mm/gup.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/gup.c
@@ -1637,7 +1637,7 @@ check_again:
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON +
page_is_file_lru(head),
- hpage_nr_pages(head));
+ thp_nr_pages(head));
}
}
}
--- a/mm/internal.h~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/internal.h
@@ -369,7 +369,7 @@ extern void clear_page_mlock(struct page
static inline void mlock_migrate_page(struct page *newpage, struct page *page)
{
if (TestClearPageMlocked(page)) {
- int nr_pages = hpage_nr_pages(page);
+ int nr_pages = thp_nr_pages(page);
/* Holding pmd lock, no change in irq context: __mod is safe */
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
--- a/mm/memcontrol.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/memcontrol.c
@@ -5589,7 +5589,7 @@ static int mem_cgroup_move_account(struc
{
struct lruvec *from_vec, *to_vec;
struct pglist_data *pgdat;
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = compound ? thp_nr_pages(page) : 1;
int ret;
VM_BUG_ON(from == to);
@@ -6682,7 +6682,7 @@ void mem_cgroup_calculate_protection(str
*/
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
- unsigned int nr_pages = hpage_nr_pages(page);
+ unsigned int nr_pages = thp_nr_pages(page);
struct mem_cgroup *memcg = NULL;
int ret = 0;
@@ -6912,7 +6912,7 @@ void mem_cgroup_migrate(struct page *old
return;
/* Force-charge the new page. The old one will be freed soon */
- nr_pages = hpage_nr_pages(newpage);
+ nr_pages = thp_nr_pages(newpage);
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
@@ -7114,7 +7114,7 @@ void mem_cgroup_swapout(struct page *pag
* ancestor for the swap instead and transfer the memory+swap charge.
*/
swap_memcg = mem_cgroup_id_get_online(memcg);
- nr_entries = hpage_nr_pages(page);
+ nr_entries = thp_nr_pages(page);
/* Get references for the tail pages, too */
if (nr_entries > 1)
mem_cgroup_id_get_many(swap_memcg, nr_entries - 1);
@@ -7158,7 +7158,7 @@ void mem_cgroup_swapout(struct page *pag
*/
int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
{
- unsigned int nr_pages = hpage_nr_pages(page);
+ unsigned int nr_pages = thp_nr_pages(page);
struct page_counter *counter;
struct mem_cgroup *memcg;
unsigned short oldid;
--- a/mm/memory_hotplug.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/memory_hotplug.c
@@ -1299,7 +1299,7 @@ static int
do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
{
unsigned long pfn;
- struct page *page;
+ struct page *page, *head;
int ret = 0;
LIST_HEAD(source);
@@ -1307,15 +1307,14 @@ do_migrate_range(unsigned long start_pfn
if (!pfn_valid(pfn))
continue;
page = pfn_to_page(pfn);
+ head = compound_head(page);
if (PageHuge(page)) {
- struct page *head = compound_head(page);
pfn = page_to_pfn(head) + compound_nr(head) - 1;
isolate_huge_page(head, &source);
continue;
} else if (PageTransHuge(page))
- pfn = page_to_pfn(compound_head(page))
- + hpage_nr_pages(page) - 1;
+ pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
/*
* HWPoison pages have elevated reference counts so the migration would
--- a/mm/mempolicy.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/mempolicy.c
@@ -1049,7 +1049,7 @@ static int migrate_page_add(struct page
list_add_tail(&head->lru, pagelist);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON + page_is_file_lru(head),
- hpage_nr_pages(head));
+ thp_nr_pages(head));
} else if (flags & MPOL_MF_STRICT) {
/*
* Non-movable page may reach here. And, there may be
--- a/mm/migrate.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/migrate.c
@@ -193,7 +193,7 @@ void putback_movable_pages(struct list_h
put_page(page);
} else {
mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
- page_is_file_lru(page), -hpage_nr_pages(page));
+ page_is_file_lru(page), -thp_nr_pages(page));
putback_lru_page(page);
}
}
@@ -386,7 +386,7 @@ static int expected_page_refs(struct add
*/
expected_count += is_device_private_page(page);
if (mapping)
- expected_count += hpage_nr_pages(page) + page_has_private(page);
+ expected_count += thp_nr_pages(page) + page_has_private(page);
return expected_count;
}
@@ -441,7 +441,7 @@ int migrate_page_move_mapping(struct add
*/
newpage->index = page->index;
newpage->mapping = page->mapping;
- page_ref_add(newpage, hpage_nr_pages(page)); /* add cache reference */
+ page_ref_add(newpage, thp_nr_pages(page)); /* add cache reference */
if (PageSwapBacked(page)) {
__SetPageSwapBacked(newpage);
if (PageSwapCache(page)) {
@@ -474,7 +474,7 @@ int migrate_page_move_mapping(struct add
* to one less reference.
* We know this isn't the last reference.
*/
- page_ref_unfreeze(page, expected_count - hpage_nr_pages(page));
+ page_ref_unfreeze(page, expected_count - thp_nr_pages(page));
xas_unlock(&xas);
/* Leave irq disabled to prevent preemption while updating stats */
@@ -591,7 +591,7 @@ static void copy_huge_page(struct page *
} else {
/* thp page */
BUG_ON(!PageTransHuge(src));
- nr_pages = hpage_nr_pages(src);
+ nr_pages = thp_nr_pages(src);
}
for (i = 0; i < nr_pages; i++) {
@@ -1213,7 +1213,7 @@ out:
*/
if (likely(!__PageMovable(page)))
mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
- page_is_file_lru(page), -hpage_nr_pages(page));
+ page_is_file_lru(page), -thp_nr_pages(page));
}
/*
@@ -1446,7 +1446,7 @@ retry:
* during migration.
*/
is_thp = PageTransHuge(page);
- nr_subpages = hpage_nr_pages(page);
+ nr_subpages = thp_nr_pages(page);
cond_resched();
if (PageHuge(page))
@@ -1670,7 +1670,7 @@ static int add_page_for_migration(struct
list_add_tail(&head->lru, pagelist);
mod_node_page_state(page_pgdat(head),
NR_ISOLATED_ANON + page_is_file_lru(head),
- hpage_nr_pages(head));
+ thp_nr_pages(head));
}
out_putpage:
/*
@@ -2034,7 +2034,7 @@ static int numamigrate_isolate_page(pg_d
page_lru = page_is_file_lru(page);
mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
- hpage_nr_pages(page));
+ thp_nr_pages(page));
/*
* Isolating the page has taken another reference, so the
--- a/mm/mlock.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/mlock.c
@@ -61,8 +61,7 @@ void clear_page_mlock(struct page *page)
if (!TestClearPageMlocked(page))
return;
- mod_zone_page_state(page_zone(page), NR_MLOCK,
- -hpage_nr_pages(page));
+ mod_zone_page_state(page_zone(page), NR_MLOCK, -thp_nr_pages(page));
count_vm_event(UNEVICTABLE_PGCLEARED);
/*
* The previous TestClearPageMlocked() corresponds to the smp_mb()
@@ -95,7 +94,7 @@ void mlock_vma_page(struct page *page)
if (!TestSetPageMlocked(page)) {
mod_zone_page_state(page_zone(page), NR_MLOCK,
- hpage_nr_pages(page));
+ thp_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
if (!isolate_lru_page(page))
putback_lru_page(page);
@@ -192,7 +191,7 @@ unsigned int munlock_vma_page(struct pag
/*
* Serialize with any parallel __split_huge_page_refcount() which
* might otherwise copy PageMlocked to part of the tail pages before
- * we clear it in the head page. It also stabilizes hpage_nr_pages().
+ * we clear it in the head page. It also stabilizes thp_nr_pages().
*/
spin_lock_irq(&pgdat->lru_lock);
@@ -202,7 +201,7 @@ unsigned int munlock_vma_page(struct pag
goto unlock_out;
}
- nr_pages = hpage_nr_pages(page);
+ nr_pages = thp_nr_pages(page);
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
if (__munlock_isolate_lru_page(page, true)) {
--- a/mm/page_io.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/page_io.c
@@ -274,7 +274,7 @@ static inline void count_swpout_vm_event
if (unlikely(PageTransHuge(page)))
count_vm_event(THP_SWPOUT);
#endif
- count_vm_events(PSWPOUT, hpage_nr_pages(page));
+ count_vm_events(PSWPOUT, thp_nr_pages(page));
}
#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
--- a/mm/page_vma_mapped.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/page_vma_mapped.c
@@ -61,7 +61,7 @@ static inline bool pfn_is_match(struct p
return page_pfn == pfn;
/* THP can be referenced by any subpage */
- return pfn >= page_pfn && pfn - page_pfn < hpage_nr_pages(page);
+ return pfn >= page_pfn && pfn - page_pfn < thp_nr_pages(page);
}
/**
--- a/mm/rmap.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/rmap.c
@@ -1130,7 +1130,7 @@ void do_page_add_anon_rmap(struct page *
}
if (first) {
- int nr = compound ? hpage_nr_pages(page) : 1;
+ int nr = compound ? thp_nr_pages(page) : 1;
/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
@@ -1169,7 +1169,7 @@ void do_page_add_anon_rmap(struct page *
void page_add_new_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, bool compound)
{
- int nr = compound ? hpage_nr_pages(page) : 1;
+ int nr = compound ? thp_nr_pages(page) : 1;
VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
__SetPageSwapBacked(page);
@@ -1860,7 +1860,7 @@ static void rmap_walk_anon(struct page *
return;
pgoff_start = page_to_pgoff(page);
- pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+ pgoff_end = pgoff_start + thp_nr_pages(page) - 1;
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
pgoff_start, pgoff_end) {
struct vm_area_struct *vma = avc->vma;
@@ -1913,7 +1913,7 @@ static void rmap_walk_file(struct page *
return;
pgoff_start = page_to_pgoff(page);
- pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+ pgoff_end = pgoff_start + thp_nr_pages(page) - 1;
if (!locked)
i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap,
--- a/mm/swap.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/swap.c
@@ -241,7 +241,7 @@ static void pagevec_move_tail_fn(struct
del_page_from_lru_list(page, lruvec, page_lru(page));
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
- (*pgmoved) += hpage_nr_pages(page);
+ (*pgmoved) += thp_nr_pages(page);
}
}
@@ -312,7 +312,7 @@ void lru_note_cost(struct lruvec *lruvec
void lru_note_cost_page(struct page *page)
{
lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
- page_is_file_lru(page), hpage_nr_pages(page));
+ page_is_file_lru(page), thp_nr_pages(page));
}
static void __activate_page(struct page *page, struct lruvec *lruvec,
@@ -320,7 +320,7 @@ static void __activate_page(struct page
{
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
- int nr_pages = hpage_nr_pages(page);
+ int nr_pages = thp_nr_pages(page);
del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
@@ -500,7 +500,7 @@ void lru_cache_add_inactive_or_unevictab
* lock is held(spinlock), which implies preemption disabled.
*/
__mod_zone_page_state(page_zone(page), NR_MLOCK,
- hpage_nr_pages(page));
+ thp_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
}
lru_cache_add(page);
@@ -532,7 +532,7 @@ static void lru_deactivate_file_fn(struc
{
int lru;
bool active;
- int nr_pages = hpage_nr_pages(page);
+ int nr_pages = thp_nr_pages(page);
if (!PageLRU(page))
return;
@@ -580,7 +580,7 @@ static void lru_deactivate_fn(struct pag
{
if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
- int nr_pages = hpage_nr_pages(page);
+ int nr_pages = thp_nr_pages(page);
del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
ClearPageActive(page);
@@ -599,7 +599,7 @@ static void lru_lazyfree_fn(struct page
if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
!PageSwapCache(page) && !PageUnevictable(page)) {
bool active = PageActive(page);
- int nr_pages = hpage_nr_pages(page);
+ int nr_pages = thp_nr_pages(page);
del_page_from_lru_list(page, lruvec,
LRU_INACTIVE_ANON + active);
@@ -972,7 +972,7 @@ static void __pagevec_lru_add_fn(struct
{
enum lru_list lru;
int was_unevictable = TestClearPageUnevictable(page);
- int nr_pages = hpage_nr_pages(page);
+ int nr_pages = thp_nr_pages(page);
VM_BUG_ON_PAGE(PageLRU(page), page);
--- a/mm/swapfile.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/swapfile.c
@@ -1370,7 +1370,7 @@ void put_swap_page(struct page *page, sw
unsigned char *map;
unsigned int i, free_entries = 0;
unsigned char val;
- int size = swap_entry_size(hpage_nr_pages(page));
+ int size = swap_entry_size(thp_nr_pages(page));
si = _swap_info_get(entry);
if (!si)
--- a/mm/swap_state.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/swap_state.c
@@ -130,7 +130,7 @@ int add_to_swap_cache(struct page *page,
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
- unsigned long i, nr = hpage_nr_pages(page);
+ unsigned long i, nr = thp_nr_pages(page);
void *old;
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -183,7 +183,7 @@ void __delete_from_swap_cache(struct pag
swp_entry_t entry, void *shadow)
{
struct address_space *address_space = swap_address_space(entry);
- int i, nr = hpage_nr_pages(page);
+ int i, nr = thp_nr_pages(page);
pgoff_t idx = swp_offset(entry);
XA_STATE(xas, &address_space->i_pages, idx);
@@ -278,7 +278,7 @@ void delete_from_swap_cache(struct page
xa_unlock_irq(&address_space->i_pages);
put_swap_page(page, entry);
- page_ref_sub(page, hpage_nr_pages(page));
+ page_ref_sub(page, thp_nr_pages(page));
}
void clear_shadow_from_swap_cache(int type, unsigned long begin,
--- a/mm/vmscan.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/vmscan.c
@@ -1354,7 +1354,7 @@ static unsigned int shrink_page_list(str
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- stat->nr_pageout += hpage_nr_pages(page);
+ stat->nr_pageout += thp_nr_pages(page);
if (PageWriteback(page))
goto keep;
@@ -1862,7 +1862,7 @@ static unsigned noinline_for_stack move_
SetPageLRU(page);
lru = page_lru(page);
- nr_pages = hpage_nr_pages(page);
+ nr_pages = thp_nr_pages(page);
update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
list_move(&page->lru, &lruvec->lists[lru]);
@@ -2065,7 +2065,7 @@ static void shrink_active_list(unsigned
* so we ignore them here.
*/
if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {
- nr_rotated += hpage_nr_pages(page);
+ nr_rotated += thp_nr_pages(page);
list_add(&page->lru, &l_active);
continue;
}
--- a/mm/workingset.c~mm-replace-hpage_nr_pages-with-thp_nr_pages
+++ a/mm/workingset.c
@@ -263,7 +263,7 @@ void *workingset_eviction(struct page *p
VM_BUG_ON_PAGE(!PageLocked(page), page);
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
- workingset_age_nonresident(lruvec, hpage_nr_pages(page));
+ workingset_age_nonresident(lruvec, thp_nr_pages(page));
/* XXX: target_memcg can be NULL, go through lruvec */
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
eviction = atomic_long_read(&lruvec->nonresident_age);
@@ -374,7 +374,7 @@ void workingset_refault(struct page *pag
goto out;
SetPageActive(page);
- workingset_age_nonresident(lruvec, hpage_nr_pages(page));
+ workingset_age_nonresident(lruvec, thp_nr_pages(page));
inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
/* Page was active prior to eviction */
@@ -411,7 +411,7 @@ void workingset_activation(struct page *
if (!mem_cgroup_disabled() && !memcg)
goto out;
lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
- workingset_age_nonresident(lruvec, hpage_nr_pages(page));
+ workingset_age_nonresident(lruvec, thp_nr_pages(page));
out:
rcu_read_unlock();
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 12/39] mm: add thp_head
2020-08-15 0:29 incoming Andrew Morton
` (10 preceding siblings ...)
2020-08-15 0:30 ` [patch 11/39] mm: replace hpage_nr_pages with thp_nr_pages Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 13/39] mm: introduce offset_in_thp Andrew Morton
` (27 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: add thp_head
This is like compound_head() but compiles away when
CONFIG_TRANSPARENT_HUGEPAGE is not enabled.
Link: http://lkml.kernel.org/r/20200629151959.15779-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/huge_mm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
--- a/include/linux/huge_mm.h~mm-add-thp_head
+++ a/include/linux/huge_mm.h
@@ -260,6 +260,15 @@ static inline spinlock_t *pud_trans_huge
}
/**
+ * thp_head - Head page of a transparent huge page.
+ * @page: Any page (tail, head or regular) found in the page cache.
+ */
+static inline struct page *thp_head(struct page *page)
+{
+ return compound_head(page);
+}
+
+/**
* thp_order - Order of a transparent huge page.
* @page: Head page of a transparent huge page.
*/
@@ -335,6 +344,12 @@ static inline struct list_head *page_def
#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
+static inline struct page *thp_head(struct page *page)
+{
+ VM_BUG_ON_PGFLAGS(PageTail(page), page);
+ return page;
+}
+
static inline unsigned int thp_order(struct page *page)
{
VM_BUG_ON_PGFLAGS(PageTail(page), page);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 13/39] mm: introduce offset_in_thp
2020-08-15 0:29 incoming Andrew Morton
` (11 preceding siblings ...)
2020-08-15 0:30 ` [patch 12/39] mm: add thp_head Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 14/39] fs: autofs: delete repeated words in comments Andrew Morton
` (26 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, david, kirill.shutemov, linux-mm, mike.kravetz, mm-commits,
torvalds, william.kucharski, willy, ziy
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: introduce offset_in_thp
Mirroring offset_in_page(), this gives you the offset within this
particular page, no matter what size page it is. It optimises down to
offset_in_page() if CONFIG_TRANSPARENT_HUGEPAGE is not set.
Link: http://lkml.kernel.org/r/20200629151959.15779-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/mm.h | 1 +
1 file changed, 1 insertion(+)
--- a/include/linux/mm.h~mm-introduce-offset_in_thp
+++ a/include/linux/mm.h
@@ -1594,6 +1594,7 @@ static inline void clear_page_pfmemalloc
extern void pagefault_out_of_memory(void);
#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)
+#define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1))
/*
* Flags passed to show_mem() and show_free_areas() to suppress output in
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 14/39] fs: autofs: delete repeated words in comments
2020-08-15 0:29 incoming Andrew Morton
` (12 preceding siblings ...)
2020-08-15 0:30 ` [patch 13/39] mm: introduce offset_in_thp Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 15/39] mm/madvise: pass task and mm to do_madvise Andrew Morton
` (25 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, linux-mm, mm-commits, raven, rdunlap, torvalds
From: Randy Dunlap <rdunlap@infradead.org>
Subject: fs: autofs: delete repeated words in comments
Drop duplicated words {the, at} in comments.
Link: http://lkml.kernel.org/r/20200811021817.24982-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/autofs/dev-ioctl.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/fs/autofs/dev-ioctl.c~fs-autofs-delete-repeated-words-in-comments
+++ a/fs/autofs/dev-ioctl.c
@@ -20,7 +20,7 @@
* another mount. This situation arises when starting automount(8)
* or other user space daemon which uses direct mounts or offset
* mounts (used for autofs lazy mount/umount of nested mount trees),
- * which have been left busy at at service shutdown.
+ * which have been left busy at service shutdown.
*/
typedef int (*ioctl_fn)(struct file *, struct autofs_sb_info *,
@@ -496,7 +496,7 @@ static int autofs_dev_ioctl_askumount(st
* located path is the root of a mount we return 1 along with
* the super magic of the mount or 0 otherwise.
*
- * In both cases the the device number (as returned by
+ * In both cases the device number (as returned by
* new_encode_dev()) is also returned.
*/
static int autofs_dev_ioctl_ismountpoint(struct file *fp,
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 15/39] mm/madvise: pass task and mm to do_madvise
2020-08-15 0:29 incoming Andrew Morton
` (13 preceding siblings ...)
2020-08-15 0:30 ` [patch 14/39] fs: autofs: delete repeated words in comments Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 16/39] pid: move pidfd_get_pid() to pid.c Andrew Morton
` (24 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
christian, dancol, hannes, jannh, joaodias, joel, ktkhai,
linux-man, linux-mm, mhocko, minchan, mm-commits, oleksandr,
rientjes, shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: pass task and mm to do_madvise
Patch series "introduce memory hinting API for external process", v8.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.
1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's
address space.
For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.
This patch (of 4):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.
Furthermore, we must not access mm_struct via task->mm, but obtain it
via access_mm() once (in the following patch) and only use that pointer
[1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers
are safe, so we can use them further down the call stack.
And let's pass *current* and current->mm as arguments of do_madvise so
it shouldn't change existing behavior but prepare next patch to make
review easy.
Note: io_madvise passes NULL as target_task argument of do_madvise because
it couldn't know who is target.
[1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com
[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
fs/io_uring.c | 2 +-
include/linux/mm.h | 3 ++-
mm/madvise.c | 40 +++++++++++++++++++++++-----------------
3 files changed, 26 insertions(+), 19 deletions(-)
--- a/fs/io_uring.c~mm-madvise-pass-task-and-mm-to-do_madvise
+++ a/fs/io_uring.c
@@ -3731,7 +3731,7 @@ static int io_madvise(struct io_kiocb *r
if (force_nonblock)
return -EAGAIN;
- ret = do_madvise(ma->addr, ma->len, ma->advice);
+ ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice);
if (ret < 0)
req_set_fail_links(req);
io_req_complete(req, ret);
--- a/include/linux/mm.h~mm-madvise-pass-task-and-mm-to-do_madvise
+++ a/include/linux/mm.h
@@ -2547,7 +2547,8 @@ extern int __do_munmap(struct mm_struct
struct list_head *uf, bool downgrade);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf);
-extern int do_madvise(unsigned long start, size_t len_in, int behavior);
+extern int do_madvise(struct task_struct *target_task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior);
#ifdef CONFIG_MMU
extern int __mm_populate(unsigned long addr, unsigned long len,
--- a/mm/madvise.c~mm-madvise-pass-task-and-mm-to-do_madvise
+++ a/mm/madvise.c
@@ -22,12 +22,14 @@
#include <linux/file.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
+#include <linux/compat.h>
#include <linux/pagewalk.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/shmem_fs.h>
#include <linux/mmu_notifier.h>
#include <linux/sched/mm.h>
+#include <linux/uio.h>
#include <asm/tlb.h>
@@ -255,6 +257,7 @@ static long madvise_willneed(struct vm_a
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
+ struct mm_struct *mm = vma->vm_mm;
struct file *file = vma->vm_file;
loff_t offset;
@@ -289,12 +292,12 @@ static long madvise_willneed(struct vm_a
*/
*prev = NULL; /* tell sys_madvise we drop mmap_lock */
get_file(file);
- mmap_read_unlock(current->mm);
+ mmap_read_unlock(mm);
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED);
fput(file);
- mmap_read_lock(current->mm);
+ mmap_read_lock(mm);
return 0;
}
@@ -683,7 +686,6 @@ out:
if (nr_swap) {
if (current->mm == mm)
sync_mm_rss(mm);
-
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
}
arch_leave_lazy_mmu_mode();
@@ -763,6 +765,8 @@ static long madvise_dontneed_free(struct
unsigned long start, unsigned long end,
int behavior)
{
+ struct mm_struct *mm = vma->vm_mm;
+
*prev = vma;
if (!can_madv_lru_vma(vma))
return -EINVAL;
@@ -770,8 +774,8 @@ static long madvise_dontneed_free(struct
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_lock has been dropped, prev is stale */
- mmap_read_lock(current->mm);
- vma = find_vma(current->mm, start);
+ mmap_read_lock(mm);
+ vma = find_vma(mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -825,6 +829,7 @@ static long madvise_remove(struct vm_are
loff_t offset;
int error;
struct file *f;
+ struct mm_struct *mm = vma->vm_mm;
*prev = NULL; /* tell sys_madvise we drop mmap_lock */
@@ -852,13 +857,13 @@ static long madvise_remove(struct vm_are
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_lock was not released by userfaultfd_remove() */
- mmap_read_unlock(current->mm);
+ mmap_read_unlock(mm);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- mmap_read_lock(current->mm);
+ mmap_read_lock(mm);
return error;
}
@@ -1051,7 +1056,8 @@ madvise_behavior_valid(int behavior)
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
-int do_madvise(unsigned long start, size_t len_in, int behavior)
+int do_madvise(struct task_struct *target_task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1089,7 +1095,7 @@ int do_madvise(unsigned long start, size
write = madvise_need_mmap_write(behavior);
if (write) {
- if (mmap_write_lock_killable(current->mm))
+ if (mmap_write_lock_killable(mm))
return -EINTR;
/*
@@ -1104,12 +1110,12 @@ int do_madvise(unsigned long start, size
* but for now we have the mmget_still_valid()
* model.
*/
- if (!mmget_still_valid(current->mm)) {
- mmap_write_unlock(current->mm);
+ if (!mmget_still_valid(mm)) {
+ mmap_write_unlock(mm);
return -EINTR;
}
} else {
- mmap_read_lock(current->mm);
+ mmap_read_lock(mm);
}
/*
@@ -1117,7 +1123,7 @@ int do_madvise(unsigned long start, size
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(current->mm, start, &prev);
+ vma = find_vma_prev(mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;
@@ -1154,19 +1160,19 @@ int do_madvise(unsigned long start, size
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_lock */
- vma = find_vma(current->mm, start);
+ vma = find_vma(mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
- mmap_write_unlock(current->mm);
+ mmap_write_unlock(mm);
else
- mmap_read_unlock(current->mm);
+ mmap_read_unlock(mm);
return error;
}
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
- return do_madvise(start, len_in, behavior);
+ return do_madvise(current, current->mm, start, len_in, behavior);
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 16/39] pid: move pidfd_get_pid() to pid.c
2020-08-15 0:29 incoming Andrew Morton
` (14 preceding siblings ...)
2020-08-15 0:30 ` [patch 15/39] mm/madvise: pass task and mm to do_madvise Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
` (23 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
dancol, hannes, jannh, joaodias, joel, ktkhai, linux-man,
linux-mm, mhocko, minchan, mm-commits, oleksandr, rientjes,
shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: pid: move pidfd_get_pid() to pid.c
process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/pid.h | 1 +
kernel/exit.c | 17 -----------------
kernel/pid.c | 17 +++++++++++++++++
3 files changed, 18 insertions(+), 17 deletions(-)
--- a/include/linux/pid.h~pid-move-pidfd_get_pid-to-pidc
+++ a/include/linux/pid.h
@@ -77,6 +77,7 @@ extern const struct file_operations pidf
struct file;
extern struct pid *pidfd_pid(const struct file *file);
+struct pid *pidfd_get_pid(unsigned int fd);
static inline struct pid *get_pid(struct pid *pid)
{
--- a/kernel/exit.c~pid-move-pidfd_get_pid-to-pidc
+++ a/kernel/exit.c
@@ -1474,23 +1474,6 @@ end:
return retval;
}
-static struct pid *pidfd_get_pid(unsigned int fd)
-{
- struct fd f;
- struct pid *pid;
-
- f = fdget(fd);
- if (!f.file)
- return ERR_PTR(-EBADF);
-
- pid = pidfd_pid(f.file);
- if (!IS_ERR(pid))
- get_pid(pid);
-
- fdput(f);
- return pid;
-}
-
static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
int options, struct rusage *ru)
{
--- a/kernel/pid.c~pid-move-pidfd_get_pid-to-pidc
+++ a/kernel/pid.c
@@ -519,6 +519,23 @@ struct pid *find_ge_pid(int nr, struct p
return idr_get_next(&ns->idr, &nr);
}
+struct pid *pidfd_get_pid(unsigned int fd)
+{
+ struct fd f;
+ struct pid *pid;
+
+ f = fdget(fd);
+ if (!f.file)
+ return ERR_PTR(-EBADF);
+
+ pid = pidfd_pid(f.file);
+ if (!IS_ERR(pid))
+ get_pid(pid);
+
+ fdput(f);
+ return pid;
+}
+
/**
* pidfd_create() - Create a new pid file descriptor.
*
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
2020-08-15 0:29 incoming Andrew Morton
` (15 preceding siblings ...)
2020-08-15 0:30 ` [patch 16/39] pid: move pidfd_get_pid() to pid.c Andrew Morton
@ 2020-08-15 0:30 ` Andrew Morton
2020-08-16 8:12 ` Christian Brauner
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
` (22 subsequent siblings)
39 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:30 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
christian, dancol, hannes, jannh, joaodias, joel, ktkhai,
linux-man, linux-mm, mhocko, minchan, mm-commits, oleksandr,
rientjes, shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to
the app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to
initiate reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall process_madvise(2).
It uses pidfd of an external process to give the hint. It also supports
vector address range because Android app has thousands of vmas due to
zygote so it's totally waste of CPU and power if we should call the
syscall one by one for each vma.(With testing 2000-vma syscall vs
1-vector syscall, it showed 15% performance improvement. I think it
would be bigger in real practice because the testing ran very cache
friendly environment).
Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this
could benefit users like TCP receive zerocopy and malloc implementations.
In future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the
same UID) gives it the right to ptrace the process could use it
successfully. The flag argument is reserved for future use if we need to
extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints
make sense from external process and implementation for the hint may
rely on the caller being in the current context so it could be
error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this
patch.
If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's safer for maintenance rather than
introducing a buggy syscall but hard to fix it later.
So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.
The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.
MADV_COLD
MADV_PAGEOUT
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd.h | 2
arch/arm64/include/asm/unistd32.h | 2
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 2
arch/xtensa/kernel/syscalls/syscall.tbl | 1
include/linux/compat.h | 4
include/linux/syscalls.h | 2
include/uapi/asm-generic/unistd.h | 4
kernel/sys_ni.c | 2
mm/madvise.c | 121 ++++++++++++++++++
23 files changed, 152 insertions(+), 2 deletions(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/alpha/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
547 common openat2 sys_openat2
548 common pidfd_getfd sys_pidfd_getfd
549 common faccessat2 sys_faccessat2
+550 common process_madvise sys_process_madvise
--- a/arch/arm64/include/asm/unistd32.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/arm64/include/asm/unistd32.h
@@ -887,6 +887,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_process_madvise 440
+__SYSCALL(__NR_process_madvise, compat_sys_process_madvise)
/*
* Please add new compat syscalls above this comment and update
--- a/arch/arm64/include/asm/unistd.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 440
+#define __NR_compat_syscalls 441
#endif
#define __ARCH_WANT_SYS_CLONE
--- a/arch/arm/tools/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/arm/tools/syscall.tbl
@@ -453,3 +453,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/ia64/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/ia64/kernel/syscalls/syscall.tbl
@@ -360,3 +360,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/m68k/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/m68k/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/microblaze/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -445,3 +445,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -378,3 +378,4 @@
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
439 n32 faccessat2 sys_faccessat2
+440 n32 process_madvise compat_sys_process_madvise
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -354,3 +354,4 @@
437 n64 openat2 sys_openat2
438 n64 pidfd_getfd sys_pidfd_getfd
439 n64 faccessat2 sys_faccessat2
+440 n64 process_madvise sys_process_madvise
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -427,3 +427,4 @@
437 o32 openat2 sys_openat2
438 o32 pidfd_getfd sys_pidfd_getfd
439 o32 faccessat2 sys_faccessat2
+440 o32 process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/parisc/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/parisc/kernel/syscalls/syscall.tbl
@@ -437,3 +437,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/powerpc/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -529,3 +529,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/s390/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/s390/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/sh/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/sh/kernel/syscalls/syscall.tbl
@@ -442,3 +442,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/arch/sparc/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/sparc/kernel/syscalls/syscall.tbl
@@ -485,3 +485,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/x86/entry/syscalls/syscall_32.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/x86/entry/syscalls/syscall_32.tbl
@@ -444,3 +444,4 @@
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
+440 i386 process_madvise sys_process_madvise compat_sys_process_madvise
--- a/arch/x86/entry/syscalls/syscall_64.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/x86/entry/syscalls/syscall_64.tbl
@@ -361,6 +361,7 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 64 process_madvise sys_process_madvise
#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -404,3 +405,4 @@
545 x32 execveat compat_sys_execveat
546 x32 preadv2 compat_sys_preadv64v2
547 x32 pwritev2 compat_sys_pwritev64v2
+548 x32 process_madvise compat_sys_process_madvise
--- a/arch/xtensa/kernel/syscalls/syscall.tbl~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -410,3 +410,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common process_madvise sys_process_madvise
--- a/include/linux/compat.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/include/linux/compat.h
@@ -823,6 +823,10 @@ asmlinkage long compat_sys_pwritev64v2(u
unsigned long vlen, loff_t pos, rwf_t flags);
#endif
+asmlinkage ssize_t compat_sys_process_madvise(compat_int_t pidfd,
+ const struct compat_iovec __user *vec,
+ compat_ulong_t vlen, compat_int_t behavior,
+ compat_uint_t flags);
/*
* Deprecated system calls which are still defined in
--- a/include/linux/syscalls.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/include/linux/syscalls.h
@@ -880,6 +880,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec,
+ unsigned long vlen, int behavior, unsigned int flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
--- a/include/uapi/asm-generic/unistd.h~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_process_madvise 440
+__SC_COMP(__NR_process_madvise, sys_process_madvise, compat_sys_process_madvise)
#undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
/*
* 32 bit systems traditionally used different
--- a/kernel/sys_ni.c~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/kernel/sys_ni.c
@@ -280,6 +280,8 @@ COND_SYSCALL(mlockall);
COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
+COND_SYSCALL_COMPAT(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
--- a/mm/madvise.c~mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api
+++ a/mm/madvise.c
@@ -17,6 +17,7 @@
#include <linux/falloc.h>
#include <linux/fadvise.h>
#include <linux/sched.h>
+#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior)
}
}
+static bool
+process_madvise_behavior_valid(int behavior)
+{
+ switch (behavior) {
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ return true;
+ default:
+ return false;
+ }
+}
+
/*
* The madvise(2) system call.
*
@@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior)
* MADV_DONTDUMP - the application wants to prevent pages in the given range
* from being included in its core dump.
* MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ * MADV_COLD - the application is not expected to use this memory soon,
+ * deactivate pages in this range so that they can be reclaimed
+ * easily if memory pressure hanppens.
+ * MADV_PAGEOUT - the application is not expected to use this memory soon,
+ * page out the pages in this range immediately.
*
* return values:
* zero - success
@@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long,
{
return do_madvise(current, current->mm, start, len_in, behavior);
}
+
+static int process_madvise_vec(struct task_struct *target_task,
+ struct mm_struct *mm, struct iov_iter *iter, int behavior)
+{
+ struct iovec iovec;
+ int ret = 0;
+
+ while (iov_iter_count(iter)) {
+ iovec = iov_iter_iovec(iter);
+ ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base,
+ iovec.iov_len, behavior);
+ if (ret < 0)
+ break;
+ iov_iter_advance(iter, iovec.iov_len);
+ }
+
+ return ret;
+}
+
+static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
+ int behavior, unsigned int flags)
+{
+ ssize_t ret;
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm;
+ size_t total_len = iov_iter_count(iter);
+
+ if (flags != 0)
+ return -EINVAL;
+
+ pid = pidfd_get_pid(pidfd);
+ if (IS_ERR(pid))
+ return PTR_ERR(pid);
+
+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ ret = -ESRCH;
+ goto put_pid;
+ }
+
+ if (task->mm != current->mm &&
+ !process_madvise_behavior_valid(behavior)) {
+ ret = -EINVAL;
+ goto release_task;
+ }
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ if (IS_ERR_OR_NULL(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ goto release_task;
+ }
+
+ ret = process_madvise_vec(task, mm, iter, behavior);
+ if (ret >= 0)
+ ret = total_len - iov_iter_count(iter);
+
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+put_pid:
+ put_pid(pid);
+ return ret;
+}
+
+SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
+ unsigned long, vlen, int, behavior, unsigned int, flags)
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+
+ ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
+ if (ret >= 0) {
+ ret = do_process_madvise(pidfd, &iter, behavior, flags);
+ kfree(iov);
+ }
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
+ const struct compat_iovec __user *, vec,
+ compat_ulong_t, vlen,
+ compat_int_t, behavior,
+ compat_uint_t, flags)
+
+{
+ ssize_t ret;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ struct iov_iter iter;
+
+ ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
+ &iov, &iter);
+ if (ret >= 0) {
+ ret = do_process_madvise(pidfd, &iter, behavior, flags);
+ kfree(iov);
+ }
+ return ret;
+}
+#endif
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 0:29 incoming Andrew Morton
` (16 preceding siblings ...)
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 2:53 ` Linus Torvalds
2020-08-15 0:31 ` [patch 19/39] all arch: remove system call sys_sysctl Andrew Morton
` (21 subsequent siblings)
39 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, alexander.h.duyck, axboe, bgeffon, christian.brauner,
christian, dancol, hannes, jannh, joaodias, joel, ktkhai,
linux-man, linux-mm, mhocko, minchan, mm-commits, oleksandr,
rientjes, shakeelb, sj38.park, sjpark, sonnyrao, sspatil, surenb,
timmurray, torvalds, vbabka
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: check fatal signal pending of target process
Bail out to prevent unnecessary CPU overhead if target process has pending
fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
Link: http://lkml.kernel.org/r/20200302193630.68771-4-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/madvise.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
--- a/mm/madvise.c~mm-madvise-check-fatal-signal-pending-of-target-process
+++ a/mm/madvise.c
@@ -39,6 +39,7 @@
struct madvise_walk_private {
struct mmu_gather *tlb;
bool pageout;
+ struct task_struct *target_task;
};
/*
@@ -319,6 +320,10 @@ static int madvise_cold_or_pageout_pte_r
if (fatal_signal_pending(current))
return -EINTR;
+ if (private->target_task &&
+ fatal_signal_pending(private->target_task))
+ return -EINTR;
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (pmd_trans_huge(*pmd)) {
pmd_t orig_pmd;
@@ -480,12 +485,14 @@ static const struct mm_walk_ops cold_wal
};
static void madvise_cold_page_range(struct mmu_gather *tlb,
+ struct task_struct *task,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
struct madvise_walk_private walk_private = {
.pageout = false,
.tlb = tlb,
+ .target_task = task,
};
tlb_start_vma(tlb, vma);
@@ -493,7 +500,8 @@ static void madvise_cold_page_range(stru
tlb_end_vma(tlb, vma);
}
-static long madvise_cold(struct vm_area_struct *vma,
+static long madvise_cold(struct task_struct *task,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
{
@@ -506,19 +514,21 @@ static long madvise_cold(struct vm_area_
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_cold_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);
return 0;
}
static void madvise_pageout_page_range(struct mmu_gather *tlb,
+ struct task_struct *task,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
struct madvise_walk_private walk_private = {
.pageout = true,
.tlb = tlb,
+ .target_task = task,
};
tlb_start_vma(tlb, vma);
@@ -542,7 +552,8 @@ static inline bool can_do_pageout(struct
inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
}
-static long madvise_pageout(struct vm_area_struct *vma,
+static long madvise_pageout(struct task_struct *task,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
{
@@ -558,7 +569,7 @@ static long madvise_pageout(struct vm_ar
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);
return 0;
@@ -938,7 +949,8 @@ static int madvise_inject_error(int beha
#endif
static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+madvise_vma(struct task_struct *task, struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
@@ -947,9 +959,9 @@ madvise_vma(struct vm_area_struct *vma,
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
- return madvise_cold(vma, prev, start, end);
+ return madvise_cold(task, vma, prev, start, end);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(task, vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -1166,7 +1178,8 @@ int do_madvise(struct task_struct *targe
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
+ error = madvise_vma(target_task, vma, &prev,
+ start, tmp, behavior);
if (error)
goto out;
start = tmp;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 19/39] all arch: remove system call sys_sysctl
2020-08-15 0:29 incoming Andrew Morton
` (17 preceding siblings ...)
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 20/39] mm/kmemleak: silence KCSAN splats in checksum Andrew Morton
` (20 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: acme, ak, akpm, alexander.shishkin, arnd, axboe, bauerman, benh,
bin.meng, borntraeger, bp, brgerst, catalin.marinas, chenzefeng2,
chris, christian, cyphar, dalias, davem, deller, dhowells,
dvyukov, ebiederm, elver, fenghua.yu, flameeyes, geert, gor,
heiko.carstens, hpa, ink, James.Bottomley, jcmvbkbc, jolsa,
jongk, keescook, krzk, linux-mm, linux, linux, luto,
mark.rutland, martin.petersen, mattst88, mcgrof, minchan, mingo,
mm-commits, monstr, mpe, mszeredi, namhyung, naveen.n.rao,
nixiaoming, npiggin, oleg, olof, paulburton, paulmck, paulus,
peterz, ravi.bangoria, rdunlap, rth, samitolvanen, sargun, sfr,
sudeep.holla, svens, tglx, tony.luck, torvalds, tsbogend, vbabka,
viro, will, yamada.masahiro, ysato, yzaikin, zhouyanjie
From: Xiaoming Ni <nixiaoming@huawei.com>
Subject: all arch: remove system call sys_sysctl
Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"),
sys_sysctl is actually unavailable: any input can only return an error.
We have been warning about people using the sysctl system call for years
and believe there are no more users. Even if there are users of this
interface if they have not complained or fixed their code by now they
probably are not going to, so there is no point in warning them any
longer.
So completely remove sys_sysctl on all architectures.
[nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com
Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Will Deacon <will@kernel.org> [arm/arm64]
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bin Meng <bin.meng@windriver.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: chenzefeng <chenzefeng2@huawei.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Diego Elio Pettenò <flameeyes@flameeyes.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kars de Jong <jongk@linux-m68k.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Paul Burton <paulburton@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Sven Schnelle <svens@stackframe.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
arch/alpha/kernel/syscalls/syscall.tbl | 2
arch/arm/configs/am200epdkit_defconfig | 1
arch/arm/tools/syscall.tbl | 2
arch/arm64/include/asm/unistd32.h | 4
arch/ia64/kernel/syscalls/syscall.tbl | 2
arch/m68k/kernel/syscalls/syscall.tbl | 2
arch/microblaze/kernel/syscalls/syscall.tbl | 2
arch/mips/configs/cu1000-neo_defconfig | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 2
arch/mips/kernel/syscalls/syscall_n64.tbl | 2
arch/mips/kernel/syscalls/syscall_o32.tbl | 2
arch/parisc/kernel/syscalls/syscall.tbl | 2
arch/powerpc/kernel/syscalls/syscall.tbl | 2
arch/s390/kernel/syscalls/syscall.tbl | 2
arch/sh/configs/dreamcast_defconfig | 1
arch/sh/configs/espt_defconfig | 1
arch/sh/configs/hp6xx_defconfig | 1
arch/sh/configs/landisk_defconfig | 1
arch/sh/configs/lboxre2_defconfig | 1
arch/sh/configs/microdev_defconfig | 1
arch/sh/configs/migor_defconfig | 1
arch/sh/configs/r7780mp_defconfig | 1
arch/sh/configs/r7785rp_defconfig | 1
arch/sh/configs/rts7751r2d1_defconfig | 1
arch/sh/configs/rts7751r2dplus_defconfig | 1
arch/sh/configs/se7206_defconfig | 1
arch/sh/configs/se7343_defconfig | 1
arch/sh/configs/se7619_defconfig | 1
arch/sh/configs/se7705_defconfig | 1
arch/sh/configs/se7750_defconfig | 1
arch/sh/configs/se7751_defconfig | 1
arch/sh/configs/secureedge5410_defconfig | 1
arch/sh/configs/sh03_defconfig | 1
arch/sh/configs/sh7710voipgw_defconfig | 1
arch/sh/configs/sh7757lcr_defconfig | 1
arch/sh/configs/sh7763rdp_defconfig | 1
arch/sh/configs/shmin_defconfig | 1
arch/sh/configs/titan_defconfig | 1
arch/sh/kernel/syscalls/syscall.tbl | 2
arch/sparc/kernel/syscalls/syscall.tbl | 2
arch/x86/entry/syscalls/syscall_32.tbl | 2
arch/x86/entry/syscalls/syscall_64.tbl | 2
arch/xtensa/kernel/syscalls/syscall.tbl | 2
include/linux/compat.h | 1
include/linux/syscalls.h | 2
include/linux/sysctl.h | 6
kernel/Makefile | 2
kernel/sys_ni.c | 1
kernel/sysctl_binary.c | 171 -----------
tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2
tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2
tools/perf/arch/x86/entry/syscalls/syscall_64.tbl | 2
52 files changed, 24 insertions(+), 227 deletions(-)
--- a/arch/alpha/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/alpha/kernel/syscalls/syscall.tbl
@@ -249,7 +249,7 @@
316 common mlockall sys_mlockall
317 common munlockall sys_munlockall
318 common sysinfo sys_sysinfo
-319 common _sysctl sys_sysctl
+319 common _sysctl sys_ni_syscall
# 320 was sys_idle
321 common oldumount sys_oldumount
322 common swapon sys_swapon
--- a/arch/arm64/include/asm/unistd32.h~all-arch-remove-system-call-sys_sysctl
+++ a/arch/arm64/include/asm/unistd32.h
@@ -308,8 +308,8 @@ __SYSCALL(__NR_writev, compat_sys_writev
__SYSCALL(__NR_getsid, sys_getsid)
#define __NR_fdatasync 148
__SYSCALL(__NR_fdatasync, sys_fdatasync)
-#define __NR__sysctl 149
-__SYSCALL(__NR__sysctl, compat_sys_sysctl)
+ /* 149 was sys_sysctl */
+__SYSCALL(149, sys_ni_syscall)
#define __NR_mlock 150
__SYSCALL(__NR_mlock, sys_mlock)
#define __NR_munlock 151
--- a/arch/arm/configs/am200epdkit_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/arm/configs/am200epdkit_defconfig
@@ -3,7 +3,6 @@ CONFIG_LOCALVERSION="gum"
CONFIG_SYSVIPC=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_EXPERT=y
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_EPOLL is not set
# CONFIG_SHMEM is not set
# CONFIG_VM_EVENT_COUNTERS is not set
--- a/arch/arm/tools/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/arm/tools/syscall.tbl
@@ -162,7 +162,7 @@
146 common writev sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl
+149 common _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/arch/ia64/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/ia64/kernel/syscalls/syscall.tbl
@@ -135,7 +135,7 @@
123 common writev sys_writev
124 common pread64 sys_pread64
125 common pwrite64 sys_pwrite64
-126 common _sysctl sys_sysctl
+126 common _sysctl sys_ni_syscall
127 common mmap sys_mmap
128 common munmap sys_munmap
129 common mlock sys_mlock
--- a/arch/m68k/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/m68k/kernel/syscalls/syscall.tbl
@@ -156,7 +156,7 @@
146 common writev sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl
+149 common _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/arch/microblaze/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -156,7 +156,7 @@
146 common writev sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl
+149 common _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/arch/mips/configs/cu1000-neo_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/mips/configs/cu1000-neo_defconfig
@@ -17,7 +17,6 @@ CONFIG_CGROUP_CPUACCT=y
CONFIG_NAMESPACES=y
CONFIG_USER_NS=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
-CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS_ALL=y
CONFIG_EMBEDDED=y
# CONFIG_VM_EVENT_COUNTERS is not set
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -159,7 +159,7 @@
149 n32 munlockall sys_munlockall
150 n32 vhangup sys_vhangup
151 n32 pivot_root sys_pivot_root
-152 n32 _sysctl compat_sys_sysctl
+152 n32 _sysctl sys_ni_syscall
153 n32 prctl sys_prctl
154 n32 adjtimex sys_adjtimex_time32
155 n32 setrlimit compat_sys_setrlimit
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -159,7 +159,7 @@
149 n64 munlockall sys_munlockall
150 n64 vhangup sys_vhangup
151 n64 pivot_root sys_pivot_root
-152 n64 _sysctl sys_sysctl
+152 n64 _sysctl sys_ni_syscall
153 n64 prctl sys_prctl
154 n64 adjtimex sys_adjtimex
155 n64 setrlimit sys_setrlimit
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -164,7 +164,7 @@
150 o32 unused150 sys_ni_syscall
151 o32 getsid sys_getsid
152 o32 fdatasync sys_fdatasync
-153 o32 _sysctl sys_sysctl compat_sys_sysctl
+153 o32 _sysctl sys_ni_syscall
154 o32 mlock sys_mlock
155 o32 munlock sys_munlock
156 o32 mlockall sys_mlockall
--- a/arch/parisc/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/parisc/kernel/syscalls/syscall.tbl
@@ -163,7 +163,7 @@
146 common writev sys_writev compat_sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl compat_sys_sysctl
+149 common _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/arch/powerpc/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -197,7 +197,7 @@
146 common writev sys_writev compat_sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 nospu _sysctl sys_sysctl compat_sys_sysctl
+149 nospu _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/arch/s390/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/s390/kernel/syscalls/syscall.tbl
@@ -138,7 +138,7 @@
146 common writev sys_writev compat_sys_writev
147 common getsid sys_getsid sys_getsid
148 common fdatasync sys_fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl compat_sys_sysctl
+149 common _sysctl - -
150 common mlock sys_mlock sys_mlock
151 common munlock sys_munlock sys_munlock
152 common mlockall sys_mlockall sys_mlockall
--- a/arch/sh/configs/dreamcast_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/dreamcast_defconfig
@@ -1,7 +1,6 @@
CONFIG_SYSVIPC=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_MODULES=y
--- a/arch/sh/configs/espt_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/espt_defconfig
@@ -5,7 +5,6 @@ CONFIG_LOG_BUF_SHIFT=14
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
--- a/arch/sh/configs/hp6xx_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/hp6xx_defconfig
@@ -3,7 +3,6 @@ CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_CPU_SUBTYPE_SH7709=y
--- a/arch/sh/configs/landisk_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/landisk_defconfig
@@ -1,6 +1,5 @@
CONFIG_SYSVIPC=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_SLAB=y
CONFIG_MODULES=y
--- a/arch/sh/configs/lboxre2_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/lboxre2_defconfig
@@ -1,6 +1,5 @@
CONFIG_SYSVIPC=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_SLAB=y
CONFIG_MODULES=y
--- a/arch/sh/configs/microdev_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/microdev_defconfig
@@ -2,7 +2,6 @@ CONFIG_BSD_PROCESS_ACCT=y
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
# CONFIG_BLK_DEV_BSG is not set
CONFIG_CPU_SUBTYPE_SH4_202=y
--- a/arch/sh/configs/migor_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/migor_defconfig
@@ -4,7 +4,6 @@ CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
--- a/arch/sh/configs/r7780mp_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/r7780mp_defconfig
@@ -3,7 +3,6 @@ CONFIG_BSD_PROCESS_ACCT=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_FUTEX is not set
# CONFIG_EPOLL is not set
CONFIG_SLAB=y
--- a/arch/sh/configs/r7785rp_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/r7785rp_defconfig
@@ -7,7 +7,6 @@ CONFIG_RCU_TRACE=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
--- a/arch/sh/configs/rts7751r2d1_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/rts7751r2d1_defconfig
@@ -1,7 +1,6 @@
CONFIG_SYSVIPC=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
--- a/arch/sh/configs/rts7751r2dplus_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/rts7751r2dplus_defconfig
@@ -1,7 +1,6 @@
CONFIG_SYSVIPC=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
--- a/arch/sh/configs/se7206_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/se7206_defconfig
@@ -18,7 +18,6 @@ CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_BLK_DEV_INITRD=y
# CONFIG_UID16 is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS_ALL=y
# CONFIG_ELF_CORE is not set
# CONFIG_COMPAT_BRK is not set
--- a/arch/sh/configs/se7343_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/se7343_defconfig
@@ -2,7 +2,6 @@
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_FUTEX is not set
# CONFIG_EPOLL is not set
# CONFIG_SHMEM is not set
--- a/arch/sh/configs/se7619_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/se7619_defconfig
@@ -1,7 +1,6 @@
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_UID16 is not set
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_KALLSYMS is not set
# CONFIG_HOTPLUG is not set
# CONFIG_ELF_CORE is not set
--- a/arch/sh/configs/se7705_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/se7705_defconfig
@@ -2,7 +2,6 @@
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_KALLSYMS is not set
# CONFIG_HOTPLUG is not set
CONFIG_SLAB=y
--- a/arch/sh/configs/se7750_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/se7750_defconfig
@@ -5,7 +5,6 @@ CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_HOTPLUG is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
--- a/arch/sh/configs/se7751_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/se7751_defconfig
@@ -3,7 +3,6 @@ CONFIG_BSD_PROCESS_ACCT=y
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_HOTPLUG is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
--- a/arch/sh/configs/secureedge5410_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/secureedge5410_defconfig
@@ -1,7 +1,6 @@
# CONFIG_SWAP is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_HOTPLUG is not set
CONFIG_SLAB=y
# CONFIG_BLK_DEV_BSG is not set
--- a/arch/sh/configs/sh03_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/sh03_defconfig
@@ -3,7 +3,6 @@ CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=m
--- a/arch/sh/configs/sh7710voipgw_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/sh7710voipgw_defconfig
@@ -2,7 +2,6 @@
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_LOG_BUF_SHIFT=14
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_FUTEX is not set
# CONFIG_EPOLL is not set
# CONFIG_SHMEM is not set
--- a/arch/sh/configs/sh7757lcr_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/sh7757lcr_defconfig
@@ -8,7 +8,6 @@ CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_LOG_BUF_SHIFT=14
CONFIG_BLK_DEV_INITRD=y
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS_ALL=y
CONFIG_SLAB=y
CONFIG_MODULES=y
--- a/arch/sh/configs/sh7763rdp_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/sh7763rdp_defconfig
@@ -5,7 +5,6 @@ CONFIG_LOG_BUF_SHIFT=14
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_PROFILING=y
CONFIG_OPROFILE=y
--- a/arch/sh/configs/shmin_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/shmin_defconfig
@@ -1,7 +1,6 @@
# CONFIG_SWAP is not set
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_UID16 is not set
-# CONFIG_SYSCTL_SYSCALL is not set
# CONFIG_KALLSYMS is not set
# CONFIG_HOTPLUG is not set
# CONFIG_BUG is not set
--- a/arch/sh/configs/titan_defconfig~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/configs/titan_defconfig
@@ -6,7 +6,6 @@ CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_BLK_DEV_INITRD=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
-# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_SLAB=y
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
--- a/arch/sh/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sh/kernel/syscalls/syscall.tbl
@@ -156,7 +156,7 @@
146 common writev sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl
+149 common _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/arch/sparc/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/sparc/kernel/syscalls/syscall.tbl
@@ -300,7 +300,7 @@
249 64 nanosleep sys_nanosleep
250 32 mremap sys_mremap
250 64 mremap sys_64_mremap
-251 common _sysctl sys_sysctl compat_sys_sysctl
+251 common _sysctl sys_ni_syscall
252 common getsid sys_getsid
253 common fdatasync sys_fdatasync
254 32 nfsservctl sys_ni_syscall sys_nis_syscall
--- a/arch/x86/entry/syscalls/syscall_32.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/x86/entry/syscalls/syscall_32.tbl
@@ -160,7 +160,7 @@
146 i386 writev sys_writev compat_sys_writev
147 i386 getsid sys_getsid
148 i386 fdatasync sys_fdatasync
-149 i386 _sysctl sys_sysctl compat_sys_sysctl
+149 i386 _sysctl sys_ni_syscall
150 i386 mlock sys_mlock
151 i386 munlock sys_munlock
152 i386 mlockall sys_mlockall
--- a/arch/x86/entry/syscalls/syscall_64.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/x86/entry/syscalls/syscall_64.tbl
@@ -164,7 +164,7 @@
153 common vhangup sys_vhangup
154 common modify_ldt sys_modify_ldt
155 common pivot_root sys_pivot_root
-156 64 _sysctl sys_sysctl
+156 64 _sysctl sys_ni_syscall
157 common prctl sys_prctl
158 common arch_prctl sys_arch_prctl
159 common adjtimex sys_adjtimex
--- a/arch/xtensa/kernel/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -222,7 +222,7 @@
204 common quotactl sys_quotactl
# 205 was old nfsservctl
205 common nfsservctl sys_ni_syscall
-206 common _sysctl sys_sysctl
+206 common _sysctl sys_ni_syscall
207 common bdflush sys_bdflush
208 common uname sys_newuname
209 common sysinfo sys_sysinfo
--- a/include/linux/compat.h~all-arch-remove-system-call-sys_sysctl
+++ a/include/linux/compat.h
@@ -855,7 +855,6 @@ asmlinkage long compat_sys_select(int n,
asmlinkage long compat_sys_ustat(unsigned dev, struct compat_ustat __user *u32);
asmlinkage long compat_sys_recv(int fd, void __user *buf, compat_size_t len,
unsigned flags);
-asmlinkage long compat_sys_sysctl(struct compat_sysctl_args __user *args);
/* obsolete: fs/readdir.c */
asmlinkage long compat_sys_old_readdir(unsigned int fd,
--- a/include/linux/syscalls.h~all-arch-remove-system-call-sys_sysctl
+++ a/include/linux/syscalls.h
@@ -47,7 +47,6 @@ struct stat64;
struct statfs;
struct statfs64;
struct statx;
-struct __sysctl_args;
struct sysinfo;
struct timespec;
struct __kernel_old_timeval;
@@ -1119,7 +1118,6 @@ asmlinkage long sys_send(int, void __use
asmlinkage long sys_bdflush(int func, long data);
asmlinkage long sys_oldumount(char __user *name);
asmlinkage long sys_uselib(const char __user *library);
-asmlinkage long sys_sysctl(struct __sysctl_args __user *args);
asmlinkage long sys_sysfs(int option,
unsigned long arg1, unsigned long arg2);
asmlinkage long sys_fork(void);
--- a/include/linux/sysctl.h~all-arch-remove-system-call-sys_sysctl
+++ a/include/linux/sysctl.h
@@ -74,15 +74,13 @@ int proc_do_static_key(struct ctl_table
* sysctl names can be mirrored automatically under /proc/sys. The
* procname supplied controls /proc naming.
*
- * The table's mode will be honoured both for sys_sysctl(2) and
- * proc-fs access.
+ * The table's mode will be honoured for proc-fs access.
*
* Leaf nodes in the sysctl tree will be represented by a single file
* under /proc; non-leaf nodes will be represented by directories. A
* null procname disables /proc mirroring at this node.
*
- * sysctl(2) can automatically manage read and write requests through
- * the sysctl table. The data and maxlen fields of the ctl_table
+ * The data and maxlen fields of the ctl_table
* struct enable minimal validation of the values being written to be
* performed, and the mode field allows minimal authentication.
*
--- a/kernel/Makefile~all-arch-remove-system-call-sys_sysctl
+++ a/kernel/Makefile
@@ -5,7 +5,7 @@
obj-y = fork.o exec_domain.o panic.o \
cpu.o exit.o softirq.o resource.o \
- sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
+ sysctl.o capability.o ptrace.o user.o \
signal.o sys.o umh.o workqueue.o pid.o task_work.o \
extable.o params.o \
kthread.o sys_ni.o nsproxy.o \
--- a/kernel/sysctl_binary.c
+++ /dev/null
@@ -1,171 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/stat.h>
-#include <linux/sysctl.h>
-#include "../fs/xfs/xfs_sysctl.h"
-#include <linux/sunrpc/debug.h>
-#include <linux/string.h>
-#include <linux/syscalls.h>
-#include <linux/namei.h>
-#include <linux/mount.h>
-#include <linux/fs.h>
-#include <linux/nsproxy.h>
-#include <linux/pid_namespace.h>
-#include <linux/file.h>
-#include <linux/ctype.h>
-#include <linux/netdevice.h>
-#include <linux/kernel.h>
-#include <linux/uuid.h>
-#include <linux/slab.h>
-#include <linux/compat.h>
-
-static ssize_t binary_sysctl(const int *name, int nlen,
- void __user *oldval, size_t oldlen, void __user *newval, size_t newlen)
-{
- return -ENOSYS;
-}
-
-static void deprecated_sysctl_warning(const int *name, int nlen)
-{
- int i;
-
- /*
- * CTL_KERN/KERN_VERSION is used by older glibc and cannot
- * ever go away.
- */
- if (nlen >= 2 && name[0] == CTL_KERN && name[1] == KERN_VERSION)
- return;
-
- if (printk_ratelimit()) {
- printk(KERN_INFO
- "warning: process `%s' used the deprecated sysctl "
- "system call with ", current->comm);
- for (i = 0; i < nlen; i++)
- printk(KERN_CONT "%d.", name[i]);
- printk(KERN_CONT "\n");
- }
- return;
-}
-
-#define WARN_ONCE_HASH_BITS 8
-#define WARN_ONCE_HASH_SIZE (1<<WARN_ONCE_HASH_BITS)
-
-static DECLARE_BITMAP(warn_once_bitmap, WARN_ONCE_HASH_SIZE);
-
-#define FNV32_OFFSET 2166136261U
-#define FNV32_PRIME 0x01000193
-
-/*
- * Print each legacy sysctl (approximately) only once.
- * To avoid making the tables non-const use a external
- * hash-table instead.
- * Worst case hash collision: 6, but very rarely.
- * NOTE! We don't use the SMP-safe bit tests. We simply
- * don't care enough.
- */
-static void warn_on_bintable(const int *name, int nlen)
-{
- int i;
- u32 hash = FNV32_OFFSET;
-
- for (i = 0; i < nlen; i++)
- hash = (hash ^ name[i]) * FNV32_PRIME;
- hash %= WARN_ONCE_HASH_SIZE;
- if (__test_and_set_bit(hash, warn_once_bitmap))
- return;
- deprecated_sysctl_warning(name, nlen);
-}
-
-static ssize_t do_sysctl(int __user *args_name, int nlen,
- void __user *oldval, size_t oldlen, void __user *newval, size_t newlen)
-{
- int name[CTL_MAXNAME];
- int i;
-
- /* Check args->nlen. */
- if (nlen < 0 || nlen > CTL_MAXNAME)
- return -ENOTDIR;
- /* Read in the sysctl name for simplicity */
- for (i = 0; i < nlen; i++)
- if (get_user(name[i], args_name + i))
- return -EFAULT;
-
- warn_on_bintable(name, nlen);
-
- return binary_sysctl(name, nlen, oldval, oldlen, newval, newlen);
-}
-
-SYSCALL_DEFINE1(sysctl, struct __sysctl_args __user *, args)
-{
- struct __sysctl_args tmp;
- size_t oldlen = 0;
- ssize_t result;
-
- if (copy_from_user(&tmp, args, sizeof(tmp)))
- return -EFAULT;
-
- if (tmp.oldval && !tmp.oldlenp)
- return -EFAULT;
-
- if (tmp.oldlenp && get_user(oldlen, tmp.oldlenp))
- return -EFAULT;
-
- result = do_sysctl(tmp.name, tmp.nlen, tmp.oldval, oldlen,
- tmp.newval, tmp.newlen);
-
- if (result >= 0) {
- oldlen = result;
- result = 0;
- }
-
- if (tmp.oldlenp && put_user(oldlen, tmp.oldlenp))
- return -EFAULT;
-
- return result;
-}
-
-
-#ifdef CONFIG_COMPAT
-
-struct compat_sysctl_args {
- compat_uptr_t name;
- int nlen;
- compat_uptr_t oldval;
- compat_uptr_t oldlenp;
- compat_uptr_t newval;
- compat_size_t newlen;
- compat_ulong_t __unused[4];
-};
-
-COMPAT_SYSCALL_DEFINE1(sysctl, struct compat_sysctl_args __user *, args)
-{
- struct compat_sysctl_args tmp;
- compat_size_t __user *compat_oldlenp;
- size_t oldlen = 0;
- ssize_t result;
-
- if (copy_from_user(&tmp, args, sizeof(tmp)))
- return -EFAULT;
-
- if (tmp.oldval && !tmp.oldlenp)
- return -EFAULT;
-
- compat_oldlenp = compat_ptr(tmp.oldlenp);
- if (compat_oldlenp && get_user(oldlen, compat_oldlenp))
- return -EFAULT;
-
- result = do_sysctl(compat_ptr(tmp.name), tmp.nlen,
- compat_ptr(tmp.oldval), oldlen,
- compat_ptr(tmp.newval), tmp.newlen);
-
- if (result >= 0) {
- oldlen = result;
- result = 0;
- }
-
- if (compat_oldlenp && put_user(oldlen, compat_oldlenp))
- return -EFAULT;
-
- return result;
-}
-
-#endif /* CONFIG_COMPAT */
--- a/kernel/sys_ni.c~all-arch-remove-system-call-sys_sysctl
+++ a/kernel/sys_ni.c
@@ -366,7 +366,6 @@ COND_SYSCALL(socketcall);
COND_SYSCALL_COMPAT(socketcall);
/* compat syscalls for arm64, x86, ... */
-COND_SYSCALL_COMPAT(sysctl);
COND_SYSCALL_COMPAT(fanotify_mark);
/* x86 */
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -193,7 +193,7 @@
146 common writev sys_writev compat_sys_writev
147 common getsid sys_getsid
148 common fdatasync sys_fdatasync
-149 nospu _sysctl sys_sysctl compat_sys_sysctl
+149 nospu _sysctl sys_ni_syscall
150 common mlock sys_mlock
151 common munlock sys_munlock
152 common mlockall sys_mlockall
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -138,7 +138,7 @@
146 common writev sys_writev compat_sys_writev
147 common getsid sys_getsid sys_getsid
148 common fdatasync sys_fdatasync sys_fdatasync
-149 common _sysctl sys_sysctl compat_sys_sysctl
+149 common _sysctl - -
150 common mlock sys_mlock compat_sys_mlock
151 common munlock sys_munlock compat_sys_munlock
152 common mlockall sys_mlockall sys_mlockall
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl~all-arch-remove-system-call-sys_sysctl
+++ a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -164,7 +164,7 @@
153 common vhangup sys_vhangup
154 common modify_ldt sys_modify_ldt
155 common pivot_root sys_pivot_root
-156 64 _sysctl sys_sysctl
+156 64 _sysctl sys_ni_syscall
157 common prctl sys_prctl
158 common arch_prctl sys_arch_prctl
159 common adjtimex sys_adjtimex
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 20/39] mm/kmemleak: silence KCSAN splats in checksum
2020-08-15 0:29 incoming Andrew Morton
` (18 preceding siblings ...)
2020-08-15 0:31 ` [patch 19/39] all arch: remove system call sys_sysctl Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 21/39] mm/frontswap: mark various intentional data races Andrew Morton
` (19 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, catalin.marinas, elver, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/kmemleak: silence KCSAN splats in checksum
Even if KCSAN is disabled for kmemleak, update_checksum() could still call
crc32() (which is outside of kmemleak.c) to dereference object->pointer.
Thus, the value of object->pointer could be accessed concurrently as
noticed by KCSAN,
BUG: KCSAN: data-race in crc32_le_base / do_raw_spin_lock
write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
do_raw_spin_lock+0x114/0x200
debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
(inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
_raw_spin_lock+0x40/0x50
__handle_mm_fault+0xa9e/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
crc32_le_base+0x67/0x350
crc32_le_base+0x67/0x350:
crc32_body at lib/crc32.c:106
(inlined by) crc32_le_generic at lib/crc32.c:179
(inlined by) crc32_le at lib/crc32.c:197
kmemleak_scan+0x528/0xd90
update_checksum at mm/kmemleak.c:1172
(inlined by) kmemleak_scan at mm/kmemleak.c:1497
kmemleak_scan_thread+0xcc/0xfa
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
If a shattered value was returned due to a data race, it will be corrected
in the next scan. Thus, let KCSAN ignore all reads in the region to
silence KCSAN in case the write side is non-atomic.
Link: http://lkml.kernel.org/r/20200317182754.2180-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Marco Elver <elver@google.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/kmemleak.c | 2 ++
1 file changed, 2 insertions(+)
--- a/mm/kmemleak.c~mm-kmemleak-silence-kcsan-splats-in-checksum
+++ a/mm/kmemleak.c
@@ -1169,8 +1169,10 @@ static bool update_checksum(struct kmeml
u32 old_csum = object->checksum;
kasan_disable_current();
+ kcsan_disable_current();
object->checksum = crc32(0, (void *)object->pointer, object->size);
kasan_enable_current();
+ kcsan_enable_current();
return object->checksum != old_csum;
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 21/39] mm/frontswap: mark various intentional data races
2020-08-15 0:29 incoming Andrew Morton
` (19 preceding siblings ...)
2020-08-15 0:31 ` [patch 20/39] mm/kmemleak: silence KCSAN splats in checksum Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 22/39] mm/page_io: " Andrew Morton
` (18 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, konrad.wilk, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/frontswap: mark various intentional data races
There are a few information counters that are intentionally not protected
against increment races, so just annotate them using the data_race()
macro.
BUG: KCSAN: data-race in __frontswap_store / __frontswap_store
write to 0xffffffff8b7174d8 of 8 bytes by task 6396 on cpu 103:
__frontswap_store+0x2d0/0x344
inc_frontswap_failed_stores at mm/frontswap.c:70
(inlined by) __frontswap_store at mm/frontswap.c:280
swap_writepage+0x83/0xf0
pageout+0x33e/0xae0
shrink_page_list+0x1f57/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffffffff8b7174d8 of 8 bytes by task 6405 on cpu 47:
__frontswap_store+0x2b9/0x344
inc_frontswap_failed_stores at mm/frontswap.c:70
(inlined by) __frontswap_store at mm/frontswap.c:280
swap_writepage+0x83/0xf0
pageout+0x33e/0xae0
shrink_page_list+0x1f57/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Link: http://lkml.kernel.org/r/1581114499-5042-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/frontswap.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/mm/frontswap.c~mm-frontswap-mark-various-intentional-data-races
+++ a/mm/frontswap.c
@@ -61,16 +61,16 @@ static u64 frontswap_failed_stores;
static u64 frontswap_invalidates;
static inline void inc_frontswap_loads(void) {
- frontswap_loads++;
+ data_race(frontswap_loads++);
}
static inline void inc_frontswap_succ_stores(void) {
- frontswap_succ_stores++;
+ data_race(frontswap_succ_stores++);
}
static inline void inc_frontswap_failed_stores(void) {
- frontswap_failed_stores++;
+ data_race(frontswap_failed_stores++);
}
static inline void inc_frontswap_invalidates(void) {
- frontswap_invalidates++;
+ data_race(frontswap_invalidates++);
}
#else
static inline void inc_frontswap_loads(void) { }
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 22/39] mm/page_io: mark various intentional data races
2020-08-15 0:29 incoming Andrew Morton
` (20 preceding siblings ...)
2020-08-15 0:31 ` [patch 21/39] mm/frontswap: mark various intentional data races Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 23/39] mm/swap_state: " Andrew Morton
` (17 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/page_io: mark various intentional data races
struct swap_info_struct si.flags could be accessed concurrently as noticed
by KCSAN,
BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage
write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
scan_swap_map_slots+0x6fe/0xb50
scan_swap_map_slots at mm/swapfile.c:887
get_swap_pages+0x39d/0x5c0
get_swap_page+0x377/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1740/0x2820
shrink_inactive_list+0x316/0x8b0
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
swap_readpage+0x204/0x6a0
swap_readpage at mm/page_io.c:380
read_swap_cache_async+0xa2/0xb0
swapin_readahead+0x6a0/0x890
do_swap_page+0x465/0xeb0
__handle_mm_fault+0xc7a/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 5422 Comm: gmain Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
Other reads,
read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
__swap_writepage+0x140/0xc20
__swap_writepage at mm/page_io.c:289
read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
swap_set_page_dirty+0x44/0x1f4
swap_set_page_dirty at mm/page_io.c:442
The write is under &si->lock, but the reads are done as lockless. Since
the reads only check for a specific bit in the flag, it is harmless even
if load tearing happens. Thus, just mark them as intentional data races
using the data_race() macro.
[cai@lca.pw: add a missing annotation]
Link: http://lkml.kernel.org/r/1581612585-5812-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/20200207003601.1526-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/page_io.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--- a/mm/page_io.c~mm-page_io-mark-various-intentional-data-races
+++ a/mm/page_io.c
@@ -85,7 +85,7 @@ static void swap_slot_free_notify(struct
return;
sis = page_swap_info(page);
- if (!(sis->flags & SWP_BLKDEV))
+ if (data_race(!(sis->flags & SWP_BLKDEV)))
return;
/*
@@ -302,7 +302,7 @@ int __swap_writepage(struct page *page,
struct swap_info_struct *sis = page_swap_info(page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
- if (sis->flags & SWP_FS) {
+ if (data_race(sis->flags & SWP_FS)) {
struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
@@ -393,7 +393,7 @@ int swap_readpage(struct page *page, boo
goto out;
}
- if (sis->flags & SWP_FS) {
+ if (data_race(sis->flags & SWP_FS)) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
@@ -455,7 +455,7 @@ int swap_set_page_dirty(struct page *pag
{
struct swap_info_struct *sis = page_swap_info(page);
- if (sis->flags & SWP_FS) {
+ if (data_race(sis->flags & SWP_FS)) {
struct address_space *mapping = sis->swap_file->f_mapping;
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 23/39] mm/swap_state: mark various intentional data races
2020-08-15 0:29 incoming Andrew Morton
` (21 preceding siblings ...)
2020-08-15 0:31 ` [patch 22/39] mm/page_io: " Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 24/39] mm/filemap.c: fix a data race in filemap_fault() Andrew Morton
` (16 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/swap_state: mark various intentional data races
swap_cache_info.* could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache
write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
lookup_swap_cache+0x12e/0x460
lookup_swap_cache at mm/swap_state.c:322
do_swap_page+0x112/0xeb0
__handle_mm_fault+0xc7a/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
lookup_swap_cache+0x117/0x460
lookup_swap_cache at mm/swap_state.c:322
shmem_swapin_page+0xc7/0x9e0
shmem_getpage_gfp+0x2ca/0x16c0
shmem_fault+0xef/0x3c0
__do_fault+0x9e/0x220
do_fault+0x4a0/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
__delete_from_swap_cache+0x681/0x8b0
__delete_from_swap_cache at mm/swap_state.c:178
read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
__delete_from_swap_cache+0x66e/0x8b0
__delete_from_swap_cache at mm/swap_state.c:178
Both the read and write are done as lockless. Since swap_cache_info.*
are only used to print out counter information, even if any of them
missed a few incremental due to data races, it will be harmless, so just
mark it as an intentional data race using the data_race() macro.
While at it, fix a checkpatch.pl warning,
WARNING: Single statement macros should not use a do {} while (0) loop
Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/swap_state.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/swap_state.c~mm-swap_state-mark-various-intentional-data-races
+++ a/mm/swap_state.c
@@ -57,8 +57,8 @@ static bool enable_vma_readahead __read_
#define GET_SWAP_RA_VAL(vma) \
(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
-#define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0)
-#define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0)
+#define INC_CACHE_INFO(x) data_race(swap_cache_info.x++)
+#define ADD_CACHE_INFO(x, nr) data_race(swap_cache_info.x += (nr))
static struct {
unsigned long add_total;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 24/39] mm/filemap.c: fix a data race in filemap_fault()
2020-08-15 0:29 incoming Andrew Morton
` (22 preceding siblings ...)
2020-08-15 0:31 ` [patch 23/39] mm/swap_state: " Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 25/39] mm/swapfile: fix and annotate various data races Andrew Morton
` (15 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, kirill, linux-mm, mm-commits, torvalds, willy
From: Kirill A. Shutemov <kirill@shutemov.name>
Subject: mm/filemap.c: fix a data race in filemap_fault()
struct file_ra_state ra.mmap_miss could be accessed concurrently during
page faults as noticed by KCSAN,
BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
filemap_fault+0x920/0xfc0
do_sync_mmap_readahead at mm/filemap.c:2384
(inlined by) filemap_fault at mm/filemap.c:2486
__xfs_filemap_fault+0x112/0x3e0 [xfs]
xfs_filemap_fault+0x74/0x90 [xfs]
__do_fault+0x9e/0x220
do_fault+0x4a0/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
filemap_map_pages+0xc2e/0xd80
filemap_map_pages at mm/filemap.c:2625
do_fault+0x3da/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
ra.mmap_miss is used to contribute the readahead decisions, a data race
could be undesirable. Both the read and write is only under non-exclusive
mmap_sem, two concurrent writers could even underflow the counter. Fix
the underflow by writing to a local variable before committing a final
store to ra.mmap_miss given a small inaccuracy of the counter should be
acceptable.
Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/filemap.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
--- a/mm/filemap.c~mm-filemap-fix-a-data-race-in-filemap_fault
+++ a/mm/filemap.c
@@ -2468,6 +2468,7 @@ static struct file *do_sync_mmap_readahe
struct address_space *mapping = file->f_mapping;
struct file *fpin = NULL;
pgoff_t offset = vmf->pgoff;
+ unsigned int mmap_miss;
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ)
@@ -2483,14 +2484,15 @@ static struct file *do_sync_mmap_readahe
}
/* Avoid banging the cache line if not needed */
- if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
- ra->mmap_miss++;
+ mmap_miss = READ_ONCE(ra->mmap_miss);
+ if (mmap_miss < MMAP_LOTSAMISS * 10)
+ WRITE_ONCE(ra->mmap_miss, ++mmap_miss);
/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
- if (ra->mmap_miss > MMAP_LOTSAMISS)
+ if (mmap_miss > MMAP_LOTSAMISS)
return fpin;
/*
@@ -2516,13 +2518,15 @@ static struct file *do_async_mmap_readah
struct file_ra_state *ra = &file->f_ra;
struct address_space *mapping = file->f_mapping;
struct file *fpin = NULL;
+ unsigned int mmap_miss;
pgoff_t offset = vmf->pgoff;
/* If we don't want any read-ahead, don't bother */
if (vmf->vma->vm_flags & VM_RAND_READ || !ra->ra_pages)
return fpin;
- if (ra->mmap_miss > 0)
- ra->mmap_miss--;
+ mmap_miss = READ_ONCE(ra->mmap_miss);
+ if (mmap_miss)
+ WRITE_ONCE(ra->mmap_miss, --mmap_miss);
if (PageReadahead(page)) {
fpin = maybe_unlock_mmap_for_io(vmf, fpin);
page_cache_async_readahead(mapping, ra, file,
@@ -2688,6 +2692,7 @@ void filemap_map_pages(struct vm_fault *
unsigned long max_idx;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct page *page;
+ unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
rcu_read_lock();
xas_for_each(&xas, page, end_pgoff) {
@@ -2724,8 +2729,8 @@ void filemap_map_pages(struct vm_fault *
if (page->index >= max_idx)
goto unlock;
- if (file->f_ra.mmap_miss > 0)
- file->f_ra.mmap_miss--;
+ if (mmap_miss > 0)
+ mmap_miss--;
vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
if (vmf->pte)
@@ -2745,6 +2750,7 @@ next:
break;
}
rcu_read_unlock();
+ WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
}
EXPORT_SYMBOL(filemap_map_pages);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 25/39] mm/swapfile: fix and annotate various data races
2020-08-15 0:29 incoming Andrew Morton
` (23 preceding siblings ...)
2020-08-15 0:31 ` [patch 24/39] mm/filemap.c: fix a data race in filemap_fault() Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 26/39] mm/page_counter: fix various data races at memsw Andrew Morton
` (14 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, hughd, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/swapfile: fix and annotate various data races
swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
be accessed concurrently separately as noticed by KCSAN,
=== si.highest_bit ===
write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
swap_range_alloc+0x81/0x130
swap_range_alloc at mm/swapfile.c:681
scan_swap_map_slots+0x371/0xb90
get_swap_pages+0x39d/0x5c0
get_swap_page+0xf2/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
scan_swap_map_slots+0x4a6/0xb90
scan_swap_map_slots at mm/swapfile.c:892
get_swap_pages+0x39d/0x5c0
get_swap_page+0xf2/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
Reported by Kernel Concurrency Sanitizer on:
CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
=== si.swap_map[offset] ===
write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
__swap_entry_free_locked+0x8c/0x100
__swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
__swap_entry_free.constprop.20+0x69/0xb0
free_swap_and_cache+0x53/0xa0
unmap_page_range+0x7f8/0x1d70
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0x10e/0x270
do_exit+0x59b/0xf40
do_group_exit+0x8b/0x180
read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
_swap_info_get+0x81/0xa0
_swap_info_get at mm/swapfile.c:1140
free_swap_and_cache+0x40/0xa0
unmap_page_range+0x7f8/0x1d70
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0x10e/0x270
do_exit+0x59b/0xf40
do_group_exit+0x8b/0x180
=== si.flags ===
write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
scan_swap_map_slots+0x6fe/0xb50
scan_swap_map_slots at mm/swapfile.c:887
get_swap_pages+0x39d/0x5c0
get_swap_page+0x377/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
_swap_info_get+0x41/0xa0
__swap_info_get at mm/swapfile.c:1114
put_swap_page+0x84/0x490
__remove_mapping+0x384/0x5f0
shrink_page_list+0xff1/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
The writes are under si->lock but the reads are not. For si.highest_bit
and si.swap_map[offset], data race could trigger logic bugs, so fix them
by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
except those isolated reads where they compare against zero which a data
race would cause no harm. Thus, annotate them as intentional data races
using the data_race() macro.
For si.flags, the readers are only interested in a single bit where a
data race there would cause no issue there.
[cai@lca.pw: add a missing annotation for si->flags in memory.c]
Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw
Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/memory.c | 4 ++--
mm/swapfile.c | 31 ++++++++++++++++++-------------
2 files changed, 20 insertions(+), 15 deletions(-)
--- a/mm/memory.c~mm-swapfile-fix-and-annotate-various-data-races
+++ a/mm/memory.c
@@ -3130,8 +3130,8 @@ vm_fault_t do_swap_page(struct vm_fault
if (!page) {
struct swap_info_struct *si = swp_swap_info(entry);
- if (si->flags & SWP_SYNCHRONOUS_IO &&
- __swap_count(entry) == 1) {
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+ __swap_count(entry) == 1) {
/* skip swapcache */
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
--- a/mm/swapfile.c~mm-swapfile-fix-and-annotate-various-data-races
+++ a/mm/swapfile.c
@@ -672,7 +672,7 @@ static void swap_range_alloc(struct swap
if (offset == si->lowest_bit)
si->lowest_bit += nr_entries;
if (end == si->highest_bit)
- si->highest_bit -= nr_entries;
+ WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries);
si->inuse_pages += nr_entries;
if (si->inuse_pages == si->pages) {
si->lowest_bit = si->max;
@@ -705,7 +705,7 @@ static void swap_range_free(struct swap_
if (end > si->highest_bit) {
bool was_full = !si->highest_bit;
- si->highest_bit = end;
+ WRITE_ONCE(si->highest_bit, end);
if (was_full && (si->flags & SWP_WRITEOK))
add_to_avail_list(si);
}
@@ -870,7 +870,7 @@ checks:
else
goto done;
}
- si->swap_map[offset] = usage;
+ WRITE_ONCE(si->swap_map[offset], usage);
inc_cluster_info_page(si, si->cluster_info, offset);
unlock_cluster(ci);
@@ -929,12 +929,13 @@ done:
scan:
spin_unlock(&si->lock);
- while (++offset <= si->highest_bit) {
- if (!si->swap_map[offset]) {
+ while (++offset <= READ_ONCE(si->highest_bit)) {
+ if (data_race(!si->swap_map[offset])) {
spin_lock(&si->lock);
goto checks;
}
- if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ if (vm_swap_full() &&
+ READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
spin_lock(&si->lock);
goto checks;
}
@@ -946,11 +947,12 @@ scan:
}
offset = si->lowest_bit;
while (offset < scan_base) {
- if (!si->swap_map[offset]) {
+ if (data_race(!si->swap_map[offset])) {
spin_lock(&si->lock);
goto checks;
}
- if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
+ if (vm_swap_full() &&
+ READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
spin_lock(&si->lock);
goto checks;
}
@@ -1149,7 +1151,7 @@ static struct swap_info_struct *__swap_i
p = swp_swap_info(entry);
if (!p)
goto bad_nofile;
- if (!(p->flags & SWP_USED))
+ if (data_race(!(p->flags & SWP_USED)))
goto bad_device;
offset = swp_offset(entry);
if (offset >= p->max)
@@ -1175,7 +1177,7 @@ static struct swap_info_struct *_swap_in
p = __swap_info_get(entry);
if (!p)
goto out;
- if (!p->swap_map[swp_offset(entry)])
+ if (data_race(!p->swap_map[swp_offset(entry)]))
goto bad_free;
return p;
@@ -1244,7 +1246,10 @@ static unsigned char __swap_entry_free_l
}
usage = count | has_cache;
- p->swap_map[offset] = usage ? : SWAP_HAS_CACHE;
+ if (usage)
+ WRITE_ONCE(p->swap_map[offset], usage);
+ else
+ WRITE_ONCE(p->swap_map[offset], SWAP_HAS_CACHE);
return usage;
}
@@ -1296,7 +1301,7 @@ struct swap_info_struct *get_swap_device
goto bad_nofile;
rcu_read_lock();
- if (!(si->flags & SWP_VALID))
+ if (data_race(!(si->flags & SWP_VALID)))
goto unlock_out;
offset = swp_offset(entry);
if (offset >= si->max)
@@ -3484,7 +3489,7 @@ static int __swap_duplicate(swp_entry_t
} else
err = -ENOENT; /* unused swap entry */
- p->swap_map[offset] = count | has_cache;
+ WRITE_ONCE(p->swap_map[offset], count | has_cache);
unlock_out:
unlock_cluster_or_swap_info(p, ci);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 26/39] mm/page_counter: fix various data races at memsw
2020-08-15 0:29 incoming Andrew Morton
` (24 preceding siblings ...)
2020-08-15 0:31 ` [patch 25/39] mm/swapfile: fix and annotate various data races Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 27/39] mm/memcontrol: fix a data race in scan count Andrew Morton
` (13 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, david, dvyukov, elver, hannes, linux-mm, mhocko,
mm-commits, penguin-kernel, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/page_counter: fix various data races at memsw
Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters") could had
memcg->memsw->watermark and memcg->memsw->failcnt been accessed
concurrently as reported by KCSAN,
BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
try_charge+0x131/0xd50 mm/memcontrol.c:2405
__memcg_kmem_charge_memcg+0x58/0x140
__memcg_kmem_charge+0xcc/0x280
__alloc_pages_nodemask+0x1e1/0x450
alloc_pages_current+0xa6/0x120
pte_alloc_one+0x17/0xd0
__pte_alloc+0x3a/0x1f0
copy_p4d_range+0xc36/0x1990
copy_page_range+0x21d/0x360
dup_mmap+0x5f5/0x7a0
dup_mm+0xa2/0x240
copy_process+0x1b3f/0x3460
_do_fork+0xaa/0xa20
__x64_sys_clone+0x13b/0x170
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
try_charge+0x131/0xd50 mm/memcontrol.c:2405
mem_cgroup_try_charge+0x159/0x460
mem_cgroup_try_charge_delay+0x3d/0xa0
wp_page_copy+0x14d/0x930
do_wp_page+0x107/0x7b0
__handle_mm_fault+0xce6/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
try_charge+0x185/0xbf0 mm/memcontrol.c:2405
__memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
__memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
__alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
try_charge+0x185/0xbf0 mm/memcontrol.c:2405
__memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
__memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
__alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
Since watermark could be compared or set to garbage due to a data race
which would change the code logic, fix it by adding a pair of READ_ONCE()
and WRITE_ONCE() in those places.
The "failcnt" counter is tolerant of some degree of inaccuracy and is only
used to report stats, a data race will not be harmful, thus mark it as an
intentional data race using the data_race() macro.
Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw
Fixes: 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
Signed-off-by: Qian Cai <cai@lca.pw>
Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/page_counter.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
--- a/mm/page_counter.c~mm-page_counter-fix-various-data-races-at-memsw
+++ a/mm/page_counter.c
@@ -77,8 +77,8 @@ void page_counter_charge(struct page_cou
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
*/
- if (new > c->watermark)
- c->watermark = new;
+ if (new > READ_ONCE(c->watermark))
+ WRITE_ONCE(c->watermark, new);
}
}
@@ -119,9 +119,10 @@ bool page_counter_try_charge(struct page
propagate_protected_usage(c, new);
/*
* This is racy, but we can live with some
- * inaccuracy in the failcnt.
+ * inaccuracy in the failcnt which is only used
+ * to report stats.
*/
- c->failcnt++;
+ data_race(c->failcnt++);
*fail = c;
goto failed;
}
@@ -130,8 +131,8 @@ bool page_counter_try_charge(struct page
* Just like with failcnt, we can live with some
* inaccuracy in the watermark.
*/
- if (new > c->watermark)
- c->watermark = new;
+ if (new > READ_ONCE(c->watermark))
+ WRITE_ONCE(c->watermark, new);
}
return true;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 27/39] mm/memcontrol: fix a data race in scan count
2020-08-15 0:29 incoming Andrew Morton
` (25 preceding siblings ...)
2020-08-15 0:31 ` [patch 26/39] mm/page_counter: fix various data races at memsw Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 28/39] mm/list_lru: fix a data race in list_lru_count_one Andrew Morton
` (12 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, hannes, linux-mm, mhocko, mm-commits, torvalds, vdavydov.dev
From: Qian Cai <cai@lca.pw>
Subject: mm/memcontrol: fix a data race in scan count
struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
accessed concurrently as noticed by KCSAN,
BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size
write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
mem_cgroup_update_lru_size+0x11c/0x1d0
mem_cgroup_update_lru_size at mm/memcontrol.c:1266
isolate_lru_pages+0x6a9/0xf30
shrink_active_list+0x123/0xcc0
shrink_lruvec+0x8fd/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
lruvec_lru_size+0xbb/0x270
mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
(inlined by) lruvec_lru_size at mm/vmscan.c:326
shrink_lruvec+0x1d0/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_current+0xa6/0x120
alloc_slab_page+0x3b1/0x540
allocate_slab+0x70/0x660
new_slab+0x46/0x70
___slab_alloc+0x4ad/0x7d0
__slab_alloc+0x43/0x70
kmem_cache_alloc+0x2c3/0x420
getname_flags+0x4c/0x230
getname+0x22/0x30
do_sys_openat2+0x205/0x3b0
do_sys_open+0x9a/0xf0
__x64_sys_openat+0x62/0x80
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reported by Kernel Concurrency Sanitizer on:
CPU: 95 PID: 50964 Comm: cc1 Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
The write is under lru_lock, but the read is done as lockless. The scan
count is used to determine how aggressively the anon and file LRU lists
should be scanned. Load tearing could generate an inefficient heuristic,
so fix it by adding READ_ONCE() for the read.
Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/memcontrol.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/include/linux/memcontrol.h~mm-memcontrol-fix-a-data-race-in-scan-count
+++ a/include/linux/memcontrol.h
@@ -630,7 +630,7 @@ unsigned long mem_cgroup_get_zone_lru_si
struct mem_cgroup_per_node *mz;
mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- return mz->lru_zone_size[zone_idx][lru];
+ return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
}
void mem_cgroup_handle_over_high(void);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 28/39] mm/list_lru: fix a data race in list_lru_count_one
2020-08-15 0:29 incoming Andrew Morton
` (26 preceding siblings ...)
2020-08-15 0:31 ` [patch 27/39] mm/memcontrol: fix a data race in scan count Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 29/39] mm/mempool: fix a data race in mempool_free() Andrew Morton
` (11 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, konrad.wilk, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/list_lru: fix a data race in list_lru_count_one
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,
BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move
write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
list_lru_isolate_move+0xf9/0x130
list_lru_isolate_move at mm/list_lru.c:180
inode_lru_isolate+0x12b/0x2a0
__list_lru_walk_one+0x122/0x3d0
list_lru_walk_one+0x75/0xa0
prune_icache_sb+0x8b/0xc0
super_cache_scan+0x1b8/0x250
do_shrink_slab+0x256/0x6d0
shrink_slab+0x41b/0x4a0
shrink_node+0x35c/0xd80
balance_pgdat+0x652/0xd90
kswapd+0x396/0x8d0
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
list_lru_count_one+0x116/0x2f0
list_lru_count_one at mm/list_lru.c:193
super_cache_count+0xe8/0x170
do_shrink_slab+0x95/0x6d0
shrink_slab+0x41b/0x4a0
shrink_node+0x35c/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).
Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/list_lru.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/list_lru.c~mm-list_lru-fix-a-data-race-in-list_lru_count_one
+++ a/mm/list_lru.c
@@ -180,7 +180,7 @@ unsigned long list_lru_count_one(struct
rcu_read_lock();
l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
- count = l->nr_items;
+ count = READ_ONCE(l->nr_items);
rcu_read_unlock();
return count;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 29/39] mm/mempool: fix a data race in mempool_free()
2020-08-15 0:29 incoming Andrew Morton
` (27 preceding siblings ...)
2020-08-15 0:31 ` [patch 28/39] mm/list_lru: fix a data race in list_lru_count_one Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 30/39] mm/rmap: annotate a data race at tlb_flush_batched Andrew Morton
` (10 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, linux-mm, mm-commits, oleg, tj, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/mempool: fix a data race in mempool_free()
mempool_t pool.curr_nr could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in mempool_free / remove_element
write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
remove_element+0x4a/0x1c0
remove_element at mm/mempool.c:132
mempool_alloc+0x102/0x210
(inlined by) mempool_alloc at mm/mempool.c:399
bio_alloc_bioset+0x106/0x2c0
get_swap_bio+0x49/0x230
__swap_writepage+0x680/0xc30
swap_writepage+0x9c/0xf0
pageout+0x33e/0xae0
shrink_page_list+0x1f57/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
<snip>
read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
mempool_free+0x3e/0x150
mempool_free at mm/mempool.c:492
bio_free+0x192/0x280
bio_put+0x91/0xd0
end_swap_bio_write+0x1d8/0x280
bio_endio+0x2c2/0x5b0
dec_pending+0x22b/0x440 [dm_mod]
clone_endio+0xe4/0x2c0 [dm_mod]
bio_endio+0x2c2/0x5b0
blk_update_request+0x217/0x940
scsi_end_request+0x6b/0x4d0
scsi_io_completion+0xb7/0x7e0
scsi_finish_command+0x223/0x310
scsi_softirq_done+0x1d5/0x210
blk_mq_complete_request+0x224/0x250
scsi_mq_done+0xc2/0x250
pqi_raid_io_complete+0x5a/0x70 [smartpqi]
pqi_irq_handler+0x150/0x1410 [smartpqi]
__handle_irq_event_percpu+0x90/0x540
handle_irq_event_percpu+0x49/0xd0
handle_irq_event+0x85/0xca
handle_edge_irq+0x13f/0x3e0
do_IRQ+0x86/0x190
<snip>
Since the write is under pool->lock but the read is done as lockless.
Even though the commit 5b990546e334 ("mempool: fix and document
synchronization and memory barrier usage") introduced the smp_wmb() and
smp_rmb() pair to improve the situation, it is adequate to protect it
from data races which could lead to a logic bug, so fix it by adding
READ_ONCE() for the read.
Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/mempool.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/mempool.c~mm-mempool-fix-a-data-race-in-mempool_free
+++ a/mm/mempool.c
@@ -489,7 +489,7 @@ void mempool_free(void *element, mempool
* ensures that there will be frees which return elements to the
* pool waking up the waiters.
*/
- if (unlikely(pool->curr_nr < pool->min_nr)) {
+ if (unlikely(READ_ONCE(pool->curr_nr) < pool->min_nr)) {
spin_lock_irqsave(&pool->lock, flags);
if (likely(pool->curr_nr < pool->min_nr)) {
add_element(pool, element);
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 30/39] mm/rmap: annotate a data race at tlb_flush_batched
2020-08-15 0:29 incoming Andrew Morton
` (28 preceding siblings ...)
2020-08-15 0:31 ` [patch 29/39] mm/mempool: fix a data race in mempool_free() Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 31/39] mm/swap.c: annotate data races for lru_rotate_pvecs Andrew Morton
` (9 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/rmap: annotate a data race at tlb_flush_batched
mm->tlb_flush_batched could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
try_to_unmap_one+0x59a/0x1ab0
set_tlb_ubc_flush_pending at mm/rmap.c:635
(inlined by) try_to_unmap_one at mm/rmap.c:1538
rmap_walk_anon+0x296/0x650
rmap_walk+0xdf/0x100
try_to_unmap+0x18a/0x2f0
shrink_page_list+0xef6/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
balance_pgdat+0x652/0xd90
kswapd+0x396/0x8d0
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
flush_tlb_batched_pending+0x29/0x90
flush_tlb_batched_pending at mm/rmap.c:682
change_p4d_range+0x5dd/0x1030
change_pte_range at mm/mprotect.c:44
(inlined by) change_pmd_range at mm/mprotect.c:212
(inlined by) change_pud_range at mm/mprotect.c:240
(inlined by) change_p4d_range at mm/mprotect.c:260
change_protection+0x222/0x310
change_prot_numa+0x3e/0x60
task_numa_work+0x219/0x350
task_work_run+0xed/0x140
prepare_exit_to_usermode+0x2cc/0x2e0
ret_from_intr+0x32/0x42
Reported by Kernel Concurrency Sanitizer on:
CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
flush_tlb_batched_pending() is under PTL but the write is not, but
mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
shattered. Thus, mark it as an intentional data race by using the data
race macro.
Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/rmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/rmap.c~mm-rmap-annotate-a-data-race-at-tlb_flush_batched
+++ a/mm/rmap.c
@@ -672,7 +672,7 @@ static bool should_defer_flush(struct mm
*/
void flush_tlb_batched_pending(struct mm_struct *mm)
{
- if (mm->tlb_flush_batched) {
+ if (data_race(mm->tlb_flush_batched)) {
flush_tlb_mm(mm);
/*
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 31/39] mm/swap.c: annotate data races for lru_rotate_pvecs
2020-08-15 0:29 incoming Andrew Morton
` (29 preceding siblings ...)
2020-08-15 0:31 ` [patch 30/39] mm/rmap: annotate a data race at tlb_flush_batched Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 32/39] mm: annotate a data race in page_zonenum() Andrew Morton
` (8 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, elver, linux-mm, mm-commits, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm/swap.c: annotate data races for lru_rotate_pvecs
Read to lru_add_pvec->nr could be interrupted and then write to the same
variable. The write has local interrupt disabled, but the plain reads
result in data races. However, it is unlikely the compilers could do much
damage here given that lru_add_pvec->nr is a "unsigned char" and there is
an existing compiler barrier. Thus, annotate the reads using the
data_race() macro. The data races were reported by KCSAN,
BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page
write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
rotate_reclaimable_page+0x2df/0x490
pagevec_add at include/linux/pagevec.h:81
(inlined by) rotate_reclaimable_page at mm/swap.c:259
end_page_writeback+0x1b5/0x2b0
end_swap_bio_write+0x1d0/0x280
bio_endio+0x297/0x560
dec_pending+0x218/0x430 [dm_mod]
clone_endio+0xe4/0x2c0 [dm_mod]
bio_endio+0x297/0x560
blk_update_request+0x201/0x920
scsi_end_request+0x6b/0x4a0
scsi_io_completion+0xb7/0x7e0
scsi_finish_command+0x1ed/0x2a0
scsi_softirq_done+0x1c9/0x1d0
blk_done_softirq+0x181/0x1d0
__do_softirq+0xd9/0x57c
irq_exit+0xa2/0xc0
do_IRQ+0x8b/0x190
ret_from_intr+0x0/0x42
delay_tsc+0x46/0x80
__const_udelay+0x3c/0x40
__udelay+0x10/0x20
kcsan_setup_watchpoint+0x202/0x3a0
__tsan_read1+0xc2/0x100
lru_add_drain_cpu+0xb8/0x3f0
lru_add_drain+0x25/0x40
shrink_active_list+0xe1/0xc80
shrink_lruvec+0x766/0xb70
shrink_node+0x2d6/0xca0
do_try_to_free_pages+0x1f7/0x9a0
try_to_free_pages+0x252/0x5b0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x16e/0x6f0
__handle_mm_fault+0xcd5/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
lru_add_drain_cpu+0xb8/0x3f0
lru_add_drain_cpu at mm/swap.c:602
lru_add_drain+0x25/0x40
shrink_active_list+0xe1/0xc80
shrink_lruvec+0x766/0xb70
shrink_node+0x2d6/0xca0
do_try_to_free_pages+0x1f7/0x9a0
try_to_free_pages+0x252/0x5b0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x16e/0x6f0
__handle_mm_fault+0xcd5/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
2 locks held by oom02/37761:
#0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
#1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
irq event stamp: 1949217
trace_hardirqs_on_thunk+0x1a/0x1c
__do_softirq+0x2e7/0x57c
__do_softirq+0x34c/0x57c
irq_exit+0xa2/0xc0
Reported by Kernel Concurrency Sanitizer on:
CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018
Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/swap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- a/mm/swap.c~mm-swap-annotate-data-races-for-lru_rotate_pvecs
+++ a/mm/swap.c
@@ -632,7 +632,8 @@ void lru_add_drain_cpu(int cpu)
__pagevec_lru_add(pvec);
pvec = &per_cpu(lru_rotate.pvec, cpu);
- if (pagevec_count(pvec)) {
+ /* Disabling interrupts below acts as a compiler barrier. */
+ if (data_race(pagevec_count(pvec))) {
unsigned long flags;
/* No harm done if a racing interrupt already did this */
@@ -793,7 +794,7 @@ void lru_add_drain_all(void)
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
- pagevec_count(&per_cpu(lru_rotate.pvec, cpu)) ||
+ data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 32/39] mm: annotate a data race in page_zonenum()
2020-08-15 0:29 incoming Andrew Morton
` (30 preceding siblings ...)
2020-08-15 0:31 ` [patch 31/39] mm/swap.c: annotate data races for lru_rotate_pvecs Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:31 ` [patch 33/39] include/asm-generic/vmlinux.lds.h: align ro_after_init Andrew Morton
` (7 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, cai, dan.j.williams, david, elver, ira.weiny, jack,
jhubbard, linux-mm, mm-commits, paulmck, torvalds
From: Qian Cai <cai@lca.pw>
Subject: mm: annotate a data race in page_zonenum()
BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page
write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
page_cpupid_xchg_last+0x51/0x80
page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
wp_page_reuse+0x3e/0xc0
wp_page_reuse at mm/memory.c:2453
do_wp_page+0x472/0x7b0
do_wp_page at mm/memory.c:2798
__handle_mm_fault+0xcb0/0xd00
handle_pte_fault at mm/memory.c:4049
(inlined by) __handle_mm_fault at mm/memory.c:4163
handle_mm_fault+0xfc/0x2f0
handle_mm_fault at mm/memory.c:4200
do_page_fault+0x263/0x6f9
do_user_addr_fault at arch/x86/mm/fault.c:1465
(inlined by) do_page_fault at arch/x86/mm/fault.c:1539
page_fault+0x34/0x40
read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
put_page+0x15a/0x1f0
page_zonenum at include/linux/mm.h:923
(inlined by) is_zone_device_page at include/linux/mm.h:929
(inlined by) page_is_devmap_managed at include/linux/mm.h:948
(inlined by) put_page at include/linux/mm.h:1023
wp_page_copy+0x571/0x930
wp_page_copy at mm/memory.c:2615
do_wp_page+0x107/0x7b0
__handle_mm_fault+0xcb0/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
A page never changes its zone number. The zone number happens to be
stored in the same word as other bits which are modified, but the zone
number bits will never be modified by any other write, so it can accept
a reload of the zone bits after an intervening write and it don't need
to use READ_ONCE(). Thus, annotate this data race using
ASSERT_EXCLUSIVE_BITS() to also assert that there are no concurrent
writes to it.
Link: http://lkml.kernel.org/r/1581619089-14472-1-git-send-email-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/mm.h | 1 +
1 file changed, 1 insertion(+)
--- a/include/linux/mm.h~mm-annotate-a-data-race-in-page_zonenum
+++ a/include/linux/mm.h
@@ -1067,6 +1067,7 @@ vm_fault_t finish_mkwrite_fault(struct v
static inline enum zone_type page_zonenum(const struct page *page)
{
+ ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 33/39] include/asm-generic/vmlinux.lds.h: align ro_after_init
2020-08-15 0:29 incoming Andrew Morton
` (31 preceding siblings ...)
2020-08-15 0:31 ` [patch 32/39] mm: annotate a data race in page_zonenum() Andrew Morton
@ 2020-08-15 0:31 ` Andrew Morton
2020-08-15 0:32 ` [patch 34/39] sh: clkfwk: remove r8/r16/r32 Andrew Morton
` (6 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:31 UTC (permalink / raw)
To: akpm, amodra, arnd, bin.meng, chenzhou10, dalias, geert+renesas,
glaubitz, krzk, kuninori.morimoto.gx, linux-mm, mm-commits,
romain.naour, sam, stable, torvalds, ysato
From: Romain Naour <romain.naour@gmail.com>
Subject: include/asm-generic/vmlinux.lds.h: align ro_after_init
Since the patch [1], building the kernel using a toolchain built with
binutils 2.33.1 prevents booting a sh4 system under Qemu. Apply the patch
provided by Alan Modra [2] that fix alignment of rodata.
[1] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=ebd2263ba9a9124d93bbc0ece63d7e0fae89b40e
[2] https://www.sourceware.org/ml/binutils/2019-12/msg00112.html
Link: https://marc.info/?l=linux-sh&m=158429470221261
Signed-off-by: Romain Naour <romain.naour@gmail.com>
Cc: Alan Modra <amodra@gmail.com>
Cc: Bin Meng <bin.meng@windriver.com>
Cc: Chen Zhou <chenzhou10@huawei.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/asm-generic/vmlinux.lds.h | 1 +
1 file changed, 1 insertion(+)
--- a/include/asm-generic/vmlinux.lds.h~include-asm-generic-vmlinuxldsh-align-ro_after_init
+++ a/include/asm-generic/vmlinux.lds.h
@@ -394,6 +394,7 @@
*/
#ifndef RO_AFTER_INIT_DATA
#define RO_AFTER_INIT_DATA \
+ . = ALIGN(8); \
__start_ro_after_init = .; \
*(.data..ro_after_init) \
JUMP_TABLE_DATA \
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 34/39] sh: clkfwk: remove r8/r16/r32
2020-08-15 0:29 incoming Andrew Morton
` (32 preceding siblings ...)
2020-08-15 0:31 ` [patch 33/39] include/asm-generic/vmlinux.lds.h: align ro_after_init Andrew Morton
@ 2020-08-15 0:32 ` Andrew Morton
2020-08-15 0:32 ` [patch 35/39] sh: use generic strncpy() Andrew Morton
` (5 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:32 UTC (permalink / raw)
To: akpm, amodra, bin.meng, chenzhou10, dalias, geert+renesas,
glaubitz, krzk, kuninori.morimoto.gx, linux-mm, mm-commits,
romain.naour, sam, torvalds, ysato
From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Subject: sh: clkfwk: remove r8/r16/r32
SH will get below warning
${LINUX}/drivers/sh/clk/cpg.c: In function 'r8':
${LINUX}/drivers/sh/clk/cpg.c:41:17: warning: passing argument 1 of 'ioread8'
discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
return ioread8(addr);
^~~~
In file included from ${LINUX}/arch/sh/include/asm/io.h:21,
from ${LINUX}/include/linux/io.h:13,
from ${LINUX}/drivers/sh/clk/cpg.c:14:
${LINUX}/include/asm-generic/iomap.h:29:29: note: expected 'void *' but
argument is of type 'const void *'
extern unsigned int ioread8(void __iomem *);
^~~~~~~~~~~~~~
We don't need "const" for r8/r16/r32. And we don't need r8/r16/r32
themselvs. This patch cleanup these.
X-MARC-Message: https://marc.info/?l=linux-renesas-soc&m=157852973916903
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Alan Modra <amodra@gmail.com>
Cc: Bin Meng <bin.meng@windriver.com>
Cc: Chen Zhou <chenzhou10@huawei.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Romain Naour <romain.naour@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
drivers/sh/clk/cpg.c | 23 ++++-------------------
1 file changed, 4 insertions(+), 19 deletions(-)
--- a/drivers/sh/clk/cpg.c~sh-clkfwk-remove-r8-r16-r32
+++ a/drivers/sh/clk/cpg.c
@@ -36,36 +36,21 @@ static void sh_clk_write(int value, stru
iowrite32(value, clk->mapped_reg);
}
-static unsigned int r8(const void __iomem *addr)
-{
- return ioread8(addr);
-}
-
-static unsigned int r16(const void __iomem *addr)
-{
- return ioread16(addr);
-}
-
-static unsigned int r32(const void __iomem *addr)
-{
- return ioread32(addr);
-}
-
static int sh_clk_mstp_enable(struct clk *clk)
{
sh_clk_write(sh_clk_read(clk) & ~(1 << clk->enable_bit), clk);
if (clk->status_reg) {
- unsigned int (*read)(const void __iomem *addr);
+ unsigned int (*read)(void __iomem *addr);
int i;
void __iomem *mapped_status = (phys_addr_t)clk->status_reg -
(phys_addr_t)clk->enable_reg + clk->mapped_reg;
if (clk->flags & CLK_ENABLE_REG_8BIT)
- read = r8;
+ read = ioread8;
else if (clk->flags & CLK_ENABLE_REG_16BIT)
- read = r16;
+ read = ioread16;
else
- read = r32;
+ read = ioread32;
for (i = 1000;
(read(mapped_status) & (1 << clk->enable_bit)) && i;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 35/39] sh: use generic strncpy()
2020-08-15 0:29 incoming Andrew Morton
` (33 preceding siblings ...)
2020-08-15 0:32 ` [patch 34/39] sh: clkfwk: remove r8/r16/r32 Andrew Morton
@ 2020-08-15 0:32 ` Andrew Morton
2020-08-15 0:32 ` [patch 36/39] iomap: constify ioreadX() iomem argument (as in generic implementation) Andrew Morton
` (4 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:32 UTC (permalink / raw)
To: akpm, amodra, bin.meng, chenzhou10, dalias, geert+renesas,
glaubitz, krzk, kuninori.morimoto.gx, linux-mm, mm-commits,
romain.naour, sam, torvalds, ysato
From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Subject: sh: use generic strncpy()
Current SH will get below warning at strncpy()
In file included from ${LINUX}/arch/sh/include/asm/string.h:3,
from ${LINUX}/include/linux/string.h:20,
from ${LINUX}/include/linux/bitmap.h:9,
from ${LINUX}/include/linux/nodemask.h:95,
from ${LINUX}/include/linux/mmzone.h:17,
from ${LINUX}/include/linux/gfp.h:6,
from ${LINUX}/innclude/linux/slab.h:15,
from ${LINUX}/linux/drivers/mmc/host/vub300.c:38:
${LINUX}/drivers/mmc/host/vub300.c: In function 'new_system_port_status':
${LINUX}/arch/sh/include/asm/string_32.h:51:42: warning: array subscript\
80 is above array bounds of 'char[26]' [-Warray-bounds]
: "0" (__dest), "1" (__src), "r" (__src+__n)
~~~~~^~~~
In general, strncpy() should behave like below.
char dest[10];
char *src = "12345";
strncpy(dest, src, 10);
// dest = {'1', '2', '3', '4', '5',
'\0','\0','\0','\0','\0'}
But, current SH strnpy() has 2 issues.
1st is it will access to out-of-memory (= src + 10).
2nd is it needs big fixup for it, and maintenance __asm__
code is difficult.
To solve these issues, this patch simply uses generic strncpy()
instead of architecture specific one.
Link: https://marc.info/?l=linux-renesas-soc&m=157664657013309
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Alan Modra <amodra@gmail.com>
Cc: Bin Meng <bin.meng@windriver.com>
Cc: Chen Zhou <chenzhou10@huawei.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Romain Naour <romain.naour@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
arch/sh/include/asm/string_32.h | 26 --------------------------
1 file changed, 26 deletions(-)
--- a/arch/sh/include/asm/string_32.h~sh-use-generic-strncpy
+++ a/arch/sh/include/asm/string_32.h
@@ -28,32 +28,6 @@ static inline char *strcpy(char *__dest,
return __xdest;
}
-#define __HAVE_ARCH_STRNCPY
-static inline char *strncpy(char *__dest, const char *__src, size_t __n)
-{
- register char *__xdest = __dest;
- unsigned long __dummy;
-
- if (__n == 0)
- return __xdest;
-
- __asm__ __volatile__(
- "1:\n"
- "mov.b @%1+, %2\n\t"
- "mov.b %2, @%0\n\t"
- "cmp/eq #0, %2\n\t"
- "bt/s 2f\n\t"
- " cmp/eq %5,%1\n\t"
- "bf/s 1b\n\t"
- " add #1, %0\n"
- "2:"
- : "=r" (__dest), "=r" (__src), "=&z" (__dummy)
- : "0" (__dest), "1" (__src), "r" (__src+__n)
- : "memory", "t");
-
- return __xdest;
-}
-
#define __HAVE_ARCH_STRCMP
static inline int strcmp(const char *__cs, const char *__ct)
{
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 36/39] iomap: constify ioreadX() iomem argument (as in generic implementation)
2020-08-15 0:29 incoming Andrew Morton
` (34 preceding siblings ...)
2020-08-15 0:32 ` [patch 35/39] sh: use generic strncpy() Andrew Morton
@ 2020-08-15 0:32 ` Andrew Morton
2020-08-15 0:32 ` [patch 37/39] rtl818x: " Andrew Morton
` (3 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:32 UTC (permalink / raw)
To: akpm, allenbh, arnd, benh, dalias, dave.jiang, davem, deller,
geert+renesas, geert, ink, James.Bottomley, jasowang, jdmason,
krzk, kuba, kvalo, linux-mm, mattst88, mm-commits, mpe, mst,
paulus, rth, torvalds, ysato
From: Krzysztof Kozlowski <krzk@kernel.org>
Subject: iomap: constify ioreadX() iomem argument (as in generic implementation)
Patch series "iomap: Constify ioreadX() iomem argument", v3.
The ioread8/16/32() and others have inconsistent interface among the
architectures: some taking address as const, some not.
It seems there is nothing really stopping all of them to take pointer to
const.
This patch (of 4):
The ioreadX() and ioreadX_rep() helpers have inconsistent interface. On
some architectures void *__iomem address argument is a pointer to const,
on some not.
Implementations of ioreadX() do not modify the memory under the address so
they can be converted to a "const" version for const-safety and
consistency among architectures.
[krzk@kernel.org: sh: clk: fix assignment from incompatible pointer type for ioreadX()]
Link: http://lkml.kernel.org/r/20200723082017.24053-1-krzk@kernel.org
[akpm@linux-foundation.org: fix drivers/mailbox/bcm-pdc-mailbox.c]
Link: http://lkml.kernel.org/r/202007132209.Rxmv4QyS%25lkp@intel.com
Link: http://lkml.kernel.org/r/20200709072837.5869-1-krzk@kernel.org
Link: http://lkml.kernel.org/r/20200709072837.5869-2-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Allen Hubbe <allenbh@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
arch/alpha/include/asm/core_apecs.h | 6 +-
arch/alpha/include/asm/core_cia.h | 6 +-
arch/alpha/include/asm/core_lca.h | 6 +-
arch/alpha/include/asm/core_marvel.h | 4 -
arch/alpha/include/asm/core_mcpcia.h | 6 +-
arch/alpha/include/asm/core_t2.h | 2
arch/alpha/include/asm/io.h | 12 ++--
arch/alpha/include/asm/io_trivial.h | 16 ++---
arch/alpha/include/asm/jensen.h | 2
arch/alpha/include/asm/machvec.h | 6 +-
arch/alpha/kernel/core_marvel.c | 2
arch/alpha/kernel/io.c | 12 ++--
arch/parisc/include/asm/io.h | 4 -
arch/parisc/lib/iomap.c | 72 ++++++++++++------------
arch/powerpc/kernel/iomap.c | 28 ++++-----
arch/sh/kernel/iomap.c | 22 +++----
drivers/mailbox/bcm-pdc-mailbox.c | 2
drivers/sh/clk/cpg.c | 2
include/asm-generic/iomap.h | 28 ++++-----
include/linux/io-64-nonatomic-hi-lo.h | 4 -
include/linux/io-64-nonatomic-lo-hi.h | 4 -
lib/iomap.c | 30 +++++-----
22 files changed, 138 insertions(+), 138 deletions(-)
--- a/arch/alpha/include/asm/core_apecs.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/core_apecs.h
@@ -384,7 +384,7 @@ struct el_apecs_procdata
} \
} while (0)
-__EXTERN_INLINE unsigned int apecs_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int apecs_ioread8(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -420,7 +420,7 @@ __EXTERN_INLINE void apecs_iowrite8(u8 b
*(vuip) ((addr << 5) + base_and_type) = w;
}
-__EXTERN_INLINE unsigned int apecs_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int apecs_ioread16(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -456,7 +456,7 @@ __EXTERN_INLINE void apecs_iowrite16(u16
*(vuip) ((addr << 5) + base_and_type) = w;
}
-__EXTERN_INLINE unsigned int apecs_ioread32(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int apecs_ioread32(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
if (addr < APECS_DENSE_MEM)
--- a/arch/alpha/include/asm/core_cia.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/core_cia.h
@@ -342,7 +342,7 @@ struct el_CIA_sysdata_mcheck {
#define vuip volatile unsigned int __force *
#define vulp volatile unsigned long __force *
-__EXTERN_INLINE unsigned int cia_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int cia_ioread8(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -374,7 +374,7 @@ __EXTERN_INLINE void cia_iowrite8(u8 b,
*(vuip) ((addr << 5) + base_and_type) = w;
}
-__EXTERN_INLINE unsigned int cia_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int cia_ioread16(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -404,7 +404,7 @@ __EXTERN_INLINE void cia_iowrite16(u16 b
*(vuip) ((addr << 5) + base_and_type) = w;
}
-__EXTERN_INLINE unsigned int cia_ioread32(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int cia_ioread32(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
if (addr < CIA_DENSE_MEM)
--- a/arch/alpha/include/asm/core_lca.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/core_lca.h
@@ -230,7 +230,7 @@ union el_lca {
} while (0)
-__EXTERN_INLINE unsigned int lca_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int lca_ioread8(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -266,7 +266,7 @@ __EXTERN_INLINE void lca_iowrite8(u8 b,
*(vuip) ((addr << 5) + base_and_type) = w;
}
-__EXTERN_INLINE unsigned int lca_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int lca_ioread16(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -302,7 +302,7 @@ __EXTERN_INLINE void lca_iowrite16(u16 b
*(vuip) ((addr << 5) + base_and_type) = w;
}
-__EXTERN_INLINE unsigned int lca_ioread32(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int lca_ioread32(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
if (addr < LCA_DENSE_MEM)
--- a/arch/alpha/include/asm/core_marvel.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/core_marvel.h
@@ -332,10 +332,10 @@ struct io7 {
#define vucp volatile unsigned char __force *
#define vusp volatile unsigned short __force *
-extern unsigned int marvel_ioread8(void __iomem *);
+extern unsigned int marvel_ioread8(const void __iomem *);
extern void marvel_iowrite8(u8 b, void __iomem *);
-__EXTERN_INLINE unsigned int marvel_ioread16(void __iomem *addr)
+__EXTERN_INLINE unsigned int marvel_ioread16(const void __iomem *addr)
{
return __kernel_ldwu(*(vusp)addr);
}
--- a/arch/alpha/include/asm/core_mcpcia.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/core_mcpcia.h
@@ -267,7 +267,7 @@ extern inline int __mcpcia_is_mmio(unsig
return (addr & 0x80000000UL) == 0;
}
-__EXTERN_INLINE unsigned int mcpcia_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int mcpcia_ioread8(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long)xaddr & MCPCIA_MEM_MASK;
unsigned long hose = (unsigned long)xaddr & ~MCPCIA_MEM_MASK;
@@ -291,7 +291,7 @@ __EXTERN_INLINE void mcpcia_iowrite8(u8
*(vuip) ((addr << 5) + hose + 0x00) = w;
}
-__EXTERN_INLINE unsigned int mcpcia_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int mcpcia_ioread16(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long)xaddr & MCPCIA_MEM_MASK;
unsigned long hose = (unsigned long)xaddr & ~MCPCIA_MEM_MASK;
@@ -315,7 +315,7 @@ __EXTERN_INLINE void mcpcia_iowrite16(u1
*(vuip) ((addr << 5) + hose + 0x08) = w;
}
-__EXTERN_INLINE unsigned int mcpcia_ioread32(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int mcpcia_ioread32(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long)xaddr;
--- a/arch/alpha/include/asm/core_t2.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/core_t2.h
@@ -572,7 +572,7 @@ __EXTERN_INLINE int t2_is_mmio(const vol
it doesn't make sense to merge the pio and mmio routines. */
#define IOPORT(OS, NS) \
-__EXTERN_INLINE unsigned int t2_ioread##NS(void __iomem *xaddr) \
+__EXTERN_INLINE unsigned int t2_ioread##NS(const void __iomem *xaddr) \
{ \
if (t2_is_mmio(xaddr)) \
return t2_read##OS(xaddr); \
--- a/arch/alpha/include/asm/io.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/io.h
@@ -150,9 +150,9 @@ static inline void generic_##NAME(TYPE b
alpha_mv.mv_##NAME(b, addr); \
}
-REMAP1(unsigned int, ioread8, /**/)
-REMAP1(unsigned int, ioread16, /**/)
-REMAP1(unsigned int, ioread32, /**/)
+REMAP1(unsigned int, ioread8, const)
+REMAP1(unsigned int, ioread16, const)
+REMAP1(unsigned int, ioread32, const)
REMAP1(u8, readb, const volatile)
REMAP1(u16, readw, const volatile)
REMAP1(u32, readl, const volatile)
@@ -307,7 +307,7 @@ static inline int __is_mmio(const volati
*/
#if IO_CONCAT(__IO_PREFIX,trivial_io_bw)
-extern inline unsigned int ioread8(void __iomem *addr)
+extern inline unsigned int ioread8(const void __iomem *addr)
{
unsigned int ret;
mb();
@@ -316,7 +316,7 @@ extern inline unsigned int ioread8(void
return ret;
}
-extern inline unsigned int ioread16(void __iomem *addr)
+extern inline unsigned int ioread16(const void __iomem *addr)
{
unsigned int ret;
mb();
@@ -359,7 +359,7 @@ extern inline void outw(u16 b, unsigned
#endif
#if IO_CONCAT(__IO_PREFIX,trivial_io_lq)
-extern inline unsigned int ioread32(void __iomem *addr)
+extern inline unsigned int ioread32(const void __iomem *addr)
{
unsigned int ret;
mb();
--- a/arch/alpha/include/asm/io_trivial.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/io_trivial.h
@@ -7,15 +7,15 @@
#if IO_CONCAT(__IO_PREFIX,trivial_io_bw)
__EXTERN_INLINE unsigned int
-IO_CONCAT(__IO_PREFIX,ioread8)(void __iomem *a)
+IO_CONCAT(__IO_PREFIX,ioread8)(const void __iomem *a)
{
- return __kernel_ldbu(*(volatile u8 __force *)a);
+ return __kernel_ldbu(*(const volatile u8 __force *)a);
}
__EXTERN_INLINE unsigned int
-IO_CONCAT(__IO_PREFIX,ioread16)(void __iomem *a)
+IO_CONCAT(__IO_PREFIX,ioread16)(const void __iomem *a)
{
- return __kernel_ldwu(*(volatile u16 __force *)a);
+ return __kernel_ldwu(*(const volatile u16 __force *)a);
}
__EXTERN_INLINE void
@@ -33,9 +33,9 @@ IO_CONCAT(__IO_PREFIX,iowrite16)(u16 b,
#if IO_CONCAT(__IO_PREFIX,trivial_io_lq)
__EXTERN_INLINE unsigned int
-IO_CONCAT(__IO_PREFIX,ioread32)(void __iomem *a)
+IO_CONCAT(__IO_PREFIX,ioread32)(const void __iomem *a)
{
- return *(volatile u32 __force *)a;
+ return *(const volatile u32 __force *)a;
}
__EXTERN_INLINE void
@@ -73,14 +73,14 @@ IO_CONCAT(__IO_PREFIX,writew)(u16 b, vol
__EXTERN_INLINE u8
IO_CONCAT(__IO_PREFIX,readb)(const volatile void __iomem *a)
{
- void __iomem *addr = (void __iomem *)a;
+ const void __iomem *addr = (const void __iomem *)a;
return IO_CONCAT(__IO_PREFIX,ioread8)(addr);
}
__EXTERN_INLINE u16
IO_CONCAT(__IO_PREFIX,readw)(const volatile void __iomem *a)
{
- void __iomem *addr = (void __iomem *)a;
+ const void __iomem *addr = (const void __iomem *)a;
return IO_CONCAT(__IO_PREFIX,ioread16)(addr);
}
--- a/arch/alpha/include/asm/jensen.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/jensen.h
@@ -305,7 +305,7 @@ __EXTERN_INLINE int jensen_is_mmio(const
that it doesn't make sense to merge them. */
#define IOPORT(OS, NS) \
-__EXTERN_INLINE unsigned int jensen_ioread##NS(void __iomem *xaddr) \
+__EXTERN_INLINE unsigned int jensen_ioread##NS(const void __iomem *xaddr) \
{ \
if (jensen_is_mmio(xaddr)) \
return jensen_read##OS(xaddr - 0x100000000ul); \
--- a/arch/alpha/include/asm/machvec.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/include/asm/machvec.h
@@ -46,9 +46,9 @@ struct alpha_machine_vector
void (*mv_pci_tbi)(struct pci_controller *hose,
dma_addr_t start, dma_addr_t end);
- unsigned int (*mv_ioread8)(void __iomem *);
- unsigned int (*mv_ioread16)(void __iomem *);
- unsigned int (*mv_ioread32)(void __iomem *);
+ unsigned int (*mv_ioread8)(const void __iomem *);
+ unsigned int (*mv_ioread16)(const void __iomem *);
+ unsigned int (*mv_ioread32)(const void __iomem *);
void (*mv_iowrite8)(u8, void __iomem *);
void (*mv_iowrite16)(u16, void __iomem *);
--- a/arch/alpha/kernel/core_marvel.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/kernel/core_marvel.c
@@ -806,7 +806,7 @@ void __iomem *marvel_ioportmap (unsigned
}
unsigned int
-marvel_ioread8(void __iomem *xaddr)
+marvel_ioread8(const void __iomem *xaddr)
{
unsigned long addr = (unsigned long) xaddr;
if (__marvel_is_port_kbd(addr))
--- a/arch/alpha/kernel/io.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/alpha/kernel/io.c
@@ -14,7 +14,7 @@
"generic", which bumps through the machine vector. */
unsigned int
-ioread8(void __iomem *addr)
+ioread8(const void __iomem *addr)
{
unsigned int ret;
mb();
@@ -23,7 +23,7 @@ ioread8(void __iomem *addr)
return ret;
}
-unsigned int ioread16(void __iomem *addr)
+unsigned int ioread16(const void __iomem *addr)
{
unsigned int ret;
mb();
@@ -32,7 +32,7 @@ unsigned int ioread16(void __iomem *addr
return ret;
}
-unsigned int ioread32(void __iomem *addr)
+unsigned int ioread32(const void __iomem *addr)
{
unsigned int ret;
mb();
@@ -257,7 +257,7 @@ EXPORT_SYMBOL(readq_relaxed);
/*
* Read COUNT 8-bit bytes from port PORT into memory starting at SRC.
*/
-void ioread8_rep(void __iomem *port, void *dst, unsigned long count)
+void ioread8_rep(const void __iomem *port, void *dst, unsigned long count)
{
while ((unsigned long)dst & 0x3) {
if (!count)
@@ -300,7 +300,7 @@ EXPORT_SYMBOL(insb);
* the interfaces seems to be slow: just using the inlined version
* of the inw() breaks things.
*/
-void ioread16_rep(void __iomem *port, void *dst, unsigned long count)
+void ioread16_rep(const void __iomem *port, void *dst, unsigned long count)
{
if (unlikely((unsigned long)dst & 0x3)) {
if (!count)
@@ -340,7 +340,7 @@ EXPORT_SYMBOL(insw);
* but the interfaces seems to be slow: just using the inlined version
* of the inl() breaks things.
*/
-void ioread32_rep(void __iomem *port, void *dst, unsigned long count)
+void ioread32_rep(const void __iomem *port, void *dst, unsigned long count)
{
if (unlikely((unsigned long)dst & 0x3)) {
while (count--) {
--- a/arch/parisc/include/asm/io.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/parisc/include/asm/io.h
@@ -303,8 +303,8 @@ extern void outsl (unsigned long port, c
#define ioread64be ioread64be
#define iowrite64 iowrite64
#define iowrite64be iowrite64be
-extern u64 ioread64(void __iomem *addr);
-extern u64 ioread64be(void __iomem *addr);
+extern u64 ioread64(const void __iomem *addr);
+extern u64 ioread64be(const void __iomem *addr);
extern void iowrite64(u64 val, void __iomem *addr);
extern void iowrite64be(u64 val, void __iomem *addr);
--- a/arch/parisc/lib/iomap.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/parisc/lib/iomap.c
@@ -43,13 +43,13 @@
#endif
struct iomap_ops {
- unsigned int (*read8)(void __iomem *);
- unsigned int (*read16)(void __iomem *);
- unsigned int (*read16be)(void __iomem *);
- unsigned int (*read32)(void __iomem *);
- unsigned int (*read32be)(void __iomem *);
- u64 (*read64)(void __iomem *);
- u64 (*read64be)(void __iomem *);
+ unsigned int (*read8)(const void __iomem *);
+ unsigned int (*read16)(const void __iomem *);
+ unsigned int (*read16be)(const void __iomem *);
+ unsigned int (*read32)(const void __iomem *);
+ unsigned int (*read32be)(const void __iomem *);
+ u64 (*read64)(const void __iomem *);
+ u64 (*read64be)(const void __iomem *);
void (*write8)(u8, void __iomem *);
void (*write16)(u16, void __iomem *);
void (*write16be)(u16, void __iomem *);
@@ -57,9 +57,9 @@ struct iomap_ops {
void (*write32be)(u32, void __iomem *);
void (*write64)(u64, void __iomem *);
void (*write64be)(u64, void __iomem *);
- void (*read8r)(void __iomem *, void *, unsigned long);
- void (*read16r)(void __iomem *, void *, unsigned long);
- void (*read32r)(void __iomem *, void *, unsigned long);
+ void (*read8r)(const void __iomem *, void *, unsigned long);
+ void (*read16r)(const void __iomem *, void *, unsigned long);
+ void (*read32r)(const void __iomem *, void *, unsigned long);
void (*write8r)(void __iomem *, const void *, unsigned long);
void (*write16r)(void __iomem *, const void *, unsigned long);
void (*write32r)(void __iomem *, const void *, unsigned long);
@@ -69,17 +69,17 @@ struct iomap_ops {
#define ADDR2PORT(addr) ((unsigned long __force)(addr) & 0xffffff)
-static unsigned int ioport_read8(void __iomem *addr)
+static unsigned int ioport_read8(const void __iomem *addr)
{
return inb(ADDR2PORT(addr));
}
-static unsigned int ioport_read16(void __iomem *addr)
+static unsigned int ioport_read16(const void __iomem *addr)
{
return inw(ADDR2PORT(addr));
}
-static unsigned int ioport_read32(void __iomem *addr)
+static unsigned int ioport_read32(const void __iomem *addr)
{
return inl(ADDR2PORT(addr));
}
@@ -99,17 +99,17 @@ static void ioport_write32(u32 datum, vo
outl(datum, ADDR2PORT(addr));
}
-static void ioport_read8r(void __iomem *addr, void *dst, unsigned long count)
+static void ioport_read8r(const void __iomem *addr, void *dst, unsigned long count)
{
insb(ADDR2PORT(addr), dst, count);
}
-static void ioport_read16r(void __iomem *addr, void *dst, unsigned long count)
+static void ioport_read16r(const void __iomem *addr, void *dst, unsigned long count)
{
insw(ADDR2PORT(addr), dst, count);
}
-static void ioport_read32r(void __iomem *addr, void *dst, unsigned long count)
+static void ioport_read32r(const void __iomem *addr, void *dst, unsigned long count)
{
insl(ADDR2PORT(addr), dst, count);
}
@@ -150,37 +150,37 @@ static const struct iomap_ops ioport_ops
/* Legacy I/O memory ops */
-static unsigned int iomem_read8(void __iomem *addr)
+static unsigned int iomem_read8(const void __iomem *addr)
{
return readb(addr);
}
-static unsigned int iomem_read16(void __iomem *addr)
+static unsigned int iomem_read16(const void __iomem *addr)
{
return readw(addr);
}
-static unsigned int iomem_read16be(void __iomem *addr)
+static unsigned int iomem_read16be(const void __iomem *addr)
{
return __raw_readw(addr);
}
-static unsigned int iomem_read32(void __iomem *addr)
+static unsigned int iomem_read32(const void __iomem *addr)
{
return readl(addr);
}
-static unsigned int iomem_read32be(void __iomem *addr)
+static unsigned int iomem_read32be(const void __iomem *addr)
{
return __raw_readl(addr);
}
-static u64 iomem_read64(void __iomem *addr)
+static u64 iomem_read64(const void __iomem *addr)
{
return readq(addr);
}
-static u64 iomem_read64be(void __iomem *addr)
+static u64 iomem_read64be(const void __iomem *addr)
{
return __raw_readq(addr);
}
@@ -220,7 +220,7 @@ static void iomem_write64be(u64 datum, v
__raw_writel(datum, addr);
}
-static void iomem_read8r(void __iomem *addr, void *dst, unsigned long count)
+static void iomem_read8r(const void __iomem *addr, void *dst, unsigned long count)
{
while (count--) {
*(u8 *)dst = __raw_readb(addr);
@@ -228,7 +228,7 @@ static void iomem_read8r(void __iomem *a
}
}
-static void iomem_read16r(void __iomem *addr, void *dst, unsigned long count)
+static void iomem_read16r(const void __iomem *addr, void *dst, unsigned long count)
{
while (count--) {
*(u16 *)dst = __raw_readw(addr);
@@ -236,7 +236,7 @@ static void iomem_read16r(void __iomem *
}
}
-static void iomem_read32r(void __iomem *addr, void *dst, unsigned long count)
+static void iomem_read32r(const void __iomem *addr, void *dst, unsigned long count)
{
while (count--) {
*(u32 *)dst = __raw_readl(addr);
@@ -297,49 +297,49 @@ static const struct iomap_ops *iomap_ops
};
-unsigned int ioread8(void __iomem *addr)
+unsigned int ioread8(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read8(addr);
return *((u8 *)addr);
}
-unsigned int ioread16(void __iomem *addr)
+unsigned int ioread16(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read16(addr);
return le16_to_cpup((u16 *)addr);
}
-unsigned int ioread16be(void __iomem *addr)
+unsigned int ioread16be(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read16be(addr);
return *((u16 *)addr);
}
-unsigned int ioread32(void __iomem *addr)
+unsigned int ioread32(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read32(addr);
return le32_to_cpup((u32 *)addr);
}
-unsigned int ioread32be(void __iomem *addr)
+unsigned int ioread32be(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read32be(addr);
return *((u32 *)addr);
}
-u64 ioread64(void __iomem *addr)
+u64 ioread64(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read64(addr);
return le64_to_cpup((u64 *)addr);
}
-u64 ioread64be(void __iomem *addr)
+u64 ioread64be(const void __iomem *addr)
{
if (unlikely(INDIRECT_ADDR(addr)))
return iomap_ops[ADDR_TO_REGION(addr)]->read64be(addr);
@@ -411,7 +411,7 @@ void iowrite64be(u64 datum, void __iomem
/* Repeating interfaces */
-void ioread8_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread8_rep(const void __iomem *addr, void *dst, unsigned long count)
{
if (unlikely(INDIRECT_ADDR(addr))) {
iomap_ops[ADDR_TO_REGION(addr)]->read8r(addr, dst, count);
@@ -423,7 +423,7 @@ void ioread8_rep(void __iomem *addr, voi
}
}
-void ioread16_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread16_rep(const void __iomem *addr, void *dst, unsigned long count)
{
if (unlikely(INDIRECT_ADDR(addr))) {
iomap_ops[ADDR_TO_REGION(addr)]->read16r(addr, dst, count);
@@ -435,7 +435,7 @@ void ioread16_rep(void __iomem *addr, vo
}
}
-void ioread32_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread32_rep(const void __iomem *addr, void *dst, unsigned long count)
{
if (unlikely(INDIRECT_ADDR(addr))) {
iomap_ops[ADDR_TO_REGION(addr)]->read32r(addr, dst, count);
--- a/arch/powerpc/kernel/iomap.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/powerpc/kernel/iomap.c
@@ -15,23 +15,23 @@
* Here comes the ppc64 implementation of the IOMAP
* interfaces.
*/
-unsigned int ioread8(void __iomem *addr)
+unsigned int ioread8(const void __iomem *addr)
{
return readb(addr);
}
-unsigned int ioread16(void __iomem *addr)
+unsigned int ioread16(const void __iomem *addr)
{
return readw(addr);
}
-unsigned int ioread16be(void __iomem *addr)
+unsigned int ioread16be(const void __iomem *addr)
{
return readw_be(addr);
}
-unsigned int ioread32(void __iomem *addr)
+unsigned int ioread32(const void __iomem *addr)
{
return readl(addr);
}
-unsigned int ioread32be(void __iomem *addr)
+unsigned int ioread32be(const void __iomem *addr)
{
return readl_be(addr);
}
@@ -41,27 +41,27 @@ EXPORT_SYMBOL(ioread16be);
EXPORT_SYMBOL(ioread32);
EXPORT_SYMBOL(ioread32be);
#ifdef __powerpc64__
-u64 ioread64(void __iomem *addr)
+u64 ioread64(const void __iomem *addr)
{
return readq(addr);
}
-u64 ioread64_lo_hi(void __iomem *addr)
+u64 ioread64_lo_hi(const void __iomem *addr)
{
return readq(addr);
}
-u64 ioread64_hi_lo(void __iomem *addr)
+u64 ioread64_hi_lo(const void __iomem *addr)
{
return readq(addr);
}
-u64 ioread64be(void __iomem *addr)
+u64 ioread64be(const void __iomem *addr)
{
return readq_be(addr);
}
-u64 ioread64be_lo_hi(void __iomem *addr)
+u64 ioread64be_lo_hi(const void __iomem *addr)
{
return readq_be(addr);
}
-u64 ioread64be_hi_lo(void __iomem *addr)
+u64 ioread64be_hi_lo(const void __iomem *addr)
{
return readq_be(addr);
}
@@ -139,15 +139,15 @@ EXPORT_SYMBOL(iowrite64be_hi_lo);
* FIXME! We could make these do EEH handling if we really
* wanted. Not clear if we do.
*/
-void ioread8_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread8_rep(const void __iomem *addr, void *dst, unsigned long count)
{
readsb(addr, dst, count);
}
-void ioread16_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread16_rep(const void __iomem *addr, void *dst, unsigned long count)
{
readsw(addr, dst, count);
}
-void ioread32_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread32_rep(const void __iomem *addr, void *dst, unsigned long count)
{
readsl(addr, dst, count);
}
--- a/arch/sh/kernel/iomap.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/arch/sh/kernel/iomap.c
@@ -8,31 +8,31 @@
#include <linux/module.h>
#include <linux/io.h>
-unsigned int ioread8(void __iomem *addr)
+unsigned int ioread8(const void __iomem *addr)
{
return readb(addr);
}
EXPORT_SYMBOL(ioread8);
-unsigned int ioread16(void __iomem *addr)
+unsigned int ioread16(const void __iomem *addr)
{
return readw(addr);
}
EXPORT_SYMBOL(ioread16);
-unsigned int ioread16be(void __iomem *addr)
+unsigned int ioread16be(const void __iomem *addr)
{
return be16_to_cpu(__raw_readw(addr));
}
EXPORT_SYMBOL(ioread16be);
-unsigned int ioread32(void __iomem *addr)
+unsigned int ioread32(const void __iomem *addr)
{
return readl(addr);
}
EXPORT_SYMBOL(ioread32);
-unsigned int ioread32be(void __iomem *addr)
+unsigned int ioread32be(const void __iomem *addr)
{
return be32_to_cpu(__raw_readl(addr));
}
@@ -74,7 +74,7 @@ EXPORT_SYMBOL(iowrite32be);
* convert to CPU byte order. We write in "IO byte
* order" (we also don't have IO barriers).
*/
-static inline void mmio_insb(void __iomem *addr, u8 *dst, int count)
+static inline void mmio_insb(const void __iomem *addr, u8 *dst, int count)
{
while (--count >= 0) {
u8 data = __raw_readb(addr);
@@ -83,7 +83,7 @@ static inline void mmio_insb(void __iome
}
}
-static inline void mmio_insw(void __iomem *addr, u16 *dst, int count)
+static inline void mmio_insw(const void __iomem *addr, u16 *dst, int count)
{
while (--count >= 0) {
u16 data = __raw_readw(addr);
@@ -92,7 +92,7 @@ static inline void mmio_insw(void __iome
}
}
-static inline void mmio_insl(void __iomem *addr, u32 *dst, int count)
+static inline void mmio_insl(const void __iomem *addr, u32 *dst, int count)
{
while (--count >= 0) {
u32 data = __raw_readl(addr);
@@ -125,19 +125,19 @@ static inline void mmio_outsl(void __iom
}
}
-void ioread8_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread8_rep(const void __iomem *addr, void *dst, unsigned long count)
{
mmio_insb(addr, dst, count);
}
EXPORT_SYMBOL(ioread8_rep);
-void ioread16_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread16_rep(const void __iomem *addr, void *dst, unsigned long count)
{
mmio_insw(addr, dst, count);
}
EXPORT_SYMBOL(ioread16_rep);
-void ioread32_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread32_rep(const void __iomem *addr, void *dst, unsigned long count)
{
mmio_insl(addr, dst, count);
}
--- a/drivers/mailbox/bcm-pdc-mailbox.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/mailbox/bcm-pdc-mailbox.c
@@ -679,7 +679,7 @@ pdc_receive(struct pdc_state *pdcs)
/* read last_rx_curr from register once */
pdcs->last_rx_curr =
- (ioread32(&pdcs->rxregs_64->status0) &
+ (ioread32((const void __iomem *)&pdcs->rxregs_64->status0) &
CRYPTO_D64_RS0_CD_MASK) / RING_ENTRY_SIZE;
do {
--- a/drivers/sh/clk/cpg.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/sh/clk/cpg.c
@@ -40,7 +40,7 @@ static int sh_clk_mstp_enable(struct clk
{
sh_clk_write(sh_clk_read(clk) & ~(1 << clk->enable_bit), clk);
if (clk->status_reg) {
- unsigned int (*read)(void __iomem *addr);
+ unsigned int (*read)(const void __iomem *addr);
int i;
void __iomem *mapped_status = (phys_addr_t)clk->status_reg -
(phys_addr_t)clk->enable_reg + clk->mapped_reg;
--- a/include/asm-generic/iomap.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/include/asm-generic/iomap.h
@@ -26,14 +26,14 @@
* in the low address range. Architectures for which this is not
* true can't use this generic implementation.
*/
-extern unsigned int ioread8(void __iomem *);
-extern unsigned int ioread16(void __iomem *);
-extern unsigned int ioread16be(void __iomem *);
-extern unsigned int ioread32(void __iomem *);
-extern unsigned int ioread32be(void __iomem *);
+extern unsigned int ioread8(const void __iomem *);
+extern unsigned int ioread16(const void __iomem *);
+extern unsigned int ioread16be(const void __iomem *);
+extern unsigned int ioread32(const void __iomem *);
+extern unsigned int ioread32be(const void __iomem *);
#ifdef CONFIG_64BIT
-extern u64 ioread64(void __iomem *);
-extern u64 ioread64be(void __iomem *);
+extern u64 ioread64(const void __iomem *);
+extern u64 ioread64be(const void __iomem *);
#endif
#ifdef readq
@@ -41,10 +41,10 @@ extern u64 ioread64be(void __iomem *);
#define ioread64_hi_lo ioread64_hi_lo
#define ioread64be_lo_hi ioread64be_lo_hi
#define ioread64be_hi_lo ioread64be_hi_lo
-extern u64 ioread64_lo_hi(void __iomem *addr);
-extern u64 ioread64_hi_lo(void __iomem *addr);
-extern u64 ioread64be_lo_hi(void __iomem *addr);
-extern u64 ioread64be_hi_lo(void __iomem *addr);
+extern u64 ioread64_lo_hi(const void __iomem *addr);
+extern u64 ioread64_hi_lo(const void __iomem *addr);
+extern u64 ioread64be_lo_hi(const void __iomem *addr);
+extern u64 ioread64be_hi_lo(const void __iomem *addr);
#endif
extern void iowrite8(u8, void __iomem *);
@@ -79,9 +79,9 @@ extern void iowrite64be_hi_lo(u64 val, v
* memory across multiple ports, use "memcpy_toio()"
* and friends.
*/
-extern void ioread8_rep(void __iomem *port, void *buf, unsigned long count);
-extern void ioread16_rep(void __iomem *port, void *buf, unsigned long count);
-extern void ioread32_rep(void __iomem *port, void *buf, unsigned long count);
+extern void ioread8_rep(const void __iomem *port, void *buf, unsigned long count);
+extern void ioread16_rep(const void __iomem *port, void *buf, unsigned long count);
+extern void ioread32_rep(const void __iomem *port, void *buf, unsigned long count);
extern void iowrite8_rep(void __iomem *port, const void *buf, unsigned long count);
extern void iowrite16_rep(void __iomem *port, const void *buf, unsigned long count);
--- a/include/linux/io-64-nonatomic-hi-lo.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/include/linux/io-64-nonatomic-hi-lo.h
@@ -57,7 +57,7 @@ static inline void hi_lo_writeq_relaxed(
#ifndef ioread64_hi_lo
#define ioread64_hi_lo ioread64_hi_lo
-static inline u64 ioread64_hi_lo(void __iomem *addr)
+static inline u64 ioread64_hi_lo(const void __iomem *addr)
{
u32 low, high;
@@ -79,7 +79,7 @@ static inline void iowrite64_hi_lo(u64 v
#ifndef ioread64be_hi_lo
#define ioread64be_hi_lo ioread64be_hi_lo
-static inline u64 ioread64be_hi_lo(void __iomem *addr)
+static inline u64 ioread64be_hi_lo(const void __iomem *addr)
{
u32 low, high;
--- a/include/linux/io-64-nonatomic-lo-hi.h~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/include/linux/io-64-nonatomic-lo-hi.h
@@ -57,7 +57,7 @@ static inline void lo_hi_writeq_relaxed(
#ifndef ioread64_lo_hi
#define ioread64_lo_hi ioread64_lo_hi
-static inline u64 ioread64_lo_hi(void __iomem *addr)
+static inline u64 ioread64_lo_hi(const void __iomem *addr)
{
u32 low, high;
@@ -79,7 +79,7 @@ static inline void iowrite64_lo_hi(u64 v
#ifndef ioread64be_lo_hi
#define ioread64be_lo_hi ioread64be_lo_hi
-static inline u64 ioread64be_lo_hi(void __iomem *addr)
+static inline u64 ioread64be_lo_hi(const void __iomem *addr)
{
u32 low, high;
--- a/lib/iomap.c~iomap-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/lib/iomap.c
@@ -70,27 +70,27 @@ static void bad_io_access(unsigned long
#define mmio_read64be(addr) swab64(readq(addr))
#endif
-unsigned int ioread8(void __iomem *addr)
+unsigned int ioread8(const void __iomem *addr)
{
IO_COND(addr, return inb(port), return readb(addr));
return 0xff;
}
-unsigned int ioread16(void __iomem *addr)
+unsigned int ioread16(const void __iomem *addr)
{
IO_COND(addr, return inw(port), return readw(addr));
return 0xffff;
}
-unsigned int ioread16be(void __iomem *addr)
+unsigned int ioread16be(const void __iomem *addr)
{
IO_COND(addr, return pio_read16be(port), return mmio_read16be(addr));
return 0xffff;
}
-unsigned int ioread32(void __iomem *addr)
+unsigned int ioread32(const void __iomem *addr)
{
IO_COND(addr, return inl(port), return readl(addr));
return 0xffffffff;
}
-unsigned int ioread32be(void __iomem *addr)
+unsigned int ioread32be(const void __iomem *addr)
{
IO_COND(addr, return pio_read32be(port), return mmio_read32be(addr));
return 0xffffffff;
@@ -142,26 +142,26 @@ static u64 pio_read64be_hi_lo(unsigned l
return lo | (hi << 32);
}
-u64 ioread64_lo_hi(void __iomem *addr)
+u64 ioread64_lo_hi(const void __iomem *addr)
{
IO_COND(addr, return pio_read64_lo_hi(port), return readq(addr));
return 0xffffffffffffffffULL;
}
-u64 ioread64_hi_lo(void __iomem *addr)
+u64 ioread64_hi_lo(const void __iomem *addr)
{
IO_COND(addr, return pio_read64_hi_lo(port), return readq(addr));
return 0xffffffffffffffffULL;
}
-u64 ioread64be_lo_hi(void __iomem *addr)
+u64 ioread64be_lo_hi(const void __iomem *addr)
{
IO_COND(addr, return pio_read64be_lo_hi(port),
return mmio_read64be(addr));
return 0xffffffffffffffffULL;
}
-u64 ioread64be_hi_lo(void __iomem *addr)
+u64 ioread64be_hi_lo(const void __iomem *addr)
{
IO_COND(addr, return pio_read64be_hi_lo(port),
return mmio_read64be(addr));
@@ -275,7 +275,7 @@ EXPORT_SYMBOL(iowrite64be_hi_lo);
* order" (we also don't have IO barriers).
*/
#ifndef mmio_insb
-static inline void mmio_insb(void __iomem *addr, u8 *dst, int count)
+static inline void mmio_insb(const void __iomem *addr, u8 *dst, int count)
{
while (--count >= 0) {
u8 data = __raw_readb(addr);
@@ -283,7 +283,7 @@ static inline void mmio_insb(void __iome
dst++;
}
}
-static inline void mmio_insw(void __iomem *addr, u16 *dst, int count)
+static inline void mmio_insw(const void __iomem *addr, u16 *dst, int count)
{
while (--count >= 0) {
u16 data = __raw_readw(addr);
@@ -291,7 +291,7 @@ static inline void mmio_insw(void __iome
dst++;
}
}
-static inline void mmio_insl(void __iomem *addr, u32 *dst, int count)
+static inline void mmio_insl(const void __iomem *addr, u32 *dst, int count)
{
while (--count >= 0) {
u32 data = __raw_readl(addr);
@@ -325,15 +325,15 @@ static inline void mmio_outsl(void __iom
}
#endif
-void ioread8_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread8_rep(const void __iomem *addr, void *dst, unsigned long count)
{
IO_COND(addr, insb(port,dst,count), mmio_insb(addr, dst, count));
}
-void ioread16_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread16_rep(const void __iomem *addr, void *dst, unsigned long count)
{
IO_COND(addr, insw(port,dst,count), mmio_insw(addr, dst, count));
}
-void ioread32_rep(void __iomem *addr, void *dst, unsigned long count)
+void ioread32_rep(const void __iomem *addr, void *dst, unsigned long count)
{
IO_COND(addr, insl(port,dst,count), mmio_insl(addr, dst, count));
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 37/39] rtl818x: constify ioreadX() iomem argument (as in generic implementation)
2020-08-15 0:29 incoming Andrew Morton
` (35 preceding siblings ...)
2020-08-15 0:32 ` [patch 36/39] iomap: constify ioreadX() iomem argument (as in generic implementation) Andrew Morton
@ 2020-08-15 0:32 ` Andrew Morton
2020-08-15 0:32 ` [patch 38/39] ntb: intel: " Andrew Morton
` (2 subsequent siblings)
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:32 UTC (permalink / raw)
To: akpm, allenbh, arnd, benh, dalias, dave.jiang, davem, deller,
geert+renesas, geert, ink, James.Bottomley, jasowang, jdmason,
krzk, kuba, kvalo, linux-mm, mattst88, mm-commits, mpe, mst,
paulus, rth, torvalds, ysato
From: Krzysztof Kozlowski <krzk@kernel.org>
Subject: rtl818x: constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures
void *__iomem address argument is a pointer to const, on some not.
Implementations of ioreadX() do not modify the memory under the address so
they can be converted to a "const" version for const-safety and
consistency among architectures.
Link: http://lkml.kernel.org/r/20200709072837.5869-3-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Cc: Allen Hubbe <allenbh@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h~rtl818x-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h
@@ -150,17 +150,17 @@ void rtl8180_write_phy(struct ieee80211_
void rtl8180_set_anaparam(struct rtl8180_priv *priv, u32 anaparam);
void rtl8180_set_anaparam2(struct rtl8180_priv *priv, u32 anaparam2);
-static inline u8 rtl818x_ioread8(struct rtl8180_priv *priv, u8 __iomem *addr)
+static inline u8 rtl818x_ioread8(struct rtl8180_priv *priv, const u8 __iomem *addr)
{
return ioread8(addr);
}
-static inline u16 rtl818x_ioread16(struct rtl8180_priv *priv, __le16 __iomem *addr)
+static inline u16 rtl818x_ioread16(struct rtl8180_priv *priv, const __le16 __iomem *addr)
{
return ioread16(addr);
}
-static inline u32 rtl818x_ioread32(struct rtl8180_priv *priv, __le32 __iomem *addr)
+static inline u32 rtl818x_ioread32(struct rtl8180_priv *priv, const __le32 __iomem *addr)
{
return ioread32(addr);
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 38/39] ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
2020-08-15 0:29 incoming Andrew Morton
` (36 preceding siblings ...)
2020-08-15 0:32 ` [patch 37/39] rtl818x: " Andrew Morton
@ 2020-08-15 0:32 ` Andrew Morton
2020-08-15 0:32 ` [patch 39/39] virtio: pci: " Andrew Morton
2020-08-19 23:09 ` mmotm 2020-08-19-16-09 uploaded Andrew Morton
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:32 UTC (permalink / raw)
To: akpm, allenbh, arnd, benh, dalias, dave.jiang, davem, deller,
geert+renesas, geert, ink, James.Bottomley, jasowang, jdmason,
krzk, kuba, kvalo, linux-mm, mattst88, mm-commits, mpe, mst,
paulus, rth, torvalds, ysato
From: Krzysztof Kozlowski <krzk@kernel.org>
Subject: ntb: intel: constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures
void *__iomem address argument is a pointer to const, on some not.
Implementations of ioreadX() do not modify the memory under the address so
they can be converted to a "const" version for const-safety and
consistency among architectures.
Link: http://lkml.kernel.org/r/20200709072837.5869-4-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
drivers/ntb/hw/intel/ntb_hw_gen1.c | 2 +-
drivers/ntb/hw/intel/ntb_hw_gen3.h | 2 +-
drivers/ntb/hw/intel/ntb_hw_intel.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/ntb/hw/intel/ntb_hw_gen1.c~ntb-intel-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/ntb/hw/intel/ntb_hw_gen1.c
@@ -1205,7 +1205,7 @@ int intel_ntb_peer_spad_write(struct ntb
ndev->peer_reg->spad);
}
-static u64 xeon_db_ioread(void __iomem *mmio)
+static u64 xeon_db_ioread(const void __iomem *mmio)
{
return (u64)ioread16(mmio);
}
--- a/drivers/ntb/hw/intel/ntb_hw_gen3.h~ntb-intel-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/ntb/hw/intel/ntb_hw_gen3.h
@@ -91,7 +91,7 @@
#define GEN3_DB_TOTAL_SHIFT 33
#define GEN3_SPAD_COUNT 16
-static inline u64 gen3_db_ioread(void __iomem *mmio)
+static inline u64 gen3_db_ioread(const void __iomem *mmio)
{
return ioread64(mmio);
}
--- a/drivers/ntb/hw/intel/ntb_hw_intel.h~ntb-intel-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/ntb/hw/intel/ntb_hw_intel.h
@@ -103,7 +103,7 @@ struct intel_ntb_dev;
struct intel_ntb_reg {
int (*poll_link)(struct intel_ntb_dev *ndev);
int (*link_is_up)(struct intel_ntb_dev *ndev);
- u64 (*db_ioread)(void __iomem *mmio);
+ u64 (*db_ioread)(const void __iomem *mmio);
void (*db_iowrite)(u64 db_bits, void __iomem *mmio);
unsigned long ntb_ctl;
resource_size_t db_size;
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* [patch 39/39] virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
2020-08-15 0:29 incoming Andrew Morton
` (37 preceding siblings ...)
2020-08-15 0:32 ` [patch 38/39] ntb: intel: " Andrew Morton
@ 2020-08-15 0:32 ` Andrew Morton
2020-08-19 23:09 ` mmotm 2020-08-19-16-09 uploaded Andrew Morton
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-15 0:32 UTC (permalink / raw)
To: akpm, allenbh, arnd, benh, dalias, dave.jiang, davem, deller,
geert+renesas, geert, ink, James.Bottomley, jasowang, jdmason,
krzk, kuba, kvalo, linux-mm, mattst88, mm-commits, mpe, mst,
paulus, rth, torvalds, ysato
From: Krzysztof Kozlowski <krzk@kernel.org>
Subject: virtio: pci: constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures
void *__iomem address argument is a pointer to const, on some not.
Implementations of ioreadX() do not modify the memory under the address so
they can be converted to a "const" version for const-safety and
consistency among architectures.
Link: http://lkml.kernel.org/r/20200709072837.5869-5-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Allen Hubbe <allenbh@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
drivers/virtio/virtio_pci_modern.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/virtio/virtio_pci_modern.c~virtio-pci-constify-ioreadx-iomem-argument-as-in-generic-implementation
+++ a/drivers/virtio/virtio_pci_modern.c
@@ -27,16 +27,16 @@
* method, i.e. 32-bit accesses for 32-bit fields, 16-bit accesses
* for 16-bit fields and 8-bit accesses for 8-bit fields.
*/
-static inline u8 vp_ioread8(u8 __iomem *addr)
+static inline u8 vp_ioread8(const u8 __iomem *addr)
{
return ioread8(addr);
}
-static inline u16 vp_ioread16 (__le16 __iomem *addr)
+static inline u16 vp_ioread16 (const __le16 __iomem *addr)
{
return ioread16(addr);
}
-static inline u32 vp_ioread32(__le32 __iomem *addr)
+static inline u32 vp_ioread32(const __le32 __iomem *addr)
{
return ioread32(addr);
}
_
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
@ 2020-08-15 2:53 ` Linus Torvalds
2020-08-15 4:59 ` Minchan Kim
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2020-08-15 2:53 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexander Duyck, Jens Axboe, Brian Geffon, Christian Brauner,
Christian Brauner, Daniel Colascione, Johannes Weiner, Jann Horn,
John Dias, Joel Fernandes, Kirill Tkhai, linux-man, Linux-MM,
Michal Hocko, Minchan Kim, mm-commits, Oleksandr Natalenko,
David Rientjes, Shakeel Butt, sj38.park, sjpark, Sonny Rao,
Sandeep Patil, Suren Baghdasaryan, Tim Murray, Vlastimil Babka
On Fri, Aug 14, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Minchan Kim <minchan@kernel.org>
> Subject: mm/madvise: check fatal signal pending of target process
>
> Bail out to prevent unnecessary CPU overhead if target process has pending
> fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
This seems bogus.
Returning -EINTR when *SOMEBODY ELSE* has a signal is crazy talk.
It also seems to be the reason for the previous patches inexplicably
passing in the task pointer.
Finally, it has absolutely no explanations for why this would matter,
and why it's magically and suddenly an issue for process_madvise(),
when in the history of the *real* madvise() this hasn't been an issue
for "current".
I'm dropping the madvise() series.
If the issue is that you can generate a long series or areas with that
iovec, maybe the code should re-consider. Or maybe the signal pending
case should be done there, not passing down an odd task pointer to the
low-level madvise code.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 2:53 ` Linus Torvalds
@ 2020-08-15 4:59 ` Minchan Kim
2020-08-15 14:57 ` Linus Torvalds
0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2020-08-15 4:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Fri, Aug 14, 2020 at 07:53:09PM -0700, Linus Torvalds wrote:
> On Fri, Aug 14, 2020 at 5:31 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Minchan Kim <minchan@kernel.org>
> > Subject: mm/madvise: check fatal signal pending of target process
> >
> > Bail out to prevent unnecessary CPU overhead if target process has pending
> > fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
>
> This seems bogus.
>
> Returning -EINTR when *SOMEBODY ELSE* has a signal is crazy talk.
It doesn't propagate -EINTR to the user but just aiming for canceling
the entire operation.
>
> It also seems to be the reason for the previous patches inexplicably
> passing in the task pointer.
>
> Finally, it has absolutely no explanations for why this would matter,
> and why it's magically and suddenly an issue for process_madvise(),
> when in the history of the *real* madvise() this hasn't been an issue
> for "current".
Currently, madvise(MADV_COLD|PAGEOUT) already have done it. I just wanted
to sync with it with process_madvise. Ting was process_madvise couldn't
get target task while madvise could get it easily.
>
> I'm dropping the madvise() series.
>
> If the issue is that you can generate a long series or areas with that
> iovec, maybe the code should re-consider. Or maybe the signal pending
> case should be done there, not passing down an odd task pointer to the
> low-level madvise code.
Do you mean you want to drop target signal check madvise as well as
process_madvise?
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 4:59 ` Minchan Kim
@ 2020-08-15 14:57 ` Linus Torvalds
2020-08-15 18:34 ` Minchan Kim
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2020-08-15 14:57 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Fri, Aug 14, 2020 at 9:59 PM Minchan Kim <minchan@kernel.org> wrote:
>
> Currently, madvise(MADV_COLD|PAGEOUT) already have done it. I just wanted
> to sync with it with process_madvise. Ting was process_madvise couldn't
> get target task while madvise could get it easily.
The thing is, for "current" it makes sense.
It makes sense because "current" is also the one doing the action, so
when current is dying, stopping the action is sane too.
But when you target somebody else, the signal handling simply doesn't
make any sense at all.
It doesn't make sense because the error code doesn't make sense (EINTR
really is about the _actor_ getting interrupted, not the target), but
it also doesn't make sense simply because there is no 1:1 relationship
between the target mm and the target thread.
The pid that was the target may be dying, but that does *not* mean
that the mm itself is dying. So stopping the operation arbitrarily
somewhere in the middle is a fundamentally insane operation. You've
done something partial to a mm that may well still be active.
So I think it's simply conceptually wrong to look at some "target
thread signal state" in ways that it isn't to look at "current signal
state".
Now, it might be worth it to have some kind of "this mm is dying,
don't bother" thing. We _kind_ of have things like that already in the
form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
image of it).
It might be reasonable to have a MMF_DYING flag, but I'm not even sure
how to implement it, exactly because of that "this thread group may be
dying, but the MM might still be attached to other tasks" issue.
For example, if you do "vfork()" and the child is killed, the mm is
still active and attached to the vfork() parent.
Similarly, on a trivial level, a particular thread might be killed
without the rest of the threads being necessarily killed.
Again, for regular "madvise()" it makes sense to look at whether the
_current_ thread is being killed - because that fundamentally
interrupts the operator. But for somebody else, operating on the mm of
a thread, I really think it's wrong to look at the target thread state
and leave the MM operation in some half-way state.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 14:57 ` Linus Torvalds
@ 2020-08-15 18:34 ` Minchan Kim
2020-08-16 1:43 ` Linus Torvalds
0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2020-08-15 18:34 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Sat, Aug 15, 2020 at 07:57:15AM -0700, Linus Torvalds wrote:
> On Fri, Aug 14, 2020 at 9:59 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > Currently, madvise(MADV_COLD|PAGEOUT) already have done it. I just wanted
> > to sync with it with process_madvise. Ting was process_madvise couldn't
> > get target task while madvise could get it easily.
>
> The thing is, for "current" it makes sense.
>
> It makes sense because "current" is also the one doing the action, so
> when current is dying, stopping the action is sane too.
True.
>
> But when you target somebody else, the signal handling simply doesn't
> make any sense at all.
>
> It doesn't make sense because the error code doesn't make sense (EINTR
> really is about the _actor_ getting interrupted, not the target), but
> it also doesn't make sense simply because there is no 1:1 relationship
> between the target mm and the target thread.
>
> The pid that was the target may be dying, but that does *not* mean
> that the mm itself is dying. So stopping the operation arbitrarily
> somewhere in the middle is a fundamentally insane operation. You've
> done something partial to a mm that may well still be active.
>
> So I think it's simply conceptually wrong to look at some "target
> thread signal state" in ways that it isn't to look at "current signal
> state".
Agreed.
>
> Now, it might be worth it to have some kind of "this mm is dying,
> don't bother" thing. We _kind_ of have things like that already in the
> form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
> image of it).
>
> It might be reasonable to have a MMF_DYING flag, but I'm not even sure
> how to implement it, exactly because of that "this thread group may be
> dying, but the MM might still be attached to other tasks" issue.
>
> For example, if you do "vfork()" and the child is killed, the mm is
> still active and attached to the vfork() parent.
Maybe, we could use mm_struct's mm_users to check caller is exclusive
owner so that rest of all are existing.
>
> Similarly, on a trivial level, a particular thread might be killed
> without the rest of the threads being necessarily killed.
>
> Again, for regular "madvise()" it makes sense to look at whether the
> _current_ thread is being killed - because that fundamentally
> interrupts the operator. But for somebody else, operating on the mm of
> a thread, I really think it's wrong to look at the target thread state
> and leave the MM operation in some half-way state.
I agreed. I will drop this single patch with revising previous patch
not to make passing task_struct since the idea.
We could revist if someting real trouble happens.
Please tell me if you found something weird in this patchset series
so that in next cycle we could go smooth.
Thanks for the review, Linus.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-15 18:34 ` Minchan Kim
@ 2020-08-16 1:43 ` Linus Torvalds
2020-08-16 5:58 ` Minchan Kim
0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2020-08-16 1:43 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Sat, Aug 15, 2020 at 11:35 AM Minchan Kim <minchan@kernel.org> wrote:
>
> > Now, it might be worth it to have some kind of "this mm is dying,
> > don't bother" thing. We _kind_ of have things like that already in the
> > form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
> > image of it).
>
> Maybe, we could use mm_struct's mm_users to check caller is exclusive
> owner so that rest of all are existing.
Hmm. Checking mm_users sounds sane. But I think the /proc reference by
any get_task_mm() site will also count as a mm_user, so it's not quite
as useful as it could be.
In an optimal world, all the temporary "grab a reference to the mm"
would use mmgrab/mmdrop() that increments the mm_count, and "mm_users"
would mean the number of actual threads that are actively using it.
But that's not how it ends up working. mmgrab/mmdrop only keeps the
"struct mm_struct" around - it doesn't keep the vma's or the page
tables. So all the /proc users really do want to increase mm_users.
I don't see any obvious thing to check for that would be about "this
mm no longer makes sense to madvise on, because nobody cares any
more".
> Please tell me if you found something weird in this patchset series
> so that in next cycle we could go smooth.
No, the only other thing that worried me was just possible locking,
but it looked like we already have all the "access page tables from
other processes" situations and it didn't seem to introduce anything
new in that respect.
Linus
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 18/39] mm/madvise: check fatal signal pending of target process
2020-08-16 1:43 ` Linus Torvalds
@ 2020-08-16 5:58 ` Minchan Kim
0 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2020-08-16 5:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Alexander Duyck, Jens Axboe, Brian Geffon,
Christian Brauner, Christian Brauner, Daniel Colascione,
Johannes Weiner, Jann Horn, John Dias, Joel Fernandes,
Kirill Tkhai, linux-man, Linux-MM, Michal Hocko, mm-commits,
Oleksandr Natalenko, David Rientjes, Shakeel Butt, sj38.park,
sjpark, Sonny Rao, Sandeep Patil, Suren Baghdasaryan, Tim Murray,
Vlastimil Babka
On Sat, Aug 15, 2020 at 06:43:08PM -0700, Linus Torvalds wrote:
> On Sat, Aug 15, 2020 at 11:35 AM Minchan Kim <minchan@kernel.org> wrote:
> >
> > > Now, it might be worth it to have some kind of "this mm is dying,
> > > don't bother" thing. We _kind_ of have things like that already in the
> > > form of the MMF_OOM_VICTIM flag (and TIF_MEMDIE is the per-thread
> > > image of it).
> >
> > Maybe, we could use mm_struct's mm_users to check caller is exclusive
> > owner so that rest of all are existing.
>
> Hmm. Checking mm_users sounds sane. But I think the /proc reference by
> any get_task_mm() site will also count as a mm_user, so it's not quite
> as useful as it could be.
>
> In an optimal world, all the temporary "grab a reference to the mm"
> would use mmgrab/mmdrop() that increments the mm_count, and "mm_users"
> would mean the number of actual threads that are actively using it.
>
> But that's not how it ends up working. mmgrab/mmdrop only keeps the
> "struct mm_struct" around - it doesn't keep the vma's or the page
> tables. So all the /proc users really do want to increase mm_users.
>
> I don't see any obvious thing to check for that would be about "this
> mm no longer makes sense to madvise on, because nobody cares any
> more".
Yeah, there are bunch of places where makes false negaive potentially
as well as proc but I expected it would be rather rare and even though
it happens, finally, we can catch it up if they are temporally holding
the refcount but our operation runs long.
At worst case, it could make the operation void so we just wastes but
when I consider the logic as optimization, it wouldn't be harmful to
start with such *simple check* rather than adding more complication.
If you still don't like the idea, at this point, I will drop the single
patch as I mentioned because I don't think I have strong justification
to add more complication here.
>
> > Please tell me if you found something weird in this patchset series
> > so that in next cycle we could go smooth.
>
> No, the only other thing that worried me was just possible locking,
> but it looked like we already have all the "access page tables from
> other processes" situations and it didn't seem to introduce anything
> new in that respect.
Thanks for the review, Linus.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
@ 2020-08-16 8:12 ` Christian Brauner
2020-08-17 15:10 ` Minchan Kim
0 siblings, 1 reply; 49+ messages in thread
From: Christian Brauner @ 2020-08-16 8:12 UTC (permalink / raw)
To: minchan
Cc: alexander.h.duyck, axboe, bgeffon, christian, dancol, hannes,
jannh, joaodias, joel, ktkhai, linux-man, linux-mm, mhocko,
mm-commits, oleksandr, rientjes, shakeelb, sj38.park, sjpark,
sonnyrao, sspatil, surenb, timmurray, torvalds, vbabka,
Andrew Morton
On Fri, Aug 14, 2020 at 05:30:58PM -0700, Andrew Morton wrote:
> From: Minchan Kim <minchan@kernel.org>
> Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
>
<snip>
> +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> + unsigned long, vlen, int, behavior, unsigned int, flags)
> +{
> + ssize_t ret;
> + struct iovec iovstack[UIO_FASTIOV];
> + struct iovec *iov = iovstack;
> + struct iov_iter iter;
> +
> + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> + if (ret >= 0) {
> + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> + kfree(iov);
> + }
> + return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
> + const struct compat_iovec __user *, vec,
> + compat_ulong_t, vlen,
> + compat_int_t, behavior,
> + compat_uint_t, flags)
> +
> +{
> + ssize_t ret;
> + struct iovec iovstack[UIO_FASTIOV];
> + struct iovec *iov = iovstack;
> + struct iov_iter iter;
> +
> + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> + &iov, &iter);
> + if (ret >= 0) {
> + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> + kfree(iov);
> + }
> + return ret;
> +}
> +#endif
Note, I'm only commenting on this patch because it has already been
dropped for this merge window. Otherwise I wouldn't interfer with stuff
that has already been sent for inclusion.
I haven't noticed this before but why do you need this
COMPAT_SYSCALL_DEFINE5()? New code we add today tries pretty hard to
avoid the compat syscall definitions. (See what we did for
pidfd_send_signal(), seccomp, and in io_uring and in various other places.)
Afaict, this could just be sm like (__completely untested__):
static inline int madv_import_iovec(int type, const struct iovec __user *uvec, unsigned nr_segs,
unsigned fast_segs, struct iovec **iov, struct iov_iter *i)
{
#ifdef CONFIG_COMPAT
if (in_compat_syscall())
return compat_import_iovec(type, (struct compat_iovec __user *)uvec, nr_segs,
fast_segs, iov, i);
#endif
return import_iovec(type, uvec, nr_segs, fast_segs, iov, i);
}
SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
unsigned long, vlen, int, behavior, unsigned int, flags)
{
ssize_t ret;
struct iovec iovstack[UIO_FASTIOV];
struct iovec *iov = iovstack;
struct iov_iter iter;
ret = madv_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
if (ret < 0)
return ret;
ret = do_process_madvise(pidfd, &iter, behavior, flags);
kfree(iov);
return ret;
}
or is there are specific reason this wouldn't work here?
Christian
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API
2020-08-16 8:12 ` Christian Brauner
@ 2020-08-17 15:10 ` Minchan Kim
0 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2020-08-17 15:10 UTC (permalink / raw)
To: Christian Brauner
Cc: alexander.h.duyck, axboe, bgeffon, christian, dancol, hannes,
jannh, joaodias, joel, ktkhai, linux-man, linux-mm, mhocko,
mm-commits, oleksandr, rientjes, shakeelb, sj38.park, sjpark,
sonnyrao, sspatil, surenb, timmurray, torvalds, vbabka,
Andrew Morton
On Sun, Aug 16, 2020 at 10:12:27AM +0200, Christian Brauner wrote:
> On Fri, Aug 14, 2020 at 05:30:58PM -0700, Andrew Morton wrote:
> > From: Minchan Kim <minchan@kernel.org>
> > Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
> >
>
> <snip>
>
> > +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > + unsigned long, vlen, int, behavior, unsigned int, flags)
> > +{
> > + ssize_t ret;
> > + struct iovec iovstack[UIO_FASTIOV];
> > + struct iovec *iov = iovstack;
> > + struct iov_iter iter;
> > +
> > + ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> > + if (ret >= 0) {
> > + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > + kfree(iov);
> > + }
> > + return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
> > + const struct compat_iovec __user *, vec,
> > + compat_ulong_t, vlen,
> > + compat_int_t, behavior,
> > + compat_uint_t, flags)
> > +
> > +{
> > + ssize_t ret;
> > + struct iovec iovstack[UIO_FASTIOV];
> > + struct iovec *iov = iovstack;
> > + struct iov_iter iter;
> > +
> > + ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> > + &iov, &iter);
> > + if (ret >= 0) {
> > + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > + kfree(iov);
> > + }
> > + return ret;
> > +}
> > +#endif
>
> Note, I'm only commenting on this patch because it has already been
> dropped for this merge window. Otherwise I wouldn't interfer with stuff
> that has already been sent for inclusion.
>
> I haven't noticed this before but why do you need this
> COMPAT_SYSCALL_DEFINE5()? New code we add today tries pretty hard to
> avoid the compat syscall definitions. (See what we did for
> pidfd_send_signal(), seccomp, and in io_uring and in various other places.)
>
> Afaict, this could just be sm like (__completely untested__):
>
> static inline int madv_import_iovec(int type, const struct iovec __user *uvec, unsigned nr_segs,
> unsigned fast_segs, struct iovec **iov, struct iov_iter *i)
> {
> #ifdef CONFIG_COMPAT
> if (in_compat_syscall())
> return compat_import_iovec(type, (struct compat_iovec __user *)uvec, nr_segs,
> fast_segs, iov, i);
> #endif
>
> return import_iovec(type, uvec, nr_segs, fast_segs, iov, i);
> }
>
> SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> unsigned long, vlen, int, behavior, unsigned int, flags)
> {
> ssize_t ret;
> struct iovec iovstack[UIO_FASTIOV];
> struct iovec *iov = iovstack;
> struct iov_iter iter;
>
> ret = madv_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> if (ret < 0)
> return ret;
>
> ret = do_process_madvise(pidfd, &iter, behavior, flags);
> kfree(iov);
> return ret;
> }
>
> or is there are specific reason this wouldn't work here?
No, I just didn't know such trend to avoid compact syscall definitions.
Thanks for the information.
I think your suggestion will work. Let me have it at respin.
Thanks!
^ permalink raw reply [flat|nested] 49+ messages in thread
* mmotm 2020-08-19-16-09 uploaded
2020-08-15 0:29 incoming Andrew Morton
` (38 preceding siblings ...)
2020-08-15 0:32 ` [patch 39/39] virtio: pci: " Andrew Morton
@ 2020-08-19 23:09 ` Andrew Morton
39 siblings, 0 replies; 49+ messages in thread
From: Andrew Morton @ 2020-08-19 23:09 UTC (permalink / raw)
To: broonie, linux-fsdevel, linux-kernel, linux-mm, linux-next,
mhocko, mm-commits, sfr
The mm-of-the-moment snapshot 2020-08-19-16-09 has been uploaded to
https://www.ozlabs.org/~akpm/mmotm/
mmotm-readme.txt says
README for mm-of-the-moment:
https://www.ozlabs.org/~akpm/mmotm/
This is a snapshot of my -mm patch queue. Uploaded at random hopefully
more than once a week.
You will need quilt to apply these patches to the latest Linus release (5.x
or 5.x-rcY). The series file is in broken-out.tar.gz and is duplicated in
https://ozlabs.org/~akpm/mmotm/series
The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE-yyyy-mm-dd-hh-mm-ss. Both contain the string yyyy-mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.
This tree is partially included in linux-next. To see which patches are
included in linux-next, consult the `series' file. Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.
A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release. Individual mmotm releases are tagged. The master branch always
points to the latest release, so it's constantly rebasing.
https://github.com/hnaz/linux-mm
The directory https://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree. It is updated more frequently
than mmotm, and is untested.
A git copy of this tree is also available at
https://github.com/hnaz/linux-mm
This mmotm tree contains the following patches against 5.9-rc1:
(patches marked "*" will be included in linux-next)
origin.patch
* mailmap-add-andi-kleen.patch
* hugetlb_cgroup-convert-comma-to-semicolon.patch
* khugepaged-adjust-vm_bug_on_mm-in-__khugepaged_enter.patch
* mm-vunmap-add-cond_resched-in-vunmap_pmd_range.patch
* mm-slub-fix-conversion-of-freelist_corrupted.patch
* proc-kpageflags-prevent-an-integer-overflow-in-stable_page_flags.patch
* proc-kpageflags-do-not-use-uninitialized-struct-pages.patch
* mm-fix-missing-function-declaration.patch
* fork-silence-a-false-postive-warning-in-__mmdrop.patch
* romfs-fix-uninitialized-memory-leak-in-romfs_dev_read.patch
* kernel-relayc-fix-memleak-on-destroy-relay-channel.patch
* uprobes-__replace_page-avoid-bug-in-munlock_vma_page.patch
* squashfs-avoid-bio_alloc-failure-with-1mbyte-blocks.patch
* mm-include-cma-pages-in-lowmem_reserve-at-boot.patch
* mm-page_alloc-fix-core-hung-in-free_pcppages_bulk.patch
* mm-slub-re-initialize-randomized-freelist-sequence-in-calculate_sizes.patch
* mm-slub-re-initialize-randomized-freelist-sequence-in-calculate_sizes-fix.patch
* mm-thp-swap-fix-allocating-cluster-for-swapfile-by-mistake.patch
* checkpatch-test-git_dir-changes.patch
* scripts-tagssh-exclude-tools-directory-from-tags-generation.patch
* fs-ocfs2-delete-repeated-words-in-comments.patch
* ocfs2-clear-links-count-in-ocfs2_mknod-if-an-error-occurs.patch
* ocfs2-fix-ocfs2-corrupt-when-iputting-an-inode.patch
* ramfs-support-o_tmpfile.patch
* kernel-watchdog-flush-all-printk-nmi-buffers-when-hardlockup-detected.patch
mm.patch
* mm-slub-branch-optimization-in-free-slowpath.patch
* mm-slub-fix-missing-alloc_slowpath-stat-when-bulk-alloc.patch
* mm-slub-make-add_full-condition-more-explicit.patch
* device-dax-fix-mismatches-of-request_mem_region.patch
* mm-debug-do-not-dereference-i_ino-blindly.patch
* mm-dump_page-rename-head_mapcount-head_compound_mapcount.patch
* mm-gup_benchmark-use-pin_user_pages-for-foll_longterm-flag.patch
* mm-gup-dont-permit-users-to-call-get_user_pages-with-foll_longterm.patch
* mm-remove-activate_page-from-unuse_pte.patch
* mm-remove-superfluous-__clearpageactive.patch
* mm-remove-superfluous-__clearpagewaiters.patch
* memremap-convert-devmap-static-branch-to-incdec.patch
* mm-memcg-warning-on-memcg-after-readahead-page-charged.patch
* mm-memcg-remove-useless-check-on-page-mem_cgroup.patch
* mm-thp-move-lru_add_page_tail-func-to-huge_memoryc.patch
* mm-thp-clean-up-lru_add_page_tail.patch
* mm-thp-remove-code-path-which-never-got-into.patch
* mm-thp-narrow-lru-locking.patch
* mm-memcontrol-use-flex_array_size-helper-in-memcpy.patch
* mm-memcontrol-use-the-preferred-form-for-passing-the-size-of-a-structure-type.patch
* mm-account-pmd-tables-like-pte-tables.patch
* mm-memory-fix-typo-in-__do_fault-comment.patch
* mm-memoryc-replace-vmf-vma-with-variable-vma.patch
* mm-mmap-rename-__vma_unlink_common-to-__vma_unlink.patch
* mm-mmap-leverage-vma_rb_erase_ignore-to-implement-vma_rb_erase.patch
* mmap-locking-api-add-mmap_lock_is_contended.patch
* mm-smaps-extend-smap_gather_stats-to-support-specified-beginning.patch
* mm-proc-smaps_rollup-do-not-stall-write-attempts-on-mmap_lock.patch
* mm-mmap-fix-the-adjusted-length-error.patch
* mm-dmapoolc-replace-open-coded-list_for_each_entry_safe.patch
* mm-dmapoolc-replace-hard-coded-function-name-with-__func__.patch
* mm-memory-failure-do-pgoff-calculation-before-for_each_process.patch
* docs-vm-fix-mm_count-vs-mm_users-counter-confusion.patch
* mm-page_alloc-tweak-comments-in-has_unmovable_pages.patch
* mm-page_isolation-exit-early-when-pageblock-is-isolated-in-set_migratetype_isolate.patch
* mm-page_isolation-drop-warn_on_once-in-set_migratetype_isolate.patch
* mm-page_isolation-cleanup-set_migratetype_isolate.patch
* virtio-mem-dont-special-case-zone_movable.patch
* mm-document-semantics-of-zone_movable.patch
* mm-huge_memoryc-update-tlb-entry-if-pmd-is-changed.patch
* mips-do-not-call-flush_tlb_all-when-setting-pmd-entry.patch
* kvm-ppc-book3s-hv-simplify-kvm_cma_reserve.patch
* dma-contiguous-simplify-cma_early_percent_memory.patch
* arm-xtensa-simplify-initialization-of-high-memory-pages.patch
* arm64-numa-simplify-dummy_numa_init.patch
* h8300-nds32-openrisc-simplify-detection-of-memory-extents.patch
* riscv-drop-unneeded-node-initialization.patch
* mircoblaze-drop-unneeded-numa-and-sparsemem-initializations.patch
* memblock-make-for_each_memblock_type-iterator-private.patch
* memblock-make-memblock_debug-and-related-functionality-private.patch
* memblock-make-memblock_debug-and-related-functionality-private-fix.patch
* memblock-reduce-number-of-parameters-in-for_each_mem_range.patch
* arch-mm-replace-for_each_memblock-with-for_each_mem_pfn_range.patch
* arch-drivers-replace-for_each_membock-with-for_each_mem_range.patch
* x86-setup-simplify-initrd-relocation-and-reservation.patch
* x86-setup-simplify-reserve_crashkernel.patch
* memblock-remove-unused-memblock_mem_size.patch
* memblock-implement-for_each_reserved_mem_region-using-__next_mem_region.patch
* memblock-use-separate-iterators-for-memory-and-reserved-regions.patch
* mmhwpoison-cleanup-unused-pagehuge-check.patch
* mm-hwpoison-remove-recalculating-hpage.patch
* mmhwpoison-inject-dont-pin-for-hwpoison_filter.patch
* mmhwpoison-un-export-get_hwpoison_page-and-make-it-static.patch
* mmhwpoison-kill-put_hwpoison_page.patch
* mmhwpoison-unify-thp-handling-for-hard-and-soft-offline.patch
* mmhwpoison-rework-soft-offline-for-free-pages.patch
* mmhwpoison-rework-soft-offline-for-in-use-pages.patch
* mmhwpoison-refactor-soft_offline_huge_page-and-__soft_offline_page.patch
* mmhwpoison-return-0-if-the-page-is-already-poisoned-in-soft-offline.patch
* mmhwpoison-introduce-mf_msg_unsplit_thp.patch
* mmhwpoison-double-check-page-count-in-__get_any_page.patch
* mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings.patch
* mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix.patch
* mm-vmstat-fix-proc-sys-vm-stat_refresh-generating-false-warnings-fix-2.patch
* mm-util-update-the-kerneldoc-for-kstrdup_const.patch
* mm-memory_hotplug-inline-__offline_pages-into-offline_pages.patch
* mm-memory_hotplug-enforce-section-granularity-when-onlining-offlining.patch
* mm-memory_hotplug-simplify-page-offlining.patch
* mm-page_alloc-simplify-__offline_isolated_pages.patch
* mm-memory_hotplug-drop-nr_isolate_pageblock-in-offline_pages.patch
* mm-page_isolation-simplify-return-value-of-start_isolate_page_range.patch
* mm-memory_hotplug-simplify-page-onlining.patch
* mm-page_alloc-drop-stale-pageblock-comment-in-memmap_init_zone.patch
* mm-pass-migratetype-into-memmap_init_zone-and-move_pfn_range_to_zone.patch
* mm-memory_hotplug-mark-pageblocks-migrate_isolate-while-onlining-memory.patch
* mm-slab-remove-duplicate-include.patch
* mm-page_reporting-drop-stale-list-head-check-in-page_reporting_cycle.patch
* mm-highmem-clean-up-endif-comments.patch
* info-task-hung-in-generic_file_write_iter.patch
* info-task-hung-in-generic_file_write-fix.patch
* kernel-hung_taskc-monitor-killed-tasks.patch
* proc-sysctl-make-protected_-world-readable.patch
* fs-configfs-delete-repeated-words-in-comments.patch
* bitops-simplify-get_count_order_long.patch
* bitops-use-the-same-mechanism-for-get_count_order.patch
* checkpatch-add-kconfig-prefix.patch
* checkpatch-move-repeated-word-test.patch
* checkpatch-add-test-for-comma-use-that-should-be-semicolon.patch
* panic-dump-registers-on-panic_on_warn.patch
* aio-simplify-read_events.patch
* proc-add-struct-mount-struct-super_block-addr-in-lx-mounts-command.patch
* tasks-add-headers-and-improve-spacing-format.patch
* romfs-support-inode-blocks-calculation.patch
linux-next.patch
* ia64-fix-build-error-with-coredump.patch
* mm-madvise-pass-task-and-mm-to-do_madvise.patch
* pid-move-pidfd_get_pid-to-pidc.patch
* mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api.patch
* mm-madvise-introduce-process_madvise-syscall-an-external-memory-hinting-api-fix.patch
* mm-madvise-check-fatal-signal-pending-of-target-process.patch
* mm-memory-failure-remove-a-wrapper-for-alloc_migration_target.patch
* mm-memory_hotplug-remove-a-wrapper-for-alloc_migration_target.patch
* mm-migrate-avoid-possible-unnecessary-process-right-check-in-kernel_move_pages.patch
* mm-mmap-add-inline-vma_next-for-readability-of-mmap-code.patch
* mm-mmap-add-inline-munmap_vma_range-for-code-readability.patch
make-sure-nobodys-leaking-resources.patch
releasing-resources-with-children.patch
mutex-subsystem-synchro-test-module.patch
kernel-forkc-export-kernel_thread-to-modules.patch
workaround-for-a-pci-restoring-bug.patch
^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2020-08-19 23:10 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-15 0:29 incoming Andrew Morton
2020-08-15 0:30 ` [patch 01/39] asm-generic: pgalloc.h: use correct #ifdef to enable pud_alloc_one() Andrew Morton
2020-08-15 0:30 ` [patch 02/39] Revert "mm/vmstat.c: do not show lowmem reserve protection information of empty zone" Andrew Morton
2020-08-15 0:30 ` [patch 03/39] lz4: fix kernel decompression speed Andrew Morton
2020-08-15 0:30 ` [patch 04/39] exec: restore EACCES of S_ISDIR execve() Andrew Morton
2020-08-15 0:30 ` [patch 05/39] selftests/exec: add file type errno tests Andrew Morton
2020-08-15 0:30 ` [patch 06/39] mailmap: add entry for Greg Kurz Andrew Morton
2020-08-15 0:30 ` [patch 07/39] mm: store compound_nr as well as compound_order Andrew Morton
2020-08-15 0:30 ` [patch 08/39] mm: move page-flags include to top of file Andrew Morton
2020-08-15 0:30 ` [patch 09/39] mm: add thp_order Andrew Morton
2020-08-15 0:30 ` [patch 10/39] mm: add thp_size Andrew Morton
2020-08-15 0:30 ` [patch 11/39] mm: replace hpage_nr_pages with thp_nr_pages Andrew Morton
2020-08-15 0:30 ` [patch 12/39] mm: add thp_head Andrew Morton
2020-08-15 0:30 ` [patch 13/39] mm: introduce offset_in_thp Andrew Morton
2020-08-15 0:30 ` [patch 14/39] fs: autofs: delete repeated words in comments Andrew Morton
2020-08-15 0:30 ` [patch 15/39] mm/madvise: pass task and mm to do_madvise Andrew Morton
2020-08-15 0:30 ` [patch 16/39] pid: move pidfd_get_pid() to pid.c Andrew Morton
2020-08-15 0:30 ` [patch 17/39] mm/madvise: introduce process_madvise() syscall: an external memory hinting API Andrew Morton
2020-08-16 8:12 ` Christian Brauner
2020-08-17 15:10 ` Minchan Kim
2020-08-15 0:31 ` [patch 18/39] mm/madvise: check fatal signal pending of target process Andrew Morton
2020-08-15 2:53 ` Linus Torvalds
2020-08-15 4:59 ` Minchan Kim
2020-08-15 14:57 ` Linus Torvalds
2020-08-15 18:34 ` Minchan Kim
2020-08-16 1:43 ` Linus Torvalds
2020-08-16 5:58 ` Minchan Kim
2020-08-15 0:31 ` [patch 19/39] all arch: remove system call sys_sysctl Andrew Morton
2020-08-15 0:31 ` [patch 20/39] mm/kmemleak: silence KCSAN splats in checksum Andrew Morton
2020-08-15 0:31 ` [patch 21/39] mm/frontswap: mark various intentional data races Andrew Morton
2020-08-15 0:31 ` [patch 22/39] mm/page_io: " Andrew Morton
2020-08-15 0:31 ` [patch 23/39] mm/swap_state: " Andrew Morton
2020-08-15 0:31 ` [patch 24/39] mm/filemap.c: fix a data race in filemap_fault() Andrew Morton
2020-08-15 0:31 ` [patch 25/39] mm/swapfile: fix and annotate various data races Andrew Morton
2020-08-15 0:31 ` [patch 26/39] mm/page_counter: fix various data races at memsw Andrew Morton
2020-08-15 0:31 ` [patch 27/39] mm/memcontrol: fix a data race in scan count Andrew Morton
2020-08-15 0:31 ` [patch 28/39] mm/list_lru: fix a data race in list_lru_count_one Andrew Morton
2020-08-15 0:31 ` [patch 29/39] mm/mempool: fix a data race in mempool_free() Andrew Morton
2020-08-15 0:31 ` [patch 30/39] mm/rmap: annotate a data race at tlb_flush_batched Andrew Morton
2020-08-15 0:31 ` [patch 31/39] mm/swap.c: annotate data races for lru_rotate_pvecs Andrew Morton
2020-08-15 0:31 ` [patch 32/39] mm: annotate a data race in page_zonenum() Andrew Morton
2020-08-15 0:31 ` [patch 33/39] include/asm-generic/vmlinux.lds.h: align ro_after_init Andrew Morton
2020-08-15 0:32 ` [patch 34/39] sh: clkfwk: remove r8/r16/r32 Andrew Morton
2020-08-15 0:32 ` [patch 35/39] sh: use generic strncpy() Andrew Morton
2020-08-15 0:32 ` [patch 36/39] iomap: constify ioreadX() iomem argument (as in generic implementation) Andrew Morton
2020-08-15 0:32 ` [patch 37/39] rtl818x: " Andrew Morton
2020-08-15 0:32 ` [patch 38/39] ntb: intel: " Andrew Morton
2020-08-15 0:32 ` [patch 39/39] virtio: pci: " Andrew Morton
2020-08-19 23:09 ` mmotm 2020-08-19-16-09 uploaded Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).