linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/6] x86: prefetch_page() vDSO call
@ 2021-02-25  7:29 Nadav Amit
  2021-02-25  7:29 ` [RFC 1/6] vdso/extable: fix calculation of base Nadav Amit
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

Just as applications can use prefetch instructions to overlap
computations and memory accesses, applications may want to overlap the
page-faults and compute or overlap the I/O accesses that are required
for page-faults of different pages.

Applications can use multiple threads and cores for this matter, by
running one thread that prefetches the data (i.e., faults in the data)
and another that does the compute, but this scheme is inefficient. Using
mincore() can tell whether a page is mapped, but might not tell whether
the page is in the page-cache and does not fault in the data.

Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
asynchronously. The semantic of this call is: try to prefetch a page of
in a given address and return zero if the page is accessible following
the call. Start I/O operations to retrieve the page if such operations
are required and there is no high memory pressure that might introduce
slowdowns.

Note that as usual the page might be paged-out at any point and
therefore, similarly to mincore(), there is no guarantee that the page
will be present at the time that the user application uses the data that
resides on the page. Nevertheless, it is expected that in the vast
majority of the cases this would not happen, since prefetch_page()
accesses the page and therefore sets the PTE access-bit (if it is
clear). 

The implementation is as follows. The vDSO code accesses the data,
triggering a page-fault it is not present. The handler detects based on
the instruction pointer that this is an asynchronous-#PF, using the
recently introduce vDSO exception tables. If the page can be brought
without waiting (e.g., the page is already in the page-cache), the
kernel handles the fault and returns success (zero). If there is memory
pressure that prevents the proper handling of the fault (i.e., requires
heavy-weight reclamation) it returns a failure. Otherwise, it starts an
I/O to bring the page and returns failure.

Compilers can be extended to issue the prefetch_page() calls when
needed.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org

Nadav Amit (6):
  vdso/extable: fix calculation of base
  x86/vdso: add mask and flags to extable
  x86/vdso: introduce page_prefetch()
  mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT
  mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT
  testing/selftest: test vDSO prefetch_page()

 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/vdso/Makefile                  |   1 +
 arch/x86/entry/vdso/extable.c                 |  70 +++--
 arch/x86/entry/vdso/extable.h                 |  21 +-
 arch/x86/entry/vdso/vdso.lds.S                |   1 +
 arch/x86/entry/vdso/vprefetch.S               |  39 +++
 arch/x86/entry/vdso/vsgx.S                    |   9 +-
 arch/x86/include/asm/vdso.h                   |  38 ++-
 arch/x86/mm/fault.c                           |  11 +-
 lib/vdso/Kconfig                              |   5 +
 mm/memory.c                                   |  47 +++-
 mm/shmem.c                                    |   1 +
 mm/swap_state.c                               |  12 +-
 tools/testing/selftests/vDSO/Makefile         |   2 +
 .../selftests/vDSO/vdso_test_prefetch_page.c  | 265 ++++++++++++++++++
 15 files changed, 470 insertions(+), 53 deletions(-)
 create mode 100644 arch/x86/entry/vdso/vprefetch.S
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_prefetch_page.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC 1/6] vdso/extable: fix calculation of base
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
@ 2021-02-25  7:29 ` Nadav Amit
  2021-02-25 21:16   ` Sean Christopherson
  2021-02-25  7:29 ` [RFC 2/6] x86/vdso: add mask and flags to extable Nadav Amit
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

Apparently, the assembly considers __ex_table as the location when the
pushsection directive was issued. Therefore when there is more than a
single entry in the vDSO exception table, the calculations of the base
and fixup are wrong.

Fix the calculations of the expected fault IP and new IP by adjusting
the base after each entry.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/vdso/extable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index afcf5b65beef..c81e78636220 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -32,7 +32,7 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
 	nr_entries = image->extable_len / (sizeof(*extable));
 	extable = image->extable;
 
-	for (i = 0; i < nr_entries; i++) {
+	for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
 		if (regs->ip == base + extable[i].insn) {
 			regs->ip = base + extable[i].fixup;
 			regs->di = trapnr;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 2/6] x86/vdso: add mask and flags to extable
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
  2021-02-25  7:29 ` [RFC 1/6] vdso/extable: fix calculation of base Nadav Amit
@ 2021-02-25  7:29 ` Nadav Amit
  2021-02-25  7:29 ` [RFC 3/6] x86/vdso: introduce page_prefetch() Nadav Amit
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

Add a "mask" field to vDSO exception tables that says which exceptions
should be handled.

Add a "flags" field to vDSO as well to provide additional information
about the exception.

The existing preprocessor macro _ASM_VDSO_EXTABLE_HANDLE for assembly is
not easy to use as it requires the user to stringify the expanded C
macro. Remove _ASM_VDSO_EXTABLE_HANDLE and use a similar scheme to
ALTERNATIVE, using assembly macros directly in assembly without wrapping
them in C macros.

Move the vsgx supported exceptions test out of the generic C code into
vsgx-specific assembly by setting vsgx supported exceptions in the mask.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/entry/vdso/extable.c |  9 +--------
 arch/x86/entry/vdso/extable.h | 21 +++++++++++++--------
 arch/x86/entry/vdso/vsgx.S    |  9 +++++++--
 3 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index c81e78636220..93fb37bd32ad 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -7,6 +7,7 @@
 
 struct vdso_exception_table_entry {
 	int insn, fixup;
+	unsigned int mask, flags;
 };
 
 bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
@@ -17,14 +18,6 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
 	unsigned int nr_entries, i;
 	unsigned long base;
 
-	/*
-	 * Do not attempt to fixup #DB or #BP.  It's impossible to identify
-	 * whether or not a #DB/#BP originated from within an SGX enclave and
-	 * SGX enclaves are currently the only use case for vDSO fixup.
-	 */
-	if (trapnr == X86_TRAP_DB || trapnr == X86_TRAP_BP)
-		return false;
-
 	if (!current->mm->context.vdso)
 		return false;
 
diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h
index b56f6b012941..7ca8a0776805 100644
--- a/arch/x86/entry/vdso/extable.h
+++ b/arch/x86/entry/vdso/extable.h
@@ -2,26 +2,31 @@
 #ifndef __VDSO_EXTABLE_H
 #define __VDSO_EXTABLE_H
 
+#include <asm/trapnr.h>
+
+#define ASM_VDSO_ASYNC_FLAGS	(1 << 0)
+
 /*
  * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
  * vDSO uses a dedicated handler the addresses are relative to the overall
  * exception table, not each individual entry.
  */
 #ifdef __ASSEMBLY__
-#define _ASM_VDSO_EXTABLE_HANDLE(from, to)	\
-	ASM_VDSO_EXTABLE_HANDLE from to
-
-.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req
+.macro ASM_VDSO_EXTABLE_HANDLE from:req to:req mask:req flags:req
 	.pushsection __ex_table, "a"
 	.long (\from) - __ex_table
 	.long (\to) - __ex_table
+	.long (\mask)
+	.long (\flags)
 	.popsection
 .endm
 #else
-#define _ASM_VDSO_EXTABLE_HANDLE(from, to)	\
-	".pushsection __ex_table, \"a\"\n"      \
-	".long (" #from ") - __ex_table\n"      \
-	".long (" #to ") - __ex_table\n"        \
+#define ASM_VDSO_EXTABLE_HANDLE(from, to, mask, flags)		\
+	".pushsection __ex_table, \"a\"\n"			\
+	".long (" #from ") - __ex_table\n"			\
+	".long (" #to ") - __ex_table\n"			\
+	".long (" #mask ")\n"					\
+	".long (" #flags ")\n"					\
 	".popsection\n"
 #endif
 
diff --git a/arch/x86/entry/vdso/vsgx.S b/arch/x86/entry/vdso/vsgx.S
index 86a0e94f68df..c588255af480 100644
--- a/arch/x86/entry/vdso/vsgx.S
+++ b/arch/x86/entry/vdso/vsgx.S
@@ -4,6 +4,7 @@
 #include <asm/export.h>
 #include <asm/errno.h>
 #include <asm/enclu.h>
+#include <asm/trapnr.h>
 
 #include "extable.h"
 
@@ -146,6 +147,10 @@ SYM_FUNC_START(__vdso_sgx_enter_enclave)
 
 	.cfi_endproc
 
-_ASM_VDSO_EXTABLE_HANDLE(.Lenclu_eenter_eresume, .Lhandle_exception)
-
+/*
+ * Do not attempt to fixup #DB or #BP.  It's impossible to identify
+ * whether or not a #DB/#BP originated from within an SGX enclave.
+ */
+ASM_VDSO_EXTABLE_HANDLE .Lenclu_eenter_eresume, .Lhandle_exception,	\
+			~((1<<X86_TRAP_DB)+(1<<X86_TRAP_BP)), 0
 SYM_FUNC_END(__vdso_sgx_enter_enclave)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 3/6] x86/vdso: introduce page_prefetch()
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
  2021-02-25  7:29 ` [RFC 1/6] vdso/extable: fix calculation of base Nadav Amit
  2021-02-25  7:29 ` [RFC 2/6] x86/vdso: add mask and flags to extable Nadav Amit
@ 2021-02-25  7:29 ` Nadav Amit
  2021-02-25  7:29 ` [RFC 4/6] mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT Nadav Amit
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

Introduce a new vDSO function: page_prefetch() which is to be used when
certain memory, which might be paged out, is expected to be used soon.
The function prefetches the page if needed. The function returns zero if
the page is accessible after the call and -1 otherwise.

page_prefetch() is intended to be very lightweight both when the page is
already present and when the page is prefetched.

The implementation leverages the new vDSO exception tables mechanism.
page_prefetch() accesses the page for read and has a corresponding vDSO
exception-table entry that indicates that a #PF might occur and that in
such case the page should be brought asynchronously. If #PF indeed
occurs, the page-fault handler sets the FAULT_FLAG_RETRY_NOWAIT flag.

If the page-fault was not resolved, the page-fault handler does not
retry, and instead jumps to the new IP that is marked in the exception
table. The vDSO part returns accordingly the return value.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/Kconfig                |  1 +
 arch/x86/entry/vdso/Makefile    |  1 +
 arch/x86/entry/vdso/extable.c   | 59 +++++++++++++++++++++++++--------
 arch/x86/entry/vdso/vdso.lds.S  |  1 +
 arch/x86/entry/vdso/vprefetch.S | 39 ++++++++++++++++++++++
 arch/x86/include/asm/vdso.h     | 38 +++++++++++++++++++--
 arch/x86/mm/fault.c             | 11 ++++--
 lib/vdso/Kconfig                |  5 +++
 8 files changed, 136 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/entry/vdso/vprefetch.S

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..86a4c265e8af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -136,6 +136,7 @@ config X86
 	select GENERIC_TIME_VSYSCALL
 	select GENERIC_GETTIMEOFDAY
 	select GENERIC_VDSO_TIME_NS
+	select GENERIC_VDSO_PREFETCH
 	select GUP_GET_PTE_LOW_HIGH		if X86_PAE
 	select HARDIRQS_SW_RESEND
 	select HARDLOCKUP_CHECK_TIMESTAMP	if X86_64
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 02e3e42f380b..e32ca1375b84 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -28,6 +28,7 @@ vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
 vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
 vobjs32-y += vdso32/vclock_gettime.o
 vobjs-$(CONFIG_X86_SGX)	+= vsgx.o
+vobjs-$(CONFIG_GENERIC_VDSO_PREFETCH) += vprefetch.o
 
 # files to link into kernel
 obj-y				+= vma.o extable.o
diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index 93fb37bd32ad..e821887112ce 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -4,36 +4,67 @@
 #include <asm/current.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
+#include "extable.h"
 
 struct vdso_exception_table_entry {
 	int insn, fixup;
 	unsigned int mask, flags;
 };
 
-bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
-			  unsigned long error_code, unsigned long fault_addr)
+static unsigned long
+get_vdso_exception_table_entry(const struct pt_regs *regs, int trapnr,
+			       unsigned int *flags)
 {
 	const struct vdso_image *image = current->mm->context.vdso_image;
 	const struct vdso_exception_table_entry *extable;
 	unsigned int nr_entries, i;
 	unsigned long base;
+	unsigned long ip = regs->ip;
+	unsigned long vdso_base = (unsigned long)current->mm->context.vdso;
 
-	if (!current->mm->context.vdso)
-		return false;
-
-	base =  (unsigned long)current->mm->context.vdso + image->extable_base;
+	base = vdso_base + image->extable_base;
 	nr_entries = image->extable_len / (sizeof(*extable));
 	extable = image->extable;
 
 	for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
-		if (regs->ip == base + extable[i].insn) {
-			regs->ip = base + extable[i].fixup;
-			regs->di = trapnr;
-			regs->si = error_code;
-			regs->dx = fault_addr;
-			return true;
-		}
+		if (ip != base + extable[i].insn)
+			continue;
+
+		if (!((1u << trapnr) & extable[i].mask))
+			continue;
+
+		/* found */
+		if (flags)
+			*flags = extable[i].flags;
+		return base + extable[i].fixup;
 	}
 
-	return false;
+	return 0;
+}
+
+bool __fixup_vdso_exception(struct pt_regs *regs, int trapnr,
+			    unsigned long error_code, unsigned long fault_addr)
+{
+	unsigned long new_ip;
+
+	new_ip = get_vdso_exception_table_entry(regs, trapnr, NULL);
+	if (!new_ip)
+		return false;
+
+	instruction_pointer_set(regs, new_ip);
+	regs->di = trapnr;
+	regs->si = error_code;
+	regs->dx = fault_addr;
+	return true;
+}
+
+__attribute_const__ bool __is_async_vdso_exception(struct pt_regs *regs,
+						   int trapnr)
+{
+	unsigned long new_ip;
+	unsigned int flags;
+
+	new_ip = get_vdso_exception_table_entry(regs, trapnr, &flags);
+
+	return new_ip && (flags & ASM_VDSO_ASYNC_FLAGS);
 }
diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
index 4bf48462fca7..fd4ba24571c8 100644
--- a/arch/x86/entry/vdso/vdso.lds.S
+++ b/arch/x86/entry/vdso/vdso.lds.S
@@ -28,6 +28,7 @@ VERSION {
 		clock_getres;
 		__vdso_clock_getres;
 		__vdso_sgx_enter_enclave;
+		__vdso_prefetch_page;
 	local: *;
 	};
 }
diff --git a/arch/x86/entry/vdso/vprefetch.S b/arch/x86/entry/vdso/vprefetch.S
new file mode 100644
index 000000000000..a0fcafb7d546
--- /dev/null
+++ b/arch/x86/entry/vdso/vprefetch.S
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/linkage.h>
+#include <asm/export.h>
+#include <asm/errno.h>
+#include <asm/enclu.h>
+
+#include "extable.h"
+
+.code64
+.section .text, "ax"
+
+SYM_FUNC_START(__vdso_prefetch_page)
+	/* Prolog */
+	.cfi_startproc
+	push	%rbp
+	.cfi_adjust_cfa_offset	8
+	.cfi_rel_offset		%rbp, 0
+	mov	%rsp, %rbp
+	.cfi_def_cfa_register	%rbp
+
+	xor	%rax, %rax
+.Laccess_page:
+	movb	(%rdi), %dil
+.Lout:
+
+	/* Epilog */
+	pop	%rbp
+	.cfi_def_cfa		%rsp, 8
+	ret
+
+.Lhandle_exception:
+	mov	$-1ll, %rax
+	jmp	.Lout
+	.cfi_endproc
+ASM_VDSO_EXTABLE_HANDLE .Laccess_page, .Lhandle_exception,		\
+			(1<<X86_TRAP_PF), ASM_VDSO_ASYNC_FLAGS
+
+SYM_FUNC_END(__vdso_prefetch_page)
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 98aa103eb4ab..ee47660fcd0d 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -9,6 +9,7 @@
 #ifndef __ASSEMBLER__
 
 #include <linux/mm_types.h>
+#include <linux/sched.h>
 
 struct vdso_image {
 	void *data;
@@ -49,9 +50,40 @@ extern void __init init_vdso_image(const struct vdso_image *image);
 
 extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
 
-extern bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
-				 unsigned long error_code,
-				 unsigned long fault_addr);
+extern bool __fixup_vdso_exception(struct pt_regs *regs, int trapnr,
+				   unsigned long error_code,
+				   unsigned long fault_addr);
+
+extern __attribute_const__ bool __is_async_vdso_exception(struct pt_regs *regs,
+							  int trapnr);
+
+static inline bool is_exception_in_vdso(struct pt_regs *regs)
+{
+	const struct vdso_image *image = current->mm->context.vdso_image;
+	unsigned long vdso_base = (unsigned long)current->mm->context.vdso;
+
+	return regs->ip >= vdso_base && regs->ip < vdso_base + image->size &&
+		vdso_base != 0;
+}
+
+static inline bool is_async_vdso_exception(struct pt_regs *regs, int trapnr)
+{
+	if (!is_exception_in_vdso(regs))
+		return false;
+
+	return __is_async_vdso_exception(regs, trapnr);
+}
+
+static inline bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
+				   unsigned long error_code,
+				   unsigned long fault_addr)
+{
+	if (is_exception_in_vdso(regs))
+		return __fixup_vdso_exception(regs, trapnr, error_code,
+					      fault_addr);
+	return false;
+}
+
 #endif /* __ASSEMBLER__ */
 
 #endif /* _ASM_X86_VDSO_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..87d8ae46510c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1289,6 +1289,10 @@ void do_user_addr_fault(struct pt_regs *regs,
 	if (user_mode(regs)) {
 		local_irq_enable();
 		flags |= FAULT_FLAG_USER;
+		if (IS_ENABLED(CONFIG_GENERIC_VDSO_PREFETCH) &&
+		    is_async_vdso_exception(regs, X86_TRAP_PF))
+			flags |= FAULT_FLAG_ALLOW_RETRY |
+				 FAULT_FLAG_RETRY_NOWAIT;
 	} else {
 		if (regs->flags & X86_EFLAGS_IF)
 			local_irq_enable();
@@ -1407,8 +1411,11 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 	if (unlikely((fault & VM_FAULT_RETRY) &&
 		     (flags & FAULT_FLAG_ALLOW_RETRY))) {
-		flags |= FAULT_FLAG_TRIED;
-		goto retry;
+		if (!(flags & FAULT_FLAG_RETRY_NOWAIT)) {
+			flags |= FAULT_FLAG_TRIED;
+			goto retry;
+		}
+		fixup_vdso_exception(regs, X86_TRAP_PF, hw_error_code, address);
 	}
 
 	mmap_read_unlock(mm);
diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig
index d883ac299508..a64d2b08b6f4 100644
--- a/lib/vdso/Kconfig
+++ b/lib/vdso/Kconfig
@@ -30,4 +30,9 @@ config GENERIC_VDSO_TIME_NS
 	  Selected by architectures which support time namespaces in the
 	  VDSO
 
+config GENERIC_VDSO_PREFETCH
+	bool
+	help
+	  Selected by architectures which support page prefetch VDSO
+
 endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 4/6] mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
                   ` (2 preceding siblings ...)
  2021-02-25  7:29 ` [RFC 3/6] x86/vdso: introduce page_prefetch() Nadav Amit
@ 2021-02-25  7:29 ` Nadav Amit
  2021-02-25  7:29 ` [RFC 5/6] mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT Nadav Amit
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

Certain use-cases (e.g., prefetch_page()) may want to avoid polling
while a page is brought from the swap. Yet, swap_cluster_readahead()
and swap_vma_readahead() do not respect FAULT_FLAG_RETRY_NOWAIT.

Add support to respect FAULT_FLAG_RETRY_NOWAIT by not polling in these
cases.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/memory.c     | 15 +++++++++++++--
 mm/shmem.c      |  1 +
 mm/swap_state.c | 12 +++++++++---
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..13b9cf36268f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3326,12 +3326,23 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		}
 
 		if (!page) {
+			/*
+			 * Back out if we failed to bring the page while we
+			 * tried to avoid I/O.
+			 */
+			if (fault_flag_allow_retry_first(vmf->flags) &&
+			    (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				ret = VM_FAULT_RETRY;
+				delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+				goto out;
+			}
+
 			/*
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-					vmf->address, &vmf->ptl);
+			vmf->pte = pte_offset_map_lock(vma->vm_mm,
+				vmf->pmd, vmf->address, &vmf->ptl);
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..b108e9ba9e89 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1525,6 +1525,7 @@ static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 	shmem_pseudo_vma_init(&pvma, info, index);
 	vmf.vma = &pvma;
 	vmf.address = 0;
+	vmf.flags = 0;
 	page = swap_cluster_readahead(swap, gfp, &vmf);
 	shmem_pseudo_vma_destroy(&pvma);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 751c1ef2fe0e..1e930f7ff8b3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -656,10 +656,13 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long mask;
 	struct swap_info_struct *si = swp_swap_info(entry);
 	struct blk_plug plug;
-	bool do_poll = true, page_allocated;
+	bool page_allocated, do_poll;
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address;
 
+	do_poll = !fault_flag_allow_retry_first(vmf->flags) ||
+		!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
@@ -838,7 +841,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 	pte_t *pte, pentry;
 	swp_entry_t entry;
 	unsigned int i;
-	bool page_allocated;
+	bool page_allocated, do_poll;
 	struct vma_swap_readahead ra_info = {
 		.win = 1,
 	};
@@ -873,9 +876,12 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 	}
 	blk_finish_plug(&plug);
 	lru_add_drain();
+
 skip:
+	do_poll = (!fault_flag_allow_retry_first(vmf->flags) ||
+		!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) && ra_info.win == 1;
 	return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
-				     ra_info.win == 1);
+				     do_poll);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 5/6] mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
                   ` (3 preceding siblings ...)
  2021-02-25  7:29 ` [RFC 4/6] mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT Nadav Amit
@ 2021-02-25  7:29 ` Nadav Amit
  2021-02-25  7:29 ` [PATCH 6/6] testing/selftest: test vDSO prefetch_page() Nadav Amit
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

When FAULT_FLAG_RETRY_NOWAIT is set, the caller arguably wants only a
lightweight reclaim to avoid a long reclamation, which would not respect
the "NOWAIT" semantic. Regard the request in swap and file-backed
page-faults accordingly during the first try.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/memory.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 13b9cf36268f..70899c92a9e6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2679,18 +2679,31 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	return ret;
 }
 
-static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
+static gfp_t massage_page_gfp_mask(gfp_t gfp_mask, unsigned long vmf_flags)
 {
-	struct file *vm_file = vma->vm_file;
+	if (fault_flag_allow_retry_first(vmf_flags) &&
+	    (vmf_flags & FAULT_FLAG_RETRY_NOWAIT))
+		gfp_mask |= __GFP_NORETRY | __GFP_NOWARN;
 
-	if (vm_file)
-		return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+	return gfp_mask;
+}
+
+static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma,
+				  unsigned long flags)
+{
+	struct file *vm_file = vma->vm_file;
+	gfp_t gfp_mask;
 
 	/*
 	 * Special mappings (e.g. VDSO) do not have any file so fake
 	 * a default GFP_KERNEL for them.
 	 */
-	return GFP_KERNEL;
+	if (!vm_file)
+		return GFP_KERNEL;
+
+	gfp_mask = mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+
+	return massage_page_gfp_mask(gfp_mask, flags);
 }
 
 /*
@@ -3253,6 +3266,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
  */
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
+	gfp_t gfp_mask = massage_page_gfp_mask(GFP_HIGHUSER_MOVABLE, vmf->flags);
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL, *swapcache;
 	swp_entry_t entry;
@@ -3293,8 +3307,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
-			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
-							vmf->address);
+			page = alloc_page_vma(gfp_mask, vma, vmf->address);
 			if (page) {
 				int err;
 
@@ -3320,8 +3333,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				swap_readpage(page, true);
 			}
 		} else {
-			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
+			page = swapin_readahead(entry, gfp_mask, vmf);
 			swapcache = page;
 		}
 
@@ -4452,7 +4464,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.address = address & PAGE_MASK,
 		.flags = flags,
 		.pgoff = linear_page_index(vma, address),
-		.gfp_mask = __get_fault_gfp_mask(vma),
+		.gfp_mask = __get_fault_gfp_mask(vma, flags),
 	};
 	unsigned int dirty = flags & FAULT_FLAG_WRITE;
 	struct mm_struct *mm = vma->vm_mm;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] testing/selftest: test vDSO prefetch_page()
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
                   ` (4 preceding siblings ...)
  2021-02-25  7:29 ` [RFC 5/6] mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT Nadav Amit
@ 2021-02-25  7:29 ` Nadav Amit
  2021-02-25  8:40 ` [RFC 0/6] x86: prefetch_page() vDSO call Peter Zijlstra
  2021-02-25 12:16 ` Matthew Wilcox
  7 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  7:29 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hugh Dickins, Andy Lutomirski, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Borislav Petkov, Nadav Amit, Sean Christopherson,
	Andrew Morton, x86

From: Nadav Amit <namit@vmware.com>

Test prefetch_page() in cases of invalid pointer, file-mmap and
anonymous memory. Partial checks are also done with mincore syscall to
ensure the output of prefetch_page() is consistent with mincore (taking
into account the different semantics of the two).

The tests are not fool-proof as they rely on the behavior of the
page-cache and page reclamation mechanism to get a major page-fault.
They should be robust in the sense of test being skipped if it failed.

There is a question though on how to know how much memory to access in
the test of anonymous memory to force the eviction of a page and trigger
a refault.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 tools/testing/selftests/vDSO/Makefile         |   2 +
 .../selftests/vDSO/vdso_test_prefetch_page.c  | 265 ++++++++++++++++++
 2 files changed, 267 insertions(+)
 create mode 100644 tools/testing/selftests/vDSO/vdso_test_prefetch_page.c

diff --git a/tools/testing/selftests/vDSO/Makefile b/tools/testing/selftests/vDSO/Makefile
index d53a4d8008f9..dcd1ede8c0f7 100644
--- a/tools/testing/selftests/vDSO/Makefile
+++ b/tools/testing/selftests/vDSO/Makefile
@@ -11,6 +11,7 @@ ifeq ($(ARCH),$(filter $(ARCH),x86 x86_64))
 TEST_GEN_PROGS += $(OUTPUT)/vdso_standalone_test_x86
 endif
 TEST_GEN_PROGS += $(OUTPUT)/vdso_test_correctness
+TEST_GEN_PROGS += $(OUTPUT)/vdso_test_prefetch_page
 
 CFLAGS := -std=gnu99
 CFLAGS_vdso_standalone_test_x86 := -nostdlib -fno-asynchronous-unwind-tables -fno-stack-protector
@@ -33,3 +34,4 @@ $(OUTPUT)/vdso_test_correctness: vdso_test_correctness.c
 		vdso_test_correctness.c \
 		-o $@ \
 		$(LDFLAGS_vdso_test_correctness)
+$(OUTPUT)/vdso_test_prefetch_page: vdso_test_prefetch_page.c parse_vdso.c
diff --git a/tools/testing/selftests/vDSO/vdso_test_prefetch_page.c b/tools/testing/selftests/vDSO/vdso_test_prefetch_page.c
new file mode 100644
index 000000000000..35928c3f36ca
--- /dev/null
+++ b/tools/testing/selftests/vDSO/vdso_test_prefetch_page.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vdso_test_prefetch_page.c: Test vDSO's prefetch_page())
+ */
+
+#define _GNU_SOURCE
+
+#include <stdint.h>
+#include <elf.h>
+#include <stdio.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <string.h>
+#include <sys/auxv.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include "../kselftest.h"
+#include "parse_vdso.h"
+
+const char *version = "LINUX_2.6";
+const char *name = "__vdso_prefetch_page";
+
+struct getcpu_cache;
+typedef long (*prefetch_page_t)(const void *p);
+
+#define MEM_SIZE_K	(9500000ull)
+#define PAGE_SIZE	(4096ull)
+
+#define SKIP_MINCORE_BEFORE	(1 << 0)
+#define SKIP_MINCORE_AFTER	(1 << 1)
+
+static prefetch_page_t prefetch_page;
+
+static const void *ptr_align(const void *p)
+{
+	return (const void *)((unsigned long)p & ~(PAGE_SIZE - 1));
+}
+
+
+static int __test_prefetch(const void *p, bool expected_no_io,
+			   const char *test_name, unsigned int skip_mincore)
+{
+	bool no_io;
+	char vec;
+	long r;
+	uint64_t start;
+
+	p = ptr_align(p);
+
+	/*
+	 * First, run a sanity check to use mincore() to see if the page is in
+	 * memory when we expect it not to be.  We can only trust mincore to
+	 * tell us when a page is already in memory when it should not be.
+	 */
+	if (!(skip_mincore & SKIP_MINCORE_BEFORE)) {
+		if (mincore((void *)p, PAGE_SIZE, &vec)) {
+			printf("[SKIP]\t%s: mincore failed: %s\n", test_name,
+			       strerror(errno));
+			return 0;
+		}
+
+		no_io = vec & 1;
+		if (!skip_mincore && no_io && !expected_no_io) {
+			printf("[SKIP]\t%s: unexpected page state: %s\n",
+			       test_name,
+			       no_io ? "in memory" : "not in memory");
+			return 0;
+		}
+	}
+
+	/*
+	 * Check we got the expected result from prefetch page.
+	 */
+	r = prefetch_page(p);
+
+	no_io = r == 0;
+	if (no_io != expected_no_io) {
+		printf("[FAIL]\t%s: prefetch_page() returned %ld\n",
+			       test_name, r);
+		return KSFT_FAIL;
+	}
+
+	if (skip_mincore & SKIP_MINCORE_AFTER)
+		return 0;
+
+	/*
+	 * Check again using mincore that the page state is as expected.
+	 * A bit racy. Skip the test if mincore fails.
+	 */
+	if (mincore((void *)p, PAGE_SIZE, &vec)) {
+		printf("[SKIP]\t%s: mincore failed: %s\n", test_name,
+		       strerror(errno));
+		return 0;
+	}
+
+	no_io = vec & 1;
+	if (0 && no_io != expected_no_io) {
+		printf("[FAIL]\t%s: mincore reported page is %s\n",
+			       test_name, no_io ? "in memory" : "not in memory");
+		return KSFT_FAIL;
+
+	}
+	return 0;
+}
+
+#define test_prefetch(p, expected_no_io, test_name, skip_mincore)	\
+	do {								\
+		long _r = __test_prefetch(p, expected_no_io,		\
+					  test_name, skip_mincore);	\
+									\
+		if (_r)							\
+			return _r;					\
+	} while (0)
+
+static void wait_for_io_completion(const void *p)
+{
+	char vec;
+	int i;
+
+	/* Wait to allow the I/O to complete */
+	p = ptr_align(p);
+
+	vec = 0;
+
+	/* Wait for 5 seconds and keep probing the page to get it */
+	for (i = 0; i < 5000; i++) {
+		if (mincore((void *)p, PAGE_SIZE, &vec) == 0 && (vec & 1))
+			break;
+		prefetch_page(p);
+		usleep(1000);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long sysinfo_ehdr;
+	long ret, i, test_ret = 0;
+	int fd, drop_fd;
+	char *p, vec;
+
+	printf("[RUN]\tTesting vdso_prefetch_page\n");
+
+	sysinfo_ehdr = getauxval(AT_SYSINFO_EHDR);
+	if (!sysinfo_ehdr) {
+		printf("[SKIP]\tAT_SYSINFO_EHDR is not present!\n");
+		return KSFT_SKIP;
+	}
+
+	vdso_init_from_sysinfo_ehdr(getauxval(AT_SYSINFO_EHDR));
+
+	prefetch_page = (prefetch_page_t)vdso_sym(version, name);
+	if (!prefetch_page) {
+		printf("[SKIP]\tCould not find %s in vdso\n", name);
+		return KSFT_SKIP;
+	}
+
+	test_prefetch(NULL, false, "NULL access",
+		      SKIP_MINCORE_BEFORE|SKIP_MINCORE_AFTER);
+
+	test_prefetch(name, true, "present", 0);
+
+	p = mmap(0, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
+	if (p == MAP_FAILED) {
+		perror("mmap anon");
+		return KSFT_FAIL;
+	}
+
+	/*
+	 * Mincore would not tell us that no I/O is needed to retrieve the page,
+	 * so tell test_prefetch() to skip it.
+	 */
+	test_prefetch(p, true, "anon prefetch", SKIP_MINCORE_BEFORE);
+
+	/* Drop the caches before testing file mmap */
+	drop_fd = open("/proc/sys/vm/drop_caches", O_WRONLY);
+	if (drop_fd < 0) {
+		perror("open /proc/sys/vm/drop_caches");
+		return KSFT_FAIL;
+	}
+
+	sync();
+	ret = write(drop_fd, "3", 1);
+	if (ret != 1) {
+		perror("write to /proc/sys/vm/drop_caches");
+		return KSFT_FAIL;
+	}
+
+	/* close, which would also flush */
+	ret = close(drop_fd);
+	if (ret) {
+		perror("close /proc/sys/vm/drop_caches");
+		return KSFT_FAIL;
+	}
+
+	/* Using /etc/passwd as a file that should alway exist */
+	fd = open("/etc/hosts", O_RDONLY);
+	if (fd < 0) {
+		perror("open /etc/passwd");
+		return KSFT_FAIL;
+	}
+
+	p = mmap(0, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
+	if (p == MAP_FAILED) {
+		perror("mmap file");
+		return KSFT_FAIL;
+	}
+
+	test_prefetch(p, false, "Minor-fault (io) file prefetch", 0);
+
+	wait_for_io_completion(p);
+
+	test_prefetch(p, true, "Minor-fault (cached) file prefetch", 0);
+
+	munmap(p, PAGE_SIZE);
+
+	/*
+	 * Try to lock all to avoid unrelated page-faults before we create
+	 * memory pressure to prevent unrelated page-faults.
+	 */
+	mlockall(MCL_CURRENT);
+
+	p = mmap(0, 1024 * MEM_SIZE_K, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
+	if (p == MAP_FAILED) {
+		perror("mmap file");
+		return KSFT_FAIL;
+	}
+
+	/*
+	 * Write random value to avoid try to prevent KSM from deduplicating
+	 * this page.
+	 */
+	*(volatile unsigned long *)p = 0x43454659;
+	ret = madvise(p, PAGE_SIZE, MADV_PAGEOUT);
+	if (ret != 0) {
+		perror("madvise(MADV_PAGEOUT)");
+		return KSFT_FAIL;
+	}
+
+	/* Wait to allow the page-out to complete */
+	usleep(2000000);
+
+	/* Cause some memory pressure */
+	for (i = PAGE_SIZE; i < MEM_SIZE_K * 1024; i += PAGE_SIZE)
+		*(volatile unsigned long *)((unsigned long)p + i) = i + 1;
+
+	/* Check if we managed to evict the page */
+	ret = mincore(p, PAGE_SIZE, &vec);
+	if (ret != 0) {
+		perror("mincore");
+		return KSFT_FAIL;
+	}
+
+	test_prefetch(p, false, "Minor-fault (io) anon prefetch", 0);
+	wait_for_io_completion(p);
+
+	test_prefetch(p, true, "Minor-fault (cached) anon prefetch", false);
+
+	printf("[PASS]\tvdso_prefetch_page\n");
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
                   ` (5 preceding siblings ...)
  2021-02-25  7:29 ` [PATCH 6/6] testing/selftest: test vDSO prefetch_page() Nadav Amit
@ 2021-02-25  8:40 ` Peter Zijlstra
  2021-02-25  8:52   ` Nadav Amit
  2021-02-25 12:16 ` Matthew Wilcox
  7 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2021-02-25  8:40 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Nadav Amit,
	Sean Christopherson, Andrew Morton, x86

On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Just as applications can use prefetch instructions to overlap
> computations and memory accesses, applications may want to overlap the
> page-faults and compute or overlap the I/O accesses that are required
> for page-faults of different pages.
> 
> Applications can use multiple threads and cores for this matter, by
> running one thread that prefetches the data (i.e., faults in the data)
> and another that does the compute, but this scheme is inefficient. Using
> mincore() can tell whether a page is mapped, but might not tell whether
> the page is in the page-cache and does not fault in the data.
> 
> Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
> asynchronously. The semantic of this call is: try to prefetch a page of
> in a given address and return zero if the page is accessible following
> the call. Start I/O operations to retrieve the page if such operations
> are required and there is no high memory pressure that might introduce
> slowdowns.
> 
> Note that as usual the page might be paged-out at any point and
> therefore, similarly to mincore(), there is no guarantee that the page
> will be present at the time that the user application uses the data that
> resides on the page. Nevertheless, it is expected that in the vast
> majority of the cases this would not happen, since prefetch_page()
> accesses the page and therefore sets the PTE access-bit (if it is
> clear). 
> 
> The implementation is as follows. The vDSO code accesses the data,
> triggering a page-fault it is not present. The handler detects based on
> the instruction pointer that this is an asynchronous-#PF, using the
> recently introduce vDSO exception tables. If the page can be brought
> without waiting (e.g., the page is already in the page-cache), the
> kernel handles the fault and returns success (zero). If there is memory
> pressure that prevents the proper handling of the fault (i.e., requires
> heavy-weight reclamation) it returns a failure. Otherwise, it starts an
> I/O to bring the page and returns failure.
> 
> Compilers can be extended to issue the prefetch_page() calls when
> needed.

Interesting, but given we've been removing explicit prefetch from some
parts of the kernel how useful is this in actual use? I'm thinking there
should at least be a real user and performance numbers with this before
merging.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25  8:40 ` [RFC 0/6] x86: prefetch_page() vDSO call Peter Zijlstra
@ 2021-02-25  8:52   ` Nadav Amit
  2021-02-25  9:32     ` Nadav Amit
  0 siblings, 1 reply; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  8:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Sean Christopherson, Andrew Morton,
	x86

[-- Attachment #1: Type: text/plain, Size: 2886 bytes --]



> On Feb 25, 2021, at 12:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Just as applications can use prefetch instructions to overlap
>> computations and memory accesses, applications may want to overlap the
>> page-faults and compute or overlap the I/O accesses that are required
>> for page-faults of different pages.
>> 
>> Applications can use multiple threads and cores for this matter, by
>> running one thread that prefetches the data (i.e., faults in the data)
>> and another that does the compute, but this scheme is inefficient. Using
>> mincore() can tell whether a page is mapped, but might not tell whether
>> the page is in the page-cache and does not fault in the data.
>> 
>> Introduce prefetch_page() vDSO-call to prefetch, i.e. fault-in memory
>> asynchronously. The semantic of this call is: try to prefetch a page of
>> in a given address and return zero if the page is accessible following
>> the call. Start I/O operations to retrieve the page if such operations
>> are required and there is no high memory pressure that might introduce
>> slowdowns.
>> 
>> Note that as usual the page might be paged-out at any point and
>> therefore, similarly to mincore(), there is no guarantee that the page
>> will be present at the time that the user application uses the data that
>> resides on the page. Nevertheless, it is expected that in the vast
>> majority of the cases this would not happen, since prefetch_page()
>> accesses the page and therefore sets the PTE access-bit (if it is
>> clear).
>> 
>> The implementation is as follows. The vDSO code accesses the data,
>> triggering a page-fault it is not present. The handler detects based on
>> the instruction pointer that this is an asynchronous-#PF, using the
>> recently introduce vDSO exception tables. If the page can be brought
>> without waiting (e.g., the page is already in the page-cache), the
>> kernel handles the fault and returns success (zero). If there is memory
>> pressure that prevents the proper handling of the fault (i.e., requires
>> heavy-weight reclamation) it returns a failure. Otherwise, it starts an
>> I/O to bring the page and returns failure.
>> 
>> Compilers can be extended to issue the prefetch_page() calls when
>> needed.
> 
> Interesting, but given we've been removing explicit prefetch from some
> parts of the kernel how useful is this in actual use? I'm thinking there
> should at least be a real user and performance numbers with this before
> merging.

Can you give me a reference to the “removing explicit prefetch from some
parts of the kernel”?

I will work on an llvm/gcc plugin to provide some performance numbers.
I wanted to make sure that the idea is not a complete obscenity first.


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25  8:52   ` Nadav Amit
@ 2021-02-25  9:32     ` Nadav Amit
  2021-02-25  9:55       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Nadav Amit @ 2021-02-25  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Sean Christopherson, Andrew Morton,
	x86

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]



> On Feb 25, 2021, at 12:52 AM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
> 
>> On Feb 25, 2021, at 12:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> Just as applications can use prefetch instructions to overlap
>>> computations and memory accesses, applications may want to overlap the
>>> page-faults and compute or overlap the I/O accesses that are required
>>> for page-faults of different pages.
[
[ snip ]

>> Interesting, but given we've been removing explicit prefetch from some
>> parts of the kernel how useful is this in actual use? I'm thinking there
>> should at least be a real user and performance numbers with this before
>> merging.
> 
> Can you give me a reference to the “removing explicit prefetch from some
> parts of the kernel”?

Oh. I get it - you mean we remove we remove the use of explicit memory
prefetch from the kernel code. Well, I don’t think it is really related,
but yes, performance numbers are needed.


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25  9:32     ` Nadav Amit
@ 2021-02-25  9:55       ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2021-02-25  9:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Sean Christopherson, Andrew Morton,
	x86

On Thu, Feb 25, 2021 at 01:32:56AM -0800, Nadav Amit wrote:
> > On Feb 25, 2021, at 12:52 AM, Nadav Amit <nadav.amit@gmail.com> wrote:

> > Can you give me a reference to the “removing explicit prefetch from some
> > parts of the kernel”?

75d65a425c01 ("hlist: remove software prefetching in hlist iterators")
e66eed651fd1 ("list: remove prefetching from regular list iterators")

> Oh. I get it - you mean we remove we remove the use of explicit memory
> prefetch from the kernel code. Well, I don’t think it is really related,
> but yes, performance numbers are needed.

Right, so my main worry was that use of the prefetch instruction
actually hurt performance once the hardware prefetchers got smart
enough, this being a very similar construct (just on a different level
of the stack) should be careful not to suffer the same fate.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
                   ` (6 preceding siblings ...)
  2021-02-25  8:40 ` [RFC 0/6] x86: prefetch_page() vDSO call Peter Zijlstra
@ 2021-02-25 12:16 ` Matthew Wilcox
  2021-02-25 16:56   ` Nadav Amit
  7 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2021-02-25 12:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andy Lutomirski,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Nadav Amit, Sean Christopherson, Andrew Morton, x86

On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
> Just as applications can use prefetch instructions to overlap
> computations and memory accesses, applications may want to overlap the
> page-faults and compute or overlap the I/O accesses that are required
> for page-faults of different pages.

Isn't this madvise(MADV_WILLNEED)?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25 12:16 ` Matthew Wilcox
@ 2021-02-25 16:56   ` Nadav Amit
  2021-02-25 17:32     ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Nadav Amit @ 2021-02-25 16:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Sean Christopherson, Andrew Morton, x86


> On Feb 25, 2021, at 4:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>> Just as applications can use prefetch instructions to overlap
>> computations and memory accesses, applications may want to overlap the
>> page-faults and compute or overlap the I/O accesses that are required
>> for page-faults of different pages.
> 
> Isn't this madvise(MADV_WILLNEED)?

Good point that I should have mentioned. In a way prefetch_page() a
combination of mincore() and MADV_WILLNEED.

There are 4 main differences from MADV_WILLNEED:

1. Much lower invocation cost if the readahead is not needed: this allows
to prefetch pages more abundantly.

2. Return value: return value tells you whether the page is accessible.
This makes it usable for coroutines, for instance. In this regard the
call is more similar to mincore() than MADV_WILLNEED.

3. The PTEs are mapped if the pages are already present in the
swap/page-cache, preventing an additional page-fault just to map them.

4. Avoiding heavy-weight reclamation on low memory (this may need to
be selective, and can be integrated with MADV_WILLNEED).



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25 16:56   ` Nadav Amit
@ 2021-02-25 17:32     ` Matthew Wilcox
  2021-02-25 17:53       ` Nadav Amit
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2021-02-25 17:32 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Sean Christopherson, Andrew Morton, x86

On Thu, Feb 25, 2021 at 04:56:50PM +0000, Nadav Amit wrote:
> 
> > On Feb 25, 2021, at 4:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > 
> > On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
> >> Just as applications can use prefetch instructions to overlap
> >> computations and memory accesses, applications may want to overlap the
> >> page-faults and compute or overlap the I/O accesses that are required
> >> for page-faults of different pages.
> > 
> > Isn't this madvise(MADV_WILLNEED)?
> 
> Good point that I should have mentioned. In a way prefetch_page() a
> combination of mincore() and MADV_WILLNEED.
> 
> There are 4 main differences from MADV_WILLNEED:
> 
> 1. Much lower invocation cost if the readahead is not needed: this allows
> to prefetch pages more abundantly.

That seems like something that could be fixed in libc -- if we add a
page prefetch vdso call, an application calling posix_madvise() could
be implemented by calling this fast path.  Assuming the performance
increase justifies this extra complexity.

> 2. Return value: return value tells you whether the page is accessible.
> This makes it usable for coroutines, for instance. In this regard the
> call is more similar to mincore() than MADV_WILLNEED.

I don't quite understand the programming model you're describing here.

> 3. The PTEs are mapped if the pages are already present in the
> swap/page-cache, preventing an additional page-fault just to map them.

We could enhance madvise() to do this, no?

> 4. Avoiding heavy-weight reclamation on low memory (this may need to
> be selective, and can be integrated with MADV_WILLNEED).

Likewise.

I don't want to add a new Linux-specific call when there's already a
POSIX interface that communicates the exact same thing.  The return
value seems like the only problem.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_madvise.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/6] x86: prefetch_page() vDSO call
  2021-02-25 17:32     ` Matthew Wilcox
@ 2021-02-25 17:53       ` Nadav Amit
  0 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-25 17:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Sean Christopherson, Andrew Morton, x86



> On Feb 25, 2021, at 9:32 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Thu, Feb 25, 2021 at 04:56:50PM +0000, Nadav Amit wrote:
>> 
>>> On Feb 25, 2021, at 4:16 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>> 
>>> On Wed, Feb 24, 2021 at 11:29:04PM -0800, Nadav Amit wrote:
>>>> Just as applications can use prefetch instructions to overlap
>>>> computations and memory accesses, applications may want to overlap the
>>>> page-faults and compute or overlap the I/O accesses that are required
>>>> for page-faults of different pages.
>>> 
>>> Isn't this madvise(MADV_WILLNEED)?
>> 
>> Good point that I should have mentioned. In a way prefetch_page() a
>> combination of mincore() and MADV_WILLNEED.
>> 
>> There are 4 main differences from MADV_WILLNEED:
>> 
>> 1. Much lower invocation cost if the readahead is not needed: this allows
>> to prefetch pages more abundantly.
> 
> That seems like something that could be fixed in libc -- if we add a
> page prefetch vdso call, an application calling posix_madvise() could
> be implemented by calling this fast path.  Assuming the performance
> increase justifies this extra complexity.
> 
>> 2. Return value: return value tells you whether the page is accessible.
>> This makes it usable for coroutines, for instance. In this regard the
>> call is more similar to mincore() than MADV_WILLNEED.
> 
> I don't quite understand the programming model you're describing here.
> 
>> 3. The PTEs are mapped if the pages are already present in the
>> swap/page-cache, preventing an additional page-fault just to map them.
> 
> We could enhance madvise() to do this, no?
> 
>> 4. Avoiding heavy-weight reclamation on low memory (this may need to
>> be selective, and can be integrated with MADV_WILLNEED).
> 
> Likewise.
> 
> I don't want to add a new Linux-specific call when there's already a
> POSIX interface that communicates the exact same thing.  The return
> value seems like the only problem.

I agree that this call does not have to be exposed to the application.

I am not sure there is a lot of extra complexity now, but obviously
some evaluations are needed.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/6] vdso/extable: fix calculation of base
  2021-02-25  7:29 ` [RFC 1/6] vdso/extable: fix calculation of base Nadav Amit
@ 2021-02-25 21:16   ` Sean Christopherson
  2021-02-26 17:24     ` Nadav Amit
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2021-02-25 21:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andy Lutomirski,
	Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Borislav Petkov,
	Nadav Amit, Andrew Morton, x86

On Wed, Feb 24, 2021, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Apparently, the assembly considers __ex_table as the location when the
> pushsection directive was issued. Therefore when there is more than a
> single entry in the vDSO exception table, the calculations of the base
> and fixup are wrong.
> 
> Fix the calculations of the expected fault IP and new IP by adjusting
> the base after each entry.
> 
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: x86@kernel.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/entry/vdso/extable.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
> index afcf5b65beef..c81e78636220 100644
> --- a/arch/x86/entry/vdso/extable.c
> +++ b/arch/x86/entry/vdso/extable.c
> @@ -32,7 +32,7 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
>  	nr_entries = image->extable_len / (sizeof(*extable));
>  	extable = image->extable;
>  
> -	for (i = 0; i < nr_entries; i++) {
> +	for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {

It's been literally years since I wrote this code, but I distinctly remember the
addresses being relative to the base.  I also remember testing multiple entries,
but again, that was a long time ago.

Assuming things have changed, or I was flat out wrong, the comment above the
macro magic should also be updated.

/*
 * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
 * vDSO uses a dedicated handler the addresses are relative to the overall
 * exception table, not each individual entry.
 */

>  		if (regs->ip == base + extable[i].insn) {
>  			regs->ip = base + extable[i].fixup;
>  			regs->di = trapnr;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/6] vdso/extable: fix calculation of base
  2021-02-25 21:16   ` Sean Christopherson
@ 2021-02-26 17:24     ` Nadav Amit
  2021-02-26 17:47       ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Nadav Amit @ 2021-02-26 17:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andrew Morton, x86

[-- Attachment #1: Type: text/plain, Size: 2400 bytes --]



> On Feb 25, 2021, at 1:16 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Wed, Feb 24, 2021, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Apparently, the assembly considers __ex_table as the location when the
>> pushsection directive was issued. Therefore when there is more than a
>> single entry in the vDSO exception table, the calculations of the base
>> and fixup are wrong.
>> 
>> Fix the calculations of the expected fault IP and new IP by adjusting
>> the base after each entry.
>> 
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Sean Christopherson <seanjc@google.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: x86@kernel.org
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> ---
>> arch/x86/entry/vdso/extable.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
>> index afcf5b65beef..c81e78636220 100644
>> --- a/arch/x86/entry/vdso/extable.c
>> +++ b/arch/x86/entry/vdso/extable.c
>> @@ -32,7 +32,7 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
>> 	nr_entries = image->extable_len / (sizeof(*extable));
>> 	extable = image->extable;
>> 
>> -	for (i = 0; i < nr_entries; i++) {
>> +	for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
> 
> It's been literally years since I wrote this code, but I distinctly remember the
> addresses being relative to the base.  I also remember testing multiple entries,
> but again, that was a long time ago.
> 
> Assuming things have changed, or I was flat out wrong, the comment above the
> macro magic should also be updated.
> 
> /*
> * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
> * vDSO uses a dedicated handler the addresses are relative to the overall
> * exception table, not each individual entry.
> */

I will update the comment. I am not very familiar with pushsection stuff,
but the offsets were wrong.

Since you say you checked it, I wonder whether it can somehow be caused
by having exception table entries defined from multiple object files.

Anyhow, this change follows the kernel’s (not vDSO) exception table
scheme.


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/6] vdso/extable: fix calculation of base
  2021-02-26 17:24     ` Nadav Amit
@ 2021-02-26 17:47       ` Sean Christopherson
  2021-02-28  9:20         ` Nadav Amit
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2021-02-26 17:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andrew Morton, x86

On Fri, Feb 26, 2021, Nadav Amit wrote:
> 
> > On Feb 25, 2021, at 1:16 PM, Sean Christopherson <seanjc@google.com> wrote:
> > It's been literally years since I wrote this code, but I distinctly remember the
> > addresses being relative to the base.  I also remember testing multiple entries,
> > but again, that was a long time ago.
> > 
> > Assuming things have changed, or I was flat out wrong, the comment above the
> > macro magic should also be updated.
> > 
> > /*
> > * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
> > * vDSO uses a dedicated handler the addresses are relative to the overall
> > * exception table, not each individual entry.
> > */
> 
> I will update the comment. I am not very familiar with pushsection stuff,
> but the offsets were wrong.
> 
> Since you say you checked it, I wonder whether it can somehow be caused
> by having exception table entries defined from multiple object files.

Oooh, I think that would do it.  Have you checked what happens if there are
multiple object files and multiple fixups within an object file?

> Anyhow, this change follows the kernel’s (not vDSO) exception table
> scheme.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 1/6] vdso/extable: fix calculation of base
  2021-02-26 17:47       ` Sean Christopherson
@ 2021-02-28  9:20         ` Nadav Amit
  0 siblings, 0 replies; 19+ messages in thread
From: Nadav Amit @ 2021-02-28  9:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Linux-MM, LKML, Hugh Dickins, Andy Lutomirski, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Borislav Petkov, Andrew Morton, x86

[-- Attachment #1: Type: text/plain, Size: 2982 bytes --]



> On Feb 26, 2021, at 9:47 AM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, Feb 26, 2021, Nadav Amit wrote:
>> 
>>> On Feb 25, 2021, at 1:16 PM, Sean Christopherson <seanjc@google.com> wrote:
>>> It's been literally years since I wrote this code, but I distinctly remember the
>>> addresses being relative to the base.  I also remember testing multiple entries,
>>> but again, that was a long time ago.
>>> 
>>> Assuming things have changed, or I was flat out wrong, the comment above the
>>> macro magic should also be updated.
>>> 
>>> /*
>>> * Inject exception fixup for vDSO code.  Unlike normal exception fixup,
>>> * vDSO uses a dedicated handler the addresses are relative to the overall
>>> * exception table, not each individual entry.
>>> */
>> 
>> I will update the comment. I am not very familiar with pushsection stuff,
>> but the offsets were wrong.
>> 
>> Since you say you checked it, I wonder whether it can somehow be caused
>> by having exception table entries defined from multiple object files.
> 
> Oooh, I think that would do it.  Have you checked what happens if there are
> multiple object files and multiple fixups within an object file?

Good thing that you insisted...

I certainly do not know well enough the assembly section directives,
but indeed it seems (after some experiments) that referring to the
section provides different values from different objects.

So both the current (yours) and this patch (mine) are broken. I think
the easiest thing is to fall back to the kernel exception table scheme.
I checked the following with both entries in the same and different
objects and it seems to work correctly:

-- >8 --

diff --git a/arch/x86/entry/vdso/extable.c b/arch/x86/entry/vdso/extable.c
index afcf5b65beef..3f395b782553 100644
--- a/arch/x86/entry/vdso/extable.c
+++ b/arch/x86/entry/vdso/extable.c
@@ -32,9 +32,11 @@ bool fixup_vdso_exception(struct pt_regs *regs, int trapnr,
 	nr_entries = image->extable_len / (sizeof(*extable));
 	extable = image->extable;

-	for (i = 0; i < nr_entries; i++) {
-		if (regs->ip == base + extable[i].insn) {
-			regs->ip = base + extable[i].fixup;
+	for (i = 0; i < nr_entries; i++, base += sizeof(*extable)) {
+		if (regs->ip == base + extable[i].insn +
+			offsetof(struct vdso_exception_table_entry, insn)) {
+			regs->ip = base + extable[i].fixup +
+				offsetof(struct vdso_exception_table_entry, fixup);
 			regs->di = trapnr;
 			regs->si = error_code;
 			regs->dx = fault_addr;
diff --git a/arch/x86/entry/vdso/extable.h b/arch/x86/entry/vdso/extable.h
index b56f6b012941..4ffe3d533148 100644
--- a/arch/x86/entry/vdso/extable.h
+++ b/arch/x86/entry/vdso/extable.h
@@ -13,8 +13,8 @@

 .macro ASM_VDSO_EXTABLE_HANDLE from:req to:req
 	.pushsection __ex_table, "a"
-	.long (\from) - __ex_table
-	.long (\to) - __ex_table
+	.long (\from) - .
+	.long (\to) - .
 	.popsection
 .endm
 #else
--
2.25.1



[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-02-28  9:22 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-25  7:29 [RFC 0/6] x86: prefetch_page() vDSO call Nadav Amit
2021-02-25  7:29 ` [RFC 1/6] vdso/extable: fix calculation of base Nadav Amit
2021-02-25 21:16   ` Sean Christopherson
2021-02-26 17:24     ` Nadav Amit
2021-02-26 17:47       ` Sean Christopherson
2021-02-28  9:20         ` Nadav Amit
2021-02-25  7:29 ` [RFC 2/6] x86/vdso: add mask and flags to extable Nadav Amit
2021-02-25  7:29 ` [RFC 3/6] x86/vdso: introduce page_prefetch() Nadav Amit
2021-02-25  7:29 ` [RFC 4/6] mm/swap_state: respect FAULT_FLAG_RETRY_NOWAIT Nadav Amit
2021-02-25  7:29 ` [RFC 5/6] mm: use lightweight reclaim on FAULT_FLAG_RETRY_NOWAIT Nadav Amit
2021-02-25  7:29 ` [PATCH 6/6] testing/selftest: test vDSO prefetch_page() Nadav Amit
2021-02-25  8:40 ` [RFC 0/6] x86: prefetch_page() vDSO call Peter Zijlstra
2021-02-25  8:52   ` Nadav Amit
2021-02-25  9:32     ` Nadav Amit
2021-02-25  9:55       ` Peter Zijlstra
2021-02-25 12:16 ` Matthew Wilcox
2021-02-25 16:56   ` Nadav Amit
2021-02-25 17:32     ` Matthew Wilcox
2021-02-25 17:53       ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).