All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] de-asmify the x86-64 system call slowpath
@ 2014-01-26 22:28 Linus Torvalds
  2014-01-27  0:22 ` Al Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Linus Torvalds @ 2014-01-26 22:28 UTC (permalink / raw)
  To: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2092 bytes --]

The x86-64 (and 32-bit, for that matter) system call slowpaths are all
in C, but the *selection* of which slow-path to take is a mixture of
complicated assembler ("sysret_check -> sysret_careful ->
sysret_signal ->sysret_audit -> int_check_syscall_exit_work" etc), and
oddly named and placed C code ("schedule_user" vs
"__audit_syscall_exit" vs "do_notify_resume").

This attached patch tries to take the "do_notify_resume()" approach,
and renaming it to something sane ("syscall_exit_slowpath") and call
out to *all* the different slow cases from that one place, instead of
having some cases hardcoded in asm, and some in C. And instead of
hardcoding which cases result in a "iretq" and which cases result in a
faster sysret case, it's now simply a return value from that
syscall_exit_slowpath() function, so it's very natural and easy to say
"taking a signal will force us to do the slow iretq case, but we can
do the task exit work and still do the sysret".

I've marked this as an RFC, because I didn't bother trying to clean up
the 32-bit code similarly (no test-cases, and trust me, if you get
this wrong, it will fail spectacularly but in very subtle and
hard-to-debug ways), and I also didn't bother with the slow cases in
the "iretq" path, so that path still has the odd asm cases and calls
the old (now legacy) do_notify_resume() path.

But this is actually tested, and seems to work (including limited
testing with strace, gdb etc), and while it adds a few more lines than
it removes, the removed lines are mostly asm, and added lines are C
(and part of them are the temporary still-extant do_notify_resume()
wrapper). In particular, it should be fairly straightforward to take
this as a starting point, removing the extant
do_notify_resume/schedule/etc cases one by one, and get rid of more
asm code and finally the wrapper.

Comments? This was obviously brought on by my frustration with the
currently nasty do_notify_resume() always returning to iret for the
task_work case, and PeterZ's patch that fixed that, but made the asm
mess even *worse*.

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 4573 bytes --]

 arch/x86/kernel/entry_64.S | 51 ++++++++++++++--------------------------------
 arch/x86/kernel/signal.c   | 37 ++++++++++++++++++++++++++++-----
 2 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..15b5953da958 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -658,30 +658,24 @@ sysret_check:
 	/* Handle reschedules */
 	/* edx:	work, edi: workmask */
 sysret_careful:
-	bt $TIF_NEED_RESCHED,%edx
-	jnc sysret_signal
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq_cfi %rdi
-	SCHEDULE_USER
-	popq_cfi %rdi
-	jmp sysret_check
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %r11
+	movq %rsp,%rdi
+	movl %edx,%esi
+	call syscall_exit_slowpath
+	movl $_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT,%edi
+	testl %eax,%eax
+	jne sysret_using_iret
+	addq $REST_SKIP,%rsp
+	jmp sysret_check;
 
-	/* Handle a signal */
-sysret_signal:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
-	bt $TIF_SYSCALL_AUDIT,%edx
-	jc sysret_audit
-#endif
-	/*
-	 * We have a signal, or exit tracing or single-step.
-	 * These all wind up with the iret return path anyway,
-	 * so just join that path right now.
-	 */
-	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
-	jmp int_check_syscall_exit_work
+sysret_using_iret:
+	RESTORE_REST
+	DISABLE_INTERRUPTS(CLBR_NONE)
+	TRACE_IRQS_OFF
+	jmp int_with_check
 
 badsys:
 	movq $-ENOSYS,RAX-ARGOFFSET(%rsp)
@@ -703,20 +697,6 @@ auditsys:
 	call __audit_syscall_entry
 	LOAD_ARGS 0		/* reload call-clobbered registers */
 	jmp system_call_fastpath
-
-	/*
-	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
-	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
-	 * masked off.
-	 */
-sysret_audit:
-	movq RAX-ARGOFFSET(%rsp),%rsi	/* second arg, syscall return value */
-	cmpq $-MAX_ERRNO,%rsi	/* is it < -MAX_ERRNO? */
-	setbe %al		/* 1 if so, 0 if not */
-	movzbl %al,%edi		/* zero-extend that into %edi */
-	call __audit_syscall_exit
-	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
-	jmp sysret_check
 #endif	/* CONFIG_AUDITSYSCALL */
 
 	/* Do syscall tracing */
@@ -786,7 +766,6 @@ int_careful:
 int_very_careful:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
-int_check_syscall_exit_work:
 	SAVE_REST
 	/* Check for syscall exit trace */
 	testl $_TIF_WORK_SYSCALL_EXIT,%edx
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9e5de6813e1f..25d02f6a65f1 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -725,13 +725,30 @@ static void do_signal(struct pt_regs *regs)
 }
 
 /*
- * notification of userspace execution resumption
- * - triggered by the TIF_WORK_MASK flags
+ * This is the system call exit slowpath, called if any of
+ * the thread info flags are set that cause us to not just
+ * return using a 'sysret' instruction.
+ *
+ * Return zero if the user register state hasn't changed,
+ * and a 'sysret' is still ok. Return nonzero if we need
+ * to go to the slow 'iret' path due to possible register
+ * changes (signals, ptrace, whatever).
  */
-__visible void
-do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
+__visible int syscall_exit_slowpath(struct pt_regs *regs, u32 thread_info_flags)
 {
+	int retval = 0;
+
 	user_exit();
+	if (thread_info_flags & _TIF_NEED_RESCHED)
+		schedule();
+
+	if (thread_info_flags & _TIF_WORK_SYSCALL_EXIT)
+		syscall_trace_leave(regs);
+
+#ifdef CONFIG_AUDITSYSCALL
+	if (thread_info_flags & _TIF_SYSCALL_AUDIT)
+		__audit_syscall_exit(regs->ax, regs->ax < -MAX_ERRNO);
+#endif
 
 #ifdef CONFIG_X86_MCE
 	/* notify userspace of pending MCEs */
@@ -743,8 +760,10 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		uprobe_notify_resume(regs);
 
 	/* deal with pending signal delivery */
-	if (thread_info_flags & _TIF_SIGPENDING)
+	if (thread_info_flags & _TIF_SIGPENDING) {
 		do_signal(regs);
+		retval = 1;
+	}
 
 	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
 		clear_thread_flag(TIF_NOTIFY_RESUME);
@@ -754,8 +773,16 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		fire_user_return_notifiers();
 
 	user_enter();
+	return retval;
 }
 
+/* Get rid of this old interface some day.. */
+__visible void do_notify_resume(struct pt_regs *regs, void *unused, u32 thread_info_flags)
+{
+	syscall_exit_slowpath(regs, thread_info_flags & ~(_TIF_NEED_RESCHED|_TIF_SYSCALL_AUDIT));
+}
+
+
 void signal_fault(struct pt_regs *regs, void __user *frame, char *where)
 {
 	struct task_struct *me = current;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds
@ 2014-01-27  0:22 ` Al Viro
  2014-01-27  4:32   ` Linus Torvalds
  2014-01-27 10:27 ` Peter Zijlstra
  2014-02-06  0:32 ` Linus Torvalds
  2 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2014-01-27  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sun, Jan 26, 2014 at 02:28:15PM -0800, Linus Torvalds wrote:
> The x86-64 (and 32-bit, for that matter) system call slowpaths are all
> in C, but the *selection* of which slow-path to take is a mixture of
> complicated assembler ("sysret_check -> sysret_careful ->
> sysret_signal ->sysret_audit -> int_check_syscall_exit_work" etc), and
> oddly named and placed C code ("schedule_user" vs
> "__audit_syscall_exit" vs "do_notify_resume").
> 
> This attached patch tries to take the "do_notify_resume()" approach,
> and renaming it to something sane ("syscall_exit_slowpath") and call
> out to *all* the different slow cases from that one place, instead of
> having some cases hardcoded in asm, and some in C. And instead of
> hardcoding which cases result in a "iretq" and which cases result in a
> faster sysret case, it's now simply a return value from that
> syscall_exit_slowpath() function, so it's very natural and easy to say
> "taking a signal will force us to do the slow iretq case, but we can
> do the task exit work and still do the sysret".
> 
> I've marked this as an RFC, because I didn't bother trying to clean up
> the 32-bit code similarly (no test-cases, and trust me, if you get
> this wrong, it will fail spectacularly but in very subtle and
> hard-to-debug ways), and I also didn't bother with the slow cases in
> the "iretq" path, so that path still has the odd asm cases and calls
> the old (now legacy) do_notify_resume() path.

Umm...  Can't uprobe_notify_resume() modify regs as well?  While we
are at it, when we start using the same thing on 32bit kernels, we'll
need to watch out for execve() - the reason why start_thread() sets
TIF_NOTIFY_RESUME is to force us away from sysexit path.  IIRC, vm86
is another thing to watch out for (same reasons).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27  0:22 ` Al Viro
@ 2014-01-27  4:32   ` Linus Torvalds
  2014-01-27  4:48     ` H. Peter Anvin
  2014-01-27  7:42     ` Al Viro
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2014-01-27  4:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  Can't uprobe_notify_resume() modify regs as well?

Probably.

.. and on the other hand, we should actually be able to use 'sysret'
for signal handling on x86-64, because while sysret destroys %rcx and
doesn't allow for returning to odd modes, for calling a signal handler
I don't think we really care..

> While we
> are at it, when we start using the same thing on 32bit kernels, we'll
> need to watch out for execve() - the reason why start_thread() sets
> TIF_NOTIFY_RESUME is to force us away from sysexit path.  IIRC, vm86
> is another thing to watch out for (same reasons).

Yes, the 32-bit code I didn't want to touch, partly because I no
longer have a test-case. And it does end up having some more
interesting cases.

                 Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27  4:32   ` Linus Torvalds
@ 2014-01-27  4:48     ` H. Peter Anvin
  2014-01-27  7:42     ` Al Viro
  1 sibling, 0 replies; 32+ messages in thread
From: H. Peter Anvin @ 2014-01-27  4:48 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 01/26/2014 08:32 PM, Linus Torvalds wrote:
> On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> Umm...  Can't uprobe_notify_resume() modify regs as well?
> 
> Probably.
> 
> .. and on the other hand, we should actually be able to use 'sysret'
> for signal handling on x86-64, because while sysret destroys %rcx and
> doesn't allow for returning to odd modes, for calling a signal handler
> I don't think we really care..
> 

Yes, it is the fourth argument register, but we only have three
arguments to a signal handler.  I had to think about that one.

>> While we
>> are at it, when we start using the same thing on 32bit kernels, we'll
>> need to watch out for execve() - the reason why start_thread() sets
>> TIF_NOTIFY_RESUME is to force us away from sysexit path.  IIRC, vm86
>> is another thing to watch out for (same reasons).
> 
> Yes, the 32-bit code I didn't want to touch, partly because I no
> longer have a test-case. And it does end up having some more
> interesting cases.

That is one way to put it.  However, this code is incredibly ugly and
getting it cleaned up would really, really help, of course.

	-hpa



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27  4:32   ` Linus Torvalds
  2014-01-27  4:48     ` H. Peter Anvin
@ 2014-01-27  7:42     ` Al Viro
  2014-01-27 22:06       ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Al Viro @ 2014-01-27  7:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sun, Jan 26, 2014 at 08:32:09PM -0800, Linus Torvalds wrote:
> On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Umm...  Can't uprobe_notify_resume() modify regs as well?
> 
> Probably.
> 
> .. and on the other hand, we should actually be able to use 'sysret'
> for signal handling on x86-64, because while sysret destroys %rcx and
> doesn't allow for returning to odd modes, for calling a signal handler
> I don't think we really care..

I'm afraid we might:

 * When user can change the frames always force IRET. That is because
 * it deals with uncanonical addresses better. SYSRET has trouble
 * with them due to bugs in both AMD and Intel CPUs.

IIRC, that was about SYSRET with something unpleasant left in RCX, which
comes from regs->ip, which is set to sa_handler by __setup_rt_frame().
And we do not normalize or validate that - not in __setup_rt_frame() and
not at sigaction(2) time.  Something about GPF triggered and buggering
attacker-chosen memory area?  I don't remember details, but IIRC the
conclusion had been "just don't go there"...

Note that we can manipulate regs->ip and regs->sp, regardless of validation
at sigaction(2) or __setup_rt_frame() - just have the sucker ptraced, send
it a signal and it'll stop on delivery.  Then tracer can use ptrace to modify
registers and issue PTRACE_CONT with zero signal.  Voila - regs->[is]p
set to arbitrary values, no signal handlers triggered...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds
  2014-01-27  0:22 ` Al Viro
@ 2014-01-27 10:27 ` Peter Zijlstra
  2014-01-27 11:36   ` Al Viro
  2014-02-06  0:32 ` Linus Torvalds
  2 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2014-01-27 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Sun, Jan 26, 2014 at 02:28:15PM -0800, Linus Torvalds wrote:
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1e96c3628bf2..15b5953da958 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -658,30 +658,24 @@ sysret_check:
>  	/* Handle reschedules */
>  	/* edx:	work, edi: workmask */
>  sysret_careful:
>  	TRACE_IRQS_ON
>  	ENABLE_INTERRUPTS(CLBR_NONE)
> +	SAVE_REST
> +	FIXUP_TOP_OF_STACK %r11
> +	movq %rsp,%rdi
> +	movl %edx,%esi
> +	call syscall_exit_slowpath
> +	movl $_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT,%edi
> +	testl %eax,%eax
> +	jne sysret_using_iret
> +	addq $REST_SKIP,%rsp
> +	jmp sysret_check;

Obviously I don't particularly like the SAVE_REST/FIXUP_TOP_OF_STACK
being added to the reschedule path.

Can't we do as Al suggested earlier and have 2 slowpath calls, one
without PT_REGS and one with?

That said, yes its a nice cleanup, entry.S always hurts my brain.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 10:27 ` Peter Zijlstra
@ 2014-01-27 11:36   ` Al Viro
  2014-01-27 17:39     ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2014-01-27 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 11:27:59AM +0100, Peter Zijlstra wrote:

> Obviously I don't particularly like the SAVE_REST/FIXUP_TOP_OF_STACK
> being added to the reschedule path.
> 
> Can't we do as Al suggested earlier and have 2 slowpath calls, one
> without PT_REGS and one with?
> 
> That said, yes its a nice cleanup, entry.S always hurts my brain.

BTW, there's an additional pile of obfuscation:
/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK                                                  \
        (0x0000FFFF &                                                   \
         ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|                       \
           _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))

/* work to do on any return to user space */
#define _TIF_ALLWORK_MASK                                               \
        ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |       \
        _TIF_NOHZ)

These guys are
	_TIF_NOTIFY_RESUME | _TIF_SIGPENDING | _TIF_MCE_NOTIFY |
	_TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | _TIF_NEED_RESCHED | 0xe200
and
	_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING |
	_TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU |
	_TIF_SYSCALL_AUDIT |  _TIF_MCE_NOTIFY | _TIF_SYSCALL_TRACEPOINT |
	_TIF_NOHZ | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | 0xe200
resp., or
	_TIF_DO_NOTIFY_MASK | _TIF_UPROBE | _TIF_NEED_RESCHED | 0xe200
and
	_TIF_DO_NOTIFY_MASK | _TIF_WORK_SYSCALL_EXIT | _TIF_NEED_RESCHED |
	_TIF_SYSCALL_EMU | _TIF_UPROBE | 0xe200

0xe200 (aka bits 15,14,13,9) consists of the bits that are never set by
anybody, so short of really deep magic it can be discarded.  The rest
is also interesting, to put it politely.  Why is _TIF_UPROBE *not* a part
of _TIF_DO_NOTIFY_MASK, for example?  Note that do_notify_resume() checks
and clears it, but on syscall (and interrupt) exit paths we only call it
with something in _TIF_DO_NOTIFY_MASK.  If UPROBE is set, but nothing
else in that set is, we'll be looping forever, right?  There's pending
work (according to _TIF_WORK_MASK), so we won't just leave.  And we won't
be calling do_notify_resume(), so there's nothing to clear that bit.
Only it gets even nastier - on the paranoid_userspace path we call
do_notify_resume() if anything in _TIF_WORK_MASK besides NEED_RESCHED 
happens to be set.  So _there_ getting solitary UPROBE is legitimate.

_TIF_SYSCALL_EMU is also an interesting story - on the way out it
	* forces us on iret path
	* does *not* trigger trace_syscall_leave() on its own
(trace_syscall_leave() is aware of that sucker, though, with rather
confusing comment)
	* hits do_notify_resume() (for no good reason - do_notify_resume()
silently ignores it)
	* gets cleared from the workmask (i.e. %edi), so on the next
iteration through the loop it gets completely ignored.

AFAICS, all of that is pointless, since SYSCALL_EMU wants to avoid
SYSRET only if we had entered with it and in that case we would've
gone through tracesys and stayed the fsck away from SYSRET path
anyway (similar on 32bit - if we hit syscall_trace_enter(), we
do not rejoin the sysenter path).  IOW, no reason for it to be
in _TIF_ALLWORK_MASK...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 11:36   ` Al Viro
@ 2014-01-27 17:39     ` Oleg Nesterov
  2014-01-28  1:18       ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2014-01-27 17:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers,
	Linux Kernel Mailing List

On 01/27, Al Viro wrote:
>
> BTW, there's an additional pile of obfuscation:
> /* work to do on interrupt/exception return */
> #define _TIF_WORK_MASK                                                  \
>         (0x0000FFFF &                                                   \
>          ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|                       \
>            _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
>
> /* work to do on any return to user space */
> #define _TIF_ALLWORK_MASK                                               \
>         ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |       \
>         _TIF_NOHZ)

Heh, yes ;)

> Why is _TIF_UPROBE *not* a part
> of _TIF_DO_NOTIFY_MASK, for example?

Yes, please see another email. That is why uprobe_deny_signal()
sets TIF_NOTIFY_RESUME along with TIF_UPROBE.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27  7:42     ` Al Viro
@ 2014-01-27 22:06       ` Andy Lutomirski
  2014-01-27 22:17         ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2014-01-27 22:06 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 01/26/2014 11:42 PM, Al Viro wrote:
> On Sun, Jan 26, 2014 at 08:32:09PM -0800, Linus Torvalds wrote:
>> On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>> Umm...  Can't uprobe_notify_resume() modify regs as well?
>>
>> Probably.
>>
>> .. and on the other hand, we should actually be able to use 'sysret'
>> for signal handling on x86-64, because while sysret destroys %rcx and
>> doesn't allow for returning to odd modes, for calling a signal handler
>> I don't think we really care..
> 
> I'm afraid we might:
> 
>  * When user can change the frames always force IRET. That is because
>  * it deals with uncanonical addresses better. SYSRET has trouble
>  * with them due to bugs in both AMD and Intel CPUs.
> 
> IIRC, that was about SYSRET with something unpleasant left in RCX, which
> comes from regs->ip, which is set to sa_handler by __setup_rt_frame().
> And we do not normalize or validate that - not in __setup_rt_frame() and
> not at sigaction(2) time.  Something about GPF triggered and buggering
> attacker-chosen memory area?  I don't remember details, but IIRC the
> conclusion had been "just don't go there"...
> 
> Note that we can manipulate regs->ip and regs->sp, regardless of validation
> at sigaction(2) or __setup_rt_frame() - just have the sucker ptraced, send
> it a signal and it'll stop on delivery.  Then tracer can use ptrace to modify
> registers and issue PTRACE_CONT with zero signal.  Voila - regs->[is]p
> set to arbitrary values, no signal handlers triggered...
> 

It's not just ip and sp -- cs matters here, too, I think.

(I may be the only one to have ever tried it, but it's possible to
far-call from 64-bit to 32-bit cs, and it works.  I've never tried
switching cs using ptrace, but someone may want that to work. too.)

That being said, the last time I benchmarked it, sysret was *way* faster
than iret.  So maybe the thing to do is to validate the registers on the
way out and, if they're appropriate for sysret, do the sysret.

I'm not quite sure how to express "I don't care about rcx" in pt_regs.
Maybe use the actual value that the CPU will stick in there (assuming
that anyone knows that this is).

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 22:06       ` Andy Lutomirski
@ 2014-01-27 22:17         ` Linus Torvalds
  2014-01-27 22:32           ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2014-01-27 22:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 2:06 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> It's not just ip and sp -- cs matters here, too, I think.

For signal *delivery*, CS will always be __USER_CS, and %rcx can be
crap, so sysret should be fine. We could easily check that %rip is
valid in the whole slow-path instead of saying "return 1 if we did
do_signal()".

Now, it's a different thing wrt signal handler *return*, because at
that point we really cannot return with some random value in %rcx. We
absolutely do need to use 'iretq' in that whole [rt_]sigreturn() path,
but on x86-64 that is all handled by the system call itself (see the
stub_*_sigreturn stuff in entry_64.S) and it very much uses iret
explicitly (the 32-bit case also does that, by forcing the sigreturn
to be done with an "int 0x80" instruction - we could change that to
use syscall+iret, but frankly, I'm not all that inclined to care,
although it might be worth trying to do just to unify the models a
bit).

                   Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 22:17         ` Linus Torvalds
@ 2014-01-27 22:32           ` Al Viro
  2014-01-27 22:43             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2014-01-27 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 02:17:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 27, 2014 at 2:06 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > It's not just ip and sp -- cs matters here, too, I think.
> 
> For signal *delivery*, CS will always be __USER_CS, and %rcx can be
> crap, so sysret should be fine. We could easily check that %rip is
> valid in the whole slow-path instead of saying "return 1 if we did
> do_signal()".

do_signal() is also a place where arbitrary changes to regs might've
been done by tracer, so regs->cs might need to be checked in the same
place where we validate regs->rip ;-/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 22:32           ` Al Viro
@ 2014-01-27 22:43             ` Linus Torvalds
  2014-01-27 22:46               ` Andy Lutomirski
  2014-01-27 23:07               ` Al Viro
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2014-01-27 22:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Andy Lutomirski, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> do_signal() is also a place where arbitrary changes to regs might've
> been done by tracer, so regs->cs might need to be checked in the same
> place where we validate regs->rip ;-/

Fair enough. But it would still be really easy, and make the common
case signal delivery a bit faster.

Now, sadly, most signal delivery is then followed by sigreturn (the
exceptions being dying or doing a longjmp), so we'd still get the
iretq then. But it would cut the iretq's related to signals in half.

We *could* try to do sigreturn with sysret and a small trampoline too,
of course. But I'm not sure how far I'd want to take it.

               Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 22:43             ` Linus Torvalds
@ 2014-01-27 22:46               ` Andy Lutomirski
  2014-01-28  0:22                 ` H. Peter Anvin
  2014-01-27 23:07               ` Al Viro
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2014-01-27 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 2:43 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 27, 2014 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> do_signal() is also a place where arbitrary changes to regs might've
>> been done by tracer, so regs->cs might need to be checked in the same
>> place where we validate regs->rip ;-/
>
> Fair enough. But it would still be really easy, and make the common
> case signal delivery a bit faster.
>
> Now, sadly, most signal delivery is then followed by sigreturn (the
> exceptions being dying or doing a longjmp), so we'd still get the
> iretq then. But it would cut the iretq's related to signals in half.
>
> We *could* try to do sigreturn with sysret and a small trampoline too,
> of course. But I'm not sure how far I'd want to take it.

I once spent a while thinking about how to do this.  The best I could
come up with was to use something like 'ret 128' for the trampoline.
(The issue here is that there's no good place to shove a global
variable with the missing register values, fs and gs aren't really
available for these games, and the red zone is in the way.)

I think that sysret for sigreturn is probably not very interesting.
On the other hand, sysret for #PF might be a huge win, despite being
even scarier.

(Or someone could politely ask Intel for a couple of non-serializing
msrs that set the values of rcx and whatever other registers get
clobbered by sysret.)

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 22:43             ` Linus Torvalds
  2014-01-27 22:46               ` Andy Lutomirski
@ 2014-01-27 23:07               ` Al Viro
  1 sibling, 0 replies; 32+ messages in thread
From: Al Viro @ 2014-01-27 23:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 02:43:14PM -0800, Linus Torvalds wrote:
> On Mon, Jan 27, 2014 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > do_signal() is also a place where arbitrary changes to regs might've
> > been done by tracer, so regs->cs might need to be checked in the same
> > place where we validate regs->rip ;-/
> 
> Fair enough. But it would still be really easy, and make the common
> case signal delivery a bit faster.
> 
> Now, sadly, most signal delivery is then followed by sigreturn (the
> exceptions being dying or doing a longjmp), so we'd still get the
> iretq then. But it would cut the iretq's related to signals in half.
> 
> We *could* try to do sigreturn with sysret and a small trampoline too,
> of course. But I'm not sure how far I'd want to take it.

The problem with validation is that we'll suddenly become sensitive to
hard-to-estimate pile of hardware bugs ;-/  E.g. which AMD-specific
errata is that comment in entry_64.S about?  The one I kinda-sorta
remember is Intel-specific, and that was about uncanonical RIP; looking
for AMD one has turned #353 (with suggested workaround being "have bit 16 set
in whatever you put into R11"), but I've no idea whether that's the only
potential issue there...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 22:46               ` Andy Lutomirski
@ 2014-01-28  0:22                 ` H. Peter Anvin
  2014-01-28  0:44                   ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: H. Peter Anvin @ 2014-01-28  0:22 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Al Viro, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 01/27/2014 02:46 PM, Andy Lutomirski wrote:
> 
> I think that sysret for sigreturn is probably not very interesting.
> On the other hand, sysret for #PF might be a huge win, despite being
> even scarier.
> 

SYSRET for #PF or other exceptions is a nonstarter; register state is
live at that point.

	-hpa



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-28  0:22                 ` H. Peter Anvin
@ 2014-01-28  0:44                   ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2014-01-28  0:44 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Al Viro, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 4:22 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 01/27/2014 02:46 PM, Andy Lutomirski wrote:
>>
>> I think that sysret for sigreturn is probably not very interesting.
>> On the other hand, sysret for #PF might be a huge win, despite being
>> even scarier.
>>
>
> SYSRET for #PF or other exceptions is a nonstarter; register state is
> live at that point.

I mean sysret-via-trampoline for #PF.

It's scary, it probably has issues with ptrace and interrupts that hit
while the trampoline is still running, and it could break anything
that writes past the red zone, but I think it could work.

No, I don't particularly want to implement (and debug) such a beast.

(I will continue cursing Intel -- why can't we have a fast way to
return to 64-bit userspace with complete control of all non-segment
registers?  sysret is *almost* the right thing.)

--Andy

>
>         -hpa
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-27 17:39     ` Oleg Nesterov
@ 2014-01-28  1:18       ` Al Viro
  2014-01-28 16:38         ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2014-01-28  1:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote:
> On 01/27, Al Viro wrote:
> >
> > BTW, there's an additional pile of obfuscation:
> > /* work to do on interrupt/exception return */
> > #define _TIF_WORK_MASK                                                  \
> >         (0x0000FFFF &                                                   \
> >          ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|                       \
> >            _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
> >
> > /* work to do on any return to user space */
> > #define _TIF_ALLWORK_MASK                                               \
> >         ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT |       \
> >         _TIF_NOHZ)
> 
> Heh, yes ;)
> 
> > Why is _TIF_UPROBE *not* a part
> > of _TIF_DO_NOTIFY_MASK, for example?
> 
> Yes, please see another email. That is why uprobe_deny_signal()
> sets TIF_NOTIFY_RESUME along with TIF_UPROBE.

*grumble* Can it end up modifying *regs?  From very cursory reading of
kernel/events/uprobe.c it seems to do so, so we probably want to leave
via iretq if that has hit, right?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-28  1:18       ` Al Viro
@ 2014-01-28 16:38         ` Oleg Nesterov
  2014-01-28 16:48           ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2014-01-28 16:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers,
	Linux Kernel Mailing List, Srikar Dronamraju

On 01/28, Al Viro wrote:
>
> On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote:
> > On 01/27, Al Viro wrote:
> > >
> > > Why is _TIF_UPROBE *not* a part
> > > of _TIF_DO_NOTIFY_MASK, for example?
> >
> > Yes, please see another email. That is why uprobe_deny_signal()
> > sets TIF_NOTIFY_RESUME along with TIF_UPROBE.
>
> *grumble* Can it end up modifying *regs?  From very cursory reading of
> kernel/events/uprobe.c it seems to do so, so we probably want to leave
> via iretq if that has hit, right?

But we do this anyway, restore_args path does iretq?

I mean, uprobe_notify_resume() is called from do_notify_resume(), it
should be fine to modify*regs there?

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-28 16:38         ` Oleg Nesterov
@ 2014-01-28 16:48           ` Al Viro
  2014-01-28 17:19             ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2014-01-28 16:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers,
	Linux Kernel Mailing List, Srikar Dronamraju

On Tue, Jan 28, 2014 at 05:38:08PM +0100, Oleg Nesterov wrote:
> On 01/28, Al Viro wrote:
> >
> > On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote:
> > > On 01/27, Al Viro wrote:
> > > >
> > > > Why is _TIF_UPROBE *not* a part
> > > > of _TIF_DO_NOTIFY_MASK, for example?
> > >
> > > Yes, please see another email. That is why uprobe_deny_signal()
> > > sets TIF_NOTIFY_RESUME along with TIF_UPROBE.
> >
> > *grumble* Can it end up modifying *regs?  From very cursory reading of
> > kernel/events/uprobe.c it seems to do so, so we probably want to leave
> > via iretq if that has hit, right?
> 
> But we do this anyway, restore_args path does iretq?
> 
> I mean, uprobe_notify_resume() is called from do_notify_resume(), it
> should be fine to modify*regs there?

See Linus' patch trying to avoid iretq path; it's really costly.  Looks
like that patch will have to treat _TIF_UPROBE the same way it treats
_TIF_SIGPENDING...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-28 16:48           ` Al Viro
@ 2014-01-28 17:19             ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2014-01-28 17:19 UTC (permalink / raw)
  To: Al Viro
  Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar,
	Thomas Gleixner, the arch/x86 maintainers,
	Linux Kernel Mailing List, Srikar Dronamraju

On 01/28, Al Viro wrote:
>
> On Tue, Jan 28, 2014 at 05:38:08PM +0100, Oleg Nesterov wrote:
> > On 01/28, Al Viro wrote:
> > >
> > > On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote:
> > > > On 01/27, Al Viro wrote:
> > > > >
> > > > > Why is _TIF_UPROBE *not* a part
> > > > > of _TIF_DO_NOTIFY_MASK, for example?
> > > >
> > > > Yes, please see another email. That is why uprobe_deny_signal()
> > > > sets TIF_NOTIFY_RESUME along with TIF_UPROBE.
> > >
> > > *grumble* Can it end up modifying *regs?  From very cursory reading of
> > > kernel/events/uprobe.c it seems to do so, so we probably want to leave
> > > via iretq if that has hit, right?
> >
> > But we do this anyway, restore_args path does iretq?
> >
> > I mean, uprobe_notify_resume() is called from do_notify_resume(), it
> > should be fine to modify*regs there?
>
> See Linus' patch trying to avoid iretq path; it's really costly.  Looks
> like that patch will have to treat _TIF_UPROBE the same way it treats
> _TIF_SIGPENDING...

Ah, this one I guess: http://marc.info/?l=linux-kernel&m=139077532507926

I think this should be fine wrt uprobes, unless I misread this patch
syscall_exit_slowpath() is actually only called by ret_from_sys_call
path, it this case TIF_UPROBE should not be set. But perhaps
"retval = 1" after uprobe_notify_resume() makes sense anyway.

And while I am almost sure I missed something, can't we (with or without
that patch) simply add TIF_UPROBE into _TIF_DO_NOTIFY_MASK and remove
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME) from uprobe_deny_signal ?

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds
  2014-01-27  0:22 ` Al Viro
  2014-01-27 10:27 ` Peter Zijlstra
@ 2014-02-06  0:32 ` Linus Torvalds
  2014-02-06  0:55   ` Kirill A. Shutemov
  2 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2014-02-06  0:32 UTC (permalink / raw)
  To: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List

On Sun, Jan 26, 2014 at 2:28 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Comments? This was obviously brought on by my frustration with the
> currently nasty do_notify_resume() always returning to iret for the
> task_work case, and PeterZ's patch that fixed that, but made the asm
> mess even *worse*.

Actually, I should have taken a closer look.

Yes, do_notify_resume() is a real issue, and my stupid open/close
test-case showed that part of the profile.

But the "iretq" that dominates on the kernel build is actually the
page fault one.

I noticed this when I compared "-e cycles:pp" with "-e cycles:p". The
single-p version shows largely the same profile for the kernel, except
that instead of showing "iretq" as the big cost, it shows the first
instruction in "page_fault".

In fact, even when *not* zoomed into the kernel DSO, "page_fault"
actually takes 5% of CPU time according to pref report. That's really
quite impressive.

I suspect the Haswell architecture has made everything else cheaper,
and the exception overhead hasn't kept up. I'm wondering if there is
anything we could do to speed this up - like doing gang lookup in the
page cache and pre-populating the page tables opportunistically.

We're using an interrupt gate for the page fault handling, and I don't
think we can avoid that. For all I know, a trap gate might be slightly
faster (but likely not really noticeable - the microcode is surely
expensive, but the pipeline unwinding is probably the biggest cost of
the page fault), but we have the issue of interrupts causing page
faults for vmalloc pages.. And obviously we can't avoid the iretq for
the return path.

So as far as I can see, there's no sane way to make the page fault
itself cheaper. Looking at opportunistically prepopulating page tables
when it's cheap and easy might be the best we can do..

                 Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06  0:32 ` Linus Torvalds
@ 2014-02-06  0:55   ` Kirill A. Shutemov
  2014-02-06  2:32     ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2014-02-06  0:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu

On Wed, Feb 05, 2014 at 04:32:55PM -0800, Linus Torvalds wrote:
> On Sun, Jan 26, 2014 at 2:28 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Comments? This was obviously brought on by my frustration with the
> > currently nasty do_notify_resume() always returning to iret for the
> > task_work case, and PeterZ's patch that fixed that, but made the asm
> > mess even *worse*.
> 
> Actually, I should have taken a closer look.
> 
> Yes, do_notify_resume() is a real issue, and my stupid open/close
> test-case showed that part of the profile.
> 
> But the "iretq" that dominates on the kernel build is actually the
> page fault one.
> 
> I noticed this when I compared "-e cycles:pp" with "-e cycles:p". The
> single-p version shows largely the same profile for the kernel, except
> that instead of showing "iretq" as the big cost, it shows the first
> instruction in "page_fault".
> 
> In fact, even when *not* zoomed into the kernel DSO, "page_fault"
> actually takes 5% of CPU time according to pref report. That's really
> quite impressive.
> 
> I suspect the Haswell architecture has made everything else cheaper,
> and the exception overhead hasn't kept up. I'm wondering if there is
> anything we could do to speed this up - like doing gang lookup in the
> page cache and pre-populating the page tables opportunistically.

One thing that could help is THP for file-backed pages. And there's
prototype with basic infrasturure and support for ramfs and
shmem/tmpfs (by Ning Qu). Work in progress.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06  0:55   ` Kirill A. Shutemov
@ 2014-02-06  2:32     ` Linus Torvalds
  2014-02-06  4:33       ` Linus Torvalds
  2014-02-06  5:42       ` [RFC] de-asmify the x86-64 system call slowpath Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2014-02-06  2:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu

On Wed, Feb 5, 2014 at 4:55 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> One thing that could help is THP for file-backed pages. And there's
> prototype with basic infrasturure and support for ramfs and
> shmem/tmpfs (by Ning Qu). Work in progress.

THP isn't really realistic for any general-purpose loads. 2MB pages
are just too big of a granularity.

On a machine with just a few gigs (ie anything but big servers), you
count the number of large pages in a few thousands. That makes
fragmentation a big issue. Also, ELF sections aren't actually
2MB-aligned, nor do you really want to throw away 9 bits of virtual
address space randomization. Plus common binaries aren't even big
enough. Look at "bash", which is a pig - it's still less than a
megabyte as just the binary, with the code segment being 900kB or so
for me. For something like "perl", it's even smaller, with a number of
shared object files that get loaded dynamically etc.

IOW, THP is completely useless for any kind of scripting. It can be
good for individual large binaries that have long lifetimes (DB
processes, HPC, stuff like that), but no, it is *not* the answer to
complex loads. And almost certainly never will be.

It's possible that some special case (glibc for example) could be
reasonably hacked to use a THP code segment, but that sounds unlikely.
And the virtual address randomization really gets hurt.

No, I was thinking "try to optimistically map 8 adjacent aligned pages
at a time" - that would be the same cacheline in the page tables, so
it would be fairly cheap if we couple it with a gang-lookup of the
pages in the page cache (or, for anonymous pages, by just
optimistically trying to do an order-3 page allocation, and if that
works, just map the 32kB allocation you got as eight individual
pages).

I know it's been discussed at some point, and I even have a dim memory
of having seen some really ugly patches.

             Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06  2:32     ` Linus Torvalds
@ 2014-02-06  4:33       ` Linus Torvalds
  2014-02-06 21:29         ` Linus Torvalds
  2014-02-06  5:42       ` [RFC] de-asmify the x86-64 system call slowpath Ingo Molnar
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2014-02-06  4:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu

[-- Attachment #1: Type: text/plain, Size: 816 bytes --]

On Wed, Feb 5, 2014 at 6:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> No, I was thinking "try to optimistically map 8 adjacent aligned pages
> at a time" - that would be the same cacheline in the page tables, so
> it would be fairly cheap if we couple it with a gang-lookup of the
> pages in the page cache

Doing the gang-lookup is hard, since it's all abstracted away, but the
attached patch kind of tries to do what I described.

This patch probably doesn't work, but something *like* this might be
worth playing with.

Except I suspect the page-backed faults are actually the minority,
judging by how high clear_page_c_e is in the profile it's probably
mostly anonymous memory. I have no idea why I started with the (more
complex) case of file-backed prefaulting. Oh well.

            Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1753 bytes --]

 mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index be6a0c0d4ae0..d52ec6a344dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3487,15 +3487,55 @@ uncharge_out:
 	return ret;
 }
 
+#define no_fault_around(x) (x & (VM_FAULT_ERROR | VM_FAULT_MAJOR | VM_FAULT_RETRY))
+#define FAULT_AROUND_SHIFT (3)
+
+static void fault_around(struct mm_struct *mm, struct vm_area_struct *vma,
+	unsigned long address, pmd_t *pmd)
+{
+	int nr = 1 << FAULT_AROUND_SHIFT;
+
+	address &= PAGE_MASK << FAULT_AROUND_SHIFT;
+	if (address < vma->vm_start)
+		return;
+
+	do {
+		pte_t *pte;
+		pte_t entry;
+		pgoff_t pgoff;
+
+		pte = pte_offset_map(pmd, address);
+		entry = *pte;
+
+		pte_unmap(pte);
+		if (!pte_none(entry))
+			continue;
+		pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
+		pgoff += vma->vm_pgoff;
+		if (no_fault_around(__do_fault(mm, vma, address, pmd, pgoff, 0, entry)))
+			break;
+	} while (address += PAGE_SIZE, address < vma->vm_end && --nr > 0);
+}
+
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		unsigned int flags, pte_t orig_pte)
 {
+	int ret;
 	pgoff_t pgoff = (((address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	pte_unmap(page_table);
-	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+	ret = __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+
+	/*
+	 * If the page we were looking for succeeded with no retries,
+	 * see if we can fault around it too..
+	 */
+	if (!no_fault_around(ret) && (flags & FAULT_FLAG_ALLOW_RETRY))
+		fault_around(mm, vma, address, pmd);
+
+	return ret;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06  2:32     ` Linus Torvalds
  2014-02-06  4:33       ` Linus Torvalds
@ 2014-02-06  5:42       ` Ingo Molnar
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2014-02-06  5:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ning Qu


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [...]
> 
> No, I was thinking "try to optimistically map 8 adjacent aligned 
> pages at a time" - that would be the same cacheline in the page 
> tables, so it would be fairly cheap if we couple it with a 
> gang-lookup of the pages in the page cache (or, for anonymous pages, 
> by just optimistically trying to do an order-3 page allocation, and 
> if that works, just map the 32kB allocation you got as eight 
> individual pages).
> 
> I know it's been discussed at some point, and I even have a dim 
> memory of having seen some really ugly patches.

I have a dim memory of having written such group-prefaulting patches 
myself a decade ago or so - IIRC the main problem was that at that 
time we never found a common load where it really mattered, and it was 
easy to spend more time doing all this extra work and not see the 
prefaulted pages used.

But the cost/benefit balance has indeed changed so IMO it's worth a 
try.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06  4:33       ` Linus Torvalds
@ 2014-02-06 21:29         ` Linus Torvalds
  2014-02-06 22:24           ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2014-02-06 21:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu

On Wed, Feb 5, 2014 at 8:33 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Doing the gang-lookup is hard, since it's all abstracted away, but the
> attached patch kind of tries to do what I described.
>
> This patch probably doesn't work, but something *like* this might be
> worth playing with.

Interesting. Here are some pte fault statistics with and without the patch.

I added a few new count_vm_event() counters: PTEFAULT, PTEFILE,
PTEANON, PTEWP, PTESPECULATIVE for the handle_pte_fault,
do_linear_fault, do_anonymous_page, do_wp_page and the "let's
speculatively fill the page tables" case.

This is what the statistics look like for me doing a "make -j" of a
fully built almodconfig build:

  5007450       ptefault
  3272038       ptefile
  1470242       pteanon
   265121       ptewp
        0       ptespeculative

where obviously the ptespeculative count is zero, and I was wrong
about anon faults being most common - the file mapping faults really
are the most common for this load (it's fairly fork/exec heavy, I
guess).

This is what happens with that patch I posted:

  2962090       ptefault
  1195130       ptefile
  1490460       pteanon
   276479       ptewp
  5690724       ptespeculative

about 200k page faults went away, and the numbers make sense (ie they
got removed from the ptefile column - the other number changes are
just noise).

Now, we filled 600k extra page table entries to do that (that
ptespeculative number), so the "hitrate" for the speculative filling
was basically about 33%. Which doesn't sound crazy - the code
basically populates the 8 aligned pages around the faulting page.

Now, because I didn't make this easily dynamically configurable I have
no good way to really test timing, but the numbers says at least the
concept works.

Whether the reduced number of page faults and presumably better
locality for the speculative prefilling makes up for the fact that 66%
of the prefilled entries never get used is very debatable. But I think
it's a somewhat interesting experiment, and the patch was certainly
not hugely complicated.

I should add a switch to turn this on/off and then do many builds in
sequence to get some kind of idea of whether it actually changes
performance. But if 5% of the overall time was literally spent on the
*exception* part of the page fault (ie not counting all the work we do
in the kernel), I think it's worth looking at this.

                        Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06 21:29         ` Linus Torvalds
@ 2014-02-06 22:24           ` Kirill A. Shutemov
  2014-02-07  1:31             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2014-02-06 22:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu

On Thu, Feb 06, 2014 at 01:29:13PM -0800, Linus Torvalds wrote:
> On Wed, Feb 5, 2014 at 8:33 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Doing the gang-lookup is hard, since it's all abstracted away, but the
> > attached patch kind of tries to do what I described.
> >
> > This patch probably doesn't work, but something *like* this might be
> > worth playing with.
> 
> Interesting. Here are some pte fault statistics with and without the patch.
> 
> I added a few new count_vm_event() counters: PTEFAULT, PTEFILE,
> PTEANON, PTEWP, PTESPECULATIVE for the handle_pte_fault,
> do_linear_fault, do_anonymous_page, do_wp_page and the "let's
> speculatively fill the page tables" case.
> 
> This is what the statistics look like for me doing a "make -j" of a
> fully built almodconfig build:
> 
>   5007450       ptefault
>   3272038       ptefile
>   1470242       pteanon
>    265121       ptewp
>         0       ptespeculative
> 
> where obviously the ptespeculative count is zero, and I was wrong
> about anon faults being most common - the file mapping faults really
> are the most common for this load (it's fairly fork/exec heavy, I
> guess).
> 
> This is what happens with that patch I posted:
> 
>   2962090       ptefault
>   1195130       ptefile
>   1490460       pteanon
>    276479       ptewp
>   5690724       ptespeculative
> 
> about 200k page faults went away, and the numbers make sense (ie they
> got removed from the ptefile column - the other number changes are
> just noise).
> 
> Now, we filled 600k extra page table entries to do that (that
> ptespeculative number), so the "hitrate" for the speculative filling
> was basically about 33%. Which doesn't sound crazy - the code
> basically populates the 8 aligned pages around the faulting page.
> 
> Now, because I didn't make this easily dynamically configurable I have
> no good way to really test timing, but the numbers says at least the
> concept works.
> 
> Whether the reduced number of page faults and presumably better
> locality for the speculative prefilling makes up for the fact that 66%
> of the prefilled entries never get used is very debatable. But I think
> it's a somewhat interesting experiment, and the patch was certainly
> not hugely complicated.
> 
> I should add a switch to turn this on/off and then do many builds in
> sequence to get some kind of idea of whether it actually changes
> performance. But if 5% of the overall time was literally spent on the
> *exception* part of the page fault (ie not counting all the work we do
> in the kernel), I think it's worth looking at this.

If we want to reduce number of page fault with less overhead we probably
should concentrate on minor page fault -- populate pte around fault
address which already in page cache. It should cover scripting use-case
pretty well.

We also can do reasonable batching in this case: make changes under the
same page table lock and group radix tree lookup. It may help with
scalability.

I'm working on prototype of this. It's more invasive and too ugly at the
moment. Will see how it will go. I'll try to show something tomorrow.

Other idea: we could introduce less strict version of MAP_POPULATE to
populate only what we already have in page cache.

Heh! `man 3 mmap' actually suggests MAP_POPULATE | MAP_NONBLOCK.
What's the story beyond MAP_NONBLOCK?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC] de-asmify the x86-64 system call slowpath
  2014-02-06 22:24           ` Kirill A. Shutemov
@ 2014-02-07  1:31             ` Linus Torvalds
  2014-02-07 15:42               ` [RFC, PATCH] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2014-02-07  1:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu

On Thu, Feb 6, 2014 at 2:24 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> If we want to reduce number of page fault with less overhead we probably
> should concentrate on minor page fault -- populate pte around fault
> address which already in page cache. It should cover scripting use-case
> pretty well.

That's what my patch largely does. Except I screwed up and didn't use
FAULT_FLAG_ALLOW_RETRY in fault_around().

Anyway, my patch kind of works, but I'm starting to hate it. I think I
want to try to extend the "->fault()" interface to allow
filemap_fault() to just fill in multiple pages.

We alread have that "vmf->page" thing, we could make it a small array
easily. That would allow proper gang lookup, and much more efficient
"fill in multiple entries in one go" in mm/memory.c.

> Heh! `man 3 mmap' actually suggests MAP_POPULATE | MAP_NONBLOCK.
> What's the story beyond MAP_NONBLOCK?

It does nothing, afaik.

                  Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC, PATCH] mm: map few pages around fault address if they are in page cache
  2014-02-07  1:31             ` Linus Torvalds
@ 2014-02-07 15:42               ` Kirill A. Shutemov
  2014-02-07 17:32                 ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2014-02-07 15:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra,
	the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu,
	Dave Hansen, Matthew Wilcox, Andi Kleen

On Thu, Feb 06, 2014 at 05:31:55PM -0800, Linus Torvalds wrote:
> On Thu, Feb 6, 2014 at 2:24 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > If we want to reduce number of page fault with less overhead we probably
> > should concentrate on minor page fault -- populate pte around fault
> > address which already in page cache. It should cover scripting use-case
> > pretty well.
> 
> That's what my patch largely does. Except I screwed up and didn't use
> FAULT_FLAG_ALLOW_RETRY in fault_around().

I fail to see how you avoid touching backing storage.

> Anyway, my patch kind of works, but I'm starting to hate it. I think I
> want to try to extend the "->fault()" interface to allow
> filemap_fault() to just fill in multiple pages.
> 
> We alread have that "vmf->page" thing, we could make it a small array
> easily. That would allow proper gang lookup, and much more efficient
> "fill in multiple entries in one go" in mm/memory.c.

See patch below.

I extended ->fault() interface: added pointer to array of pages and
nr_pages. If ->fault() nows about new interface and use it, it fills array
and set VM_FAULT_AROUND bit in return code. Only filemap_fault()
converted at the moment.

I haven't tested it much, but my kvm boots. There're few places where code
should be fixed. __do_fault() and filemap_fault() are too ugly and need to
be cleaned.

I don't have any performance data yet.

Any thoughts?

---
 include/linux/mm.h |  11 +++++
 mm/filemap.c       | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/memory.c        |  58 ++++++++++++++++++++--
 3 files changed, 204 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35527173cf50..deb65c0850a9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -181,6 +181,9 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
 #define FAULT_FLAG_USER		0x80	/* The fault originated in userspace */
+#define FAULT_FLAG_AROUND	0x100	/* Get ->nr_pages pages around fault
+					 * address
+					 */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -200,6 +203,13 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
+
+	int nr_pages;			/* Number of pages to faultin, naturally
+					 * aligned around virtual_address
+					 */
+	struct page **pages;		/* Pointer to array to store nr_pages
+					 * pages.
+					 */
 };
 
 /*
@@ -962,6 +972,7 @@ static inline int page_mapped(struct page *page)
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
+#define VM_FAULT_AROUND 0x1000
 
 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..1cabb15a5847 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -884,6 +884,64 @@ repeat:
 	return ret;
 }
 
+void find_get_pages_or_null(struct address_space *mapping, pgoff_t start,
+			    unsigned int nr_pages, struct page **pages)
+{
+	struct radix_tree_iter iter;
+	void **slot;
+
+	if (unlikely(!nr_pages))
+		return;
+
+	memset(pages, 0, sizeof(struct page *) * nr_pages);
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				WARN_ON(iter.index);
+				goto restart;
+			}
+			/*
+			 * Otherwise, shmem/tmpfs must be storing a swap entry
+			 * here as an exceptional entry: so skip over it -
+			 * we only reach this from invalidate_mapping_pages().
+			 */
+			continue;
+		}
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		if (page->index - start >= nr_pages) {
+			page_cache_release(page);
+			break;
+		}
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[page->index - start] = page;
+	}
+
+	rcu_read_unlock();
+}
+
 /**
  * find_get_pages_contig - gang contiguous pagecache lookup
  * @mapping:	The address_space to search
@@ -1595,6 +1653,67 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma,
 					   page, offset, ra->ra_pages);
 }
 
+static void lock_secondary_pages(struct vm_area_struct *vma,
+		struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	unsigned long address = (unsigned long)vmf->virtual_address;
+	int primary_idx = (address >> PAGE_SHIFT) & (vmf->nr_pages - 1);
+	pgoff_t size;
+	int i;
+
+	for (i = 0; i < vmf->nr_pages; i++) {
+		unsigned long addr = address + PAGE_SIZE * (i - primary_idx);
+
+		if (i == primary_idx || !vmf->pages[i])
+			continue;
+
+		if (addr < vma->vm_start || addr >= vma->vm_end)
+			goto put;
+		if (!trylock_page(vmf->pages[i]))
+			goto put;
+		/* Truncated? */
+		if (unlikely(vmf->pages[i]->mapping != mapping))
+			goto unlock;
+
+		if (unlikely(!PageUptodate(vmf->pages[i])))
+			goto unlock;
+
+		/*
+		 * We have a locked page in the page cache, now we need to
+		 * check that it's up-to-date. If not, it is going to be due to
+		 * an error.
+		 */
+		size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1)
+			>> PAGE_CACHE_SHIFT;
+		if (unlikely(vmf->pages[i]->index >= size))
+			goto unlock;
+
+		continue;
+unlock:
+		unlock_page(vmf->pages[i]);
+put:
+		put_page(vmf->pages[i]);
+		vmf->pages[i] = NULL;
+	}
+}
+
+static void unlock_and_put_secondary_pages(struct vm_fault *vmf)
+{
+	unsigned long address = (unsigned long)vmf->virtual_address;
+	int primary_idx = (address >> PAGE_SHIFT) & (vmf->nr_pages - 1);
+	int i;
+
+	for (i = 0; i < vmf->nr_pages; i++) {
+		if (i == primary_idx || !vmf->pages[i])
+			continue;
+		unlock_page(vmf->pages[i]);
+		page_cache_release(vmf->pages[i]);
+		vmf->pages[i] = NULL;
+	}
+}
+
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1617,6 +1736,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	pgoff_t offset = vmf->pgoff;
 	struct page *page;
 	pgoff_t size;
+	unsigned long address = (unsigned long)vmf->virtual_address;
+	int primary_idx = (address >> PAGE_SHIFT) & (vmf->nr_pages - 1);
 	int ret = 0;
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
@@ -1626,7 +1747,15 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	/*
 	 * Do we have something in the page cache already?
 	 */
-	page = find_get_page(mapping, offset);
+	if (vmf->flags & FAULT_FLAG_AROUND) {
+		find_get_pages_or_null(mapping, offset - primary_idx,
+				vmf->nr_pages, vmf->pages);
+		page = vmf->pages[primary_idx];
+		lock_secondary_pages(vma, vmf);
+		ret |= VM_FAULT_AROUND;
+	} else
+		page = find_get_page(mapping, offset);
+
 	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
@@ -1638,14 +1767,18 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 		do_sync_mmap_readahead(vma, ra, file, offset);
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
-		ret = VM_FAULT_MAJOR;
+		ret |= VM_FAULT_MAJOR;
 retry_find:
 		page = find_get_page(mapping, offset);
 		if (!page)
 			goto no_cached_page;
+		if (vmf->flags & FAULT_FLAG_AROUND)
+			vmf->pages[primary_idx] = page;
 	}
 
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+		unlock_and_put_secondary_pages(vmf);
+		ret &= ~VM_FAULT_AROUND;
 		page_cache_release(page);
 		return ret | VM_FAULT_RETRY;
 	}
@@ -1671,6 +1804,7 @@ retry_find:
 	 */
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	if (unlikely(offset >= size)) {
+		unlock_and_put_secondary_pages(vmf);
 		unlock_page(page);
 		page_cache_release(page);
 		return VM_FAULT_SIGBUS;
@@ -1694,6 +1828,8 @@ no_cached_page:
 	if (error >= 0)
 		goto retry_find;
 
+	unlock_and_put_secondary_pages(vmf);
+
 	/*
 	 * An error return from page_cache_read can result if the
 	 * system is low on memory, or a problem occurs while trying
@@ -1722,6 +1858,8 @@ page_not_uptodate:
 	if (!error || error == AOP_TRUNCATED_PAGE)
 		goto retry_find;
 
+	unlock_and_put_secondary_pages(vmf);
+
 	/* Things didn't work out. Return zero to tell the mm layer so. */
 	shrink_readahead_size_eio(file, ra);
 	return VM_FAULT_SIGBUS;
diff --git a/mm/memory.c b/mm/memory.c
index 6768ce9e57d2..e94d86ac7d5a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3283,6 +3283,9 @@ oom:
 	return VM_FAULT_OOM;
 }
 
+#define FAULT_AROUND_PAGES 32
+#define FAULT_AROUND_MASK (FAULT_AROUND_PAGES - 1)
+
 /*
  * __do_fault() tries to create a new page mapping. It aggressively
  * tries to share with existing pages, but makes a separate copy if
@@ -3303,12 +3306,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t *page_table;
 	spinlock_t *ptl;
 	struct page *page;
+	struct page *pages[FAULT_AROUND_PAGES];
 	struct page *cow_page;
 	pte_t entry;
 	int anon = 0;
 	struct page *dirty_page = NULL;
 	struct vm_fault vmf;
-	int ret;
+	int i, ret;
 	int page_mkwrite = 0;
 
 	/*
@@ -3336,14 +3340,35 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vmf.flags = flags;
 	vmf.page = NULL;
 
+	/* Do fault around only for read faults on linear mapping */
+	if (flags & (FAULT_FLAG_WRITE | FAULT_FLAG_NONLINEAR)) {
+		vmf.nr_pages = 0;
+		vmf.pages = NULL;
+	} else {
+		vmf.nr_pages = FAULT_AROUND_PAGES;
+		vmf.pages = pages;
+		vmf.flags |= FAULT_FLAG_AROUND;
+	}
 	ret = vma->vm_ops->fault(vma, &vmf);
+
+	/* ->fault don't know about FAULT_FLAG_AROUND */
+	if ((vmf.flags & FAULT_FLAG_AROUND) && !(ret & VM_FAULT_AROUND)) {
+		vmf.flags &= ~FAULT_FLAG_AROUND;
+	}
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
 			    VM_FAULT_RETRY)))
 		goto uncharge_out;
 
 	if (unlikely(PageHWPoison(vmf.page))) {
-		if (ret & VM_FAULT_LOCKED)
-			unlock_page(vmf.page);
+		if (ret & VM_FAULT_LOCKED) {
+			if (vmf.flags & FAULT_FLAG_AROUND) {
+				for (i = 0; i < FAULT_AROUND_PAGES; i++) {
+					if (pages[i])
+						unlock_page(pages[i]);
+				}
+			} else
+				unlock_page(vmf.page);
+		}
 		ret = VM_FAULT_HWPOISON;
 		goto uncharge_out;
 	}
@@ -3352,9 +3377,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * For consistency in subsequent calls, make the faulted page always
 	 * locked.
 	 */
-	if (unlikely(!(ret & VM_FAULT_LOCKED)))
+	if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+		BUG_ON(ret & VM_FAULT_AROUND); // FIXME
 		lock_page(vmf.page);
-	else
+	} else
 		VM_BUG_ON(!PageLocked(vmf.page));
 
 	/*
@@ -3401,6 +3427,28 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 
+	for (i = 0; (vmf.flags & FAULT_FLAG_AROUND) && i < FAULT_AROUND_PAGES;
+			i++) {
+		int primary_idx = (address >> PAGE_SHIFT) & FAULT_AROUND_MASK;
+		pte_t *pte = page_table - primary_idx + i;
+		unsigned long addr = address + PAGE_SIZE * (i - primary_idx);
+
+		if (!pages[i])
+			continue;
+		if (i == primary_idx || !pte_none(*pte))
+			goto skip;
+		if (PageHWPoison(vmf.page))
+			goto skip;
+		flush_icache_page(vma, pages[i]);
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		inc_mm_counter_fast(mm, MM_FILEPAGES);
+		page_add_file_rmap(pages[i]);
+		set_pte_at(mm, addr, pte, entry);
+		update_mmu_cache(vma, addr, pte);
+skip:
+		unlock_page(pages[i]);
+	}
+
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
 	 * due to the bad i386 page protection. But it's valid
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC, PATCH] mm: map few pages around fault address if they are in page cache
  2014-02-07 15:42               ` [RFC, PATCH] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov
@ 2014-02-07 17:32                 ` Andi Kleen
  2014-02-07 17:56                   ` Kirill A. Shutemov
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2014-02-07 17:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox

> I haven't tested it much, but my kvm boots. There're few places where code
> should be fixed. __do_fault() and filemap_fault() are too ugly and need to
> be cleaned.
> 
> I don't have any performance data yet.
> 
> Any thoughts?

It seems very drastic to do it unconditionally. How about at least a simple
stream detection heuristic and perhaps also madvise?

There are some extreme cases where workloads could use a lot more memory
than before, if they access their memory sparsely in the right pattern.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC, PATCH] mm: map few pages around fault address if they are in page cache
  2014-02-07 17:32                 ` Andi Kleen
@ 2014-02-07 17:56                   ` Kirill A. Shutemov
  2014-02-07 18:11                     ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Kirill A. Shutemov @ 2014-02-07 17:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox

On Fri, Feb 07, 2014 at 09:32:00AM -0800, Andi Kleen wrote:
> > I haven't tested it much, but my kvm boots. There're few places where code
> > should be fixed. __do_fault() and filemap_fault() are too ugly and need to
> > be cleaned.
> > 
> > I don't have any performance data yet.
> > 
> > Any thoughts?
> 
> It seems very drastic to do it unconditionally. How about at least a simple
> stream detection heuristic and perhaps also madvise?

We already have readahead here it can be reused here. But see below.

> There are some extreme cases where workloads could use a lot more memory
> than before, if they access their memory sparsely in the right pattern.

Have you noticied that we don't actually allocate any memory: only reuse
what's already there. Sure, it will increase VmSize, but do we care?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC, PATCH] mm: map few pages around fault address if they are in page cache
  2014-02-07 17:56                   ` Kirill A. Shutemov
@ 2014-02-07 18:11                     ` Andi Kleen
  0 siblings, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2014-02-07 18:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Peter Zijlstra, the arch/x86 maintainers,
	Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox

> > There are some extreme cases where workloads could use a lot more memory
> > than before, if they access their memory sparsely in the right pattern.
> 
> Have you noticied that we don't actually allocate any memory: only reuse
> what's already there. Sure, it will increase VmSize, but do we care?

Good point. With that probably readahead is overkill, agree.

We would need it though if we ever do it for anonymous.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-02-07 18:14 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds
2014-01-27  0:22 ` Al Viro
2014-01-27  4:32   ` Linus Torvalds
2014-01-27  4:48     ` H. Peter Anvin
2014-01-27  7:42     ` Al Viro
2014-01-27 22:06       ` Andy Lutomirski
2014-01-27 22:17         ` Linus Torvalds
2014-01-27 22:32           ` Al Viro
2014-01-27 22:43             ` Linus Torvalds
2014-01-27 22:46               ` Andy Lutomirski
2014-01-28  0:22                 ` H. Peter Anvin
2014-01-28  0:44                   ` Andy Lutomirski
2014-01-27 23:07               ` Al Viro
2014-01-27 10:27 ` Peter Zijlstra
2014-01-27 11:36   ` Al Viro
2014-01-27 17:39     ` Oleg Nesterov
2014-01-28  1:18       ` Al Viro
2014-01-28 16:38         ` Oleg Nesterov
2014-01-28 16:48           ` Al Viro
2014-01-28 17:19             ` Oleg Nesterov
2014-02-06  0:32 ` Linus Torvalds
2014-02-06  0:55   ` Kirill A. Shutemov
2014-02-06  2:32     ` Linus Torvalds
2014-02-06  4:33       ` Linus Torvalds
2014-02-06 21:29         ` Linus Torvalds
2014-02-06 22:24           ` Kirill A. Shutemov
2014-02-07  1:31             ` Linus Torvalds
2014-02-07 15:42               ` [RFC, PATCH] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov
2014-02-07 17:32                 ` Andi Kleen
2014-02-07 17:56                   ` Kirill A. Shutemov
2014-02-07 18:11                     ` Andi Kleen
2014-02-06  5:42       ` [RFC] de-asmify the x86-64 system call slowpath Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.