linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add a text_poke syscall v2
@ 2013-11-26  0:37 Andi Kleen
  2013-11-26 19:05 ` Andy Lutomirski
  2013-11-29 18:35 ` Oleg Nesterov
  0 siblings, 2 replies; 65+ messages in thread
From: Andi Kleen @ 2013-11-26  0:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: torvalds, x86, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

[Addressed all addressable review feedback in v2]

Properly patching running code ("cross modification")
is a quite complicated business on x86.

The CPU has specific rules that need to be followed, including
multiple global barriers.

Self modifying code is getting more popular, so it's important
to make it easy to follow the rules.

The kernel does it properly with text_poke_bp(). But the same
method is hard to do for user programs.

This patch adds a (x86 specific) text_poke() syscall that exposes
the text_poke_bp() machinery to user programs.

The interface is practically the same as text_poke_bp, just as
a syscall. I added an extra flags parameter, for future
extension.  Right now it is enforced to be 0.

The call also still has a global lock, so it has some scaling
limitations. If it was commonly used this could be fixed
by setting up a list of break point locations. Then
a lock would only be hold to modify the list.

Right now the implementation is just as simple as possible.

Proposed man page:

NAME
	text_poke - Safely modify running instructions (x86)

SYNOPSYS
	int text_poke(void *addr, const void *opcode, size_t len,
	              void (*handler)(void), int timeout);

DESCRIPTION
	The text_poke system allows to safely modify code that may
	be currently executing in parallel on other threads.
	Patch the instruction at addr with the new instructions
	at opcode of length len. The target instruction will temporarily
	be patched with a break point, before it is replaced
	with the final replacement instruction. When the break point
	hits the code handler will be called in the context
	of the thread. The handler does not save any registers
	and cannot return. Typically it would consist of the
	original instruction and then a jump to after the original
	instruction. The handler is only needed during the
	patching process and can be overwritten once the syscall
	returns. timeout defines an optional timout to indicate
	to the kernel how long the patching could be delayed.
	Right now it has to be 0.
	The page containing the target instruction has to
	be writable.

EXAMPLE

  #include <pthread.h>
  #include <unistd.h>
  #include <stdio.h>
  #include <assert.h>
  #include <sys/mman.h>
  #include <string.h>
  #include <time.h>
  #include <stdlib.h>

volatile int finished;

extern char patch[], recovery[], repl[];

struct res {
        long total;
        long val1, val2, handler;
};

int text_poke(void *insn, void *repl, int len, void *handler, int to)
{
        return syscall(314, insn, repl, len, handler, to);
}

void *tfunc(void *arg)
{
        struct res *res = (struct res *)arg;

        while (!finished) {
                int val;
                asm volatile(   ".globl patch\n"
                                ".globl recovery\n"
                                ".global repl\n"
				/* original code to be patched */
                                "patch: mov $1,%0\n"
                                "1:\n"
                                ".section \".text.patchup\",\"x\"\n"
				/* Called when a race happens during patching.
				   Just execute the original code and jump back. */
                                "recovery:\n"
                                " mov $3,%0\n"
                                " jmp 1b\n"
				/* replacement code that gets patched in: */
                                "repl:\n"
                                " mov $2,%0\n"
                                ".previous" : "=a" (val));
                        if (val == 1)
                                res->val1++;
                        else if (val == 3)
                                res->handler++;
                        else
                                res->val2++;
                        res->total++;
        }
        return NULL;
}

int main(int ac, char **av)
{
        int ncpus = sysconf(_SC_NPROCESSORS_ONLN);
        int ps = sysconf(_SC_PAGESIZE);
        pthread_t pthr[ncpus];
        struct res res[ncpus];
        int i;

        srand(1);
        memset(&res, 0, sizeof(struct res) * ncpus);
        mprotect(patch - (unsigned long)patch % ps, ps,
		 PROT_READ|PROT_WRITE|PROT_EXEC);
        for (i = 0; i < ncpus - 1; i++)
                pthread_create(&pthr[i], NULL, tfunc, &res[i]);
        for (i = 0; i < 500000; i++) {
                text_poke(patch, repl, 5, recovery, 0);
                nanosleep(&((struct timespec) { 0, rand() % 100 }), NULL);
                text_poke(repl, patch, 5, recovery, 0);
        }
        finished = 1;
        for (i = 0; i < ncpus - 1; i++) {
                pthread_join(pthr[i], NULL);
                printf("%d: val1 %lu val2 %lu handler %lu to %lu\n",
                                i, res[i].val1, res[i].val2, res[i].handler,
				res[i].total);
                assert(res[i].val1 + res[i].val2 + res[i].handler
				== res[i].total);
        }
        return 0;
}

RETURN VALUE
	On success, text_poke returns 0, otherwise -1 is returned
	and errno is set appropiately.

ERRORS
	EINVAL		len was too long
			timeout was an invalid value
	EFAULT		An error happened while accessing opcode
			or the target instruction.

VERSIONS
	text_poke has been added with the Linux XXX kernel.

CONFORMING TO
	The call is Linux and x86 specific and should not be used
	in programs intended to be portable.

v2: Fix off by one (Andy Lutomirski)
    Improve man page slightly
    Fix error path (Oleg N.)
    Rename timeout to flags
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/alternative.c    | 126 ++++++++++++++++++++++++++++++++-------
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 3 files changed, 107 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df94598..b93aa03 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/syscalls.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 #include <asm/pgtable.h>
@@ -538,6 +539,23 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	return addr;
 }
 
+static void __kprobes __text_poke(struct page **pages, 
+				  int offset,
+				  const void *opcode, size_t len);
+
+static void resolve_pages(struct page **pages, void *addr)
+{
+	if (!core_kernel_text((unsigned long)addr)) {
+		pages[0] = vmalloc_to_page(addr);
+		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+	} else {
+		pages[0] = virt_to_page(addr);
+		WARN_ON(!PageReserved(pages[0]));
+		pages[1] = virt_to_page(addr + PAGE_SIZE);
+	}
+	BUG_ON(!pages[0]);
+}
+
 /**
  * text_poke - Update instructions on a live kernel
  * @addr: address to modify
@@ -553,26 +571,27 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
  */
 void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 {
+	struct page *pages[2];
+
+	resolve_pages(pages, addr);
+	__text_poke(pages, (unsigned long)addr & ~PAGE_MASK,
+		    opcode, len);
+	return addr;
+}
+
+static void __kprobes __text_poke(struct page **pages,
+				   int offset,
+				   const void *opcode, size_t len)
+{
 	unsigned long flags;
 	char *vaddr;
-	struct page *pages[2];
-	int i;
 
-	if (!core_kernel_text((unsigned long)addr)) {
-		pages[0] = vmalloc_to_page(addr);
-		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
-	} else {
-		pages[0] = virt_to_page(addr);
-		WARN_ON(!PageReserved(pages[0]));
-		pages[1] = virt_to_page(addr + PAGE_SIZE);
-	}
-	BUG_ON(!pages[0]);
 	local_irq_save(flags);
 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
 	if (pages[1])
 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
 	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
+	memcpy(&vaddr[offset], opcode, len);
 	clear_fixmap(FIX_TEXT_POKE0);
 	if (pages[1])
 		clear_fixmap(FIX_TEXT_POKE1);
@@ -580,10 +599,7 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
-	for (i = 0; i < len; i++)
-		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
 	local_irq_restore(flags);
-	return addr;
 }
 
 static void do_sync_core(void *info)
@@ -593,6 +609,7 @@ static void do_sync_core(void *info)
 
 static bool bp_patching_in_progress;
 static void *bp_int3_handler, *bp_int3_addr;
+static struct mm_struct *bp_target_mm;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
@@ -602,6 +619,14 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
+	if (user_mode_vm(regs) &&
+		bp_target_mm &&
+		current->mm == bp_target_mm &&
+		regs->ip == (unsigned long)bp_int3_addr) {
+		regs->ip = (unsigned long) bp_int3_handler;
+		return 1;
+	}
+
 	if (user_mode_vm(regs) || regs->ip != (unsigned long)bp_int3_addr)
 		return 0;
 
@@ -612,6 +637,9 @@ int poke_int3_handler(struct pt_regs *regs)
 
 }
 
+static void __text_poke_bp(struct page **pages, int offset,
+			   const void *opcode, size_t len, void *handler);
+
 /**
  * text_poke_bp() -- update instructions on live kernel on SMP
  * @addr:	address to patch
@@ -636,10 +664,22 @@ int poke_int3_handler(struct pt_regs *regs)
  */
 void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
+	struct page *pages[2];
+
+	resolve_pages(pages, addr);
+	bp_int3_addr = (u8 *)addr + 1;
+	__text_poke_bp(pages, (unsigned long)addr & ~PAGE_MASK,
+		       opcode, len, handler);
+	return addr;
+}
+
+/* Caller must set up bp_int3_addr and hold text_mutex */
+static void __text_poke_bp(struct page **pages, int offset,
+		     const void *opcode, size_t len, void *handler)
+{
 	unsigned char int3 = 0xcc;
 
 	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for
@@ -648,13 +688,14 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	__text_poke(pages, offset, &int3, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
 	if (len - sizeof(int3) > 0) {
 		/* patch all but the first byte */
-		text_poke((char *)addr + sizeof(int3),
+		__text_poke(pages, 
+			  offset + sizeof(int3),
 			  (const char *) opcode + sizeof(int3),
 			  len - sizeof(int3));
 		/*
@@ -666,13 +707,56 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	}
 
 	/* patch the first byte */
-	text_poke(addr, opcode, sizeof(int3));
+	__text_poke(pages, offset, opcode, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
 	bp_patching_in_progress = false;
 	smp_wmb();
-
-	return addr;
 }
 
+#define MAX_INSN_LEN 32
+
+SYSCALL_DEFINE5(text_poke,
+		__user void *, addr,
+		const void *, opcode, 
+		size_t, len,
+		void *, handler,
+		int, flags)
+{
+	struct page *pages[2];
+	char insn[MAX_INSN_LEN];
+	int err, npages;
+
+	if ((unsigned long)len > MAX_INSN_LEN || len == 0)
+		return -EINVAL;
+	if (copy_from_user(insn, opcode, len))
+		return -EFAULT;
+	/* For future extension */
+	if (flags != 0)
+		return -EINVAL;
+	pages[1] = NULL;
+	npages = ((unsigned long)addr & PAGE_MASK) == 
+		(((unsigned long)addr + len - 1) & PAGE_MASK) ? 1 : 2;
+	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
+	if (err < 0)
+		return err;
+	if (err != npages) {
+		err = -EFAULT;
+		goto out;
+	}
+	err = 0;
+	mutex_lock(&text_mutex);
+	bp_target_mm = current->mm;
+	bp_int3_addr = (u8 *)addr + 1;
+	__text_poke_bp(pages, 
+		       (unsigned long)addr & ~PAGE_MASK,
+		       insn, len, handler);
+	bp_target_mm = NULL;
+	mutex_unlock(&text_mutex);
+out:
+	put_page(pages[0]);
+	if (pages[1])
+		put_page(pages[1]);
+	return err;
+}
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index aabfb83..5d3cb72 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,4 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	text_poke		sys_text_poke
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..77e60f9 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	text_poke		sys_text_poke
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-26  0:37 [PATCH] Add a text_poke syscall v2 Andi Kleen
@ 2013-11-26 19:05 ` Andy Lutomirski
  2013-11-26 19:11   ` Andi Kleen
  2013-11-26 20:03   ` Linus Torvalds
  2013-11-29 18:35 ` Oleg Nesterov
  1 sibling, 2 replies; 65+ messages in thread
From: Andy Lutomirski @ 2013-11-26 19:05 UTC (permalink / raw)
  To: Andi Kleen, linux-kernel; +Cc: torvalds, x86, Andi Kleen

On 11/25/2013 04:37 PM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> [Addressed all addressable review feedback in v2]
> 
> Properly patching running code ("cross modification")
> is a quite complicated business on x86.
> 
> The CPU has specific rules that need to be followed, including
> multiple global barriers.
> 
> Self modifying code is getting more popular, so it's important
> to make it easy to follow the rules.
> 
> The kernel does it properly with text_poke_bp(). But the same
> method is hard to do for user programs.
> 
> This patch adds a (x86 specific) text_poke() syscall that exposes
> the text_poke_bp() machinery to user programs.
> 
> The interface is practically the same as text_poke_bp, just as
> a syscall. I added an extra flags parameter, for future
> extension.  Right now it is enforced to be 0.
> 
> The call also still has a global lock, so it has some scaling
> limitations. If it was commonly used this could be fixed
> by setting up a list of break point locations. Then
> a lock would only be hold to modify the list.
> 
> Right now the implementation is just as simple as possible.

IIRC someone proposed that, rather than specifying a "handler", that any
user thread that traps just wait until the poke completes.  This would
complicate the kernel implementation a bit, but it would make the user
code a good deal simpler.  Is there any reason that this is a bad idea?

--Andy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-26 19:05 ` Andy Lutomirski
@ 2013-11-26 19:11   ` Andi Kleen
  2013-11-26 20:03   ` Linus Torvalds
  1 sibling, 0 replies; 65+ messages in thread
From: Andi Kleen @ 2013-11-26 19:11 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen

> IIRC someone proposed that, rather than specifying a "handler", that any
> user thread that traps just wait until the poke completes.  This would
> complicate the kernel implementation a bit, but it would make the user
> code a good deal simpler.  Is there any reason that this is a bad idea?

Should be doable yes.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-26 19:05 ` Andy Lutomirski
  2013-11-26 19:11   ` Andi Kleen
@ 2013-11-26 20:03   ` Linus Torvalds
  2013-11-27 19:57     ` H. Peter Anvin
  1 sibling, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2013-11-26 20:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen

On Tue, Nov 26, 2013 at 11:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> IIRC someone proposed that, rather than specifying a "handler", that any
> user thread that traps just wait until the poke completes.  This would
> complicate the kernel implementation a bit, but it would make the user
> code a good deal simpler.  Is there any reason that this is a bad idea?

Please do it this way instead, because user space will get the
callback version wrong and then - because it never actually triggers
in practice in normal situations - it will cause very *very* subtle
bugs that we can't fix.

Making the kernel serialize the accesses is the right thing to do.
Just a new per-mm mutex should trivially do it, then you don't even
have to check the "current->mm == bp_target_mm" thing at all, you just
make the bp handler do a simple

    if (mutex_lock_interruptible(&mm->text_poke_mutex) >= 0)
        mutex_unlock(&mm->text_poke_mutex);

and return. All done.

Plus the callback thing is pointless if we can do the instruction
switch atomically (which would be true for UP, for single-thread, and
potentially for certain sizes/alignments coupled with known rules for
particular micro-architectures). So it's not a particularly good
interface anyway.

Btw, I also think that there's a separate problem wrt shared pages.
Should we perhaps only allow this in private mappings? Because right
now it has that "current->mm == bp_target_mm" thing, and generally it
only works on one particular mm, but by using "get_user_pages_fast(,
1,..)" it really only requires write permissions on the page. So it
could be shared mapping, and I could easily see people doing that on
purpose ("open executable file, then use text_poke() to change it for
this architecture") and THAT DOES NOT WORK with the current patch if
somebody else is also running that app..

               Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-26 20:03   ` Linus Torvalds
@ 2013-11-27 19:57     ` H. Peter Anvin
  2013-11-27 22:02       ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 19:57 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen

On 11/26/2013 12:03 PM, Linus Torvalds wrote:
> On Tue, Nov 26, 2013 at 11:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> IIRC someone proposed that, rather than specifying a "handler", that any
>> user thread that traps just wait until the poke completes.  This would
>> complicate the kernel implementation a bit, but it would make the user
>> code a good deal simpler.  Is there any reason that this is a bad idea?
> 
> Please do it this way instead, because user space will get the
> callback version wrong and then - because it never actually triggers
> in practice in normal situations - it will cause very *very* subtle
> bugs that we can't fix.
> 
> Making the kernel serialize the accesses is the right thing to do.
> Just a new per-mm mutex should trivially do it, then you don't even
> have to check the "current->mm == bp_target_mm" thing at all, you just
> make the bp handler do a simple
> 
>     if (mutex_lock_interruptible(&mm->text_poke_mutex) >= 0)
>         mutex_unlock(&mm->text_poke_mutex);
> 
> and return. All done.
> 
> Plus the callback thing is pointless if we can do the instruction
> switch atomically (which would be true for UP, for single-thread, and
> potentially for certain sizes/alignments coupled with known rules for
> particular micro-architectures). So it's not a particularly good
> interface anyway.
> 
> Btw, I also think that there's a separate problem wrt shared pages.
> Should we perhaps only allow this in private mappings? Because right
> now it has that "current->mm == bp_target_mm" thing, and generally it
> only works on one particular mm, but by using "get_user_pages_fast(,
> 1,..)" it really only requires write permissions on the page. So it
> could be shared mapping, and I could easily see people doing that on
> purpose ("open executable file, then use text_poke() to change it for
> this architecture") and THAT DOES NOT WORK with the current patch if
> somebody else is also running that app..
> 

I have already said I don't think this is a good interface as it doesn't
provide sane semantics for batching changes -- the whole handler and
timeout mechanism (which isn't even implemented yet, and as a result
probably will never be used by userspace) is basically a hack for that.

The other alternative is that we only expose the "synchronize all cores"
operation as a system call, and let user space do the rest of the work.
 Anything else can be done in user space, and a lot of the subtleties
with locking pages, user space access and so on simply go away... user
space will be responsible or will get protection faults.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 19:57     ` H. Peter Anvin
@ 2013-11-27 22:02       ` H. Peter Anvin
  2013-11-27 22:21         ` Andy Lutomirski
                           ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 22:02 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Andi Kleen, Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 271 bytes --]

For the record, this is the entire patch necessary to do the
sync_cores() system call -- and no potential interactions with security
frameworks or whatnot, simply because no security-sensitive operations
are performed of any kind.

Comments/opinions appreciated.

	-hpa


[-- Attachment #2: 0001-x86-Add-a-sync_cores-system-call-for-user-space-code.patch --]
[-- Type: text/x-patch, Size: 2435 bytes --]

>From c0246c43c30453e4f88a314e437d4504e6a36c08 Mon Sep 17 00:00:00 2001
From: "H. Peter Anvin" <hpa@linux.intel.com>
Date: Wed, 27 Nov 2013 13:57:29 -0800
Subject: [PATCH] x86: Add a sync_cores() system call for user space code
 patching

Add a system call to synchronize all processors (guarantee execution
of a serializing instruction on all processors before execution
resumes) so that user space can do INT3-style patching, or any other
kind of patching that it wants to do.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/kernel/alternative.c    | 15 +++++++++++++++
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df94598ad05a..8227eee42114 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/syscalls.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 #include <asm/pgtable.h>
@@ -676,3 +677,17 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	return addr;
 }
 
+/**
+ * sys_sync_cores() -- synchronize cores for userspace patching
+ *
+ * This function provides the core synchronization operation
+ * so that userspace can do int3 breakpoint patching.
+ *
+ * Everything else can be handled in userspace.
+ */
+SYSCALL_DEFINE0(sync_cores)
+{
+	on_each_cpu(do_sync_core, NULL, 1);
+
+	return 0;
+}
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index aabfb8380a1c..acfdfe2b95e7 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,4 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	sync_cores		sys_sync_cores
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65dfd14f..6556f1e6e920 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	sync_cores		sys_sync_cores
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:02       ` H. Peter Anvin
@ 2013-11-27 22:21         ` Andy Lutomirski
  2013-11-27 22:21         ` Borislav Petkov
  2013-11-27 22:41         ` Linus Torvalds
  2 siblings, 0 replies; 65+ messages in thread
From: Andy Lutomirski @ 2013-11-27 22:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner, Mathieu Desnoyers

On Wed, Nov 27, 2013 at 2:02 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> For the record, this is the entire patch necessary to do the
> sync_cores() system call -- and no potential interactions with security
> frameworks or whatnot, simply because no security-sensitive operations
> are performed of any kind.
>
> Comments/opinions appreciated.

Presumably, if this were done, it should be merged with the
sys_membarrier work.  Maybe there should be a syscall for executing
some kind of barrier on all cpus that might be looking at the current
mm (and, eventually, something for MAP_SHARED as well if anyone
cares).

A benefit of the explicit text_poke approach is that it can
potentially get MAP_SHARED patching right.  I don't know if anyone
would ever want to do that, though.

--Andy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:02       ` H. Peter Anvin
  2013-11-27 22:21         ` Andy Lutomirski
@ 2013-11-27 22:21         ` Borislav Petkov
  2013-11-27 22:24           ` H. Peter Anvin
  2013-11-27 22:25           ` H. Peter Anvin
  2013-11-27 22:41         ` Linus Torvalds
  2 siblings, 2 replies; 65+ messages in thread
From: Borislav Petkov @ 2013-11-27 22:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, Nov 27, 2013 at 02:02:50PM -0800, H. Peter Anvin wrote:
> For the record, this is the entire patch necessary to do the
> sync_cores() system call -- and no potential interactions with security
> frameworks or whatnot, simply because no security-sensitive operations
> are performed of any kind.
> 
> Comments/opinions appreciated.

And we do this in the kernel not because userspace can't execute
sync_core() aka CPUID but because for userspace it is hard to IPI all
cores easily?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:21         ` Borislav Petkov
@ 2013-11-27 22:24           ` H. Peter Anvin
  2013-11-27 22:25           ` H. Peter Anvin
  1 sibling, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 22:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

Correct.

Borislav Petkov <bp@alien8.de> wrote:
>On Wed, Nov 27, 2013 at 02:02:50PM -0800, H. Peter Anvin wrote:
>> For the record, this is the entire patch necessary to do the
>> sync_cores() system call -- and no potential interactions with
>security
>> frameworks or whatnot, simply because no security-sensitive
>operations
>> are performed of any kind.
>> 
>> Comments/opinions appreciated.
>
>And we do this in the kernel not because userspace can't execute
>sync_core() aka CPUID but because for userspace it is hard to IPI all
>cores easily?

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:21         ` Borislav Petkov
  2013-11-27 22:24           ` H. Peter Anvin
@ 2013-11-27 22:25           ` H. Peter Anvin
  2013-11-27 22:29             ` Borislav Petkov
  1 sibling, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 22:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

Actually the CPUID is superfluous on anything than the executing CPU since IRET is serializing.

Borislav Petkov <bp@alien8.de> wrote:
>On Wed, Nov 27, 2013 at 02:02:50PM -0800, H. Peter Anvin wrote:
>> For the record, this is the entire patch necessary to do the
>> sync_cores() system call -- and no potential interactions with
>security
>> frameworks or whatnot, simply because no security-sensitive
>operations
>> are performed of any kind.
>> 
>> Comments/opinions appreciated.
>
>And we do this in the kernel not because userspace can't execute
>sync_core() aka CPUID but because for userspace it is hard to IPI all
>cores easily?

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:25           ` H. Peter Anvin
@ 2013-11-27 22:29             ` Borislav Petkov
  2013-11-27 22:31               ` H. Peter Anvin
  2013-11-27 22:40               ` H. Peter Anvin
  0 siblings, 2 replies; 65+ messages in thread
From: Borislav Petkov @ 2013-11-27 22:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, Nov 27, 2013 at 02:25:29PM -0800, H. Peter Anvin wrote:
> Actually the CPUID is superfluous on anything than the executing CPU
> since IRET is serializing.

Why not on the executing core too? We're IRET-ting there too.

Which would mean that a dummy empty syscall would be enough too since we
need the serialization only.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:29             ` Borislav Petkov
@ 2013-11-27 22:31               ` H. Peter Anvin
  2013-11-27 23:04                 ` Linus Torvalds
  2013-11-27 22:40               ` H. Peter Anvin
  1 sibling, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 22:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

No, we're not... sysexit/sysret doesn't count.

Borislav Petkov <bp@alien8.de> wrote:
>On Wed, Nov 27, 2013 at 02:25:29PM -0800, H. Peter Anvin wrote:
>> Actually the CPUID is superfluous on anything than the executing CPU
>> since IRET is serializing.
>
>Why not on the executing core too? We're IRET-ting there too.
>
>Which would mean that a dummy empty syscall would be enough too since
>we
>need the serialization only.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:29             ` Borislav Petkov
  2013-11-27 22:31               ` H. Peter Anvin
@ 2013-11-27 22:40               ` H. Peter Anvin
  2013-11-27 23:10                 ` Borislav Petkov
  1 sibling, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 22:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On 11/27/2013 02:29 PM, Borislav Petkov wrote:
> On Wed, Nov 27, 2013 at 02:25:29PM -0800, H. Peter Anvin wrote:
>> Actually the CPUID is superfluous on anything than the executing CPU
>> since IRET is serializing.
> 
> Why not on the executing core too? We're IRET-ting there too.
> 
> Which would mean that a dummy empty syscall would be enough too since we
> need the serialization only.
> 

Also don't forget we need the IPIs, too...

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:02       ` H. Peter Anvin
  2013-11-27 22:21         ` Andy Lutomirski
  2013-11-27 22:21         ` Borislav Petkov
@ 2013-11-27 22:41         ` Linus Torvalds
  2013-11-27 22:53           ` H. Peter Anvin
  2 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2013-11-27 22:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner

On Wed, Nov 27, 2013 at 2:02 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> For the record, this is the entire patch necessary to do the
> sync_cores() system call -- and no potential interactions with security
> frameworks or whatnot, simply because no security-sensitive operations
> are performed of any kind.
>
> Comments/opinions appreciated.

If we just care about a core sync, then:

 - on_each_cpu() is overkill, since you really only want each CPU that
can run that process.

   And we have that bitmask already: it's the same bitmask that we use
for TLB flush purposes.

 - calling "sync_cores()" cross-cpu is overkill, since the IPI will
already sync the other cores. And the current core is already going to
be synchronized wrt the code change, since we're in kernel mode and
the code is in user mode.

So I actually think your patch does too much - although it's possible
that we should have a system call argument saying "sync just this
process" or "sync all cores" so that people can literally do some
"modify global instruction in shared library" games if they really
really want to.

That said, the thing I really do *not* like about this patch is that
it makes the "insert 'int3' instruction" visible to user space. That
makes the "global instruction" replacement impossible, for example,
because while we'd sync all cores, we have no way to protect against
other processes getting the breakpoint exception and just SIGSEGV'ing
randomly as a result of *that*.

And even if we say "well, we don't care about the global case" (which
is quite possibly the right thign to do), it's actually hard for
threaded libraries to sanely catch SIGSGV for the breakpoint case too.
And since the only reason for this existing IN THE FIRST PLACE is the
threaded case, I have to say that I absolutely despise this "simpler"
interface.

The interface may be simpler, but it's garbage.

I didn't like the first patch either, but the first patch with
"replace the stupid pseudo-signal back-call with just waiting" at
least seems to be a good interface. Unlike this kind of stuff.

               Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:41         ` Linus Torvalds
@ 2013-11-27 22:53           ` H. Peter Anvin
  2013-11-27 23:15             ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 22:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner

On 11/27/2013 02:41 PM, Linus Torvalds wrote:
> On Wed, Nov 27, 2013 at 2:02 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> For the record, this is the entire patch necessary to do the
>> sync_cores() system call -- and no potential interactions with security
>> frameworks or whatnot, simply because no security-sensitive operations
>> are performed of any kind.
>>
>> Comments/opinions appreciated.
> 
> If we just care about a core sync, then:
> 
>  - on_each_cpu() is overkill, since you really only want each CPU that
> can run that process.
> 
>    And we have that bitmask already: it's the same bitmask that we use
> for TLB flush purposes.
> 
>  - calling "sync_cores()" cross-cpu is overkill, since the IPI will
> already sync the other cores. And the current core is already going to
> be synchronized wrt the code change, since we're in kernel mode and
> the code is in user mode.
> 
> So I actually think your patch does too much - although it's possible
> that we should have a system call argument saying "sync just this
> process" or "sync all cores" so that people can literally do some
> "modify global instruction in shared library" games if they really
> really want to.

Yes, that was my main concern.

> That said, the thing I really do *not* like about this patch is that
> it makes the "insert 'int3' instruction" visible to user space. That
> makes the "global instruction" replacement impossible, for example,
> because while we'd sync all cores, we have no way to protect against
> other processes getting the breakpoint exception and just SIGSEGV'ing
> randomly as a result of *that*.
> 
> And even if we say "well, we don't care about the global case" (which
> is quite possibly the right thign to do), it's actually hard for
> threaded libraries to sanely catch SIGSGV for the breakpoint case too.
> And since the only reason for this existing IN THE FIRST PLACE is the
> threaded case, I have to say that I absolutely despise this "simpler"
> interface.
> 
> The interface may be simpler, but it's garbage.
> 
> I didn't like the first patch either, but the first patch with
> "replace the stupid pseudo-signal back-call with just waiting" at
> least seems to be a good interface. Unlike this kind of stuff.

If we are going to go down that route, I would like to see a list of
patch sites, not just one with a "timeout" that won't get used.  For
function tracing, for example, one may want to patch every function in
the program, and on the kernel side we already had to do batching for
that.  Similarly, for CPU feature patching, batching the operations make
sense.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:31               ` H. Peter Anvin
@ 2013-11-27 23:04                 ` Linus Torvalds
  2013-11-27 23:13                   ` Borislav Petkov
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2013-11-27 23:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, Nov 27, 2013 at 2:31 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> No, we're not... sysexit/sysret doesn't count.

So sysexit/sysret doesn't count as a serializing instruction, no. But
it doesn't need to, because *self*-modifying code doesn't need a
serializing instruction, only a branch. It's only *cross*-modifying
code that needs a serializing instruction.

So the IPI is sufficient for the cross-modifying case, and the sysret
is sufficient for the self-modifying case. And we also don't need to
worry about "what happens if we schedule to another CPU, and
self-modifying becomes cross-modifying", because the scheduling will
then do the serializing instruction.

So IPI for other CPU's (limited to the mm-mask) and just a system call
for local CPU should be perfectly fine.

          Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:40               ` H. Peter Anvin
@ 2013-11-27 23:10                 ` Borislav Petkov
  2013-11-27 23:20                   ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Borislav Petkov @ 2013-11-27 23:10 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, Nov 27, 2013 at 02:40:04PM -0800, H. Peter Anvin wrote:
> Also don't forget we need the IPIs, too...

Yeah, I was simply looking at whether we could get away with executing
an empty syscall, i.e. save us the CPUID and rely only on the IPIs and
IRET.

But we don't IPI ourselves in smp_call_function; actually we remove
ourselves from the cpumask because of deadlocking scenarios. So on
this_cpu we only execute the function with IRQs disabled and CLI/STI is
not serializing.

I wonder if we could use MFENCE instead of CPUID, though, and save us
the clobbering of e*x, maybe even save us some cycles since MFENCE
should be faster than hundred-ish cycles of microcoded CPUID.

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:04                 ` Linus Torvalds
@ 2013-11-27 23:13                   ` Borislav Petkov
  0 siblings, 0 replies; 65+ messages in thread
From: Borislav Petkov @ 2013-11-27 23:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, Nov 27, 2013 at 03:04:41PM -0800, Linus Torvalds wrote:
> So sysexit/sysret doesn't count as a serializing instruction, no.
> But it doesn't need to, because *self*-modifying code doesn't need a
> serializing instruction, only a branch. It's only *cross*-modifying
> code that needs a serializing instruction.
>
> So the IPI is sufficient for the cross-modifying case, and the sysret
> is sufficient for the self-modifying case. And we also don't need
> to worry about "what happens if we schedule to another CPU, and
> self-modifying becomes cross-modifying", because the scheduling will
> then do the serializing instruction.
>
> So IPI for other CPU's (limited to the mm-mask) and just a system call
> for local CPU should be perfectly fine.

Cool, so basically an empty dummy syscall IPI-ed to all cores. With a
big fat comment on top.

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 22:53           ` H. Peter Anvin
@ 2013-11-27 23:15             ` Linus Torvalds
  2013-11-27 23:28               ` H. Peter Anvin
  2013-11-27 23:44               ` Andi Kleen
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2013-11-27 23:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner

On Wed, Nov 27, 2013 at 2:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> If we are going to go down that route, I would like to see a list of
> patch sites, not just one with a "timeout" that won't get used.

Oh, I agree. The interface of the original patch was just inane/insane.

The timeout and the callback is pointless. The only thing the system
call should get as an argument is the address and the replacement
instruction.  So

  int text_poke(void *addr, const void *opcode, size_t len)

sounds fine to me. And it would do:
 - take some (possibly per-mm) mutex
 - write the one-byte int3
 - do the IPI
 - write the other bytes
 - do the IPI
 - do the first byte
 - release the (possibly per-mm) mutex

and then in the BP handler we'd just take the mutex, see if the first
byte of the exception is still int3, if it's not, just return silently
(because that means that we hit the race).

And I would seriously suggest just open-coding the above simple
sequence instead of trying to force-reuse the text_poke_bp() function
we already have. Because I think doing this on kernel code is
*very*different* (for irq reasons _and_ for IPI mask reasons).

Hmm? It doesn't sound too bad. And I really don't see the point of
some timeout handling or anything like that.

           Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:10                 ` Borislav Petkov
@ 2013-11-27 23:20                   ` H. Peter Anvin
  2013-11-27 23:40                     ` Borislav Petkov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 23:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On 11/27/2013 03:10 PM, Borislav Petkov wrote:
> On Wed, Nov 27, 2013 at 02:40:04PM -0800, H. Peter Anvin wrote:
>> Also don't forget we need the IPIs, too...
> 
> Yeah, I was simply looking at whether we could get away with executing
> an empty syscall, i.e. save us the CPUID and rely only on the IPIs and
> IRET.
> 
> But we don't IPI ourselves in smp_call_function; actually we remove
> ourselves from the cpumask because of deadlocking scenarios. So on
> this_cpu we only execute the function with IRQs disabled and CLI/STI is
> not serializing.
> 

Although as Linus correctly pointed out, on the modifying CPU itself we
only need a branch.  For the standalone system call that doesn't work,
because you can't assume that the modifying CPU and the syscall CPU are
the same CPU.

> I wonder if we could use MFENCE instead of CPUID, though, and save us
> the clobbering of e*x, maybe even save us some cycles since MFENCE
> should be faster than hundred-ish cycles of microcoded CPUID.
> 

No.  MFENCE doesn't serialize the front end.

	-hpa




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:15             ` Linus Torvalds
@ 2013-11-27 23:28               ` H. Peter Anvin
  2013-11-28  2:01                 ` Linus Torvalds
  2013-11-27 23:44               ` Andi Kleen
  1 sibling, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 23:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner

On 11/27/2013 03:15 PM, Linus Torvalds wrote:
> 
> Oh, I agree. The interface of the original patch was just inane/insane.
> 
> The timeout and the callback is pointless. The only thing the system
> call should get as an argument is the address and the replacement
> instruction.  So
> 
>   int text_poke(void *addr, const void *opcode, size_t len)
> 
> sounds fine to me. And it would do:
>  - take some (possibly per-mm) mutex
>  - write the one-byte int3
>  - do the IPI
>  - write the other bytes
>  - do the IPI
>  - do the first byte
>  - release the (possibly per-mm) mutex
> 
> and then in the BP handler we'd just take the mutex, see if the first
> byte of the exception is still int3, if it's not, just return silently
> (because that means that we hit the race).
> 

I was going to say we can just re-execute the instruction until it goes
away, but this is unsafe for user space since you might have CD 03
(INT 3) somewhere instead of CC (INT3) and backing up to the previous
byte would be bad in the former case.

Even if matched against patch sites it would be iffy.

> Hmm? It doesn't sound too bad. And I really don't see the point of
> some timeout handling or anything like that.

The timeout bit was an acknowledgment that some kind of batching
interface is necessary.  If you are doing this for function tracing, for
example, you can easily have a hundred thousand patch sites or more, and
you may not want to have to go through this process anew for each single
site.  Hence my earlier comment about feeling that we would need a
batched interface of some sort.  Which, unfortunately, has its own set
of problems relating to restartability and potential delay.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:20                   ` H. Peter Anvin
@ 2013-11-27 23:40                     ` Borislav Petkov
  2013-11-27 23:47                       ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Borislav Petkov @ 2013-11-27 23:40 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, Nov 27, 2013 at 03:20:07PM -0800, H. Peter Anvin wrote:
> No.  MFENCE doesn't serialize the front end.

So my manual says

"The MFENCE instruction is weakly-ordered with respect to data and
instruction prefetches. Speculative loads initiated by the processor, or
specified explicitly using cache-prefetch instructions, can be reordered
around an MFENCE."

and the SDM has only one sentence stating "MFENCE does not serialize
the instruction stream." which I'm assuming means the same, i.e. MFENCE
doesn't have control on I$ prefetches and they can be reordered around
it.

But I'd guess that depends on the uarch because Bulldozer, reportedly,
implements MFENCE by causing a pipeline stall which controls the
prefetches too:

"[MFENCE] stalls the pipeline and the processor core cannot begin
processing any further instructions until all previous instructions are
completed and any outstanding memory operations (such as prefetches
and stores) have completed. (This stall applies only to the individual
integer unit of the compute unit where the MFENCE instruction is
executed.) Architecturally serializing instructions such as CPUID have
the same pipeline stall behavior as MFENCE."

It is not clear, though, whether with "memory operations" and
"prefetches" they mean I$ prefetches too.

Cool, one gets to learn new stuff every day so thanks for making me look
:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:15             ` Linus Torvalds
  2013-11-27 23:28               ` H. Peter Anvin
@ 2013-11-27 23:44               ` Andi Kleen
  1 sibling, 0 replies; 65+ messages in thread
From: Andi Kleen @ 2013-11-27 23:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner


One reason why I like having a syscall is that if there is some
CPU which needs additional workarounds here it could be all
done in a central place (I don't know of any that does right now
but traditionally it has been a bug intensive area)

> Hmm? It doesn't sound too bad. And I really don't see the point of
> some timeout handling or anything like that.

Sounds good to me. I'll work on it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:40                     ` Borislav Petkov
@ 2013-11-27 23:47                       ` H. Peter Anvin
  0 siblings, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-27 23:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On 11/27/2013 03:40 PM, Borislav Petkov wrote:
> 
> But I'd guess that depends on the uarch because Bulldozer, reportedly,
> implements MFENCE by causing a pipeline stall which controls the
> prefetches too:
> 

An implementation can always make the memory model stronger, but not weaker.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-27 23:28               ` H. Peter Anvin
@ 2013-11-28  2:01                 ` Linus Torvalds
  2013-11-28  2:10                   ` H. Peter Anvin
  2013-11-28  9:12                   ` Jiri Kosina
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2013-11-28  2:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner

On Wed, Nov 27, 2013 at 3:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> The timeout bit was an acknowledgment that some kind of batching
> interface is necessary.

That's just moronic. People would make up totally random timeouts, so
from an interface standpoint it's just horrid, horrid.

Giving user space random knobs that you don't understand yourself, and
the monkeys in user space are guaranteed to mis-use is just entirely
the wrong thing to do.

Much better to then just making the interface itself be about
batching, which isn't as hard as you make it out to be. Make it an
array of those addr/replace/len things. And we have that
"restart_block" for system calls, and we'd limit batching to some
random smallish number ("128 instructions, just because"), while still
being easily interruptible in between those blocks. That limits you to
two IPI's per 128 instructions replaced - and at that point even
*that* is just an internal kernel random tuning thing, not some insane
user interface.

But is such batching really even worth it? If' it's not *that* much
more effort, maybe it's worth it, but do we have known users that
really would have thousands and thousands of cases all at once?

            Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-28  2:01                 ` Linus Torvalds
@ 2013-11-28  2:10                   ` H. Peter Anvin
  2013-11-28  9:12                   ` Jiri Kosina
  1 sibling, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-28  2:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Andi Kleen, Linux Kernel Mailing List,
	Ingo Molnar, Andi Kleen, Thomas Gleixner

ftrace is the flagship example.

And yes, agreed about timeouts.

Linus Torvalds <torvalds@linux-foundation.org> wrote:
>On Wed, Nov 27, 2013 at 3:28 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> The timeout bit was an acknowledgment that some kind of batching
>> interface is necessary.
>
>That's just moronic. People would make up totally random timeouts, so
>from an interface standpoint it's just horrid, horrid.
>
>Giving user space random knobs that you don't understand yourself, and
>the monkeys in user space are guaranteed to mis-use is just entirely
>the wrong thing to do.
>
>Much better to then just making the interface itself be about
>batching, which isn't as hard as you make it out to be. Make it an
>array of those addr/replace/len things. And we have that
>"restart_block" for system calls, and we'd limit batching to some
>random smallish number ("128 instructions, just because"), while still
>being easily interruptible in between those blocks. That limits you to
>two IPI's per 128 instructions replaced - and at that point even
>*that* is just an internal kernel random tuning thing, not some insane
>user interface.
>
>But is such batching really even worth it? If' it's not *that* much
>more effort, maybe it's worth it, but do we have known users that
>really would have thousands and thousands of cases all at once?
>
>            Linus

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-28  2:01                 ` Linus Torvalds
  2013-11-28  2:10                   ` H. Peter Anvin
@ 2013-11-28  9:12                   ` Jiri Kosina
  1 sibling, 0 replies; 65+ messages in thread
From: Jiri Kosina @ 2013-11-28  9:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Andy Lutomirski, Andi Kleen,
	Linux Kernel Mailing List, Ingo Molnar, Andi Kleen,
	Thomas Gleixner

On Wed, 27 Nov 2013, Linus Torvalds wrote:

> But is such batching really even worth it? If' it's not *that* much more 
> effort, maybe it's worth it, but do we have known users that really 
> would have thousands and thousands of cases all at once?

If you want to base tracing infrastructure on it, then you might easily 
want to patch all the existing functions. We have in-kernel facility 
(ftrace) doing exactly that.

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-26  0:37 [PATCH] Add a text_poke syscall v2 Andi Kleen
  2013-11-26 19:05 ` Andy Lutomirski
@ 2013-11-29 18:35 ` Oleg Nesterov
  2013-11-29 19:54   ` Andi Kleen
  1 sibling, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-29 18:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, torvalds, x86, Andi Kleen, H. Peter Anvin,
	Ingo Molnar, Borislav Petkov

On 11/25, Andi Kleen wrote:
>
> +	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
> +	if (err < 0)
> +		return err;
> +	if (err != npages) {
> +		err = -EFAULT;
> +		goto out;
> +	}
> +	err = 0;
> +	mutex_lock(&text_mutex);
> +	bp_target_mm = current->mm;
> +	bp_int3_addr = (u8 *)addr + 1;
> +	__text_poke_bp(pages,
> +		       (unsigned long)addr & ~PAGE_MASK,
> +		       insn, len, handler);

We are going to change this user page, it seems that we need
set_page_dirty() ?

Andi, et al. I am going to discuss the things I do not really
understand, probably this can't make any sense, but...

I am wondering if sys_text_poke() can do something like

	down_write(mmap_sem);

	get_user_pages(write, page, vma);

	ptep = page_check_address(...);

	pte = ptep_clear_flush(ptep);

	copy-opcode-to-page;

	pte = pte_mkdirty(pte);
	set_pte_at(ptep, pte);

	pte_unmap_unlock(...);
	up_write(mmap_sem);

?

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 18:35 ` Oleg Nesterov
@ 2013-11-29 19:54   ` Andi Kleen
  2013-11-29 20:05     ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2013-11-29 19:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

Hi Oleg,

Thanks for finding all these problems.

> We are going to change this user page, it seems that we need
> set_page_dirty() ?

That's true. Will add it next rev.

> 
> Andi, et al. I am going to discuss the things I do not really
> understand, probably this can't make any sense, but...

I think it's enough to set the dirty bit in the underlying
struct page, no need to play games with the PTE.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 19:54   ` Andi Kleen
@ 2013-11-29 20:05     ` Oleg Nesterov
  2013-11-29 20:17       ` H. Peter Anvin
                         ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-29 20:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, torvalds, x86, Andi Kleen, H. Peter Anvin,
	Ingo Molnar, Borislav Petkov

On 11/29, Andi Kleen wrote:
>
> > Andi, et al. I am going to discuss the things I do not really
> > understand, probably this can't make any sense, but...
>
> I think it's enough to set the dirty bit in the underlying
> struct page, no need to play games with the PTE.

Ah, sorry for confusion, I guess you misunderstood.

I meant, perhaps sys_text_poke() doesn't the in-kernel text_poke
machinery altogether?

Can't we invalidate pte (so that any user will stuck in page fault),
update the page(s), restore the pte and drop the locks?

This way sys_text_poke() won't be x86-specific, and it will be per-mm.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 20:05     ` Oleg Nesterov
@ 2013-11-29 20:17       ` H. Peter Anvin
  2013-11-29 20:35         ` Oleg Nesterov
  2013-11-29 23:24       ` Jiri Kosina
  2013-11-30  5:16       ` H. Peter Anvin
  2 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-29 20:17 UTC (permalink / raw)
  To: Oleg Nesterov, Andi Kleen
  Cc: linux-kernel, torvalds, x86, Andi Kleen, Ingo Molnar, Borislav Petkov

On 11/29/2013 12:05 PM, Oleg Nesterov wrote:
> 
> Can't we invalidate pte (so that any user will stuck in page fault),
> update the page(s), restore the pte and drop the locks?
> 

That would require a global TLB shootdown (and wouldn't help
shared-memory code segments, if we care about that at all.)  It also
means much bigger code ranges would be affected.

The performance requirements are part of what makes the INT3 solution
attractive, but anything is going to inherently be slow.

> This way sys_text_poke() won't be x86-specific, and it will be per-mm.

That is definitely an appeal.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 20:17       ` H. Peter Anvin
@ 2013-11-29 20:35         ` Oleg Nesterov
  2013-11-29 21:24           ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-29 20:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen, Ingo Molnar,
	Borislav Petkov

On 11/29, H. Peter Anvin wrote:
>
> On 11/29/2013 12:05 PM, Oleg Nesterov wrote:
> >
> > Can't we invalidate pte (so that any user will stuck in page fault),
> > update the page(s), restore the pte and drop the locks?
> >
>
> That would require a global TLB shootdown

Well, it is not really global, it is for mm_cpumask() and for good
reason?

And is it really worse than on_each_cpu(do_sync_core) and the usage
of text_mutex?

> (and wouldn't help
> shared-memory code segments, if we care about that at all.)

Well, I think this should only support the private mappings.

> It also
> means much bigger code ranges would be affected.

Yes, this is true.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 20:35         ` Oleg Nesterov
@ 2013-11-29 21:24           ` H. Peter Anvin
  2013-11-30 14:56             ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-29 21:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen, Ingo Molnar,
	Borislav Petkov

On 11/29/2013 12:35 PM, Oleg Nesterov wrote:
> On 11/29, H. Peter Anvin wrote:
>>
>> On 11/29/2013 12:05 PM, Oleg Nesterov wrote:
>>>
>>> Can't we invalidate pte (so that any user will stuck in page fault),
>>> update the page(s), restore the pte and drop the locks?
>>
>> That would require a global TLB shootdown
> 
> Well, it is not really global, it is for mm_cpumask() and for good
> reason?
> 
> And is it really worse than on_each_cpu(do_sync_core) and the usage
> of text_mutex?
> 

Probably not, but one would have to consider the total amount of
synchronization needed.

>> (and wouldn't help
>> shared-memory code segments, if we care about that at all.)
> 
> Well, I think this should only support the private mappings.
> 

Well, what do you do if someone tries this on a MAP_SHARED mapping?
Error out?

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 20:05     ` Oleg Nesterov
  2013-11-29 20:17       ` H. Peter Anvin
@ 2013-11-29 23:24       ` Jiri Kosina
  2013-11-30  0:22         ` Linus Torvalds
                           ` (2 more replies)
  2013-11-30  5:16       ` H. Peter Anvin
  2 siblings, 3 replies; 65+ messages in thread
From: Jiri Kosina @ 2013-11-29 23:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

On Fri, 29 Nov 2013, Oleg Nesterov wrote:

> > > Andi, et al. I am going to discuss the things I do not really
> > > understand, probably this can't make any sense, but...
> >
> > I think it's enough to set the dirty bit in the underlying
> > struct page, no need to play games with the PTE.
> 
> Ah, sorry for confusion, I guess you misunderstood.
> 
> I meant, perhaps sys_text_poke() doesn't the in-kernel text_poke
> machinery altogether?
> 
> Can't we invalidate pte (so that any user will stuck in page fault),
> update the page(s), restore the pte and drop the locks?

Do you think this'd be faster than the int3-based aproach?

We have moved from using stop_machine() to int3-based patching exactly 
because it's much more lightweight.

> This way sys_text_poke() won't be x86-specific, and it will be per-mm.

Good point, it'd be really nice to have this as an alternative on archs 
where breakpoint-based patching wouldn't be possible for some reason.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 23:24       ` Jiri Kosina
@ 2013-11-30  0:22         ` Linus Torvalds
  2013-12-03 18:49           ` [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly Oleg Nesterov
  2013-11-30 15:20         ` [PATCH] Add a text_poke syscall v2 Oleg Nesterov
  2013-11-30 16:51         ` Oleg Nesterov
  2 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2013-11-30  0:22 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Oleg Nesterov, Andi Kleen, Linux Kernel Mailing List,
	the arch/x86 maintainers, Andi Kleen, H. Peter Anvin,
	Ingo Molnar, Borislav Petkov

On Fri, Nov 29, 2013 at 3:24 PM, Jiri Kosina <jkosina@suse.cz> wrote:
>
> Do you think this'd be faster than the int3-based aproach?

Unlikely to be faster, but perhaps more robust and more portable. Maybe.

And the reason we use the int3-based approach is that doing TLB
shootdowns of kernel mappings is completely out of the question, not
to mention not portable (kernel code may not even be in a page table
to begin with).

Remember: the current text_poke is all about kernel code. So the rules
for user space may well be entirely different.

           Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 20:05     ` Oleg Nesterov
  2013-11-29 20:17       ` H. Peter Anvin
  2013-11-29 23:24       ` Jiri Kosina
@ 2013-11-30  5:16       ` H. Peter Anvin
  2013-11-30 14:52         ` Oleg Nesterov
  2 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-11-30  5:16 UTC (permalink / raw)
  To: Oleg Nesterov, Andi Kleen
  Cc: linux-kernel, torvalds, x86, Andi Kleen, Ingo Molnar, Borislav Petkov

On 11/29/2013 12:05 PM, Oleg Nesterov wrote:
> On 11/29, Andi Kleen wrote:
>>
>>> Andi, et al. I am going to discuss the things I do not really
>>> understand, probably this can't make any sense, but...
>>
>> I think it's enough to set the dirty bit in the underlying
>> struct page, no need to play games with the PTE.
> 
> Ah, sorry for confusion, I guess you misunderstood.
> 
> I meant, perhaps sys_text_poke() doesn't the in-kernel text_poke
> machinery altogether?
> 
> Can't we invalidate pte (so that any user will stuck in page fault),
> update the page(s), restore the pte and drop the locks?
> 
> This way sys_text_poke() won't be x86-specific, and it will be per-mm.
> 

Hmmm... if we hold mmap_sem() this pretty much will be the net result,
no?  That would mean no additional tests needed on the page fault path.
 What I'm not sure of is whether or not it is actually safe to hold
mmap_sem across all the code we need -- on the other hand, we may very
well need to do so.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-30  5:16       ` H. Peter Anvin
@ 2013-11-30 14:52         ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-30 14:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen, Ingo Molnar,
	Borislav Petkov

On 11/29, H. Peter Anvin wrote:
>
> On 11/29/2013 12:05 PM, Oleg Nesterov wrote:
> >
> > Can't we invalidate pte (so that any user will stuck in page fault),
> > update the page(s), restore the pte and drop the locks?
> >
> > This way sys_text_poke() won't be x86-specific, and it will be per-mm.
> >
>
> Hmmm... if we hold mmap_sem() this pretty much will be the net result,
> no?

Yes, down_write(mmap_sem) is enough to block the page faults.

But we need pte-lock anyway, to avoid the races with, say, try_to_unmap().
(and of course, we need to retry if page_check_address() fails).

This actually means that if we want to update a single page we could
use down_read(). But in general we need to update 2 pages. Or even more
if we generalize sys_text_poke(), perhaps it should be renamed in this
case, but this is off-topic.

> That would mean no additional tests needed on the page fault path.

Not sure I understand... but of course we should not change the fault
paths in any case.

>  What I'm not sure of is whether or not it is actually safe to hold
> mmap_sem across all the code we need

Let me repeat, I do not understand vm enough to answer authoritatively.

But I think this should be safe. I do not see why it should not, but
see above.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 21:24           ` H. Peter Anvin
@ 2013-11-30 14:56             ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-30 14:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen, Ingo Molnar,
	Borislav Petkov

On 11/29, H. Peter Anvin wrote:
>
> On 11/29/2013 12:35 PM, Oleg Nesterov wrote:
> > On 11/29, H. Peter Anvin wrote:
> >>
> >> On 11/29/2013 12:05 PM, Oleg Nesterov wrote:
> >>>
> >>> Can't we invalidate pte (so that any user will stuck in page fault),
> >>> update the page(s), restore the pte and drop the locks?
> >>
> >> That would require a global TLB shootdown
> >
> > Well, it is not really global, it is for mm_cpumask() and for good
> > reason?
> >
> > And is it really worse than on_each_cpu(do_sync_core) and the usage
> > of text_mutex?
> >
>
> Probably not, but one would have to consider the total amount of
> synchronization needed.

Yes, agreed, this should be justified.

> >> (and wouldn't help
> >> shared-memory code segments, if we care about that at all.)
> >
> > Well, I think this should only support the private mappings.
> >
>
> Well, what do you do if someone tries this on a MAP_SHARED mapping?
> Error out?

Imho yes. But this needs more discussion, and afaics this is a bit
offtopic.

Please note that this patch relies on bp_target_mm, this can't help
in VM_SHARED case too.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 23:24       ` Jiri Kosina
  2013-11-30  0:22         ` Linus Torvalds
@ 2013-11-30 15:20         ` Oleg Nesterov
  2013-11-30 16:51         ` Oleg Nesterov
  2 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-30 15:20 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

On 11/30, Jiri Kosina wrote:
>
> On Fri, 29 Nov 2013, Oleg Nesterov wrote:
>
> > > > Andi, et al. I am going to discuss the things I do not really
> > > > understand, probably this can't make any sense, but...
> > >
> > > I think it's enough to set the dirty bit in the underlying
> > > struct page, no need to play games with the PTE.
> >
> > Ah, sorry for confusion, I guess you misunderstood.
> >
> > I meant, perhaps sys_text_poke() doesn't the in-kernel text_poke
> > machinery altogether?
> >
> > Can't we invalidate pte (so that any user will stuck in page fault),
> > update the page(s), restore the pte and drop the locks?
>
> Do you think this'd be faster than the int3-based aproach?

No.

And more, I simply do not know if it would be slower or faster, and how
much. Just I hope that this won't be "much" slower.

OTOH, this is obviously more scalable, and this way sys_text_poke() won't
block, say, jump_label or kprobes. Not sure this actually matters though.

> We have moved from using stop_machine() to int3-based patching exactly
> because it's much more lightweight.

Oh, I do not think it makes sense to compare stop_machine() with this
approach...

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-29 23:24       ` Jiri Kosina
  2013-11-30  0:22         ` Linus Torvalds
  2013-11-30 15:20         ` [PATCH] Add a text_poke syscall v2 Oleg Nesterov
@ 2013-11-30 16:51         ` Oleg Nesterov
  2013-11-30 17:31           ` Oleg Nesterov
  2 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-30 16:51 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

Sorry for completely offtopic question, but while we are here...

On 11/30, Jiri Kosina wrote:
>
> We have moved from using stop_machine() to int3-based patching exactly
> because it's much more lightweight.

I don't really understans the barriers in poke_int3_handler() and
text_poke_bp(). To the point, I do not really understand why do we
actually need bp_patching_in_progress, but lets ignore this.

Lets look at the end of text_poke_bp(), it does

	on_each_cpu(do_sync_core, NULL, 1);

	bp_patching_in_progress = false;
	smp_wmb();

First of all, this smp_wmb() is not clear. But what I actually
can't understand is why it is safe to clear bp_patching_in_progress.

OK, on_each_cpu() should serialize us with do_int3(), but only if
poke_int3_handler() is called with irqs disabled.

However, do_int3() does preempt_conditional_sti() and this looks
as if it can be called with irqs enabled? If this is actually
possible then text_poke_bp() needs synchronize_sched() to avoid
the races with poke_int3_handler(), afaics.

OTOH, int3 is GATE_INTERRUPT, doesn't this mean that that do_int3()
can enable irqs unconditionally and on_each_cpu() also acts as a
synchronization barrier?

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] Add a text_poke syscall v2
  2013-11-30 16:51         ` Oleg Nesterov
@ 2013-11-30 17:31           ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-11-30 17:31 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Andi Kleen, linux-kernel, torvalds, x86, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

On 11/30, Oleg Nesterov wrote:
>
> First of all, this smp_wmb() is not clear.

Yes, but...

> However, do_int3() does preempt_conditional_sti() and this looks
> as if it can be called with irqs enabled?

Ah, please ignore, I misread preempt_conditional_sti(). It enables
irqs if they were enabled before the exception, unless I misread it
again.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-11-30  0:22         ` Linus Torvalds
@ 2013-12-03 18:49           ` Oleg Nesterov
  2013-12-03 19:00             ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-03 18:49 UTC (permalink / raw)
  To: Linus Torvalds, Hugh Dickins, Peter Zijlstra
  Cc: Jiri Kosina, Andi Kleen, Linux Kernel Mailing List,
	the arch/x86 maintainers, Andi Kleen, H. Peter Anvin,
	Ingo Molnar, Borislav Petkov

On 11/29, Linus Torvalds wrote:
>
> On Fri, Nov 29, 2013 at 3:24 PM, Jiri Kosina <jkosina@suse.cz> wrote:
> >
> > Do you think this'd be faster than the int3-based aproach?
>
> Unlikely to be faster, but perhaps more robust and more portable. Maybe.

Can't we at least change uprobe_write_opcode() and kill the
nontrivial __replace_page() logic?

See the patch below. For review only, I can't understand if
it needs mmu_notifier_invalidate* and set_pte_notify(), and
of course I can easily miss something else.

Currently uprobe_write_opcode() always allocs/installs the new
page, even if the previous uprobe_write_opcode() has already
created the cowed anonymous page.

With this patch uprobe_write_opcode() calls gup(write, force),
then invalidates pte, then modifies the page, and restores the
old pte.

The patch is hardly readable, this is how the code looks:

	int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
				uprobe_opcode_t opcode)
	{
		struct page *page;
		struct vm_area_struct *vma;
		pte_t *ptep, entry;
		spinlock_t *ptlp;
		int ret;

		/* Read the page with vaddr into memory */
		ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
		if (ret <= 0)
			return ret;

		ret = verify_opcode(page, vaddr, &opcode);
		if (ret <= 0)
			goto put;

	 retry:
		put_page(page);
		/* COW this page if not writable */
		ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
		if (ret <= 0)
			goto put;

		ptep = page_check_address(page, mm, vaddr, &ptlp, 0);
		if (!ptep)
			goto retry;

		/* Unmap this page to ensure that nobody can execute it */
		flush_cache_page(vma, vaddr, pte_pfn(*ptep));
		entry = ptep_clear_flush(vma, vaddr, ptep);

		/* Nobody can fault in this page, modify it */
		copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);

		/* Restore the old mapping */
		entry = pte_mkdirty(entry);
		set_pte_at(mm, vaddr, ptep, entry);
		update_mmu_cache(vma, vaddr, ptep);
		pte_unmap_unlock(ptep, ptlp);
	 put:
		put_page(page);
		return ret;
	}

you can safely ignore the fist get_user_pages() and verify_opcode().

I'll appretiate any review, thanks in advance.

Oleg.


--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -114,65 +114,6 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
 }
 
 /**
- * __replace_page - replace page in vma by new page.
- * based on replace_page in mm/ksm.c
- *
- * @vma:      vma that holds the pte pointing to page
- * @addr:     address the old @page is mapped at
- * @page:     the cowed page we are replacing by kpage
- * @kpage:    the modified page we replace page by
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
-				struct page *page, struct page *kpage)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pte_t *ptep;
-	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
-
-	/* For try_to_free_swap() and munlock_vma_page() below */
-	lock_page(page);
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	err = -EAGAIN;
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
-	if (!ptep)
-		goto unlock;
-
-	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
-
-	if (!PageAnon(page)) {
-		dec_mm_counter(mm, MM_FILEPAGES);
-		inc_mm_counter(mm, MM_ANONPAGES);
-	}
-
-	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
-
-	page_remove_rmap(page);
-	if (!page_mapped(page))
-		try_to_free_swap(page);
-	pte_unmap_unlock(ptep, ptl);
-
-	if (vma->vm_flags & VM_LOCKED)
-		munlock_vma_page(page);
-	put_page(page);
-
-	err = 0;
- unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	unlock_page(page);
-	return err;
-}
-
-/**
  * is_swbp_insn - check if instruction is breakpoint instruction.
  * @insn: instruction to be checked.
  * Default implementation of is_swbp_insn
@@ -264,43 +205,46 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
 			uprobe_opcode_t opcode)
 {
-	struct page *old_page, *new_page;
+	struct page *page;
 	struct vm_area_struct *vma;
+	pte_t *ptep, entry;
+	spinlock_t *ptlp;
 	int ret;
 
-retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
+	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
 	if (ret <= 0)
 		return ret;
 
-	ret = verify_opcode(old_page, vaddr, &opcode);
+	ret = verify_opcode(page, vaddr, &opcode);
 	if (ret <= 0)
-		goto put_old;
-
-	ret = -ENOMEM;
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
-	if (!new_page)
-		goto put_old;
+		goto put;
 
-	__SetPageUptodate(new_page);
-
-	copy_highpage(new_page, old_page);
-	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+ retry:
+	put_page(page);
+	/* COW this page if not writable */
+	ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
+	if (ret <= 0)
+		goto put;
 
-	ret = anon_vma_prepare(vma);
-	if (ret)
-		goto put_new;
+	ptep = page_check_address(page, mm, vaddr, &ptlp, 0);
+	if (!ptep)
+		goto retry;
 
-	ret = __replace_page(vma, vaddr, old_page, new_page);
+	/* Unmap this page to ensure that nobody can execute it */
+	flush_cache_page(vma, vaddr, pte_pfn(*ptep));
+	entry = ptep_clear_flush(vma, vaddr, ptep);
 
-put_new:
-	page_cache_release(new_page);
-put_old:
-	put_page(old_page);
+	/* Nobody can fault in this page, modify it */
+	copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
 
-	if (unlikely(ret == -EAGAIN))
-		goto retry;
+	/* Restore the old mapping */
+	entry = pte_mkdirty(entry);
+	set_pte_at(mm, vaddr, ptep, entry);
+	update_mmu_cache(vma, vaddr, ptep);
+	pte_unmap_unlock(ptep, ptlp);
+ put:
+	put_page(page);
 	return ret;
 }
 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 18:49           ` [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly Oleg Nesterov
@ 2013-12-03 19:00             ` Linus Torvalds
  2013-12-03 19:20               ` H. Peter Anvin
  2013-12-03 19:53               ` Oleg Nesterov
  0 siblings, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2013-12-03 19:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Peter Zijlstra, Jiri Kosina, Andi Kleen,
	Linux Kernel Mailing List, the arch/x86 maintainers, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

On Tue, Dec 3, 2013 at 10:49 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> See the patch below. For review only

Looks completely broken. Where do you guarantee that it's just a single page?

Yes, on x86, UPROBE_SWBP_INSN_SIZE is a single byte. But quite
frankly, on x86, exactly *because* it's a single byte, I don't
understand why we don't just write the damn thing with a single
"put_user()", and stop with all the idiotic games. No need to
invalidate caches, even, because if you overwrite the first byte of an
instruction, it all "just works". Either the instruction decoding gets
the old one, or it gets the new one. We already rely on that for the
kernel bp instruction replacement.

And on non-x86, UPROBE_SWBP_INSN_SIZE is not necessarily 1, so it
could cross a page boundary. Yes, many architectures will have
alignment constraints, but I don't see this testing it.

Whatever. I think that code is bad, and you should feel bad. But hey,
I think it was pretty bad before too.

            Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 19:00             ` Linus Torvalds
@ 2013-12-03 19:20               ` H. Peter Anvin
  2013-12-03 20:01                 ` Oleg Nesterov
  2013-12-03 19:53               ` Oleg Nesterov
  1 sibling, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-03 19:20 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Hugh Dickins, Peter Zijlstra, Jiri Kosina, Andi Kleen,
	Linux Kernel Mailing List, the arch/x86 maintainers, Andi Kleen,
	Ingo Molnar, Borislav Petkov

On 12/03/2013 11:00 AM, Linus Torvalds wrote:
> 
> Yes, on x86, UPROBE_SWBP_INSN_SIZE is a single byte. But quite
> frankly, on x86, exactly *because* it's a single byte, I don't
> understand why we don't just write the damn thing with a single
> "put_user()", and stop with all the idiotic games. No need to
> invalidate caches, even, because if you overwrite the first byte of an
> instruction, it all "just works". Either the instruction decoding gets
> the old one, or it gets the new one. We already rely on that for the
> kernel bp instruction replacement.
> 
> And on non-x86, UPROBE_SWBP_INSN_SIZE is not necessarily 1, so it
> could cross a page boundary. Yes, many architectures will have
> alignment constraints, but I don't see this testing it.
> 
> Whatever. I think that code is bad, and you should feel bad. But hey,
> I think it was pretty bad before too.
> 

I guess it would have to be checked, but I would be *highly* surprised
if UPROBE_SWBP_INSN_SIZE ever[1] could be anything than the fundamental
instruction quantum, which means it should never be able to wrap a page,
but *also* should mean it should be able to just be put_user()'d
followed by whatever synchronization necessary to make it globally visible.

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 19:00             ` Linus Torvalds
  2013-12-03 19:20               ` H. Peter Anvin
@ 2013-12-03 19:53               ` Oleg Nesterov
  1 sibling, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-03 19:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Jiri Kosina, Andi Kleen,
	Linux Kernel Mailing List, the arch/x86 maintainers, Andi Kleen,
	H. Peter Anvin, Ingo Molnar, Borislav Petkov

On 12/03, Linus Torvalds wrote:
>
> On Tue, Dec 3, 2013 at 10:49 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > See the patch below. For review only
>
> Looks completely broken. Where do you guarantee that it's just a single page?

Yes, it is always a single page on all supported architectures.

This is even documented. I believe that "NOTE:" comment above
uprobe_write_opcode() tries to say this but I guess this comment
should be cleanuped.

And note also

	/* uprobe_write_opcode() assumes we don't cross page boundary */
	BUG_ON((uprobe->offset & ~PAGE_MASK) +
			UPROBE_SWBP_INSN_SIZE > PAGE_SIZE);

in prepare_uprobe().

> Yes, on x86, UPROBE_SWBP_INSN_SIZE is a single byte.

And powerpc checks addr & 3 to ensure it doesn't cross the page.

> frankly, on x86, exactly *because* it's a single byte, I don't
> understand why we don't just write the damn thing with a single
> "put_user()", and stop with all the idiotic games.

Well, put_user() obviously can't work, mm != current->mm.
So we need get_user_pages() at least.

> No need to
> invalidate caches, even, because if you overwrite the first byte of an
> instruction, it all "just works".

I can't comment this, I do not know how the hardware actually works.

> Either the instruction decoding gets
> the old one, or it gets the new one.

Funny that.

I have asked why access_process_vm() can't work when I saw the initial
version of uprobes patches. I was told this can't work (even on x86).

And I was told that this idiotic games were suggested by someone
named Linus Torvalds ;)

> And on non-x86, UPROBE_SWBP_INSN_SIZE is not necessarily 1, so it
> could cross a page boundary.

Yes. If we support such an architecture we should obviously update
uprobe_write_opcode() accordingly.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 19:20               ` H. Peter Anvin
@ 2013-12-03 20:01                 ` Oleg Nesterov
  2013-12-03 20:21                   ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-03 20:01 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03, H. Peter Anvin wrote:
>
> I guess it would have to be checked, but I would be *highly* surprised
> if UPROBE_SWBP_INSN_SIZE ever[1] could be anything than the fundamental
> instruction quantum, which means it should never be able to wrap a page,

Yes, it can't.

> but *also* should mean it should be able to just be put_user()'d

put_user() obviously can't work, we need access_remote_vm-like code.

> followed by whatever synchronization necessary to make it globally visible.

Could you explain what this synchronization should actually do ?

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 20:01                 ` Oleg Nesterov
@ 2013-12-03 20:21                   ` H. Peter Anvin
  2013-12-03 20:38                     ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-03 20:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03/2013 12:01 PM, Oleg Nesterov wrote:
> 
>> followed by whatever synchronization necessary to make it globally visible.
> 
> Could you explain what this synchronization should actually do ?
> 

That is architecture-dependent.  On x86 it requires each CPU on which
this code is visible to be IPId, on others it involves complex icache
flushing protocols.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 20:21                   ` H. Peter Anvin
@ 2013-12-03 20:38                     ` Oleg Nesterov
  2013-12-03 20:43                       ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-03 20:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03, H. Peter Anvin wrote:
>
> On 12/03/2013 12:01 PM, Oleg Nesterov wrote:
> >
> >> followed by whatever synchronization necessary to make it globally visible.
> >
> > Could you explain what this synchronization should actually do ?
> >
>
> That is architecture-dependent. On x86 it requires each CPU on which
> this code is visible to be IPId, on others it involves complex icache
> flushing protocols.

Hmm. And why this would be better than arch-independent code I sent?

IOW. Could you please comment the patch I sent?

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 20:38                     ` Oleg Nesterov
@ 2013-12-03 20:43                       ` H. Peter Anvin
  2013-12-03 20:54                         ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-03 20:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

There is no architecture-independent way to make code globally visible.

Oleg Nesterov <oleg@redhat.com> wrote:
>On 12/03, H. Peter Anvin wrote:
>>
>> On 12/03/2013 12:01 PM, Oleg Nesterov wrote:
>> >
>> >> followed by whatever synchronization necessary to make it globally
>visible.
>> >
>> > Could you explain what this synchronization should actually do ?
>> >
>>
>> That is architecture-dependent. On x86 it requires each CPU on which
>> this code is visible to be IPId, on others it involves complex icache
>> flushing protocols.
>
>Hmm. And why this would be better than arch-independent code I sent?
>
>IOW. Could you please comment the patch I sent?
>
>Oleg.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 20:43                       ` H. Peter Anvin
@ 2013-12-03 20:54                         ` Oleg Nesterov
  2013-12-03 22:01                           ` Linus Torvalds
  2013-12-03 22:42                           ` H. Peter Anvin
  0 siblings, 2 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-03 20:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03, H. Peter Anvin wrote:
>
> There is no architecture-independent way to make code globally visible.

Then I probably misunderstood "globally". Or something else.

So do you think the patch I sent is wrong? Why?

I am totally confused.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 20:54                         ` Oleg Nesterov
@ 2013-12-03 22:01                           ` Linus Torvalds
  2013-12-03 23:47                             ` H. Peter Anvin
  2013-12-04 11:11                             ` Oleg Nesterov
  2013-12-03 22:42                           ` H. Peter Anvin
  1 sibling, 2 replies; 65+ messages in thread
From: Linus Torvalds @ 2013-12-03 22:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: H. Peter Anvin, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On Tue, Dec 3, 2013 at 12:54 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> So do you think the patch I sent is wrong? Why?

I think the TLB shootdown should guarantee that it's ok on other
CPU's, since that's basically what we do on mmap.

But looking closer at this, I think I see why the old code did what it
did. I think it's breaking shared mmap pages on purpose rather than
dirtying them. Which is probably the right thing to do.

              Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 20:54                         ` Oleg Nesterov
  2013-12-03 22:01                           ` Linus Torvalds
@ 2013-12-03 22:42                           ` H. Peter Anvin
  1 sibling, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-03 22:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03/2013 12:54 PM, Oleg Nesterov wrote:
> On 12/03, H. Peter Anvin wrote:
>>
>> There is no architecture-independent way to make code globally visible.
> 
> Then I probably misunderstood "globally". Or something else.
> 
> So do you think the patch I sent is wrong? Why?
> 

It is wrong in the sense that without an architecture-specific
synchronization at the end it is not guaranteed to work.  It will work
fine on x86 and presumably any other architecture where icache-dcache
coherency is enforced in hardware (I am *guessing* that includes all or
most SMP platforms, but it is definitely *not* true on all UP platforms.)

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 22:01                           ` Linus Torvalds
@ 2013-12-03 23:47                             ` H. Peter Anvin
  2013-12-04 11:30                               ` Oleg Nesterov
  2013-12-04 11:11                             ` Oleg Nesterov
  1 sibling, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-03 23:47 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Hugh Dickins, Peter Zijlstra, Jiri Kosina, Andi Kleen,
	Linux Kernel Mailing List, the arch/x86 maintainers, Andi Kleen,
	Ingo Molnar, Borislav Petkov

On 12/03/2013 02:01 PM, Linus Torvalds wrote:
> On Tue, Dec 3, 2013 at 12:54 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> So do you think the patch I sent is wrong? Why?
> 
> I think the TLB shootdown should guarantee that it's ok on other
> CPU's, since that's basically what we do on mmap.
> 

I think that is true for other CPUs; however, there are definitely CPUs
out there (which Linux supports) for which you have to synchronize the I
and D sides "manually" after writing code through memory, at least
through the CPU.  That is at least one reason why MIPS has a
cacheflush() system call, for example.

> But looking closer at this, I think I see why the old code did what it
> did. I think it's breaking shared mmap pages on purpose rather than
> dirtying them. Which is probably the right thing to do.
> 

In other words, treating them as MAP_PRIVATE?  Wouldn't it be better to
throw an error if we can't honor the semantics of the mapping that we
are using?

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 22:01                           ` Linus Torvalds
  2013-12-03 23:47                             ` H. Peter Anvin
@ 2013-12-04 11:11                             ` Oleg Nesterov
  2013-12-04 16:01                               ` H. Peter Anvin
  1 sibling, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-04 11:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03, Linus Torvalds wrote:
>
> On Tue, Dec 3, 2013 at 12:54 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > So do you think the patch I sent is wrong? Why?
>
> I think the TLB shootdown should guarantee that it's ok on other
> CPU's, since that's basically what we do on mmap.

OK, thanks. I'll resend this patch.

It is still not clear to me if we can simply change a single byte on
x86 or not, but at least on powerpc we need to update 4 bytes. Perhaps
we can conditionalize these pte games later.

> But looking closer at this, I think I see why the old code did what it
> did. I think it's breaking shared mmap pages on purpose rather than
> dirtying them. Which is probably the right thing to do.

Ah, no. uprobes never writes to the shared pages. (hmm, it seems that
VM_SHARED check is buggy, but this is offtopic). Otherwise this patch
would be very wrong.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-03 23:47                             ` H. Peter Anvin
@ 2013-12-04 11:30                               ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-04 11:30 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/03, H. Peter Anvin wrote:
>
> On 12/03/2013 02:01 PM, Linus Torvalds wrote:
> > On Tue, Dec 3, 2013 at 12:54 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> So do you think the patch I sent is wrong? Why?
> >
> > I think the TLB shootdown should guarantee that it's ok on other
> > CPU's, since that's basically what we do on mmap.
> >
>
> I think that is true for other CPUs; however, there are definitely CPUs
> out there (which Linux supports) for which you have to synchronize the I
> and D sides "manually" after writing code through memory, at least
> through the CPU.  That is at least one reason why MIPS has a
> cacheflush() system call, for example.

OK, probably (with or without the patch I sent) uprobe_write_opcode() needs
flush_icache_page(). Altough it is nop on x86 and powerpc (architectures
we currently support).

But I still can't understand your "There is no architecture-independent
way to make code globally visible". If this is true, then how, say,
do_swap_page() can work?

So I still think the patch should work (I'll add flush_icache_page).

> > But looking closer at this, I think I see why the old code did what it
> > did. I think it's breaking shared mmap pages on purpose rather than
> > dirtying them. Which is probably the right thing to do.
>
> In other words, treating them as MAP_PRIVATE?  Wouldn't it be better to
> throw an error if we can't honor the semantics of the mapping that we
> are using?

Yes, uprobes never writes to MAP_SHARED vmas.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 11:11                             ` Oleg Nesterov
@ 2013-12-04 16:01                               ` H. Peter Anvin
  2013-12-04 16:48                                 ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-04 16:01 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds
  Cc: Hugh Dickins, Peter Zijlstra, Jiri Kosina, Andi Kleen,
	Linux Kernel Mailing List, the arch/x86 maintainers, Andi Kleen,
	Ingo Molnar, Borislav Petkov

On 12/04/2013 03:11 AM, Oleg Nesterov wrote:
> 
> It is still not clear to me if we can simply change a single byte on
> x86 or not, but at least on powerpc we need to update 4 bytes. Perhaps
> we can conditionalize these pte games later.
> 

But 4 aligned bytes can be written as a single transaction.

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 16:01                               ` H. Peter Anvin
@ 2013-12-04 16:48                                 ` Oleg Nesterov
  2013-12-04 16:54                                   ` H. Peter Anvin
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-04 16:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/04, H. Peter Anvin wrote:
>
> On 12/04/2013 03:11 AM, Oleg Nesterov wrote:
> >
> > It is still not clear to me if we can simply change a single byte on
> > x86 or not, but at least on powerpc we need to update 4 bytes. Perhaps
> > we can conditionalize these pte games later.
> >
>
> But 4 aligned bytes can be written as a single transaction.

Ah yes, this is true.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 16:48                                 ` Oleg Nesterov
@ 2013-12-04 16:54                                   ` H. Peter Anvin
  2013-12-04 17:15                                     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-04 16:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/04/2013 08:48 AM, Oleg Nesterov wrote:
> On 12/04, H. Peter Anvin wrote:
>>
>> On 12/04/2013 03:11 AM, Oleg Nesterov wrote:
>>>
>>> It is still not clear to me if we can simply change a single byte on
>>> x86 or not, but at least on powerpc we need to update 4 bytes. Perhaps
>>> we can conditionalize these pte games later.
>>>
>>
>> But 4 aligned bytes can be written as a single transaction.
> 
> Ah yes, this is true.
> 

That is why I talk about the atomic instruction word... most (but not
*all*) architectures have a fundamental minimum unit of instructions
which is aligned and can be atomically written.  Typically this is 1, 2,
or 4 bytes.

	-hpa


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 16:54                                   ` H. Peter Anvin
@ 2013-12-04 17:15                                     ` Linus Torvalds
  2013-12-04 17:43                                       ` Oleg Nesterov
                                                         ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Linus Torvalds @ 2013-12-04 17:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Oleg Nesterov, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On Wed, Dec 4, 2013 at 8:54 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> That is why I talk about the atomic instruction word... most (but not
> *all*) architectures have a fundamental minimum unit of instructions
> which is aligned and can be atomically written.  Typically this is 1, 2,
> or 4 bytes.

Note that it's not just about the "atomically written", it's also
about the guarantee that it's atomically *read*.

x86 can certainly atomically write a 4-byte instruction too, it's just
that there's no guarantee - even if the instruction is aligned etc -
that the actual instruction decoding always ends up reading it that
way. It might re-read an instruction after encountering a prefix byte
etc etc. So even if it's all properly aligned, the reading side might
do something odd.

            Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 17:15                                     ` Linus Torvalds
@ 2013-12-04 17:43                                       ` Oleg Nesterov
  2013-12-05 17:23                                         ` Oleg Nesterov
  2013-12-04 18:32                                       ` H. Peter Anvin
  2013-12-05  8:28                                       ` Jon Medhurst (Tixy)
  2 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-04 17:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov,
	Ananth N Mavinakayanahalli, Srikar Dronamraju

Peter, Linus, I got lost.

So what do you finally think about this change? Please see v2 below:

	- update the comment above gup(write, force)

	- add flush_icache_page() before set_pte_at() (nop on x86
	  and powerpc)

	- fix the returned value, and with this change it seems
	  to work although I do not trust my testing.

I am attaching the code:

	int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
				uprobe_opcode_t opcode)
	{
		struct page *page;
		struct vm_area_struct *vma;
		pte_t *ptep, entry;
		spinlock_t *ptlp;
		int ret;

		/* Read the page with vaddr into memory */
		ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
		if (ret < 0)
			return ret;

		ret = verify_opcode(page, vaddr, &opcode);
		if (ret < 0)
			goto put;

	 retry:
		put_page(page);
		/* Break the mapping unless the page is already anonymous */
		ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
		if (ret <= 0)
			goto put;

		ptep = page_check_address(page, mm, vaddr, &ptlp, 0);
		if (!ptep)
			goto retry;

		/* Unmap this page to ensure that nobody can execute it */
		flush_cache_page(vma, vaddr, pte_pfn(*ptep));
		entry = ptep_clear_flush(vma, vaddr, ptep);

		/* Nobody can fault in this page, modify it */
		copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);

		/* Restore the old mapping */
		entry = pte_mkdirty(entry);
		flush_icache_page(vma, page);
		set_pte_at(mm, vaddr, ptep, entry);
		update_mmu_cache(vma, vaddr, ptep);
		pte_unmap_unlock(ptep, ptlp);
		ret = 0;
	 put:
		put_page(page);
		return ret;
	}

And/Or. Are you still saying that on x86 (and powerpc?) we do not need
these pte games at all and uprobe_write_opcode() can simply do:

		/* Break the mapping unless the page is already anonymous */
		ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);

		// this page can't be swapped out due to page_freeze_refs()
		copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
		
		set_page_dirty_lock(page);

Just in case... I have no idea if this matters or not, but
uprobe_write_opcode() is also used to restore the original insn which
can even cross the page. We still write a single (1st) byte in this
case, the byte which was previously replaced by int3.

Oleg.


--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -114,65 +114,6 @@ static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr)
 }
 
 /**
- * __replace_page - replace page in vma by new page.
- * based on replace_page in mm/ksm.c
- *
- * @vma:      vma that holds the pte pointing to page
- * @addr:     address the old @page is mapped at
- * @page:     the cowed page we are replacing by kpage
- * @kpage:    the modified page we replace page by
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
-				struct page *page, struct page *kpage)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pte_t *ptep;
-	int err;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = addr;
-	const unsigned long mmun_end   = addr + PAGE_SIZE;
-
-	/* For try_to_free_swap() and munlock_vma_page() below */
-	lock_page(page);
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	err = -EAGAIN;
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
-	if (!ptep)
-		goto unlock;
-
-	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
-
-	if (!PageAnon(page)) {
-		dec_mm_counter(mm, MM_FILEPAGES);
-		inc_mm_counter(mm, MM_ANONPAGES);
-	}
-
-	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
-
-	page_remove_rmap(page);
-	if (!page_mapped(page))
-		try_to_free_swap(page);
-	pte_unmap_unlock(ptep, ptl);
-
-	if (vma->vm_flags & VM_LOCKED)
-		munlock_vma_page(page);
-	put_page(page);
-
-	err = 0;
- unlock:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	unlock_page(page);
-	return err;
-}
-
-/**
  * is_swbp_insn - check if instruction is breakpoint instruction.
  * @insn: instruction to be checked.
  * Default implementation of is_swbp_insn
@@ -264,43 +205,48 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
 			uprobe_opcode_t opcode)
 {
-	struct page *old_page, *new_page;
+	struct page *page;
 	struct vm_area_struct *vma;
+	pte_t *ptep, entry;
+	spinlock_t *ptlp;
 	int ret;
 
-retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &old_page, &vma);
-	if (ret <= 0)
+	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 1, &page, NULL);
+	if (ret < 0)
 		return ret;
 
-	ret = verify_opcode(old_page, vaddr, &opcode);
-	if (ret <= 0)
-		goto put_old;
-
-	ret = -ENOMEM;
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
-	if (!new_page)
-		goto put_old;
-
-	__SetPageUptodate(new_page);
-
-	copy_highpage(new_page, old_page);
-	copy_to_page(new_page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
-
-	ret = anon_vma_prepare(vma);
-	if (ret)
-		goto put_new;
-
-	ret = __replace_page(vma, vaddr, old_page, new_page);
+	ret = verify_opcode(page, vaddr, &opcode);
+	if (ret < 0)
+		goto put;
 
-put_new:
-	page_cache_release(new_page);
-put_old:
-	put_page(old_page);
+ retry:
+	put_page(page);
+	/* Break the mapping unless the page is already anonymous */
+	ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
+	if (ret <= 0)
+		goto put;
 
-	if (unlikely(ret == -EAGAIN))
+	ptep = page_check_address(page, mm, vaddr, &ptlp, 0);
+	if (!ptep)
 		goto retry;
+
+	/* Unmap this page to ensure that nobody can execute it */
+	flush_cache_page(vma, vaddr, pte_pfn(*ptep));
+	entry = ptep_clear_flush(vma, vaddr, ptep);
+
+	/* Nobody can fault in this page, modify it */
+	copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
+
+	/* Restore the old mapping */
+	entry = pte_mkdirty(entry);
+	flush_icache_page(vma, page);
+	set_pte_at(mm, vaddr, ptep, entry);
+	update_mmu_cache(vma, vaddr, ptep);
+	pte_unmap_unlock(ptep, ptlp);
+	ret = 0;
+ put:
+	put_page(page);
 	return ret;
 }
 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 17:15                                     ` Linus Torvalds
  2013-12-04 17:43                                       ` Oleg Nesterov
@ 2013-12-04 18:32                                       ` H. Peter Anvin
  2013-12-05  8:28                                       ` Jon Medhurst (Tixy)
  2 siblings, 0 replies; 65+ messages in thread
From: H. Peter Anvin @ 2013-12-04 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov

On 12/04/2013 09:15 AM, Linus Torvalds wrote:
> On Wed, Dec 4, 2013 at 8:54 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> That is why I talk about the atomic instruction word... most (but not
>> *all*) architectures have a fundamental minimum unit of instructions
>> which is aligned and can be atomically written.  Typically this is 1, 2,
>> or 4 bytes.
> 
> Note that it's not just about the "atomically written", it's also
> about the guarantee that it's atomically *read*.
> 
> x86 can certainly atomically write a 4-byte instruction too, it's just
> that there's no guarantee - even if the instruction is aligned etc -
> that the actual instruction decoding always ends up reading it that
> way. It might re-read an instruction after encountering a prefix byte
> etc etc. So even if it's all properly aligned, the reading side might
> do something odd.
> 

True, at least in theory, but the atomic instruction quantum on x86 is a
byte.

	-hpa



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 17:15                                     ` Linus Torvalds
  2013-12-04 17:43                                       ` Oleg Nesterov
  2013-12-04 18:32                                       ` H. Peter Anvin
@ 2013-12-05  8:28                                       ` Jon Medhurst (Tixy)
  2 siblings, 0 replies; 65+ messages in thread
From: Jon Medhurst (Tixy) @ 2013-12-05  8:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Oleg Nesterov, Hugh Dickins, Peter Zijlstra,
	Jiri Kosina, Andi Kleen, Linux Kernel Mailing List,
	the arch/x86 maintainers, Andi Kleen, Ingo Molnar,
	Borislav Petkov

On Wed, 2013-12-04 at 09:15 -0800, Linus Torvalds wrote:
> On Wed, Dec 4, 2013 at 8:54 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> >
> > That is why I talk about the atomic instruction word... most (but not
> > *all*) architectures have a fundamental minimum unit of instructions
> > which is aligned and can be atomically written.  Typically this is 1, 2,
> > or 4 bytes.
> 
> Note that it's not just about the "atomically written", it's also
> about the guarantee that it's atomically *read*.
> 
> x86 can certainly atomically write a 4-byte instruction too, it's just
> that there's no guarantee - even if the instruction is aligned etc -
> that the actual instruction decoding always ends up reading it that
> way. It might re-read an instruction after encountering a prefix byte
> etc etc. So even if it's all properly aligned, the reading side might
> do something odd.

The ARM architecture has similar issues. Even though the instruction
size is mostly fixed, the architecture specification itself only
guarantees a very tiny subset of instructions are safe to modify whilst
there may be concurrent execution of that instruction. I'm quoting a
discussion from a while ago: http://lkml.org/lkml/2012/12/10/346

-- 
Tixy




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-04 17:43                                       ` Oleg Nesterov
@ 2013-12-05 17:23                                         ` Oleg Nesterov
  2013-12-05 17:49                                           ` Borislav Petkov
  0 siblings, 1 reply; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-05 17:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Hugh Dickins, Peter Zijlstra, Jiri Kosina,
	Andi Kleen, Linux Kernel Mailing List, the arch/x86 maintainers,
	Andi Kleen, Ingo Molnar, Borislav Petkov,
	Ananth N Mavinakayanahalli, Srikar Dronamraju

On 12/04, Oleg Nesterov wrote:
>
> And/Or. Are you still saying that on x86 (and powerpc?) we do not need
> these pte games at all and uprobe_write_opcode() can simply do:
>
> 		/* Break the mapping unless the page is already anonymous */
> 		ret = get_user_pages(NULL, mm, vaddr, 1, 1, 1, &page, &vma);
>
> 		// this page can't be swapped out due to page_freeze_refs()
> 		copy_to_page(page, vaddr, &opcode, UPROBE_SWBP_INSN_SIZE);
> 		
> 		set_page_dirty_lock(page);

This is almost off-topic, but I am wondering if (in the long term)
we can avoid this "insert the bp into every mm" altogether.

Instead, uprobe_write_opcode() should only unmap this page and set
SWP_UPROBE_ENTRY. The probed application should install the page
with breakpoints itself on demand during the fault, do_swap_page()
->non_swap_entry() can do this.

This also allows to share the patched page, uprobes should read it
at uprobe_register() time, make a copy, add the breakpoint, and keep
it for do_swap_page().

Perhaps. Not sure this is really possible, and of course this is not
that simple, and the filtering complicates this even more. And we
should ensure that the application itself or ptracer can't write to
this page (or even cow it), but this doesn't look too hard.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-05 17:23                                         ` Oleg Nesterov
@ 2013-12-05 17:49                                           ` Borislav Petkov
  2013-12-05 18:45                                             ` Oleg Nesterov
  0 siblings, 1 reply; 65+ messages in thread
From: Borislav Petkov @ 2013-12-05 17:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, H. Peter Anvin, Hugh Dickins, Peter Zijlstra,
	Jiri Kosina, Andi Kleen, Linux Kernel Mailing List,
	the arch/x86 maintainers, Andi Kleen, Ingo Molnar,
	Ananth N Mavinakayanahalli, Srikar Dronamraju

On Thu, Dec 05, 2013 at 06:23:55PM +0100, Oleg Nesterov wrote:
> This is almost off-topic, but I am wondering if (in the long term) we
> can avoid this "insert the bp into every mm" altogether.
>
> Instead, uprobe_write_opcode() should only unmap this page and set

Ok, sorry if I'm completely off base here but have you guys tried
unmapping the page from all other VMs, patching it and causing all
the VMs to refault it thereby getting the updated content? During the
patching you'd probably need to cause the #PF handler to "loop" until
patching is complete though.

I don't know whether that is even doable/makes sense - just a dumb
idea...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly
  2013-12-05 17:49                                           ` Borislav Petkov
@ 2013-12-05 18:45                                             ` Oleg Nesterov
  0 siblings, 0 replies; 65+ messages in thread
From: Oleg Nesterov @ 2013-12-05 18:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, H. Peter Anvin, Hugh Dickins, Peter Zijlstra,
	Jiri Kosina, Andi Kleen, Linux Kernel Mailing List,
	the arch/x86 maintainers, Andi Kleen, Ingo Molnar,
	Ananth N Mavinakayanahalli, Srikar Dronamraju

On 12/05, Borislav Petkov wrote:
>
> On Thu, Dec 05, 2013 at 06:23:55PM +0100, Oleg Nesterov wrote:
> > This is almost off-topic, but I am wondering if (in the long term) we
> > can avoid this "insert the bp into every mm" altogether.
> >
> > Instead, uprobe_write_opcode() should only unmap this page and set
>
> Ok, sorry if I'm completely off base here but have you guys tried
> unmapping the page from all other VMs,

This is what I meant, but we can't simply clear this pte,

> and causing all
> the VMs to refault

why? it would be better to install the page on demand.

> patching you'd probably need to cause the #PF handler to "loop" until
> patching is complete though.

We can't do this, but I do not think we need to block #PF handler.

However, somehow the #PF handler should know that it should install
the patched page owned by uprobes. That is why I talked about
SWP_UPROBE_ENTRY (or something similar)

But again, in any case this is not trivial.

And perhaps I misundestood you... If you actually want to cause
all the VMs to refault, then why we can't unmap + refault every
mm like the patch I sent does? Just in case, note that we can't
share the same page anyway without more complications.

Oleg.


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2013-12-05 18:46 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-26  0:37 [PATCH] Add a text_poke syscall v2 Andi Kleen
2013-11-26 19:05 ` Andy Lutomirski
2013-11-26 19:11   ` Andi Kleen
2013-11-26 20:03   ` Linus Torvalds
2013-11-27 19:57     ` H. Peter Anvin
2013-11-27 22:02       ` H. Peter Anvin
2013-11-27 22:21         ` Andy Lutomirski
2013-11-27 22:21         ` Borislav Petkov
2013-11-27 22:24           ` H. Peter Anvin
2013-11-27 22:25           ` H. Peter Anvin
2013-11-27 22:29             ` Borislav Petkov
2013-11-27 22:31               ` H. Peter Anvin
2013-11-27 23:04                 ` Linus Torvalds
2013-11-27 23:13                   ` Borislav Petkov
2013-11-27 22:40               ` H. Peter Anvin
2013-11-27 23:10                 ` Borislav Petkov
2013-11-27 23:20                   ` H. Peter Anvin
2013-11-27 23:40                     ` Borislav Petkov
2013-11-27 23:47                       ` H. Peter Anvin
2013-11-27 22:41         ` Linus Torvalds
2013-11-27 22:53           ` H. Peter Anvin
2013-11-27 23:15             ` Linus Torvalds
2013-11-27 23:28               ` H. Peter Anvin
2013-11-28  2:01                 ` Linus Torvalds
2013-11-28  2:10                   ` H. Peter Anvin
2013-11-28  9:12                   ` Jiri Kosina
2013-11-27 23:44               ` Andi Kleen
2013-11-29 18:35 ` Oleg Nesterov
2013-11-29 19:54   ` Andi Kleen
2013-11-29 20:05     ` Oleg Nesterov
2013-11-29 20:17       ` H. Peter Anvin
2013-11-29 20:35         ` Oleg Nesterov
2013-11-29 21:24           ` H. Peter Anvin
2013-11-30 14:56             ` Oleg Nesterov
2013-11-29 23:24       ` Jiri Kosina
2013-11-30  0:22         ` Linus Torvalds
2013-12-03 18:49           ` [PATCH?] uprobes: change uprobe_write_opcode() to modify the page directly Oleg Nesterov
2013-12-03 19:00             ` Linus Torvalds
2013-12-03 19:20               ` H. Peter Anvin
2013-12-03 20:01                 ` Oleg Nesterov
2013-12-03 20:21                   ` H. Peter Anvin
2013-12-03 20:38                     ` Oleg Nesterov
2013-12-03 20:43                       ` H. Peter Anvin
2013-12-03 20:54                         ` Oleg Nesterov
2013-12-03 22:01                           ` Linus Torvalds
2013-12-03 23:47                             ` H. Peter Anvin
2013-12-04 11:30                               ` Oleg Nesterov
2013-12-04 11:11                             ` Oleg Nesterov
2013-12-04 16:01                               ` H. Peter Anvin
2013-12-04 16:48                                 ` Oleg Nesterov
2013-12-04 16:54                                   ` H. Peter Anvin
2013-12-04 17:15                                     ` Linus Torvalds
2013-12-04 17:43                                       ` Oleg Nesterov
2013-12-05 17:23                                         ` Oleg Nesterov
2013-12-05 17:49                                           ` Borislav Petkov
2013-12-05 18:45                                             ` Oleg Nesterov
2013-12-04 18:32                                       ` H. Peter Anvin
2013-12-05  8:28                                       ` Jon Medhurst (Tixy)
2013-12-03 22:42                           ` H. Peter Anvin
2013-12-03 19:53               ` Oleg Nesterov
2013-11-30 15:20         ` [PATCH] Add a text_poke syscall v2 Oleg Nesterov
2013-11-30 16:51         ` Oleg Nesterov
2013-11-30 17:31           ` Oleg Nesterov
2013-11-30  5:16       ` H. Peter Anvin
2013-11-30 14:52         ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).