linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add a text_poke syscall
@ 2013-11-19  0:27 Andi Kleen
  2013-11-19  2:32 ` Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Andi Kleen @ 2013-11-19  0:27 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Properly patching running code ("cross modification")
is a quite complicated business on x86.

The CPU has specific rules that need to be followed, including
multiple global barriers.

Self modifying code is getting more popular, so it's important
to make it easy to follow the rules.

The kernel does it properly with text_poke_bp(). But the same
method is hard to do for user programs.

This patch adds a (x86 specific) text_poke() syscall that exposes
the text_poke_bp() machinery to user programs.

The interface is practically the same as text_poke_bp, just as
a syscall. I added an extra timeout parameter, that
will potentially allow batching the global barriers in
the future. Right now it is enforced to be 0.

The call also still has a global lock, so it has some scaling
limitations. If it was commonly used this could be fixed
by setting up a list of break point locations. Then
a lock would only be hold to modify the list.

Right now the implementation is just as simple as possible.

Proposed man page:

NAME
	text_poke - Safely modify running instructions (x86)

SYNOPSYS
	int text_poke(void *addr, const void *opcode, size_t len,
	              void (*handler)(void), int timeout);

DESCRIPTION
	The text_poke system allows to safely modify code that may
	be currently executing in parallel on other threads.
	Patch the instruction at addr with the new instructions
	at opcode of length len. The target instruction will temporarily
	be patched with a break point, before it is replaced
	with the final replacement instruction. When the break point
	hits the code handler will be called in the context
	of the thread. The handler does not save any registers
	and cannot return. Typically it would consist of the
	original instruction and then a jump to after the original
	instruction. The handler is only needed during the
	patching process and can be overwritten once the syscall
	returns. timeout defines an optional timout to indicate
	to the kernel how long the patching could be delayed.
	Right now it has to be 0.

EXAMPLE

volatile int finished;

extern char patch[], recovery[], repl[];

struct res {
        long total;
        long val1, val2, handler;
};

int text_poke(void *insn, void *repl, int len, void *handler, int to)
{
        return syscall(314, insn, repl, len, handler, to);
}

void *tfunc(void *arg)
{
        struct res *res = (struct res *)arg;

        while (!finished) {
                int val;
                asm volatile(   ".globl patch\n"
                                ".globl recovery\n"
                                ".global repl\n"
				/* original code to be patched */
                                "patch: mov $1,%0\n"
                                "1:\n"
                                ".section \".text.patchup\",\"x\"\n"
				/* Called when a race happens during patching.
				   Just execute the original code and jump back. */
                                "recovery:\n"
                                " mov $3,%0\n"
                                " jmp 1b\n"
				/* replacement code that gets patched in: */
                                "repl:\n"
                                " mov $2,%0\n"
                                ".previous" : "=a" (val));
                        if (val == 1)
                                res->val1++;
                        else if (val == 3)
                                res->handler++;
                        else
                                res->val2++;
                        res->total++;
        }
        return NULL;
}

int main(int ac, char **av)
{
        int ncpus = sysconf(_SC_NPROCESSORS_ONLN);
        int ps = sysconf(_SC_PAGESIZE);
        pthread_t pthr[ncpus];
        struct res res[ncpus];
        int i;

        srand(1);
        memset(&res, 0, sizeof(struct res) * ncpus);
        mprotect(patch - (unsigned long)patch % ps, ps,
		 PROT_READ|PROT_WRITE|PROT_EXEC);
        for (i = 0; i < ncpus - 1; i++)
                pthread_create(&pthr[i], NULL, tfunc, &res[i]);
        for (i = 0; i < 500000; i++) {
                text_poke(patch, repl, 5, recovery, 0);
                nanosleep(&((struct timespec) { 0, rand() % 100 }), NULL);
                text_poke(repl, patch, 5, recovery, 0);
        }
        finished = 1;
        for (i = 0; i < ncpus - 1; i++) {
                pthread_join(pthr[i], NULL);
                printf("%d: val1 %lu val2 %lu handler %lu to %lu\n",
                                i, res[i].val1, res[i].val2, res[i].handler,
				res[i].total);
                assert(res[i].val1 + res[i].val2 + res[i].handler
				== res[i].total);
        }
        return 0;
}

RETURN VALUE
	On success, text_poke returns 0, otherwise -1 is returned
	and errno is set appropiately.

ERRORS
	EINVAL		len was too long
			timeout was an invalid value
	EFAULT		An error happened while accessing opcode

VERSIONS
	text_poke has been added with the Linux XXX kernel.

CONFORMING TO
	The call is Linux and x86 specific and should not be used
	in programs intended to be portable.
---
 arch/x86/kernel/alternative.c    | 121 ++++++++++++++++++++++++++++++++-------
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 3 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index df94598..c143ff5 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -12,6 +12,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/syscalls.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
 #include <asm/pgtable.h>
@@ -538,6 +539,23 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
 	return addr;
 }
 
+static void __kprobes __text_poke(struct page **pages, 
+				  int offset,
+				  const void *opcode, size_t len);
+
+static void resolve_pages(struct page **pages, void *addr)
+{
+	if (!core_kernel_text((unsigned long)addr)) {
+		pages[0] = vmalloc_to_page(addr);
+		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
+	} else {
+		pages[0] = virt_to_page(addr);
+		WARN_ON(!PageReserved(pages[0]));
+		pages[1] = virt_to_page(addr + PAGE_SIZE);
+	}
+	BUG_ON(!pages[0]);
+}
+
 /**
  * text_poke - Update instructions on a live kernel
  * @addr: address to modify
@@ -553,26 +571,27 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
  */
 void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 {
+	struct page *pages[2];
+
+	resolve_pages(pages, addr);
+	__text_poke(pages, (unsigned long)addr & ~PAGE_MASK,
+		    opcode, len);
+	return addr;
+}
+
+static void __kprobes __text_poke(struct page **pages,
+				   int offset,
+				   const void *opcode, size_t len)
+{
 	unsigned long flags;
 	char *vaddr;
-	struct page *pages[2];
-	int i;
 
-	if (!core_kernel_text((unsigned long)addr)) {
-		pages[0] = vmalloc_to_page(addr);
-		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
-	} else {
-		pages[0] = virt_to_page(addr);
-		WARN_ON(!PageReserved(pages[0]));
-		pages[1] = virt_to_page(addr + PAGE_SIZE);
-	}
-	BUG_ON(!pages[0]);
 	local_irq_save(flags);
 	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
 	if (pages[1])
 		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
 	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
-	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
+	memcpy(&vaddr[offset], opcode, len);
 	clear_fixmap(FIX_TEXT_POKE0);
 	if (pages[1])
 		clear_fixmap(FIX_TEXT_POKE1);
@@ -580,10 +599,7 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
 	sync_core();
 	/* Could also do a CLFLUSH here to speed up CPU recovery; but
 	   that causes hangs on some VIA CPUs. */
-	for (i = 0; i < len; i++)
-		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
 	local_irq_restore(flags);
-	return addr;
 }
 
 static void do_sync_core(void *info)
@@ -593,6 +609,7 @@ static void do_sync_core(void *info)
 
 static bool bp_patching_in_progress;
 static void *bp_int3_handler, *bp_int3_addr;
+static struct mm_struct *bp_target_mm;
 
 int poke_int3_handler(struct pt_regs *regs)
 {
@@ -602,6 +619,14 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
+	if (user_mode_vm(regs) &&
+		bp_target_mm &&
+		current->mm == bp_target_mm &&
+		regs->ip == (unsigned long)bp_int3_addr) {
+		regs->ip = (unsigned long) bp_int3_handler;
+		return 1;
+	}
+
 	if (user_mode_vm(regs) || regs->ip != (unsigned long)bp_int3_addr)
 		return 0;
 
@@ -612,6 +637,9 @@ int poke_int3_handler(struct pt_regs *regs)
 
 }
 
+static void __text_poke_bp(struct page **pages, int offset,
+			   const void *opcode, size_t len, void *handler);
+
 /**
  * text_poke_bp() -- update instructions on live kernel on SMP
  * @addr:	address to patch
@@ -636,10 +664,22 @@ int poke_int3_handler(struct pt_regs *regs)
  */
 void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
+	struct page *pages[2];
+
+	resolve_pages(pages, addr);
+	bp_int3_addr = (u8 *)addr + 1;
+	__text_poke_bp(pages, (unsigned long)addr & ~PAGE_MASK,
+		       opcode, len, handler);
+	return addr;
+}
+
+/* Caller must set up bp_int3_addr and hold text_mutex */
+static void __text_poke_bp(struct page **pages, int offset,
+		     const void *opcode, size_t len, void *handler)
+{
 	unsigned char int3 = 0xcc;
 
 	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for
@@ -648,13 +688,14 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	__text_poke(pages, offset, &int3, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
 	if (len - sizeof(int3) > 0) {
 		/* patch all but the first byte */
-		text_poke((char *)addr + sizeof(int3),
+		__text_poke(pages, 
+			  offset + sizeof(int3),
 			  (const char *) opcode + sizeof(int3),
 			  len - sizeof(int3));
 		/*
@@ -666,13 +707,51 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	}
 
 	/* patch the first byte */
-	text_poke(addr, opcode, sizeof(int3));
+	__text_poke(pages, offset, opcode, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
 	bp_patching_in_progress = false;
 	smp_wmb();
-
-	return addr;
 }
 
+#define MAX_INSN_LEN 32
+
+SYSCALL_DEFINE5(text_poke,
+		__user void *, addr,
+		const void *, opcode, 
+		size_t, len,
+		void *, handler,
+		int, timeout)
+{
+	struct page *pages[2];
+	char insn[MAX_INSN_LEN];
+	int err, npages;
+
+	if ((unsigned long)len > MAX_INSN_LEN || len == 0)
+		return -EINVAL;
+	/* For future extension */
+	if (timeout != 0)
+		return -EINVAL;
+	if (copy_from_user(insn, opcode, len))
+		return -EFAULT;
+	pages[1] = NULL;
+	npages = ((unsigned long)addr & PAGE_MASK) == 
+		(((unsigned long)addr + len) & PAGE_MASK) ? 1 : 2;
+	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
+	if (err < 0)
+		return err;
+	err = 0;
+	mutex_lock(&text_mutex);
+	bp_target_mm = current->mm;
+	bp_int3_addr = (u8 *)addr + 1;
+	__text_poke_bp(pages, 
+		       (unsigned long)addr & ~PAGE_MASK,
+		       insn, len, handler);
+	bp_target_mm = NULL;
+	mutex_unlock(&text_mutex);
+	put_page(pages[0]);
+	if (pages[1])
+		put_page(pages[1]);
+	return err;
+}
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index aabfb83..5d3cb72 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,4 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	text_poke		sys_text_poke
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..77e60f9 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	text_poke		sys_text_poke
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  0:27 [PATCH] Add a text_poke syscall Andi Kleen
@ 2013-11-19  2:32 ` Andy Lutomirski
  2013-11-19 20:16   ` Andi Kleen
  2013-11-19  5:23 ` H. Peter Anvin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2013-11-19  2:32 UTC (permalink / raw)
  To: Andi Kleen, x86; +Cc: linux-kernel, Andi Kleen

On 11/18/2013 04:27 PM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Properly patching running code ("cross modification")
> is a quite complicated business on x86.
> 
> The CPU has specific rules that need to be followed, including
> multiple global barriers.
> 
> Self modifying code is getting more popular, so it's important
> to make it easy to follow the rules.
> 
> The kernel does it properly with text_poke_bp(). But the same
> method is hard to do for user programs.
> 
> This patch adds a (x86 specific) text_poke() syscall that exposes
> the text_poke_bp() machinery to user programs.
> 
> The interface is practically the same as text_poke_bp, just as
> a syscall. I added an extra timeout parameter, that
> will potentially allow batching the global barriers in
> the future. Right now it is enforced to be 0.
> 
> The call also still has a global lock, so it has some scaling
> limitations. If it was commonly used this could be fixed
> by setting up a list of break point locations. Then
> a lock would only be hold to modify the list.
> 
> Right now the implementation is just as simple as possible.
> 
> Proposed man page:
> 
> NAME
> 	text_poke - Safely modify running instructions (x86)
> 
> SYNOPSYS
> 	int text_poke(void *addr, const void *opcode, size_t len,
> 	              void (*handler)(void), int timeout);
> 
> DESCRIPTION
> 	The text_poke system allows to safely modify code that may
> 	be currently executing in parallel on other threads.
> 	Patch the instruction at addr with the new instructions
> 	at opcode of length len. The target instruction will temporarily
> 	be patched with a break point, before it is replaced
> 	with the final replacement instruction. When the break point
> 	hits the code handler will be called in the context
> 	of the thread. The handler does not save any registers
> 	and cannot return. Typically it would consist of the
> 	original instruction and then a jump to after the original
> 	instruction. The handler is only needed during the
> 	patching process and can be overwritten once the syscall
> 	returns. timeout defines an optional timout to indicate
> 	to the kernel how long the patching could be delayed.
> 	Right now it has to be 0.
> 
> EXAMPLE
> 
> volatile int finished;
> 
> extern char patch[], recovery[], repl[];
> 
> struct res {
>         long total;
>         long val1, val2, handler;
> };
> 
> int text_poke(void *insn, void *repl, int len, void *handler, int to)
> {
>         return syscall(314, insn, repl, len, handler, to);
> }
> 
> void *tfunc(void *arg)
> {
>         struct res *res = (struct res *)arg;
> 
>         while (!finished) {
>                 int val;
>                 asm volatile(   ".globl patch\n"
>                                 ".globl recovery\n"
>                                 ".global repl\n"
> 				/* original code to be patched */
>                                 "patch: mov $1,%0\n"
>                                 "1:\n"
>                                 ".section \".text.patchup\",\"x\"\n"
> 				/* Called when a race happens during patching.
> 				   Just execute the original code and jump back. */
>                                 "recovery:\n"
>                                 " mov $3,%0\n"
>                                 " jmp 1b\n"
> 				/* replacement code that gets patched in: */
>                                 "repl:\n"
>                                 " mov $2,%0\n"
>                                 ".previous" : "=a" (val));
>                         if (val == 1)
>                                 res->val1++;
>                         else if (val == 3)
>                                 res->handler++;
>                         else
>                                 res->val2++;
>                         res->total++;
>         }
>         return NULL;
> }
> 
> int main(int ac, char **av)
> {
>         int ncpus = sysconf(_SC_NPROCESSORS_ONLN);
>         int ps = sysconf(_SC_PAGESIZE);
>         pthread_t pthr[ncpus];
>         struct res res[ncpus];
>         int i;
> 
>         srand(1);
>         memset(&res, 0, sizeof(struct res) * ncpus);
>         mprotect(patch - (unsigned long)patch % ps, ps,
> 		 PROT_READ|PROT_WRITE|PROT_EXEC);
>         for (i = 0; i < ncpus - 1; i++)
>                 pthread_create(&pthr[i], NULL, tfunc, &res[i]);
>         for (i = 0; i < 500000; i++) {
>                 text_poke(patch, repl, 5, recovery, 0);
>                 nanosleep(&((struct timespec) { 0, rand() % 100 }), NULL);
>                 text_poke(repl, patch, 5, recovery, 0);
>         }
>         finished = 1;
>         for (i = 0; i < ncpus - 1; i++) {
>                 pthread_join(pthr[i], NULL);
>                 printf("%d: val1 %lu val2 %lu handler %lu to %lu\n",
>                                 i, res[i].val1, res[i].val2, res[i].handler,
> 				res[i].total);
>                 assert(res[i].val1 + res[i].val2 + res[i].handler
> 				== res[i].total);
>         }
>         return 0;
> }
> 
> RETURN VALUE
> 	On success, text_poke returns 0, otherwise -1 is returned
> 	and errno is set appropiately.
> 
> ERRORS
> 	EINVAL		len was too long
> 			timeout was an invalid value
> 	EFAULT		An error happened while accessing opcode
> 
> VERSIONS
> 	text_poke has been added with the Linux XXX kernel.
> 
> CONFORMING TO
> 	The call is Linux and x86 specific and should not be used
> 	in programs intended to be portable.
> ---
>  arch/x86/kernel/alternative.c    | 121 ++++++++++++++++++++++++++++++++-------
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  3 files changed, 102 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index df94598..c143ff5 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -12,6 +12,7 @@
>  #include <linux/stop_machine.h>
>  #include <linux/slab.h>
>  #include <linux/kdebug.h>
> +#include <linux/syscalls.h>
>  #include <asm/alternative.h>
>  #include <asm/sections.h>
>  #include <asm/pgtable.h>
> @@ -538,6 +539,23 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
>  	return addr;
>  }
>  
> +static void __kprobes __text_poke(struct page **pages, 
> +				  int offset,
> +				  const void *opcode, size_t len);
> +
> +static void resolve_pages(struct page **pages, void *addr)
> +{
> +	if (!core_kernel_text((unsigned long)addr)) {
> +		pages[0] = vmalloc_to_page(addr);
> +		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
> +	} else {
> +		pages[0] = virt_to_page(addr);
> +		WARN_ON(!PageReserved(pages[0]));
> +		pages[1] = virt_to_page(addr + PAGE_SIZE);
> +	}
> +	BUG_ON(!pages[0]);
> +}
> +
>  /**
>   * text_poke - Update instructions on a live kernel
>   * @addr: address to modify
> @@ -553,26 +571,27 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode,
>   */
>  void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>  {
> +	struct page *pages[2];
> +
> +	resolve_pages(pages, addr);
> +	__text_poke(pages, (unsigned long)addr & ~PAGE_MASK,
> +		    opcode, len);
> +	return addr;
> +}
> +
> +static void __kprobes __text_poke(struct page **pages,
> +				   int offset,
> +				   const void *opcode, size_t len)
> +{
>  	unsigned long flags;
>  	char *vaddr;
> -	struct page *pages[2];
> -	int i;
>  
> -	if (!core_kernel_text((unsigned long)addr)) {
> -		pages[0] = vmalloc_to_page(addr);
> -		pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
> -	} else {
> -		pages[0] = virt_to_page(addr);
> -		WARN_ON(!PageReserved(pages[0]));
> -		pages[1] = virt_to_page(addr + PAGE_SIZE);
> -	}
> -	BUG_ON(!pages[0]);
>  	local_irq_save(flags);
>  	set_fixmap(FIX_TEXT_POKE0, page_to_phys(pages[0]));
>  	if (pages[1])
>  		set_fixmap(FIX_TEXT_POKE1, page_to_phys(pages[1]));
>  	vaddr = (char *)fix_to_virt(FIX_TEXT_POKE0);
> -	memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
> +	memcpy(&vaddr[offset], opcode, len);
>  	clear_fixmap(FIX_TEXT_POKE0);
>  	if (pages[1])
>  		clear_fixmap(FIX_TEXT_POKE1);
> @@ -580,10 +599,7 @@ void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
>  	sync_core();
>  	/* Could also do a CLFLUSH here to speed up CPU recovery; but
>  	   that causes hangs on some VIA CPUs. */
> -	for (i = 0; i < len; i++)
> -		BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
>  	local_irq_restore(flags);
> -	return addr;
>  }
>  
>  static void do_sync_core(void *info)
> @@ -593,6 +609,7 @@ static void do_sync_core(void *info)
>  
>  static bool bp_patching_in_progress;
>  static void *bp_int3_handler, *bp_int3_addr;
> +static struct mm_struct *bp_target_mm;
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> @@ -602,6 +619,14 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> +	if (user_mode_vm(regs) &&
> +		bp_target_mm &&
> +		current->mm == bp_target_mm &&
> +		regs->ip == (unsigned long)bp_int3_addr) {
> +		regs->ip = (unsigned long) bp_int3_handler;
> +		return 1;
> +	}
> +
>  	if (user_mode_vm(regs) || regs->ip != (unsigned long)bp_int3_addr)
>  		return 0;
>  
> @@ -612,6 +637,9 @@ int poke_int3_handler(struct pt_regs *regs)
>  
>  }
>  
> +static void __text_poke_bp(struct page **pages, int offset,
> +			   const void *opcode, size_t len, void *handler);
> +
>  /**
>   * text_poke_bp() -- update instructions on live kernel on SMP
>   * @addr:	address to patch
> @@ -636,10 +664,22 @@ int poke_int3_handler(struct pt_regs *regs)
>   */
>  void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  {
> +	struct page *pages[2];
> +
> +	resolve_pages(pages, addr);
> +	bp_int3_addr = (u8 *)addr + 1;
> +	__text_poke_bp(pages, (unsigned long)addr & ~PAGE_MASK,
> +		       opcode, len, handler);
> +	return addr;
> +}
> +
> +/* Caller must set up bp_int3_addr and hold text_mutex */
> +static void __text_poke_bp(struct page **pages, int offset,
> +		     const void *opcode, size_t len, void *handler)
> +{
>  	unsigned char int3 = 0xcc;
>  
>  	bp_int3_handler = handler;
> -	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	bp_patching_in_progress = true;
>  	/*
>  	 * Corresponding read barrier in int3 notifier for
> @@ -648,13 +688,14 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 */
>  	smp_wmb();
>  
> -	text_poke(addr, &int3, sizeof(int3));
> +	__text_poke(pages, offset, &int3, sizeof(int3));
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  
>  	if (len - sizeof(int3) > 0) {
>  		/* patch all but the first byte */
> -		text_poke((char *)addr + sizeof(int3),
> +		__text_poke(pages, 
> +			  offset + sizeof(int3),
>  			  (const char *) opcode + sizeof(int3),
>  			  len - sizeof(int3));
>  		/*
> @@ -666,13 +707,51 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	}
>  
>  	/* patch the first byte */
> -	text_poke(addr, opcode, sizeof(int3));
> +	__text_poke(pages, offset, opcode, sizeof(int3));
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  
>  	bp_patching_in_progress = false;
>  	smp_wmb();
> -
> -	return addr;
>  }
>  
> +#define MAX_INSN_LEN 32
> +
> +SYSCALL_DEFINE5(text_poke,
> +		__user void *, addr,
> +		const void *, opcode, 
> +		size_t, len,
> +		void *, handler,
> +		int, timeout)
> +{
> +	struct page *pages[2];
> +	char insn[MAX_INSN_LEN];
> +	int err, npages;
> +
> +	if ((unsigned long)len > MAX_INSN_LEN || len == 0)
> +		return -EINVAL;
> +	/* For future extension */
> +	if (timeout != 0)
> +		return -EINVAL;
> +	if (copy_from_user(insn, opcode, len))
> +		return -EFAULT;
> +	pages[1] = NULL;
> +	npages = ((unsigned long)addr & PAGE_MASK) == 
> +		(((unsigned long)addr + len) & PAGE_MASK) ? 1 : 2;

This is off by one, I think.  That should be addr + len - 1.

> +	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
> +	if (err < 0)
> +		return err;
> +	err = 0;
> +	mutex_lock(&text_mutex);
> +	bp_target_mm = current->mm;
> +	bp_int3_addr = (u8 *)addr + 1;

Do you need an smp_wmb here?  (Maybe there's a strong enough barrier in
__text_poke_bp.)

It might pay to verify that the pages are executable, although I don't
see what the harm would be to poking at non-executable pages.

--Andy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  0:27 [PATCH] Add a text_poke syscall Andi Kleen
  2013-11-19  2:32 ` Andy Lutomirski
@ 2013-11-19  5:23 ` H. Peter Anvin
  2013-11-19 18:49   ` Andi Kleen
  2013-11-19  6:49 ` Ingo Molnar
  2013-11-21 18:26 ` Oleg Nesterov
  3 siblings, 1 reply; 17+ messages in thread
From: H. Peter Anvin @ 2013-11-19  5:23 UTC (permalink / raw)
  To: Andi Kleen, x86; +Cc: linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

On 11/18/2013 04:27 PM, Andi Kleen wrote:
> 
> Proposed man page:
> 
> NAME
> 	text_poke - Safely modify running instructions (x86)
> 
> SYNOPSYS
> 	int text_poke(void *addr, const void *opcode, size_t len,
> 	              void (*handler)(void), int timeout);
> 
> DESCRIPTION
> 	The text_poke system allows to safely modify code that may
> 	be currently executing in parallel on other threads.
> 	Patch the instruction at addr with the new instructions
> 	at opcode of length len. The target instruction will temporarily
> 	be patched with a break point, before it is replaced
> 	with the final replacement instruction. When the break point
> 	hits the code handler will be called in the context
> 	of the thread. The handler does not save any registers
> 	and cannot return. Typically it would consist of the
> 	original instruction and then a jump to after the original
> 	instruction. The handler is only needed during the
> 	patching process and can be overwritten once the syscall
> 	returns. timeout defines an optional timout to indicate
> 	to the kernel how long the patching could be delayed.
> 	Right now it has to be 0.
> 

I think I would prefer an interface which took a list of patch points,
or implemented only the aspects which are impossible to do in user space.

All we really need in the kernel is the IPI broadcasts - the rest can be
done in user space, including intercepting SIGTRAP.  For userspace it is
probably the best to just put a thread to sleep until the patching is
done, which can be done with a futex.

One advantage with doing this in userspace is that the kernel doesn't
have to be responsible avoiding holding a thread due to a slightly
different SIGTRAP -- it will all come out after the signal handler is
restored, anyway.

That being said, the user space code would really need to be librarized.

	-hpa


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  0:27 [PATCH] Add a text_poke syscall Andi Kleen
  2013-11-19  2:32 ` Andy Lutomirski
  2013-11-19  5:23 ` H. Peter Anvin
@ 2013-11-19  6:49 ` Ingo Molnar
  2013-11-19 19:10   ` Andi Kleen
  2013-11-21 18:26 ` Oleg Nesterov
  3 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2013-11-19  6:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Andi Kleen, Linus Torvalds, H. Peter Anvin,
	Thomas Gleixner, Andrew Morton


* Andi Kleen <andi@firstfloor.org> wrote:

> From: Andi Kleen <ak@linux.intel.com>
> 
> Properly patching running code ("cross modification")
> is a quite complicated business on x86.
> 
> The CPU has specific rules that need to be followed, including
> multiple global barriers.
> 
> Self modifying code is getting more popular, so it's important
> to make it easy to follow the rules.
> 
> The kernel does it properly with text_poke_bp(). But the same
> method is hard to do for user programs.
> 
> This patch adds a (x86 specific) text_poke() syscall that exposes
> the text_poke_bp() machinery to user programs.
> 
> The interface is practically the same as text_poke_bp, just as
> a syscall. I added an extra timeout parameter, that
> will potentially allow batching the global barriers in
> the future. Right now it is enforced to be 0.
> 
> The call also still has a global lock, so it has some scaling
> limitations. If it was commonly used this could be fixed
> by setting up a list of break point locations. Then
> a lock would only be hold to modify the list.
> 
> Right now the implementation is just as simple as possible.
> 
> Proposed man page:
> 
> NAME
> 	text_poke - Safely modify running instructions (x86)
> 
> SYNOPSYS
> 	int text_poke(void *addr, const void *opcode, size_t len,
> 	              void (*handler)(void), int timeout);
> 
> DESCRIPTION
> 	The text_poke system allows to safely modify code that may
> 	be currently executing in parallel on other threads.
> 	Patch the instruction at addr with the new instructions
> 	at opcode of length len. The target instruction will temporarily
> 	be patched with a break point, before it is replaced
> 	with the final replacement instruction. When the break point
> 	hits the code handler will be called in the context
> 	of the thread. The handler does not save any registers
> 	and cannot return. Typically it would consist of the
> 	original instruction and then a jump to after the original
> 	instruction. The handler is only needed during the
> 	patching process and can be overwritten once the syscall
> 	returns. timeout defines an optional timout to indicate
> 	to the kernel how long the patching could be delayed.
> 	Right now it has to be 0.
> 
> EXAMPLE
> 
> volatile int finished;
> 
> extern char patch[], recovery[], repl[];
> 
> struct res {
>         long total;
>         long val1, val2, handler;
> };
> 
> int text_poke(void *insn, void *repl, int len, void *handler, int to)
> {
>         return syscall(314, insn, repl, len, handler, to);
> }
> 
> void *tfunc(void *arg)
> {
>         struct res *res = (struct res *)arg;
> 
>         while (!finished) {
>                 int val;
>                 asm volatile(   ".globl patch\n"
>                                 ".globl recovery\n"
>                                 ".global repl\n"
> 				/* original code to be patched */
>                                 "patch: mov $1,%0\n"
>                                 "1:\n"
>                                 ".section \".text.patchup\",\"x\"\n"
> 				/* Called when a race happens during patching.
> 				   Just execute the original code and jump back. */
>                                 "recovery:\n"
>                                 " mov $3,%0\n"
>                                 " jmp 1b\n"
> 				/* replacement code that gets patched in: */
>                                 "repl:\n"
>                                 " mov $2,%0\n"
>                                 ".previous" : "=a" (val));
>                         if (val == 1)
>                                 res->val1++;
>                         else if (val == 3)
>                                 res->handler++;
>                         else
>                                 res->val2++;
>                         res->total++;
>         }
>         return NULL;
> }
> 
> int main(int ac, char **av)
> {
>         int ncpus = sysconf(_SC_NPROCESSORS_ONLN);
>         int ps = sysconf(_SC_PAGESIZE);
>         pthread_t pthr[ncpus];
>         struct res res[ncpus];
>         int i;
> 
>         srand(1);
>         memset(&res, 0, sizeof(struct res) * ncpus);
>         mprotect(patch - (unsigned long)patch % ps, ps,
> 		 PROT_READ|PROT_WRITE|PROT_EXEC);
>         for (i = 0; i < ncpus - 1; i++)
>                 pthread_create(&pthr[i], NULL, tfunc, &res[i]);
>         for (i = 0; i < 500000; i++) {
>                 text_poke(patch, repl, 5, recovery, 0);
>                 nanosleep(&((struct timespec) { 0, rand() % 100 }), NULL);
>                 text_poke(repl, patch, 5, recovery, 0);
>         }
>         finished = 1;
>         for (i = 0; i < ncpus - 1; i++) {
>                 pthread_join(pthr[i], NULL);
>                 printf("%d: val1 %lu val2 %lu handler %lu to %lu\n",
>                                 i, res[i].val1, res[i].val2, res[i].handler,
> 				res[i].total);
>                 assert(res[i].val1 + res[i].val2 + res[i].handler
> 				== res[i].total);
>         }
>         return 0;
> }
> 
> RETURN VALUE
> 	On success, text_poke returns 0, otherwise -1 is returned
> 	and errno is set appropiately.
> 
> ERRORS
> 	EINVAL		len was too long
> 			timeout was an invalid value
> 	EFAULT		An error happened while accessing opcode
> 
> VERSIONS
> 	text_poke has been added with the Linux XXX kernel.
> 
> CONFORMING TO
> 	The call is Linux and x86 specific and should not be used
> 	in programs intended to be portable.
> ---
>  arch/x86/kernel/alternative.c    | 121 ++++++++++++++++++++++++++++++++-------
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  3 files changed, 102 insertions(+), 21 deletions(-)

A couple of observations:

1)

Documentation: as usual you hide information: please _explain_ in the 
changelog and in the manpage why self-modifying code is a 'complicated 
business' on x86, it's not rocket science: that on x86 in-flight 
speculative instructions which may correspond to the old, 
pre-modification state need to be flushed before code can be modified, 
and that not even atomic ops (can) achieve this.

So x86 code has to do an at least two-step dance of adding a 
single-byte breakpoint, flushing instructions, then modifying the 
first byte and flushing instructions again.

The 'flushing instructions' has to happen on all CPUs that may execute 
that code region, to be safe. (The kernel code does a 3-step 
synchronization dance but that is paranoia.)

2)

Locking: why should kernel-space code modifications and user-space 
code modifications be synchronized by the same single system-global 
mutex (text_mutex)?

Also, why should possibly unrelated user-space be synchronized with 
each other when they do a flush?

3)

Design: more fundamentally, you don't explain the design: why is this 
architecture specific and why is it a new syscall?

In particular I'm somewhat sceptical about doing this as a separate 
syscall, because such Linux-only syscall specials tend to propagate to 
the right tools rather slowly - especially if it's an x86-only 
Linux-special syscall ...

If we want to do this then it could be shaped as a straightforward 
ptrace() extension: ptrace already has the concept of self-tracing 
(PTRACE_TRACEME), so adding PTRACE_POKETEXT with pid==0 (or a special 
flag to denote 'careful text self-modification') would achieve that, 
and would make it instantly available to tooling, without fragile 
syscall wrappers.

That would also allow other SMP architectures with speculative 
execution to implement such code modification helpers as well, by 
reusing the same new ptrace ABI.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  5:23 ` H. Peter Anvin
@ 2013-11-19 18:49   ` Andi Kleen
  2013-11-20 16:42     ` H. Peter Anvin
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-11-19 18:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, x86, linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

> I think I would prefer an interface which took a list of patch points,
> or implemented only the aspects which are impossible to do in user space.

We already have all the code. Why not just use it?

Note I'm not adding any new mechanism, just exporting the existing one.
So the usual "do things in user space" arguments do not really
apply here.

Also this is subtle enough that there is definitely benefit from
having only a single canonical code that does it.

If we ever need any new errata workarounds for this they could be also
all done in a single central place.

> 
> All we really need in the kernel is the IPI broadcasts - the rest can be
> done in user space, including intercepting SIGTRAP.  For userspace it is
> probably the best to just put a thread to sleep until the patching is
> done, which can be done with a futex.

I'm not sure that's worth it. IPIs are reasonably fast (a few 1000s cycles).
Sleeping likely only becomes beneficial with much longer delays, like
ms. But if the IPIs start taking ms we have much more problems.

> One advantage with doing this in userspace is that the kernel doesn't
> have to be responsible avoiding holding a thread due to a slightly
> different SIGTRAP -- it will all come out after the signal handler is
> restored, anyway.

It's just some spinning, not a new task state. I don't think any
task states make sense here.

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  6:49 ` Ingo Molnar
@ 2013-11-19 19:10   ` Andi Kleen
  2013-11-21 13:07     ` Steven Rostedt
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-11-19 19:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, Andi Kleen, Linus Torvalds,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton

> Design: more fundamentally, you don't explain the design: why is this 
> architecture specific and why is it a new syscall?
> 
> In particular I'm somewhat sceptical about doing this as a separate 
> syscall, because such Linux-only syscall specials tend to propagate to 
> the right tools rather slowly - especially if it's an x86-only 
> Linux-special syscall ...

Self modifying code is by definition x86 specific.

Very likely any other method to do this would be Linux specific
in some form or another, as there doesn't seem to be any
existing art.
> 
> If we want to do this then it could be shaped as a straightforward 
> ptrace() extension: ptrace already has the concept of self-tracing 
> (PTRACE_TRACEME), so adding PTRACE_POKETEXT with pid==0 (or a special 
> flag to denote 'careful text self-modification') would achieve that, 
> and would make it instantly available to tooling, without fragile 
> syscall wrappers.

Hmm, my impression was that ptrace would always stop, but yes
that TRACEME thing would be possible. Of course once you 
do TRACMEME the debugger won't work anymore. But there's no
reason for text_poke to exclude debuggers.  That would seem
like a major disadvantage to me.

My personal feeling is that ptrace is already very complicated,
hard to understand and best left alone. So I preferred to 
do it in a clean separate system call.

But if people really think ptrace is somehow better it would
be a possibility.

Is the motivation here just to limit the number of syscalls, 
or some other reason too?

> That would also allow other SMP architectures with speculative 
> execution to implement such code modification helpers as well, by 
> reusing the same new ptrace ABI.

They could always add text_poke too. But their requirements
may be different (for example they may or may not need the 
handler) and any code using it would be necessarily
architecture specific anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  2:32 ` Andy Lutomirski
@ 2013-11-19 20:16   ` Andi Kleen
  2013-11-21 10:02     ` Jiri Kosina
  0 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2013-11-19 20:16 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andi Kleen, x86, linux-kernel, Andi Kleen

> > +	pages[1] = NULL;
> > +	npages = ((unsigned long)addr & PAGE_MASK) == 
> > +		(((unsigned long)addr + len) & PAGE_MASK) ? 1 : 2;
> 
> This is off by one, I think.  That should be addr + len - 1.

Thanks fixed.

> 
> > +	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
> > +	if (err < 0)
> > +		return err;
> > +	err = 0;
> > +	mutex_lock(&text_mutex);
> > +	bp_target_mm = current->mm;
> > +	bp_int3_addr = (u8 *)addr + 1;
> 
> Do you need an smp_wmb here?  (Maybe there's a strong enough barrier in
> __text_poke_bp.)

__text_poke_bp already has enough barriers (although I don't
think they are really needed in any case)

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19 18:49   ` Andi Kleen
@ 2013-11-20 16:42     ` H. Peter Anvin
  2013-11-20 17:25       ` Andi Kleen
  2013-11-21 12:49       ` Steven Rostedt
  0 siblings, 2 replies; 17+ messages in thread
From: H. Peter Anvin @ 2013-11-20 16:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

On 11/19/2013 10:49 AM, Andi Kleen wrote:
> 
> We already have all the code. Why not just use it?
> 

We're talking user space here, which has different requirement (COW,
memory protection, ...) which means it is not really the same code.  You
can't take a page fault while patching the kernel.

> Note I'm not adding any new mechanism, just exporting the existing one.
> So the usual "do things in user space" arguments do not really
> apply here.
> 
> Also this is subtle enough that there is definitely benefit from
> having only a single canonical code that does it.
> 
> If we ever need any new errata workarounds for this they could be also
> all done in a single central place.
> 
>>
>> All we really need in the kernel is the IPI broadcasts - the rest can be
>> done in user space, including intercepting SIGTRAP.  For userspace it is
>> probably the best to just put a thread to sleep until the patching is
>> done, which can be done with a futex.
> 
> I'm not sure that's worth it. IPIs are reasonably fast (a few 1000s cycles).
> Sleeping likely only becomes beneficial with much longer delays, like
> ms. But if the IPIs start taking ms we have much more problems.

I'm referring to if some thread actually stumbles over INT 3, which is
indeed not very long for one patch site (as long as you don't end up
with page faults.)  However, for tracing, you may want to do tens of
thousands of patches, and you really want to batch them.

>> One advantage with doing this in userspace is that the kernel doesn't
>> have to be responsible avoiding holding a thread due to a slightly
>> different SIGTRAP -- it will all come out after the signal handler is
>> restored, anyway.
> 
> It's just some spinning, not a new task state. I don't think any
> task states make sense here.

I'd rather spin in user space, though.

	-hpa


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-20 16:42     ` H. Peter Anvin
@ 2013-11-20 17:25       ` Andi Kleen
  2013-11-20 18:30         ` H. Peter Anvin
  2013-11-20 21:00         ` H. Peter Anvin
  2013-11-21 12:49       ` Steven Rostedt
  1 sibling, 2 replies; 17+ messages in thread
From: Andi Kleen @ 2013-11-20 17:25 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: x86, linux-kernel, Ingo Molnar, Thomas Gleixner

"H. Peter Anvin" <hpa@zytor.com> writes:
>
> We're talking user space here, which has different requirement (COW,
> memory protection, ...) which means it is not really the same code.  You
> can't take a page fault while patching the kernel.

The syscall pins the page, then it calls the kernel code.
There are no page faults during patching.

> I'm referring to if some thread actually stumbles over INT 3, which is
> indeed not very long for one patch site (as long as you don't end up
> with page faults.)  However, for tracing, you may want to do tens of
> thousands of patches, and you really want to batch them.

For tens of thousand of patches you very likely don't want 
live patching, but a stop everything approach.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-20 17:25       ` Andi Kleen
@ 2013-11-20 18:30         ` H. Peter Anvin
  2013-11-20 21:00         ` H. Peter Anvin
  1 sibling, 0 replies; 17+ messages in thread
From: H. Peter Anvin @ 2013-11-20 18:30 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Ingo Molnar, Thomas Gleixner

Why? Live patching works fine and keys the other threads run along just fine.  We do this for ftrace already...

Andi Kleen <andi@firstfloor.org> wrote:
>"H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> We're talking user space here, which has different requirement (COW,
>> memory protection, ...) which means it is not really the same code. 
>You
>> can't take a page fault while patching the kernel.
>
>The syscall pins the page, then it calls the kernel code.
>There are no page faults during patching.
>
>> I'm referring to if some thread actually stumbles over INT 3, which
>is
>> indeed not very long for one patch site (as long as you don't end up
>> with page faults.)  However, for tracing, you may want to do tens of
>> thousands of patches, and you really want to batch them.
>
>For tens of thousand of patches you very likely don't want 
>live patching, but a stop everything approach.
>
>-Andi

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-20 17:25       ` Andi Kleen
  2013-11-20 18:30         ` H. Peter Anvin
@ 2013-11-20 21:00         ` H. Peter Anvin
  1 sibling, 0 replies; 17+ messages in thread
From: H. Peter Anvin @ 2013-11-20 21:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, linux-kernel, Ingo Molnar, Thomas Gleixner

On 11/20/2013 09:25 AM, Andi Kleen wrote:
> "H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> We're talking user space here, which has different requirement (COW,
>> memory protection, ...) which means it is not really the same code.  You
>> can't take a page fault while patching the kernel.
> 
> The syscall pins the page, then it calls the kernel code.
> There are no page faults during patching.
> 

But you don't need to pin anything if you just do it in userspace.
Again, it is particularly poignant for large batched operations.

	-hpa


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19 20:16   ` Andi Kleen
@ 2013-11-21 10:02     ` Jiri Kosina
  0 siblings, 0 replies; 17+ messages in thread
From: Jiri Kosina @ 2013-11-21 10:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andy Lutomirski, x86, linux-kernel, Andi Kleen

On Tue, 19 Nov 2013, Andi Kleen wrote:

> > > +	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
> > > +	if (err < 0)
> > > +		return err;
> > > +	err = 0;
> > > +	mutex_lock(&text_mutex);
> > > +	bp_target_mm = current->mm;
> > > +	bp_int3_addr = (u8 *)addr + 1;
> > 
> > Do you need an smp_wmb here?  (Maybe there's a strong enough barrier in
> > __text_poke_bp.)
> 
> __text_poke_bp already has enough barriers (although I don't
> think they are really needed in any case)

As an author of those barriers I have to ask -- why do you think so? The 
first one is there to avoid int3 handler seeing !patching_in_progress 
while it has actually been entered due to text_poke_bp() kicking in.

The second one is there for exactly the inverse reason.

Both are pairing to smp_rmb() in poke_int3_handler().

Thanks,

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-20 16:42     ` H. Peter Anvin
  2013-11-20 17:25       ` Andi Kleen
@ 2013-11-21 12:49       ` Steven Rostedt
  1 sibling, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2013-11-21 12:49 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, x86, linux-kernel, Andi Kleen, Ingo Molnar, Thomas Gleixner

On Wed, Nov 20, 2013 at 08:42:03AM -0800, H. Peter Anvin wrote:
> We're talking user space here, which has different requirement (COW,
> memory protection, ...) which means it is not really the same code.  You
> can't take a page fault while patching the kernel.

Well, technically you can take a page fault while patching the kernel. It
happens all the time when patching modules (vmalloc'd).

But I get your point. You don't have to worry about COW or being paged
out. "page fault" is just too general of a term. And yes, I'm being anal ;-)

-- Steve


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19 19:10   ` Andi Kleen
@ 2013-11-21 13:07     ` Steven Rostedt
  2013-11-22  3:26       ` Andi Kleen
  0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2013-11-21 13:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, linux-kernel, Andi Kleen, Linus Torvalds,
	H. Peter Anvin, Thomas Gleixner, Andrew Morton

On Tue, Nov 19, 2013 at 08:10:34PM +0100, Andi Kleen wrote:
> 
> Self modifying code is by definition x86 specific.

How so? Ftrace works on several archs, and does so with self modifying code.

> 
> Very likely any other method to do this would be Linux specific
> in some form or another, as there doesn't seem to be any
> existing art.

Seems you need to solve a catch-22 on this. You need to show that there's a
need (application), that requires this new feature and also show that all the
existing kernel features are inefficient for what is needed.

The catch-22 may be that a application wont be created without the feature,
but that's the problem you need to solve.

Personally, I rather see the userspace patching be completele separate from
the kernel space patching, as the kernel space patching must not be
complicated by any requirement of userspace patching.

I don't see how userspace doesn't need anything but a CPU sync call to do
the required sync. Add a breakpoint, sync, change the code, sync, remove
breakpoint, done!

And as a bonus, doing this in userspace could easily extend it. Say you want
to do special things with that breakpoint handler.

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-19  0:27 [PATCH] Add a text_poke syscall Andi Kleen
                   ` (2 preceding siblings ...)
  2013-11-19  6:49 ` Ingo Molnar
@ 2013-11-21 18:26 ` Oleg Nesterov
  2013-11-22  3:29   ` Andi Kleen
  3 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2013-11-21 18:26 UTC (permalink / raw)
  To: Andi Kleen, Ingo Molnar; +Cc: x86, linux-kernel, Andi Kleen

Can't really comment this patch, just a couple of nits...

On 11/18, Andi Kleen wrote:
>
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> @@ -602,6 +619,14 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> +	if (user_mode_vm(regs) &&
> +		bp_target_mm &&
> +		current->mm == bp_target_mm &&
> +		regs->ip == (unsigned long)bp_int3_addr) {
> +		regs->ip = (unsigned long) bp_int3_handler;
> +		return 1;
> +	}
> +
>  	if (user_mode_vm(regs) || regs->ip != (unsigned long)bp_int3_addr)
>  		return 0;

This looks a bit confusing, afaics we can avoid the duplications:

	// also handles bp_target_mm == NULL case
	if (user_mode_vm(regs) && current->mm != bp_target_mm)
		return 0;

	if (regs->ip != (unsigned long)bp_int3_addr)
		return 0;

	/* set up the specified breakpoint handler */
	regs->ip = (unsigned long) bp_int3_handler;

	return 1;

> +SYSCALL_DEFINE5(text_poke,
> +		__user void *, addr,
> +		const void *, opcode, 
> +		size_t, len,
> +		void *, handler,
> +		int, timeout)
> +{
> +	struct page *pages[2];
> +	char insn[MAX_INSN_LEN];
> +	int err, npages;
> +
> +	if ((unsigned long)len > MAX_INSN_LEN || len == 0)
> +		return -EINVAL;
> +	/* For future extension */
> +	if (timeout != 0)
> +		return -EINVAL;
> +	if (copy_from_user(insn, opcode, len))
> +		return -EFAULT;
> +	pages[1] = NULL;
> +	npages = ((unsigned long)addr & PAGE_MASK) == 
> +		(((unsigned long)addr + len) & PAGE_MASK) ? 1 : 2;
> +	err = get_user_pages_fast((unsigned long)addr, npages, 1, pages);
> +	if (err < 0)
> +		return err;
> +	err = 0;
> +	mutex_lock(&text_mutex);
> +	bp_target_mm = current->mm;
> +	bp_int3_addr = (u8 *)addr + 1;
> +	__text_poke_bp(pages, 
> +		       (unsigned long)addr & ~PAGE_MASK,
> +		       insn, len, handler);

This doesn't look right if npages == 2 but get_user_pages_fast() returns 1.

__text_poke() checks pages[1] != NULL, but in this case it assumes
that memcpy(vaddr, opcode, len) should fit into the 1st page.


Ingo, Andi, I do not think that it is good idea to implement this
via ptrace. If nothing else, you need to fork the tracer which can
do PTRACE_POKETEXT.

Please note that PTRACE_TRACEME does not mean self-tracing, it means
that current->real_parent becomes the tracer (imho this interface
should die but we obviously can't kill it). So I don't see how it
can help. And you can't ptrace yourself of a sub-thread, this is
explicitly forbidden by ptrace_attach()->same_thread_group() check.

And ptrace can't really help with CLONE_VM tasks.

Or I misunderstood the suggestion.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-21 13:07     ` Steven Rostedt
@ 2013-11-22  3:26       ` Andi Kleen
  0 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2013-11-22  3:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Andi Kleen,
	Linus Torvalds, H. Peter Anvin, Thomas Gleixner, Andrew Morton

> Seems you need to solve a catch-22 on this. You need to show that there's a
> need (application), that requires this new feature and also show that all the

Patching code is quite common these days.

I'm just stating that.

> existing kernel features are inefficient for what is needed.

I believe it's possible to do it without kernel support, just hard and
ugly. 

So the idea was: the kernel already knows how to do it. It does it
in a (not perfect but reasonable way). Let's just export it.

Will make everyone's live easier. 

Is a single place so if any changes are needed they are all
centralized. If there are ever any new bugs in this area
they could be worked around in a single place.

It's too hard to leave it to user space programmers @-)

> 
> The catch-22 may be that a application wont be created without the feature,
> but that's the problem you need to solve.

Of course the applications get created, they just likely have extremly
subtle bugs if they don't quite conform to the official protocol.

Or they may suddenly break when running on some new CPUs which
has different requirements.

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Add a text_poke syscall
  2013-11-21 18:26 ` Oleg Nesterov
@ 2013-11-22  3:29   ` Andi Kleen
  0 siblings, 0 replies; 17+ messages in thread
From: Andi Kleen @ 2013-11-22  3:29 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Andi Kleen, Ingo Molnar, x86, linux-kernel, Andi Kleen

Thanks for the code review.

> This looks a bit confusing, afaics we can avoid the duplications:
> 
> 	// also handles bp_target_mm == NULL case
> 	if (user_mode_vm(regs) && current->mm != bp_target_mm)
> 		return 0;

I had that originally in my code, but I was worried there
were any threads with mm == NULL and user_mode_vm still true
(maybe some BIOS code or somesuch)

So I ended up with the longer, but safer, variant.


> > +	err = 0;
> > +	mutex_lock(&text_mutex);
> > +	bp_target_mm = current->mm;
> > +	bp_int3_addr = (u8 *)addr + 1;
> > +	__text_poke_bp(pages, 
> > +		       (unsigned long)addr & ~PAGE_MASK,
> > +		       insn, len, handler);
> 
> This doesn't look right if npages == 2 but get_user_pages_fast() returns 1.

True, i'll add an error out.

> 
> __text_poke() checks pages[1] != NULL, but in this case it assumes
> that memcpy(vaddr, opcode, len) should fit into the 1st page.
> 
> 
> Ingo, Andi, I do not think that it is good idea to implement this
> via ptrace. If nothing else, you need to fork the tracer which can
> do PTRACE_POKETEXT.

Yes I agree, ptrace is not the right way to do this.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-11-22  3:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-19  0:27 [PATCH] Add a text_poke syscall Andi Kleen
2013-11-19  2:32 ` Andy Lutomirski
2013-11-19 20:16   ` Andi Kleen
2013-11-21 10:02     ` Jiri Kosina
2013-11-19  5:23 ` H. Peter Anvin
2013-11-19 18:49   ` Andi Kleen
2013-11-20 16:42     ` H. Peter Anvin
2013-11-20 17:25       ` Andi Kleen
2013-11-20 18:30         ` H. Peter Anvin
2013-11-20 21:00         ` H. Peter Anvin
2013-11-21 12:49       ` Steven Rostedt
2013-11-19  6:49 ` Ingo Molnar
2013-11-19 19:10   ` Andi Kleen
2013-11-21 13:07     ` Steven Rostedt
2013-11-22  3:26       ` Andi Kleen
2013-11-21 18:26 ` Oleg Nesterov
2013-11-22  3:29   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).