linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key
@ 2018-12-21 10:27 Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 1/9] jump_label: Add for_each_label_entry helper Daniel Bristot de Oliveira
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

While tuning a system with CPUs isolated as much as possible, we've
noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically.
Tracing the functions that emit IPIs, we saw chronyd - an unprivileged
process -  generating the IPIs when changing a static key, enabling
network timestaping on sockets.

For instance, the trace pointed:

# trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'...
# trace-cmde report...
[...]
         chronyd-858   [000]   433.286406: function:             smp_call_function_single
         chronyd-858   [000]   433.286408: kernel_stack:         <stack trace>
=> smp_call_function_many (ffffffffbc10ab22)
=> smp_call_function (ffffffffbc10abaa)
=> on_each_cpu (ffffffffbc10ac0b)
=> text_poke_bp (ffffffffbc02436a)
=> arch_jump_label_transform (ffffffffbc020ec8)
=> __jump_label_update (ffffffffbc1b300f)
=> jump_label_update (ffffffffbc1b30ed)
=> static_key_slow_inc (ffffffffbc1b334d)
=> net_enable_timestamp (ffffffffbc625c44)
=> sock_enable_timestamp (ffffffffbc6127d5)
=> sock_setsockopt (ffffffffbc612d2f)
=> SyS_setsockopt (ffffffffbc60d5e6)
=> tracesys (ffffffffbc7681c5)
          <idle>-0     [001]   433.286416: call_function_single_entry: vector=251
          <idle>-0     [001]   433.286419: call_function_single_exit: vector=251
  
[... The IPI takes place 12 times]

The static key in case was the netstamp_needed_key. A static key
change from enabled->disabled/disabled->enabled causes the code to be
patched, and this is done in three steps:

-- Pseudo-code #1 - Current implementation ---
For each key to be updated:
	1) add an int3 trap to the address that will be patched
	    sync cores (send IPI to all other CPUs)
	2) update all but the first byte of the patched range
	    sync cores (send IPI to all other CPUs)
	3) replace the first byte (int3) by the first byte of replacing opcode 
	    sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

As the static key netstamp_needed_key has four entries (used in for
places in the code) in our kernel, 3 IPIs were generated for each
entry, resulting in 12 IPIs. The number of IPIs then is linear with
regard to the number 'n' of entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling is _not_ running.

Rather than doing each updated at once, it is possible to queue all updates
first, and then apply all updates at once, rewriting the pseudo-code #1
in this way:

-- Pseudo-code #2 - proposal  ---
1)  for each key in the queue:
        add an int3 trap to the address that will be patched
    sync cores (send IPI to all other CPUs)

2)  for each key in the queue:
        update all but the first byte of the patched range
    sync cores (send IPI to all other CPUs)

3)  for each key in the queue:
        replace the first byte (int3) by the first byte of replacing opcode 
    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - proposal  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    void arch_jump_label_transform_queue(struct jump_entry *entry,
                         enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated.

After queuing all jump_entries, the function:
  
    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

One easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ---------------------------- 
#!/bin/bash

while [ true ]; do 
    sysctl -w kernel.sched_schedstats=1
    sleep 2
    sysctl -w kernel.sched_schedstats=0
    sleep 2
done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ---------------------------- 
# watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with the batch mode the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The sched_schedstats static key was chosen as the key to being switched on and
off. The reason being is that it is used in more than 56 places, in a hot path.
The change in the schedstats static key will be done with the following
command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the static key, while the kernel with the batch updates takes just 1.4 ms on
average. Although it seems to be too good to be true, it makes sense: the
sched_schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
this scenario a lot.

Changes from v2:
  - Switch the order of patches 8 and 9 (Steven Rostedt)
  - Fallback in the case of failure in the batch mode (Steven Rostedt)
  - Fix a pointer in patch 7 (Steven Rostedt)
  - Adjust the commit log of the patch 1 (Borislav Petkov)
  - Adjust the commit log of the patch 3 (Jiri Kosina/Thomas Gleixner)
Changes from v1:
  - Split the patch in jump-label/x86-jump-label/alternative (Jason Baron)
  - Use bserach in the int3 handler (Steven Rostedt)
  - Use a pre-allocated page to store the vector (Jason/Steven)
  - Do performance tests in the int3 handler (peterz)

Daniel Bristot de Oliveira (9):
  jump_label: Add for_each_label_entry helper
  jump_label: Add the jump_label_can_update_check() helper
  x86/jump_label: Move checking code away from __jump_label_transform()
  x86/jump_label: Add __jump_label_set_jump_code() helper
  x86/alternative: Split text_poke_bp() into tree steps
  jump_label: Sort entries of the same key by the code
  x86/alternative: Batch of patch operations
  jump_label: Batch updates if arch supports it
  x86/jump_label: Batch jump label updates

 arch/x86/include/asm/jump_label.h    |   2 +
 arch/x86/include/asm/text-patching.h |  15 +++
 arch/x86/kernel/alternative.c        | 140 ++++++++++++++++++---
 arch/x86/kernel/jump_label.c         | 174 ++++++++++++++++++++++-----
 include/linux/jump_label.h           |   6 +
 kernel/jump_label.c                  |  73 +++++++++--
 6 files changed, 359 insertions(+), 51 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V3 1/9] jump_label: Add for_each_label_entry helper
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:36   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 2/9] jump_label: Add the jump_label_can_update_check() helper Daniel Bristot de Oliveira
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Add a helper macro to make jump entry iteration code more readable.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/jump_label.h | 3 +++
 kernel/jump_label.c        | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 5df6a621e464..c88d903befc0 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -231,6 +231,9 @@ extern void static_key_disable(struct static_key *key);
 extern void static_key_enable_cpuslocked(struct static_key *key);
 extern void static_key_disable_cpuslocked(struct static_key *key);
 
+#define for_each_label_entry(key, entry, stop)				  \
+	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++)
+
 /*
  * We should be using ATOMIC_INIT() for initializing .enabled, but
  * the inclusion of atomic.h is problematic for inclusion of jump_label.h
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index b28028b08d44..65965eb1cf05 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -381,7 +381,7 @@ static void __jump_label_update(struct static_key *key,
 				struct jump_entry *stop,
 				bool init)
 {
-	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+	for_each_label_entry(key, entry, stop) {
 		/*
 		 * An entry->code of 0 indicates an entry which has been
 		 * disabled because it was in an init text area.
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 2/9] jump_label: Add the jump_label_can_update_check() helper
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 1/9] jump_label: Add for_each_label_entry helper Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:37   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 3/9] x86/jump_label: Move checking code away from __jump_label_transform() Daniel Bristot de Oliveira
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Move the check of if a jump_entry is valid to a specific function. No
functional change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/jump_label.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 65965eb1cf05..26bf54251218 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -376,22 +376,32 @@ static enum jump_label_type jump_label_type(struct jump_entry *entry)
 	return enabled ^ branch;
 }
 
+bool jump_label_can_update_check(struct jump_entry *entry, bool init)
+{
+	/*
+	 * An entry->code of 0 indicates an entry which has been
+	 * disabled because it was in an init text area.
+	 */
+	if (init || !jump_entry_is_init(entry)) {
+		if (!kernel_text_address(jump_entry_code(entry))) {
+			WARN_ONCE(1, "can't patch jump_label at %pS",
+				  (void *)jump_entry_code(entry));
+			return 0;
+		}
+		return 1;
+	}
+	return 0;
+}
+
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
 				bool init)
 {
 	for_each_label_entry(key, entry, stop) {
-		/*
-		 * An entry->code of 0 indicates an entry which has been
-		 * disabled because it was in an init text area.
-		 */
-		if (init || !jump_entry_is_init(entry)) {
-			if (kernel_text_address(jump_entry_code(entry)))
-				arch_jump_label_transform(entry, jump_label_type(entry));
-			else
-				WARN_ONCE(1, "can't patch jump_label at %pS",
-					  (void *)jump_entry_code(entry));
+		if (jump_label_can_update_check(entry, init)) {
+			arch_jump_label_transform(entry,
+						  jump_label_type(entry));
 		}
 	}
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 3/9] x86/jump_label: Move checking code away from __jump_label_transform()
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 1/9] jump_label: Add for_each_label_entry helper Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 2/9] jump_label: Add the jump_label_can_update_check() helper Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:38   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 4/9] x86/jump_label: Add __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Move the check of the current code, before updating an entry, to specialized
functions. No changes in the method, only code relocation.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/jump_label.c | 60 +++++++++++++++++++++++++-----------
 1 file changed, 42 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index aac0c1f7e354..7894080bd02f 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,16 +37,53 @@ static void bug_at(unsigned char *ip, int line)
 	BUG();
 }
 
+static inline void __jump_label_trans_check_enable(struct jump_entry *entry,
+						   enum jump_label_type type,
+						   const unsigned char *ideal_nop,
+						   int init)
+{
+	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
+	const void *expect;
+	int line;
+
+	if (init) {
+		expect = default_nop; line = __LINE__;
+	} else {
+		expect = ideal_nop; line = __LINE__;
+	}
+
+	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
+		bug_at((void *)jump_entry_code(entry), line);
+}
+
+static inline void __jump_label_trans_check_disable(struct jump_entry *entry,
+						    enum jump_label_type type,
+						    union jump_code_union *jmp,
+						    int init)
+{
+	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
+	const void *expect;
+	int line;
+
+	if (init) {
+		expect = default_nop; line = __LINE__;
+	} else {
+		expect = jmp->code; line = __LINE__;
+	}
+
+	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
+		bug_at((void *)jump_entry_code(entry), line);
+}
+
+
 static void __ref __jump_label_transform(struct jump_entry *entry,
 					 enum jump_label_type type,
 					 void *(*poker)(void *, const void *, size_t),
 					 int init)
 {
 	union jump_code_union jmp;
-	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
 	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-	const void *expect, *code;
-	int line;
+	const void *code;
 
 	jmp.jump = 0xe9;
 	jmp.offset = jump_entry_target(entry) -
@@ -56,26 +93,13 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 		poker = text_poke_early;
 
 	if (type == JUMP_LABEL_JMP) {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = ideal_nop; line = __LINE__;
-		}
-
+		__jump_label_trans_check_enable(entry, type, ideal_nop, init);
 		code = &jmp.code;
 	} else {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = &jmp.code; line = __LINE__;
-		}
-
+		__jump_label_trans_check_disable(entry, type, &jmp, init);
 		code = ideal_nop;
 	}
 
-	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
-		bug_at((void *)jump_entry_code(entry), line);
-
 	/*
 	 * Make text_poke_bp() a default fallback poker.
 	 *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 4/9] x86/jump_label: Add __jump_label_set_jump_code() helper
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 3/9] x86/jump_label: Move checking code away from __jump_label_transform() Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:38   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 5/9] x86/alternative: Split text_poke_bp() into tree steps Daniel Bristot de Oliveira
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Move the definition of the code to be written from
__jump_label_transform() to a specialized function. No change in the
method, code relocation only.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/jump_label.c | 38 +++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 7894080bd02f..a039ac4ca5bf 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -75,30 +75,36 @@ static inline void __jump_label_trans_check_disable(struct jump_entry *entry,
 		bug_at((void *)jump_entry_code(entry), line);
 }
 
+static void __jump_label_set_jump_code(struct jump_entry *entry,
+				       enum jump_label_type type,
+				       union jump_code_union *code,
+				       int init)
+{
+	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
+
+	code->jump = 0xe9;
+	code->offset = jump_entry_target(entry) -
+		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+
+	if (type == JUMP_LABEL_JMP) {
+		__jump_label_trans_check_enable(entry, type, ideal_nop, init);
+	} else {
+		__jump_label_trans_check_disable(entry, type, code, init);
+		memcpy(code, ideal_nop, JUMP_LABEL_NOP_SIZE);
+	}
+}
 
 static void __ref __jump_label_transform(struct jump_entry *entry,
 					 enum jump_label_type type,
 					 void *(*poker)(void *, const void *, size_t),
 					 int init)
 {
-	union jump_code_union jmp;
-	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-	const void *code;
-
-	jmp.jump = 0xe9;
-	jmp.offset = jump_entry_target(entry) -
-		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+	union jump_code_union code;
 
 	if (early_boot_irqs_disabled)
 		poker = text_poke_early;
 
-	if (type == JUMP_LABEL_JMP) {
-		__jump_label_trans_check_enable(entry, type, ideal_nop, init);
-		code = &jmp.code;
-	} else {
-		__jump_label_trans_check_disable(entry, type, &jmp, init);
-		code = ideal_nop;
-	}
+	__jump_label_set_jump_code(entry, type, &code, init);
 
 	/*
 	 * Make text_poke_bp() a default fallback poker.
@@ -109,12 +115,12 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 	 *
 	 */
 	if (poker) {
-		(*poker)((void *)jump_entry_code(entry), code,
+		(*poker)((void *)jump_entry_code(entry), &code,
 			 JUMP_LABEL_NOP_SIZE);
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
+	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE,
 		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 5/9] x86/alternative: Split text_poke_bp() into tree steps
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 4/9] x86/jump_label: Add __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:39   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 6/9] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

text_poke_bp() updates instructions on live kernel on SMP in three steps:
 1) add a int3 trap to the address that will be patched
 2) update all but the first byte of the patched range
 3) replace the first byte (int3) by the first byte of

This patch creates one function for each of these steps.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/alternative.c | 38 +++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..6f5ad8587de0 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -767,6 +767,29 @@ int poke_int3_handler(struct pt_regs *regs)
 
 }
 
+static void text_poke_bp_set_handler(void *addr, void *handler,
+				     unsigned char int3)
+{
+	bp_int3_handler = handler;
+	bp_int3_addr = (u8 *)addr + sizeof(int3);
+	text_poke(addr, &int3, sizeof(int3));
+}
+
+static void patch_all_but_first_byte(void *addr, const void *opcode,
+				     size_t len, unsigned char int3)
+{
+	/* patch all but the first byte */
+	text_poke((char *)addr + sizeof(int3),
+		  (const char *) opcode + sizeof(int3),
+		  len - sizeof(int3));
+}
+
+static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
+{
+	/* patch the first byte */
+	text_poke(addr, opcode, sizeof(int3));
+}
+
 /**
  * text_poke_bp() -- update instructions on live kernel on SMP
  * @addr:	address to patch
@@ -791,27 +814,21 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
 	unsigned char int3 = 0xcc;
 
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
-	bp_patching_in_progress = true;
-
 	lockdep_assert_held(&text_mutex);
 
+	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
 	 * in_progress and handler are correctly ordered wrt. patching.
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	text_poke_bp_set_handler(addr, handler, int3);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
 	if (len - sizeof(int3) > 0) {
-		/* patch all but the first byte */
-		text_poke((char *)addr + sizeof(int3),
-			  (const char *) opcode + sizeof(int3),
-			  len - sizeof(int3));
+		patch_all_but_first_byte(addr, opcode, len, int3);
 		/*
 		 * According to Intel, this core syncing is very likely
 		 * not necessary and we'd be safe even without it. But
@@ -820,8 +837,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 		on_each_cpu(do_sync_core, NULL, 1);
 	}
 
-	/* patch the first byte */
-	text_poke(addr, opcode, sizeof(int3));
+	patch_first_byte(addr, opcode, int3);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 6/9] jump_label: Sort entries of the same key by the code
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (4 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 5/9] x86/alternative: Split text_poke_bp() into tree steps Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:40   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 7/9] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

In the batching mode, entries with the same key should also be sorted by the
code, enabling a bsearch() of a code/addr when updating a key.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/jump_label.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 26bf54251218..22093b5f57c9 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -38,12 +38,28 @@ static int jump_label_cmp(const void *a, const void *b)
 	const struct jump_entry *jea = a;
 	const struct jump_entry *jeb = b;
 
+	/*
+	 * Entrires are sorted by key.
+	 */
 	if (jump_entry_key(jea) < jump_entry_key(jeb))
 		return -1;
 
 	if (jump_entry_key(jea) > jump_entry_key(jeb))
 		return 1;
 
+#ifdef HAVE_JUMP_LABEL_BATCH
+	/*
+	 * In the batching mode, entries should also be sorted by the code
+	 * inside the already sorted list of entries, enabling a bsearch in
+	 * the vector.
+	 */
+	if (jump_entry_code(jea) < jump_entry_code(jeb))
+		return -1;
+
+	if (jump_entry_code(jea) > jump_entry_code(jeb))
+		return 1;
+#endif
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (5 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 6/9] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-01-23  5:15   ` Masami Hiramatsu
  2019-04-19 18:41   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 8/9] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Currently, the patch of an address is done in three steps:

-- Pseudo-code #1 - Current implementation ---
        1) add an int3 trap to the address that will be patched
            sync cores (send IPI to all other CPUs)
        2) update all but the first byte of the patched range
            sync cores (send IPI to all other CPUs)
        3) replace the first byte (int3) by the first byte of replacing opcode
            sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.

With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:

-- Pseudo-code #2 - This patch  ---
1)  for each patch in the vector:
        add an int3 trap to the address that will be patched

    sync cores (send IPI to all other CPUs)

2)  for each patch in the vector:
        update all but the first byte of the patched range

    sync cores (send IPI to all other CPUs)

3)  for each patch in the vector:
        replace the first byte (int3) by the first byte of replacing opcode

    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.

The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/text-patching.h |  15 ++++
 arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
 2 files changed, 117 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..42ea7846df33 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+/*
+ * Currently, the max observed size in the kernel code is
+ * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * Raise it if needed.
+ */
+#define POKE_MAX_OPCODE_SIZE	5
+
+struct text_to_poke {
+	void *handler;
+	void *addr;
+	size_t len;
+	const char opcode[POKE_MAX_OPCODE_SIZE];
+};
+
 extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
@@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
 extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 6f5ad8587de0..8fa47e5ec709 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -21,6 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include <linux/bsearch.h>
 
 int __read_mostly alternatives_patched;
 
@@ -738,10 +739,32 @@ static void do_sync_core(void *info)
 }
 
 static bool bp_patching_in_progress;
+/*
+ * Single poke.
+ */
 static void *bp_int3_handler, *bp_int3_addr;
+/*
+ * Batching poke.
+ */
+static struct text_to_poke *bp_int3_tpv;
+static unsigned int bp_int3_tpv_nr;
+
+static int text_bp_batch_bsearch(const void *key, const void *elt)
+{
+	struct text_to_poke *tp = (struct text_to_poke *) elt;
+
+	if (key < tp->addr)
+		return -1;
+	if (key > tp->addr)
+		return 1;
+	return 0;
+}
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	void *ip;
+	struct text_to_poke *tp;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
 	 * bp_patching_in_progress.
@@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	/*
+	 * Single poke first.
+	 */
+	if (bp_int3_addr) {
+		if (regs->ip == (unsigned long) bp_int3_addr) {
+			regs->ip = (unsigned long) bp_int3_handler;
+			return 1;
+		}
+		return 0;
+	}
 
-	return 1;
+	/*
+	 * Batch mode.
+	 */
+	if (bp_int3_tpv_nr) {
+		ip = (void *) regs->ip - sizeof(unsigned char);
+		tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
+			     sizeof(struct text_to_poke),
+			     text_bp_batch_bsearch);
+		if (tp) {
+			/* set up the specified breakpoint handler */
+			regs->ip = (unsigned long) tp->handler;
+			return 1;
+		}
+	}
 
+	return 0;
 }
 
 static void text_poke_bp_set_handler(void *addr, void *handler,
 				     unsigned char int3)
 {
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	text_poke(addr, &int3, sizeof(int3));
 }
 
@@ -816,6 +859,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 
 	lockdep_assert_held(&text_mutex);
 
+	bp_int3_handler = handler;
+	bp_int3_addr = (u8 *)addr + sizeof(int3);
+
 	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
@@ -846,6 +892,56 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 */
 	bp_patching_in_progress = false;
 
+	bp_int3_handler = bp_int3_addr = 0;
 	return addr;
 }
 
+void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
+{
+	unsigned int i;
+	unsigned char int3 = 0xcc;
+	int patched_all_but_first = 0;
+
+	bp_int3_tpv = tp;
+	bp_int3_tpv_nr = nr_entries;
+	bp_patching_in_progress = true;
+	/*
+	 * Corresponding read barrier in int3 notifier for making sure the
+	 * in_progress and handler are correctly ordered wrt. patching.
+	 */
+	smp_wmb();
+
+	for (i = 0; i < nr_entries; i++)
+		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
+
+	on_each_cpu(do_sync_core, NULL, 1);
+
+	for (i = 0; i < nr_entries; i++) {
+		if (tp[i].len - sizeof(int3) > 0) {
+			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
+						 tp[i].len, int3);
+			patched_all_but_first++;
+		}
+	}
+
+	if (patched_all_but_first) {
+		/*
+		 * According to Intel, this core syncing is very likely
+		 * not necessary and we'd be safe even without it. But
+		 * better safe than sorry (plus there's not only Intel).
+		 */
+		on_each_cpu(do_sync_core, NULL, 1);
+	}
+
+	for (i = 0; i < nr_entries; i++)
+		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
+
+	on_each_cpu(do_sync_core, NULL, 1);
+	/*
+	 * sync_core() implies an smp_mb() and orders this store against
+	 * the writing of the new instruction.
+	 */
+	bp_int3_tpv_nr = 0;
+	bp_int3_tpv = NULL;
+	bp_patching_in_progress = false;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 8/9] jump_label: Batch updates if arch supports it
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (6 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 7/9] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:42   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2018-12-21 10:27 ` [PATCH V3 9/9] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

If the architecture supports the batching of jump label updates, use it!

An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
#!/bin/bash

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    sleep 2
    sysctl -w kernel.sched_schedstats=0
    sleep 2
done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
# watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/jump_label.h |  3 +++
 kernel/jump_label.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index c88d903befc0..ed51ef3e1abd 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -219,6 +219,9 @@ extern void arch_jump_label_transform(struct jump_entry *entry,
 				      enum jump_label_type type);
 extern void arch_jump_label_transform_static(struct jump_entry *entry,
 					     enum jump_label_type type);
+extern int arch_jump_label_transform_queue(struct jump_entry *entry,
+					    enum jump_label_type type);
+extern void arch_jump_label_transform_apply(void);
 extern int jump_label_text_reserved(void *start, void *end);
 extern void static_key_slow_inc(struct static_key *key);
 extern void static_key_slow_dec(struct static_key *key);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 22093b5f57c9..08ace142af0a 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -409,6 +409,7 @@ bool jump_label_can_update_check(struct jump_entry *entry, bool init)
 	return 0;
 }
 
+#ifndef HAVE_JUMP_LABEL_BATCH
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
@@ -421,6 +422,34 @@ static void __jump_label_update(struct static_key *key,
 		}
 	}
 }
+#else
+static void __jump_label_update(struct static_key *key,
+				struct jump_entry *entry,
+				struct jump_entry *stop,
+				bool init)
+{
+	for_each_label_entry(key, entry, stop) {
+
+		if (!jump_label_can_update_check(entry, init))
+			continue;
+
+		if (arch_jump_label_transform_queue(entry,
+						    jump_label_type(entry)))
+			continue;
+
+		/*
+		 * Queue's overflow: Apply the current queue, and then
+		 * queue again. If it stills not possible to queue, BUG!
+		 */
+		arch_jump_label_transform_apply();
+		if (!arch_jump_label_transform_queue(entry,
+						     jump_label_type(entry))) {
+			BUG();
+		}
+	}
+	arch_jump_label_transform_apply();
+}
+#endif
 
 void __init jump_label_init(void)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 9/9] x86/jump_label: Batch jump label updates
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (7 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 8/9] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
@ 2018-12-21 10:27 ` Daniel Bristot de Oliveira
  2019-04-19 18:42   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  2019-01-21 12:52 ` [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
  2019-04-19 19:01 ` Ingo Molnar
  10 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2018-12-21 10:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    void arch_jump_label_transform_queue(struct jump_entry *entry,
					 enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated. This functions returns 1 in the case of a
successful enqueue of an entry. If it returns 0, the caller must to
apply the queue and then try to queue again, for instance, because the
queue is full.

This function expects the caller to sort the entries by the address before
enqueueuing then. This is already done by the arch independent code, though.

After queuing all jump_entries, the function:

    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/jump_label.h |  2 +
 arch/x86/kernel/jump_label.c      | 88 +++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index a5fb34fe56a4..39f908e8c0d1 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
+#define HAVE_JUMP_LABEL_BATCH
+
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index a039ac4ca5bf..633c103412e1 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -15,6 +15,7 @@
 #include <asm/kprobes.h>
 #include <asm/alternative.h>
 #include <asm/text-patching.h>
+#include <linux/slab.h>
 
 #ifdef HAVE_JUMP_LABEL
 
@@ -132,6 +133,93 @@ void arch_jump_label_transform(struct jump_entry *entry,
 	mutex_unlock(&text_mutex);
 }
 
+struct text_to_poke *entry_vector;
+unsigned int entry_vector_max_elem __read_mostly;
+unsigned int entry_vector_nr_elem;
+
+void arch_jump_label_init(void)
+{
+	entry_vector = (void *) __get_free_page(GFP_KERNEL);
+
+	if (WARN_ON_ONCE(!entry_vector))
+		return;
+
+	entry_vector_max_elem = PAGE_SIZE / sizeof(struct text_to_poke);
+	return;
+}
+
+int arch_jump_label_transform_queue(struct jump_entry *entry,
+				     enum jump_label_type type)
+{
+	void *entry_code;
+	struct text_to_poke *tp;
+
+	/*
+	 * Batch mode disabled before being able to allocate memory:
+	 * Fallback to the non-batching mode.
+	 */
+	if (unlikely(!entry_vector_max_elem)) {
+		if (!slab_is_available() || early_boot_irqs_disabled)
+			goto fallback;
+
+		arch_jump_label_init();
+	}
+
+	/*
+	 * No more space in the vector, tell upper layer to apply
+	 * the queue before continuing.
+	 */
+	if (entry_vector_nr_elem == entry_vector_max_elem)
+		return 0;
+
+	tp = &entry_vector[entry_vector_nr_elem];
+
+	entry_code = (void *)jump_entry_code(entry);
+
+	/*
+	 * The int3 handler will do a bsearch in the queue, so we need entries
+	 * to be sorted. We can survive an unsorted list by rejecting the entry,
+	 * forcing the generic jump_label code to apply the queue. Warning once,
+	 * to raise the attention to the case of an unsorted entry that is
+	 * better not happen, because, in the worst case we will perform in the
+	 * same way as we do without batching - with some more overhead.
+	 */
+	if (entry_vector_nr_elem > 0) {
+		int prev_idx = entry_vector_nr_elem - 1;
+		struct text_to_poke *prev_tp = &entry_vector[prev_idx];
+
+		if (WARN_ON_ONCE(prev_tp->addr > entry_code))
+			return 0;
+	}
+
+	__jump_label_set_jump_code(entry, type,
+				   (union jump_code_union *) &tp->opcode, 0);
+
+	tp->addr = entry_code;
+	tp->handler = entry_code + JUMP_LABEL_NOP_SIZE;
+	tp->len = JUMP_LABEL_NOP_SIZE;
+
+	entry_vector_nr_elem++;
+
+	return 1;
+
+fallback:
+	arch_jump_label_transform(entry, type);
+	return 1;
+}
+
+void arch_jump_label_transform_apply(void)
+{
+	if (early_boot_irqs_disabled || !entry_vector_nr_elem)
+		return;
+
+	mutex_lock(&text_mutex);
+	text_poke_bp_batch(entry_vector, entry_vector_nr_elem);
+	mutex_unlock(&text_mutex);
+
+	entry_vector_nr_elem = 0;
+}
+
 static enum {
 	JL_STATE_START,
 	JL_STATE_NO_UPDATE,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (8 preceding siblings ...)
  2018-12-21 10:27 ` [PATCH V3 9/9] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
@ 2019-01-21 12:52 ` Daniel Bristot de Oliveira
  2019-04-19 19:01 ` Ingo Molnar
  10 siblings, 0 replies; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-01-21 12:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

On 12/21/18 11:27 AM, Daniel Bristot de Oliveira wrote:
> While tuning a system with CPUs isolated as much as possible, we've
> noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically.
> Tracing the functions that emit IPIs, we saw chronyd - an unprivileged
> process -  generating the IPIs when changing a static key, enabling
> network timestaping on sockets.
> 
> For instance, the trace pointed:
> 
> # trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'...
> # trace-cmde report...
> [...]
>          chronyd-858   [000]   433.286406: function:             smp_call_function_single
>          chronyd-858   [000]   433.286408: kernel_stack:         <stack trace>
> => smp_call_function_many (ffffffffbc10ab22)
> => smp_call_function (ffffffffbc10abaa)
> => on_each_cpu (ffffffffbc10ac0b)
> => text_poke_bp (ffffffffbc02436a)
> => arch_jump_label_transform (ffffffffbc020ec8)
> => __jump_label_update (ffffffffbc1b300f)
> => jump_label_update (ffffffffbc1b30ed)
> => static_key_slow_inc (ffffffffbc1b334d)
> => net_enable_timestamp (ffffffffbc625c44)
> => sock_enable_timestamp (ffffffffbc6127d5)
> => sock_setsockopt (ffffffffbc612d2f)
> => SyS_setsockopt (ffffffffbc60d5e6)
> => tracesys (ffffffffbc7681c5)
>           <idle>-0     [001]   433.286416: call_function_single_entry: vector=251
>           <idle>-0     [001]   433.286419: call_function_single_exit: vector=251
>   
> [... The IPI takes place 12 times]
> 
> The static key in case was the netstamp_needed_key. A static key
> change from enabled->disabled/disabled->enabled causes the code to be
> patched, and this is done in three steps:
> 
> -- Pseudo-code #1 - Current implementation ---
> For each key to be updated:
> 	1) add an int3 trap to the address that will be patched
> 	    sync cores (send IPI to all other CPUs)
> 	2) update all but the first byte of the patched range
> 	    sync cores (send IPI to all other CPUs)
> 	3) replace the first byte (int3) by the first byte of replacing opcode 
> 	    sync cores (send IPI to all other CPUs)
> -- Pseudo-code #1 ---
> 
> As the static key netstamp_needed_key has four entries (used in for
> places in the code) in our kernel, 3 IPIs were generated for each
> entry, resulting in 12 IPIs. The number of IPIs then is linear with
> regard to the number 'n' of entries of a key: O(n*3), which is O(n).
> 
> This algorithm works fine for the update of a single key. But we think
> it is possible to optimize the case in which a static key has more than
> one entry. For instance, the sched_schedstats jump label has 56 entries
> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> which the thread that is enabling is _not_ running.
> 
> Rather than doing each updated at once, it is possible to queue all updates
> first, and then apply all updates at once, rewriting the pseudo-code #1
> in this way:
> 
> -- Pseudo-code #2 - proposal  ---
> 1)  for each key in the queue:
>         add an int3 trap to the address that will be patched
>     sync cores (send IPI to all other CPUs)
> 
> 2)  for each key in the queue:
>         update all but the first byte of the patched range
>     sync cores (send IPI to all other CPUs)
> 
> 3)  for each key in the queue:
>         replace the first byte (int3) by the first byte of replacing opcode 
>     sync cores (send IPI to all other CPUs)
> -- Pseudo-code #2 - proposal  ---
> 
> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).
> 
> Currently, the jump label of a static key is transformed via the arch
> specific function:
> 
>     void arch_jump_label_transform(struct jump_entry *entry,
>                                    enum jump_label_type type)
> 
> The new approach (batch mode) uses two arch functions, the first has the
> same arguments of the arch_jump_label_transform(), and is the function:
> 
>     void arch_jump_label_transform_queue(struct jump_entry *entry,
>                          enum jump_label_type type)
> 
> Rather than transforming the code, it adds the jump_entry in a queue of
> entries to be updated.
> 
> After queuing all jump_entries, the function:
>   
>     void arch_jump_label_transform_apply(void)
> 
> Applies the changes in the queue.
> 
> One easy way to see the benefits of this patch is switching the
> schedstats on and off. For instance:
> 
> -------------------------- %< ---------------------------- 
> #!/bin/bash
> 
> while [ true ]; do 
>     sysctl -w kernel.sched_schedstats=1
>     sleep 2
>     sysctl -w kernel.sched_schedstats=0
>     sleep 2
> done
> -------------------------- >% ----------------------------
> 
> while watching the IPI count:
> 
> -------------------------- %< ---------------------------- 
> # watch -n1 "cat /proc/interrupts | grep Function"
> -------------------------- >% ----------------------------
> 
> With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
> while with the batch mode the number of IPIs goes to 3 each 2 seconds.
> 
> Regarding the performance impact of this patch set, I made two measurements:
> 
>     The time to update a key (the task that is causing the change)
>     The time to run the int3 handler (the side effect on a thread that
>                                       hits the code being changed)
> 
> The sched_schedstats static key was chosen as the key to being switched on and
> off. The reason being is that it is used in more than 56 places, in a hot path.
> The change in the schedstats static key will be done with the following
> command:
> 
> while [ true ]; do
>     sysctl -w kernel.sched_schedstats=1
>     usleep 500000
>     sysctl -w kernel.sched_schedstats=0
>     usleep 500000
> done
> 
> In this way, they key will be updated twice per second. To force the hit of the
> int3 handler, the system will also run a kernel compilation with two jobs per
> CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
> @2.27GHz.
> 
> Regarding the update part, on average, the regular kernel takes 57 ms to update
> the static key, while the kernel with the batch updates takes just 1.4 ms on
> average. Although it seems to be too good to be true, it makes sense: the
> sched_schedstats key is used in 56 places, so it was expected that it would take
> around 56 times to update the keys with the current implementation, as the
> IPIs are the most expensive part of the update.
> 
> Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
> the batch version takes around 180 ns. At first glance, it seems to be a high
> value. But it is not, considering that it is doing 56 updates, rather than one!
> It is taking four times more, only. This gain is possible because the patch
> uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
> an overhead within four times.
> 
> (voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
> a shorter period (because the update part is on for a shorter time), the number
> of hits in the int3 handler decreased by 10%.
> 
> The question then is: Is it worth paying the price of "135 ns" more in the int3
> handler?
> 
> Considering that, in this test case, we are saving the handling of 53 IPIs,
> that takes more than these 135 ns, it seems to be a meager price to be paid.
> Moreover, the test case was forcing the hit of the int3, in practice, it
> does not take that often. While the IPI takes place on all CPUs, hitting
> the int3 handler or not!
> 
> For instance, in an isolated CPU with a process running in user-space
> (nohz_full use-case), the chances of hitting the int3 handler is barely zero,
> while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
> this scenario a lot.
> 
> Changes from v2:
>   - Switch the order of patches 8 and 9 (Steven Rostedt)
>   - Fallback in the case of failure in the batch mode (Steven Rostedt)
>   - Fix a pointer in patch 7 (Steven Rostedt)
>   - Adjust the commit log of the patch 1 (Borislav Petkov)
>   - Adjust the commit log of the patch 3 (Jiri Kosina/Thomas Gleixner)
> Changes from v1:
>   - Split the patch in jump-label/x86-jump-label/alternative (Jason Baron)
>   - Use bserach in the int3 handler (Steven Rostedt)
>   - Use a pre-allocated page to store the vector (Jason/Steven)
>   - Do performance tests in the int3 handler (peterz)
> 
> Daniel Bristot de Oliveira (9):
>   jump_label: Add for_each_label_entry helper
>   jump_label: Add the jump_label_can_update_check() helper
>   x86/jump_label: Move checking code away from __jump_label_transform()
>   x86/jump_label: Add __jump_label_set_jump_code() helper
>   x86/alternative: Split text_poke_bp() into tree steps
>   jump_label: Sort entries of the same key by the code
>   x86/alternative: Batch of patch operations
>   jump_label: Batch updates if arch supports it
>   x86/jump_label: Batch jump label updates
> 
>  arch/x86/include/asm/jump_label.h    |   2 +
>  arch/x86/include/asm/text-patching.h |  15 +++
>  arch/x86/kernel/alternative.c        | 140 ++++++++++++++++++---
>  arch/x86/kernel/jump_label.c         | 174 ++++++++++++++++++++++-----
>  include/linux/jump_label.h           |   6 +
>  kernel/jump_label.c                  |  73 +++++++++--
>  6 files changed, 359 insertions(+), 51 deletions(-)
> 

Hi,

This is just a gentle ping...  Any thoughts about this patch series?

Thanks!
-- Daniel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2018-12-21 10:27 ` [PATCH V3 7/9] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
@ 2019-01-23  5:15   ` Masami Hiramatsu
  2019-01-26 11:52     ` Daniel Bristot de Oliveira
  2019-04-19 18:41   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
  1 sibling, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2019-01-23  5:15 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Masami Hiramatsu,
	Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Hi Daniel,

On Fri, 21 Dec 2018 11:27:32 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> Currently, the patch of an address is done in three steps:
> 
> -- Pseudo-code #1 - Current implementation ---
>         1) add an int3 trap to the address that will be patched
>             sync cores (send IPI to all other CPUs)
>         2) update all but the first byte of the patched range
>             sync cores (send IPI to all other CPUs)
>         3) replace the first byte (int3) by the first byte of replacing opcode
>             sync cores (send IPI to all other CPUs)
> -- Pseudo-code #1 ---
> 
> When a static key has more than one entry, these steps are called once for
> each entry. The number of IPIs then is linear with regard to the number 'n' of
> entries of a key: O(n*3), which is O(n).
> 
> This algorithm works fine for the update of a single key. But we think
> it is possible to optimize the case in which a static key has more than
> one entry. For instance, the sched_schedstats jump label has 56 entries
> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> which the thread that is enabling the key is _not_ running.
> 
> With this patch, rather than receiving a single patch to be processed, a vector
> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
> way:
> 
> -- Pseudo-code #2 - This patch  ---
> 1)  for each patch in the vector:
>         add an int3 trap to the address that will be patched
> 
>     sync cores (send IPI to all other CPUs)
> 
> 2)  for each patch in the vector:
>         update all but the first byte of the patched range
> 
>     sync cores (send IPI to all other CPUs)
> 
> 3)  for each patch in the vector:
>         replace the first byte (int3) by the first byte of replacing opcode
> 
>     sync cores (send IPI to all other CPUs)
> -- Pseudo-code #2 - This patch  ---
> 
> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).
> 
> The batch mode is done with the function text_poke_bp_batch(), that receives
> two arguments: a vector of "struct text_to_poke", and the number of entries
> in the vector.
> 
> The vector must be sorted by the addr field of the text_to_poke structure,
> enabling the binary search of a handler in the poke_int3_handler function
> (a fast path).
> 
> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> Cc: Jiri Kosina <jkosina@suse.cz>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> Cc: Chris von Recklinghausen <crecklin@redhat.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Cc: Scott Wood <swood@redhat.com>
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Cc: Clark Williams <williams@redhat.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/text-patching.h |  15 ++++
>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
>  2 files changed, 117 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..42ea7846df33 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>  #define __parainstructions_end	NULL
>  #endif
>  
> +/*
> + * Currently, the max observed size in the kernel code is
> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> + * Raise it if needed.
> + */
> +#define POKE_MAX_OPCODE_SIZE	5
> +
> +struct text_to_poke {
> +	void *handler;
> +	void *addr;
> +	size_t len;
> +	const char opcode[POKE_MAX_OPCODE_SIZE];
> +};
> +
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  
>  /*
> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>  extern int after_bootmem;
>  
>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 6f5ad8587de0..8fa47e5ec709 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -21,6 +21,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/io.h>
>  #include <asm/fixmap.h>
> +#include <linux/bsearch.h>
>  
>  int __read_mostly alternatives_patched;
>  
> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
>  }
>  
>  static bool bp_patching_in_progress;
> +/*
> + * Single poke.
> + */
>  static void *bp_int3_handler, *bp_int3_addr;
> +/*
> + * Batching poke.
> + */
> +static struct text_to_poke *bp_int3_tpv;
> +static unsigned int bp_int3_tpv_nr;
> +
> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> +{
> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
> +
> +	if (key < tp->addr)
> +		return -1;
> +	if (key > tp->addr)
> +		return 1;
> +	return 0;
> +}
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> +	void *ip;
> +	struct text_to_poke *tp;
> +
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
>  	 * bp_patching_in_progress.
> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> +	if (user_mode(regs))
>  		return 0;
>  
> -	/* set up the specified breakpoint handler */
> -	regs->ip = (unsigned long) bp_int3_handler;
> +	/*
> +	 * Single poke first.
> +	 */

I wonder why would you separate single poke and batch poke?
It seems a single poke is just a case that bp_int3_tpv_nr == 1.
If so, you can remove bp_int3_addr and this block.

> +	if (bp_int3_addr) {
> +		if (regs->ip == (unsigned long) bp_int3_addr) {
> +			regs->ip = (unsigned long) bp_int3_handler;
> +			return 1;
> +		}
> +		return 0;
> +	}
> 
> -	return 1;
> +	/*
> +	 * Batch mode.
> +	 */
> +	if (bp_int3_tpv_nr) {

if (unlikely(bp_int3_tpv_nr))

Sorry about interrupting, but this is a "hot-path" when we use kprobes.

Also, could you add NOKPROBE_SYMBOL(); for all symbols involved in this
process?
Recently I found I missed it for poke_int3_handler and sent a fix.
( https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898241.html )
If this increase the function-call-chain from poke_int3_handler, those
must be marked as NOKPROBE_SYMBOL().

But basically this series looks good to me.

Thank you, 

> +		ip = (void *) regs->ip - sizeof(unsigned char);
> +		tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
> +			     sizeof(struct text_to_poke),
> +			     text_bp_batch_bsearch);
> +		if (tp) {
> +			/* set up the specified breakpoint handler */
> +			regs->ip = (unsigned long) tp->handler;
> +			return 1;
> +		}
> +	}
>  
> +	return 0;
>  }
>  
>  static void text_poke_bp_set_handler(void *addr, void *handler,
>  				     unsigned char int3)
>  {
> -	bp_int3_handler = handler;
> -	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	text_poke(addr, &int3, sizeof(int3));
>  }
>  
> @@ -816,6 +859,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  
>  	lockdep_assert_held(&text_mutex);
>  
> +	bp_int3_handler = handler;
> +	bp_int3_addr = (u8 *)addr + sizeof(int3);
> +
>  	bp_patching_in_progress = true;
>  	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> @@ -846,6 +892,56 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 */
>  	bp_patching_in_progress = false;
>  
> +	bp_int3_handler = bp_int3_addr = 0;
>  	return addr;
>  }
>  
> +void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
> +{
> +	unsigned int i;
> +	unsigned char int3 = 0xcc;
> +	int patched_all_but_first = 0;
> +
> +	bp_int3_tpv = tp;
> +	bp_int3_tpv_nr = nr_entries;
> +	bp_patching_in_progress = true;
> +	/*
> +	 * Corresponding read barrier in int3 notifier for making sure the
> +	 * in_progress and handler are correctly ordered wrt. patching.
> +	 */
> +	smp_wmb();
> +
> +	for (i = 0; i < nr_entries; i++)
> +		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
> +
> +	on_each_cpu(do_sync_core, NULL, 1);
> +
> +	for (i = 0; i < nr_entries; i++) {
> +		if (tp[i].len - sizeof(int3) > 0) {
> +			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
> +						 tp[i].len, int3);
> +			patched_all_but_first++;
> +		}
> +	}
> +
> +	if (patched_all_but_first) {
> +		/*
> +		 * According to Intel, this core syncing is very likely
> +		 * not necessary and we'd be safe even without it. But
> +		 * better safe than sorry (plus there's not only Intel).
> +		 */
> +		on_each_cpu(do_sync_core, NULL, 1);
> +	}
> +
> +	for (i = 0; i < nr_entries; i++)
> +		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
> +
> +	on_each_cpu(do_sync_core, NULL, 1);
> +	/*
> +	 * sync_core() implies an smp_mb() and orders this store against
> +	 * the writing of the new instruction.
> +	 */
> +	bp_int3_tpv_nr = 0;
> +	bp_int3_tpv = NULL;
> +	bp_patching_in_progress = false;
> +}
> -- 
> 2.17.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2019-01-23  5:15   ` Masami Hiramatsu
@ 2019-01-26 11:52     ` Daniel Bristot de Oliveira
  2019-01-28 13:52       ` Masami Hiramatsu
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-01-26 11:52 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
> Hi Daniel,
> 
> On Fri, 21 Dec 2018 11:27:32 +0100
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> Currently, the patch of an address is done in three steps:
>>
>> -- Pseudo-code #1 - Current implementation ---
>>         1) add an int3 trap to the address that will be patched
>>             sync cores (send IPI to all other CPUs)
>>         2) update all but the first byte of the patched range
>>             sync cores (send IPI to all other CPUs)
>>         3) replace the first byte (int3) by the first byte of replacing opcode
>>             sync cores (send IPI to all other CPUs)
>> -- Pseudo-code #1 ---
>>
>> When a static key has more than one entry, these steps are called once for
>> each entry. The number of IPIs then is linear with regard to the number 'n' of
>> entries of a key: O(n*3), which is O(n).
>>
>> This algorithm works fine for the update of a single key. But we think
>> it is possible to optimize the case in which a static key has more than
>> one entry. For instance, the sched_schedstats jump label has 56 entries
>> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
>> which the thread that is enabling the key is _not_ running.
>>
>> With this patch, rather than receiving a single patch to be processed, a vector
>> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
>> way:
>>
>> -- Pseudo-code #2 - This patch  ---
>> 1)  for each patch in the vector:
>>         add an int3 trap to the address that will be patched
>>
>>     sync cores (send IPI to all other CPUs)
>>
>> 2)  for each patch in the vector:
>>         update all but the first byte of the patched range
>>
>>     sync cores (send IPI to all other CPUs)
>>
>> 3)  for each patch in the vector:
>>         replace the first byte (int3) by the first byte of replacing opcode
>>
>>     sync cores (send IPI to all other CPUs)
>> -- Pseudo-code #2 - This patch  ---
>>
>> Doing the update in this way, the number of IPI becomes O(3) with regard
>> to the number of keys, which is O(1).
>>
>> The batch mode is done with the function text_poke_bp_batch(), that receives
>> two arguments: a vector of "struct text_to_poke", and the number of entries
>> in the vector.
>>
>> The vector must be sorted by the addr field of the text_to_poke structure,
>> enabling the binary search of a handler in the poke_int3_handler function
>> (a fast path).
>>
>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
>> Cc: Jiri Kosina <jkosina@suse.cz>
>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>> Cc: Chris von Recklinghausen <crecklin@redhat.com>
>> Cc: Jason Baron <jbaron@akamai.com>
>> Cc: Scott Wood <swood@redhat.com>
>> Cc: Marcelo Tosatti <mtosatti@redhat.com>
>> Cc: Clark Williams <williams@redhat.com>
>> Cc: x86@kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> ---
>>  arch/x86/include/asm/text-patching.h |  15 ++++
>>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
>>  2 files changed, 117 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..42ea7846df33 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>>  #define __parainstructions_end	NULL
>>  #endif
>>  
>> +/*
>> + * Currently, the max observed size in the kernel code is
>> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
>> + * Raise it if needed.
>> + */
>> +#define POKE_MAX_OPCODE_SIZE	5
>> +
>> +struct text_to_poke {
>> +	void *handler;
>> +	void *addr;
>> +	size_t len;
>> +	const char opcode[POKE_MAX_OPCODE_SIZE];
>> +};
>> +
>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>  
>>  /*
>> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>>  extern int poke_int3_handler(struct pt_regs *regs);
>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>>  extern int after_bootmem;
>>  
>>  #endif /* _ASM_X86_TEXT_PATCHING_H */
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 6f5ad8587de0..8fa47e5ec709 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -21,6 +21,7 @@
>>  #include <asm/tlbflush.h>
>>  #include <asm/io.h>
>>  #include <asm/fixmap.h>
>> +#include <linux/bsearch.h>
>>  
>>  int __read_mostly alternatives_patched;
>>  
>> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
>>  }
>>  
>>  static bool bp_patching_in_progress;
>> +/*
>> + * Single poke.
>> + */
>>  static void *bp_int3_handler, *bp_int3_addr;
>> +/*
>> + * Batching poke.
>> + */
>> +static struct text_to_poke *bp_int3_tpv;
>> +static unsigned int bp_int3_tpv_nr;
>> +
>> +static int text_bp_batch_bsearch(const void *key, const void *elt)
>> +{
>> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
>> +
>> +	if (key < tp->addr)
>> +		return -1;
>> +	if (key > tp->addr)
>> +		return 1;
>> +	return 0;
>> +}
>>  
>>  int poke_int3_handler(struct pt_regs *regs)
>>  {
>> +	void *ip;
>> +	struct text_to_poke *tp;
>> +
>>  	/*
>>  	 * Having observed our INT3 instruction, we now must observe
>>  	 * bp_patching_in_progress.
>> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
>>  	if (likely(!bp_patching_in_progress))
>>  		return 0;
>>  
>> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>> +	if (user_mode(regs))
>>  		return 0;
>>  
>> -	/* set up the specified breakpoint handler */
>> -	regs->ip = (unsigned long) bp_int3_handler;
>> +	/*
>> +	 * Single poke first.
>> +	 */
> 
> I wonder why would you separate single poke and batch poke?
> It seems a single poke is just a case that bp_int3_tpv_nr == 1.

Hi Masami!

The single poke is used only at the boot time, before the system is able to
allocate memory. After that, the batch mode becomes the default.

I was thinking to make one function to each method, but then I would have to
change the do_int3() and manage how to switch between one and the other without
further overhead. I was planing to do this in a second round of improvements.

> If so, you can remove bp_int3_addr and this block.
> 
>> +	if (bp_int3_addr) {
>> +		if (regs->ip == (unsigned long) bp_int3_addr) {
>> +			regs->ip = (unsigned long) bp_int3_handler;
>> +			return 1;
>> +		}
>> +		return 0;
>> +	}
>>
>> -	return 1;
>> +	/*
>> +	 * Batch mode.
>> +	 */
>> +	if (bp_int3_tpv_nr) {
> 
> if (unlikely(bp_int3_tpv_nr))
> 
> Sorry about interrupting, but this is a "hot-path" when we use kprobes.

No problem at all! :-)

I will change this function to better deal with the hot-path (the default mode
after the system boots up).

how about something like this:
------------------ %< ------------------
int poke_int3_handler(struct pt_regs *regs)
{
        void *ip;
        struct text_to_poke *tp;

        /*
         * Having observed our INT3 instruction, we now must observe
         * bp_patching_in_progress.
         *
         *      in_progress = TRUE              INT3
         *      WMB                             RMB
         *      write INT3                      if (in_progress)
         *
         * Idem for bp_int3_handler.
         */
        smp_rmb();

        if (likely(!bp_patching_in_progress))
                return 0;

        if (user_mode(regs))
                return 0;

        /*
         * Single poke is only used at the boot.
         */
        if (unlikely(!bp_int3_tpv))
                goto single_poke;

        ip = (void *) regs->ip - sizeof(unsigned char);
        tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
                     sizeof(struct text_to_poke),
                     text_bp_batch_bsearch);
        if (tp) {
                /* set up the specified breakpoint handler */
                regs->ip = (unsigned long) tp->handler;
                return 1;
        }

        return 0;

single_poke:
        if (regs->ip == (unsigned long) bp_int3_addr) {
                regs->ip = (unsigned long) bp_int3_handler;
                return 1;
        }

        return 0;
}
------------- >% ----------

In this way the default code is up, and the only 'if' I am using is a var of the
batch mode (that will be used later). If are are still at the boot, we are
jumping to the end of the function.

look better?

> 
> Also, could you add NOKPROBE_SYMBOL(); for all symbols involved in this
> process?
> Recently I found I missed it for poke_int3_handler and sent a fix.
> ( https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898241.html )
> If this increase the function-call-chain from poke_int3_handler, those
> must be marked as NOKPROBE_SYMBOL().

Ack! Doing that!

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2019-01-26 11:52     ` Daniel Bristot de Oliveira
@ 2019-01-28 13:52       ` Masami Hiramatsu
  2019-02-01 12:49         ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2019-01-28 13:52 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

On Sat, 26 Jan 2019 12:52:15 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
> > Hi Daniel,
> > 
> > On Fri, 21 Dec 2018 11:27:32 +0100
> > Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> > 
> >> Currently, the patch of an address is done in three steps:
> >>
> >> -- Pseudo-code #1 - Current implementation ---
> >>         1) add an int3 trap to the address that will be patched
> >>             sync cores (send IPI to all other CPUs)
> >>         2) update all but the first byte of the patched range
> >>             sync cores (send IPI to all other CPUs)
> >>         3) replace the first byte (int3) by the first byte of replacing opcode
> >>             sync cores (send IPI to all other CPUs)
> >> -- Pseudo-code #1 ---
> >>
> >> When a static key has more than one entry, these steps are called once for
> >> each entry. The number of IPIs then is linear with regard to the number 'n' of
> >> entries of a key: O(n*3), which is O(n).
> >>
> >> This algorithm works fine for the update of a single key. But we think
> >> it is possible to optimize the case in which a static key has more than
> >> one entry. For instance, the sched_schedstats jump label has 56 entries
> >> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> >> which the thread that is enabling the key is _not_ running.
> >>
> >> With this patch, rather than receiving a single patch to be processed, a vector
> >> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
> >> way:
> >>
> >> -- Pseudo-code #2 - This patch  ---
> >> 1)  for each patch in the vector:
> >>         add an int3 trap to the address that will be patched
> >>
> >>     sync cores (send IPI to all other CPUs)
> >>
> >> 2)  for each patch in the vector:
> >>         update all but the first byte of the patched range
> >>
> >>     sync cores (send IPI to all other CPUs)
> >>
> >> 3)  for each patch in the vector:
> >>         replace the first byte (int3) by the first byte of replacing opcode
> >>
> >>     sync cores (send IPI to all other CPUs)
> >> -- Pseudo-code #2 - This patch  ---
> >>
> >> Doing the update in this way, the number of IPI becomes O(3) with regard
> >> to the number of keys, which is O(1).
> >>
> >> The batch mode is done with the function text_poke_bp_batch(), that receives
> >> two arguments: a vector of "struct text_to_poke", and the number of entries
> >> in the vector.
> >>
> >> The vector must be sorted by the addr field of the text_to_poke structure,
> >> enabling the binary search of a handler in the poke_int3_handler function
> >> (a fast path).
> >>
> >> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> >> Cc: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Borislav Petkov <bp@alien8.de>
> >> Cc: "H. Peter Anvin" <hpa@zytor.com>
> >> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> >> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> >> Cc: Jiri Kosina <jkosina@suse.cz>
> >> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> >> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> >> Cc: Chris von Recklinghausen <crecklin@redhat.com>
> >> Cc: Jason Baron <jbaron@akamai.com>
> >> Cc: Scott Wood <swood@redhat.com>
> >> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> >> Cc: Clark Williams <williams@redhat.com>
> >> Cc: x86@kernel.org
> >> Cc: linux-kernel@vger.kernel.org
> >> ---
> >>  arch/x86/include/asm/text-patching.h |  15 ++++
> >>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
> >>  2 files changed, 117 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> >> index e85ff65c43c3..42ea7846df33 100644
> >> --- a/arch/x86/include/asm/text-patching.h
> >> +++ b/arch/x86/include/asm/text-patching.h
> >> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> >>  #define __parainstructions_end	NULL
> >>  #endif
> >>  
> >> +/*
> >> + * Currently, the max observed size in the kernel code is
> >> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> >> + * Raise it if needed.
> >> + */
> >> +#define POKE_MAX_OPCODE_SIZE	5
> >> +
> >> +struct text_to_poke {
> >> +	void *handler;
> >> +	void *addr;
> >> +	size_t len;
> >> +	const char opcode[POKE_MAX_OPCODE_SIZE];
> >> +};
> >> +
> >>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>  
> >>  /*
> >> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>  extern void *text_poke(void *addr, const void *opcode, size_t len);
> >>  extern int poke_int3_handler(struct pt_regs *regs);
> >>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> >> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
> >>  extern int after_bootmem;
> >>  
> >>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> >> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> >> index 6f5ad8587de0..8fa47e5ec709 100644
> >> --- a/arch/x86/kernel/alternative.c
> >> +++ b/arch/x86/kernel/alternative.c
> >> @@ -21,6 +21,7 @@
> >>  #include <asm/tlbflush.h>
> >>  #include <asm/io.h>
> >>  #include <asm/fixmap.h>
> >> +#include <linux/bsearch.h>
> >>  
> >>  int __read_mostly alternatives_patched;
> >>  
> >> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
> >>  }
> >>  
> >>  static bool bp_patching_in_progress;
> >> +/*
> >> + * Single poke.
> >> + */
> >>  static void *bp_int3_handler, *bp_int3_addr;
> >> +/*
> >> + * Batching poke.
> >> + */
> >> +static struct text_to_poke *bp_int3_tpv;
> >> +static unsigned int bp_int3_tpv_nr;
> >> +
> >> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> >> +{
> >> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
> >> +
> >> +	if (key < tp->addr)
> >> +		return -1;
> >> +	if (key > tp->addr)
> >> +		return 1;
> >> +	return 0;
> >> +}
> >>  
> >>  int poke_int3_handler(struct pt_regs *regs)
> >>  {
> >> +	void *ip;
> >> +	struct text_to_poke *tp;
> >> +
> >>  	/*
> >>  	 * Having observed our INT3 instruction, we now must observe
> >>  	 * bp_patching_in_progress.
> >> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
> >>  	if (likely(!bp_patching_in_progress))
> >>  		return 0;
> >>  
> >> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> >> +	if (user_mode(regs))
> >>  		return 0;
> >>  
> >> -	/* set up the specified breakpoint handler */
> >> -	regs->ip = (unsigned long) bp_int3_handler;
> >> +	/*
> >> +	 * Single poke first.
> >> +	 */
> > 
> > I wonder why would you separate single poke and batch poke?
> > It seems a single poke is just a case that bp_int3_tpv_nr == 1.
> 
> Hi Masami!
> 
> The single poke is used only at the boot time, before the system is able to
> allocate memory. After that, the batch mode becomes the default.

Hmm, what's the difference from text_poke_early()?

> 
> I was thinking to make one function to each method, but then I would have to
> change the do_int3() and manage how to switch between one and the other without
> further overhead. I was planing to do this in a second round of improvements.

I didn't think such big change.
I just thought we could allocate single entry array on stack, something like

text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
{
	struct text_to_poke tp = {.handler = handler, .addr = addr, .len = len};
	if (len > POKE_MAX_OPCODE_SIZE)
		return -E2BIG;
	memcpy(tp.opcode, opcode, len);
	return text_poke_bp_batch(&tp, 1);
}

> 
> > If so, you can remove bp_int3_addr and this block.
> > 
> >> +	if (bp_int3_addr) {
> >> +		if (regs->ip == (unsigned long) bp_int3_addr) {
> >> +			regs->ip = (unsigned long) bp_int3_handler;
> >> +			return 1;
> >> +		}
> >> +		return 0;
> >> +	}
> >>
> >> -	return 1;
> >> +	/*
> >> +	 * Batch mode.
> >> +	 */
> >> +	if (bp_int3_tpv_nr) {
> > 
> > if (unlikely(bp_int3_tpv_nr))
> > 
> > Sorry about interrupting, but this is a "hot-path" when we use kprobes.
> 
> No problem at all! :-)

Thanks! :-)

> 
> I will change this function to better deal with the hot-path (the default mode
> after the system boots up).
> 
> how about something like this:
> ------------------ %< ------------------
> int poke_int3_handler(struct pt_regs *regs)
> {
>         void *ip;
>         struct text_to_poke *tp;
> 
>         /*
>          * Having observed our INT3 instruction, we now must observe
>          * bp_patching_in_progress.
>          *
>          *      in_progress = TRUE              INT3
>          *      WMB                             RMB
>          *      write INT3                      if (in_progress)
>          *
>          * Idem for bp_int3_handler.
>          */
>         smp_rmb();
> 
>         if (likely(!bp_patching_in_progress))
>                 return 0;
> 
>         if (user_mode(regs))
>                 return 0;
> 
>         /*
>          * Single poke is only used at the boot.
>          */
>         if (unlikely(!bp_int3_tpv))
>                 goto single_poke;
> 
>         ip = (void *) regs->ip - sizeof(unsigned char);
>         tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
>                      sizeof(struct text_to_poke),
>                      text_bp_batch_bsearch);
>         if (tp) {
>                 /* set up the specified breakpoint handler */
>                 regs->ip = (unsigned long) tp->handler;
>                 return 1;
>         }
> 
>         return 0;
> 
> single_poke:
>         if (regs->ip == (unsigned long) bp_int3_addr) {
>                 regs->ip = (unsigned long) bp_int3_handler;
>                 return 1;
>         }
> 
>         return 0;
> }
> ------------- >% ----------
> 
> In this way the default code is up, and the only 'if' I am using is a var of the
> batch mode (that will be used later). If are are still at the boot, we are
> jumping to the end of the function.
> 
> look better?

yeah, it looks much better. But I just wonder why don't you consolidate both by
just because reducing code.

> 
> > 
> > Also, could you add NOKPROBE_SYMBOL(); for all symbols involved in this
> > process?
> > Recently I found I missed it for poke_int3_handler and sent a fix.
> > ( https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1898241.html )
> > If this increase the function-call-chain from poke_int3_handler, those
> > must be marked as NOKPROBE_SYMBOL().
> 
> Ack! Doing that!

Thank you!

> 
> Thanks!


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2019-01-28 13:52       ` Masami Hiramatsu
@ 2019-02-01 12:49         ` Daniel Bristot de Oliveira
  2019-02-01 14:47           ` Masami Hiramatsu
  0 siblings, 1 reply; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-02-01 12:49 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

On 1/28/19 2:52 PM, Masami Hiramatsu wrote:
> On Sat, 26 Jan 2019 12:52:15 +0100
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
>>> Hi Daniel,
>>>
>>> On Fri, 21 Dec 2018 11:27:32 +0100
>>> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
>>>
>>>> Currently, the patch of an address is done in three steps:
>>>>
>>>> -- Pseudo-code #1 - Current implementation ---
>>>>         1) add an int3 trap to the address that will be patched
>>>>             sync cores (send IPI to all other CPUs)
>>>>         2) update all but the first byte of the patched range
>>>>             sync cores (send IPI to all other CPUs)
>>>>         3) replace the first byte (int3) by the first byte of replacing opcode
>>>>             sync cores (send IPI to all other CPUs)
>>>> -- Pseudo-code #1 ---
>>>>
>>>> When a static key has more than one entry, these steps are called once for
>>>> each entry. The number of IPIs then is linear with regard to the number 'n' of
>>>> entries of a key: O(n*3), which is O(n).
>>>>
>>>> This algorithm works fine for the update of a single key. But we think
>>>> it is possible to optimize the case in which a static key has more than
>>>> one entry. For instance, the sched_schedstats jump label has 56 entries
>>>> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
>>>> which the thread that is enabling the key is _not_ running.
>>>>
>>>> With this patch, rather than receiving a single patch to be processed, a vector
>>>> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
>>>> way:
>>>>
>>>> -- Pseudo-code #2 - This patch  ---
>>>> 1)  for each patch in the vector:
>>>>         add an int3 trap to the address that will be patched
>>>>
>>>>     sync cores (send IPI to all other CPUs)
>>>>
>>>> 2)  for each patch in the vector:
>>>>         update all but the first byte of the patched range
>>>>
>>>>     sync cores (send IPI to all other CPUs)
>>>>
>>>> 3)  for each patch in the vector:
>>>>         replace the first byte (int3) by the first byte of replacing opcode
>>>>
>>>>     sync cores (send IPI to all other CPUs)
>>>> -- Pseudo-code #2 - This patch  ---
>>>>
>>>> Doing the update in this way, the number of IPI becomes O(3) with regard
>>>> to the number of keys, which is O(1).
>>>>
>>>> The batch mode is done with the function text_poke_bp_batch(), that receives
>>>> two arguments: a vector of "struct text_to_poke", and the number of entries
>>>> in the vector.
>>>>
>>>> The vector must be sorted by the addr field of the text_to_poke structure,
>>>> enabling the binary search of a handler in the poke_int3_handler function
>>>> (a fast path).
>>>>
>>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>>> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
>>>> Cc: Jiri Kosina <jkosina@suse.cz>
>>>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>>>> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>>>> Cc: Chris von Recklinghausen <crecklin@redhat.com>
>>>> Cc: Jason Baron <jbaron@akamai.com>
>>>> Cc: Scott Wood <swood@redhat.com>
>>>> Cc: Marcelo Tosatti <mtosatti@redhat.com>
>>>> Cc: Clark Williams <williams@redhat.com>
>>>> Cc: x86@kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> ---
>>>>  arch/x86/include/asm/text-patching.h |  15 ++++
>>>>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
>>>>  2 files changed, 117 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>>>> index e85ff65c43c3..42ea7846df33 100644
>>>> --- a/arch/x86/include/asm/text-patching.h
>>>> +++ b/arch/x86/include/asm/text-patching.h
>>>> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>>>>  #define __parainstructions_end	NULL
>>>>  #endif
>>>>  
>>>> +/*
>>>> + * Currently, the max observed size in the kernel code is
>>>> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
>>>> + * Raise it if needed.
>>>> + */
>>>> +#define POKE_MAX_OPCODE_SIZE	5
>>>> +
>>>> +struct text_to_poke {
>>>> +	void *handler;
>>>> +	void *addr;
>>>> +	size_t len;
>>>> +	const char opcode[POKE_MAX_OPCODE_SIZE];
>>>> +};
>>>> +
>>>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>>>  
>>>>  /*
>>>> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>>>>  extern int poke_int3_handler(struct pt_regs *regs);
>>>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>>>> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>>>>  extern int after_bootmem;
>>>>  
>>>>  #endif /* _ASM_X86_TEXT_PATCHING_H */
>>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>>>> index 6f5ad8587de0..8fa47e5ec709 100644
>>>> --- a/arch/x86/kernel/alternative.c
>>>> +++ b/arch/x86/kernel/alternative.c
>>>> @@ -21,6 +21,7 @@
>>>>  #include <asm/tlbflush.h>
>>>>  #include <asm/io.h>
>>>>  #include <asm/fixmap.h>
>>>> +#include <linux/bsearch.h>
>>>>  
>>>>  int __read_mostly alternatives_patched;
>>>>  
>>>> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
>>>>  }
>>>>  
>>>>  static bool bp_patching_in_progress;
>>>> +/*
>>>> + * Single poke.
>>>> + */
>>>>  static void *bp_int3_handler, *bp_int3_addr;
>>>> +/*
>>>> + * Batching poke.
>>>> + */
>>>> +static struct text_to_poke *bp_int3_tpv;
>>>> +static unsigned int bp_int3_tpv_nr;
>>>> +
>>>> +static int text_bp_batch_bsearch(const void *key, const void *elt)
>>>> +{
>>>> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
>>>> +
>>>> +	if (key < tp->addr)
>>>> +		return -1;
>>>> +	if (key > tp->addr)
>>>> +		return 1;
>>>> +	return 0;
>>>> +}
>>>>  
>>>>  int poke_int3_handler(struct pt_regs *regs)
>>>>  {
>>>> +	void *ip;
>>>> +	struct text_to_poke *tp;
>>>> +
>>>>  	/*
>>>>  	 * Having observed our INT3 instruction, we now must observe
>>>>  	 * bp_patching_in_progress.
>>>> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
>>>>  	if (likely(!bp_patching_in_progress))
>>>>  		return 0;
>>>>  
>>>> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>>>> +	if (user_mode(regs))
>>>>  		return 0;
>>>>  
>>>> -	/* set up the specified breakpoint handler */
>>>> -	regs->ip = (unsigned long) bp_int3_handler;
>>>> +	/*
>>>> +	 * Single poke first.
>>>> +	 */
>>>
>>> I wonder why would you separate single poke and batch poke?
>>> It seems a single poke is just a case that bp_int3_tpv_nr == 1.
>>
>> Hi Masami!
>>
>> The single poke is used only at the boot time, before the system is able to
>> allocate memory. After that, the batch mode becomes the default.
> 
> Hmm, what's the difference from text_poke_early()?

text_poke_early(): before enabling interrupts at boot.

text_poke_bp(): after enabling interrupts, before being able to allocate memory,
or in the error handling with batch mode.

task_poke_batch(): After enabling interrupts and being able to allocate memory.

>>
>> I was thinking to make one function to each method, but then I would have to
>> change the do_int3() and manage how to switch between one and the other without
>> further overhead. I was planing to do this in a second round of improvements.
> 
> I didn't think such big change.
> I just thought we could allocate single entry array on stack, something like

Ah!

> text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> {
> 	struct text_to_poke tp = {.handler = handler, .addr = addr, .len = len};
> 	if (len > POKE_MAX_OPCODE_SIZE)
> 		return -E2BIG;
> 	memcpy(tp.opcode, opcode, len);
> 	return text_poke_bp_batch(&tp, 1);
> }

Good idea!

>>
>>> If so, you can remove bp_int3_addr and this block.
>>>
>>>> +	if (bp_int3_addr) {
>>>> +		if (regs->ip == (unsigned long) bp_int3_addr) {
>>>> +			regs->ip = (unsigned long) bp_int3_handler;
>>>> +			return 1;
>>>> +		}
>>>> +		return 0;
>>>> +	}
>>>>
>>>> -	return 1;
>>>> +	/*
>>>> +	 * Batch mode.
>>>> +	 */
>>>> +	if (bp_int3_tpv_nr) {
>>>
>>> if (unlikely(bp_int3_tpv_nr))
>>>
>>> Sorry about interrupting, but this is a "hot-path" when we use kprobes.
>>
>> No problem at all! :-)
> 
> Thanks! :-)
> 
>>
>> I will change this function to better deal with the hot-path (the default mode
>> after the system boots up).
>>
>> how about something like this:
>> ------------------ %< ------------------
>> int poke_int3_handler(struct pt_regs *regs)
>> {
>>         void *ip;
>>         struct text_to_poke *tp;
>>
>>         /*
>>          * Having observed our INT3 instruction, we now must observe
>>          * bp_patching_in_progress.
>>          *
>>          *      in_progress = TRUE              INT3
>>          *      WMB                             RMB
>>          *      write INT3                      if (in_progress)
>>          *
>>          * Idem for bp_int3_handler.
>>          */
>>         smp_rmb();
>>
>>         if (likely(!bp_patching_in_progress))
>>                 return 0;
>>
>>         if (user_mode(regs))
>>                 return 0;
>>
>>         /*
>>          * Single poke is only used at the boot.
>>          */
>>         if (unlikely(!bp_int3_tpv))
>>                 goto single_poke;
>>
>>         ip = (void *) regs->ip - sizeof(unsigned char);
>>         tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
>>                      sizeof(struct text_to_poke),
>>                      text_bp_batch_bsearch);
>>         if (tp) {
>>                 /* set up the specified breakpoint handler */
>>                 regs->ip = (unsigned long) tp->handler;
>>                 return 1;
>>         }
>>
>>         return 0;
>>
>> single_poke:
>>         if (regs->ip == (unsigned long) bp_int3_addr) {
>>                 regs->ip = (unsigned long) bp_int3_handler;
>>                 return 1;
>>         }
>>
>>         return 0;
>> }
>> ------------- >% ----------
>>
>> In this way the default code is up, and the only 'if' I am using is a var of the
>> batch mode (that will be used later). If are are still at the boot, we are
>> jumping to the end of the function.
>>
>> look better?
> 
> yeah, it looks much better. But I just wonder why don't you consolidate both by
> just because reducing code.
> 

and so I did. How about something like this?
---------- %< ---------
---
 arch/x86/include/asm/text-patching.h |  15 ++++
 arch/x86/kernel/alternative.c        | 118 +++++++++++++++++++--------
 lib/bsearch.c                        |   2 +
 3 files changed, 100 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..42ea7846df33 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+/*
+ * Currently, the max observed size in the kernel code is
+ * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * Raise it if needed.
+ */
+#define POKE_MAX_OPCODE_SIZE	5
+
+struct text_to_poke {
+	void *handler;
+	void *addr;
+	size_t len;
+	const char opcode[POKE_MAX_OPCODE_SIZE];
+};
+
 extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
@@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
 extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 202af29c43c0..2196bb8bb924 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -11,6 +11,7 @@
 #include <linux/stop_machine.h>
 #include <linux/slab.h>
 #include <linux/kdebug.h>
+#include <linux/kprobes.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -21,6 +22,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include <linux/bsearch.h>
 
 int __read_mostly alternatives_patched;
 
@@ -738,10 +740,26 @@ static void do_sync_core(void *info)
 }
 
 static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static struct text_to_poke *bp_int3_tpv;
+static unsigned int bp_int3_tpv_nr;
+
+static int text_bp_batch_bsearch(const void *key, const void *elt)
+{
+	struct text_to_poke *tp = (struct text_to_poke *) elt;
+
+	if (key < tp->addr)
+		return -1;
+	if (key > tp->addr)
+		return 1;
+	return 0;
+}
+NOKPROBE_SYMBOL(text_bp_batch_bsearch);
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	void *ip;
+	struct text_to_poke *tp;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
 	 * bp_patching_in_progress.
@@ -757,21 +775,41 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	ip = (void *) regs->ip - sizeof(unsigned char);
 
-	return 1;
+	/*
+	 * Skip the binary search if there is a single member in the vector.
+	 */
+	if (unlikely(bp_int3_tpv_nr == 1))
+		goto single_poke;
+
+	tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
+		     sizeof(struct text_to_poke),
+		     text_bp_batch_bsearch);
+	if (tp) {
+		/* set up the specified breakpoint handler */
+		regs->ip = (unsigned long) tp->handler;
+		return 1;
+	}
+
+	return 0;
+
+single_poke:
+	if (ip == (unsigned long) bp_int3_tpv->addr) {
+		regs->ip = (unsigned long) bp_int3_tpv->handler;
+		return 1;
+	}
 
+	return 0;
 }
+NOKPROBE_SYMBOL(poke_int3_handler);
 
 static void text_poke_bp_set_handler(void *addr, void *handler,
 				     unsigned char int3)
 {
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	text_poke(addr, &int3, sizeof(int3));
 }
 
@@ -790,32 +828,14 @@ static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
 	text_poke(addr, opcode, sizeof(int3));
 }
 
-/**
- * text_poke_bp() -- update instructions on live kernel on SMP
- * @addr:	address to patch
- * @opcode:	opcode of new instruction
- * @len:	length to copy
- * @handler:	address to jump to when the temporary breakpoint is hit
- *
- * Modify multi-byte instruction by using int3 breakpoint on SMP.
- * We completely avoid stop_machine() here, and achieve the
- * synchronization using int3 breakpoint.
- *
- * The way it is done:
- *	- add a int3 trap to the address that will be patched
- *	- sync cores
- *	- update all but the first byte of the patched range
- *	- sync cores
- *	- replace the first byte (int3) by the first byte of
- *	  replacing opcode
- *	- sync cores
- */
-void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
 {
+	unsigned int i;
 	unsigned char int3 = 0xcc;
+	int patched_all_but_first = 0;
 
-	lockdep_assert_held(&text_mutex);
-
+	bp_int3_tpv = tp;
+	bp_int3_tpv_nr = nr_entries;
 	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
@@ -823,12 +843,20 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 */
 	smp_wmb();
 
-	text_poke_bp_set_handler(addr, handler, int3);
+	for (i = 0; i < nr_entries; i++)
+		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
-	if (len - sizeof(int3) > 0) {
-		patch_all_but_first_byte(addr, opcode, len, int3);
+	for (i = 0; i < nr_entries; i++) {
+		if (tp[i].len - sizeof(int3) > 0) {
+			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
+						 tp[i].len, int3);
+			patched_all_but_first++;
+		}
+	}
+
+	if (patched_all_but_first) {
 		/*
 		 * According to Intel, this core syncing is very likely
 		 * not necessary and we'd be safe even without it. But
@@ -837,15 +865,35 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 		on_each_cpu(do_sync_core, NULL, 1);
 	}
 
-	patch_first_byte(addr, opcode, int3);
+	for (i = 0; i < nr_entries; i++)
+		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 	/*
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
+	bp_int3_tpv_nr = 0;
+	bp_int3_tpv = NULL;
 	bp_patching_in_progress = false;
+}
+ XXX: paste the old comment here... I forgot.
+void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+{
+	struct text_to_poke tp = {
+		.handler = handler,
+		.addr = addr,
+		.len = len
+	};
+
+	if (len > POKE_MAX_OPCODE_SIZE) {
+		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
+		return NULL;
+	}
+
+	memcpy((void *)tp.opcode, opcode, len);
+
+	text_poke_bp_batch(&tp, 1);
 
 	return addr;
 }
-
diff --git a/lib/bsearch.c b/lib/bsearch.c
index 18b445b010c3..82512fe7b33c 100644
--- a/lib/bsearch.c
+++ b/lib/bsearch.c
@@ -11,6 +11,7 @@
 
 #include <linux/export.h>
 #include <linux/bsearch.h>
+#include <linux/kprobes.h>
 
 /*
  * bsearch - binary search an array of elements
@@ -53,3 +54,4 @@ void *bsearch(const void *key, const void *base, size_t num, size_t size,
 	return NULL;
 }
 EXPORT_SYMBOL(bsearch);
+NOKPROBE_SYMBOL(bsearch);
-- 

If so, I will send a v4 with this ideia.

Thanks!

-- Daniel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2019-02-01 12:49         ` Daniel Bristot de Oliveira
@ 2019-02-01 14:47           ` Masami Hiramatsu
  2019-02-01 15:51             ` Daniel Bristot de Oliveira
  0 siblings, 1 reply; 27+ messages in thread
From: Masami Hiramatsu @ 2019-02-01 14:47 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Hi Daniel,

On Fri, 1 Feb 2019 13:49:32 +0100
Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> On 1/28/19 2:52 PM, Masami Hiramatsu wrote:
> > On Sat, 26 Jan 2019 12:52:15 +0100
> > Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> > 
> >> On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
> >>> Hi Daniel,
> >>>
> >>> On Fri, 21 Dec 2018 11:27:32 +0100
> >>> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> >>>
> >>>> Currently, the patch of an address is done in three steps:
> >>>>
> >>>> -- Pseudo-code #1 - Current implementation ---
> >>>>         1) add an int3 trap to the address that will be patched
> >>>>             sync cores (send IPI to all other CPUs)
> >>>>         2) update all but the first byte of the patched range
> >>>>             sync cores (send IPI to all other CPUs)
> >>>>         3) replace the first byte (int3) by the first byte of replacing opcode
> >>>>             sync cores (send IPI to all other CPUs)
> >>>> -- Pseudo-code #1 ---
> >>>>
> >>>> When a static key has more than one entry, these steps are called once for
> >>>> each entry. The number of IPIs then is linear with regard to the number 'n' of
> >>>> entries of a key: O(n*3), which is O(n).
> >>>>
> >>>> This algorithm works fine for the update of a single key. But we think
> >>>> it is possible to optimize the case in which a static key has more than
> >>>> one entry. For instance, the sched_schedstats jump label has 56 entries
> >>>> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
> >>>> which the thread that is enabling the key is _not_ running.
> >>>>
> >>>> With this patch, rather than receiving a single patch to be processed, a vector
> >>>> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
> >>>> way:
> >>>>
> >>>> -- Pseudo-code #2 - This patch  ---
> >>>> 1)  for each patch in the vector:
> >>>>         add an int3 trap to the address that will be patched
> >>>>
> >>>>     sync cores (send IPI to all other CPUs)
> >>>>
> >>>> 2)  for each patch in the vector:
> >>>>         update all but the first byte of the patched range
> >>>>
> >>>>     sync cores (send IPI to all other CPUs)
> >>>>
> >>>> 3)  for each patch in the vector:
> >>>>         replace the first byte (int3) by the first byte of replacing opcode
> >>>>
> >>>>     sync cores (send IPI to all other CPUs)
> >>>> -- Pseudo-code #2 - This patch  ---
> >>>>
> >>>> Doing the update in this way, the number of IPI becomes O(3) with regard
> >>>> to the number of keys, which is O(1).
> >>>>
> >>>> The batch mode is done with the function text_poke_bp_batch(), that receives
> >>>> two arguments: a vector of "struct text_to_poke", and the number of entries
> >>>> in the vector.
> >>>>
> >>>> The vector must be sorted by the addr field of the text_to_poke structure,
> >>>> enabling the binary search of a handler in the poke_int3_handler function
> >>>> (a fast path).
> >>>>
> >>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
> >>>> Cc: Thomas Gleixner <tglx@linutronix.de>
> >>>> Cc: Ingo Molnar <mingo@redhat.com>
> >>>> Cc: Borislav Petkov <bp@alien8.de>
> >>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
> >>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> >>>> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
> >>>> Cc: Jiri Kosina <jkosina@suse.cz>
> >>>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> >>>> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> >>>> Cc: Chris von Recklinghausen <crecklin@redhat.com>
> >>>> Cc: Jason Baron <jbaron@akamai.com>
> >>>> Cc: Scott Wood <swood@redhat.com>
> >>>> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> >>>> Cc: Clark Williams <williams@redhat.com>
> >>>> Cc: x86@kernel.org
> >>>> Cc: linux-kernel@vger.kernel.org
> >>>> ---
> >>>>  arch/x86/include/asm/text-patching.h |  15 ++++
> >>>>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
> >>>>  2 files changed, 117 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> >>>> index e85ff65c43c3..42ea7846df33 100644
> >>>> --- a/arch/x86/include/asm/text-patching.h
> >>>> +++ b/arch/x86/include/asm/text-patching.h
> >>>> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
> >>>>  #define __parainstructions_end	NULL
> >>>>  #endif
> >>>>  
> >>>> +/*
> >>>> + * Currently, the max observed size in the kernel code is
> >>>> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> >>>> + * Raise it if needed.
> >>>> + */
> >>>> +#define POKE_MAX_OPCODE_SIZE	5
> >>>> +
> >>>> +struct text_to_poke {
> >>>> +	void *handler;
> >>>> +	void *addr;
> >>>> +	size_t len;
> >>>> +	const char opcode[POKE_MAX_OPCODE_SIZE];
> >>>> +};
> >>>> +
> >>>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>>>  
> >>>>  /*
> >>>> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
> >>>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
> >>>>  extern int poke_int3_handler(struct pt_regs *regs);
> >>>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> >>>> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
> >>>>  extern int after_bootmem;
> >>>>  
> >>>>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> >>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> >>>> index 6f5ad8587de0..8fa47e5ec709 100644
> >>>> --- a/arch/x86/kernel/alternative.c
> >>>> +++ b/arch/x86/kernel/alternative.c
> >>>> @@ -21,6 +21,7 @@
> >>>>  #include <asm/tlbflush.h>
> >>>>  #include <asm/io.h>
> >>>>  #include <asm/fixmap.h>
> >>>> +#include <linux/bsearch.h>
> >>>>  
> >>>>  int __read_mostly alternatives_patched;
> >>>>  
> >>>> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
> >>>>  }
> >>>>  
> >>>>  static bool bp_patching_in_progress;
> >>>> +/*
> >>>> + * Single poke.
> >>>> + */
> >>>>  static void *bp_int3_handler, *bp_int3_addr;
> >>>> +/*
> >>>> + * Batching poke.
> >>>> + */
> >>>> +static struct text_to_poke *bp_int3_tpv;
> >>>> +static unsigned int bp_int3_tpv_nr;
> >>>> +
> >>>> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> >>>> +{
> >>>> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
> >>>> +
> >>>> +	if (key < tp->addr)
> >>>> +		return -1;
> >>>> +	if (key > tp->addr)
> >>>> +		return 1;
> >>>> +	return 0;
> >>>> +}
> >>>>  
> >>>>  int poke_int3_handler(struct pt_regs *regs)
> >>>>  {
> >>>> +	void *ip;
> >>>> +	struct text_to_poke *tp;
> >>>> +
> >>>>  	/*
> >>>>  	 * Having observed our INT3 instruction, we now must observe
> >>>>  	 * bp_patching_in_progress.
> >>>> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
> >>>>  	if (likely(!bp_patching_in_progress))
> >>>>  		return 0;
> >>>>  
> >>>> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> >>>> +	if (user_mode(regs))
> >>>>  		return 0;
> >>>>  
> >>>> -	/* set up the specified breakpoint handler */
> >>>> -	regs->ip = (unsigned long) bp_int3_handler;
> >>>> +	/*
> >>>> +	 * Single poke first.
> >>>> +	 */
> >>>
> >>> I wonder why would you separate single poke and batch poke?
> >>> It seems a single poke is just a case that bp_int3_tpv_nr == 1.
> >>
> >> Hi Masami!
> >>
> >> The single poke is used only at the boot time, before the system is able to
> >> allocate memory. After that, the batch mode becomes the default.
> > 
> > Hmm, what's the difference from text_poke_early()?
> 
> text_poke_early(): before enabling interrupts at boot.
> 
> text_poke_bp(): after enabling interrupts, before being able to allocate memory,
> or in the error handling with batch mode.
> 
> task_poke_batch(): After enabling interrupts and being able to allocate memory.

OK, I got it. Maybe we should document this for future users.

> >> I was thinking to make one function to each method, but then I would have to
> >> change the do_int3() and manage how to switch between one and the other without
> >> further overhead. I was planing to do this in a second round of improvements.
> > 
> > I didn't think such big change.
> > I just thought we could allocate single entry array on stack, something like
> 
> Ah!
> 
> > text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> > {
> > 	struct text_to_poke tp = {.handler = handler, .addr = addr, .len = len};
> > 	if (len > POKE_MAX_OPCODE_SIZE)
> > 		return -E2BIG;
> > 	memcpy(tp.opcode, opcode, len);
> > 	return text_poke_bp_batch(&tp, 1);
> > }
> 
> Good idea!
> 
> >>
> >>> If so, you can remove bp_int3_addr and this block.
> >>>
> >>>> +	if (bp_int3_addr) {
> >>>> +		if (regs->ip == (unsigned long) bp_int3_addr) {
> >>>> +			regs->ip = (unsigned long) bp_int3_handler;
> >>>> +			return 1;
> >>>> +		}
> >>>> +		return 0;
> >>>> +	}
> >>>>
> >>>> -	return 1;
> >>>> +	/*
> >>>> +	 * Batch mode.
> >>>> +	 */
> >>>> +	if (bp_int3_tpv_nr) {
> >>>
> >>> if (unlikely(bp_int3_tpv_nr))
> >>>
> >>> Sorry about interrupting, but this is a "hot-path" when we use kprobes.
> >>
> >> No problem at all! :-)
> > 
> > Thanks! :-)
> > 
> >>
> >> I will change this function to better deal with the hot-path (the default mode
> >> after the system boots up).
> >>
> >> how about something like this:
> >> ------------------ %< ------------------
> >> int poke_int3_handler(struct pt_regs *regs)
> >> {
> >>         void *ip;
> >>         struct text_to_poke *tp;
> >>
> >>         /*
> >>          * Having observed our INT3 instruction, we now must observe
> >>          * bp_patching_in_progress.
> >>          *
> >>          *      in_progress = TRUE              INT3
> >>          *      WMB                             RMB
> >>          *      write INT3                      if (in_progress)
> >>          *
> >>          * Idem for bp_int3_handler.
> >>          */
> >>         smp_rmb();
> >>
> >>         if (likely(!bp_patching_in_progress))
> >>                 return 0;
> >>
> >>         if (user_mode(regs))
> >>                 return 0;
> >>
> >>         /*
> >>          * Single poke is only used at the boot.
> >>          */
> >>         if (unlikely(!bp_int3_tpv))
> >>                 goto single_poke;
> >>
> >>         ip = (void *) regs->ip - sizeof(unsigned char);
> >>         tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
> >>                      sizeof(struct text_to_poke),
> >>                      text_bp_batch_bsearch);
> >>         if (tp) {
> >>                 /* set up the specified breakpoint handler */
> >>                 regs->ip = (unsigned long) tp->handler;
> >>                 return 1;
> >>         }
> >>
> >>         return 0;
> >>
> >> single_poke:
> >>         if (regs->ip == (unsigned long) bp_int3_addr) {
> >>                 regs->ip = (unsigned long) bp_int3_handler;
> >>                 return 1;
> >>         }
> >>
> >>         return 0;
> >> }
> >> ------------- >% ----------
> >>
> >> In this way the default code is up, and the only 'if' I am using is a var of the
> >> batch mode (that will be used later). If are are still at the boot, we are
> >> jumping to the end of the function.
> >>
> >> look better?
> > 
> > yeah, it looks much better. But I just wonder why don't you consolidate both by
> > just because reducing code.
> > 
> 
> and so I did. How about something like this?

OK, I just have a nitpick comment, but this version looks good to me.


> ---------- %< ---------
> ---
>  arch/x86/include/asm/text-patching.h |  15 ++++
>  arch/x86/kernel/alternative.c        | 118 +++++++++++++++++++--------
>  lib/bsearch.c                        |   2 +
>  3 files changed, 100 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
> index e85ff65c43c3..42ea7846df33 100644
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>  #define __parainstructions_end	NULL
>  #endif
>  
> +/*
> + * Currently, the max observed size in the kernel code is
> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
> + * Raise it if needed.
> + */
> +#define POKE_MAX_OPCODE_SIZE	5
> +
> +struct text_to_poke {
> +	void *handler;
> +	void *addr;
> +	size_t len;
> +	const char opcode[POKE_MAX_OPCODE_SIZE];
> +};
> +
>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  
>  /*
> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>  extern int after_bootmem;
>  
>  #endif /* _ASM_X86_TEXT_PATCHING_H */
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index 202af29c43c0..2196bb8bb924 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -11,6 +11,7 @@
>  #include <linux/stop_machine.h>
>  #include <linux/slab.h>
>  #include <linux/kdebug.h>
> +#include <linux/kprobes.h>
>  #include <asm/text-patching.h>
>  #include <asm/alternative.h>
>  #include <asm/sections.h>
> @@ -21,6 +22,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/io.h>
>  #include <asm/fixmap.h>
> +#include <linux/bsearch.h>
>  
>  int __read_mostly alternatives_patched;
>  
> @@ -738,10 +740,26 @@ static void do_sync_core(void *info)
>  }
>  
>  static bool bp_patching_in_progress;
> -static void *bp_int3_handler, *bp_int3_addr;
> +static struct text_to_poke *bp_int3_tpv;
> +static unsigned int bp_int3_tpv_nr;
> +
> +static int text_bp_batch_bsearch(const void *key, const void *elt)
> +{
> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
> +
> +	if (key < tp->addr)
> +		return -1;
> +	if (key > tp->addr)
> +		return 1;
> +	return 0;
> +}
> +NOKPROBE_SYMBOL(text_bp_batch_bsearch);
>  
>  int poke_int3_handler(struct pt_regs *regs)
>  {
> +	void *ip;
> +	struct text_to_poke *tp;
> +
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
>  	 * bp_patching_in_progress.
> @@ -757,21 +775,41 @@ int poke_int3_handler(struct pt_regs *regs)
>  	if (likely(!bp_patching_in_progress))
>  		return 0;
>  
> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> +	if (user_mode(regs))
>  		return 0;
>  
> -	/* set up the specified breakpoint handler */
> -	regs->ip = (unsigned long) bp_int3_handler;
> +	ip = (void *) regs->ip - sizeof(unsigned char);
>  
> -	return 1;
> +	/*
> +	 * Skip the binary search if there is a single member in the vector.
> +	 */
> +	if (unlikely(bp_int3_tpv_nr == 1))
> +		goto single_poke;
> +
> +	tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
> +		     sizeof(struct text_to_poke),
> +		     text_bp_batch_bsearch);
> +	if (tp) {
> +		/* set up the specified breakpoint handler */
> +		regs->ip = (unsigned long) tp->handler;
> +		return 1;
> +	}
> +
> +	return 0;
> +
> +single_poke:
> +	if (ip == (unsigned long) bp_int3_tpv->addr) {
> +		regs->ip = (unsigned long) bp_int3_tpv->handler;
> +		return 1;
> +	}
>  
> +	return 0;
>  }
> +NOKPROBE_SYMBOL(poke_int3_handler);

Ah, this will be covered by a series which currently I'm pinging Ingo.

https://lkml.org/lkml/2019/1/11/1480


>  
>  static void text_poke_bp_set_handler(void *addr, void *handler,
>  				     unsigned char int3)
>  {
> -	bp_int3_handler = handler;
> -	bp_int3_addr = (u8 *)addr + sizeof(int3);
>  	text_poke(addr, &int3, sizeof(int3));
>  }
>  
> @@ -790,32 +828,14 @@ static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
>  	text_poke(addr, opcode, sizeof(int3));
>  }
>  
> -/**
> - * text_poke_bp() -- update instructions on live kernel on SMP
> - * @addr:	address to patch
> - * @opcode:	opcode of new instruction
> - * @len:	length to copy
> - * @handler:	address to jump to when the temporary breakpoint is hit
> - *
> - * Modify multi-byte instruction by using int3 breakpoint on SMP.
> - * We completely avoid stop_machine() here, and achieve the
> - * synchronization using int3 breakpoint.
> - *
> - * The way it is done:
> - *	- add a int3 trap to the address that will be patched
> - *	- sync cores
> - *	- update all but the first byte of the patched range
> - *	- sync cores
> - *	- replace the first byte (int3) by the first byte of
> - *	  replacing opcode
> - *	- sync cores
> - */
> -void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
>  {
> +	unsigned int i;
>  	unsigned char int3 = 0xcc;
> +	int patched_all_but_first = 0;
>  
> -	lockdep_assert_held(&text_mutex);
> -
> +	bp_int3_tpv = tp;
> +	bp_int3_tpv_nr = nr_entries;
>  	bp_patching_in_progress = true;
>  	/*
>  	 * Corresponding read barrier in int3 notifier for making sure the
> @@ -823,12 +843,20 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  	 */
>  	smp_wmb();
>  
> -	text_poke_bp_set_handler(addr, handler, int3);
> +	for (i = 0; i < nr_entries; i++)
> +		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  
> -	if (len - sizeof(int3) > 0) {
> -		patch_all_but_first_byte(addr, opcode, len, int3);
> +	for (i = 0; i < nr_entries; i++) {
> +		if (tp[i].len - sizeof(int3) > 0) {
> +			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
> +						 tp[i].len, int3);
> +			patched_all_but_first++;
> +		}
> +	}
> +
> +	if (patched_all_but_first) {
>  		/*
>  		 * According to Intel, this core syncing is very likely
>  		 * not necessary and we'd be safe even without it. But
> @@ -837,15 +865,35 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>  		on_each_cpu(do_sync_core, NULL, 1);
>  	}
>  
> -	patch_first_byte(addr, opcode, int3);
> +	for (i = 0; i < nr_entries; i++)
> +		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  	/*
>  	 * sync_core() implies an smp_mb() and orders this store against
>  	 * the writing of the new instruction.
>  	 */
> +	bp_int3_tpv_nr = 0;
> +	bp_int3_tpv = NULL;
>  	bp_patching_in_progress = false;
> +}
> + XXX: paste the old comment here... I forgot.
> +void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +{
> +	struct text_to_poke tp = {
> +		.handler = handler,
> +		.addr = addr,
> +		.len = len

even the last field assignment, you'd better add "," here.

> +	};
> +
> +	if (len > POKE_MAX_OPCODE_SIZE) {
> +		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
> +		return NULL;
> +	}
> +
> +	memcpy((void *)tp.opcode, opcode, len);
> +
> +	text_poke_bp_batch(&tp, 1);
>  
>  	return addr;
>  }
> -

> diff --git a/lib/bsearch.c b/lib/bsearch.c
> index 18b445b010c3..82512fe7b33c 100644
> --- a/lib/bsearch.c
> +++ b/lib/bsearch.c
> @@ -11,6 +11,7 @@
>  
>  #include <linux/export.h>
>  #include <linux/bsearch.h>
> +#include <linux/kprobes.h>
>  
>  /*
>   * bsearch - binary search an array of elements
> @@ -53,3 +54,4 @@ void *bsearch(const void *key, const void *base, size_t num, size_t size,
>  	return NULL;
>  }
>  EXPORT_SYMBOL(bsearch);
> +NOKPROBE_SYMBOL(bsearch);

Actually, this part is already pointed by Andrea Righi, since ftrace
is using bsearch, see below.

https://lkml.org/lkml/2019/1/12/70

It depends on which patch series are merged first, but I would like to
separate NOKPROBE_SYMBOL() patch since it fixes (or prevents) a bug.

Anyway, this looks good to me.

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you,

> -- 
> 
> If so, I will send a v4 with this ideia.
> 
> Thanks!
> 
> -- Daniel


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 7/9] x86/alternative: Batch of patch operations
  2019-02-01 14:47           ` Masami Hiramatsu
@ 2019-02-01 15:51             ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-02-01 15:51 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Hi Masami!

On 2/1/19 3:47 PM, Masami Hiramatsu wrote:
> Hi Daniel,
> 
> On Fri, 1 Feb 2019 13:49:32 +0100
> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
> 
>> On 1/28/19 2:52 PM, Masami Hiramatsu wrote:
>>> On Sat, 26 Jan 2019 12:52:15 +0100
>>> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
>>>
>>>> On 1/23/19 6:15 AM, Masami Hiramatsu wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On Fri, 21 Dec 2018 11:27:32 +0100
>>>>> Daniel Bristot de Oliveira <bristot@redhat.com> wrote:
>>>>>
>>>>>> Currently, the patch of an address is done in three steps:
>>>>>>
>>>>>> -- Pseudo-code #1 - Current implementation ---
>>>>>>         1) add an int3 trap to the address that will be patched
>>>>>>             sync cores (send IPI to all other CPUs)
>>>>>>         2) update all but the first byte of the patched range
>>>>>>             sync cores (send IPI to all other CPUs)
>>>>>>         3) replace the first byte (int3) by the first byte of replacing opcode
>>>>>>             sync cores (send IPI to all other CPUs)
>>>>>> -- Pseudo-code #1 ---
>>>>>>
>>>>>> When a static key has more than one entry, these steps are called once for
>>>>>> each entry. The number of IPIs then is linear with regard to the number 'n' of
>>>>>> entries of a key: O(n*3), which is O(n).
>>>>>>
>>>>>> This algorithm works fine for the update of a single key. But we think
>>>>>> it is possible to optimize the case in which a static key has more than
>>>>>> one entry. For instance, the sched_schedstats jump label has 56 entries
>>>>>> in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
>>>>>> which the thread that is enabling the key is _not_ running.
>>>>>>
>>>>>> With this patch, rather than receiving a single patch to be processed, a vector
>>>>>> of patches is passed, enabling the rewrite of the pseudo-code #1 in this
>>>>>> way:
>>>>>>
>>>>>> -- Pseudo-code #2 - This patch  ---
>>>>>> 1)  for each patch in the vector:
>>>>>>         add an int3 trap to the address that will be patched
>>>>>>
>>>>>>     sync cores (send IPI to all other CPUs)
>>>>>>
>>>>>> 2)  for each patch in the vector:
>>>>>>         update all but the first byte of the patched range
>>>>>>
>>>>>>     sync cores (send IPI to all other CPUs)
>>>>>>
>>>>>> 3)  for each patch in the vector:
>>>>>>         replace the first byte (int3) by the first byte of replacing opcode
>>>>>>
>>>>>>     sync cores (send IPI to all other CPUs)
>>>>>> -- Pseudo-code #2 - This patch  ---
>>>>>>
>>>>>> Doing the update in this way, the number of IPI becomes O(3) with regard
>>>>>> to the number of keys, which is O(1).
>>>>>>
>>>>>> The batch mode is done with the function text_poke_bp_batch(), that receives
>>>>>> two arguments: a vector of "struct text_to_poke", and the number of entries
>>>>>> in the vector.
>>>>>>
>>>>>> The vector must be sorted by the addr field of the text_to_poke structure,
>>>>>> enabling the binary search of a handler in the poke_int3_handler function
>>>>>> (a fast path).
>>>>>>
>>>>>> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
>>>>>> Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>> Cc: Ingo Molnar <mingo@redhat.com>
>>>>>> Cc: Borislav Petkov <bp@alien8.de>
>>>>>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>>>>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>>>>> Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
>>>>>> Cc: Jiri Kosina <jkosina@suse.cz>
>>>>>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>>>>>> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>>>>>> Cc: Chris von Recklinghausen <crecklin@redhat.com>
>>>>>> Cc: Jason Baron <jbaron@akamai.com>
>>>>>> Cc: Scott Wood <swood@redhat.com>
>>>>>> Cc: Marcelo Tosatti <mtosatti@redhat.com>
>>>>>> Cc: Clark Williams <williams@redhat.com>
>>>>>> Cc: x86@kernel.org
>>>>>> Cc: linux-kernel@vger.kernel.org
>>>>>> ---
>>>>>>  arch/x86/include/asm/text-patching.h |  15 ++++
>>>>>>  arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++--
>>>>>>  2 files changed, 117 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>>>>>> index e85ff65c43c3..42ea7846df33 100644
>>>>>> --- a/arch/x86/include/asm/text-patching.h
>>>>>> +++ b/arch/x86/include/asm/text-patching.h
>>>>>> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>>>>>>  #define __parainstructions_end	NULL
>>>>>>  #endif
>>>>>>  
>>>>>> +/*
>>>>>> + * Currently, the max observed size in the kernel code is
>>>>>> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
>>>>>> + * Raise it if needed.
>>>>>> + */
>>>>>> +#define POKE_MAX_OPCODE_SIZE	5
>>>>>> +
>>>>>> +struct text_to_poke {
>>>>>> +	void *handler;
>>>>>> +	void *addr;
>>>>>> +	size_t len;
>>>>>> +	const char opcode[POKE_MAX_OPCODE_SIZE];
>>>>>> +};
>>>>>> +
>>>>>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>>>>>  
>>>>>>  /*
>>>>>> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>>>>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>>>>>>  extern int poke_int3_handler(struct pt_regs *regs);
>>>>>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>>>>>> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>>>>>>  extern int after_bootmem;
>>>>>>  
>>>>>>  #endif /* _ASM_X86_TEXT_PATCHING_H */
>>>>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>>>>>> index 6f5ad8587de0..8fa47e5ec709 100644
>>>>>> --- a/arch/x86/kernel/alternative.c
>>>>>> +++ b/arch/x86/kernel/alternative.c
>>>>>> @@ -21,6 +21,7 @@
>>>>>>  #include <asm/tlbflush.h>
>>>>>>  #include <asm/io.h>
>>>>>>  #include <asm/fixmap.h>
>>>>>> +#include <linux/bsearch.h>
>>>>>>  
>>>>>>  int __read_mostly alternatives_patched;
>>>>>>  
>>>>>> @@ -738,10 +739,32 @@ static void do_sync_core(void *info)
>>>>>>  }
>>>>>>  
>>>>>>  static bool bp_patching_in_progress;
>>>>>> +/*
>>>>>> + * Single poke.
>>>>>> + */
>>>>>>  static void *bp_int3_handler, *bp_int3_addr;
>>>>>> +/*
>>>>>> + * Batching poke.
>>>>>> + */
>>>>>> +static struct text_to_poke *bp_int3_tpv;
>>>>>> +static unsigned int bp_int3_tpv_nr;
>>>>>> +
>>>>>> +static int text_bp_batch_bsearch(const void *key, const void *elt)
>>>>>> +{
>>>>>> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
>>>>>> +
>>>>>> +	if (key < tp->addr)
>>>>>> +		return -1;
>>>>>> +	if (key > tp->addr)
>>>>>> +		return 1;
>>>>>> +	return 0;
>>>>>> +}
>>>>>>  
>>>>>>  int poke_int3_handler(struct pt_regs *regs)
>>>>>>  {
>>>>>> +	void *ip;
>>>>>> +	struct text_to_poke *tp;
>>>>>> +
>>>>>>  	/*
>>>>>>  	 * Having observed our INT3 instruction, we now must observe
>>>>>>  	 * bp_patching_in_progress.
>>>>>> @@ -757,21 +780,41 @@ int poke_int3_handler(struct pt_regs *regs)
>>>>>>  	if (likely(!bp_patching_in_progress))
>>>>>>  		return 0;
>>>>>>  
>>>>>> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>>>>>> +	if (user_mode(regs))
>>>>>>  		return 0;
>>>>>>  
>>>>>> -	/* set up the specified breakpoint handler */
>>>>>> -	regs->ip = (unsigned long) bp_int3_handler;
>>>>>> +	/*
>>>>>> +	 * Single poke first.
>>>>>> +	 */
>>>>>
>>>>> I wonder why would you separate single poke and batch poke?
>>>>> It seems a single poke is just a case that bp_int3_tpv_nr == 1.
>>>>
>>>> Hi Masami!
>>>>
>>>> The single poke is used only at the boot time, before the system is able to
>>>> allocate memory. After that, the batch mode becomes the default.
>>>
>>> Hmm, what's the difference from text_poke_early()?
>>
>> text_poke_early(): before enabling interrupts at boot.
>>
>> text_poke_bp(): after enabling interrupts, before being able to allocate memory,
>> or in the error handling with batch mode.
>>
>> task_poke_batch(): After enabling interrupts and being able to allocate memory.
> 
> OK, I got it. Maybe we should document this for future users.
> 
>>>> I was thinking to make one function to each method, but then I would have to
>>>> change the do_int3() and manage how to switch between one and the other without
>>>> further overhead. I was planing to do this in a second round of improvements.
>>>
>>> I didn't think such big change.
>>> I just thought we could allocate single entry array on stack, something like
>>
>> Ah!
>>
>>> text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>>> {
>>> 	struct text_to_poke tp = {.handler = handler, .addr = addr, .len = len};
>>> 	if (len > POKE_MAX_OPCODE_SIZE)
>>> 		return -E2BIG;
>>> 	memcpy(tp.opcode, opcode, len);
>>> 	return text_poke_bp_batch(&tp, 1);
>>> }
>>
>> Good idea!
>>
>>>>
>>>>> If so, you can remove bp_int3_addr and this block.
>>>>>
>>>>>> +	if (bp_int3_addr) {
>>>>>> +		if (regs->ip == (unsigned long) bp_int3_addr) {
>>>>>> +			regs->ip = (unsigned long) bp_int3_handler;
>>>>>> +			return 1;
>>>>>> +		}
>>>>>> +		return 0;
>>>>>> +	}
>>>>>>
>>>>>> -	return 1;
>>>>>> +	/*
>>>>>> +	 * Batch mode.
>>>>>> +	 */
>>>>>> +	if (bp_int3_tpv_nr) {
>>>>>
>>>>> if (unlikely(bp_int3_tpv_nr))
>>>>>
>>>>> Sorry about interrupting, but this is a "hot-path" when we use kprobes.
>>>>
>>>> No problem at all! :-)
>>>
>>> Thanks! :-)
>>>
>>>>
>>>> I will change this function to better deal with the hot-path (the default mode
>>>> after the system boots up).
>>>>
>>>> how about something like this:
>>>> ------------------ %< ------------------
>>>> int poke_int3_handler(struct pt_regs *regs)
>>>> {
>>>>         void *ip;
>>>>         struct text_to_poke *tp;
>>>>
>>>>         /*
>>>>          * Having observed our INT3 instruction, we now must observe
>>>>          * bp_patching_in_progress.
>>>>          *
>>>>          *      in_progress = TRUE              INT3
>>>>          *      WMB                             RMB
>>>>          *      write INT3                      if (in_progress)
>>>>          *
>>>>          * Idem for bp_int3_handler.
>>>>          */
>>>>         smp_rmb();
>>>>
>>>>         if (likely(!bp_patching_in_progress))
>>>>                 return 0;
>>>>
>>>>         if (user_mode(regs))
>>>>                 return 0;
>>>>
>>>>         /*
>>>>          * Single poke is only used at the boot.
>>>>          */
>>>>         if (unlikely(!bp_int3_tpv))
>>>>                 goto single_poke;
>>>>
>>>>         ip = (void *) regs->ip - sizeof(unsigned char);
>>>>         tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
>>>>                      sizeof(struct text_to_poke),
>>>>                      text_bp_batch_bsearch);
>>>>         if (tp) {
>>>>                 /* set up the specified breakpoint handler */
>>>>                 regs->ip = (unsigned long) tp->handler;
>>>>                 return 1;
>>>>         }
>>>>
>>>>         return 0;
>>>>
>>>> single_poke:
>>>>         if (regs->ip == (unsigned long) bp_int3_addr) {
>>>>                 regs->ip = (unsigned long) bp_int3_handler;
>>>>                 return 1;
>>>>         }
>>>>
>>>>         return 0;
>>>> }
>>>> ------------- >% ----------
>>>>
>>>> In this way the default code is up, and the only 'if' I am using is a var of the
>>>> batch mode (that will be used later). If are are still at the boot, we are
>>>> jumping to the end of the function.
>>>>
>>>> look better?
>>>
>>> yeah, it looks much better. But I just wonder why don't you consolidate both by
>>> just because reducing code.
>>>
>>
>> and so I did. How about something like this?
> 
> OK, I just have a nitpick comment, but this version looks good to me.
> 
> 
>> ---------- %< ---------
>> ---
>>  arch/x86/include/asm/text-patching.h |  15 ++++
>>  arch/x86/kernel/alternative.c        | 118 +++++++++++++++++++--------
>>  lib/bsearch.c                        |   2 +
>>  3 files changed, 100 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..42ea7846df33 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
>>  #define __parainstructions_end	NULL
>>  #endif
>>  
>> +/*
>> + * Currently, the max observed size in the kernel code is
>> + * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
>> + * Raise it if needed.
>> + */
>> +#define POKE_MAX_OPCODE_SIZE	5
>> +
>> +struct text_to_poke {
>> +	void *handler;
>> +	void *addr;
>> +	size_t len;
>> +	const char opcode[POKE_MAX_OPCODE_SIZE];
>> +};
>> +
>>  extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>  
>>  /*
>> @@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>>  extern int poke_int3_handler(struct pt_regs *regs);
>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
>> +extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
>>  extern int after_bootmem;
>>  
>>  #endif /* _ASM_X86_TEXT_PATCHING_H */
>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
>> index 202af29c43c0..2196bb8bb924 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/stop_machine.h>
>>  #include <linux/slab.h>
>>  #include <linux/kdebug.h>
>> +#include <linux/kprobes.h>
>>  #include <asm/text-patching.h>
>>  #include <asm/alternative.h>
>>  #include <asm/sections.h>
>> @@ -21,6 +22,7 @@
>>  #include <asm/tlbflush.h>
>>  #include <asm/io.h>
>>  #include <asm/fixmap.h>
>> +#include <linux/bsearch.h>
>>  
>>  int __read_mostly alternatives_patched;
>>  
>> @@ -738,10 +740,26 @@ static void do_sync_core(void *info)
>>  }
>>  
>>  static bool bp_patching_in_progress;
>> -static void *bp_int3_handler, *bp_int3_addr;
>> +static struct text_to_poke *bp_int3_tpv;
>> +static unsigned int bp_int3_tpv_nr;
>> +
>> +static int text_bp_batch_bsearch(const void *key, const void *elt)
>> +{
>> +	struct text_to_poke *tp = (struct text_to_poke *) elt;
>> +
>> +	if (key < tp->addr)
>> +		return -1;
>> +	if (key > tp->addr)
>> +		return 1;
>> +	return 0;
>> +}
>> +NOKPROBE_SYMBOL(text_bp_batch_bsearch);
>>  
>>  int poke_int3_handler(struct pt_regs *regs)
>>  {
>> +	void *ip;
>> +	struct text_to_poke *tp;
>> +
>>  	/*
>>  	 * Having observed our INT3 instruction, we now must observe
>>  	 * bp_patching_in_progress.
>> @@ -757,21 +775,41 @@ int poke_int3_handler(struct pt_regs *regs)
>>  	if (likely(!bp_patching_in_progress))
>>  		return 0;
>>  
>> -	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
>> +	if (user_mode(regs))
>>  		return 0;
>>  
>> -	/* set up the specified breakpoint handler */
>> -	regs->ip = (unsigned long) bp_int3_handler;
>> +	ip = (void *) regs->ip - sizeof(unsigned char);
>>  
>> -	return 1;
>> +	/*
>> +	 * Skip the binary search if there is a single member in the vector.
>> +	 */
>> +	if (unlikely(bp_int3_tpv_nr == 1))
>> +		goto single_poke;
>> +
>> +	tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
>> +		     sizeof(struct text_to_poke),
>> +		     text_bp_batch_bsearch);
>> +	if (tp) {
>> +		/* set up the specified breakpoint handler */
>> +		regs->ip = (unsigned long) tp->handler;
>> +		return 1;
>> +	}
>> +
>> +	return 0;
>> +
>> +single_poke:
>> +	if (ip == (unsigned long) bp_int3_tpv->addr) {
>> +		regs->ip = (unsigned long) bp_int3_tpv->handler;
>> +		return 1;
>> +	}
>>  
>> +	return 0;
>>  }
>> +NOKPROBE_SYMBOL(poke_int3_handler);
> 
> Ah, this will be covered by a series which currently I'm pinging Ingo.
> 
> https://lkml.org/lkml/2019/1/11/1480
> 
> 
>>  
>>  static void text_poke_bp_set_handler(void *addr, void *handler,
>>  				     unsigned char int3)
>>  {
>> -	bp_int3_handler = handler;
>> -	bp_int3_addr = (u8 *)addr + sizeof(int3);
>>  	text_poke(addr, &int3, sizeof(int3));
>>  }
>>  
>> @@ -790,32 +828,14 @@ static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
>>  	text_poke(addr, opcode, sizeof(int3));
>>  }
>>  
>> -/**
>> - * text_poke_bp() -- update instructions on live kernel on SMP
>> - * @addr:	address to patch
>> - * @opcode:	opcode of new instruction
>> - * @len:	length to copy
>> - * @handler:	address to jump to when the temporary breakpoint is hit
>> - *
>> - * Modify multi-byte instruction by using int3 breakpoint on SMP.
>> - * We completely avoid stop_machine() here, and achieve the
>> - * synchronization using int3 breakpoint.
>> - *
>> - * The way it is done:
>> - *	- add a int3 trap to the address that will be patched
>> - *	- sync cores
>> - *	- update all but the first byte of the patched range
>> - *	- sync cores
>> - *	- replace the first byte (int3) by the first byte of
>> - *	  replacing opcode
>> - *	- sync cores
>> - */
>> -void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>> +void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
>>  {
>> +	unsigned int i;
>>  	unsigned char int3 = 0xcc;
>> +	int patched_all_but_first = 0;
>>  
>> -	lockdep_assert_held(&text_mutex);
>> -
>> +	bp_int3_tpv = tp;
>> +	bp_int3_tpv_nr = nr_entries;
>>  	bp_patching_in_progress = true;
>>  	/*
>>  	 * Corresponding read barrier in int3 notifier for making sure the
>> @@ -823,12 +843,20 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>>  	 */
>>  	smp_wmb();
>>  
>> -	text_poke_bp_set_handler(addr, handler, int3);
>> +	for (i = 0; i < nr_entries; i++)
>> +		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
>>  
>>  	on_each_cpu(do_sync_core, NULL, 1);
>>  
>> -	if (len - sizeof(int3) > 0) {
>> -		patch_all_but_first_byte(addr, opcode, len, int3);
>> +	for (i = 0; i < nr_entries; i++) {
>> +		if (tp[i].len - sizeof(int3) > 0) {
>> +			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
>> +						 tp[i].len, int3);
>> +			patched_all_but_first++;
>> +		}
>> +	}
>> +
>> +	if (patched_all_but_first) {
>>  		/*
>>  		 * According to Intel, this core syncing is very likely
>>  		 * not necessary and we'd be safe even without it. But
>> @@ -837,15 +865,35 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>>  		on_each_cpu(do_sync_core, NULL, 1);
>>  	}
>>  
>> -	patch_first_byte(addr, opcode, int3);
>> +	for (i = 0; i < nr_entries; i++)
>> +		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
>>  
>>  	on_each_cpu(do_sync_core, NULL, 1);
>>  	/*
>>  	 * sync_core() implies an smp_mb() and orders this store against
>>  	 * the writing of the new instruction.
>>  	 */
>> +	bp_int3_tpv_nr = 0;
>> +	bp_int3_tpv = NULL;
>>  	bp_patching_in_progress = false;
>> +}
>> + XXX: paste the old comment here... I forgot.
>> +void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
>> +{
>> +	struct text_to_poke tp = {
>> +		.handler = handler,
>> +		.addr = addr,
>> +		.len = len
> 
> even the last field assignment, you'd better add "," here.

ack!

Doing the changes, testing and submitting again.

Thank you very much for your comments.

-- Daniel
> 
>> +	};
>> +
>> +	if (len > POKE_MAX_OPCODE_SIZE) {
>> +		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
>> +		return NULL;
>> +	}
>> +
>> +	memcpy((void *)tp.opcode, opcode, len);
>> +
>> +	text_poke_bp_batch(&tp, 1);
>>  
>>  	return addr;
>>  }
>> -
> 
>> diff --git a/lib/bsearch.c b/lib/bsearch.c
>> index 18b445b010c3..82512fe7b33c 100644
>> --- a/lib/bsearch.c
>> +++ b/lib/bsearch.c
>> @@ -11,6 +11,7 @@
>>  
>>  #include <linux/export.h>
>>  #include <linux/bsearch.h>
>> +#include <linux/kprobes.h>
>>  
>>  /*
>>   * bsearch - binary search an array of elements
>> @@ -53,3 +54,4 @@ void *bsearch(const void *key, const void *base, size_t num, size_t size,
>>  	return NULL;
>>  }
>>  EXPORT_SYMBOL(bsearch);
>> +NOKPROBE_SYMBOL(bsearch);
> 
> Actually, this part is already pointed by Andrea Righi, since ftrace
> is using bsearch, see below.
> 
> https://lkml.org/lkml/2019/1/12/70
> 
> It depends on which patch series are merged first, but I would like to
> separate NOKPROBE_SYMBOL() patch since it fixes (or prevents) a bug.
> 
> Anyway, this looks good to me.
> 
> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
> 
> Thank you,
> 
>> -- 
>>
>> If so, I will send a v4 with this ideia.
>>
>> Thanks!
>>
>> -- Daniel
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] jump_label: Add for_each_label_entry helper
  2018-12-21 10:27 ` [PATCH V3 1/9] jump_label: Add for_each_label_entry helper Daniel Bristot de Oliveira
@ 2019-04-19 18:36   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, jkosina, mingo, jbaron, gregkh, hpa, bristot, peterz,
	jolsa, mtosatti, rostedt, alexander.shishkin, dvlasenk, mhiramat,
	williams, jpoimboe, torvalds, bp, tglx, acme, crecklin,
	linux-kernel, swood, luto

Commit-ID:  b1b944f7230f4101b5a293d5ac3b7a9627ed5e6f
Gitweb:     https://git.kernel.org/tip/b1b944f7230f4101b5a293d5ac3b7a9627ed5e6f
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:26 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:34 +0200

jump_label: Add for_each_label_entry helper

Add a helper macro to make jump entry iteration code more readable.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/206640539a01e31b6f4ef4ef74962b1394e2f930.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/jump_label.h | 3 +++
 kernel/jump_label.c        | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 3e113a1fa0f1..7e91af98bbb1 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -227,6 +227,9 @@ extern void static_key_disable(struct static_key *key);
 extern void static_key_enable_cpuslocked(struct static_key *key);
 extern void static_key_disable_cpuslocked(struct static_key *key);
 
+#define for_each_label_entry(key, entry, stop)				  \
+	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++)
+
 /*
  * We should be using ATOMIC_INIT() for initializing .enabled, but
  * the inclusion of atomic.h is problematic for inclusion of jump_label.h
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index bad96b476eb6..288d630da22d 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -379,7 +379,7 @@ static void __jump_label_update(struct static_key *key,
 				struct jump_entry *stop,
 				bool init)
 {
-	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+	for_each_label_entry(key, entry, stop) {
 		/*
 		 * An entry->code of 0 indicates an entry which has been
 		 * disabled because it was in an init text area.

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] jump_label: Add the jump_label_can_update_check() helper
  2018-12-21 10:27 ` [PATCH V3 2/9] jump_label: Add the jump_label_can_update_check() helper Daniel Bristot de Oliveira
@ 2019-04-19 18:37   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jkosina, linux-kernel, jbaron, mhiramat, dvlasenk, bristot,
	mingo, hpa, brgerst, swood, peterz, acme, luto,
	alexander.shishkin, jpoimboe, crecklin, tglx, jolsa, rostedt,
	torvalds, williams, gregkh, mtosatti, bp

Commit-ID:  41bef31d0abe29b5888b33b526cacc8a30795318
Gitweb:     https://git.kernel.org/tip/41bef31d0abe29b5888b33b526cacc8a30795318
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:27 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:34 +0200

jump_label: Add the jump_label_can_update_check() helper

Move the check of if a jump_entry is valid to a specific function. No
functional change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1e758cd3b2206be96091a57a17b26585e94792b6.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/jump_label.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 288d630da22d..456c0d7cbb5b 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -374,22 +374,32 @@ static enum jump_label_type jump_label_type(struct jump_entry *entry)
 	return enabled ^ branch;
 }
 
+bool jump_label_can_update_check(struct jump_entry *entry, bool init)
+{
+	/*
+	 * An entry->code of 0 indicates an entry which has been
+	 * disabled because it was in an init text area.
+	 */
+	if (init || !jump_entry_is_init(entry)) {
+		if (!kernel_text_address(jump_entry_code(entry))) {
+			WARN_ONCE(1, "can't patch jump_label at %pS",
+				  (void *)jump_entry_code(entry));
+			return 0;
+		}
+		return 1;
+	}
+	return 0;
+}
+
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
 				bool init)
 {
 	for_each_label_entry(key, entry, stop) {
-		/*
-		 * An entry->code of 0 indicates an entry which has been
-		 * disabled because it was in an init text area.
-		 */
-		if (init || !jump_entry_is_init(entry)) {
-			if (kernel_text_address(jump_entry_code(entry)))
-				arch_jump_label_transform(entry, jump_label_type(entry));
-			else
-				WARN_ONCE(1, "can't patch jump_label at %pS",
-					  (void *)jump_entry_code(entry));
+		if (jump_label_can_update_check(entry, init)) {
+			arch_jump_label_transform(entry,
+						  jump_label_type(entry));
 		}
 	}
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] x86/jump_label: Move checking code away from __jump_label_transform()
  2018-12-21 10:27 ` [PATCH V3 3/9] x86/jump_label: Move checking code away from __jump_label_transform() Daniel Bristot de Oliveira
@ 2019-04-19 18:38   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, jpoimboe, crecklin, peterz, gregkh, jkosina, mhiramat,
	jolsa, bristot, williams, jbaron, brgerst, luto, acme, swood,
	rostedt, mtosatti, dvlasenk, alexander.shishkin, mingo, torvalds,
	hpa, linux-kernel, bp

Commit-ID:  ddbefdc167f3d492036e6b0babd11a350c3bd3e3
Gitweb:     https://git.kernel.org/tip/ddbefdc167f3d492036e6b0babd11a350c3bd3e3
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:28 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:34 +0200

x86/jump_label: Move checking code away from __jump_label_transform()

Move the check of the current code, before updating an entry, to specialized
functions. No changes in the method, only code relocation.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/296d5a75c4aee4a957ee7923127c0604903f9daa.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/jump_label.c | 60 +++++++++++++++++++++++++++++++-------------
 1 file changed, 42 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index f99bd26bd3f1..e443c43478eb 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -35,16 +35,53 @@ static void bug_at(unsigned char *ip, int line)
 	BUG();
 }
 
+static inline void __jump_label_trans_check_enable(struct jump_entry *entry,
+						   enum jump_label_type type,
+						   const unsigned char *ideal_nop,
+						   int init)
+{
+	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
+	const void *expect;
+	int line;
+
+	if (init) {
+		expect = default_nop; line = __LINE__;
+	} else {
+		expect = ideal_nop; line = __LINE__;
+	}
+
+	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
+		bug_at((void *)jump_entry_code(entry), line);
+}
+
+static inline void __jump_label_trans_check_disable(struct jump_entry *entry,
+						    enum jump_label_type type,
+						    union jump_code_union *jmp,
+						    int init)
+{
+	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
+	const void *expect;
+	int line;
+
+	if (init) {
+		expect = default_nop; line = __LINE__;
+	} else {
+		expect = jmp->code; line = __LINE__;
+	}
+
+	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
+		bug_at((void *)jump_entry_code(entry), line);
+}
+
+
 static void __ref __jump_label_transform(struct jump_entry *entry,
 					 enum jump_label_type type,
 					 void *(*poker)(void *, const void *, size_t),
 					 int init)
 {
 	union jump_code_union jmp;
-	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
 	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-	const void *expect, *code;
-	int line;
+	const void *code;
 
 	jmp.jump = 0xe9;
 	jmp.offset = jump_entry_target(entry) -
@@ -54,26 +91,13 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 		poker = text_poke_early;
 
 	if (type == JUMP_LABEL_JMP) {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = ideal_nop; line = __LINE__;
-		}
-
+		__jump_label_trans_check_enable(entry, type, ideal_nop, init);
 		code = &jmp.code;
 	} else {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = &jmp.code; line = __LINE__;
-		}
-
+		__jump_label_trans_check_disable(entry, type, &jmp, init);
 		code = ideal_nop;
 	}
 
-	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
-		bug_at((void *)jump_entry_code(entry), line);
-
 	/*
 	 * Make text_poke_bp() a default fallback poker.
 	 *

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] x86/jump_label: Add __jump_label_set_jump_code() helper
  2018-12-21 10:27 ` [PATCH V3 4/9] x86/jump_label: Add __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
@ 2019-04-19 18:38   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:38 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, mingo, jolsa, dvlasenk, tglx, rostedt, gregkh,
	linux-kernel, jbaron, crecklin, bp, jkosina, bristot, swood,
	acme, mtosatti, brgerst, alexander.shishkin, hpa, peterz,
	williams, luto, jpoimboe, mhiramat

Commit-ID:  369670e583390ce7324ba3db988de09a7fceca93
Gitweb:     https://git.kernel.org/tip/369670e583390ce7324ba3db988de09a7fceca93
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:29 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:34 +0200

x86/jump_label: Add __jump_label_set_jump_code() helper

Move the definition of the code to be written from
__jump_label_transform() to a specialized function. No change in the
method, code relocation only.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/eb97675f0d139aa6f78874db3abc81fcdba7a80f.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/jump_label.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e443c43478eb..2ef687db5a87 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -73,30 +73,36 @@ static inline void __jump_label_trans_check_disable(struct jump_entry *entry,
 		bug_at((void *)jump_entry_code(entry), line);
 }
 
+static void __jump_label_set_jump_code(struct jump_entry *entry,
+				       enum jump_label_type type,
+				       union jump_code_union *code,
+				       int init)
+{
+	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
+
+	code->jump = 0xe9;
+	code->offset = jump_entry_target(entry) -
+		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+
+	if (type == JUMP_LABEL_JMP) {
+		__jump_label_trans_check_enable(entry, type, ideal_nop, init);
+	} else {
+		__jump_label_trans_check_disable(entry, type, code, init);
+		memcpy(code, ideal_nop, JUMP_LABEL_NOP_SIZE);
+	}
+}
 
 static void __ref __jump_label_transform(struct jump_entry *entry,
 					 enum jump_label_type type,
 					 void *(*poker)(void *, const void *, size_t),
 					 int init)
 {
-	union jump_code_union jmp;
-	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-	const void *code;
-
-	jmp.jump = 0xe9;
-	jmp.offset = jump_entry_target(entry) -
-		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+	union jump_code_union code;
 
 	if (early_boot_irqs_disabled)
 		poker = text_poke_early;
 
-	if (type == JUMP_LABEL_JMP) {
-		__jump_label_trans_check_enable(entry, type, ideal_nop, init);
-		code = &jmp.code;
-	} else {
-		__jump_label_trans_check_disable(entry, type, &jmp, init);
-		code = ideal_nop;
-	}
+	__jump_label_set_jump_code(entry, type, &code, init);
 
 	/*
 	 * Make text_poke_bp() a default fallback poker.
@@ -107,12 +113,12 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 	 *
 	 */
 	if (poker) {
-		(*poker)((void *)jump_entry_code(entry), code,
+		(*poker)((void *)jump_entry_code(entry), &code,
 			 JUMP_LABEL_NOP_SIZE);
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
+	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE,
 		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 }
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] x86/alternative: Split text_poke_bp() into tree steps
  2018-12-21 10:27 ` [PATCH V3 5/9] x86/alternative: Split text_poke_bp() into tree steps Daniel Bristot de Oliveira
@ 2019-04-19 18:39   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, jolsa, torvalds, linux-kernel, bp, jpoimboe, crecklin,
	williams, acme, bristot, alexander.shishkin, rostedt, mingo,
	mhiramat, tglx, jkosina, mtosatti, peterz, brgerst, swood,
	jbaron, luto, gregkh, dvlasenk

Commit-ID:  6d2b0054d9a448d384264e003474219497e70bb9
Gitweb:     https://git.kernel.org/tip/6d2b0054d9a448d384264e003474219497e70bb9
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:30 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:34 +0200

x86/alternative: Split text_poke_bp() into tree steps

text_poke_bp() updates instructions on live kernel on SMP in three steps:
 1) add a int3 trap to the address that will be patched
 2) update all but the first byte of the patched range
 3) replace the first byte (int3) by the first byte of

This patch creates one function for each of these steps.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/67c326c50f8efa728d3040d3563f258949ff32c4.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/alternative.c | 38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9a79c7808f9c..7fce844017f1 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -768,6 +768,29 @@ int poke_int3_handler(struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(poke_int3_handler);
 
+static void text_poke_bp_set_handler(void *addr, void *handler,
+				     unsigned char int3)
+{
+	bp_int3_handler = handler;
+	bp_int3_addr = (u8 *)addr + sizeof(int3);
+	text_poke(addr, &int3, sizeof(int3));
+}
+
+static void patch_all_but_first_byte(void *addr, const void *opcode,
+				     size_t len, unsigned char int3)
+{
+	/* patch all but the first byte */
+	text_poke((char *)addr + sizeof(int3),
+		  (const char *) opcode + sizeof(int3),
+		  len - sizeof(int3));
+}
+
+static void patch_first_byte(void *addr, const void *opcode, unsigned char int3)
+{
+	/* patch the first byte */
+	text_poke(addr, opcode, sizeof(int3));
+}
+
 /**
  * text_poke_bp() -- update instructions on live kernel on SMP
  * @addr:	address to patch
@@ -792,27 +815,21 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
 	unsigned char int3 = 0xcc;
 
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
-	bp_patching_in_progress = true;
-
 	lockdep_assert_held(&text_mutex);
 
+	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
 	 * in_progress and handler are correctly ordered wrt. patching.
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	text_poke_bp_set_handler(addr, handler, int3);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
 	if (len - sizeof(int3) > 0) {
-		/* patch all but the first byte */
-		text_poke((char *)addr + sizeof(int3),
-			  (const char *) opcode + sizeof(int3),
-			  len - sizeof(int3));
+		patch_all_but_first_byte(addr, opcode, len, int3);
 		/*
 		 * According to Intel, this core syncing is very likely
 		 * not necessary and we'd be safe even without it. But
@@ -821,8 +838,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 		on_each_cpu(do_sync_core, NULL, 1);
 	}
 
-	/* patch the first byte */
-	text_poke(addr, opcode, sizeof(int3));
+	patch_first_byte(addr, opcode, int3);
 
 	on_each_cpu(do_sync_core, NULL, 1);
 	/*

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] jump_label: Sort entries of the same key by the code
  2018-12-21 10:27 ` [PATCH V3 6/9] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
@ 2019-04-19 18:40   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: jkosina, acme, crecklin, bristot, jolsa, tglx, torvalds, swood,
	jbaron, hpa, bp, dvlasenk, rostedt, luto, mhiramat,
	alexander.shishkin, gregkh, brgerst, linux-kernel, mingo,
	jpoimboe, williams, peterz, mtosatti

Commit-ID:  1ca3398278341e1db3568f2d9c015ee4f5ad0b56
Gitweb:     https://git.kernel.org/tip/1ca3398278341e1db3568f2d9c015ee4f5ad0b56
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:31 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:34 +0200

jump_label: Sort entries of the same key by the code

In the batching mode, entries with the same key should also be sorted by the
code, enabling a bsearch() of a code/addr when updating a key.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/7006ec86c40e5ec5d546dcd76ccc42d46079a963.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/jump_label.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 456c0d7cbb5b..53b7c85c0b09 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -36,12 +36,28 @@ static int jump_label_cmp(const void *a, const void *b)
 	const struct jump_entry *jea = a;
 	const struct jump_entry *jeb = b;
 
+	/*
+	 * Entrires are sorted by key.
+	 */
 	if (jump_entry_key(jea) < jump_entry_key(jeb))
 		return -1;
 
 	if (jump_entry_key(jea) > jump_entry_key(jeb))
 		return 1;
 
+#ifdef HAVE_JUMP_LABEL_BATCH
+	/*
+	 * In the batching mode, entries should also be sorted by the code
+	 * inside the already sorted list of entries, enabling a bsearch in
+	 * the vector.
+	 */
+	if (jump_entry_code(jea) < jump_entry_code(jeb))
+		return -1;
+
+	if (jump_entry_code(jea) > jump_entry_code(jeb))
+		return 1;
+#endif
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] x86/alternative: Batch of patch operations
  2018-12-21 10:27 ` [PATCH V3 7/9] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
  2019-01-23  5:15   ` Masami Hiramatsu
@ 2019-04-19 18:41   ` tip-bot for Daniel Bristot de Oliveira
  1 sibling, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, peterz, jkosina, torvalds, brgerst, mingo, mhiramat,
	jpoimboe, rostedt, mtosatti, jbaron, luto, swood, gregkh,
	williams, bp, dvlasenk, linux-kernel, bristot, hpa, jolsa,
	alexander.shishkin, crecklin, acme

Commit-ID:  76ec759ad71c1fa0c4b327367e82b2650109a22f
Gitweb:     https://git.kernel.org/tip/76ec759ad71c1fa0c4b327367e82b2650109a22f
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:32 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:35 +0200

x86/alternative: Batch of patch operations

Currently, the patch of an address is done in three steps:

-- Pseudo-code #1 - Current implementation ---
        1) add an int3 trap to the address that will be patched
            sync cores (send IPI to all other CPUs)
        2) update all but the first byte of the patched range
            sync cores (send IPI to all other CPUs)
        3) replace the first byte (int3) by the first byte of replacing opcode
            sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.

With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:

-- Pseudo-code #2 - This patch  ---
1)  for each patch in the vector:
        add an int3 trap to the address that will be patched

    sync cores (send IPI to all other CPUs)

2)  for each patch in the vector:
        update all but the first byte of the patched range

    sync cores (send IPI to all other CPUs)

3)  for each patch in the vector:
        replace the first byte (int3) by the first byte of replacing opcode

    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.

The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e3057f0d0bddd625de6d4e7c631faf734e28628b.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/text-patching.h |  15 +++++
 arch/x86/kernel/alternative.c        | 108 +++++++++++++++++++++++++++++++++--
 2 files changed, 117 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..42ea7846df33 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+/*
+ * Currently, the max observed size in the kernel code is
+ * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * Raise it if needed.
+ */
+#define POKE_MAX_OPCODE_SIZE	5
+
+struct text_to_poke {
+	void *handler;
+	void *addr;
+	size_t len;
+	const char opcode[POKE_MAX_OPCODE_SIZE];
+};
+
 extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
@@ -37,6 +51,7 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries);
 extern int after_bootmem;
 
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7fce844017f1..0048fd953596 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -22,6 +22,7 @@
 #include <asm/tlbflush.h>
 #include <asm/io.h>
 #include <asm/fixmap.h>
+#include <linux/bsearch.h>
 
 int __read_mostly alternatives_patched;
 
@@ -739,10 +740,32 @@ static void do_sync_core(void *info)
 }
 
 static bool bp_patching_in_progress;
+/*
+ * Single poke.
+ */
 static void *bp_int3_handler, *bp_int3_addr;
+/*
+ * Batching poke.
+ */
+static struct text_to_poke *bp_int3_tpv;
+static unsigned int bp_int3_tpv_nr;
+
+static int text_bp_batch_bsearch(const void *key, const void *elt)
+{
+	struct text_to_poke *tp = (struct text_to_poke *) elt;
+
+	if (key < tp->addr)
+		return -1;
+	if (key > tp->addr)
+		return 1;
+	return 0;
+}
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	void *ip;
+	struct text_to_poke *tp;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
 	 * bp_patching_in_progress.
@@ -758,21 +781,41 @@ int poke_int3_handler(struct pt_regs *regs)
 	if (likely(!bp_patching_in_progress))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	/*
+	 * Single poke first.
+	 */
+	if (bp_int3_addr) {
+		if (regs->ip == (unsigned long) bp_int3_addr) {
+			regs->ip = (unsigned long) bp_int3_handler;
+			return 1;
+		}
+		return 0;
+	}
 
-	return 1;
+	/*
+	 * Batch mode.
+	 */
+	if (bp_int3_tpv_nr) {
+		ip = (void *) regs->ip - sizeof(unsigned char);
+		tp = bsearch(ip, bp_int3_tpv, bp_int3_tpv_nr,
+			     sizeof(struct text_to_poke),
+			     text_bp_batch_bsearch);
+		if (tp) {
+			/* set up the specified breakpoint handler */
+			regs->ip = (unsigned long) tp->handler;
+			return 1;
+		}
+	}
+	return 0;
 }
 NOKPROBE_SYMBOL(poke_int3_handler);
 
 static void text_poke_bp_set_handler(void *addr, void *handler,
 				     unsigned char int3)
 {
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	text_poke(addr, &int3, sizeof(int3));
 }
 
@@ -817,6 +860,9 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 
 	lockdep_assert_held(&text_mutex);
 
+	bp_int3_handler = handler;
+	bp_int3_addr = (u8 *)addr + sizeof(int3);
+
 	bp_patching_in_progress = true;
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
@@ -847,6 +893,56 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 */
 	bp_patching_in_progress = false;
 
+	bp_int3_handler = bp_int3_addr = 0;
 	return addr;
 }
 
+void text_poke_bp_batch(struct text_to_poke *tp, unsigned int nr_entries)
+{
+	unsigned int i;
+	unsigned char int3 = 0xcc;
+	int patched_all_but_first = 0;
+
+	bp_int3_tpv = tp;
+	bp_int3_tpv_nr = nr_entries;
+	bp_patching_in_progress = true;
+	/*
+	 * Corresponding read barrier in int3 notifier for making sure the
+	 * in_progress and handler are correctly ordered wrt. patching.
+	 */
+	smp_wmb();
+
+	for (i = 0; i < nr_entries; i++)
+		text_poke_bp_set_handler(tp[i].addr, tp[i].handler, int3);
+
+	on_each_cpu(do_sync_core, NULL, 1);
+
+	for (i = 0; i < nr_entries; i++) {
+		if (tp[i].len - sizeof(int3) > 0) {
+			patch_all_but_first_byte(tp[i].addr, tp[i].opcode,
+						 tp[i].len, int3);
+			patched_all_but_first++;
+		}
+	}
+
+	if (patched_all_but_first) {
+		/*
+		 * According to Intel, this core syncing is very likely
+		 * not necessary and we'd be safe even without it. But
+		 * better safe than sorry (plus there's not only Intel).
+		 */
+		on_each_cpu(do_sync_core, NULL, 1);
+	}
+
+	for (i = 0; i < nr_entries; i++)
+		patch_first_byte(tp[i].addr, tp[i].opcode, int3);
+
+	on_each_cpu(do_sync_core, NULL, 1);
+	/*
+	 * sync_core() implies an smp_mb() and orders this store against
+	 * the writing of the new instruction.
+	 */
+	bp_int3_tpv_nr = 0;
+	bp_int3_tpv = NULL;
+	bp_patching_in_progress = false;
+}

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] jump_label: Batch updates if arch supports it
  2018-12-21 10:27 ` [PATCH V3 8/9] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
@ 2019-04-19 18:42   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: brgerst, jolsa, mtosatti, alexander.shishkin, jbaron, bp,
	torvalds, jkosina, rostedt, acme, tglx, gregkh, williams, swood,
	peterz, bristot, dvlasenk, hpa, mingo, jpoimboe, luto,
	linux-kernel, mhiramat, crecklin

Commit-ID:  9f4a3121e69d4e93cde30d1a274bd877f58e6b0a
Gitweb:     https://git.kernel.org/tip/9f4a3121e69d4e93cde30d1a274bd877f58e6b0a
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:33 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:35 +0200

jump_label: Batch updates if arch supports it

If the architecture supports the batching of jump label updates, use it!

An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    sleep 2
    sysctl -w kernel.sched_schedstats=0
    sleep 2
done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/b39e5320498f144410d2cf85a8bd9b079302c10a.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/jump_label.h |  3 +++
 kernel/jump_label.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 7e91af98bbb1..b3dfce98edb7 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -215,6 +215,9 @@ extern void arch_jump_label_transform(struct jump_entry *entry,
 				      enum jump_label_type type);
 extern void arch_jump_label_transform_static(struct jump_entry *entry,
 					     enum jump_label_type type);
+extern int arch_jump_label_transform_queue(struct jump_entry *entry,
+					    enum jump_label_type type);
+extern void arch_jump_label_transform_apply(void);
 extern int jump_label_text_reserved(void *start, void *end);
 extern void static_key_slow_inc(struct static_key *key);
 extern void static_key_slow_dec(struct static_key *key);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 53b7c85c0b09..944c75a0b09b 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -407,6 +407,7 @@ bool jump_label_can_update_check(struct jump_entry *entry, bool init)
 	return 0;
 }
 
+#ifndef HAVE_JUMP_LABEL_BATCH
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
@@ -419,6 +420,34 @@ static void __jump_label_update(struct static_key *key,
 		}
 	}
 }
+#else
+static void __jump_label_update(struct static_key *key,
+				struct jump_entry *entry,
+				struct jump_entry *stop,
+				bool init)
+{
+	for_each_label_entry(key, entry, stop) {
+
+		if (!jump_label_can_update_check(entry, init))
+			continue;
+
+		if (arch_jump_label_transform_queue(entry,
+						    jump_label_type(entry)))
+			continue;
+
+		/*
+		 * Queue's overflow: Apply the current queue, and then
+		 * queue again. If it stills not possible to queue, BUG!
+		 */
+		arch_jump_label_transform_apply();
+		if (!arch_jump_label_transform_queue(entry,
+						     jump_label_type(entry))) {
+			BUG();
+		}
+	}
+	arch_jump_label_transform_apply();
+}
+#endif
 
 void __init jump_label_init(void)
 {

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [tip:x86/alternatives] x86/jump_label: Batch jump label updates
  2018-12-21 10:27 ` [PATCH V3 9/9] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
@ 2019-04-19 18:42   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 27+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-04-19 18:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, jbaron, tglx, brgerst, jpoimboe, dvlasenk, luto, acme, bp,
	bristot, jkosina, swood, mingo, alexander.shishkin, linux-kernel,
	crecklin, torvalds, peterz, jolsa, mtosatti, gregkh, mhiramat,
	rostedt, williams

Commit-ID:  1f30946b1a01baf22df6faf74c0a1e602bb6cac7
Gitweb:     https://git.kernel.org/tip/1f30946b1a01baf22df6faf74c0a1e602bb6cac7
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Fri, 21 Dec 2018 11:27:34 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Apr 2019 19:37:35 +0200

x86/jump_label: Batch jump label updates

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    void arch_jump_label_transform_queue(struct jump_entry *entry,
					 enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated. This functions returns 1 in the case of a
successful enqueue of an entry. If it returns 0, the caller must to
apply the queue and then try to queue again, for instance, because the
queue is full.

This function expects the caller to sort the entries by the address before
enqueueuing then. This is already done by the arch independent code, though.

After queuing all jump_entries, the function:

    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a7ff8be5f195e72d69b8a1689d13d3995056994c.1545228276.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/jump_label.h |  2 +
 arch/x86/kernel/jump_label.c      | 88 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 65191ce8e1cf..06c3cc22a058 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
+#define HAVE_JUMP_LABEL_BATCH
+
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 2ef687db5a87..3c81cf8f06ca 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -15,6 +15,7 @@
 #include <asm/kprobes.h>
 #include <asm/alternative.h>
 #include <asm/text-patching.h>
+#include <linux/slab.h>
 
 union jump_code_union {
 	char code[JUMP_LABEL_NOP_SIZE];
@@ -130,6 +131,93 @@ void arch_jump_label_transform(struct jump_entry *entry,
 	mutex_unlock(&text_mutex);
 }
 
+struct text_to_poke *entry_vector;
+unsigned int entry_vector_max_elem __read_mostly;
+unsigned int entry_vector_nr_elem;
+
+void arch_jump_label_init(void)
+{
+	entry_vector = (void *) __get_free_page(GFP_KERNEL);
+
+	if (WARN_ON_ONCE(!entry_vector))
+		return;
+
+	entry_vector_max_elem = PAGE_SIZE / sizeof(struct text_to_poke);
+	return;
+}
+
+int arch_jump_label_transform_queue(struct jump_entry *entry,
+				     enum jump_label_type type)
+{
+	void *entry_code;
+	struct text_to_poke *tp;
+
+	/*
+	 * Batch mode disabled before being able to allocate memory:
+	 * Fallback to the non-batching mode.
+	 */
+	if (unlikely(!entry_vector_max_elem)) {
+		if (!slab_is_available() || early_boot_irqs_disabled)
+			goto fallback;
+
+		arch_jump_label_init();
+	}
+
+	/*
+	 * No more space in the vector, tell upper layer to apply
+	 * the queue before continuing.
+	 */
+	if (entry_vector_nr_elem == entry_vector_max_elem)
+		return 0;
+
+	tp = &entry_vector[entry_vector_nr_elem];
+
+	entry_code = (void *)jump_entry_code(entry);
+
+	/*
+	 * The int3 handler will do a bsearch in the queue, so we need entries
+	 * to be sorted. We can survive an unsorted list by rejecting the entry,
+	 * forcing the generic jump_label code to apply the queue. Warning once,
+	 * to raise the attention to the case of an unsorted entry that is
+	 * better not happen, because, in the worst case we will perform in the
+	 * same way as we do without batching - with some more overhead.
+	 */
+	if (entry_vector_nr_elem > 0) {
+		int prev_idx = entry_vector_nr_elem - 1;
+		struct text_to_poke *prev_tp = &entry_vector[prev_idx];
+
+		if (WARN_ON_ONCE(prev_tp->addr > entry_code))
+			return 0;
+	}
+
+	__jump_label_set_jump_code(entry, type,
+				   (union jump_code_union *) &tp->opcode, 0);
+
+	tp->addr = entry_code;
+	tp->handler = entry_code + JUMP_LABEL_NOP_SIZE;
+	tp->len = JUMP_LABEL_NOP_SIZE;
+
+	entry_vector_nr_elem++;
+
+	return 1;
+
+fallback:
+	arch_jump_label_transform(entry, type);
+	return 1;
+}
+
+void arch_jump_label_transform_apply(void)
+{
+	if (early_boot_irqs_disabled || !entry_vector_nr_elem)
+		return;
+
+	mutex_lock(&text_mutex);
+	text_poke_bp_batch(entry_vector, entry_vector_nr_elem);
+	mutex_unlock(&text_mutex);
+
+	entry_vector_nr_elem = 0;
+}
+
 static enum {
 	JL_STATE_START,
 	JL_STATE_NO_UPDATE,

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key
  2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (9 preceding siblings ...)
  2019-01-21 12:52 ` [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
@ 2019-04-19 19:01 ` Ingo Molnar
  10 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2019-04-19 19:01 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Masami Hiramatsu,
	Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86


* Daniel Bristot de Oliveira <bristot@redhat.com> wrote:

> Considering that, in this test case, we are saving the handling of 53 IPIs,
> that takes more than these 135 ns, it seems to be a meager price to be paid.
> Moreover, the test case was forcing the hit of the int3, in practice, it
> does not take that often. While the IPI takes place on all CPUs, hitting
> the int3 handler or not!
> 
> For instance, in an isolated CPU with a process running in user-space
> (nohz_full use-case), the chances of hitting the int3 handler is barely zero,
> while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
> this scenario a lot.
> 
> Changes from v2:
>   - Switch the order of patches 8 and 9 (Steven Rostedt)
>   - Fallback in the case of failure in the batch mode (Steven Rostedt)
>   - Fix a pointer in patch 7 (Steven Rostedt)
>   - Adjust the commit log of the patch 1 (Borislav Petkov)
>   - Adjust the commit log of the patch 3 (Jiri Kosina/Thomas Gleixner)
> Changes from v1:
>   - Split the patch in jump-label/x86-jump-label/alternative (Jason Baron)
>   - Use bserach in the int3 handler (Steven Rostedt)
>   - Use a pre-allocated page to store the vector (Jason/Steven)
>   - Do performance tests in the int3 handler (peterz)
>
> Daniel Bristot de Oliveira (9):
>   jump_label: Add for_each_label_entry helper
>   jump_label: Add the jump_label_can_update_check() helper
>   x86/jump_label: Move checking code away from __jump_label_transform()
>   x86/jump_label: Add __jump_label_set_jump_code() helper
>   x86/alternative: Split text_poke_bp() into tree steps
>   jump_label: Sort entries of the same key by the code
>   x86/alternative: Batch of patch operations
>   jump_label: Batch updates if arch supports it
>   x86/jump_label: Batch jump label updates
> 
>  arch/x86/include/asm/jump_label.h    |   2 +
>  arch/x86/include/asm/text-patching.h |  15 +++
>  arch/x86/kernel/alternative.c        | 140 ++++++++++++++++++---
>  arch/x86/kernel/jump_label.c         | 174 ++++++++++++++++++++++-----
>  include/linux/jump_label.h           |   6 +
>  kernel/jump_label.c                  |  73 +++++++++--
>  6 files changed, 359 insertions(+), 51 deletions(-)

Note that I've applied these to tip:WIP.x86/alternatives to help move it 
along and help with testing, but before this can go upstream review 
feedback from PeterZ needs to be addressed.

Please send the latest series (v3, v4, etc.) with full patches to make it 
easy to review from scratch to everyone - I'll reshuffle the WIP branch 
accordingly. These are not stable SHA1's yet.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-04-19 19:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-21 10:27 [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 1/9] jump_label: Add for_each_label_entry helper Daniel Bristot de Oliveira
2019-04-19 18:36   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 2/9] jump_label: Add the jump_label_can_update_check() helper Daniel Bristot de Oliveira
2019-04-19 18:37   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 3/9] x86/jump_label: Move checking code away from __jump_label_transform() Daniel Bristot de Oliveira
2019-04-19 18:38   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 4/9] x86/jump_label: Add __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
2019-04-19 18:38   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 5/9] x86/alternative: Split text_poke_bp() into tree steps Daniel Bristot de Oliveira
2019-04-19 18:39   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 6/9] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
2019-04-19 18:40   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 7/9] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
2019-01-23  5:15   ` Masami Hiramatsu
2019-01-26 11:52     ` Daniel Bristot de Oliveira
2019-01-28 13:52       ` Masami Hiramatsu
2019-02-01 12:49         ` Daniel Bristot de Oliveira
2019-02-01 14:47           ` Masami Hiramatsu
2019-02-01 15:51             ` Daniel Bristot de Oliveira
2019-04-19 18:41   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 8/9] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
2019-04-19 18:42   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2018-12-21 10:27 ` [PATCH V3 9/9] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
2019-04-19 18:42   ` [tip:x86/alternatives] " tip-bot for Daniel Bristot de Oliveira
2019-01-21 12:52 ` [PATCH V3 0/9] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
2019-04-19 19:01 ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).