linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key
@ 2019-06-12  9:57 Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 1/6] jump_label: Add a jump_label_can_update() helper Daniel Bristot de Oliveira
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

While tuning a system with CPUs isolated as much as possible, we've
noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically.
Tracing the functions that emit IPIs, we saw chronyd - an unprivileged
process -  generating the IPIs when changing a static key, enabling
network timestaping on sockets.

For instance, the trace pointed:

# trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'...
# trace-cmde report...
[...]
         chronyd-858   [000]   433.286406: function:             smp_call_function_single
         chronyd-858   [000]   433.286408: kernel_stack:         <stack trace>
=> smp_call_function_many (ffffffffbc10ab22)
=> smp_call_function (ffffffffbc10abaa)
=> on_each_cpu (ffffffffbc10ac0b)
=> text_poke_bp (ffffffffbc02436a)
=> arch_jump_label_transform (ffffffffbc020ec8)
=> __jump_label_update (ffffffffbc1b300f)
=> jump_label_update (ffffffffbc1b30ed)
=> static_key_slow_inc (ffffffffbc1b334d)
=> net_enable_timestamp (ffffffffbc625c44)
=> sock_enable_timestamp (ffffffffbc6127d5)
=> sock_setsockopt (ffffffffbc612d2f)
=> SyS_setsockopt (ffffffffbc60d5e6)
=> tracesys (ffffffffbc7681c5)
          <idle>-0     [001]   433.286416: call_function_single_entry: vector=251
          <idle>-0     [001]   433.286419: call_function_single_exit: vector=251
  
[... The IPI takes place 12 times]

The static key in case was the netstamp_needed_key. A static key
change from enabled->disabled/disabled->enabled causes the code to be
patched, and this is done in three steps:

-- Pseudo-code #1 - Current implementation ---
For each key to be updated:
	1) add an int3 trap to the address that will be patched
	    sync cores (send IPI to all other CPUs)
	2) update all but the first byte of the patched range
	    sync cores (send IPI to all other CPUs)
	3) replace the first byte (int3) by the first byte of replacing opcode 
	    sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

As the static key netstamp_needed_key has four entries (used in for
places in the code) in our kernel, 3 IPIs were generated for each
entry, resulting in 12 IPIs. The number of IPIs then is linear with
regard to the number 'n' of entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling is _not_ running.

Rather than doing each updated at once, it is possible to queue all updates
first, and then apply all updates at once, rewriting the pseudo-code #1
in this way:

-- Pseudo-code #2 - proposal  ---
1)  for each key in the queue:
        add an int3 trap to the address that will be patched
    sync cores (send IPI to all other CPUs)

2)  for each key in the queue:
        update all but the first byte of the patched range
    sync cores (send IPI to all other CPUs)

3)  for each key in the queue:
        replace the first byte (int3) by the first byte of replacing opcode 
    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - proposal  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    bool arch_jump_label_transform_queue(struct jump_entry *entry,
                                         enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated.

After queuing all jump_entries, the function:
  
    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

One easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ---------------------------- 
  #!/bin/bash

  while [ true ]; do 
      sysctl -w kernel.sched_schedstats=1
      sleep 2
      sysctl -w kernel.sched_schedstats=0
      sleep 2
  done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ---------------------------- 
  # watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with the batch mode the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The sched_schedstats static key was chosen as the key to being switched on and
off. The reason being is that it is used in more than 56 places, in a hot path.
The change in the schedstats static key will be done with the following
command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the static key, while the kernel with the batch updates takes just 1.4 ms on
average. Although it seems to be too good to be true, it makes sense: the
sched_schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
this scenario a lot.

Changes from v5:
  - Remove "Add for_each_label_entry helper" patch (Peter Zijlstra &
    Borislav Petkov)
  - Adjust the changelog of "jump_label: Sort entries of the same key by
    the code" (Peter Zijlstra)
  - Optimize poke_int3_handler() (Peter Zijlstra)
  - Optimize __jump_label_update() (Peter Zijlstra)
  - Remove "in_progress" var, use "nr_entries" instead (Peter Zijlstra)
  - Use static allocated memory rather than dynamic (Peter Zijlstra)
  - Remove all the control/fallback required by dynamic allocated memory.
  - Undo the substitution of BUG() to WARN() - Running with inconsistent
    code is a BUG (Peter Zijlstra)
Changes from v4:
  - Renamed jump_label_can_update_check() to jump_label_can_update()
    (Borislav Petkov)
  - Switch jump_label_can_update() return values to bool (Borislav Petkov)
  - Accept the fact that some lines will be > 80 characters (Borislav Petkov)
  - Local variables are now in the inverted Christmas three order
    (Borislav Petkov)
  - Removed superfluous helpers. (Borislav Petkov)
  - Renamed text_to_poke to text_patch_loc, and handler to detour
    (Borislav Petkov & Steven Rostedt)
  - Re-applied the suggestion of not using BUG() from steven (Masami Hiramatsu)
  - arch_jump_label_transform_queue() now returns 0 on success and
    -ENOSPC/EINVAL on errors (Masami Hiramatsu)
Changes from v3:
  - Optimizations in the hot path of poke_int3_handler() (Masami Hiramatsu)
  - Reduce code duplication in text_poke_bp()/text_poke_batch()
    (Masami Hiramatsu)
  - Set NOKPROBE_SYMBOL on functions called by poke_int3_handler()
    (Masami Hiramatsu)
Changes from v2:
  - Switch the order of patches 8 and 9 (Steven Rostedt)
  - Fallback in the case of failure in the batch mode (Steven Rostedt)
  - Fix a pointer in patch 7 (Steven Rostedt)
  - Adjust the commit log of the patch 1 (Borislav Petkov)
  - Adjust the commit log of the patch 3 (Jiri Kosina/Thomas Gleixner)
Changes from v1:
  - Split the patch in jump-label/x86-jump-label/alternative (Jason Baron)
  - Use bserach in the int3 handler (Steven Rostedt)
  - Use a pre-allocated page to store the vector (Jason/Steven)
  - Do performance tests in the int3 handler (peterz)

Daniel Bristot de Oliveira (6):
  jump_label: Add a jump_label_can_update() helper
  x86/jump_label: Add a __jump_label_set_jump_code() helper
  jump_label: Sort entries of the same key by the code
  x86/alternative: Batch of patch operations
  jump_label: Batch updates if arch supports it
  x86/jump_label: Batch jump label updates

 arch/x86/include/asm/jump_label.h    |   2 +
 arch/x86/include/asm/text-patching.h |  15 +++
 arch/x86/kernel/alternative.c        | 154 +++++++++++++++++++++------
 arch/x86/kernel/jump_label.c         | 110 ++++++++++++++++---
 include/linux/jump_label.h           |   3 +
 kernel/jump_label.c                  |  65 +++++++++--
 6 files changed, 290 insertions(+), 59 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V6 1/6] jump_label: Add a jump_label_can_update() helper
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
@ 2019-06-12  9:57 ` Daniel Bristot de Oliveira
  2019-06-17 14:13   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 2/6] x86/jump_label: Add a __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Move the check if a jump_entry is valid to a function. No functional
change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/jump_label.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 0bfa10f4410c..468ca421ad9a 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -384,22 +384,30 @@ static enum jump_label_type jump_label_type(struct jump_entry *entry)
 	return enabled ^ branch;
 }
 
+static bool jump_label_can_update(struct jump_entry *entry, bool init)
+{
+	/*
+	 * Cannot update code that was in an init text area.
+	 */
+	if (!init && jump_entry_is_init(entry))
+		return false;
+
+	if (!kernel_text_address(jump_entry_code(entry))) {
+		WARN_ONCE(1, "can't patch jump_label at %pS", (void *)jump_entry_code(entry));
+		return false;
+	}
+
+	return true;
+}
+
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
 				bool init)
 {
 	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
-		/*
-		 * An entry->code of 0 indicates an entry which has been
-		 * disabled because it was in an init text area.
-		 */
-		if (init || !jump_entry_is_init(entry)) {
-			if (kernel_text_address(jump_entry_code(entry)))
-				arch_jump_label_transform(entry, jump_label_type(entry));
-			else
-				WARN_ONCE(1, "can't patch jump_label at %pS",
-					  (void *)jump_entry_code(entry));
+		if (jump_label_can_update(entry, init)) {
+			arch_jump_label_transform(entry, jump_label_type(entry));
 		}
 	}
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V6 2/6] x86/jump_label: Add a __jump_label_set_jump_code() helper
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 1/6] jump_label: Add a jump_label_can_update() helper Daniel Bristot de Oliveira
@ 2019-06-12  9:57 ` Daniel Bristot de Oliveira
  2019-06-17 14:13   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 3/6] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Move the definition of the code to be written from
__jump_label_transform() to a specialized function. No functional
change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/jump_label.c | 41 +++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e631c358f7f4..d3328062b8cf 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -35,19 +35,19 @@ static void bug_at(unsigned char *ip, int line)
 	BUG();
 }
 
-static void __ref __jump_label_transform(struct jump_entry *entry,
-					 enum jump_label_type type,
-					 int init)
+static void __jump_label_set_jump_code(struct jump_entry *entry,
+				       enum jump_label_type type,
+				       union jump_code_union *code,
+				       int init)
 {
-	union jump_code_union jmp;
 	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
 	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-	const void *expect, *code;
+	const void *expect;
 	int line;
 
-	jmp.jump = 0xe9;
-	jmp.offset = jump_entry_target(entry) -
-		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+	code->jump = 0xe9;
+	code->offset = jump_entry_target(entry) -
+		       (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 
 	if (type == JUMP_LABEL_JMP) {
 		if (init) {
@@ -56,19 +56,30 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 			expect = ideal_nop; line = __LINE__;
 		}
 
-		code = &jmp.code;
+		if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
+			bug_at((void *)jump_entry_code(entry), line);
+
 	} else {
 		if (init) {
 			expect = default_nop; line = __LINE__;
 		} else {
-			expect = &jmp.code; line = __LINE__;
+			expect = code->code; line = __LINE__;
 		}
 
-		code = ideal_nop;
+		if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
+			bug_at((void *)jump_entry_code(entry), line);
+
+		memcpy(code, ideal_nop, JUMP_LABEL_NOP_SIZE);
 	}
+}
+
+static void __ref __jump_label_transform(struct jump_entry *entry,
+					 enum jump_label_type type,
+					 int init)
+{
+	union jump_code_union code;
 
-	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
-		bug_at((void *)jump_entry_code(entry), line);
+	__jump_label_set_jump_code(entry, type, &code, init);
 
 	/*
 	 * As long as only a single processor is running and the code is still
@@ -82,12 +93,12 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 	 * always nop being the 'currently valid' instruction
 	 */
 	if (init || system_state == SYSTEM_BOOTING) {
-		text_poke_early((void *)jump_entry_code(entry), code,
+		text_poke_early((void *)jump_entry_code(entry), &code,
 				JUMP_LABEL_NOP_SIZE);
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
+	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE,
 		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V6 3/6] jump_label: Sort entries of the same key by the code
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 1/6] jump_label: Add a jump_label_can_update() helper Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 2/6] x86/jump_label: Add a __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
@ 2019-06-12  9:57 ` Daniel Bristot de Oliveira
  2019-06-17 14:14   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 4/6] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

In the batching mode, all the entries of a given key are updated at once.
During the update of a key, a hit in the int3 handler will check if the
hitting code address belongs to one of these keys.

To optimize the search of a given code in the vector of entries being
updated, a binary search is used. The binary search relies on the order
of the entries of a key by its code. Hence the keys need to be sorted
by the code too, so sort the entries of a given key by the code.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/jump_label.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 468ca421ad9a..2bd172e1e95f 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -37,12 +37,26 @@ static int jump_label_cmp(const void *a, const void *b)
 	const struct jump_entry *jea = a;
 	const struct jump_entry *jeb = b;
 
+	/*
+	 * Entrires are sorted by key.
+	 */
 	if (jump_entry_key(jea) < jump_entry_key(jeb))
 		return -1;
 
 	if (jump_entry_key(jea) > jump_entry_key(jeb))
 		return 1;
 
+	/*
+	 * In the batching mode, entries should also be sorted by the code
+	 * inside the already sorted list of entries, enabling a bsearch in
+	 * the vector.
+	 */
+	if (jump_entry_code(jea) < jump_entry_code(jeb))
+		return -1;
+
+	if (jump_entry_code(jea) > jump_entry_code(jeb))
+		return 1;
+
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V6 4/6] x86/alternative: Batch of patch operations
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (2 preceding siblings ...)
  2019-06-12  9:57 ` [PATCH V6 3/6] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
@ 2019-06-12  9:57 ` Daniel Bristot de Oliveira
  2019-06-12 14:52   ` Peter Zijlstra
  2019-06-17 14:15   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 5/6] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Currently, the patch of an address is done in three steps:

-- Pseudo-code #1 - Current implementation ---
        1) add an int3 trap to the address that will be patched
            sync cores (send IPI to all other CPUs)
        2) update all but the first byte of the patched range
            sync cores (send IPI to all other CPUs)
        3) replace the first byte (int3) by the first byte of replacing opcode
            sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---

When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.

With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:

-- Pseudo-code #2 - This patch  ---
1)  for each patch in the vector:
        add an int3 trap to the address that will be patched

    sync cores (send IPI to all other CPUs)

2)  for each patch in the vector:
        update all but the first byte of the patched range

    sync cores (send IPI to all other CPUs)

3)  for each patch in the vector:
        replace the first byte (int3) by the first byte of replacing opcode

    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.

The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/text-patching.h |  15 +++
 arch/x86/kernel/alternative.c        | 154 +++++++++++++++++++++------
 2 files changed, 135 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 880b5515b1d6..e1ae038d44f8 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+/*
+ * Currently, the max observed size in the kernel code is
+ * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * Raise it if needed.
+ */
+#define POKE_MAX_OPCODE_SIZE	5
+
+struct text_patch_loc {
+	void *detour;
+	void *addr;
+	size_t len;
+	const char opcode[POKE_MAX_OPCODE_SIZE];
+};
+
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
@@ -38,6 +52,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_batch(struct text_patch_loc *tp, unsigned int nr_entries);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 0d57015114e7..dc5dc697492a 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -14,6 +14,7 @@
 #include <linux/kdebug.h>
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
+#include <linux/bsearch.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -847,81 +848,133 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
-static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static struct bp_patching_desc {
+	struct text_patch_loc *vec;
+	int nr_entries;
+} bp_patching;
+
+static int patch_cmp(const void *key, const void *elt)
+{
+	struct text_patch_loc *tp = (struct text_patch_loc *) elt;
+
+	if (key < tp->addr)
+		return -1;
+	if (key > tp->addr)
+		return 1;
+	return 0;
+}
+NOKPROBE_SYMBOL(patch_cmp);
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	struct text_patch_loc *tp;
+	unsigned char int3 = 0xcc;
+	void *ip;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
-	 * bp_patching_in_progress.
+	 * bp_patching.nr_entries.
 	 *
-	 * 	in_progress = TRUE		INT3
+	 * 	nr_entries != 0			INT3
 	 * 	WMB				RMB
-	 * 	write INT3			if (in_progress)
+	 * 	write INT3			if (nr_entries)
 	 *
-	 * Idem for bp_int3_handler.
+	 * Idem for other elements in bp_patching.
 	 */
 	smp_rmb();
 
-	if (likely(!bp_patching_in_progress))
+	if (likely(!bp_patching.nr_entries))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	/*
+	 * Discount the sizeof(int3). See text_poke_bp_batch().
+	 */
+	ip = (void *) regs->ip - sizeof(int3);
+
+	/*
+	 * Skip the binary search if there is a single member in the vector.
+	 */
+	if (unlikely(bp_patching.nr_entries > 1)) {
+		tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
+			     sizeof(struct text_patch_loc),
+			     patch_cmp);
+		if (!tp)
+			return 0;
+	} else {
+		tp = bp_patching.vec;
+		if (tp->addr != ip)
+			return 0;
+	}
+
+	/* set up the specified breakpoint detour */
+	regs->ip = (unsigned long) tp->detour;
 
 	return 1;
 }
 NOKPROBE_SYMBOL(poke_int3_handler);
 
 /**
- * text_poke_bp() -- update instructions on live kernel on SMP
- * @addr:	address to patch
- * @opcode:	opcode of new instruction
- * @len:	length to copy
- * @handler:	address to jump to when the temporary breakpoint is hit
+ * text_poke_bp_batch() -- update instructions on live kernel on SMP
+ * @tp:			vector of instructions to patch
+ * @nr_entries:		number of entries in the vector
  *
  * Modify multi-byte instruction by using int3 breakpoint on SMP.
  * We completely avoid stop_machine() here, and achieve the
  * synchronization using int3 breakpoint.
  *
  * The way it is done:
- *	- add a int3 trap to the address that will be patched
+ * 	- For each entry in the vector:
+ *		- add a int3 trap to the address that will be patched
  *	- sync cores
- *	- update all but the first byte of the patched range
+ *	- For each entry in the vector:
+ *		- update all but the first byte of the patched range
  *	- sync cores
- *	- replace the first byte (int3) by the first byte of
- *	  replacing opcode
+ *	- For each entry in the vector:
+ *		- replace the first byte (int3) by the first byte of
+ *		  replacing opcode
  *	- sync cores
  */
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp_batch(struct text_patch_loc *tp, unsigned int nr_entries)
 {
+	int patched_all_but_first = 0;
 	unsigned char int3 = 0xcc;
-
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
-	bp_patching_in_progress = true;
+	unsigned int i;
 
 	lockdep_assert_held(&text_mutex);
 
+	bp_patching.vec = tp;
+	bp_patching.nr_entries = nr_entries;
+
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
-	 * in_progress and handler are correctly ordered wrt. patching.
+	 * nr_entries and handler are correctly ordered wrt. patching.
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	/*
+	 * First step: add a int3 trap to the address that will be patched.
+	 */
+	for (i = 0; i < nr_entries; i++)
+		text_poke(tp[i].addr, &int3, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
-	if (len - sizeof(int3) > 0) {
-		/* patch all but the first byte */
-		text_poke((char *)addr + sizeof(int3),
-			  (const char *) opcode + sizeof(int3),
-			  len - sizeof(int3));
+	/*
+	 * Second step: update all but the first byte of the patched range.
+	 */
+	for (i = 0; i < nr_entries; i++) {
+		if (tp[i].len - sizeof(int3) > 0) {
+			text_poke((char *)tp[i].addr + sizeof(int3),
+				  (const char *)tp[i].opcode + sizeof(int3),
+				  tp[i].len - sizeof(int3));
+			patched_all_but_first++;
+		}
+	}
+
+	if (patched_all_but_first) {
 		/*
 		 * According to Intel, this core syncing is very likely
 		 * not necessary and we'd be safe even without it. But
@@ -930,14 +983,47 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 		on_each_cpu(do_sync_core, NULL, 1);
 	}
 
-	/* patch the first byte */
-	text_poke(addr, opcode, sizeof(int3));
+	/*
+	 * Third step: replace the first byte (int3) by the first byte of
+	 * replacing opcode.
+	 */
+	for (i = 0; i < nr_entries; i++)
+		text_poke(tp[i].addr, tp[i].opcode, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 	/*
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
-	bp_patching_in_progress = false;
+	bp_patching.vec = NULL;
+	bp_patching.nr_entries = 0;
 }
 
+/**
+ * text_poke_bp() -- update instructions on live kernel on SMP
+ * @addr:	address to patch
+ * @opcode:	opcode of new instruction
+ * @len:	length to copy
+ * @handler:	address to jump to when the temporary breakpoint is hit
+ *
+ * Update a single instruction with the vector in the stack, avoiding
+ * dynamically allocated memory. This function should be used when it is
+ * not possible to allocate memory.
+ */
+void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+{
+	struct text_patch_loc tp = {
+		.detour = handler,
+		.addr = addr,
+		.len = len,
+	};
+
+	if (len > POKE_MAX_OPCODE_SIZE) {
+		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
+		return;
+	}
+
+	memcpy((void *)tp.opcode, opcode, len);
+
+	text_poke_bp_batch(&tp, 1);
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V6 5/6] jump_label: Batch updates if arch supports it
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (3 preceding siblings ...)
  2019-06-12  9:57 ` [PATCH V6 4/6] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
@ 2019-06-12  9:57 ` Daniel Bristot de Oliveira
  2019-06-17 14:15   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  2019-06-12  9:57 ` [PATCH V6 6/6] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
  2019-06-12 17:07 ` [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Peter Zijlstra
  6 siblings, 1 reply; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

If the architecture supports the batching of jump label updates, use it!

An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
  #!/bin/sh
  while [ true ]; do
      sysctl -w kernel.sched_schedstats=1
      sleep 2
      sysctl -w kernel.sched_schedstats=0
      sleep 2
  done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
  # watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/jump_label.h |  3 +++
 kernel/jump_label.c        | 23 +++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 3e113a1fa0f1..3526c0aee954 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -215,6 +215,9 @@ extern void arch_jump_label_transform(struct jump_entry *entry,
 				      enum jump_label_type type);
 extern void arch_jump_label_transform_static(struct jump_entry *entry,
 					     enum jump_label_type type);
+extern bool arch_jump_label_transform_queue(struct jump_entry *entry,
+					    enum jump_label_type type);
+extern void arch_jump_label_transform_apply(void);
 extern int jump_label_text_reserved(void *start, void *end);
 extern void static_key_slow_inc(struct static_key *key);
 extern void static_key_slow_dec(struct static_key *key);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 2bd172e1e95f..812c7d66f339 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -414,6 +414,7 @@ static bool jump_label_can_update(struct jump_entry *entry, bool init)
 	return true;
 }
 
+#ifndef HAVE_JUMP_LABEL_BATCH
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
@@ -425,6 +426,28 @@ static void __jump_label_update(struct static_key *key,
 		}
 	}
 }
+#else
+static void __jump_label_update(struct static_key *key,
+				struct jump_entry *entry,
+				struct jump_entry *stop,
+				bool init)
+{
+	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+
+		if (!jump_label_can_update(entry, init))
+			continue;
+
+		if (!arch_jump_label_transform_queue(entry, jump_label_type(entry))) {
+			/*
+			 * Queue is full: Apply the current queue and try again.
+			 */
+			arch_jump_label_transform_apply();
+			BUG_ON(!arch_jump_label_transform_queue(entry, jump_label_type(entry)));
+		}
+	}
+	arch_jump_label_transform_apply();
+}
+#endif
 
 void __init jump_label_init(void)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V6 6/6] x86/jump_label: Batch jump label updates
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (4 preceding siblings ...)
  2019-06-12  9:57 ` [PATCH V6 5/6] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
@ 2019-06-12  9:57 ` Daniel Bristot de Oliveira
  2019-06-17 14:16   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  2019-06-12 17:07 ` [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Peter Zijlstra
  6 siblings, 1 reply; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12  9:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Greg Kroah-Hartman, Masami Hiramatsu, Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Peter Zijlstra (Intel),
	Chris von Recklinghausen, Jason Baron, Scott Wood,
	Marcelo Tosatti, Clark Williams, x86

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    bool arch_jump_label_transform_queue(struct jump_entry *entry,
                                         enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated. This functions returns true in the case of a
successful enqueue of an entry. If it returns false, the caller must to
apply the queue and then try to queue again, for instance, because the
queue is full.

This function expects the caller to sort the entries by the address before
enqueueuing then. This is already done by the arch independent code, though.

After queuing all jump_entries, the function:

    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Scott Wood <swood@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/jump_label.h |  2 +
 arch/x86/kernel/jump_label.c      | 69 +++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 65191ce8e1cf..06c3cc22a058 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
+#define HAVE_JUMP_LABEL_BATCH
+
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index d3328062b8cf..7391d6d90bcf 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -110,6 +110,75 @@ void arch_jump_label_transform(struct jump_entry *entry,
 	mutex_unlock(&text_mutex);
 }
 
+#define TP_VEC_MAX (PAGE_SIZE / sizeof(struct text_patch_loc))
+static struct text_patch_loc tp_vec[TP_VEC_MAX];
+int tp_vec_nr = 0;
+
+bool arch_jump_label_transform_queue(struct jump_entry *entry,
+				     enum jump_label_type type)
+{
+	struct text_patch_loc *tp;
+	void *entry_code;
+
+	if (system_state == SYSTEM_BOOTING) {
+		/*
+		 * Fallback to the non-batching mode.
+		 */
+		arch_jump_label_transform(entry, type);
+		return true;
+	}
+
+	/*
+	 * No more space in the vector, tell upper layer to apply
+	 * the queue before continuing.
+	 */
+	if (tp_vec_nr == TP_VEC_MAX)
+		return false;
+
+	tp = &tp_vec[tp_vec_nr];
+
+	entry_code = (void *)jump_entry_code(entry);
+
+	/*
+	 * The int3 handler will do a bsearch in the queue, so we need entries
+	 * to be sorted. We can survive an unsorted list by rejecting the entry,
+	 * forcing the generic jump_label code to apply the queue. Warning once,
+	 * to raise the attention to the case of an unsorted entry that is
+	 * better not happen, because, in the worst case we will perform in the
+	 * same way as we do without batching - with some more overhead.
+	 */
+	if (tp_vec_nr > 0) {
+		int prev = tp_vec_nr - 1;
+		struct text_patch_loc *prev_tp = &tp_vec[prev];
+
+		if (WARN_ON_ONCE(prev_tp->addr > entry_code))
+			return false;
+	}
+
+	__jump_label_set_jump_code(entry, type,
+				   (union jump_code_union *) &tp->opcode, 0);
+
+	tp->addr = entry_code;
+	tp->detour = entry_code + JUMP_LABEL_NOP_SIZE;
+	tp->len = JUMP_LABEL_NOP_SIZE;
+
+	tp_vec_nr++;
+
+	return true;
+}
+
+void arch_jump_label_transform_apply(void)
+{
+	if (!tp_vec_nr)
+		return;
+
+	mutex_lock(&text_mutex);
+	text_poke_bp_batch(tp_vec, tp_vec_nr);
+	mutex_unlock(&text_mutex);
+
+	tp_vec_nr = 0;
+}
+
 static enum {
 	JL_STATE_START,
 	JL_STATE_NO_UPDATE,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/6] x86/alternative: Batch of patch operations
  2019-06-12  9:57 ` [PATCH V6 4/6] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
@ 2019-06-12 14:52   ` Peter Zijlstra
  2019-06-12 16:33     ` Daniel Bristot de Oliveira
  2019-06-17 14:15   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2019-06-12 14:52 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Masami Hiramatsu,
	Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Chris von Recklinghausen,
	Jason Baron, Scott Wood, Marcelo Tosatti, Clark Williams, x86

On Wed, Jun 12, 2019 at 11:57:29AM +0200, Daniel Bristot de Oliveira wrote:

> When a static key has more than one entry, these steps are called once for
> each entry. The number of IPIs then is linear with regard to the number 'n' of
> entries of a key: O(n*3), which is O(n).

> Doing the update in this way, the number of IPI becomes O(3) with regard
> to the number of keys, which is O(1).

That's not quite true, what you're doing is n/X, which, in the end, is
still O(n).

It just so happens your X is 128, and so any n smaller than that ends up
being 1.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 4/6] x86/alternative: Batch of patch operations
  2019-06-12 14:52   ` Peter Zijlstra
@ 2019-06-12 16:33     ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-12 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Masami Hiramatsu,
	Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Chris von Recklinghausen,
	Jason Baron, Scott Wood, Marcelo Tosatti, Clark Williams, x86

On 12/06/2019 16:52, Peter Zijlstra wrote:
> On Wed, Jun 12, 2019 at 11:57:29AM +0200, Daniel Bristot de Oliveira wrote:
> 
>> When a static key has more than one entry, these steps are called once for
>> each entry. The number of IPIs then is linear with regard to the number 'n' of
>> entries of a key: O(n*3), which is O(n).
> 
>> Doing the update in this way, the number of IPI becomes O(3) with regard
>> to the number of keys, which is O(1).
> 
> That's not quite true, what you're doing is n/X, which, in the end, is
> still O(n).
> 
> It just so happens your X is 128, and so any n smaller than that ends up
> being 1.
> 

Correct! In the v1, when I was using a (dynamic) linked list of keys, it was
O(1), now it is O(n).

Using an academic hat of easy assumptions, I could argue that:

"Doing the update in this way, the number of IPI becomes O(3) with regard to the
number of keys*, which is O(1).

* Given that the number of elements in the vector is larger than or equals to
the numbers of entries of a given key, O(n) otherwise."

Life is so easy when we can do such assumptions, like infinity memory :-)

So, yeah, with a fixed size vector, it is O(n) in the worst case, but still
"O(1)" in the vast majority of cases.

-- Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key
  2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
                   ` (5 preceding siblings ...)
  2019-06-12  9:57 ` [PATCH V6 6/6] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
@ 2019-06-12 17:07 ` Peter Zijlstra
  2019-06-13  8:08   ` Daniel Bristot de Oliveira
  6 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2019-06-12 17:07 UTC (permalink / raw)
  To: Daniel Bristot de Oliveira
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Masami Hiramatsu,
	Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Chris von Recklinghausen,
	Jason Baron, Scott Wood, Marcelo Tosatti, Clark Williams, x86

On Wed, Jun 12, 2019 at 11:57:25AM +0200, Daniel Bristot de Oliveira wrote:
> Daniel Bristot de Oliveira (6):
>   jump_label: Add a jump_label_can_update() helper
>   x86/jump_label: Add a __jump_label_set_jump_code() helper
>   jump_label: Sort entries of the same key by the code
>   x86/alternative: Batch of patch operations
>   jump_label: Batch updates if arch supports it
>   x86/jump_label: Batch jump label updates
> 
>  arch/x86/include/asm/jump_label.h    |   2 +
>  arch/x86/include/asm/text-patching.h |  15 +++
>  arch/x86/kernel/alternative.c        | 154 +++++++++++++++++++++------
>  arch/x86/kernel/jump_label.c         | 110 ++++++++++++++++---
>  include/linux/jump_label.h           |   3 +
>  kernel/jump_label.c                  |  65 +++++++++--
>  6 files changed, 290 insertions(+), 59 deletions(-)

OK, so I like these. I did make me some change, but mostly cosmetic
(although patch #2 got a bigger shuffle than intended).

I've also done my text_poke_bp() emulation thing on top of this, to make
sure it all works properly. Funnily, that shrank the size of
text_poke_loc to 24 bytes and thus we now fit 170 locations in the one
page.

---
Subject: x86/alternatives: Teach text_poke_bp() to emulate instructions
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Jun 5 10:48:37 CEST 2019

In preparation for static_call support, teach text_poke_bp() to
emulate instructions, including CALL.

The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.

When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.

This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/text-patching.h |   24 +++++++--
 arch/x86/kernel/alternative.c        |   88 +++++++++++++++++++++++++----------
 arch/x86/kernel/jump_label.c         |    9 +--
 arch/x86/kernel/kprobes/opt.c        |   11 +++-
 4 files changed, 93 insertions(+), 39 deletions(-)

--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -26,10 +26,11 @@ static inline void apply_paravirt(struct
 #define POKE_MAX_OPCODE_SIZE	5
 
 struct text_poke_loc {
-	void *detour;
 	void *addr;
-	size_t len;
-	const char opcode[POKE_MAX_OPCODE_SIZE];
+	int len;
+	s32 rel32;
+	u8 opcode;
+	const char text[POKE_MAX_OPCODE_SIZE];
 };
 
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
@@ -51,8 +52,10 @@ extern void text_poke_early(void *addr,
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
-extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
 extern void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries);
+extern void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
+			       const void *opcode, size_t len, const void *emulate);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
@@ -63,8 +66,17 @@ static inline void int3_emulate_jmp(stru
 	regs->ip = ip;
 }
 
-#define INT3_INSN_SIZE 1
-#define CALL_INSN_SIZE 5
+#define INT3_INSN_SIZE		1
+#define INT3_INSN_OPCODE	0xCC
+
+#define CALL_INSN_SIZE		5
+#define CALL_INSN_OPCODE	0xE9
+
+#define JMP32_INSN_SIZE		5
+#define JMP32_INSN_OPCODE	0xE8
+
+#define JMP8_INSN_SIZE		2
+#define JMP8_INSN_OPCODE	0xEB
 
 static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
 {
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -941,16 +941,15 @@ NOKPROBE_SYMBOL(patch_cmp);
 int poke_int3_handler(struct pt_regs *regs)
 {
 	struct text_poke_loc *tp;
-	unsigned char int3 = 0xcc;
 	void *ip;
 
 	/*
 	 * Having observed our INT3 instruction, we now must observe
 	 * bp_patching.nr_entries.
 	 *
-	 * 	nr_entries != 0			INT3
-	 * 	WMB				RMB
-	 * 	write INT3			if (nr_entries)
+	 *	nr_entries != 0			INT3
+	 *	WMB				RMB
+	 *	write INT3			if (nr_entries)
 	 *
 	 * Idem for other elements in bp_patching.
 	 */
@@ -963,9 +962,9 @@ int poke_int3_handler(struct pt_regs *re
 		return 0;
 
 	/*
-	 * Discount the sizeof(int3). See text_poke_bp_batch().
+	 * Discount the INT3. See text_poke_bp_batch().
 	 */
-	ip = (void *) regs->ip - sizeof(int3);
+	ip = (void *) regs->ip - INT3_INSN_SIZE;
 
 	/*
 	 * Skip the binary search if there is a single member in the vector.
@@ -982,8 +981,22 @@ int poke_int3_handler(struct pt_regs *re
 			return 0;
 	}
 
-	/* set up the specified breakpoint detour */
-	regs->ip = (unsigned long) tp->detour;
+	ip += tp->len;
+
+	switch (tp->opcode) {
+	case CALL_INSN_OPCODE:
+		int3_emulate_call(regs, (long)ip + tp->rel32);
+		break;
+
+	case JMP32_INSN_OPCODE:
+	case JMP8_INSN_OPCODE:
+		int3_emulate_jmp(regs, (long)ip + tp->rel32);
+		break;
+
+	default: /* nop */
+		int3_emulate_jmp(regs, (long)ip);
+		break;
+	}
 
 	return 1;
 }
@@ -1012,8 +1025,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
  */
 void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
+	unsigned char int3 = INT3_INSN_OPCODE;
 	int patched_all_but_first = 0;
-	unsigned char int3 = 0xcc;
 	unsigned int i;
 
 	lockdep_assert_held(&text_mutex);
@@ -1041,7 +1054,7 @@ void text_poke_bp_batch(struct text_poke
 	for (i = 0; i < nr_entries; i++) {
 		if (tp[i].len - sizeof(int3) > 0) {
 			text_poke((char *)tp[i].addr + sizeof(int3),
-				  (const char *)tp[i].opcode + sizeof(int3),
+				  (const char *)tp[i].text + sizeof(int3),
 				  tp[i].len - sizeof(int3));
 			patched_all_but_first++;
 		}
@@ -1061,7 +1074,7 @@ void text_poke_bp_batch(struct text_poke
 	 * replacing opcode.
 	 */
 	for (i = 0; i < nr_entries; i++)
-		text_poke(tp[i].addr, tp[i].opcode, sizeof(int3));
+		text_poke(tp[i].addr, tp[i].text, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 	/*
@@ -1072,6 +1085,43 @@ void text_poke_bp_batch(struct text_poke
 	bp_patching.nr_entries = 0;
 }
 
+void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
+			const void *opcode, size_t len, const void *emulate)
+{
+	struct insn insn;
+
+	if (!opcode)
+		opcode = (void *)tp->text;
+	else
+		memcpy((void *)tp->text, opcode, len);
+
+	if (!emulate)
+		emulate = opcode;
+
+	kernel_insn_init(&insn, emulate, MAX_INSN_SIZE);
+	insn_get_length(&insn);
+
+	BUG_ON(!insn_complete(&insn));
+	BUG_ON(len != insn.length);
+
+	tp->addr = addr;
+	tp->len = len;
+	tp->opcode = insn.opcode.bytes[0];
+
+	switch (tp->opcode) {
+	case CALL_INSN_OPCODE:
+	case JMP32_INSN_OPCODE:
+	case JMP8_INSN_OPCODE:
+		tp->rel32 = insn.immediate.value;
+		break;
+
+	default:
+		BUG_ON(len != 5);
+		BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], len));
+		break;
+	}
+}
+
 /**
  * text_poke_bp() -- update instructions on live kernel on SMP
  * @addr:	address to patch
@@ -1083,20 +1133,10 @@ void text_poke_bp_batch(struct text_poke
  * dynamically allocated memory. This function should be used when it is
  * not possible to allocate memory.
  */
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
 {
-	struct text_poke_loc tp = {
-		.detour = handler,
-		.addr = addr,
-		.len = len,
-	};
-
-	if (len > POKE_MAX_OPCODE_SIZE) {
-		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
-		return;
-	}
-
-	memcpy((void *)tp.opcode, opcode, len);
+	struct text_poke_loc tp;
 
+	text_poke_loc_init(&tp, addr, opcode, len, emulate);
 	text_poke_bp_batch(&tp, 1);
 }
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -89,8 +89,7 @@ static void __ref __jump_label_transform
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE,
-		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE, NULL);
 }
 
 void arch_jump_label_transform(struct jump_entry *entry,
@@ -147,11 +146,9 @@ bool arch_jump_label_transform_queue(str
 	}
 
 	__jump_label_set_jump_code(entry, type,
-				   (union jump_code_union *) &tp->opcode, 0);
+				   (union jump_code_union *)&tp->text, 0);
 
-	tp->addr = entry_code;
-	tp->detour = entry_code + JUMP_LABEL_NOP_SIZE;
-	tp->len = JUMP_LABEL_NOP_SIZE;
+	text_poke_loc_init(tp, entry_code, NULL, JUMP_LABEL_NOP_SIZE, NULL);
 
 	tp_vec_nr++;
 
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
 		insn_buff[0] = RELATIVEJUMP_OPCODE;
 		*(s32 *)(&insn_buff[1]) = rel;
 
-		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
-			     op->optinsn.insn);
+		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);
 
 		list_del_init(&op->list);
 	}
@@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
 void arch_unoptimize_kprobe(struct optimized_kprobe *op)
 {
 	u8 insn_buff[RELATIVEJUMP_SIZE];
+	u8 emulate_buff[RELATIVEJUMP_SIZE];
 
 	/* Set int3 to first byte for kprobes */
 	insn_buff[0] = BREAKPOINT_INSTRUCTION;
 	memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
+
+	emulate_buff[0] = RELATIVEJUMP_OPCODE;
+	*(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
+			((long)op->kp.addr + RELATIVEJUMP_SIZE));
+
 	text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
-		     op->optinsn.insn);
+		     emulate_buff);
 }
 
 /*

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key
  2019-06-12 17:07 ` [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Peter Zijlstra
@ 2019-06-13  8:08   ` Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Bristot de Oliveira @ 2019-06-13  8:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Greg Kroah-Hartman, Masami Hiramatsu,
	Steven Rostedt (VMware),
	Jiri Kosina, Josh Poimboeuf, Chris von Recklinghausen,
	Jason Baron, Scott Wood, Marcelo Tosatti, Clark Williams, x86

On 12/06/2019 19:07, Peter Zijlstra wrote:
> On Wed, Jun 12, 2019 at 11:57:25AM +0200, Daniel Bristot de Oliveira wrote:
>> Daniel Bristot de Oliveira (6):
>>   jump_label: Add a jump_label_can_update() helper
>>   x86/jump_label: Add a __jump_label_set_jump_code() helper
>>   jump_label: Sort entries of the same key by the code
>>   x86/alternative: Batch of patch operations
>>   jump_label: Batch updates if arch supports it
>>   x86/jump_label: Batch jump label updates
>>
>>  arch/x86/include/asm/jump_label.h    |   2 +
>>  arch/x86/include/asm/text-patching.h |  15 +++
>>  arch/x86/kernel/alternative.c        | 154 +++++++++++++++++++++------
>>  arch/x86/kernel/jump_label.c         | 110 ++++++++++++++++---
>>  include/linux/jump_label.h           |   3 +
>>  kernel/jump_label.c                  |  65 +++++++++--
>>  6 files changed, 290 insertions(+), 59 deletions(-)
> 
> OK, so I like these. I did make me some change, but mostly cosmetic
> (although patch #2 got a bigger shuffle than intended).
> 
> I've also done my text_poke_bp() emulation thing on top of this, to make
> sure it all works properly. Funnily, that shrank the size of
> text_poke_loc to 24 bytes and thus we now fit 170 locations in the one
> page.
> 
> ---
> Subject: x86/alternatives: Teach text_poke_bp() to emulate instructions
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Jun 5 10:48:37 CEST 2019
> 
> In preparation for static_call support, teach text_poke_bp() to
> emulate instructions, including CALL.
> 
> The current text_poke_bp() takes a @handler argument which is used as
> a jump target when the temporary INT3 is hit by a different CPU.
> 
> When patching CALL instructions, this doesn't work because we'd miss
> the PUSH of the return address. Instead, teach poke_int3_handler() to
> emulate an instruction, typically the instruction we're patching in.
> 
> This fits almost all text_poke_bp() users, except
> arch_unoptimize_kprobe() which restores random text, and for that site
> we have to build an explicit emulate instruction.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/text-patching.h |   24 +++++++--
>  arch/x86/kernel/alternative.c        |   88 +++++++++++++++++++++++++----------
>  arch/x86/kernel/jump_label.c         |    9 +--
>  arch/x86/kernel/kprobes/opt.c        |   11 +++-
>  4 files changed, 93 insertions(+), 39 deletions(-)
> 
> --- a/arch/x86/include/asm/text-patching.h
> +++ b/arch/x86/include/asm/text-patching.h
> @@ -26,10 +26,11 @@ static inline void apply_paravirt(struct
>  #define POKE_MAX_OPCODE_SIZE	5
>  
>  struct text_poke_loc {
> -	void *detour;
>  	void *addr;
> -	size_t len;
> -	const char opcode[POKE_MAX_OPCODE_SIZE];
> +	int len;
> +	s32 rel32;
> +	u8 opcode;
> +	const char text[POKE_MAX_OPCODE_SIZE];
>  };
>  
>  extern void text_poke_early(void *addr, const void *opcode, size_t len);
> @@ -51,8 +52,10 @@ extern void text_poke_early(void *addr,
>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>  extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
>  extern int poke_int3_handler(struct pt_regs *regs);
> -extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
> +extern void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate);
>  extern void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries);
> +extern void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
> +			       const void *opcode, size_t len, const void *emulate);
>  extern int after_bootmem;
>  extern __ro_after_init struct mm_struct *poking_mm;
>  extern __ro_after_init unsigned long poking_addr;
> @@ -63,8 +66,17 @@ static inline void int3_emulate_jmp(stru
>  	regs->ip = ip;
>  }
>  
> -#define INT3_INSN_SIZE 1
> -#define CALL_INSN_SIZE 5
> +#define INT3_INSN_SIZE		1
> +#define INT3_INSN_OPCODE	0xCC
> +
> +#define CALL_INSN_SIZE		5
> +#define CALL_INSN_OPCODE	0xE9
> +
> +#define JMP32_INSN_SIZE		5
> +#define JMP32_INSN_OPCODE	0xE8
> +
> +#define JMP8_INSN_SIZE		2
> +#define JMP8_INSN_OPCODE	0xEB
>  
>  static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
>  {
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -941,16 +941,15 @@ NOKPROBE_SYMBOL(patch_cmp);
>  int poke_int3_handler(struct pt_regs *regs)
>  {
>  	struct text_poke_loc *tp;
> -	unsigned char int3 = 0xcc;
>  	void *ip;
>  
>  	/*
>  	 * Having observed our INT3 instruction, we now must observe
>  	 * bp_patching.nr_entries.
>  	 *
> -	 * 	nr_entries != 0			INT3
> -	 * 	WMB				RMB
> -	 * 	write INT3			if (nr_entries)
> +	 *	nr_entries != 0			INT3
> +	 *	WMB				RMB
> +	 *	write INT3			if (nr_entries)
>  	 *
>  	 * Idem for other elements in bp_patching.
>  	 */
> @@ -963,9 +962,9 @@ int poke_int3_handler(struct pt_regs *re
>  		return 0;
>  
>  	/*
> -	 * Discount the sizeof(int3). See text_poke_bp_batch().
> +	 * Discount the INT3. See text_poke_bp_batch().
>  	 */
> -	ip = (void *) regs->ip - sizeof(int3);
> +	ip = (void *) regs->ip - INT3_INSN_SIZE;
>  
>  	/*
>  	 * Skip the binary search if there is a single member in the vector.
> @@ -982,8 +981,22 @@ int poke_int3_handler(struct pt_regs *re
>  			return 0;
>  	}
>  
> -	/* set up the specified breakpoint detour */
> -	regs->ip = (unsigned long) tp->detour;
> +	ip += tp->len;
> +
> +	switch (tp->opcode) {
> +	case CALL_INSN_OPCODE:
> +		int3_emulate_call(regs, (long)ip + tp->rel32);
> +		break;
> +
> +	case JMP32_INSN_OPCODE:
> +	case JMP8_INSN_OPCODE:
> +		int3_emulate_jmp(regs, (long)ip + tp->rel32);
> +		break;
> +
> +	default: /* nop */
> +		int3_emulate_jmp(regs, (long)ip);
> +		break;
> +	}
>  
>  	return 1;
>  }
> @@ -1012,8 +1025,8 @@ NOKPROBE_SYMBOL(poke_int3_handler);
>   */
>  void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
>  {
> +	unsigned char int3 = INT3_INSN_OPCODE;
>  	int patched_all_but_first = 0;
> -	unsigned char int3 = 0xcc;
>  	unsigned int i;
>  
>  	lockdep_assert_held(&text_mutex);
> @@ -1041,7 +1054,7 @@ void text_poke_bp_batch(struct text_poke
>  	for (i = 0; i < nr_entries; i++) {
>  		if (tp[i].len - sizeof(int3) > 0) {
>  			text_poke((char *)tp[i].addr + sizeof(int3),
> -				  (const char *)tp[i].opcode + sizeof(int3),
> +				  (const char *)tp[i].text + sizeof(int3),
>  				  tp[i].len - sizeof(int3));
>  			patched_all_but_first++;
>  		}
> @@ -1061,7 +1074,7 @@ void text_poke_bp_batch(struct text_poke
>  	 * replacing opcode.
>  	 */
>  	for (i = 0; i < nr_entries; i++)
> -		text_poke(tp[i].addr, tp[i].opcode, sizeof(int3));
> +		text_poke(tp[i].addr, tp[i].text, sizeof(int3));
>  
>  	on_each_cpu(do_sync_core, NULL, 1);
>  	/*
> @@ -1072,6 +1085,43 @@ void text_poke_bp_batch(struct text_poke
>  	bp_patching.nr_entries = 0;
>  }
>  
> +void text_poke_loc_init(struct text_poke_loc *tp, void *addr,
> +			const void *opcode, size_t len, const void *emulate)
> +{
> +	struct insn insn;
> +
> +	if (!opcode)
> +		opcode = (void *)tp->text;
> +	else
> +		memcpy((void *)tp->text, opcode, len);
> +
> +	if (!emulate)
> +		emulate = opcode;
> +
> +	kernel_insn_init(&insn, emulate, MAX_INSN_SIZE);
> +	insn_get_length(&insn);
> +
> +	BUG_ON(!insn_complete(&insn));
> +	BUG_ON(len != insn.length);
> +
> +	tp->addr = addr;
> +	tp->len = len;
> +	tp->opcode = insn.opcode.bytes[0];
> +
> +	switch (tp->opcode) {
> +	case CALL_INSN_OPCODE:
> +	case JMP32_INSN_OPCODE:
> +	case JMP8_INSN_OPCODE:
> +		tp->rel32 = insn.immediate.value;
> +		break;
> +
> +	default:
> +		BUG_ON(len != 5);
> +		BUG_ON(memcmp(emulate, ideal_nops[NOP_ATOMIC5], len));
> +		break;
> +	}
> +}
> +
>  /**
>   * text_poke_bp() -- update instructions on live kernel on SMP
>   * @addr:	address to patch
> @@ -1083,20 +1133,10 @@ void text_poke_bp_batch(struct text_poke
>   * dynamically allocated memory. This function should be used when it is
>   * not possible to allocate memory.
>   */
> -void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
> +void text_poke_bp(void *addr, const void *opcode, size_t len, const void *emulate)
>  {
> -	struct text_poke_loc tp = {
> -		.detour = handler,
> -		.addr = addr,
> -		.len = len,
> -	};
> -
> -	if (len > POKE_MAX_OPCODE_SIZE) {
> -		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
> -		return;
> -	}
> -
> -	memcpy((void *)tp.opcode, opcode, len);
> +	struct text_poke_loc tp;
>  
> +	text_poke_loc_init(&tp, addr, opcode, len, emulate);
>  	text_poke_bp_batch(&tp, 1);
>  }
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -89,8 +89,7 @@ static void __ref __jump_label_transform
>  		return;
>  	}
>  
> -	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE,
> -		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
> +	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE, NULL);
>  }
>  
>  void arch_jump_label_transform(struct jump_entry *entry,
> @@ -147,11 +146,9 @@ bool arch_jump_label_transform_queue(str
>  	}
>  
>  	__jump_label_set_jump_code(entry, type,
> -				   (union jump_code_union *) &tp->opcode, 0);
> +				   (union jump_code_union *)&tp->text, 0);
>  
> -	tp->addr = entry_code;
> -	tp->detour = entry_code + JUMP_LABEL_NOP_SIZE;
> -	tp->len = JUMP_LABEL_NOP_SIZE;
> +	text_poke_loc_init(tp, entry_code, NULL, JUMP_LABEL_NOP_SIZE, NULL);
>  
>  	tp_vec_nr++;
>  
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -437,8 +437,7 @@ void arch_optimize_kprobes(struct list_h
>  		insn_buff[0] = RELATIVEJUMP_OPCODE;
>  		*(s32 *)(&insn_buff[1]) = rel;
>  
> -		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> -			     op->optinsn.insn);
> +		text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE, NULL);
>  
>  		list_del_init(&op->list);
>  	}
> @@ -448,12 +447,18 @@ void arch_optimize_kprobes(struct list_h
>  void arch_unoptimize_kprobe(struct optimized_kprobe *op)
>  {
>  	u8 insn_buff[RELATIVEJUMP_SIZE];
> +	u8 emulate_buff[RELATIVEJUMP_SIZE];
>  
>  	/* Set int3 to first byte for kprobes */
>  	insn_buff[0] = BREAKPOINT_INSTRUCTION;
>  	memcpy(insn_buff + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
> +
> +	emulate_buff[0] = RELATIVEJUMP_OPCODE;
> +	*(s32 *)(&emulate_buff[1]) = (s32)((long)op->optinsn.insn -
> +			((long)op->kp.addr + RELATIVEJUMP_SIZE));
> +
>  	text_poke_bp(op->kp.addr, insn_buff, RELATIVEJUMP_SIZE,
> -		     op->optinsn.insn);
> +		     emulate_buff);
>  }
>  
>  /*
> 

Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>

Thanks Peter!
-- Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip:locking/core] jump_label: Add a jump_label_can_update() helper
  2019-06-12  9:57 ` [PATCH V6 1/6] jump_label: Add a jump_label_can_update() helper Daniel Bristot de Oliveira
@ 2019-06-17 14:13   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-06-17 14:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, rostedt, williams, jkosina, jbaron, swood, crecklin,
	bristot, mhiramat, hpa, tglx, gregkh, mtosatti, torvalds, mingo,
	bp, jpoimboe, linux-kernel

Commit-ID:  e1aacb3f4adc1bcd4273a1d538a2dc3e50f1cbb5
Gitweb:     https://git.kernel.org/tip/e1aacb3f4adc1bcd4273a1d538a2dc3e50f1cbb5
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Wed, 12 Jun 2019 11:57:26 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:09:19 +0200

jump_label: Add a jump_label_can_update() helper

Move the check if a jump_entry is valid to a function. No functional
change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/56b69bd3f8e644ed64f2dbde7c088030b8cbe76b.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/jump_label.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 0bfa10f4410c..24f0d3b1526b 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -384,23 +384,30 @@ static enum jump_label_type jump_label_type(struct jump_entry *entry)
 	return enabled ^ branch;
 }
 
+static bool jump_label_can_update(struct jump_entry *entry, bool init)
+{
+	/*
+	 * Cannot update code that was in an init text area.
+	 */
+	if (!init && jump_entry_is_init(entry))
+		return false;
+
+	if (!kernel_text_address(jump_entry_code(entry))) {
+		WARN_ONCE(1, "can't patch jump_label at %pS", (void *)jump_entry_code(entry));
+		return false;
+	}
+
+	return true;
+}
+
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
 				bool init)
 {
 	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
-		/*
-		 * An entry->code of 0 indicates an entry which has been
-		 * disabled because it was in an init text area.
-		 */
-		if (init || !jump_entry_is_init(entry)) {
-			if (kernel_text_address(jump_entry_code(entry)))
-				arch_jump_label_transform(entry, jump_label_type(entry));
-			else
-				WARN_ONCE(1, "can't patch jump_label at %pS",
-					  (void *)jump_entry_code(entry));
-		}
+		if (jump_label_can_update(entry, init))
+			arch_jump_label_transform(entry, jump_label_type(entry));
 	}
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:locking/core] x86/jump_label: Add a __jump_label_set_jump_code() helper
  2019-06-12  9:57 ` [PATCH V6 2/6] x86/jump_label: Add a __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
@ 2019-06-17 14:13   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-06-17 14:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: swood, rostedt, tglx, hpa, crecklin, bristot, linux-kernel,
	mtosatti, torvalds, jkosina, jpoimboe, mhiramat, peterz,
	williams, jbaron, bp, gregkh, mingo

Commit-ID:  4cc6620b5e4c8953c725bcfab86a57df01e83a7c
Gitweb:     https://git.kernel.org/tip/4cc6620b5e4c8953c725bcfab86a57df01e83a7c
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Wed, 12 Jun 2019 11:57:27 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:09:20 +0200

x86/jump_label: Add a __jump_label_set_jump_code() helper

Move the definition of the code to be written from
__jump_label_transform() to a specialized function. No functional
change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/d2f52a0010ecd399cf9b02a65bcf5836571b9e52.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/jump_label.c | 52 +++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index e631c358f7f4..f33408f1c3f6 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -35,41 +35,43 @@ static void bug_at(unsigned char *ip, int line)
 	BUG();
 }
 
-static void __ref __jump_label_transform(struct jump_entry *entry,
-					 enum jump_label_type type,
-					 int init)
+static void __jump_label_set_jump_code(struct jump_entry *entry,
+				       enum jump_label_type type,
+				       union jump_code_union *code,
+				       int init)
 {
-	union jump_code_union jmp;
 	const unsigned char default_nop[] = { STATIC_KEY_INIT_NOP };
 	const unsigned char *ideal_nop = ideal_nops[NOP_ATOMIC5];
-	const void *expect, *code;
+	const void *expect;
 	int line;
 
-	jmp.jump = 0xe9;
-	jmp.offset = jump_entry_target(entry) -
-		     (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
-
-	if (type == JUMP_LABEL_JMP) {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = ideal_nop; line = __LINE__;
-		}
+	code->jump = 0xe9;
+	code->offset = jump_entry_target(entry) -
+		       (jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 
-		code = &jmp.code;
+	if (init) {
+		expect = default_nop; line = __LINE__;
+	} else if (type == JUMP_LABEL_JMP) {
+		expect = ideal_nop; line = __LINE__;
 	} else {
-		if (init) {
-			expect = default_nop; line = __LINE__;
-		} else {
-			expect = &jmp.code; line = __LINE__;
-		}
-
-		code = ideal_nop;
+		expect = code->code; line = __LINE__;
 	}
 
 	if (memcmp((void *)jump_entry_code(entry), expect, JUMP_LABEL_NOP_SIZE))
 		bug_at((void *)jump_entry_code(entry), line);
 
+	if (type == JUMP_LABEL_NOP)
+		memcpy(code, ideal_nop, JUMP_LABEL_NOP_SIZE);
+}
+
+static void __ref __jump_label_transform(struct jump_entry *entry,
+					 enum jump_label_type type,
+					 int init)
+{
+	union jump_code_union code;
+
+	__jump_label_set_jump_code(entry, type, &code, init);
+
 	/*
 	 * As long as only a single processor is running and the code is still
 	 * not marked as RO, text_poke_early() can be used; Checking that
@@ -82,12 +84,12 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
 	 * always nop being the 'currently valid' instruction
 	 */
 	if (init || system_state == SYSTEM_BOOTING) {
-		text_poke_early((void *)jump_entry_code(entry), code,
+		text_poke_early((void *)jump_entry_code(entry), &code,
 				JUMP_LABEL_NOP_SIZE);
 		return;
 	}
 
-	text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
+	text_poke_bp((void *)jump_entry_code(entry), &code, JUMP_LABEL_NOP_SIZE,
 		     (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:locking/core] jump_label: Sort entries of the same key by the code
  2019-06-12  9:57 ` [PATCH V6 3/6] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
@ 2019-06-17 14:14   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-06-17 14:14 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: rostedt, mhiramat, tglx, crecklin, gregkh, jbaron, bp, peterz,
	jkosina, linux-kernel, mingo, mtosatti, torvalds, williams, hpa,
	jpoimboe, bristot, swood

Commit-ID:  0f133021bd82548a33580bfb7b055e8857f46c2a
Gitweb:     https://git.kernel.org/tip/0f133021bd82548a33580bfb7b055e8857f46c2a
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Wed, 12 Jun 2019 11:57:28 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:09:21 +0200

jump_label: Sort entries of the same key by the code

In the batching mode, all the entries of a given key are updated at once.
During the update of a key, a hit in the int3 handler will check if the
hitting code address belongs to one of these keys.

To optimize the search of a given code in the vector of entries being
updated, a binary search is used. The binary search relies on the order
of the entries of a key by its code. Hence the keys need to be sorted
by the code too, so sort the entries of a given key by the code.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/f57ae83e0592418ba269866bb7ade570fc8632e0.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/jump_label.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 24f0d3b1526b..ca00ac10d9b9 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -37,12 +37,26 @@ static int jump_label_cmp(const void *a, const void *b)
 	const struct jump_entry *jea = a;
 	const struct jump_entry *jeb = b;
 
+	/*
+	 * Entrires are sorted by key.
+	 */
 	if (jump_entry_key(jea) < jump_entry_key(jeb))
 		return -1;
 
 	if (jump_entry_key(jea) > jump_entry_key(jeb))
 		return 1;
 
+	/*
+	 * In the batching mode, entries should also be sorted by the code
+	 * inside the already sorted list of entries, enabling a bsearch in
+	 * the vector.
+	 */
+	if (jump_entry_code(jea) < jump_entry_code(jeb))
+		return -1;
+
+	if (jump_entry_code(jea) > jump_entry_code(jeb))
+		return 1;
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:locking/core] x86/alternative: Batch of patch operations
  2019-06-12  9:57 ` [PATCH V6 4/6] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
  2019-06-12 14:52   ` Peter Zijlstra
@ 2019-06-17 14:15   ` tip-bot for Daniel Bristot de Oliveira
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-06-17 14:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: crecklin, peterz, jkosina, gregkh, bp, swood, bristot, hpa,
	williams, torvalds, mhiramat, jpoimboe, jbaron, mingo,
	linux-kernel, mtosatti, rostedt, tglx

Commit-ID:  c0213b0ac03cf69f90fe5c6a8fe2c986630940fa
Gitweb:     https://git.kernel.org/tip/c0213b0ac03cf69f90fe5c6a8fe2c986630940fa
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Wed, 12 Jun 2019 11:57:29 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:09:21 +0200

x86/alternative: Batch of patch operations

Currently, the patch of an address is done in three steps:

-- Pseudo-code #1 - Current implementation ---

        1) add an int3 trap to the address that will be patched
            sync cores (send IPI to all other CPUs)
        2) update all but the first byte of the patched range
            sync cores (send IPI to all other CPUs)
        3) replace the first byte (int3) by the first byte of replacing opcode
            sync cores (send IPI to all other CPUs)

-- Pseudo-code #1 ---

When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).

This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.

With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:

-- Pseudo-code #2 - This patch  ---
1)  for each patch in the vector:
        add an int3 trap to the address that will be patched

    sync cores (send IPI to all other CPUs)

2)  for each patch in the vector:
        update all but the first byte of the patched range

    sync cores (send IPI to all other CPUs)

3)  for each patch in the vector:
        replace the first byte (int3) by the first byte of replacing opcode

    sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch  ---

Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).

The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.

The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/ca506ed52584c80f64de23f6f55ca288e5d079de.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/text-patching.h |  15 ++++
 arch/x86/kernel/alternative.c        | 154 +++++++++++++++++++++++++++--------
 2 files changed, 135 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index 880b5515b1d6..d83e9f771d86 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -18,6 +18,20 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,
 #define __parainstructions_end	NULL
 #endif
 
+/*
+ * Currently, the max observed size in the kernel code is
+ * JUMP_LABEL_NOP_SIZE/RELATIVEJUMP_SIZE, which are 5.
+ * Raise it if needed.
+ */
+#define POKE_MAX_OPCODE_SIZE	5
+
+struct text_poke_loc {
+	void *detour;
+	void *addr;
+	size_t len;
+	const char opcode[POKE_MAX_OPCODE_SIZE];
+};
+
 extern void text_poke_early(void *addr, const void *opcode, size_t len);
 
 /*
@@ -38,6 +52,7 @@ extern void *text_poke(void *addr, const void *opcode, size_t len);
 extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries);
 extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 390596b761e3..bd542f9b0953 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -14,6 +14,7 @@
 #include <linux/kdebug.h>
 #include <linux/kprobes.h>
 #include <linux/mmu_context.h>
+#include <linux/bsearch.h>
 #include <asm/text-patching.h>
 #include <asm/alternative.h>
 #include <asm/sections.h>
@@ -848,81 +849,133 @@ static void do_sync_core(void *info)
 	sync_core();
 }
 
-static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static struct bp_patching_desc {
+	struct text_poke_loc *vec;
+	int nr_entries;
+} bp_patching;
+
+static int patch_cmp(const void *key, const void *elt)
+{
+	struct text_poke_loc *tp = (struct text_poke_loc *) elt;
+
+	if (key < tp->addr)
+		return -1;
+	if (key > tp->addr)
+		return 1;
+	return 0;
+}
+NOKPROBE_SYMBOL(patch_cmp);
 
 int poke_int3_handler(struct pt_regs *regs)
 {
+	struct text_poke_loc *tp;
+	unsigned char int3 = 0xcc;
+	void *ip;
+
 	/*
 	 * Having observed our INT3 instruction, we now must observe
-	 * bp_patching_in_progress.
+	 * bp_patching.nr_entries.
 	 *
-	 * 	in_progress = TRUE		INT3
+	 * 	nr_entries != 0			INT3
 	 * 	WMB				RMB
-	 * 	write INT3			if (in_progress)
+	 * 	write INT3			if (nr_entries)
 	 *
-	 * Idem for bp_int3_handler.
+	 * Idem for other elements in bp_patching.
 	 */
 	smp_rmb();
 
-	if (likely(!bp_patching_in_progress))
+	if (likely(!bp_patching.nr_entries))
 		return 0;
 
-	if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
+	if (user_mode(regs))
 		return 0;
 
-	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	/*
+	 * Discount the sizeof(int3). See text_poke_bp_batch().
+	 */
+	ip = (void *) regs->ip - sizeof(int3);
+
+	/*
+	 * Skip the binary search if there is a single member in the vector.
+	 */
+	if (unlikely(bp_patching.nr_entries > 1)) {
+		tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
+			     sizeof(struct text_poke_loc),
+			     patch_cmp);
+		if (!tp)
+			return 0;
+	} else {
+		tp = bp_patching.vec;
+		if (tp->addr != ip)
+			return 0;
+	}
+
+	/* set up the specified breakpoint detour */
+	regs->ip = (unsigned long) tp->detour;
 
 	return 1;
 }
 NOKPROBE_SYMBOL(poke_int3_handler);
 
 /**
- * text_poke_bp() -- update instructions on live kernel on SMP
- * @addr:	address to patch
- * @opcode:	opcode of new instruction
- * @len:	length to copy
- * @handler:	address to jump to when the temporary breakpoint is hit
+ * text_poke_bp_batch() -- update instructions on live kernel on SMP
+ * @tp:			vector of instructions to patch
+ * @nr_entries:		number of entries in the vector
  *
  * Modify multi-byte instruction by using int3 breakpoint on SMP.
  * We completely avoid stop_machine() here, and achieve the
  * synchronization using int3 breakpoint.
  *
  * The way it is done:
- *	- add a int3 trap to the address that will be patched
+ * 	- For each entry in the vector:
+ *		- add a int3 trap to the address that will be patched
  *	- sync cores
- *	- update all but the first byte of the patched range
+ *	- For each entry in the vector:
+ *		- update all but the first byte of the patched range
  *	- sync cores
- *	- replace the first byte (int3) by the first byte of
- *	  replacing opcode
+ *	- For each entry in the vector:
+ *		- replace the first byte (int3) by the first byte of
+ *		  replacing opcode
  *	- sync cores
  */
-void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
 {
+	int patched_all_but_first = 0;
 	unsigned char int3 = 0xcc;
-
-	bp_int3_handler = handler;
-	bp_int3_addr = (u8 *)addr + sizeof(int3);
-	bp_patching_in_progress = true;
+	unsigned int i;
 
 	lockdep_assert_held(&text_mutex);
 
+	bp_patching.vec = tp;
+	bp_patching.nr_entries = nr_entries;
+
 	/*
 	 * Corresponding read barrier in int3 notifier for making sure the
-	 * in_progress and handler are correctly ordered wrt. patching.
+	 * nr_entries and handler are correctly ordered wrt. patching.
 	 */
 	smp_wmb();
 
-	text_poke(addr, &int3, sizeof(int3));
+	/*
+	 * First step: add a int3 trap to the address that will be patched.
+	 */
+	for (i = 0; i < nr_entries; i++)
+		text_poke(tp[i].addr, &int3, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 
-	if (len - sizeof(int3) > 0) {
-		/* patch all but the first byte */
-		text_poke((char *)addr + sizeof(int3),
-			  (const char *) opcode + sizeof(int3),
-			  len - sizeof(int3));
+	/*
+	 * Second step: update all but the first byte of the patched range.
+	 */
+	for (i = 0; i < nr_entries; i++) {
+		if (tp[i].len - sizeof(int3) > 0) {
+			text_poke((char *)tp[i].addr + sizeof(int3),
+				  (const char *)tp[i].opcode + sizeof(int3),
+				  tp[i].len - sizeof(int3));
+			patched_all_but_first++;
+		}
+	}
+
+	if (patched_all_but_first) {
 		/*
 		 * According to Intel, this core syncing is very likely
 		 * not necessary and we'd be safe even without it. But
@@ -931,14 +984,47 @@ void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 		on_each_cpu(do_sync_core, NULL, 1);
 	}
 
-	/* patch the first byte */
-	text_poke(addr, opcode, sizeof(int3));
+	/*
+	 * Third step: replace the first byte (int3) by the first byte of
+	 * replacing opcode.
+	 */
+	for (i = 0; i < nr_entries; i++)
+		text_poke(tp[i].addr, tp[i].opcode, sizeof(int3));
 
 	on_each_cpu(do_sync_core, NULL, 1);
 	/*
 	 * sync_core() implies an smp_mb() and orders this store against
 	 * the writing of the new instruction.
 	 */
-	bp_patching_in_progress = false;
+	bp_patching.vec = NULL;
+	bp_patching.nr_entries = 0;
 }
 
+/**
+ * text_poke_bp() -- update instructions on live kernel on SMP
+ * @addr:	address to patch
+ * @opcode:	opcode of new instruction
+ * @len:	length to copy
+ * @handler:	address to jump to when the temporary breakpoint is hit
+ *
+ * Update a single instruction with the vector in the stack, avoiding
+ * dynamically allocated memory. This function should be used when it is
+ * not possible to allocate memory.
+ */
+void text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+{
+	struct text_poke_loc tp = {
+		.detour = handler,
+		.addr = addr,
+		.len = len,
+	};
+
+	if (len > POKE_MAX_OPCODE_SIZE) {
+		WARN_ONCE(1, "len is larger than %d\n", POKE_MAX_OPCODE_SIZE);
+		return;
+	}
+
+	memcpy((void *)tp.opcode, opcode, len);
+
+	text_poke_bp_batch(&tp, 1);
+}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:locking/core] jump_label: Batch updates if arch supports it
  2019-06-12  9:57 ` [PATCH V6 5/6] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
@ 2019-06-17 14:15   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-06-17 14:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, torvalds, bp, swood, tglx, rostedt, bristot,
	mtosatti, mhiramat, jkosina, williams, jpoimboe, gregkh, hpa,
	jbaron, linux-kernel, crecklin

Commit-ID:  c2ba8a15f310d915f8748dd8324c91c82b12b5ff
Gitweb:     https://git.kernel.org/tip/c2ba8a15f310d915f8748dd8324c91c82b12b5ff
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Wed, 12 Jun 2019 11:57:30 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:09:22 +0200

jump_label: Batch updates if arch supports it

If the architecture supports the batching of jump label updates, use it!

An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
  #!/bin/sh
  while [ true ]; do
      sysctl -w kernel.sched_schedstats=1
      sleep 2
      sysctl -w kernel.sched_schedstats=0
      sleep 2
  done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
  # watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/acc891dbc2dbc9fd616dd680529a2337b1d1274c.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/jump_label.h |  3 +++
 kernel/jump_label.c        | 23 +++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 3e113a1fa0f1..3526c0aee954 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -215,6 +215,9 @@ extern void arch_jump_label_transform(struct jump_entry *entry,
 				      enum jump_label_type type);
 extern void arch_jump_label_transform_static(struct jump_entry *entry,
 					     enum jump_label_type type);
+extern bool arch_jump_label_transform_queue(struct jump_entry *entry,
+					    enum jump_label_type type);
+extern void arch_jump_label_transform_apply(void);
 extern int jump_label_text_reserved(void *start, void *end);
 extern void static_key_slow_inc(struct static_key *key);
 extern void static_key_slow_dec(struct static_key *key);
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index ca00ac10d9b9..df3008419a1d 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -414,6 +414,7 @@ static bool jump_label_can_update(struct jump_entry *entry, bool init)
 	return true;
 }
 
+#ifndef HAVE_JUMP_LABEL_BATCH
 static void __jump_label_update(struct static_key *key,
 				struct jump_entry *entry,
 				struct jump_entry *stop,
@@ -424,6 +425,28 @@ static void __jump_label_update(struct static_key *key,
 			arch_jump_label_transform(entry, jump_label_type(entry));
 	}
 }
+#else
+static void __jump_label_update(struct static_key *key,
+				struct jump_entry *entry,
+				struct jump_entry *stop,
+				bool init)
+{
+	for (; (entry < stop) && (jump_entry_key(entry) == key); entry++) {
+
+		if (!jump_label_can_update(entry, init))
+			continue;
+
+		if (!arch_jump_label_transform_queue(entry, jump_label_type(entry))) {
+			/*
+			 * Queue is full: Apply the current queue and try again.
+			 */
+			arch_jump_label_transform_apply();
+			BUG_ON(!arch_jump_label_transform_queue(entry, jump_label_type(entry)));
+		}
+	}
+	arch_jump_label_transform_apply();
+}
+#endif
 
 void __init jump_label_init(void)
 {

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip:locking/core] x86/jump_label: Batch jump label updates
  2019-06-12  9:57 ` [PATCH V6 6/6] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
@ 2019-06-17 14:16   ` tip-bot for Daniel Bristot de Oliveira
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot for Daniel Bristot de Oliveira @ 2019-06-17 14:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, rostedt, peterz, mingo, gregkh, jbaron, bp, bristot,
	mtosatti, swood, mhiramat, jpoimboe, linux-kernel, crecklin,
	torvalds, jkosina, williams

Commit-ID:  ba54f0c3f7c400a392c413d8ca21d3ada35f2584
Gitweb:     https://git.kernel.org/tip/ba54f0c3f7c400a392c413d8ca21d3ada35f2584
Author:     Daniel Bristot de Oliveira <bristot@redhat.com>
AuthorDate: Wed, 12 Jun 2019 11:57:31 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 17 Jun 2019 12:09:23 +0200

x86/jump_label: Batch jump label updates

Currently, the jump label of a static key is transformed via the arch
specific function:

    void arch_jump_label_transform(struct jump_entry *entry,
                                   enum jump_label_type type)

The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:

    bool arch_jump_label_transform_queue(struct jump_entry *entry,
                                         enum jump_label_type type)

Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated. This functions returns true in the case of a
successful enqueue of an entry. If it returns false, the caller must to
apply the queue and then try to queue again, for instance, because the
queue is full.

This function expects the caller to sort the entries by the address before
enqueueuing then. This is already done by the arch independent code, though.

After queuing all jump_entries, the function:

    void arch_jump_label_transform_apply(void)

Applies the changes in the queue.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/57b4caa654bad7e3b066301c9a9ae233dea065b5.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/jump_label.h |  2 ++
 arch/x86/kernel/jump_label.c      | 69 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 65191ce8e1cf..06c3cc22a058 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_JUMP_LABEL_H
 #define _ASM_X86_JUMP_LABEL_H
 
+#define HAVE_JUMP_LABEL_BATCH
+
 #define JUMP_LABEL_NOP_SIZE 5
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index f33408f1c3f6..ea13808bf6da 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -101,6 +101,75 @@ void arch_jump_label_transform(struct jump_entry *entry,
 	mutex_unlock(&text_mutex);
 }
 
+#define TP_VEC_MAX (PAGE_SIZE / sizeof(struct text_poke_loc))
+static struct text_poke_loc tp_vec[TP_VEC_MAX];
+int tp_vec_nr = 0;
+
+bool arch_jump_label_transform_queue(struct jump_entry *entry,
+				     enum jump_label_type type)
+{
+	struct text_poke_loc *tp;
+	void *entry_code;
+
+	if (system_state == SYSTEM_BOOTING) {
+		/*
+		 * Fallback to the non-batching mode.
+		 */
+		arch_jump_label_transform(entry, type);
+		return true;
+	}
+
+	/*
+	 * No more space in the vector, tell upper layer to apply
+	 * the queue before continuing.
+	 */
+	if (tp_vec_nr == TP_VEC_MAX)
+		return false;
+
+	tp = &tp_vec[tp_vec_nr];
+
+	entry_code = (void *)jump_entry_code(entry);
+
+	/*
+	 * The INT3 handler will do a bsearch in the queue, so we need entries
+	 * to be sorted. We can survive an unsorted list by rejecting the entry,
+	 * forcing the generic jump_label code to apply the queue. Warning once,
+	 * to raise the attention to the case of an unsorted entry that is
+	 * better not happen, because, in the worst case we will perform in the
+	 * same way as we do without batching - with some more overhead.
+	 */
+	if (tp_vec_nr > 0) {
+		int prev = tp_vec_nr - 1;
+		struct text_poke_loc *prev_tp = &tp_vec[prev];
+
+		if (WARN_ON_ONCE(prev_tp->addr > entry_code))
+			return false;
+	}
+
+	__jump_label_set_jump_code(entry, type,
+				   (union jump_code_union *) &tp->opcode, 0);
+
+	tp->addr = entry_code;
+	tp->detour = entry_code + JUMP_LABEL_NOP_SIZE;
+	tp->len = JUMP_LABEL_NOP_SIZE;
+
+	tp_vec_nr++;
+
+	return true;
+}
+
+void arch_jump_label_transform_apply(void)
+{
+	if (!tp_vec_nr)
+		return;
+
+	mutex_lock(&text_mutex);
+	text_poke_bp_batch(tp_vec, tp_vec_nr);
+	mutex_unlock(&text_mutex);
+
+	tp_vec_nr = 0;
+}
+
 static enum {
 	JL_STATE_START,
 	JL_STATE_NO_UPDATE,

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-06-17 14:17 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-12  9:57 [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Daniel Bristot de Oliveira
2019-06-12  9:57 ` [PATCH V6 1/6] jump_label: Add a jump_label_can_update() helper Daniel Bristot de Oliveira
2019-06-17 14:13   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
2019-06-12  9:57 ` [PATCH V6 2/6] x86/jump_label: Add a __jump_label_set_jump_code() helper Daniel Bristot de Oliveira
2019-06-17 14:13   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
2019-06-12  9:57 ` [PATCH V6 3/6] jump_label: Sort entries of the same key by the code Daniel Bristot de Oliveira
2019-06-17 14:14   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
2019-06-12  9:57 ` [PATCH V6 4/6] x86/alternative: Batch of patch operations Daniel Bristot de Oliveira
2019-06-12 14:52   ` Peter Zijlstra
2019-06-12 16:33     ` Daniel Bristot de Oliveira
2019-06-17 14:15   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
2019-06-12  9:57 ` [PATCH V6 5/6] jump_label: Batch updates if arch supports it Daniel Bristot de Oliveira
2019-06-17 14:15   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
2019-06-12  9:57 ` [PATCH V6 6/6] x86/jump_label: Batch jump label updates Daniel Bristot de Oliveira
2019-06-17 14:16   ` [tip:locking/core] " tip-bot for Daniel Bristot de Oliveira
2019-06-12 17:07 ` [PATCH V6 0/6] x86/jump_label: Bound IPIs sent when updating a static key Peter Zijlstra
2019-06-13  8:08   ` Daniel Bristot de Oliveira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).