From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932178Ab2BUUDx (ORCPT ); Tue, 21 Feb 2012 15:03:53 -0500 Received: from mx1.redhat.com ([209.132.183.28]:7053 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756154Ab2BUUDv (ORCPT ); Tue, 21 Feb 2012 15:03:51 -0500 Date: Tue, 21 Feb 2012 15:03:30 -0500 From: Jason Baron To: a.p.zijlstra@chello.nl, mingo@elte.hu Cc: rostedt@goodmis.org, mathieu.desnoyers@efficios.com, hpa@zytor.com, davem@davemloft.net, ddaney.cavm@gmail.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org Message-Id: <52570e566e5f1914f27b67e4eafb5781b8f9f9db.1329851692.git.jbaron@redhat.com> In-Reply-To: References: Subject: [PATCH 10/10] jump label: Add docs better explaining the whole jump label mechanism Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add better documentation for jump labels. Signed-off-by: Jason Baron --- Documentation/jump-label.txt | 258 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 258 insertions(+), 0 deletions(-) create mode 100644 Documentation/jump-label.txt diff --git a/Documentation/jump-label.txt b/Documentation/jump-label.txt new file mode 100644 index 0000000..210b760 --- /dev/null +++ b/Documentation/jump-label.txt @@ -0,0 +1,258 @@ + Jump Label + ---------- + +By: Jason Baron + + +1) Motivation + + +Currently, tracepoints are implemented using a conditional. The conditional +check requires checking a global variable for each tracepoint. Although +the overhead of this check is small, it increases when the memory cache comes +under pressure (memory cache lines for these global variables may be shared +with other memory accesses). As we increase the number of tracepoints in the +kernel this overhead may become more of an issue. In addition, tracepoints are +often dormant (disabled) and provide no direct kernel functionality. Thus, it +is highly desirable to reduce their impact as much as possible. Although +tracepoints are the original motivation for this work, other kernel code paths +should be able to make use of the jump label optimization. + + +2) Solution + + +gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: + +http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html + +Using the 'asm goto', we can create branches that are either taken or not taken +by default, without the need to check memory. Then, at run-time, we can patch +the branch site to change the branch direction. + +For example, if we have a simple branch that is disabled by default: + + if (very_unlikely(&jump_key)) + printk("I am the true branch\n"); + +Thus, by default the 'printk' will not be emitted. And the code generated will +consist of a single atomic 'no-op' instruction (5 bytes on x86), in the +straight-line code path. When the branch is 'flipped', we will patch the 'no-op' +in the straight-line codepath with a 'jump' instruction to the out-of-line true +branch. Thus, changing branch direction is expensive but branch selection is +basically 'free'. That is the basic tradeoff of this optimization. + + +3) Jump label API, usage and examples: + + +In order to use a jump label you much first define a key: + + struct jump_label_key jump_key; + +Which is initialized as: + + struct jump_label_key jump_key = JUMP_LABEL_INIT_TRUE; + +or: + + struct jump_label_Key jump_key = JUMP_LABEL_INIT_FALSE; + +If the key is not initialized, it is default false. The +'struct jump_label_key', must be a 'global'. That is, it can't be allocated on +the stack or dynamically allocated at run-time. + +The key is then used in code as: + + if (very_unlikely(&jump_key)) + do unlikely code + else + do likely code + +Or: + + if (very_likely(&jump_key)) + do likely code + else + do unlikely code + +A key that is initialized via 'JUMP_LABEL_INIT_FALSE', must be used in a +'very_unlikely()' construct. Likewise, a key initialized via +'JUMP_LABEL_INIT_TRUE' must be used in a 'very_likely()' construct. +A single key can be used in many branches, but all the branches must match +the way that the key has been initialized. + +The branch(es) can then be switched via: + + jump_label_inc(&jump_key); + jump_label_dec(&jump_key); + +Thus, 'jump_label_inc()' means 'make the branch true', and +'jump_label_dec()' means 'make the the branch false' with appropriate reference +counting. For example, if the key is initialized true, a jump_label_dec(), will +switch the branch to false. And a subsequent jump_label_inc(), will change +the branch back to true. Likewise, if the key is initialized false, a +'jump_label_inc()', will change the branch to true. And then a +'jump_label_dec()', will again make the branch false. + +An example usage in the kernel is the implementation of tracepoints: + + static inline void trace_##name(proto) \ + { \ + if (very_unlikely(&__tracepoint_##name.key)) \ + __DO_TRACE(&__tracepoint_##name, \ + TP_PROTO(data_proto), \ + TP_ARGS(data_args), \ + TP_CONDITION(cond)); \ + } + +Tracepoints are disabled by default, and can be placed in performance critical +pieces of the kernel. Thus, by using a jump labels, the tracepoints can have +absolutely minimal impact when not in use. + + +4) Architecture interface + + +There are a few functions and macros that architectures must implement in order +to take advantage of this optimization. If there is no architecture support, we +simply fall back to a traditional, load, test, and jump sequence. + +* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig + +* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h + +* __always_inline bool arch_static_branch(struct jump_label_key *key), see: + arch/x86/include/asm/jump_label.h + +* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type), + see: arch/x86/kernel/jump_label.c + +* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type), + see: arch/x86/kernel/jump_label.c + + +* struct jump_entry, see: arch/x86/include/asm/jump_label.h + + +5) Jump label analysis, results (x86_64): + + +As an example, let's add the following branch to 'getppid()', such that the +system call now looks like: + +SYSCALL_DEFINE0(getppid) +{ + int pid; + ++ if (very_unlikely(&key)) ++ printk("I am the true branch\n"); + + rcu_read_lock(); + pid = task_tgid_vnr(rcu_dereference(current->real_parent)); + rcu_read_unlock(); + + return pid; +} + +The resulting instructions with jump labels is: + +ffffffff81044290 : +ffffffff81044290: 55 push %rbp +ffffffff81044291: 48 89 e5 mov %rsp,%rbp +ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 +ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax +ffffffff810442a0: 00 00 +ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax +ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax +ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi +ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 +ffffffff810442bc: 5d pop %rbp +ffffffff810442bd: 48 98 cltq +ffffffff810442bf: c3 retq +ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi +ffffffff810442c7: 31 c0 xor %eax,%eax +ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f +ffffffff810442ce: eb c9 jmp ffffffff81044299 + +Without jump labels it looks like: + +ffffffff810441f0 : +ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 +ffffffff810441f6: 55 push %rbp +ffffffff810441f7: 48 89 e5 mov %rsp,%rbp +ffffffff810441fa: 85 c0 test %eax,%eax +ffffffff810441fc: 75 27 jne ffffffff81044225 +ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax +ffffffff81044205: 00 00 +ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax +ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax +ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi +ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 +ffffffff81044221: 5d pop %rbp +ffffffff81044222: 48 98 cltq +ffffffff81044224: c3 retq +ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi +ffffffff8104422c: 31 c0 xor %eax,%eax +ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 +ffffffff81044233: eb c9 jmp ffffffff810441fe +ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) +ffffffff8104423c: 00 00 00 00 + +Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction +vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched +to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump +label case adds: + +6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. + +If we then include the padding bytes, the jump label code saves, 16 total +bytes of instruction memory for this small fucntion. In this case the +non-jump label function is 80 bytes long. Thus, we have have saved 20% of the +instruction footprint. We can in fact improve this even further, since the +5-byte no-op really can be a 2-byte no-op since we can reach the branch with +a 2-byte jmp. However, we have not yet implemented optimal no-op sizes (they +are currently hard-coded). + +Since there are a number of jump labels in the scheduler paths, 'pipe-test', +found here: http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389, +can be used to show the performance improvement. Testing done on 3.3.0-rc2: + +jump label disabled: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + + 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% ) + 200,003 context-switches # 0.234 M/sec ( +- 0.00% ) + 0 CPU-migrations # 0.000 M/sec ( +- 39.58% ) + 487 page-faults # 0.001 M/sec ( +- 0.02% ) + 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% ) + stalled-cycles-frontend + stalled-cycles-backend + 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% ) + 208,368,926 branches # 243.507 M/sec ( +- 0.06% ) + 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% ) + + 1.601607384 seconds time elapsed ( +- 0.07% ) + +jump label enabled: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + + 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% ) + 200,004 context-switches # 0.238 M/sec ( +- 0.00% ) + 0 CPU-migrations # 0.000 M/sec ( +- 40.87% ) + 487 page-faults # 0.001 M/sec ( +- 0.05% ) + 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% ) + stalled-cycles-frontend + stalled-cycles-backend + 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% ) + 206,859,359 branches # 245.956 M/sec ( +- 0.04% ) + 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% ) + + 1.579384366 seconds time elapsed + +The percentage of saved branches is .7%, and we've saved 12% on 'branch-misses'. +This is where we would expect to get the most savings, since this optimization +is about reducing the number of branches. In addition, we've saved .2% on +instructions, and 2.8% on cycles and 1.4% on elapsed time. -- 1.7.7.5