From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753808AbZIXOno (ORCPT ); Thu, 24 Sep 2009 10:43:44 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753793AbZIXOnn (ORCPT ); Thu, 24 Sep 2009 10:43:43 -0400 Received: from smtp.polymtl.ca ([132.207.4.11]:48402 "EHLO smtp.polymtl.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753698AbZIXOng (ORCPT ); Thu, 24 Sep 2009 10:43:36 -0400 Message-Id: <20090924133400.355409744@polymtl.ca> References: <20090924132626.485545323@polymtl.ca> User-Agent: quilt/0.46-1 Date: Thu, 24 Sep 2009 09:26:35 -0400 From: Mathieu Desnoyers To: Ingo Molnar , linux-kernel@vger.kernel.org Cc: Mathieu Desnoyers , Rusty Russell , Adrian Bunk , Andi Kleen , Christoph Hellwig , akpm@osdl.org, KOSAKI Motohiro Subject: [patch 09/12] Immediate Values - Documentation Content-Disposition: inline; filename=immediate-values-documentation.patch X-Poly-FromMTA: (test.casi.polymtl.ca [132.207.72.60]) at Thu, 24 Sep 2009 14:07:33 +0000 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changelog: - Remove imv_set_early (removed from API). - Use imv_* instead of immediate_*. - Remove non-ascii characters. Signed-off-by: Mathieu Desnoyers CC: Rusty Russell CC: Adrian Bunk CC: Andi Kleen CC: Christoph Hellwig CC: mingo@elte.hu CC: akpm@osdl.org CC: KOSAKI Motohiro --- Documentation/immediate.txt | 221 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 221 insertions(+) Index: linux-2.6-lttng/Documentation/immediate.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/Documentation/immediate.txt 2009-09-24 08:31:49.000000000 -0400 @@ -0,0 +1,221 @@ + Using the Immediate Values + + Mathieu Desnoyers + + +This document introduces Immediate Values and their use. + + +* Purpose of immediate values + +An immediate value is used to compile into the kernel variables that sit within +the instruction stream. They are meant to be rarely updated but read often. +Using immediate values for these variables will save cache lines. + +This infrastructure is specialized in supporting dynamic patching of the values +in the instruction stream when multiple CPUs are running without disturbing the +normal system behavior. + +Compiling code meant to be rarely enabled at runtime can be done using +if (unlikely(imv_read(var))) as condition surrounding the code. The +smallest data type required for the test (an 8 bits char) is preferred, since +some architectures, such as powerpc, only allow up to 16 bits immediate values. + + +* Usage + +In order to use the "immediate" macros, you should include linux/immediate.h. + +#include + +DEFINE_IMV(char, this_immediate); +EXPORT_IMV_SYMBOL(this_immediate); + + +And use, in the body of a function: + +Use imv_set(this_immediate) to set the immediate value. + +Use imv_read(this_immediate) to read the immediate value. + +The immediate mechanism supports inserting multiple instances of the same +immediate. Immediate values can be put in inline functions, inlined static +functions, and unrolled loops. + +If you have to read the immediate values from a function declared as __init or +__exit, you should explicitly use _imv_read(), which will fall back on a +global variable read. Failing to do so will leave a reference to the __init +section after it is freed (it would generate a modpost warning). + +You can choose to set an initial static value to the immediate by using, for +instance: + +DEFINE_IMV(long, myptr) = 10; + + +* Optimization for a given architecture + +One can implement optimized immediate values for a given architecture by +replacing asm-$ARCH/immediate.h. + + +* Performance improvement + + + * Memory hit for a data-based branch + +Here are the results on a 3GHz Pentium 4: + +number of tests: 100 +number of branches per test: 100000 +memory hit cycles per iteration (mean): 636.611 +L1 cache hit cycles per iteration (mean): 89.6413 +instruction stream based test, cycles per iteration (mean): 85.3438 +Just getting the pointer from a modulo on a pseudo-random value, doing + nothing with it, cycles per iteration (mean): 77.5044 + +So: +Base case: 77.50 cycles +instruction stream based test: +7.8394 cycles +L1 cache hit based test: +12.1369 cycles +Memory load based test: +559.1066 cycles + +So let's say we have a ping flood coming at +(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms) +7674 packets per second. If we put 2 tracepoints for irq entry/exit, it +brings us to 15348 tracepoint sites executed per second. + +(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029 +We therefore have a 0.29% slowdown just on this case. + +Compared to this, the instruction stream based test will cause a +slowdown of: + +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004 +For a 0.004% slowdown. + +If we plan to use this for memory allocation, spinlock, and all sorts of +very high event rate tracing, we can assume it will execute 10 to 100 +times more sites per second, which brings us to 0.4% slowdown with the +instruction stream based test compared to 29% slowdown with the memory +load based test on a system with high memory pressure. + + + + * Tracepoint impact under heavy memory load + +Running a kernel with my LTTng instrumentation set, in a test that +generates memory pressure (from userspace) by trashing L1 and L2 caches +between calls to getppid() (note: syscall_trace is active and calls +a tracepoint upon syscall entry and syscall exit; tracepoints are disarmed). +This test is done in user-space, so there are some delays due to IRQs +coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20 +nice level) + +My first set of results: Linear cache trashing, turned out not to be +very interesting, because it seems like the linearity of the memset on a +full array is somehow detected and it does not "really" trash the +caches. + +Now the most interesting result: Random walk L1 and L2 trashing +surrounding a getppid() call. + +- Tracepoints compiled out (but syscall_trace execution forced) +number of tests: 10000 +No memory pressure +Reading timestamps takes 108.033 cycles +getppid: 1681.4 cycles +With memory pressure +Reading timestamps takes 102.938 cycles +getppid: 15691.6 cycles + + +- With the immediate values based tracepoints: +number of tests: 10000 +No memory pressure +Reading timestamps takes 108.006 cycles +getppid: 1681.84 cycles +With memory pressure +Reading timestamps takes 100.291 cycles +getppid: 11793 cycles + + +- With global variables based tracepoints: +number of tests: 10000 +No memory pressure +Reading timestamps takes 107.999 cycles +getppid: 1669.06 cycles +With memory pressure +Reading timestamps takes 102.839 cycles +getppid: 12535 cycles + +The result is quite interesting in that the kernel is slower without +tracepoints than with tracepoints. I explain it by the fact that the data +accessed is not laid out in the same manner in the cache lines when the +tracepoints are compiled in or out. It seems that it aligns the function's +data better to compile-in the tracepoints in this case. + +But since the interesting comparison is between the immediate values and +global variables based tracepoints, and because they share the same memory +layout, except for the movl being replaced by a movz, we see that the +global variable based tracepoints (2 tracepoints) adds 742 cycles to each system +call (syscall entry and exit are traced and memory locations for both +global variables lie on the same cache line). + + +- Test redone with less iterations, but with error estimates + +10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with +syscall trace inactive, comparing the case with memory pressure and without +memory pressure. (sorry, my system is not setup to execute syscall_trace this +time, but it will make the point anyway). + +No memory pressure +Reading timestamps: 150.92 cycles, std dev. 1.01 cycles +getppid: 1462.09 cycles, std dev. 18.87 cycles + +With memory pressure +Reading timestamps: 578.22 cycles, std dev. 269.51 cycles +getppid: 17113.33 cycles, std dev. 1655.92 cycles + + +Now for memory read timing: (10 runs, branches per test: 100000) +Memory read based branch: + 644.09 cycles, std dev. 11.39 cycles +L1 cache hit based branch: + 88.16 cycles, std dev. 1.35 cycles + + +So, now that we have the raw results, let's calculate: + +Memory read: +644.09 +/- 11.39 - 88.16 +/- 1.35 = 555.93 +/- 11.46 cycles + +Getppid without memory pressure: +1462.09 +/- 18.87 - 150.92 +/- 1.01 = 1311.17 +/- 18.90 cycles + +Getppid with memory pressure: +17113.33 +/- 1655.92 - 578.22 +/- 269.51 = 16535.11 +/- 1677.71 cycles + +Therefore, if we add 2 tracepoints not based on immediate values to the getppid +code, which would add 2 memory reads, we would add +2 * 555.93 +/- 12.74 = 1111.86 +/- 25.48 cycles + +Therefore, + +1111.86 +/- 25.48 / 16535.11 +/- 1677.71 = 0.0672 + relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2)) + = 0.1040 + absolute error: 0.1040 * 0.0672 = 0.0070 + +Therefore: 0.0672 +/- 0.0070 * 100% = 6.72 +/- 0.70 % + +We can therefore affirm that adding 2 tracepoints to getppid, on a system with +high memory pressure, would have a performance hit of at least 6.0% on the +system call time, all within the uncertainty limits of these tests. The same +applies to other kernel code paths. The smaller those code paths are, the +highest the impact ratio will be. + +Therefore, not only is it interesting to use the immediate values to dynamically +activate dormant code such as the tracepoints, but I think it should also be +considered as a replacement for many of the "read-mostly" static variables. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68