* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
@ 2016-02-08 20:12 George Spelvin
2016-02-09 10:48 ` David Laight
0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-02-08 20:12 UTC (permalink / raw)
To: David.Laight, linux-kernel, netdev, tom; +Cc: mingo
David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant gain.
> You have a dependency chain on the carry flag so have delays between the 'adcq'
> instructions (these may be more significant than the memory reads from l1 cache).
If the carry chain is a bottleneck, on Broadwell+ (feature flag
X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
separate flag bits for their carry chains and so can be interleaved.
I don't have such a machine to test on, but if someone who does
would like to do a little benchmarking, that would be an interesting
data point.
Unfortunately, that means yet another version of the main loop,
but if there's a significant benefit...
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-08 20:12 [PATCH v3 net-next] net: Implement fast csum_partial for x86_64 George Spelvin
@ 2016-02-09 10:48 ` David Laight
2016-02-10 0:53 ` George Spelvin
0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-09 10:48 UTC (permalink / raw)
To: 'George Spelvin', linux-kernel, netdev, tom; +Cc: mingo
From: George Spelvin [mailto:linux@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant gain.
> > You have a dependency chain on the carry flag so have delays between the 'adcq'
> > instructions (these may be more significant than the memory reads from l1 cache).
>
> If the carry chain is a bottleneck, on Broadwell+ (feature flag
> X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
> separate flag bits for their carry chains and so can be interleaved.
>
> I don't have such a machine to test on, but if someone who does
> would like to do a little benchmarking, that would be an interesting
> data point.
>
> Unfortunately, that means yet another version of the main loop,
> but if there's a significant benefit...
Well, the only part actually worth writing in assembler is the 'adc' loop.
So run-time substitution of separate versions (as is done for memcpy())
wouldn't be hard.
Since adcx and adox must execute in parallel I clearly need to re-remember
how dependencies against the flags register work. I'm sure I remember
issues with 'false dependencies' against the flags.
However you still need a loop construct that doesn't modify 'o' or 'c'.
Using leal, jcxz, jmp might work.
(Unless broadwell actually has a fast 'loop' instruction.)
(I've not got a suitable test cpu.)
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-09 10:48 ` David Laight
@ 2016-02-10 0:53 ` George Spelvin
2016-02-10 11:39 ` David Laight
0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-02-10 0:53 UTC (permalink / raw)
To: David.Laight, linux-kernel, linux, netdev, tom; +Cc: mingo
David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.
The issue is with flags register bits that are *not* modified by
an instruction. If the register is treated as a monolithic entity,
then the previous values of those bits must be considered an *input*
to the instruction, forcing serialization.
The first step in avoiding this problem is to consider the rarely-modified
bits (interrupt, direction, trap, etc.) to be a separate logical register
from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
which are updated by almost every instruction.
An arithmetic instruction overwrites the arithmetic flags (so it's only
a WAW dependency which can be broken by renaming) and doesn't touch the
status flags (so no dependency).
However, on x86 even the arithmetic flags aren't updated consistently.
The biggest offender are the (very common!) INC/DEC instructions,
which update all of the arithmetic flags *except* the carry flag.
Thus, the carry flag is also renamed separately on every superscalar
x86 implementation I've ever heard of.
The bit test instructions (BT, BTC, BTR, BTS) also affect *only*
the carry flag, leaving other flags unmodified. This is also
handled properly by renaming the carry flag separately.
Here's a brief summary chart of flags updated by common instructions:
http://www.logix.cz/michal/doc/i386/app-c.htm
and the full list with all the corner cases:
http://www.logix.cz/michal/doc/i386/app-b.htm
The other two flags that can be worth separating are the overflow
and zero flags.
The rotate instructions modify *only* the carry and overflow flags.
While overflow is undefined for multi-bit rotates (and thus leaving it
unmodified is a valid implementation), it's defined for single-bit rotates,
so must be written.
There are several less common instructions, notably BSF, BSR, CMPXCHG8B,
and a bunch of 80286 segment instructions that nobody cares about,
which retort the result of a test in the zero flag and are defined to
not affect the other flags.
So an aggressive x86 implementation breaks the flags register into five
separately renamed registers:
- CF (carry)
- OF (overflow)
- ZF (zero)
- SF, AF, PF (sign, aux carry, and parity)
- DF, IF, TF, IOPL, etc.
Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
it was reasonable to commit to always renaming CF and OF separately.
> However you still need a loop construct that doesn't modify 'o' or 'c'.
> Using leal, jcxz, jmp might work.
> (Unless broadwell actually has a fast 'loop' instruction.)
According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
to K8 and Merom. The one exception is Precott. JCXZ and LOOP are 4
uops on those processors. But 64 bit in general sucked on Precott,
so how much do we care?
AMD: LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
JCXZ is acceptable on all of them.
LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
Intel: LOOP is slow (7+ uops) on all processors up to and including Skylake.
JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
- Prescott (JCXZ & loop both 4 uops)
- 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
I can't find any that it's fast on.
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-10 0:53 ` George Spelvin
@ 2016-02-10 11:39 ` David Laight
2016-02-10 14:43 ` George Spelvin
0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-10 11:39 UTC (permalink / raw)
To: 'George Spelvin', linux-kernel, netdev, tom; +Cc: mingo
From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-kernel@vger.kernel.org; linux@horizon.com; netdev@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register work. I'm sure I remember
> > issues with 'false dependencies' against the flags.
>
> The issue is with flags register bits that are *not* modified by
> an instruction. If the register is treated as a monolithic entity,
> then the previous values of those bits must be considered an *input*
> to the instruction, forcing serialization.
>
> The first step in avoiding this problem is to consider the rarely-modified
> bits (interrupt, direction, trap, etc.) to be a separate logical register
> from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
> which are updated by almost every instruction.
>
> An arithmetic instruction overwrites the arithmetic flags (so it's only
> a WAW dependency which can be broken by renaming) and doesn't touch the
> status flags (so no dependency).
>
> However, on x86 even the arithmetic flags aren't updated consistently.
> The biggest offender are the (very common!) INC/DEC instructions,
> which update all of the arithmetic flags *except* the carry flag.
>
> Thus, the carry flag is also renamed separately on every superscalar
> x86 implementation I've ever heard of.
Ah, that is the little fact I'd forgotten.
...
> Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
> it was reasonable to commit to always renaming CF and OF separately.
Separate renaming allows:
1) The value to tested without waiting for pending updates to complete.
Useful for IE and DIR.
2) Instructions that modify almost all the flags to execute without
waiting for a previous instruction to complete.
So separating 'carry' allows inc/dec to execute without waiting
for previous arithmetic to complete.
The latter should remove the dependency (both ways) between 'adc' and
'dec, jnz' in a checksum loop.
I can't see any obvious gain from separating out O or Z (even with
adcx and adox). You'd need some other instructions that don't set O (or Z)
but set some other useful flags.
(A decrement that only set Z for instance.)
> > However you still need a loop construct that doesn't modify 'o' or 'c'.
> > Using leal, jcxz, jmp might work.
> > (Unless broadwell actually has a fast 'loop' instruction.)
>
> According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
> JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
> to K8 and Merom. The one exception is Precott. JCXZ and LOOP are 4
> uops on those processors. But 64 bit in general sucked on Precott,
> so how much do we care?
>
> AMD: LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
> JCXZ is acceptable on all of them.
> LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
> Intel: LOOP is slow (7+ uops) on all processors up to and including Skylake.
> JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
> - Prescott (JCXZ & loop both 4 uops)
> - 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
> I can't find any that it's fast on.
While LOOP could be used on Bulldozer+ an equivalently fast loop
can be done with inc/dec and jnz.
So you only care about LOOP/JCXZ when ADOX is supported.
I think the fastest loop is:
10: adc %rax,0(%rdi,%rcx,8)
inc %rcx
jnz 10b
but check if any cpu add an extra clock for the 'scaled' offset
(they might be faster if %rdi is incremented).
That loop looks like it will have no overhead on recent cpu.
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-10 11:39 ` David Laight
@ 2016-02-10 14:43 ` George Spelvin
2016-02-10 15:18 ` David Laight
0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-02-10 14:43 UTC (permalink / raw)
To: David.Laight, linux-kernel, linux, netdev, tom; +Cc: mingo
David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
> Useful for IE and DIR.
I don't quite follow. It allows the value to be tested without waiting
for pending updates *of other bits* to complete.
Obviusly, the update of the bit being tested has to complete!
> I can't see any obvious gain from separating out O or Z (even with
> adcx and adox). You'd need some other instructions that don't set O (or Z)
> but set some other useful flags.
> (A decrement that only set Z for instance.)
I tried to describe the advantages in the previous message.
The problems arise much less often than the INC/DEC pair, but there are
instructions whick write only the O and C flags, (ROL, ROR) and only
the Z flag (CMPXCHG).
The sign, aux carry, and parity flags are *always* updated as
a group, so they can be renamed as a group.
> While LOOP could be used on Bulldozer+ an equivalently fast loop
> can be done with inc/dec and jnz.
> So you only care about LOOP/JCXZ when ADOX is supported.
>
> I think the fastest loop is:
> 10: adc %rax,0(%rdi,%rcx,8)
> inc %rcx
> jnz 10b
> but check if any cpu add an extra clock for the 'scaled' offset
> (they might be faster if %rdi is incremented).
> That loop looks like it will have no overhead on recent cpu.
Well, it should execute at 1 instruction/cycle. (No, a scaled offset
doesn't take extra time.) To break that requires ADCX/ADOX:
10: adcxq 0(%rdi,%rcx),%rax
adoxq 8(%rdi,%rcx),%rdx
leaq 16(%rcx),%rcx
jrcxz 11f
j 10b
11:
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-10 14:43 ` George Spelvin
@ 2016-02-10 15:18 ` David Laight
0 siblings, 0 replies; 29+ messages in thread
From: David Laight @ 2016-02-10 15:18 UTC (permalink / raw)
To: 'George Spelvin', linux-kernel, netdev, tom; +Cc: mingo
From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq 0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
>
> Well, it should execute at 1 instruction/cycle.
I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.
> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.
> To break that requires ADCX/ADOX:
>
> 10: adcxq 0(%rdi,%rcx),%rax
> adoxq 8(%rdi,%rcx),%rdx
> leaq 16(%rcx),%rcx
> jrcxz 11f
> j 10b
> 11:
Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.
Experiments would be needed to confirm guesses though.
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-05 8:01 ` Ingo Molnar
@ 2016-02-05 10:07 ` David Laight
0 siblings, 0 replies; 29+ messages in thread
From: David Laight @ 2016-02-05 10:07 UTC (permalink / raw)
To: 'Ingo Molnar', Tom Herbert
Cc: Linus Torvalds, David Miller, Network Development,
Thomas Gleixner, Ingo Molnar, Peter Anvin,
the arch/x86 maintainers, kernel-team, Linux Kernel Mailing List,
Peter Zijlstra
From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you manage
> to offload the logic into arithmetics and not affect the control flow then that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics units which effectively execute everything fed to them in a
> single cycle.
Not necessarily, in real-life they are likely to be a cache miss.
With the parallel execution units you just need to worry about the
register dependency chains.
In this case the 'adc' have dependencies on both the result register
and the flags register.
The best loop might even be:
10: adc %rax,0(%rdi,%rcx)
lea %rcx,8(%rcx)
jcxz 20f
jmp 10b
20:
You then need to insert just enough copies of the adc so they take as long
as the other instructions.
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 19:24 ` Tom Herbert
@ 2016-02-05 9:24 ` Ingo Molnar
0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2016-02-05 9:24 UTC (permalink / raw)
To: Tom Herbert
Cc: David S. Miller, Linux Kernel Network Developers, tglx, mingo,
hpa, x86, Kernel Team, LKML, Peter Zijlstra, Linus Torvalds
* Tom Herbert <tom@herbertland.com> wrote:
> Thanks for the explanation and sample code. Expanding on your example, I added a
> switch statement to perform to function (code below).
So I think your new switch() based testcase is broken in a subtle way.
The problem is that in your added testcase GCC effectively optimizes out the
function call, because it is able to recognize that all the (trivial) functions
are identical. (At least here, with GCC 5.2.1)
So all overhead of the workload goes into just that do_switch() function you
added:
#
# Total Lost Samples: 0
#
# Samples: 5K of event 'cycles:pp'
# Event count (approx.): 5738959683
#
# Overhead Command Shared Object Symbol
# ........ ........... ................. ...........................
#
99.94% jump-table2 jump-table2 [.] do_switch
0.02% jump-table2 [kernel.kallsyms] [k] native_read_msr_safe
0.02% jump-table2 [kernel.kallsyms] [k] native_apic_mem_write
0.01% jump-table2 [kernel.kallsyms] [k] __fput
0.00% jump-table2 [kernel.kallsyms] [k] change_protection_range
0.00% perf [kernel.kallsyms] [k] strrchr
0.00% perf [kernel.kallsyms] [k] intel_pmu_handle_irq
0.00% perf [kernel.kallsyms] [k] native_write_msr_safe
... and per instruction overhead in the do_switch() looks like this:
:
: Disassembly of section .text:
:
: 0000000000401100 <do_switch>:
: do_switch():
0.00 : 401100: mov 0x201f61(%rip),%r8 # 603068 <loops>
0.00 : 401107: test %r8,%r8
0.00 : 40110a: je 401178 <do_switch+0x78>
0.00 : 40110c: mov 0x201f5d(%rip),%rdi # 603070 <table_size>
0.00 : 401113: xor %esi,%esi
0.00 : 401115: xor %ecx,%ecx
0.00 : 401117: nopw 0x0(%rax,%rax,1)
0.00 : 401120: mov 0x201f61(%rip),%rax # 603088 <global>
1.46 : 401127: xor %edx,%edx
0.00 : 401129: add $0x1,%rax
0.00 : 40112d: mov %rax,0x201f54(%rip) # 603088 <global>
0.00 : 401134: mov 0x201f4d(%rip),%rax # 603088 <global>
51.68 : 40113b: div %rdi
0.00 : 40113e: cmp $0x3f,%rdx
1.34 : 401142: ja 40116b <do_switch+0x6b>
0.02 : 401144: mov 0x201f3d(%rip),%r9 # 603088 <global>
0.10 : 40114b: mov 0x201f36(%rip),%rax # 603088 <global>
6.91 : 401152: jmpq *0x401420(,%rdx,8)
0.00 : 401159: nopl 0x0(%rax)
0.02 : 401160: xor %edx,%edx
35.71 : 401162: div %r9
1.07 : 401165: add %rdx,%r9
1.69 : 401168: add %r9,%rsi
0.00 : 40116b: add $0x1,%rcx
0.00 : 40116f: cmp %r8,%rcx
0.00 : 401172: jne 401120 <do_switch+0x20>
0.00 : 401174: mov %rsi,%rax
0.00 : 401177: retq
0.00 : 401178: xor %esi,%esi
0.00 : 40117a: jmp 401174 <do_switch+0x74>
Note that as you remarked there _is_ an indirect jump in there:
6.91 : 401152: jmpq *0x401420(,%rdx,8)
But:
> gcc seems to be implementing this switch as a jump table:
>
> jmpq *0x400ac8(,%rdx,8)
... the 'jump table' has identical entries:
40141f: 00 60 11 add %ah,0x11(%rax)
401422: 40 00 00 add %al,(%rax)
401425: 00 00 add %al,(%rax)
401427: 00 60 11 add %ah,0x11(%rax)
40142a: 40 00 00 add %al,(%rax)
40142d: 00 00 add %al,(%rax)
40142f: 00 60 11 add %ah,0x11(%rax)
401432: 40 00 00 add %al,(%rax)
401435: 00 00 add %al,(%rax)
401437: 00 60 11 add %ah,0x11(%rax)
40143a: 40 00 00 add %al,(%rax)
(misaligned dump - jump table starts at 0x401420)
Every jump table entry points to 0x401160 - which is the code after the indirect
jump in do_switch().
So the hardware predictor simply recognizes that it's the same target address all
the time, and thus (understandably) performs much faster - none of the functions
are actually executed.
That is why there are no function call entries in perf report, while in the
function pointer case we get:
#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles:pp'
# Event count (approx.): 5482523153
#
# Overhead Command Shared Object Symbol
# ........ ........... ................. ..........................
#
13.47% jump-table2 jump-table2 [.] fun_1
13.22% jump-table2 jump-table2 [.] fun_6
13.12% jump-table2 jump-table2 [.] fun_5
12.87% jump-table2 jump-table2 [.] fun_7
12.23% jump-table2 jump-table2 [.] fun_2
12.09% jump-table2 jump-table2 [.] fun_0
8.23% jump-table2 jump-table2 [.] do_funcptr
7.43% jump-table2 jump-table2 [.] fun_3
7.27% jump-table2 jump-table2 [.] fun_4
0.02% jump-table2 [kernel.kallsyms] [k] page_add_new_anon_rmap
0.02% jump-table2 [kernel.kallsyms] [k] native_read_msr_safe
0.02% jump-table2 [kernel.kallsyms] [k] perf_pmu_sched_task
0.00% jump-table2 [kernel.kallsyms] [k] flush_tlb_mm_range
0.00% perf [kernel.kallsyms] [k] _raw_spin_lock
0.00% perf [kernel.kallsyms] [k] intel_bts_enable_local
0.00% perf [kernel.kallsyms] [k] native_write_msr_safe
Same-target indirect jumps were optimized well starting from PentiumM IIRC, so
this would perform well all across the x86 spectrum.
Btw., I think it's a bit weird that GCC didn't eliminate the switch jump table
itself. (Maybe newer versions of GCC do?)
> Which is a different call than the function table implementation:
>
> callq *(%rsp,%rdx,8)
> So the switch implementation does not seem to be causing branch-misses to be
> counted and is a lot faster. This is also what I see when looking at the
> branch-misses with the checksum code-- I haven't yet found any test case
> thrashing lengths or alignments increase branch-misses and shows a performance
> degradation over the case where they are static.
That's because AFAICS there's no indirect branching in your testcase! :-)
And before you spend too much time on this, I do concede your general point: that
jump tables are simpler than call tables, and that newer x86 uarchs are able to
distinguish between multiple target branches pretty well indexed by a wide range
of input registers, resulting in pretty good branch prediction.
Anyway - I think Linus's C based approach is vastly superior, so the jump tables
vs. call tables topic is somewhat of a side track.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 22:43 ` Tom Herbert
2016-02-04 22:57 ` Linus Torvalds
@ 2016-02-05 8:01 ` Ingo Molnar
2016-02-05 10:07 ` David Laight
1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2016-02-05 8:01 UTC (permalink / raw)
To: Tom Herbert
Cc: Linus Torvalds, David Miller, Network Development,
Thomas Gleixner, Ingo Molnar, Peter Anvin,
the arch/x86 maintainers, kernel-team, Linux Kernel Mailing List,
Peter Zijlstra
* Tom Herbert <tom@herbertland.com> wrote:
> [....] gcc turns these switch statements into jump tables (not function tables
> which is what Ingo's example code was using). [...]
So to the extent this still matters, on most x86 microarchitectures that count,
jump tables and function call tables (i.e. virtual functions that C++ uses) are
generally optimized by the same branch predictor hardware mechanism. Indirect
jumps (jump tables) and indirect calls (function pointer tables) are very similar
conceptually. That is why posted the indirect calls test code.
( The only branching variant that will perform badly even on the latest uarchs are
indirect returns: to modify the return address on the stack. )
So my narrow performance point stands, if any sort of indirect jump is used. They
should be avoided if possible, because it's pretty hard for the hardware to get it
right.
As Linus noticed, data lookup tables are the intelligent solution: if you manage
to offload the logic into arithmetics and not affect the control flow then that's
a big win. The inherent branching will be hidden by executing on massively
parallel arithmetics units which effectively execute everything fed to them in a
single cycle.
In any case, when submitting such patches, please get into the habit of looking at
and posting perf stat output - it will give us a good idea about the quality of an
implementation.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-05 1:27 ` Linus Torvalds
@ 2016-02-05 1:39 ` Linus Torvalds
0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-02-05 1:39 UTC (permalink / raw)
To: Ingo Molnar
Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
Linux Kernel Mailing List, Peter Zijlstra
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
> return rotate_by8_if_odd(sum, align);
Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code does the right thing if "len" is zero, except
for the possible page fault or address verifier complaint.
So that very last call to "csum_partial_lt8()" either needs to be
conditional (easy enough to add an "if (len)" around that whole
statement) or the thing could be unconditional but the load needs to
use "load_unaligned_zeropad()" so that the exception is harmless.
It's probably immaterial which one you pick. The few possibly useless
ALU operations vs a possible branch misprodict penalty are probably
going to make it a wash. The exception will never happen in practice,
but if DEBUG_PAGEALLOC is enabled, or if something like KASAN is
active, it will complain loudly if it happens to go past the
allocation.
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 22:09 ` Linus Torvalds
@ 2016-02-05 1:27 ` Linus Torvalds
2016-02-05 1:39 ` Linus Torvalds
0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-02-05 1:27 UTC (permalink / raw)
To: Ingo Molnar
Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
Linux Kernel Mailing List, Peter Zijlstra
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load unit to
> do some work for us.
Ok, so I thought some more about this, and the fact is, we don't
actually want to do the byte shifting at all for the first case (the
"length < 8" case), since the address of that one hasn't been shifted.
it's only for the "we're going to align things to 8 bytes" case that
we would want to do it. But then we might as well use the
rotate_by8_if_odd() model, so I suspect the address games are just
entirely pointless.
So here is something that is actually tested (although admittedly not
well), and uses that fairly simple model.
NOTE! I did not do the unrolling of the "adcq" loop in the middle, but
that's a totally trivial thing now. So this isn't very optimized,
because it will do a *lot* of extra "adcq $0" to get rid of the carry
bit. But with that core loop unrolled, you'd get rid of most of them.
Linus
---
static unsigned long rotate_by8_if_odd(unsigned long sum, unsigned long aligned)
{
asm("rorq %b1,%0"
:"=r" (sum)
:"c" ((aligned & 1) << 3), "0" (sum));
return sum;
}
static unsigned long csum_partial_lt8(unsigned long val, int len,
unsigned long sum)
{
unsigned long mask = (1ul << len*8)-1;
val &= mask;
return add64_with_carry(val, sum);
}
static unsigned long csum_partial_64(const void *buff, unsigned long
len, unsigned long sum)
{
unsigned long align, val;
// This is the only potentially unaligned access, and it can
// also theoretically overflow into the next page
val = load_unaligned_zeropad(buff);
if (len < 8)
return csum_partial_lt8(val, len, sum);
align = 7 & -(unsigned long)buff;
sum = csum_partial_lt8(val, align, sum);
buff += align;
len -= align;
sum = rotate_by8_if_odd(sum, align);
while (len >= 8) {
val = *(unsigned long *) buff;
sum = add64_with_carry(sum, val);
buff += 8;
len -= 8;
}
sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
return rotate_by8_if_odd(sum, align);
}
__wsum csum_partial(const void *buff, unsigned long len, unsigned long sum)
{
sum = csum_partial_64(buff, len, sum);
return add32_with_carry(sum, sum >> 32);
}
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 22:43 ` Tom Herbert
@ 2016-02-04 22:57 ` Linus Torvalds
2016-02-05 8:01 ` Ingo Molnar
1 sibling, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-02-04 22:57 UTC (permalink / raw)
To: Tom Herbert
Cc: Ingo Molnar, David Miller, Network Development, Thomas Gleixner,
Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
Linux Kernel Mailing List, Peter Zijlstra
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert <tom@herbertland.com> wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and the odd number of quads to sum in the
> main loop. gcc turns these switch statements into jump tables (not
> function tables which is what Ingo's example code was using). The
> problem I hit was that for each case I needed to close the carry chain
> in the inline asm so fall through wouldn't have much value and each
> case is expanded. The C version using switch gave a nice performance
> gain, moving to all assembly was somewhat better.
Yeah,. so I _think_ that if my approach works, the code generated for
the 0-8 byte case looks just something like
movq %rsi, %rax
andl $1, %eax
subq %rax, %rdi
movq %rdx, %rax
movq (%rdi), %rcx
andq mask(,%rsi,8), %rcx
addq %rcx,%rax
adcq $0,%rax
which is pretty close to optimal (that's with my hopefully fixed version).
That's not with the final narrowing from 64-bit to 32-bit, but it
looks like it should perform well on most x86 cores, and then the only
remaining branches end up being the actual size checking ones.
And yes, you could avoid a few "adc $0" instructions in asm, but quite
frankly, I think that's a tradeoff that is worth it.
My guess is that you can get to within a percent of optimal with the C
approach, and I really think it makes it easier to try different
things (ie things like the above that avoids the switch table
entirely)
> There is also question of alignment. I f we really don't need to worry
> about alignment at all on x86, then we should be able to eliminate the
> complexity of dealing with it.
So most x86 alignment issues are about crossing the cache bank size,
which is usually 16 or 32 bytes. Unaligned accesses *within* one of
those banks should be basically free (there's a whoppign big byte lane
shifter, so there's lots of hardware support for that).
Also, even when you do cross a cache bank, it's usually pretty cheap.
It extra cycles, but it's generally *less* extra cycles than it would
be to try to align things in software and doing two accesses.
The rule of thumb is that you should never worry about _individual_
unaligned accesses. It's only really worth it aligning things before
biggish loops. So aligning to an 8-byte boundary before you then do a
lot of "adcq" instructions makes sense, but worrying about unaligned
accesses for the beginning/end does generally not.
>>> For example, for the actual "8 bytes or shorter" case, I think
>> something like this might just work fine: [ snip ]
>
> I will look at doing that.
So I already found and fixed one bug in that approach, but I still
think it's a viable model.
But you will almost certainly have to fix a few more of my bugs before
it really works ;)
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 21:46 ` Linus Torvalds
2016-02-04 22:09 ` Linus Torvalds
@ 2016-02-04 22:43 ` Tom Herbert
2016-02-04 22:57 ` Linus Torvalds
2016-02-05 8:01 ` Ingo Molnar
1 sibling, 2 replies; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 22:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, David Miller, Network Development, Thomas Gleixner,
Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
Linux Kernel Mailing List, Peter Zijlstra
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in assembly is precisely about the your point of
having to close the carry chains with adcq $0. I do have a first
implementation in C which using switch() to handle alignment, excess
length less than 8 bytes, and the odd number of quads to sum in the
main loop. gcc turns these switch statements into jump tables (not
function tables which is what Ingo's example code was using). The
problem I hit was that for each case I needed to close the carry chain
in the inline asm so fall through wouldn't have much value and each
case is expanded. The C version using switch gave a nice performance
gain, moving to all assembly was somewhat better.
There is also question of alignment. I f we really don't need to worry
about alignment at all on x86, then we should be able to eliminate the
complexity of dealing with it.
> It seems likely that the real issue is avoiding the short loops (that
> will cause branch prediction problems) and use a lookup table instead.
>
> But we can probably do better than that asm.
>
> For example, for the actual "8 bytes or shorter" case, I think
> something like this might just work fine:
>
> unsigned long csum_partial_8orless(const unsigned char *buf,
> unsigned long len, unsigned long sum)
> {
> static const unsigned long mask[9] = {
> 0x0000000000000000,
> 0x000000000000ff00,
> 0x000000000000ffff,
> 0x00000000ff00ffff,
> 0x00000000ffffffff,
> 0x0000ff00ffffffff,
> 0x0000ffffffffffff,
> 0xff00ffffffffffff,
> 0xffffffffffffffff };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];
> asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
> return sum;
> }
>
I will look at doing that.
Thanks,
Tom
> NOTE! The above is 100% untested. But I _think_ it may handle the
> odd-byte-case correctly, and it should result in just one 8-byte load
> (the "load_unaligned_zeropad()" is just in case that ends up
> overflowing and we have page-alloc-debug triggering a page fault at
> the end). All without looping or any conditional branches that might
> mispredict.
>
> My point is that going to assembly results in pretty hard-to-read
> code, but it's also fairly inflexible. If we stay with C, we can much
> more easily play tricks. So for example, make the above an inline
> function, and now you can do things like this:
>
> static inline unsigned long csum_partial_64bit(void *buf, unsigned
> long len, unsigned long sum)
> {
> if (len <= 8)
> return csum_partial_8orless(buf, len, sum);
>
> /* We know it's larger than 8 bytes, so handle alignment */
> align = 7 & -(unsigned long)buf;
> sum = csum_partial_8orless(buf, align, sum);
> buf += align;
>
> /* We need to do the big-endian thing */
> sum = rotate_by8_if_odd(sum, align);
>
> /* main loop for big packets */
> .. do the unrolled inline asm thing that we already do ..
>
> sum = rotate_by8_if_odd(sum, align);
>
> /* Handle the last bytes */
> return csum_partial_8orless(buf, len, sum);
> }
>
> /* Fold the 64-bit sum we computed down to 32 bits __wsum */
> __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
> {
> unsigned long sum = csum_partial_64bit(ptr, len, partial);
> asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
> return sum;
> }
>
> or something like that.
>
> NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
> "big-endian 16-bit add" thing right by doing the odd byte masking in
> the end cases, and by rotating the sum by 8 bits around the
> 8-byte-unrolled-loop, but I didn't test the above. It's literally
> written in my mail client. So I can almost guarantee that it is buggy,
> but it is meant as an *example* of "why not do it this way" rather
> than production code.
>
> I think that writing it in C and trying to be intelligent about it
> like the above would result in more maintainable code, and it is
> possible that it would even be faster.
>
> Yes, writing it in C *does* likely result in a few more cases of "adcq
> $0" in order to finish up the carry calculations. The *only* advantage
> of inline asm is how it allows you to keep the carry flag around. So
> there is downside to the C model, and it might cause a cycle or two of
> extra work, but the upside of C is that you can try to do clever
> things without turning the code completely unreadable.
>
> For example, doing the exception handling (that will never actually
> trigger) for the "let's just do a 8-byte load" is just painful in
> assembly. In C, we already have the helper function to do it.
>
> Hmm? Would somebody be willing to take the likely very buggy code
> above, and test it and make it work? I assume there's a test harness
> for the whole csum thing somewhere.
>
> Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 21:46 ` Linus Torvalds
@ 2016-02-04 22:09 ` Linus Torvalds
2016-02-05 1:27 ` Linus Torvalds
2016-02-04 22:43 ` Tom Herbert
1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-02-04 22:09 UTC (permalink / raw)
To: Ingo Molnar
Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
Linux Kernel Mailing List, Peter Zijlstra
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> static const unsigned long mask[9] = {
> 0x0000000000000000,
> 0x000000000000ff00,
> 0x000000000000ffff,
> 0x00000000ff00ffff,
> 0x00000000ffffffff,
> 0x0000ff00ffffffff,
> 0x0000ffffffffffff,
> 0xff00ffffffffffff,
> 0xffffffffffffffff };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];
Yeah, that was buggy. I knew it was likely buggy, but that "buf +
(len & 1)" was just stupid.
The "+" should be "-", of course - the point is to shift up the value
by 8 bits for odd cases, and we need to load starting one byte early
for that. The idea is that we use the byte shifter in the load unit to
do some work for us.
And the bitmasks are the wrong way around for the odd cases - it's the
low byte that ends up being bogus for those cases.
So it should probably look something like
static const unsigned long mask[9] = {
0x0000000000000000,
0x000000000000ff00,
0x000000000000ffff,
0x00000000ffffff00,
0x00000000ffffffff,
0x0000ffffffffff00,
0x0000ffffffffffff,
0xffffffffffffff00,
0xffffffffffffffff };
unsigned long val = load_unaligned_zeropad(buf - (len & 1));
val &= mask[len];
and then it *might* work.
Still entirely and utterly untested, I just decided to look at it a
bit more and noticed my thinko.
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 9:30 ` Ingo Molnar
2016-02-04 10:56 ` Ingo Molnar
@ 2016-02-04 21:46 ` Linus Torvalds
2016-02-04 22:09 ` Linus Torvalds
2016-02-04 22:43 ` Tom Herbert
1 sibling, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-02-04 21:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
Linux Kernel Mailing List, Peter Zijlstra
I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:
Why wasn't that done with C code instead of asm with odd numerical targets?
It seems likely that the real issue is avoiding the short loops (that
will cause branch prediction problems) and use a lookup table instead.
But we can probably do better than that asm.
For example, for the actual "8 bytes or shorter" case, I think
something like this might just work fine:
unsigned long csum_partial_8orless(const unsigned char *buf,
unsigned long len, unsigned long sum)
{
static const unsigned long mask[9] = {
0x0000000000000000,
0x000000000000ff00,
0x000000000000ffff,
0x00000000ff00ffff,
0x00000000ffffffff,
0x0000ff00ffffffff,
0x0000ffffffffffff,
0xff00ffffffffffff,
0xffffffffffffffff };
unsigned long val = load_unaligned_zeropad(buf + (len & 1));
val &= mask[len];
asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
return sum;
}
NOTE! The above is 100% untested. But I _think_ it may handle the
odd-byte-case correctly, and it should result in just one 8-byte load
(the "load_unaligned_zeropad()" is just in case that ends up
overflowing and we have page-alloc-debug triggering a page fault at
the end). All without looping or any conditional branches that might
mispredict.
My point is that going to assembly results in pretty hard-to-read
code, but it's also fairly inflexible. If we stay with C, we can much
more easily play tricks. So for example, make the above an inline
function, and now you can do things like this:
static inline unsigned long csum_partial_64bit(void *buf, unsigned
long len, unsigned long sum)
{
if (len <= 8)
return csum_partial_8orless(buf, len, sum);
/* We know it's larger than 8 bytes, so handle alignment */
align = 7 & -(unsigned long)buf;
sum = csum_partial_8orless(buf, align, sum);
buf += align;
/* We need to do the big-endian thing */
sum = rotate_by8_if_odd(sum, align);
/* main loop for big packets */
.. do the unrolled inline asm thing that we already do ..
sum = rotate_by8_if_odd(sum, align);
/* Handle the last bytes */
return csum_partial_8orless(buf, len, sum);
}
/* Fold the 64-bit sum we computed down to 32 bits __wsum */
__wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
{
unsigned long sum = csum_partial_64bit(ptr, len, partial);
asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
return sum;
}
or something like that.
NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
"big-endian 16-bit add" thing right by doing the odd byte masking in
the end cases, and by rotating the sum by 8 bits around the
8-byte-unrolled-loop, but I didn't test the above. It's literally
written in my mail client. So I can almost guarantee that it is buggy,
but it is meant as an *example* of "why not do it this way" rather
than production code.
I think that writing it in C and trying to be intelligent about it
like the above would result in more maintainable code, and it is
possible that it would even be faster.
Yes, writing it in C *does* likely result in a few more cases of "adcq
$0" in order to finish up the carry calculations. The *only* advantage
of inline asm is how it allows you to keep the carry flag around. So
there is downside to the C model, and it might cause a cycle or two of
extra work, but the upside of C is that you can try to do clever
things without turning the code completely unreadable.
For example, doing the exception handling (that will never actually
trigger) for the "let's just do a 8-byte load" is just painful in
assembly. In C, we already have the helper function to do it.
Hmm? Would somebody be willing to take the likely very buggy code
above, and test it and make it work? I assume there's a test harness
for the whole csum thing somewhere.
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 20:59 ` Tom Herbert
@ 2016-02-04 21:09 ` Alexander Duyck
0 siblings, 0 replies; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 21:09 UTC (permalink / raw)
To: Tom Herbert
Cc: David Laight, davem, netdev, tglx, mingo, hpa, x86, kernel-team
On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@aculab.com> wrote:
>> From: Tom Herbert
>> ...
>>> > If nothing else reducing the size of this main loop may be desirable.
>>> > I know the newer x86 is supposed to have a loop buffer so that it can
>>> > basically loop on already decoded instructions. Normally it is only
>>> > something like 64 or 128 bytes in size though. You might find that
>>> > reducing this loop to that smaller size may improve the performance
>>> > for larger payloads.
>>>
>>> I saw 128 to be better in my testing. For large packets this loop does
>>> all the work. I see performance dependent on the amount of loop
>>> overhead, i.e. we got it down to two non-adcq instructions but it is
>>> still noticeable. Also, this helps a lot on sizes up to 128 bytes
>>> since we only need to do single call in the jump table and no trip
>>> through the loop.
>>
>> But one of your 'loop overhead' instructions is 'loop'.
>> Look at http://www.agner.org/optimize/instruction_tables.pdf
>> you don't want to be using 'loop' on intel cpus.
>>
> I'm not following. We can replace loop with decl %ecx and jg, but why
> is that better?
Because loop takes something like 7 cycles whereas the decl/jg
approach takes 2 or 3. It is probably one of the reasons things look
so much better with the loop unrolled.
- Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 17:09 ` David Laight
@ 2016-02-04 20:59 ` Tom Herbert
2016-02-04 21:09 ` Alexander Duyck
0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 20:59 UTC (permalink / raw)
To: David Laight
Cc: Alexander Duyck, davem, netdev, tglx, mingo, hpa, x86, kernel-team
On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Tom Herbert
> ...
>> > If nothing else reducing the size of this main loop may be desirable.
>> > I know the newer x86 is supposed to have a loop buffer so that it can
>> > basically loop on already decoded instructions. Normally it is only
>> > something like 64 or 128 bytes in size though. You might find that
>> > reducing this loop to that smaller size may improve the performance
>> > for larger payloads.
>>
>> I saw 128 to be better in my testing. For large packets this loop does
>> all the work. I see performance dependent on the amount of loop
>> overhead, i.e. we got it down to two non-adcq instructions but it is
>> still noticeable. Also, this helps a lot on sizes up to 128 bytes
>> since we only need to do single call in the jump table and no trip
>> through the loop.
>
> But one of your 'loop overhead' instructions is 'loop'.
> Look at http://www.agner.org/optimize/instruction_tables.pdf
> you don't want to be using 'loop' on intel cpus.
>
I'm not following. We can replace loop with decl %ecx and jg, but why
is that better?
Tom
> You might get some benefit from pipelining the loop (so you do
> a read to register in one iteration and a register-register adc
> the next).
>
> David
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 19:44 ` Tom Herbert
@ 2016-02-04 20:03 ` Alexander Duyck
0 siblings, 0 replies; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 20:03 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team
On Thu, Feb 4, 2016 at 11:44 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> Implement assembly routine for csum_partial for 64 bit x86. This
>>> primarily speeds up checksum calculation for smaller lengths such as
>>> those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>>> conversion.
>>>
>>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>>> we need to avoid unaligned accesses. Efficient unaligned accesses
>>> offer a nice additional speedup as demonstrated in the results
>>> provided below.
>>>
>>> This implementation is similar to csum_partial implemented in
>>> checksum_32.S, however since we are dealing with 8 bytes at a time
>>> there are more cases for small lengths (and alignments)-- for that
>>> we employ a jump table.
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Checksum performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>> Old NewA NewA % NewNoA NewNoA %
>>> Len/Aln nsec nsec Improv nsecs Improve
>>> --------+-------+--------+-------+-------+---------------------
>>> 1400/0 192.9 175.1 10% 174.9 10% (Big packet)
>>> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case)
>>> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4)
>>> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr)
>>> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4)
>>> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align)
>>> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad)
>>>
>>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>>
>>> Also test on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>>
>>> Branch prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between
>>> the baseline and the interleaving test case.
>>>
>>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>>
>> So a couple of general questions about all this.
>>
>> First why do you break the alignment part out over so many reads? It
>> seems like you could just get the offset, subtract that from your
>> starting buffer, read the aligned long, and then mask off the bits you
>> don't want. That should take maybe 4 or 5 instructions which would
>> save you a bunch of jumping around since most of it is just
>> arithmetic.
>>
> This would require reading bytes before the buffer pointer back to an
> 8 byte offset. Is this *always* guaranteed to be safe? Similar
> question on reading bytes past the length.
I would think we should be fine as long as we are only confining
ourselves to the nearest long. I would assume page and segmentation
boundaries are at least aligned by the size of a long. As such I
don't see how reading a few extra bytes to mask off later should cause
any issue.
- Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 19:22 ` Alexander Duyck
2016-02-04 19:31 ` Tom Herbert
@ 2016-02-04 19:44 ` Tom Herbert
2016-02-04 20:03 ` Alexander Duyck
1 sibling, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 19:44 UTC (permalink / raw)
To: Alexander Duyck; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>> conversion.
>>
>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>> we need to avoid unaligned accesses. Efficient unaligned accesses
>> offer a nice additional speedup as demonstrated in the results
>> provided below.
>>
>> This implementation is similar to csum_partial implemented in
>> checksum_32.S, however since we are dealing with 8 bytes at a time
>> there are more cases for small lengths (and alignments)-- for that
>> we employ a jump table.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Checksum performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>> Old NewA NewA % NewNoA NewNoA %
>> Len/Aln nsec nsec Improv nsecs Improve
>> --------+-------+--------+-------+-------+---------------------
>> 1400/0 192.9 175.1 10% 174.9 10% (Big packet)
>> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case)
>> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4)
>> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr)
>> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4)
>> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align)
>> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad)
>>
>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also test on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between
>> the baseline and the interleaving test case.
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>
> So a couple of general questions about all this.
>
> First why do you break the alignment part out over so many reads? It
> seems like you could just get the offset, subtract that from your
> starting buffer, read the aligned long, and then mask off the bits you
> don't want. That should take maybe 4 or 5 instructions which would
> save you a bunch of jumping around since most of it is just
> arithmetic.
>
This would require reading bytes before the buffer pointer back to an
8 byte offset. Is this *always* guaranteed to be safe? Similar
question on reading bytes past the length.
> I am still not sure about the loops, but as mentioned you probably
> don't want to use loop. You would likely be better off generating a
> pointer representing the last valid offset and just running through
> the loop that way.
>
> Then on the last read the same question as the first. Why not just do
> one read and mask the result. It should be faster.
>
> Finally I am not sure if your result is safe if the offset for the
> buffer is an odd value. The original code did a collapse to 16b and
> rotate by 8 if the offset was odd. I don't see anything like that in
> your code.
>
> - Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 19:22 ` Alexander Duyck
@ 2016-02-04 19:31 ` Tom Herbert
2016-02-04 19:44 ` Tom Herbert
1 sibling, 0 replies; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 19:31 UTC (permalink / raw)
To: Alexander Duyck; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team
On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>> conversion.
>>
>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>> we need to avoid unaligned accesses. Efficient unaligned accesses
>> offer a nice additional speedup as demonstrated in the results
>> provided below.
>>
>> This implementation is similar to csum_partial implemented in
>> checksum_32.S, however since we are dealing with 8 bytes at a time
>> there are more cases for small lengths (and alignments)-- for that
>> we employ a jump table.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Checksum performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>> Old NewA NewA % NewNoA NewNoA %
>> Len/Aln nsec nsec Improv nsecs Improve
>> --------+-------+--------+-------+-------+---------------------
>> 1400/0 192.9 175.1 10% 174.9 10% (Big packet)
>> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case)
>> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4)
>> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr)
>> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4)
>> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align)
>> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad)
>>
>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also test on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between
>> the baseline and the interleaving test case.
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>
> So a couple of general questions about all this.
>
> First why do you break the alignment part out over so many reads? It
> seems like you could just get the offset, subtract that from your
> starting buffer, read the aligned long, and then mask off the bits you
> don't want. That should take maybe 4 or 5 instructions which would
> save you a bunch of jumping around since most of it is just
> arithmetic.
>
I will look at doing that.
> I am still not sure about the loops, but as mentioned you probably
> don't want to use loop. You would likely be better off generating a
> pointer representing the last valid offset and just running through
> the loop that way.
>
> Then on the last read the same question as the first. Why not just do
> one read and mask the result. It should be faster.
>
> Finally I am not sure if your result is safe if the offset for the
> buffer is an odd value. The original code did a collapse to 16b and
> rotate by 8 if the offset was odd. I don't see anything like that in
> your code.
The roll $8 is done in the RETURN macro. We only need to worry about
this when alignment matters.
Thanks,
Tom
>
> - Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 10:56 ` Ingo Molnar
@ 2016-02-04 19:24 ` Tom Herbert
2016-02-05 9:24 ` Ingo Molnar
0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 19:24 UTC (permalink / raw)
To: Ingo Molnar
Cc: David S. Miller, Linux Kernel Network Developers, tglx, mingo,
hpa, x86, Kernel Team, LKML, Peter Zijlstra, Linus Torvalds
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > + /* Check length */
>> > +10: cmpl $8, %esi
>> > + jg 30f
>> > + jl 20f
>> > +
>> > + /* Exactly 8 bytes length */
>> > + addl (%rdi), %eax
>> > + adcl 4(%rdi), %eax
>> > + RETURN
>> > +
>> > + /* Less than 8 bytes length */
>> > +20: clc
>> > + jmpq *branch_tbl_len(, %rsi, 8)
>> > +
>> > + /* Greater than 8 bytes length. Determine number of quads (n). Sum
>> > + * over first n % 16 quads
>> > + */
>> > +30: movl %esi, %ecx
>> > + shrl $3, %ecx
>> > + andl $0xf, %ecx
>> > + negq %rcx
>> > + lea 40f(, %rcx, 4), %r11
>> > + clc
>> > + jmp *%r11
>>
>> Are you absolutely sure that a jump table is the proper solution here? It
>> defeats branch prediction on most x86 uarchs. Why not label the loop stages and
>> jump in directly with a binary tree of branches?
>
> So just to expand on this a bit, attached below is a quick & simple & stupid
> testcase that generates a 16 entries call table. (Indirect jumps and indirect
> calls are predicted similarly on most x86 uarchs.) Just built it with:
>
> gcc -Wall -O2 -o jump-table jump-table.c
>
> Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU
> and also on AMD Opteron. The numbers are from the Intel box.) this gives a high
> branch miss rate with a 16 entries jump table:
>
> triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
> ... using 16 jump table entries.
> ... using 100000000 loop iterations.
> ... result: 10000000100000000
> [...]
>
> Performance counter stats for './jump-table 16' (10 runs):
>
> 1386.131780 task-clock (msec) # 1.001 CPUs utilized ( +- 0.18% )
> 33 context-switches # 0.024 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec
> 52 page-faults # 0.037 K/sec ( +- 0.71% )
> 6,247,215,683 cycles # 4.507 GHz ( +- 0.18% )
> 3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles idle ( +- 0.30% )
> 1,404,014,996 instructions # 0.22 insns per cycle
> # 2.77 stalled cycles per insn ( +- 0.02% )
> 300,820,988 branches # 217.022 M/sec ( +- 0.02% )
> 87,518,741 branch-misses # 29.09% of all branches ( +- 0.01% )
>
> 1.385240076 seconds time elapsed ( +- 0.21% )
>
> ... as you can see the branch miss rate is very significant, causing a stalled
> decoder and very low instruction throughput.
>
> I have to reduce the jump table to a single entry (!) to get good performance on
> this CPU:
>
> Performance counter stats for './jump-table 1' (10 runs):
>
> 739.173505 task-clock (msec) # 1.001 CPUs utilized ( +- 0.26% )
> 37 context-switches # 0.051 K/sec ( +- 16.79% )
> 0 cpu-migrations # 0.000 K/sec
> 52 page-faults # 0.070 K/sec ( +- 0.41% )
> 3,331,328,405 cycles # 4.507 GHz ( +- 0.26% )
> 2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles idle ( +- 0.47% )
> 1,403,880,792 instructions # 0.42 insns per cycle
> # 1.43 stalled cycles per insn ( +- 0.05% )
> 300,817,064 branches # 406.964 M/sec ( +- 0.05% )
> 12,177 branch-misses # 0.00% of all branches ( +- 12.39% )
>
> 0.738616356 seconds time elapsed ( +- 0.26% )
>
> Note how the runtime got halved: that is because stalls got halved and the
> instructions per cycle throughput doubled.
>
> Even a two entries jump table performs poorly:
>
> Performance counter stats for './jump-table 2' (10 runs):
>
> 1493.790686 task-clock (msec) # 1.001 CPUs utilized ( +- 0.06% )
> 39 context-switches # 0.026 K/sec ( +- 4.73% )
> 0 cpu-migrations # 0.000 K/sec
> 52 page-faults # 0.035 K/sec ( +- 0.26% )
> 6,732,372,612 cycles # 4.507 GHz ( +- 0.06% )
> 4,229,130,302 stalled-cycles-frontend # 62.82% frontend cycles idle ( +- 0.09% )
> 1,407,803,145 instructions # 0.21 insns per cycle
> # 3.00 stalled cycles per insn ( +- 0.08% )
> 301,688,946 branches # 201.962 M/sec ( +- 0.09% )
> 100,022,150 branch-misses # 33.15% of all branches ( +- 0.00% )
>
> 1.492820524 seconds time elapsed ( +- 0.08% )
>
> In fact it's worse than the 16 entries runtime.
>
> ( Note that Intel SkyLake improved the indirect jump/call branch predictor.
> On another Intel CPU I have the boundary between good and bad prediction is
> at 4-6 entries. So this is highly uarch (and code layout) dependent. )
>
> In contrast, a static hierarchy/tree of branches is predicted a lot better on most
> x86 uarchs, even with highly variable input. (This does not even count the extra
> calculations needed to calculate the jump table index, which, as you coded it in
> your patch, add 2-3 cycles of extra latency as well.)
>
> Such branch miss characteristics matter and they can become more visible with
> smaller skb sizes.
>
> So I'm generally sceptical of jump tables and I'd like to see careful and
> convincing perf stat runs on modern x86 uarchs, comparing the jump table solution
> to a branch hierarchy solution, before accepting a jump table solution into the
> x86 architecture.
>
> x86 uarchs are generally good at predicting function pointer indirect jumps (which
> tend to be static) - multi-target jump tables are a lot harder nut to crack,
> especially if the jump index is calculated right before the jump is performed,
> like your patch does it.
>
> The impact of branch misses is more pronounced on modern Intel CPUs, because
> speculation is deeper.
>
> But as always I can be convinced with contrary numbers! :-)
>
Hi Ingo,
Thanks for the explanation and sample code. Expanding on your example,
I added a switch statement to perform to function (code below).
gcc seems to be implementing this switch as a jump table:
jmpq *0x400ac8(,%rdx,8)
Which is a different call than the function table implementation:
callq *(%rsp,%rdx,8)
The switch jump table method is what I coded in the assembly for the
checksum routine (I basically just copied the assembly from what gcc
was doing with the C code for the same function).
Running using the functptr and switch table gives:
taskset 1 perf stat --repeat 10 ./jump_table -S 16
... using funcptr
... result: 10000000100000000
Performance counter stats for './jump_table -S 16' (10 runs):
2318.192392 task-clock (msec) # 0.999 CPUs
utilized ( +- 1.02% )
9 context-switches # 0.004 K/sec
( +- 9.51% )
1 cpu-migrations # 0.000 K/sec
( +- 68.31% )
132 page-faults # 0.057 K/sec
( +- 0.10% )
6,328,766,014 cycles # 2.730 GHz
( +- 0.04% ) [49.99%]
3,325,346,741 stalled-cycles-frontend # 52.54% frontend
cycles idle ( +- 0.10% ) [50.01%]
1,793,615,301 stalled-cycles-backend # 28.34% backend
cycles idle ( +- 2.84% ) [50.04%]
1,509,733,276 instructions # 0.24 insns per cycle
# 2.20 stalled
cycles per insn ( +- 0.02% ) [50.03%]
301,788,275 branches # 130.183 M/sec
( +- 0.01% ) [50.01%]
100,030,120 branch-misses # 33.15% of all
branches ( +- 0.01% ) [49.98%]
2.320510795 seconds time elapsed
( +- 1.02% )
taskset 1 perf stat --repeat 10 ./jump_table -S 16 -x
... using switch
... result: 10000000100000000
Performance counter stats for './jump_table -S 16 -x' (10 runs):
942.117858 task-clock (msec) # 0.999 CPUs
utilized ( +- 1.07% )
5 context-switches # 0.005 K/sec
( +- 13.04% )
2 cpu-migrations # 0.003 K/sec
( +- 42.27% )
132 page-faults # 0.140 K/sec
( +- 0.12% )
2,521,873,028 cycles # 2.677 GHz
( +- 0.09% ) [50.01%]
1,021,412,581 stalled-cycles-frontend # 40.50% frontend
cycles idle ( +- 0.17% ) [50.06%]
255,591,030 stalled-cycles-backend # 10.13% backend
cycles idle ( +- 13.11% ) [50.10%]
1,104,081,483 instructions # 0.44 insns per cycle
# 0.93 stalled
cycles per insn ( +- 0.01% ) [50.05%]
300,713,493 branches # 319.189 M/sec
( +- 0.01% ) [49.98%]
11,500 branch-misses # 0.00% of all
branches ( +- 12.18% ) [49.95%]
0.943088132 seconds time elapsed
( +- 1.07% )
So the switch implementation does not seem to be causing branch-misses
to be counted and is a lot faster. This is also what I see when
looking at the branch-misses with the checksum code-- I haven't yet
found any test case thrashing lengths or alignments increase
branch-misses and shows a performance degradation over the case where
they are static.
Tom
/*
* * Simple testcase to generate jump table usage.
* */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
typedef unsigned long (fn_t) (unsigned long param);
#define TABLE_SIZE_MAX 16L
unsigned long global;
unsigned long table_size = TABLE_SIZE_MAX;
unsigned long loops = 100000000;
int doswitch = 0;
#define DEFINE_FUN(x) \
\
unsigned long fun_##x(unsigned long param) \
{ \
return param+global; \
}
DEFINE_FUN( 0);
DEFINE_FUN( 1);
DEFINE_FUN( 2);
DEFINE_FUN( 3);
DEFINE_FUN( 4);
DEFINE_FUN( 5);
DEFINE_FUN( 6);
DEFINE_FUN( 7);
DEFINE_FUN( 8);
DEFINE_FUN( 9);
DEFINE_FUN(10);
DEFINE_FUN(11);
DEFINE_FUN(12);
DEFINE_FUN(13);
DEFINE_FUN(14);
DEFINE_FUN(15);
static void usage(void)
{
printf("usage: jump-table [-S <table_size>] [-L
<loops> ] [-h] [-x]\n");
exit(0);
}
unsigned long do_funcptr(void)
{
int i;
unsigned long local = 0;
fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_0 , fun_1 , fun_2 , fun_3 ,
fun_4 , fun_5 , fun_6 , fun_7 ,
fun_8 , fun_9, fun_10, fun_11,
fun_12, fun_13, fun_14, fun_15,
};
/* Call a set of simple arithmetic functions from a jump table: */
for (i = 0; i < loops; i++) {
global++;
local += fn_ptr[global % table_size](global);
}
return local;
}
unsigned long do_switch(void)
{
int i;
unsigned long local = 0;
/* Call a set of simple arithmetic functions via switch: */
#define CASE(X) case X: \
local += fun_##X(global); \
break;
for (i = 0; i < loops; i++) {
global++;
switch (global % table_size) {
CASE(0) CASE(1) CASE(2) CASE(3) CASE(4) CASE(5) CASE(6) CASE(7)
CASE(8) CASE(9) CASE(10) CASE(11) CASE(12) CASE(13)
CASE(14) CASE(15)
}
}
return local;
}
int main(int argc, char **argv)
{
unsigned long local = 0;
int c;
while ((c = getopt(argc, argv, "hS:xL:")) != -1)
switch (c) {
case 'S':
table_size = atoi(optarg);
if (table_size <= 1)
table_size = 1;
if (table_size > TABLE_SIZE_MAX)
table_size = TABLE_SIZE_MAX;
break;
case 'x':
doswitch = 1;
break;
case 'L':
loops = atoi(optarg);
case 'h':
default:
usage();
}
printf("... using %lu jump table entries.\n", table_size);
printf("... using %lu loop iterations.\n", loops);
if (doswitch) {
printf("... using switch\n");
} else {
printf("... using funcptr\n");
}
local = doswitch ? do_switch() : do_funcptr();
printf("... result: %lu\n", local);
return 0;
}
> Thanks,
>
> Ingo
>
> -------------------------------------->
> /*
> * Simple testcase to generate jump table usage.
> */
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> typedef unsigned long (fn_t) (unsigned long param);
>
> unsigned long global;
>
> #define DEFINE_FUN(x) \
> \
> unsigned long fun_##x(unsigned long param) \
> { \
> return param+global; \
> }
>
> #define TABLE_SIZE_MAX 16L
>
> DEFINE_FUN( 1);
> DEFINE_FUN( 2);
> DEFINE_FUN( 3);
> DEFINE_FUN( 4);
> DEFINE_FUN( 5);
> DEFINE_FUN( 6);
> DEFINE_FUN( 7);
> DEFINE_FUN( 8);
> DEFINE_FUN( 9);
> DEFINE_FUN(10);
> DEFINE_FUN(11);
> DEFINE_FUN(12);
> DEFINE_FUN(13);
> DEFINE_FUN(14);
> DEFINE_FUN(15);
> DEFINE_FUN(16);
>
> int main(int argc, char **argv)
> {
> fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_1 , fun_2 , fun_3 , fun_4 ,
> fun_5 , fun_6 , fun_7 , fun_8 ,
> fun_9 , fun_10, fun_11, fun_12,
> fun_13, fun_14, fun_15, fun_16,
> };
> unsigned long local = 0;
> unsigned long i;
> unsigned long table_size = TABLE_SIZE_MAX;
> unsigned long loops = 100000000;
>
>
> if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) {
> printf("usage: jump-table <table_size(1-%lu)(default: %lu)> <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops);
> exit(0);
> }
>
> if (argc >= 2) {
> table_size = atol(argv[1]);
> if (table_size <= 1)
> table_size = 1;
> if (table_size > TABLE_SIZE_MAX)
> table_size = TABLE_SIZE_MAX;
> }
> printf("... using %lu jump table entries.\n", table_size);
>
> if (argc >= 3)
> loops = atol(argv[2]);
> printf("... using %lu loop iterations.\n", loops);
>
> /* Call a set of simple arithmetic functions from a jump table: */
> for (i = 0; i < loops; i++) {
> global++;
> local += fn_ptr[global % table_size](global);
> }
>
> printf("... result: %lu\n", local);
> }
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-03 19:18 Tom Herbert
2016-02-04 9:30 ` Ingo Molnar
2016-02-04 11:08 ` David Laight
@ 2016-02-04 19:22 ` Alexander Duyck
2016-02-04 19:31 ` Tom Herbert
2016-02-04 19:44 ` Tom Herbert
2 siblings, 2 replies; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 19:22 UTC (permalink / raw)
To: Tom Herbert; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team
On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.
>
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
> we need to avoid unaligned accesses. Efficient unaligned accesses
> offer a nice additional speedup as demonstrated in the results
> provided below.
>
> This implementation is similar to csum_partial implemented in
> checksum_32.S, however since we are dealing with 8 bytes at a time
> there are more cases for small lengths (and alignments)-- for that
> we employ a jump table.
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Checksum performance:
>
> Isolating old and new implementation for some common cases:
>
> Old NewA NewA % NewNoA NewNoA %
> Len/Aln nsec nsec Improv nsecs Improve
> --------+-------+--------+-------+-------+---------------------
> 1400/0 192.9 175.1 10% 174.9 10% (Big packet)
> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case)
> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4)
> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr)
> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4)
> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align)
> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad)
>
> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> Also test on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>
> Branch prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between
> the baseline and the interleaving test case.
>
> Signed-off-by: Tom Herbert <tom@herbertland.com>
So a couple of general questions about all this.
First why do you break the alignment part out over so many reads? It
seems like you could just get the offset, subtract that from your
starting buffer, read the aligned long, and then mask off the bits you
don't want. That should take maybe 4 or 5 instructions which would
save you a bunch of jumping around since most of it is just
arithmetic.
I am still not sure about the loops, but as mentioned you probably
don't want to use loop. You would likely be better off generating a
pointer representing the last valid offset and just running through
the loop that way.
Then on the last read the same question as the first. Why not just do
one read and mask the result. It should be faster.
Finally I am not sure if your result is safe if the offset for the
buffer is an odd value. The original code did a collapse to 16b and
rotate by 8 if the offset was odd. I don't see anything like that in
your code.
- Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 16:58 ` Tom Herbert
@ 2016-02-04 17:09 ` David Laight
2016-02-04 20:59 ` Tom Herbert
0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-04 17:09 UTC (permalink / raw)
To: 'Tom Herbert', Alexander Duyck
Cc: davem, netdev, tglx, mingo, hpa, x86, kernel-team
From: Tom Herbert
...
> > If nothing else reducing the size of this main loop may be desirable.
> > I know the newer x86 is supposed to have a loop buffer so that it can
> > basically loop on already decoded instructions. Normally it is only
> > something like 64 or 128 bytes in size though. You might find that
> > reducing this loop to that smaller size may improve the performance
> > for larger payloads.
>
> I saw 128 to be better in my testing. For large packets this loop does
> all the work. I see performance dependent on the amount of loop
> overhead, i.e. we got it down to two non-adcq instructions but it is
> still noticeable. Also, this helps a lot on sizes up to 128 bytes
> since we only need to do single call in the jump table and no trip
> through the loop.
But one of your 'loop overhead' instructions is 'loop'.
Look at http://www.agner.org/optimize/instruction_tables.pdf
you don't want to be using 'loop' on intel cpus.
You might get some benefit from pipelining the loop (so you do
a read to register in one iteration and a register-register adc
the next).
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 16:51 ` Alexander Duyck
@ 2016-02-04 16:58 ` Tom Herbert
2016-02-04 17:09 ` David Laight
0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 16:58 UTC (permalink / raw)
To: Alexander Duyck
Cc: David Laight, davem, netdev, tglx, mingo, hpa, x86, kernel-team
On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@aculab.com> wrote:
>> From: Tom Herbert
>>> Sent: 03 February 2016 19:19
>> ...
>>> + /* Main loop */
>>> +50: adcq 0*8(%rdi),%rax
>>> + adcq 1*8(%rdi),%rax
>>> + adcq 2*8(%rdi),%rax
>>> + adcq 3*8(%rdi),%rax
>>> + adcq 4*8(%rdi),%rax
>>> + adcq 5*8(%rdi),%rax
>>> + adcq 6*8(%rdi),%rax
>>> + adcq 7*8(%rdi),%rax
>>> + adcq 8*8(%rdi),%rax
>>> + adcq 9*8(%rdi),%rax
>>> + adcq 10*8(%rdi),%rax
>>> + adcq 11*8(%rdi),%rax
>>> + adcq 12*8(%rdi),%rax
>>> + adcq 13*8(%rdi),%rax
>>> + adcq 14*8(%rdi),%rax
>>> + adcq 15*8(%rdi),%rax
>>> + lea 128(%rdi), %rdi
>>> + loop 50b
>>
>> I'd need convincing that unrolling the loop like that gives any significant gain.
>> You have a dependency chain on the carry flag so have delays between the 'adcq'
>> instructions (these may be more significant than the memory reads from l1 cache).
>>
>> I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
>> If 'loop' is fast then you will probably find that:
>>
>> 10: adcq 0(%rdi),%rax
>> lea 8(%rdi),%rdi
>> loop 10b
>>
>> is just as fast since the three instructions could all be executed in parallel.
>> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
>> a single micro-op).
>> IIRC 'adc' and 'dec' will both have dependencies on the flags register
>> so cannot execute together (which is a shame here).
>>
>> It is also possible that breaking the carry-chain dependency by doing 32bit
>> adds (possibly after 64bit reads) can be made to be faster.
>
> If nothing else reducing the size of this main loop may be desirable.
> I know the newer x86 is supposed to have a loop buffer so that it can
> basically loop on already decoded instructions. Normally it is only
> something like 64 or 128 bytes in size though. You might find that
> reducing this loop to that smaller size may improve the performance
> for larger payloads.
>
I saw 128 to be better in my testing. For large packets this loop does
all the work. I see performance dependent on the amount of loop
overhead, i.e. we got it down to two non-adcq instructions but it is
still noticeable. Also, this helps a lot on sizes up to 128 bytes
since we only need to do single call in the jump table and no trip
through the loop.
Tom
> - Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 11:08 ` David Laight
@ 2016-02-04 16:51 ` Alexander Duyck
2016-02-04 16:58 ` Tom Herbert
0 siblings, 1 reply; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 16:51 UTC (permalink / raw)
To: David Laight
Cc: Tom Herbert, davem, netdev, tglx, mingo, hpa, x86, kernel-team
On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Tom Herbert
>> Sent: 03 February 2016 19:19
> ...
>> + /* Main loop */
>> +50: adcq 0*8(%rdi),%rax
>> + adcq 1*8(%rdi),%rax
>> + adcq 2*8(%rdi),%rax
>> + adcq 3*8(%rdi),%rax
>> + adcq 4*8(%rdi),%rax
>> + adcq 5*8(%rdi),%rax
>> + adcq 6*8(%rdi),%rax
>> + adcq 7*8(%rdi),%rax
>> + adcq 8*8(%rdi),%rax
>> + adcq 9*8(%rdi),%rax
>> + adcq 10*8(%rdi),%rax
>> + adcq 11*8(%rdi),%rax
>> + adcq 12*8(%rdi),%rax
>> + adcq 13*8(%rdi),%rax
>> + adcq 14*8(%rdi),%rax
>> + adcq 15*8(%rdi),%rax
>> + lea 128(%rdi), %rdi
>> + loop 50b
>
> I'd need convincing that unrolling the loop like that gives any significant gain.
> You have a dependency chain on the carry flag so have delays between the 'adcq'
> instructions (these may be more significant than the memory reads from l1 cache).
>
> I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
> If 'loop' is fast then you will probably find that:
>
> 10: adcq 0(%rdi),%rax
> lea 8(%rdi),%rdi
> loop 10b
>
> is just as fast since the three instructions could all be executed in parallel.
> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
> a single micro-op).
> IIRC 'adc' and 'dec' will both have dependencies on the flags register
> so cannot execute together (which is a shame here).
>
> It is also possible that breaking the carry-chain dependency by doing 32bit
> adds (possibly after 64bit reads) can be made to be faster.
If nothing else reducing the size of this main loop may be desirable.
I know the newer x86 is supposed to have a loop buffer so that it can
basically loop on already decoded instructions. Normally it is only
something like 64 or 128 bytes in size though. You might find that
reducing this loop to that smaller size may improve the performance
for larger payloads.
- Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-03 19:18 Tom Herbert
2016-02-04 9:30 ` Ingo Molnar
@ 2016-02-04 11:08 ` David Laight
2016-02-04 16:51 ` Alexander Duyck
2016-02-04 19:22 ` Alexander Duyck
2 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-04 11:08 UTC (permalink / raw)
To: 'Tom Herbert', davem, netdev; +Cc: tglx, mingo, hpa, x86, kernel-team
From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> + /* Main loop */
> +50: adcq 0*8(%rdi),%rax
> + adcq 1*8(%rdi),%rax
> + adcq 2*8(%rdi),%rax
> + adcq 3*8(%rdi),%rax
> + adcq 4*8(%rdi),%rax
> + adcq 5*8(%rdi),%rax
> + adcq 6*8(%rdi),%rax
> + adcq 7*8(%rdi),%rax
> + adcq 8*8(%rdi),%rax
> + adcq 9*8(%rdi),%rax
> + adcq 10*8(%rdi),%rax
> + adcq 11*8(%rdi),%rax
> + adcq 12*8(%rdi),%rax
> + adcq 13*8(%rdi),%rax
> + adcq 14*8(%rdi),%rax
> + adcq 15*8(%rdi),%rax
> + lea 128(%rdi), %rdi
> + loop 50b
I'd need convincing that unrolling the loop like that gives any significant gain.
You have a dependency chain on the carry flag so have delays between the 'adcq'
instructions (these may be more significant than the memory reads from l1 cache).
I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
If 'loop' is fast then you will probably find that:
10: adcq 0(%rdi),%rax
lea 8(%rdi),%rdi
loop 10b
is just as fast since the three instructions could all be executed in parallel.
But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
a single micro-op).
IIRC 'adc' and 'dec' will both have dependencies on the flags register
so cannot execute together (which is a shame here).
It is also possible that breaking the carry-chain dependency by doing 32bit
adds (possibly after 64bit reads) can be made to be faster.
David
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-04 9:30 ` Ingo Molnar
@ 2016-02-04 10:56 ` Ingo Molnar
2016-02-04 19:24 ` Tom Herbert
2016-02-04 21:46 ` Linus Torvalds
1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2016-02-04 10:56 UTC (permalink / raw)
To: Tom Herbert
Cc: davem, netdev, tglx, mingo, hpa, x86, kernel-team, linux-kernel,
Peter Zijlstra, Linus Torvalds
* Ingo Molnar <mingo@kernel.org> wrote:
> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> > +
> > + /* Check length */
> > +10: cmpl $8, %esi
> > + jg 30f
> > + jl 20f
> > +
> > + /* Exactly 8 bytes length */
> > + addl (%rdi), %eax
> > + adcl 4(%rdi), %eax
> > + RETURN
> > +
> > + /* Less than 8 bytes length */
> > +20: clc
> > + jmpq *branch_tbl_len(, %rsi, 8)
> > +
> > + /* Greater than 8 bytes length. Determine number of quads (n). Sum
> > + * over first n % 16 quads
> > + */
> > +30: movl %esi, %ecx
> > + shrl $3, %ecx
> > + andl $0xf, %ecx
> > + negq %rcx
> > + lea 40f(, %rcx, 4), %r11
> > + clc
> > + jmp *%r11
>
> Are you absolutely sure that a jump table is the proper solution here? It
> defeats branch prediction on most x86 uarchs. Why not label the loop stages and
> jump in directly with a binary tree of branches?
So just to expand on this a bit, attached below is a quick & simple & stupid
testcase that generates a 16 entries call table. (Indirect jumps and indirect
calls are predicted similarly on most x86 uarchs.) Just built it with:
gcc -Wall -O2 -o jump-table jump-table.c
Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU
and also on AMD Opteron. The numbers are from the Intel box.) this gives a high
branch miss rate with a 16 entries jump table:
triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
... using 16 jump table entries.
... using 100000000 loop iterations.
... result: 10000000100000000
[...]
Performance counter stats for './jump-table 16' (10 runs):
1386.131780 task-clock (msec) # 1.001 CPUs utilized ( +- 0.18% )
33 context-switches # 0.024 K/sec ( +- 1.71% )
0 cpu-migrations # 0.000 K/sec
52 page-faults # 0.037 K/sec ( +- 0.71% )
6,247,215,683 cycles # 4.507 GHz ( +- 0.18% )
3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles idle ( +- 0.30% )
1,404,014,996 instructions # 0.22 insns per cycle
# 2.77 stalled cycles per insn ( +- 0.02% )
300,820,988 branches # 217.022 M/sec ( +- 0.02% )
87,518,741 branch-misses # 29.09% of all branches ( +- 0.01% )
1.385240076 seconds time elapsed ( +- 0.21% )
... as you can see the branch miss rate is very significant, causing a stalled
decoder and very low instruction throughput.
I have to reduce the jump table to a single entry (!) to get good performance on
this CPU:
Performance counter stats for './jump-table 1' (10 runs):
739.173505 task-clock (msec) # 1.001 CPUs utilized ( +- 0.26% )
37 context-switches # 0.051 K/sec ( +- 16.79% )
0 cpu-migrations # 0.000 K/sec
52 page-faults # 0.070 K/sec ( +- 0.41% )
3,331,328,405 cycles # 4.507 GHz ( +- 0.26% )
2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles idle ( +- 0.47% )
1,403,880,792 instructions # 0.42 insns per cycle
# 1.43 stalled cycles per insn ( +- 0.05% )
300,817,064 branches # 406.964 M/sec ( +- 0.05% )
12,177 branch-misses # 0.00% of all branches ( +- 12.39% )
0.738616356 seconds time elapsed ( +- 0.26% )
Note how the runtime got halved: that is because stalls got halved and the
instructions per cycle throughput doubled.
Even a two entries jump table performs poorly:
Performance counter stats for './jump-table 2' (10 runs):
1493.790686 task-clock (msec) # 1.001 CPUs utilized ( +- 0.06% )
39 context-switches # 0.026 K/sec ( +- 4.73% )
0 cpu-migrations # 0.000 K/sec
52 page-faults # 0.035 K/sec ( +- 0.26% )
6,732,372,612 cycles # 4.507 GHz ( +- 0.06% )
4,229,130,302 stalled-cycles-frontend # 62.82% frontend cycles idle ( +- 0.09% )
1,407,803,145 instructions # 0.21 insns per cycle
# 3.00 stalled cycles per insn ( +- 0.08% )
301,688,946 branches # 201.962 M/sec ( +- 0.09% )
100,022,150 branch-misses # 33.15% of all branches ( +- 0.00% )
1.492820524 seconds time elapsed ( +- 0.08% )
In fact it's worse than the 16 entries runtime.
( Note that Intel SkyLake improved the indirect jump/call branch predictor.
On another Intel CPU I have the boundary between good and bad prediction is
at 4-6 entries. So this is highly uarch (and code layout) dependent. )
In contrast, a static hierarchy/tree of branches is predicted a lot better on most
x86 uarchs, even with highly variable input. (This does not even count the extra
calculations needed to calculate the jump table index, which, as you coded it in
your patch, add 2-3 cycles of extra latency as well.)
Such branch miss characteristics matter and they can become more visible with
smaller skb sizes.
So I'm generally sceptical of jump tables and I'd like to see careful and
convincing perf stat runs on modern x86 uarchs, comparing the jump table solution
to a branch hierarchy solution, before accepting a jump table solution into the
x86 architecture.
x86 uarchs are generally good at predicting function pointer indirect jumps (which
tend to be static) - multi-target jump tables are a lot harder nut to crack,
especially if the jump index is calculated right before the jump is performed,
like your patch does it.
The impact of branch misses is more pronounced on modern Intel CPUs, because
speculation is deeper.
But as always I can be convinced with contrary numbers! :-)
Thanks,
Ingo
-------------------------------------->
/*
* Simple testcase to generate jump table usage.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef unsigned long (fn_t) (unsigned long param);
unsigned long global;
#define DEFINE_FUN(x) \
\
unsigned long fun_##x(unsigned long param) \
{ \
return param+global; \
}
#define TABLE_SIZE_MAX 16L
DEFINE_FUN( 1);
DEFINE_FUN( 2);
DEFINE_FUN( 3);
DEFINE_FUN( 4);
DEFINE_FUN( 5);
DEFINE_FUN( 6);
DEFINE_FUN( 7);
DEFINE_FUN( 8);
DEFINE_FUN( 9);
DEFINE_FUN(10);
DEFINE_FUN(11);
DEFINE_FUN(12);
DEFINE_FUN(13);
DEFINE_FUN(14);
DEFINE_FUN(15);
DEFINE_FUN(16);
int main(int argc, char **argv)
{
fn_t *fn_ptr [TABLE_SIZE_MAX] = { fun_1 , fun_2 , fun_3 , fun_4 ,
fun_5 , fun_6 , fun_7 , fun_8 ,
fun_9 , fun_10, fun_11, fun_12,
fun_13, fun_14, fun_15, fun_16,
};
unsigned long local = 0;
unsigned long i;
unsigned long table_size = TABLE_SIZE_MAX;
unsigned long loops = 100000000;
if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) {
printf("usage: jump-table <table_size(1-%lu)(default: %lu)> <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops);
exit(0);
}
if (argc >= 2) {
table_size = atol(argv[1]);
if (table_size <= 1)
table_size = 1;
if (table_size > TABLE_SIZE_MAX)
table_size = TABLE_SIZE_MAX;
}
printf("... using %lu jump table entries.\n", table_size);
if (argc >= 3)
loops = atol(argv[2]);
printf("... using %lu loop iterations.\n", loops);
/* Call a set of simple arithmetic functions from a jump table: */
for (i = 0; i < loops; i++) {
global++;
local += fn_ptr[global % table_size](global);
}
printf("... result: %lu\n", local);
}
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
2016-02-03 19:18 Tom Herbert
@ 2016-02-04 9:30 ` Ingo Molnar
2016-02-04 10:56 ` Ingo Molnar
2016-02-04 21:46 ` Linus Torvalds
2016-02-04 11:08 ` David Laight
2016-02-04 19:22 ` Alexander Duyck
2 siblings, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2016-02-04 9:30 UTC (permalink / raw)
To: Tom Herbert
Cc: davem, netdev, tglx, mingo, hpa, x86, kernel-team, linux-kernel,
Peter Zijlstra
* Tom Herbert <tom@herbertland.com> wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.
>
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
> we need to avoid unaligned accesses. Efficient unaligned accesses
> offer a nice additional speedup as demonstrated in the results
> provided below.
>
> This implementation is similar to csum_partial implemented in
> checksum_32.S, however since we are dealing with 8 bytes at a time
> there are more cases for small lengths (and alignments)-- for that
> we employ a jump table.
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Checksum performance:
>
> Isolating old and new implementation for some common cases:
>
> Old NewA NewA % NewNoA NewNoA %
> Len/Aln nsec nsec Improv nsecs Improve
> --------+-------+--------+-------+-------+---------------------
> 1400/0 192.9 175.1 10% 174.9 10% (Big packet)
> 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case)
> 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4)
> 14/0 10.5 7.3 30% 5.4 48% (Eth hdr)
> 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4)
> 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align)
> 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad)
>
> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> Also test on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>
> Branch prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between
> the baseline and the interleaving test case.
Possibly because the jumps are missing branch prediction for all these variants...
Please run something like:
triton:~/tip> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' ~/hackbench 10
...
Performance counter stats for '/home/mingo/hackbench 10' (10 runs):
1,701,689,802 instructions ( +- 0.83% )
305,439,262 branches ( +- 0.92% )
2,011,032 branch-misses # 0.66% of all branches ( +- 3.00% )
0.074311070 seconds time elapsed ( +- 2.42% )
... to see the quality of x86 branch prediction and to see the statistical quality
of the measurement (stddev).
The real comparison would be jump table against a hierarchy of open coded
branches. The latter is much more likely to be correctly branch-predicted.
I have a couple of (mostly stylistic) comments about the patch:
> Signed-off-by: Tom Herbert <tom@herbertland.com>
> ---
> arch/x86/include/asm/checksum_64.h | 5 +
> arch/x86/lib/csum-partial_64.S | 277 +++++++++++++++++++++++++++++++++++++
> arch/x86/lib/csum-partial_64.c | 148 --------------------
> 3 files changed, 282 insertions(+), 148 deletions(-)
> create mode 100644 arch/x86/lib/csum-partial_64.S
> delete mode 100644 arch/x86/lib/csum-partial_64.c
>
> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
> index cd00e17..a888f65 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
> */
> extern __wsum csum_partial(const void *buff, int len, __wsum sum);
>
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
> +{
> + return csum_fold(csum_partial(buff, len, 0));
> +}
> +
> #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
> #define HAVE_CSUM_COPY_USER 1
>
> diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S
> new file mode 100644
> index 0000000..520b400
> --- /dev/null
> +++ b/arch/x86/lib/csum-partial_64.S
> @@ -0,0 +1,277 @@
> +/* Copyright 2016 Tom Herbert <tom@herbertland.com>
> + *
> + * Checksum partial calculation
s/ Partial checksum calculation
> + *
> + * __wsum csum_partial(const void *buff, int len, __wsum sum)
> + *
> + * Computes the checksum of a memory block at buff, length len,
> + * and adds in "sum" (32-bit)
So please refer to arguments consistently in an escaped manner, i.e. 'buff',
'len', 'sum'. It's a bit confusing to read otherwise.
This applies to other comments in your patch as well.
> + *
> + * Returns a 32-bit number suitable for feeding into itself
> + * or csum_tcpudp_magic
In comments please refer to functions with (), i.e. csum_tcpudp_magic().
> + *
> + * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS determines whether alignment of the
> + * buffer must be dealt with.
> + *
> + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set then the steps are:
s/ If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y then the steps are:
> + * 1) Initialize accumulator to initial sum
> + * 2) Sum 8 bytes at a time using adcq (unroll main loop
> + * to do 128 bytes at a time)
Please refer to x86 instructions capitalized, i.e. ADCQ here. That makes them
stand out better especially when they alias to some English word.
This applies to other parts of your patch as well.
> + * 3) Sum remaining length (less than 8 bytes)
> + *
> + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set then the steps are:
s/ If !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS then the steps are:
> + * 1) Handle buffer that is not aligned to 8 bytes, sum up to 8 byte
> + * alignment
> + * 2) Sum 8 bytes at a time using adcq (unroll main loop
> + * to do 128 bytes at a time)
> + * 3) Sum remaining length (less than 8 bytes)
> + * 4) Roll result if alignment is odd and add in initial sum argument
> + * 5) If buffer is not aligned to 8 bytes and length is less than
> + * or equal to 8 - alignment (whole buffer is in one quad), then
> + * treat that as a special case.
Had to read it 3 times to realize that '-' is not a separator, so please:
s/8-alignment
> + *
> + * Register usage:
> + * %rdi: argument #1, buff
> + * %rsi: argument #2, length
> + * %rdx: argument #3, add in value
> + * %rax,%eax: accumulator and return value
> + * %rcx,%ecx: counter and tmp
> + * %r11: tmp
> + * %r10: alignment (0-7) - when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
s/CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/errno.h>
> +#include <asm/asm.h>
> +
> +#define branch_tbl_len .L_branch_tbl_len
> +
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +
> +/* Close the carry chain and return. */
> +#define RETURN \
> + adcl $0, %eax; \
> + ret
> +
> +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
> +
> +/* Before returning need to roll the result if alignment was odd and then add
> + * in the initial sum.
> + */
Please use the customary (multi-line) comment style:
/*
* Comment .....
* ...... goes here.
*/
specified in Documentation/CodingStyle. This applies to other comments in your
patch as well.
> +#define RETURN \
> + adcl $0, %eax; \
> + test $0x1, %r10d; \
> + jz 99f; \
> + roll $8, %eax; \
> +99: addl %edx, %eax; \
> + adcl $0, %eax; \
> + ret
> +
> +#define branch_tbl_align .L_branch_tbl_align
> +
> +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
On x86 we always set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
So this is dead, untested code that is never compiled on x86.
> +
> +ENTRY(csum_partial)
> +
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> + movl %edx, %eax /* Initialize with initial sum argument */
> +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> + test %esi, %esi /* Zero length? */
> + jne 310f
> + movl %edx, %eax
> + ret
> +
> +310: xorl %eax, %eax
So in modern x86 assembly we try to use local named labels instead of numeric
ones. Easier to extend and the result is a lot easier to read. This applies to
most other numeric labels in your patch.
> +
> + /* Determine alignment */
> + movl %edi, %r10d
> + andl $0x7, %r10d
> + jz 10f
> + movl $8, %ecx
> + subl %r10d, %ecx
> + cmpl %ecx, %esi
> + jle 320f
> + clc
> + jmpq *branch_tbl_align(, %r10, 8)
> +
> + /* Whole buffer fits into one quad. Sum up to a four byte alignment
> + * and then call into the length table to finish.
> + */
> +320: test $0x1, %r10d
> + jz 330f
> + movb (%rdi), %ah /* Align to two bytes */
> + decl %esi
> + lea 1(%rdi), %rdi
> +330: cmpl $2, %esi
> + jl 340f
> + test $0x2, %r10d
> + jz 340f
> + addw (%rdi), %ax /* Align to four bytes */
> + adcl $0, %eax
> + lea 2(%rdi), %rdi
> + subl $2, %esi
> +340:
> + clc
> + jmpq *branch_tbl_len(, %rsi, 8)
Also, please use extra newlines to better delineate the control flow.
I.e. something like:
.L_small_buffer:
test $0x1, %r10d
jz L_2_bytes_aligned:
/* Align to two bytes: */
movb (%rdi), %ah
decl %esi
lea 1(%rdi), %rdi
.L_2_bytes_aligned:
cmpl $2, %esi
jl .L_4_bytes_aligned
test $0x2, %r10d
jz .L_4_bytes_aligned
/* Align to four bytes: */
addw (%rdi), %ax
adcl $0, %eax
lea 2(%rdi), %rdi
subl $2, %esi
.L_4_bytes_aligned:
See how much more readable such an assembly construct is than numeric labels and
no proper flow control separation?
Also note how I changed the placement and style of the existing comments.
Note that I just guessed about the labels based on patch context, some might be
wrong. Also, this code will go away as it's dead code - but the comments still
apply to the rest of the patch.
Please propagate all such readability improvements to the rest of your patch.
> +
> +/* Jumps table for alignments */
s/Jump table
> +
> +201: /* Align 1 */
> + adcw 5(%rdi), %ax
> +203: /* Align 3 */
> + adcw 3(%rdi), %ax
> +205: /* Align 5 */
> + adcw 1(%rdi), %ax
> +207: /* Align 7 */
> + adcl $0, %eax
> + addb (%rdi), %ah
> + jmp 222f
> +202: /* Align 2 */
> + adcw 4(%rdi), %ax
> +204: /* Align 4 */
> + adcw 2(%rdi), %ax
> +206: /* Align 6 */
> + adcw (%rdi), %ax
> +
> +222: adcl $0, %eax
> + subl %ecx, %esi /* %rcx is 8 - alignment */
> + addq %rcx, %rdi
> +200:
> + /* Fall through */
> +
> +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +
> + /* Check length */
> +10: cmpl $8, %esi
> + jg 30f
> + jl 20f
> +
> + /* Exactly 8 bytes length */
> + addl (%rdi), %eax
> + adcl 4(%rdi), %eax
> + RETURN
> +
> + /* Less than 8 bytes length */
> +20: clc
> + jmpq *branch_tbl_len(, %rsi, 8)
> +
> + /* Greater than 8 bytes length. Determine number of quads (n). Sum
> + * over first n % 16 quads
> + */
> +30: movl %esi, %ecx
> + shrl $3, %ecx
> + andl $0xf, %ecx
> + negq %rcx
> + lea 40f(, %rcx, 4), %r11
> + clc
> + jmp *%r11
Are you absolutely sure that a jump table is the proper solution here? It defeats
branch prediction on most x86 uarchs. Why not label the loop stages and jump in
directly with a binary tree of branches?
> +
> +.align 8
> + adcq 14*8(%rdi),%rax
> + adcq 13*8(%rdi),%rax
> + adcq 12*8(%rdi),%rax
> + adcq 11*8(%rdi),%rax
> + adcq 10*8(%rdi),%rax
> + adcq 9*8(%rdi),%rax
> + adcq 8*8(%rdi),%rax
> + adcq 7*8(%rdi),%rax
> + adcq 6*8(%rdi),%rax
> + adcq 5*8(%rdi),%rax
> + adcq 4*8(%rdi),%rax
> + adcq 3*8(%rdi),%rax
> + adcq 2*8(%rdi),%rax
> + adcq 1*8(%rdi),%rax
> + adcq 0*8(%rdi),%rax
Stray whitespace.
> + nop
> +40: /* #quads % 16 jump table base */
So both the jump table and its initial alignment adds NOPs that at minimum blows
up the I$ a bit. With direct jumps we could avoid all that.
> +
> + adcq $0, %rax
> + shlq $3, %rcx
> + subq %rcx, %rdi /* %rcx is already negative length */
> +
> + /* Now determine number of blocks of 8 quads. Sum 128 bytes at a time
> + * using unrolled loop.
s/using an unrolled loop
> + */
> + movl %esi, %ecx
> + shrl $7, %ecx
> + jz 60f
> + clc
> +
> + /* Main loop */
> +50: adcq 0*8(%rdi),%rax
> + adcq 1*8(%rdi),%rax
> + adcq 2*8(%rdi),%rax
> + adcq 3*8(%rdi),%rax
> + adcq 4*8(%rdi),%rax
> + adcq 5*8(%rdi),%rax
> + adcq 6*8(%rdi),%rax
> + adcq 7*8(%rdi),%rax
> + adcq 8*8(%rdi),%rax
> + adcq 9*8(%rdi),%rax
> + adcq 10*8(%rdi),%rax
> + adcq 11*8(%rdi),%rax
> + adcq 12*8(%rdi),%rax
> + adcq 13*8(%rdi),%rax
> + adcq 14*8(%rdi),%rax
> + adcq 15*8(%rdi),%rax
> + lea 128(%rdi), %rdi
> + loop 50b
> +
> + adcq $0, %rax
> +
> + /* Handle remaining length which is <= 8 bytes */
> +60: andl $0x7, %esi
> +
> + /* Fold 64 bit sum to 32 bits */
s/64-bit
> + movq %rax, %rcx
> + shrq $32, %rcx
> + addl %ecx, %eax
> +
> + jmpq *branch_tbl_len(, %rsi, 8)
Same fundamental question as above.
> +
> +/* Length table targets */
> +
> +107: /* Length 7 */
> + adcw 4(%rdi), %ax
> +105: /* Length 5 */
> + adcw 2(%rdi), %ax
> +103: /* Length 3 */
> + adcw (%rdi), %ax
> +101: /* Length 1, grab the odd byte */
> + adcb -1(%rdi, %rsi), %al
> + adcb $0, %ah
> + RETURN
> +106: /* Length 6 */
> + adcw 4(%rdi), %ax
> +104: /* Length 4, optimized for double word access*/
> + adcl (%rdi), %eax
> + RETURN
> +102: /* Length 2 */
> + adcw (%rdi), %ax
> +100: /* Length 0 */
> + RETURN
> +
> +.section .rodata
> +.align 64
Please use the proper parametric cache alignment define. There are
subarchitectures that prefer (and need) much larger alignment.
> +.L_branch_tbl_len:
> + .quad 100b
> + .quad 101b
> + .quad 102b
> + .quad 103b
> + .quad 104b
> + .quad 105b
> + .quad 106b
> + .quad 107b
> +
> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +.L_branch_tbl_align:
> + .quad 200b
> + .quad 201b
> + .quad 202b
> + .quad 203b
> + .quad 204b
> + .quad 205b
> + .quad 206b
> + .quad 207b
> +#endif
This too is dead code.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
@ 2016-02-03 19:18 Tom Herbert
2016-02-04 9:30 ` Ingo Molnar
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Tom Herbert @ 2016-02-03 19:18 UTC (permalink / raw)
To: davem, netdev; +Cc: tglx, mingo, hpa, x86, kernel-team
Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
conversion.
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
we need to avoid unaligned accesses. Efficient unaligned accesses
offer a nice additional speedup as demonstrated in the results
provided below.
This implementation is similar to csum_partial implemented in
checksum_32.S, however since we are dealing with 8 bytes at a time
there are more cases for small lengths (and alignments)-- for that
we employ a jump table.
Testing:
Correctness:
Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).
Checksum performance:
Isolating old and new implementation for some common cases:
Old NewA NewA % NewNoA NewNoA %
Len/Aln nsec nsec Improv nsecs Improve
--------+-------+--------+-------+-------+---------------------
1400/0 192.9 175.1 10% 174.9 10% (Big packet)
40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case)
8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4)
14/0 10.5 7.3 30% 5.4 48% (Eth hdr)
14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4)
14/3 11.0 9.8 11% 5.6 49% (Eth with odd align)
7/1 10.0 5.8 42% 4.8 52% (buffer in one quad)
NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
Also test on these with similar results:
Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Branch prediction:
To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between
the baseline and the interleaving test case.
Signed-off-by: Tom Herbert <tom@herbertland.com>
---
arch/x86/include/asm/checksum_64.h | 5 +
arch/x86/lib/csum-partial_64.S | 277 +++++++++++++++++++++++++++++++++++++
arch/x86/lib/csum-partial_64.c | 148 --------------------
3 files changed, 282 insertions(+), 148 deletions(-)
create mode 100644 arch/x86/lib/csum-partial_64.S
delete mode 100644 arch/x86/lib/csum-partial_64.c
diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index cd00e17..a888f65 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
*/
extern __wsum csum_partial(const void *buff, int len, __wsum sum);
+static inline __sum16 ip_compute_csum(const void *buff, int len)
+{
+ return csum_fold(csum_partial(buff, len, 0));
+}
+
#define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
#define HAVE_CSUM_COPY_USER 1
diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S
new file mode 100644
index 0000000..520b400
--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,277 @@
+/* Copyright 2016 Tom Herbert <tom@herbertland.com>
+ *
+ * Checksum partial calculation
+ *
+ * __wsum csum_partial(const void *buff, int len, __wsum sum)
+ *
+ * Computes the checksum of a memory block at buff, length len,
+ * and adds in "sum" (32-bit)
+ *
+ * Returns a 32-bit number suitable for feeding into itself
+ * or csum_tcpudp_magic
+ *
+ * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS determines whether alignment of the
+ * buffer must be dealt with.
+ *
+ * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set then the steps are:
+ * 1) Initialize accumulator to initial sum
+ * 2) Sum 8 bytes at a time using adcq (unroll main loop
+ * to do 128 bytes at a time)
+ * 3) Sum remaining length (less than 8 bytes)
+ *
+ * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set then the steps are:
+ * 1) Handle buffer that is not aligned to 8 bytes, sum up to 8 byte
+ * alignment
+ * 2) Sum 8 bytes at a time using adcq (unroll main loop
+ * to do 128 bytes at a time)
+ * 3) Sum remaining length (less than 8 bytes)
+ * 4) Roll result if alignment is odd and add in initial sum argument
+ * 5) If buffer is not aligned to 8 bytes and length is less than
+ * or equal to 8 - alignment (whole buffer is in one quad), then
+ * treat that as a special case.
+ *
+ * Register usage:
+ * %rdi: argument #1, buff
+ * %rsi: argument #2, length
+ * %rdx: argument #3, add in value
+ * %rax,%eax: accumulator and return value
+ * %rcx,%ecx: counter and tmp
+ * %r11: tmp
+ * %r10: alignment (0-7) - when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
+ */
+
+#include <linux/linkage.h>
+#include <asm/errno.h>
+#include <asm/asm.h>
+
+#define branch_tbl_len .L_branch_tbl_len
+
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+
+/* Close the carry chain and return. */
+#define RETURN \
+ adcl $0, %eax; \
+ ret
+
+#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+/* Before returning need to roll the result if alignment was odd and then add
+ * in the initial sum.
+ */
+#define RETURN \
+ adcl $0, %eax; \
+ test $0x1, %r10d; \
+ jz 99f; \
+ roll $8, %eax; \
+99: addl %edx, %eax; \
+ adcl $0, %eax; \
+ ret
+
+#define branch_tbl_align .L_branch_tbl_align
+
+#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+ENTRY(csum_partial)
+
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ movl %edx, %eax /* Initialize with initial sum argument */
+#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+ test %esi, %esi /* Zero length? */
+ jne 310f
+ movl %edx, %eax
+ ret
+
+310: xorl %eax, %eax
+
+ /* Determine alignment */
+ movl %edi, %r10d
+ andl $0x7, %r10d
+ jz 10f
+ movl $8, %ecx
+ subl %r10d, %ecx
+ cmpl %ecx, %esi
+ jle 320f
+ clc
+ jmpq *branch_tbl_align(, %r10, 8)
+
+ /* Whole buffer fits into one quad. Sum up to a four byte alignment
+ * and then call into the length table to finish.
+ */
+320: test $0x1, %r10d
+ jz 330f
+ movb (%rdi), %ah /* Align to two bytes */
+ decl %esi
+ lea 1(%rdi), %rdi
+330: cmpl $2, %esi
+ jl 340f
+ test $0x2, %r10d
+ jz 340f
+ addw (%rdi), %ax /* Align to four bytes */
+ adcl $0, %eax
+ lea 2(%rdi), %rdi
+ subl $2, %esi
+340:
+ clc
+ jmpq *branch_tbl_len(, %rsi, 8)
+
+/* Jumps table for alignments */
+
+201: /* Align 1 */
+ adcw 5(%rdi), %ax
+203: /* Align 3 */
+ adcw 3(%rdi), %ax
+205: /* Align 5 */
+ adcw 1(%rdi), %ax
+207: /* Align 7 */
+ adcl $0, %eax
+ addb (%rdi), %ah
+ jmp 222f
+202: /* Align 2 */
+ adcw 4(%rdi), %ax
+204: /* Align 4 */
+ adcw 2(%rdi), %ax
+206: /* Align 6 */
+ adcw (%rdi), %ax
+
+222: adcl $0, %eax
+ subl %ecx, %esi /* %rcx is 8 - alignment */
+ addq %rcx, %rdi
+200:
+ /* Fall through */
+
+#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+ /* Check length */
+10: cmpl $8, %esi
+ jg 30f
+ jl 20f
+
+ /* Exactly 8 bytes length */
+ addl (%rdi), %eax
+ adcl 4(%rdi), %eax
+ RETURN
+
+ /* Less than 8 bytes length */
+20: clc
+ jmpq *branch_tbl_len(, %rsi, 8)
+
+ /* Greater than 8 bytes length. Determine number of quads (n). Sum
+ * over first n % 16 quads
+ */
+30: movl %esi, %ecx
+ shrl $3, %ecx
+ andl $0xf, %ecx
+ negq %rcx
+ lea 40f(, %rcx, 4), %r11
+ clc
+ jmp *%r11
+
+.align 8
+ adcq 14*8(%rdi),%rax
+ adcq 13*8(%rdi),%rax
+ adcq 12*8(%rdi),%rax
+ adcq 11*8(%rdi),%rax
+ adcq 10*8(%rdi),%rax
+ adcq 9*8(%rdi),%rax
+ adcq 8*8(%rdi),%rax
+ adcq 7*8(%rdi),%rax
+ adcq 6*8(%rdi),%rax
+ adcq 5*8(%rdi),%rax
+ adcq 4*8(%rdi),%rax
+ adcq 3*8(%rdi),%rax
+ adcq 2*8(%rdi),%rax
+ adcq 1*8(%rdi),%rax
+ adcq 0*8(%rdi),%rax
+ nop
+40: /* #quads % 16 jump table base */
+
+ adcq $0, %rax
+ shlq $3, %rcx
+ subq %rcx, %rdi /* %rcx is already negative length */
+
+ /* Now determine number of blocks of 8 quads. Sum 128 bytes at a time
+ * using unrolled loop.
+ */
+ movl %esi, %ecx
+ shrl $7, %ecx
+ jz 60f
+ clc
+
+ /* Main loop */
+50: adcq 0*8(%rdi),%rax
+ adcq 1*8(%rdi),%rax
+ adcq 2*8(%rdi),%rax
+ adcq 3*8(%rdi),%rax
+ adcq 4*8(%rdi),%rax
+ adcq 5*8(%rdi),%rax
+ adcq 6*8(%rdi),%rax
+ adcq 7*8(%rdi),%rax
+ adcq 8*8(%rdi),%rax
+ adcq 9*8(%rdi),%rax
+ adcq 10*8(%rdi),%rax
+ adcq 11*8(%rdi),%rax
+ adcq 12*8(%rdi),%rax
+ adcq 13*8(%rdi),%rax
+ adcq 14*8(%rdi),%rax
+ adcq 15*8(%rdi),%rax
+ lea 128(%rdi), %rdi
+ loop 50b
+
+ adcq $0, %rax
+
+ /* Handle remaining length which is <= 8 bytes */
+60: andl $0x7, %esi
+
+ /* Fold 64 bit sum to 32 bits */
+ movq %rax, %rcx
+ shrq $32, %rcx
+ addl %ecx, %eax
+
+ jmpq *branch_tbl_len(, %rsi, 8)
+
+/* Length table targets */
+
+107: /* Length 7 */
+ adcw 4(%rdi), %ax
+105: /* Length 5 */
+ adcw 2(%rdi), %ax
+103: /* Length 3 */
+ adcw (%rdi), %ax
+101: /* Length 1, grab the odd byte */
+ adcb -1(%rdi, %rsi), %al
+ adcb $0, %ah
+ RETURN
+106: /* Length 6 */
+ adcw 4(%rdi), %ax
+104: /* Length 4, optimized for double word access*/
+ adcl (%rdi), %eax
+ RETURN
+102: /* Length 2 */
+ adcw (%rdi), %ax
+100: /* Length 0 */
+ RETURN
+
+.section .rodata
+.align 64
+.L_branch_tbl_len:
+ .quad 100b
+ .quad 101b
+ .quad 102b
+ .quad 103b
+ .quad 104b
+ .quad 105b
+ .quad 106b
+ .quad 107b
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+.L_branch_tbl_align:
+ .quad 200b
+ .quad 201b
+ .quad 202b
+ .quad 203b
+ .quad 204b
+ .quad 205b
+ .quad 206b
+ .quad 207b
+#endif
+
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
deleted file mode 100644
index 9845371..0000000
--- a/arch/x86/lib/csum-partial_64.c
+++ /dev/null
@@ -1,148 +0,0 @@
-/*
- * arch/x86_64/lib/csum-partial.c
- *
- * This file contains network checksum routines that are better done
- * in an architecture-specific manner due to speed.
- */
-
-#include <linux/compiler.h>
-#include <linux/module.h>
-#include <asm/checksum.h>
-
-static inline unsigned short from32to16(unsigned a)
-{
- unsigned short b = a >> 16;
- asm("addw %w2,%w0\n\t"
- "adcw $0,%w0\n"
- : "=r" (b)
- : "0" (b), "r" (a));
- return b;
-}
-
-/*
- * Do a 64-bit checksum on an arbitrary memory area.
- * Returns a 32bit checksum.
- *
- * This isn't as time critical as it used to be because many NICs
- * do hardware checksumming these days.
- *
- * Things tried and found to not make it faster:
- * Manual Prefetching
- * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
- */
-static unsigned do_csum(const unsigned char *buff, unsigned len)
-{
- unsigned odd, count;
- unsigned long result = 0;
-
- if (unlikely(len == 0))
- return result;
- odd = 1 & (unsigned long) buff;
- if (unlikely(odd)) {
- result = *buff << 8;
- len--;
- buff++;
- }
- count = len >> 1; /* nr of 16-bit words.. */
- if (count) {
- if (2 & (unsigned long) buff) {
- result += *(unsigned short *)buff;
- count--;
- len -= 2;
- buff += 2;
- }
- count >>= 1; /* nr of 32-bit words.. */
- if (count) {
- unsigned long zero;
- unsigned count64;
- if (4 & (unsigned long) buff) {
- result += *(unsigned int *) buff;
- count--;
- len -= 4;
- buff += 4;
- }
- count >>= 1; /* nr of 64-bit words.. */
-
- /* main loop using 64byte blocks */
- zero = 0;
- count64 = count >> 3;
- while (count64) {
- asm("addq 0*8(%[src]),%[res]\n\t"
- "adcq 1*8(%[src]),%[res]\n\t"
- "adcq 2*8(%[src]),%[res]\n\t"
- "adcq 3*8(%[src]),%[res]\n\t"
- "adcq 4*8(%[src]),%[res]\n\t"
- "adcq 5*8(%[src]),%[res]\n\t"
- "adcq 6*8(%[src]),%[res]\n\t"
- "adcq 7*8(%[src]),%[res]\n\t"
- "adcq %[zero],%[res]"
- : [res] "=r" (result)
- : [src] "r" (buff), [zero] "r" (zero),
- "[res]" (result));
- buff += 64;
- count64--;
- }
-
- /* last up to 7 8byte blocks */
- count %= 8;
- while (count) {
- asm("addq %1,%0\n\t"
- "adcq %2,%0\n"
- : "=r" (result)
- : "m" (*(unsigned long *)buff),
- "r" (zero), "0" (result));
- --count;
- buff += 8;
- }
- result = add32_with_carry(result>>32,
- result&0xffffffff);
-
- if (len & 4) {
- result += *(unsigned int *) buff;
- buff += 4;
- }
- }
- if (len & 2) {
- result += *(unsigned short *) buff;
- buff += 2;
- }
- }
- if (len & 1)
- result += *buff;
- result = add32_with_carry(result>>32, result & 0xffffffff);
- if (unlikely(odd)) {
- result = from32to16(result);
- result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
- }
- return result;
-}
-
-/*
- * computes the checksum of a memory block at buff, length len,
- * and adds in "sum" (32-bit)
- *
- * returns a 32-bit number suitable for feeding into itself
- * or csum_tcpudp_magic
- *
- * this function must be called with even lengths, except
- * for the last fragment, which may be odd
- *
- * it's best to have buff aligned on a 64-bit boundary
- */
-__wsum csum_partial(const void *buff, int len, __wsum sum)
-{
- return (__force __wsum)add32_with_carry(do_csum(buff, len),
- (__force u32)sum);
-}
-
-/*
- * this routine is used for miscellaneous IP-like checksums, mainly
- * in icmp.c
- */
-__sum16 ip_compute_csum(const void *buff, int len)
-{
- return csum_fold(csum_partial(buff,len,0));
-}
-EXPORT_SYMBOL(ip_compute_csum);
-
--
2.4.6
^ permalink raw reply related [flat|nested] 29+ messages in thread
end of thread, other threads:[~2016-02-10 15:21 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 20:12 [PATCH v3 net-next] net: Implement fast csum_partial for x86_64 George Spelvin
2016-02-09 10:48 ` David Laight
2016-02-10 0:53 ` George Spelvin
2016-02-10 11:39 ` David Laight
2016-02-10 14:43 ` George Spelvin
2016-02-10 15:18 ` David Laight
-- strict thread matches above, loose matches on Subject: below --
2016-02-03 19:18 Tom Herbert
2016-02-04 9:30 ` Ingo Molnar
2016-02-04 10:56 ` Ingo Molnar
2016-02-04 19:24 ` Tom Herbert
2016-02-05 9:24 ` Ingo Molnar
2016-02-04 21:46 ` Linus Torvalds
2016-02-04 22:09 ` Linus Torvalds
2016-02-05 1:27 ` Linus Torvalds
2016-02-05 1:39 ` Linus Torvalds
2016-02-04 22:43 ` Tom Herbert
2016-02-04 22:57 ` Linus Torvalds
2016-02-05 8:01 ` Ingo Molnar
2016-02-05 10:07 ` David Laight
2016-02-04 11:08 ` David Laight
2016-02-04 16:51 ` Alexander Duyck
2016-02-04 16:58 ` Tom Herbert
2016-02-04 17:09 ` David Laight
2016-02-04 20:59 ` Tom Herbert
2016-02-04 21:09 ` Alexander Duyck
2016-02-04 19:22 ` Alexander Duyck
2016-02-04 19:31 ` Tom Herbert
2016-02-04 19:44 ` Tom Herbert
2016-02-04 20:03 ` Alexander Duyck
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.