Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
@ 2016-02-08 20:12 George Spelvin
  2016-02-09 10:48 ` David Laight
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-02-08 20:12 UTC (permalink / raw)
  To: David.Laight, linux-kernel, netdev, tom; +Cc: mingo

David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant gain.
> You have a dependency chain on the carry flag so have delays between the 'adcq'
> instructions (these may be more significant than the memory reads from l1 cache).

If the carry chain is a bottleneck, on Broadwell+ (feature flag
X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
separate flag bits for their carry chains and so can be interleaved.

I don't have such a machine to test on, but if someone who does
would like to do a little benchmarking, that would be an interesting
data point.

Unfortunately, that means yet another version of the main loop,
but if there's a significant benefit...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-08 20:12 [PATCH v3 net-next] net: Implement fast csum_partial for x86_64 George Spelvin
@ 2016-02-09 10:48 ` David Laight
  2016-02-10  0:53   ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-09 10:48 UTC (permalink / raw)
  To: 'George Spelvin', linux-kernel, netdev, tom; +Cc: mingo

From: George Spelvin [mailto:linux@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant gain.
> > You have a dependency chain on the carry flag so have delays between the 'adcq'
> > instructions (these may be more significant than the memory reads from l1 cache).
> 
> If the carry chain is a bottleneck, on Broadwell+ (feature flag
> X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
> separate flag bits for their carry chains and so can be interleaved.
> 
> I don't have such a machine to test on, but if someone who does
> would like to do a little benchmarking, that would be an interesting
> data point.
> 
> Unfortunately, that means yet another version of the main loop,
> but if there's a significant benefit...

Well, the only part actually worth writing in assembler is the 'adc' loop.
So run-time substitution of separate versions (as is done for memcpy())
wouldn't be hard.

Since adcx and adox must execute in parallel I clearly need to re-remember
how dependencies against the flags register work. I'm sure I remember
issues with 'false dependencies' against the flags.

However you still need a loop construct that doesn't modify 'o' or 'c'.
Using leal, jcxz, jmp might work.
(Unless broadwell actually has a fast 'loop' instruction.)

(I've not got a suitable test cpu.)

	David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-09 10:48 ` David Laight
@ 2016-02-10  0:53   ` George Spelvin
  2016-02-10 11:39     ` David Laight
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-02-10  0:53 UTC (permalink / raw)
  To: David.Laight, linux-kernel, linux, netdev, tom; +Cc: mingo

David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.

The issue is with flags register bits that are *not* modified by
an instruction.  If the register is treated as a monolithic entity,
then the previous values of those bits must be considered an *input*
to the instruction, forcing serialization.

The first step in avoiding this problem is to consider the rarely-modified
bits (interrupt, direction, trap, etc.) to be a separate logical register
from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
which are updated by almost every instruction.

An arithmetic instruction overwrites the arithmetic flags (so it's only
a WAW dependency which can be broken by renaming) and doesn't touch the
status flags (so no dependency).

However, on x86 even the arithmetic flags aren't updated consistently.
The biggest offender are the (very common!) INC/DEC instructions,
which update all of the arithmetic flags *except* the carry flag.

Thus, the carry flag is also renamed separately on every superscalar
x86 implementation I've ever heard of.

The bit test instructions (BT, BTC, BTR, BTS) also affect *only*
the carry flag, leaving other flags unmodified.  This is also
handled properly by renaming the carry flag separately.

Here's a brief summary chart of flags updated by common instructions:
http://www.logix.cz/michal/doc/i386/app-c.htm
and the full list with all the corner cases:
http://www.logix.cz/michal/doc/i386/app-b.htm

The other two flags that can be worth separating are the overflow
and zero flags.

The rotate instructions modify *only* the carry and overflow flags.
While overflow is undefined for multi-bit rotates (and thus leaving it
unmodified is a valid implementation), it's defined for single-bit rotates,
so must be written.

There are several less common instructions, notably BSF, BSR, CMPXCHG8B,
and a bunch of 80286 segment instructions that nobody cares about,
which retort the result of a test in the zero flag and are defined to
not affect the other flags.

So an aggressive x86 implementation breaks the flags register into five
separately renamed registers:
- CF (carry)
- OF (overflow)
- ZF (zero)
- SF, AF, PF (sign, aux carry, and parity)
- DF, IF, TF, IOPL, etc.

Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
it was reasonable to commit to always renaming CF and OF separately.

> However you still need a loop construct that doesn't modify 'o' or 'c'.
> Using leal, jcxz, jmp might work.
> (Unless broadwell actually has a fast 'loop' instruction.)

According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
uops on those processors.  But 64 bit in general sucked on Precott,
so how much do we care?

AMD:	LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
	JCXZ is acceptable on all of them.
	LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
Intel:	LOOP is slow (7+ uops) on all processors up to and including Skylake.
	JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
	- Prescott (JCXZ & loop both 4 uops)
	- 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
	I can't find any that it's fast on.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-10  0:53   ` George Spelvin
@ 2016-02-10 11:39     ` David Laight
  2016-02-10 14:43       ` George Spelvin
  0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-10 11:39 UTC (permalink / raw)
  To: 'George Spelvin', linux-kernel, netdev, tom; +Cc: mingo

From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-kernel@vger.kernel.org; linux@horizon.com; netdev@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register work. I'm sure I remember
> > issues with 'false dependencies' against the flags.
> 
> The issue is with flags register bits that are *not* modified by
> an instruction.  If the register is treated as a monolithic entity,
> then the previous values of those bits must be considered an *input*
> to the instruction, forcing serialization.
> 
> The first step in avoiding this problem is to consider the rarely-modified
> bits (interrupt, direction, trap, etc.) to be a separate logical register
> from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
> which are updated by almost every instruction.
> 
> An arithmetic instruction overwrites the arithmetic flags (so it's only
> a WAW dependency which can be broken by renaming) and doesn't touch the
> status flags (so no dependency).
> 
> However, on x86 even the arithmetic flags aren't updated consistently.
> The biggest offender are the (very common!) INC/DEC instructions,
> which update all of the arithmetic flags *except* the carry flag.
> 
> Thus, the carry flag is also renamed separately on every superscalar
> x86 implementation I've ever heard of.

Ah, that is the little fact I'd forgotten.
...
> Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
> it was reasonable to commit to always renaming CF and OF separately.

Separate renaming allows:
1) The value to tested without waiting for pending updates to complete.
   Useful for IE and DIR.
2) Instructions that modify almost all the flags to execute without
   waiting for a previous instruction to complete.
   So separating 'carry' allows inc/dec to execute without waiting
   for previous arithmetic to complete.

The latter should remove the dependency (both ways) between 'adc' and
'dec, jnz' in a checksum loop.

I can't see any obvious gain from separating out O or Z (even with
adcx and adox). You'd need some other instructions that don't set O (or Z)
but set some other useful flags.
(A decrement that only set Z for instance.)

> > However you still need a loop construct that doesn't modify 'o' or 'c'.
> > Using leal, jcxz, jmp might work.
> > (Unless broadwell actually has a fast 'loop' instruction.)
> 
> According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
> JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
> to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
> uops on those processors.  But 64 bit in general sucked on Precott,
> so how much do we care?
> 
> AMD:	LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
> 	JCXZ is acceptable on all of them.
> 	LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
> Intel:	LOOP is slow (7+ uops) on all processors up to and including Skylake.
> 	JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
> 	- Prescott (JCXZ & loop both 4 uops)
> 	- 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
> 	I can't find any that it's fast on.

While LOOP could be used on Bulldozer+ an equivalently fast loop
can be done with inc/dec and jnz.
So you only care about LOOP/JCXZ when ADOX is supported.

I think the fastest loop is:
10:	adc	%rax,0(%rdi,%rcx,8)
	inc	%rcx
	jnz	10b
but check if any cpu add an extra clock for the 'scaled' offset
(they might be faster if %rdi is incremented).
That loop looks like it will have no overhead on recent cpu.

	David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-10 11:39     ` David Laight
@ 2016-02-10 14:43       ` George Spelvin
  2016-02-10 15:18         ` David Laight
  0 siblings, 1 reply; 29+ messages in thread
From: George Spelvin @ 2016-02-10 14:43 UTC (permalink / raw)
  To: David.Laight, linux-kernel, linux, netdev, tom; +Cc: mingo

David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>    Useful for IE and DIR.

I don't quite follow.  It allows the value to be tested without waiting
for pending updates *of other bits* to complete.

Obviusly, the update of the bit being tested has to complete!

> I can't see any obvious gain from separating out O or Z (even with
> adcx and adox). You'd need some other instructions that don't set O (or Z)
> but set some other useful flags.
> (A decrement that only set Z for instance.)

I tried to describe the advantages in the previous message.

The problems arise much less often than the INC/DEC pair, but there are
instructions whick write only the O and C flags, (ROL, ROR) and only
the Z flag (CMPXCHG).

The sign, aux carry, and parity flags are *always* updated as
a group, so they can be renamed as a group.

> While LOOP could be used on Bulldozer+ an equivalently fast loop
> can be done with inc/dec and jnz.
> So you only care about LOOP/JCXZ when ADOX is supported.
> 
> I think the fastest loop is:
> 10:	adc	%rax,0(%rdi,%rcx,8)
> 	inc	%rcx
> 	jnz	10b
> but check if any cpu add an extra clock for the 'scaled' offset
> (they might be faster if %rdi is incremented).
> That loop looks like it will have no overhead on recent cpu.

Well, it should execute at 1 instruction/cycle.  (No, a scaled offset
doesn't take extra time.)  To break that requires ADCX/ADOX:

10:	adcxq	0(%rdi,%rcx),%rax
	adoxq	8(%rdi,%rcx),%rdx
 	leaq	16(%rcx),%rcx
	jrcxz	11f
 	j	10b
11:

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-10 14:43       ` George Spelvin
@ 2016-02-10 15:18         ` David Laight
  0 siblings, 0 replies; 29+ messages in thread
From: David Laight @ 2016-02-10 15:18 UTC (permalink / raw)
  To: 'George Spelvin', linux-kernel, netdev, tom; +Cc: mingo

From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10:	adcq	0(%rdi,%rcx,8),%rax
> > 	inc	%rcx
> > 	jnz	10b
> > That loop looks like it will have no overhead on recent cpu.
> 
> Well, it should execute at 1 instruction/cycle.

I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.

> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.

> To break that requires ADCX/ADOX:
> 
> 10:	adcxq	0(%rdi,%rcx),%rax
> 	adoxq	8(%rdi,%rcx),%rdx
>  	leaq	16(%rcx),%rcx
> 	jrcxz	11f
>  	j	10b
> 11:

Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.

Experiments would be needed to confirm guesses though.

	David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-05  8:01       ` Ingo Molnar
@ 2016-02-05 10:07         ` David Laight
  0 siblings, 0 replies; 29+ messages in thread
From: David Laight @ 2016-02-05 10:07 UTC (permalink / raw)
  To: 'Ingo Molnar', Tom Herbert
  Cc: Linus Torvalds, David Miller, Network Development,
	Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, kernel-team, Linux Kernel Mailing List,
	Peter Zijlstra

From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you manage
> to offload the logic into arithmetics and not affect the control flow then that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics units which effectively execute everything fed to them in a
> single cycle.

Not necessarily, in real-life they are likely to be a cache miss.

With the parallel execution units you just need to worry about the
register dependency chains.
In this case the 'adc' have dependencies on both the result register
and the flags register.

The best loop might even be:
10:	adc	%rax,0(%rdi,%rcx)
	lea	%rcx,8(%rcx)
	jcxz	20f
	jmp	10b
20:

You then need to insert just enough copies of the adc so they take as long
as the other instructions.

	David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 19:24     ` Tom Herbert
@ 2016-02-05  9:24       ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2016-02-05  9:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers, tglx, mingo,
	hpa, x86, Kernel Team, LKML, Peter Zijlstra, Linus Torvalds

* Tom Herbert <tom@herbertland.com> wrote:

> Thanks for the explanation and sample code. Expanding on your example, I added a 
> switch statement to perform to function (code below).

So I think your new switch() based testcase is broken in a subtle way.

The problem is that in your added testcase GCC effectively optimizes out the 
function call, because it is able to recognize that all the (trivial) functions 
are identical. (At least here, with GCC 5.2.1)

So all overhead of the workload goes into just that do_switch() function you 
added:

#
# Total Lost Samples: 0
#
# Samples: 5K of event 'cycles:pp'
# Event count (approx.): 5738959683
#
# Overhead  Command      Shared Object      Symbol                     
# ........  ...........  .................  ...........................
#
    99.94%  jump-table2  jump-table2        [.] do_switch              
     0.02%  jump-table2  [kernel.kallsyms]  [k] native_read_msr_safe   
     0.02%  jump-table2  [kernel.kallsyms]  [k] native_apic_mem_write  
     0.01%  jump-table2  [kernel.kallsyms]  [k] __fput                 
     0.00%  jump-table2  [kernel.kallsyms]  [k] change_protection_range
     0.00%  perf         [kernel.kallsyms]  [k] strrchr                
     0.00%  perf         [kernel.kallsyms]  [k] intel_pmu_handle_irq   
     0.00%  perf         [kernel.kallsyms]  [k] native_write_msr_safe  


... and per instruction overhead in the do_switch() looks like this:

         :
         :      Disassembly of section .text:
         :
         :      0000000000401100 <do_switch>:
         :      do_switch():
    0.00 :        401100:       mov    0x201f61(%rip),%r8        # 603068 <loops>
    0.00 :        401107:       test   %r8,%r8
    0.00 :        40110a:       je     401178 <do_switch+0x78>
    0.00 :        40110c:       mov    0x201f5d(%rip),%rdi        # 603070 <table_size>
    0.00 :        401113:       xor    %esi,%esi
    0.00 :        401115:       xor    %ecx,%ecx
    0.00 :        401117:       nopw   0x0(%rax,%rax,1)
    0.00 :        401120:       mov    0x201f61(%rip),%rax        # 603088 <global>
    1.46 :        401127:       xor    %edx,%edx
    0.00 :        401129:       add    $0x1,%rax
    0.00 :        40112d:       mov    %rax,0x201f54(%rip)        # 603088 <global>
    0.00 :        401134:       mov    0x201f4d(%rip),%rax        # 603088 <global>
   51.68 :        40113b:       div    %rdi
    0.00 :        40113e:       cmp    $0x3f,%rdx
    1.34 :        401142:       ja     40116b <do_switch+0x6b>
    0.02 :        401144:       mov    0x201f3d(%rip),%r9        # 603088 <global>
    0.10 :        40114b:       mov    0x201f36(%rip),%rax        # 603088 <global>
    6.91 :        401152:       jmpq   *0x401420(,%rdx,8)
    0.00 :        401159:       nopl   0x0(%rax)
    0.02 :        401160:       xor    %edx,%edx
   35.71 :        401162:       div    %r9
    1.07 :        401165:       add    %rdx,%r9
    1.69 :        401168:       add    %r9,%rsi
    0.00 :        40116b:       add    $0x1,%rcx
    0.00 :        40116f:       cmp    %r8,%rcx
    0.00 :        401172:       jne    401120 <do_switch+0x20>
    0.00 :        401174:       mov    %rsi,%rax
    0.00 :        401177:       retq   
    0.00 :        401178:       xor    %esi,%esi
    0.00 :        40117a:       jmp    401174 <do_switch+0x74>

Note that as you remarked there _is_ an indirect jump in there:

    6.91 :        401152:       jmpq   *0x401420(,%rdx,8)

But:

> gcc seems to be implementing this switch as a jump table:
> 
>    jmpq   *0x400ac8(,%rdx,8)

... the 'jump table' has identical entries:

  40141f:       00 60 11                add    %ah,0x11(%rax)
  401422:       40 00 00                add    %al,(%rax)
  401425:       00 00                   add    %al,(%rax)
  401427:       00 60 11                add    %ah,0x11(%rax)
  40142a:       40 00 00                add    %al,(%rax)
  40142d:       00 00                   add    %al,(%rax)
  40142f:       00 60 11                add    %ah,0x11(%rax)
  401432:       40 00 00                add    %al,(%rax)
  401435:       00 00                   add    %al,(%rax)
  401437:       00 60 11                add    %ah,0x11(%rax)
  40143a:       40 00 00                add    %al,(%rax)

(misaligned dump - jump table starts at 0x401420)

Every jump table entry points to 0x401160 - which is the code after the indirect 
jump in do_switch().

So the hardware predictor simply recognizes that it's the same target address all 
the time, and thus (understandably) performs much faster - none of the functions 
are actually executed.

That is why there are no function call entries in perf report, while in the 
function pointer case we get:

#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles:pp'
# Event count (approx.): 5482523153
#
# Overhead  Command      Shared Object      Symbol                    
# ........  ...........  .................  ..........................
#
    13.47%  jump-table2  jump-table2        [.] fun_1                 
    13.22%  jump-table2  jump-table2        [.] fun_6                 
    13.12%  jump-table2  jump-table2        [.] fun_5                 
    12.87%  jump-table2  jump-table2        [.] fun_7                 
    12.23%  jump-table2  jump-table2        [.] fun_2                 
    12.09%  jump-table2  jump-table2        [.] fun_0                 
     8.23%  jump-table2  jump-table2        [.] do_funcptr            
     7.43%  jump-table2  jump-table2        [.] fun_3                 
     7.27%  jump-table2  jump-table2        [.] fun_4                 
     0.02%  jump-table2  [kernel.kallsyms]  [k] page_add_new_anon_rmap
     0.02%  jump-table2  [kernel.kallsyms]  [k] native_read_msr_safe  
     0.02%  jump-table2  [kernel.kallsyms]  [k] perf_pmu_sched_task   
     0.00%  jump-table2  [kernel.kallsyms]  [k] flush_tlb_mm_range    
     0.00%  perf         [kernel.kallsyms]  [k] _raw_spin_lock        
     0.00%  perf         [kernel.kallsyms]  [k] intel_bts_enable_local
     0.00%  perf         [kernel.kallsyms]  [k] native_write_msr_safe 

Same-target indirect jumps were optimized well starting from PentiumM IIRC, so 
this would perform well all across the x86 spectrum.

Btw., I think it's a bit weird that GCC didn't eliminate the switch jump table 
itself. (Maybe newer versions of GCC do?)

> Which is a different call than the function table implementation:
> 
>    callq  *(%rsp,%rdx,8)

> So the switch implementation does not seem to be causing branch-misses to be 
> counted and is a lot faster. This is also what I see when looking at the 
> branch-misses with the checksum code-- I haven't yet found any test case 
> thrashing lengths or alignments increase branch-misses and shows a performance 
> degradation over the case where they are static.

That's because AFAICS there's no indirect branching in your testcase! :-)

And before you spend too much time on this, I do concede your general point: that 
jump tables are simpler than call tables, and that newer x86 uarchs are able to 
distinguish between multiple target branches pretty well indexed by a wide range 
of input registers, resulting in pretty good branch prediction.

Anyway - I think Linus's C based approach is vastly superior, so the jump tables 
vs. call tables topic is somewhat of a side track.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 22:43     ` Tom Herbert
  2016-02-04 22:57       ` Linus Torvalds
@ 2016-02-05  8:01       ` Ingo Molnar
  2016-02-05 10:07         ` David Laight
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2016-02-05  8:01 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linus Torvalds, David Miller, Network Development,
	Thomas Gleixner, Ingo Molnar, Peter Anvin,
	the arch/x86 maintainers, kernel-team, Linux Kernel Mailing List,
	Peter Zijlstra

* Tom Herbert <tom@herbertland.com> wrote:

> [....] gcc turns these switch statements into jump tables (not function tables 
> which is what Ingo's example code was using). [...]

So to the extent this still matters, on most x86 microarchitectures that count, 
jump tables and function call tables (i.e. virtual functions that C++ uses) are 
generally optimized by the same branch predictor hardware mechanism. Indirect 
jumps (jump tables) and indirect calls (function pointer tables) are very similar 
conceptually. That is why posted the indirect calls test code.

( The only branching variant that will perform badly even on the latest uarchs are
  indirect returns: to modify the return address on the stack. )

So my narrow performance point stands, if any sort of indirect jump is used. They 
should be avoided if possible, because it's pretty hard for the hardware to get it 
right.

As Linus noticed, data lookup tables are the intelligent solution: if you manage 
to offload the logic into arithmetics and not affect the control flow then that's 
a big win. The inherent branching will be hidden by executing on massively 
parallel arithmetics units which effectively execute everything fed to them in a 
single cycle.

In any case, when submitting such patches, please get into the habit of looking at 
and posting perf stat output - it will give us a good idea about the quality of an 
implementation.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-05  1:27       ` Linus Torvalds
@ 2016-02-05  1:39         ` Linus Torvalds
  0 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-02-05  1:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
	Linux Kernel Mailing List, Peter Zijlstra

On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>         sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
>         return rotate_by8_if_odd(sum, align);

Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code does the right thing if "len" is zero, except
for the possible page fault or address verifier complaint.

So that very last call to "csum_partial_lt8()" either needs to be
conditional (easy enough to add an "if (len)" around that whole
statement) or the thing could be unconditional but the load needs to
use "load_unaligned_zeropad()" so that the exception is harmless.

It's probably immaterial which one you pick. The few possibly useless
ALU operations vs a possible branch misprodict penalty are probably
going to make it a wash. The exception will never happen in practice,
but if DEBUG_PAGEALLOC is enabled, or if something like KASAN is
active, it will complain loudly if it happens to go past the
allocation.

               Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 22:09     ` Linus Torvalds
@ 2016-02-05  1:27       ` Linus Torvalds
  2016-02-05  1:39         ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-02-05  1:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
	Linux Kernel Mailing List, Peter Zijlstra

On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load unit to
> do some work for us.

Ok, so I thought some more about this, and the fact is, we don't
actually want to do the byte shifting at all for the first case (the
"length < 8" case), since the address of that one hasn't been shifted.

it's only for the "we're going to align things to 8 bytes" case that
we would want to do it. But then we might as well use the
rotate_by8_if_odd() model, so I suspect the address games are just
entirely pointless.

So here is something that is actually tested (although admittedly not
well), and uses that fairly simple model.

NOTE! I did not do the unrolling of the "adcq" loop in the middle, but
that's a totally trivial thing now. So this isn't very optimized,
because it will do a *lot* of extra "adcq $0" to get rid of the carry
bit. But with that core loop unrolled, you'd get rid of most of them.

                  Linus

---
static unsigned long rotate_by8_if_odd(unsigned long sum, unsigned long aligned)
{
        asm("rorq %b1,%0"
                :"=r" (sum)
                :"c" ((aligned & 1) << 3), "0" (sum));
        return sum;
}

static unsigned long csum_partial_lt8(unsigned long val, int len,
unsigned long sum)
{
        unsigned long mask = (1ul << len*8)-1;
        val &= mask;
        return add64_with_carry(val, sum);
}

static unsigned long csum_partial_64(const void *buff, unsigned long
len, unsigned long sum)
{
        unsigned long align, val;

        // This is the only potentially unaligned access, and it can
        // also theoretically overflow into the next page
        val = load_unaligned_zeropad(buff);
        if (len < 8)
                return csum_partial_lt8(val, len, sum);

        align = 7 & -(unsigned long)buff;
        sum = csum_partial_lt8(val, align, sum);
        buff += align;
        len -= align;

        sum = rotate_by8_if_odd(sum, align);
        while (len >= 8) {
                val = *(unsigned long *) buff;
                sum = add64_with_carry(sum, val);
                buff += 8;
                len -= 8;
        }
        sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
        return rotate_by8_if_odd(sum, align);
}

__wsum csum_partial(const void *buff, unsigned long len, unsigned long sum)
{
        sum = csum_partial_64(buff, len, sum);
        return add32_with_carry(sum, sum >> 32);
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 22:43     ` Tom Herbert
@ 2016-02-04 22:57       ` Linus Torvalds
  2016-02-05  8:01       ` Ingo Molnar
  1 sibling, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-02-04 22:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Ingo Molnar, David Miller, Network Development, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
	Linux Kernel Mailing List, Peter Zijlstra

On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert <tom@herbertland.com> wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and the odd number of quads to sum in the
> main loop. gcc turns these switch statements into jump tables (not
> function tables which is what Ingo's example code was using). The
> problem I hit was that for each case I needed to close the carry chain
> in the inline asm so fall through wouldn't have much value and each
> case is expanded. The C version using switch gave a nice performance
> gain, moving to all assembly was somewhat better.

Yeah,. so I _think_ that if my approach works, the code generated for
the 0-8 byte case looks just something like

    movq %rsi, %rax
    andl $1, %eax
    subq %rax, %rdi
    movq %rdx, %rax
    movq (%rdi), %rcx
    andq mask(,%rsi,8), %rcx
    addq %rcx,%rax
    adcq $0,%rax

which is pretty close to optimal (that's with my hopefully fixed version).

That's not with the final narrowing from 64-bit to 32-bit, but it
looks like it should perform well on most x86 cores, and then the only
remaining branches end up being the actual size checking ones.

And yes, you could avoid a few "adc $0" instructions in asm, but quite
frankly, I think that's a tradeoff that is worth it.

My guess is that you can get to within a percent of optimal with the C
approach, and I really think it makes it easier to try different
things (ie things like the above that avoids the switch table
entirely)

> There is also question of alignment. I f we really don't need to worry
> about alignment at all on x86, then we should be able to eliminate the
> complexity of dealing with it.

So most x86 alignment issues are about crossing the cache bank size,
which is usually 16 or 32 bytes. Unaligned accesses *within* one of
those banks should be basically free (there's a whoppign big byte lane
shifter, so there's lots of hardware support for that).

Also, even when you do cross a cache bank, it's usually pretty cheap.
It extra cycles, but it's generally *less* extra cycles than it would
be to try to align things in software and doing two accesses.

The rule of thumb is that you should never worry about _individual_
unaligned accesses. It's only really worth it aligning things before
biggish loops. So aligning to an 8-byte boundary before you then do a
lot of "adcq" instructions makes sense, but worrying about unaligned
accesses for the beginning/end does generally not.

>>> For example, for the actual "8 bytes or shorter" case, I think
>> something like this might just work fine: [ snip ]
>
> I will look at doing that.

So I already found and fixed one bug in that approach, but I still
think it's a viable model.

But you will almost certainly have to fix a few more of my bugs before
it really works ;)

                   Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 21:46   ` Linus Torvalds
  2016-02-04 22:09     ` Linus Torvalds
@ 2016-02-04 22:43     ` Tom Herbert
  2016-02-04 22:57       ` Linus Torvalds
  2016-02-05  8:01       ` Ingo Molnar
  1 sibling, 2 replies; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 22:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, David Miller, Network Development, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
	Linux Kernel Mailing List, Peter Zijlstra

On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in assembly is precisely about the your point of
having to close the carry chains with adcq $0. I do have a first
implementation in C which using switch() to handle alignment, excess
length less than 8 bytes, and the odd number of quads to sum in the
main loop. gcc turns these switch statements into jump tables (not
function tables which is what Ingo's example code was using). The
problem I hit was that for each case I needed to close the carry chain
in the inline asm so fall through wouldn't have much value and each
case is expanded. The C version using switch gave a nice performance
gain, moving to all assembly was somewhat better.

There is also question of alignment. I f we really don't need to worry
about alignment at all on x86, then we should be able to eliminate the
complexity of dealing with it.

> It seems likely that the real issue is avoiding the short loops (that
> will cause branch prediction problems) and use a lookup table instead.
>
> But we can probably do better than that asm.
>
> For example, for the actual "8 bytes or shorter" case, I think
> something like this might just work fine:
>
>   unsigned long csum_partial_8orless(const unsigned char *buf,
> unsigned long len, unsigned long sum)
>   {
>         static const unsigned long mask[9] = {
>                 0x0000000000000000,
>                 0x000000000000ff00,
>                 0x000000000000ffff,
>                 0x00000000ff00ffff,
>                 0x00000000ffffffff,
>                 0x0000ff00ffffffff,
>                 0x0000ffffffffffff,
>                 0xff00ffffffffffff,
>                 0xffffffffffffffff };
>         unsigned long val = load_unaligned_zeropad(buf + (len & 1));
>         val &= mask[len];
>         asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
>         return sum;
>   }
>
I will look at doing that.

Thanks,
Tom

> NOTE! The above is 100% untested. But I _think_ it may handle the
> odd-byte-case correctly, and it should result in just one 8-byte load
> (the "load_unaligned_zeropad()" is just in case that ends up
> overflowing and we have page-alloc-debug triggering a page fault at
> the end). All without looping or any conditional branches that might
> mispredict.
>
> My point is that going to assembly results in pretty hard-to-read
> code, but it's also fairly inflexible. If we stay with C, we can much
> more easily play tricks. So for example, make the above an inline
> function, and now you can do things like this:
>
>   static inline unsigned long csum_partial_64bit(void *buf, unsigned
> long len, unsigned long sum)
>   {
>         if (len <= 8)
>                 return csum_partial_8orless(buf, len, sum);
>
>         /* We know it's larger than 8 bytes, so handle alignment */
>         align = 7 & -(unsigned long)buf;
>         sum = csum_partial_8orless(buf, align, sum);
>         buf += align;
>
>         /* We need to do the big-endian thing */
>         sum = rotate_by8_if_odd(sum, align);
>
>         /* main loop for big packets */
>         .. do the unrolled inline asm thing that we already do ..
>
>         sum = rotate_by8_if_odd(sum, align);
>
>         /* Handle the last bytes */
>         return csum_partial_8orless(buf, len, sum);
>   }
>
>   /* Fold the 64-bit sum we computed down to 32 bits __wsum */
>   __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
>   {
>         unsigned long sum = csum_partial_64bit(ptr, len, partial);
>         asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
>         return sum;
>  }
>
> or something like that.
>
> NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
> "big-endian 16-bit add" thing right by doing the odd byte masking in
> the end cases, and by rotating the sum by 8 bits around the
> 8-byte-unrolled-loop, but I didn't test the above. It's literally
> written in my mail client. So I can almost guarantee that it is buggy,
> but it is meant as an *example* of "why not do it this way" rather
> than production code.
>
> I think that writing it in C and trying to be intelligent about it
> like the above would result in more maintainable code, and it is
> possible that it would even be faster.
>
> Yes, writing it in C *does* likely result in a few more cases of "adcq
> $0" in order to finish up the carry calculations. The *only* advantage
> of inline asm is how it allows you to keep the carry flag around. So
> there is downside to the C model, and it might cause a cycle or two of
> extra work, but the upside of C is that you can try to do clever
> things without turning the code completely unreadable.
>
> For example, doing the exception handling (that will never actually
> trigger) for the "let's just do a 8-byte load" is just painful in
> assembly. In C, we already have the helper function to do it.
>
> Hmm? Would somebody be willing to take the likely very buggy code
> above, and test it and make it work? I assume there's a test harness
> for the whole csum thing somewhere.
>
>                      Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 21:46   ` Linus Torvalds
@ 2016-02-04 22:09     ` Linus Torvalds
  2016-02-05  1:27       ` Linus Torvalds
  2016-02-04 22:43     ` Tom Herbert
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2016-02-04 22:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
	Linux Kernel Mailing List, Peter Zijlstra

On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>         static const unsigned long mask[9] = {
>                 0x0000000000000000,
>                 0x000000000000ff00,
>                 0x000000000000ffff,
>                 0x00000000ff00ffff,
>                 0x00000000ffffffff,
>                 0x0000ff00ffffffff,
>                 0x0000ffffffffffff,
>                 0xff00ffffffffffff,
>                 0xffffffffffffffff };
>         unsigned long val = load_unaligned_zeropad(buf + (len & 1));
>         val &= mask[len];

Yeah, that was buggy. I knew it was likely buggy,  but that "buf +
(len & 1)" was just stupid.

The "+" should be "-", of course - the point is to shift up the value
by 8 bits for odd cases, and we need to load starting one byte early
for that. The idea is that we use the byte shifter in the load unit to
do some work for us.

And the bitmasks are the wrong way around for the odd cases - it's the
low byte that ends up being bogus for those cases.

So it should probably look something like

        static const unsigned long mask[9] = {
                0x0000000000000000,
                0x000000000000ff00,
                0x000000000000ffff,
                0x00000000ffffff00,
                0x00000000ffffffff,
                0x0000ffffffffff00,
                0x0000ffffffffffff,
                0xffffffffffffff00,
                0xffffffffffffffff };
        unsigned long val = load_unaligned_zeropad(buf - (len & 1));
        val &= mask[len];

and then it *might* work.

Still entirely and utterly untested, I just decided to look at it a
bit more and noticed my thinko.

                Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04  9:30 ` Ingo Molnar
  2016-02-04 10:56   ` Ingo Molnar
@ 2016-02-04 21:46   ` Linus Torvalds
  2016-02-04 22:09     ` Linus Torvalds
  2016-02-04 22:43     ` Tom Herbert
  1 sibling, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2016-02-04 21:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Tom Herbert, David Miller, Network Development, Thomas Gleixner,
	Ingo Molnar, Peter Anvin, the arch/x86 maintainers, kernel-team,
	Linux Kernel Mailing List, Peter Zijlstra

I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:

Why wasn't that done with C code instead of asm with odd numerical targets?

It seems likely that the real issue is avoiding the short loops (that
will cause branch prediction problems) and use a lookup table instead.

But we can probably do better than that asm.

For example, for the actual "8 bytes or shorter" case, I think
something like this might just work fine:

  unsigned long csum_partial_8orless(const unsigned char *buf,
unsigned long len, unsigned long sum)
  {
        static const unsigned long mask[9] = {
                0x0000000000000000,
                0x000000000000ff00,
                0x000000000000ffff,
                0x00000000ff00ffff,
                0x00000000ffffffff,
                0x0000ff00ffffffff,
                0x0000ffffffffffff,
                0xff00ffffffffffff,
                0xffffffffffffffff };
        unsigned long val = load_unaligned_zeropad(buf + (len & 1));
        val &= mask[len];
        asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
        return sum;
  }

NOTE! The above is 100% untested. But I _think_ it may handle the
odd-byte-case correctly, and it should result in just one 8-byte load
(the "load_unaligned_zeropad()" is just in case that ends up
overflowing and we have page-alloc-debug triggering a page fault at
the end). All without looping or any conditional branches that might
mispredict.

My point is that going to assembly results in pretty hard-to-read
code, but it's also fairly inflexible. If we stay with C, we can much
more easily play tricks. So for example, make the above an inline
function, and now you can do things like this:

  static inline unsigned long csum_partial_64bit(void *buf, unsigned
long len, unsigned long sum)
  {
        if (len <= 8)
                return csum_partial_8orless(buf, len, sum);

        /* We know it's larger than 8 bytes, so handle alignment */
        align = 7 & -(unsigned long)buf;
        sum = csum_partial_8orless(buf, align, sum);
        buf += align;

        /* We need to do the big-endian thing */
        sum = rotate_by8_if_odd(sum, align);

        /* main loop for big packets */
        .. do the unrolled inline asm thing that we already do ..

        sum = rotate_by8_if_odd(sum, align);

        /* Handle the last bytes */
        return csum_partial_8orless(buf, len, sum);
  }

  /* Fold the 64-bit sum we computed down to 32 bits __wsum */
  __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
  {
        unsigned long sum = csum_partial_64bit(ptr, len, partial);
        asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
        return sum;
 }

or something like that.

NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
"big-endian 16-bit add" thing right by doing the odd byte masking in
the end cases, and by rotating the sum by 8 bits around the
8-byte-unrolled-loop, but I didn't test the above. It's literally
written in my mail client. So I can almost guarantee that it is buggy,
but it is meant as an *example* of "why not do it this way" rather
than production code.

I think that writing it in C and trying to be intelligent about it
like the above would result in more maintainable code, and it is
possible that it would even be faster.

Yes, writing it in C *does* likely result in a few more cases of "adcq
$0" in order to finish up the carry calculations. The *only* advantage
of inline asm is how it allows you to keep the carry flag around. So
there is downside to the C model, and it might cause a cycle or two of
extra work, but the upside of C is that you can try to do clever
things without turning the code completely unreadable.

For example, doing the exception handling (that will never actually
trigger) for the "let's just do a 8-byte load" is just painful in
assembly. In C, we already have the helper function to do it.

Hmm? Would somebody be willing to take the likely very buggy code
above, and test it and make it work? I assume there's a test harness
for the whole csum thing somewhere.

                     Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 20:59         ` Tom Herbert
@ 2016-02-04 21:09           ` Alexander Duyck
  0 siblings, 0 replies; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 21:09 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Laight, davem, netdev, tglx, mingo, hpa, x86, kernel-team

On Thu, Feb 4, 2016 at 12:59 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@aculab.com> wrote:
>> From: Tom Herbert
>> ...
>>> > If nothing else reducing the size of this main loop may be desirable.
>>> > I know the newer x86 is supposed to have a loop buffer so that it can
>>> > basically loop on already decoded instructions.  Normally it is only
>>> > something like 64 or 128 bytes in size though.  You might find that
>>> > reducing this loop to that smaller size may improve the performance
>>> > for larger payloads.
>>>
>>> I saw 128 to be better in my testing. For large packets this loop does
>>> all the work. I see performance dependent on the amount of loop
>>> overhead, i.e. we got it down to two non-adcq instructions but it is
>>> still noticeable. Also, this helps a lot on sizes up to 128 bytes
>>> since we only need to do single call in the jump table and no trip
>>> through the loop.
>>
>> But one of your 'loop overhead' instructions is 'loop'.
>> Look at http://www.agner.org/optimize/instruction_tables.pdf
>> you don't want to be using 'loop' on intel cpus.
>>
> I'm not following. We can replace loop with decl %ecx and jg, but why
> is that better?

Because loop takes something like 7 cycles whereas the decl/jg
approach takes 2 or 3.  It is probably one of the reasons things look
so much better with the loop unrolled.

- Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 17:09       ` David Laight
@ 2016-02-04 20:59         ` Tom Herbert
  2016-02-04 21:09           ` Alexander Duyck
  0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 20:59 UTC (permalink / raw)
  To: David Laight
  Cc: Alexander Duyck, davem, netdev, tglx, mingo, hpa, x86, kernel-team

On Thu, Feb 4, 2016 at 9:09 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Tom Herbert
> ...
>> > If nothing else reducing the size of this main loop may be desirable.
>> > I know the newer x86 is supposed to have a loop buffer so that it can
>> > basically loop on already decoded instructions.  Normally it is only
>> > something like 64 or 128 bytes in size though.  You might find that
>> > reducing this loop to that smaller size may improve the performance
>> > for larger payloads.
>>
>> I saw 128 to be better in my testing. For large packets this loop does
>> all the work. I see performance dependent on the amount of loop
>> overhead, i.e. we got it down to two non-adcq instructions but it is
>> still noticeable. Also, this helps a lot on sizes up to 128 bytes
>> since we only need to do single call in the jump table and no trip
>> through the loop.
>
> But one of your 'loop overhead' instructions is 'loop'.
> Look at http://www.agner.org/optimize/instruction_tables.pdf
> you don't want to be using 'loop' on intel cpus.
>
I'm not following. We can replace loop with decl %ecx and jg, but why
is that better?

Tom

> You might get some benefit from pipelining the loop (so you do
> a read to register in one iteration and a register-register adc
> the next).
>
>         David
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 19:44   ` Tom Herbert
@ 2016-02-04 20:03     ` Alexander Duyck
  0 siblings, 0 replies; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 20:03 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team

On Thu, Feb 4, 2016 at 11:44 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
>>> Implement assembly routine for csum_partial for 64 bit x86. This
>>> primarily speeds up checksum calculation for smaller lengths such as
>>> those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>>> conversion.
>>>
>>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>>> we need to avoid unaligned accesses. Efficient unaligned accesses
>>> offer a nice additional speedup as demonstrated in the results
>>> provided below.
>>>
>>> This implementation is similar to csum_partial implemented in
>>> checksum_32.S, however since we are dealing with 8 bytes at a time
>>> there are more cases for small lengths (and alignments)-- for that
>>> we employ a jump table.
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Checksum performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>>          Old      NewA     NewA %   NewNoA   NewNoA %
>>> Len/Aln  nsec     nsec     Improv   nsecs    Improve
>>> --------+-------+--------+-------+-------+---------------------
>>> 1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
>>> 40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
>>> 8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
>>> 14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
>>> 14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
>>> 14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
>>> 7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)
>>>
>>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>>
>>> Also test on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>>
>>> Branch prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between
>>> the baseline and the interleaving test case.
>>>
>>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>>
>> So a couple of general questions about all this.
>>
>> First why do you break the alignment part out over so many reads?  It
>> seems like you could just get the offset, subtract that from your
>> starting buffer, read the aligned long, and then mask off the bits you
>> don't want.  That should take maybe 4 or 5 instructions which would
>> save you a bunch of jumping around since most of it is just
>> arithmetic.
>>
> This would require reading bytes before the buffer pointer back to an
> 8 byte offset. Is this *always* guaranteed to be safe? Similar
> question on reading bytes past the length.

I would think we should be fine as long as we are only confining
ourselves to the nearest long.  I would assume page and segmentation
boundaries are at least aligned by the size of a long.  As such I
don't see how reading a few extra bytes to mask off later should cause
any issue.

- Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 19:22 ` Alexander Duyck
  2016-02-04 19:31   ` Tom Herbert
@ 2016-02-04 19:44   ` Tom Herbert
  2016-02-04 20:03     ` Alexander Duyck
  1 sibling, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 19:44 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team

On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>> conversion.
>>
>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>> we need to avoid unaligned accesses. Efficient unaligned accesses
>> offer a nice additional speedup as demonstrated in the results
>> provided below.
>>
>> This implementation is similar to csum_partial implemented in
>> checksum_32.S, however since we are dealing with 8 bytes at a time
>> there are more cases for small lengths (and alignments)-- for that
>> we employ a jump table.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Checksum performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>          Old      NewA     NewA %   NewNoA   NewNoA %
>> Len/Aln  nsec     nsec     Improv   nsecs    Improve
>> --------+-------+--------+-------+-------+---------------------
>> 1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
>> 40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
>> 8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
>> 14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
>> 14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
>> 14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
>> 7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)
>>
>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also test on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between
>> the baseline and the interleaving test case.
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>
> So a couple of general questions about all this.
>
> First why do you break the alignment part out over so many reads?  It
> seems like you could just get the offset, subtract that from your
> starting buffer, read the aligned long, and then mask off the bits you
> don't want.  That should take maybe 4 or 5 instructions which would
> save you a bunch of jumping around since most of it is just
> arithmetic.
>
This would require reading bytes before the buffer pointer back to an
8 byte offset. Is this *always* guaranteed to be safe? Similar
question on reading bytes past the length.

> I am still not sure about the loops, but as mentioned you probably
> don't want to use loop.  You would likely be better off generating a
> pointer representing the last valid offset and just running through
> the loop that way.
>
> Then on the last read the same question as the first.  Why not just do
> one read and mask the result.  It should be faster.
>
> Finally I am not sure if your result is safe if the offset for the
> buffer is an odd value.  The original code did a collapse to 16b and
> rotate by 8 if the offset was odd.  I don't see anything like that in
> your code.
>
> - Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 19:22 ` Alexander Duyck
@ 2016-02-04 19:31   ` Tom Herbert
  2016-02-04 19:44   ` Tom Herbert
  1 sibling, 0 replies; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 19:31 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team

On Thu, Feb 4, 2016 at 11:22 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
>> Implement assembly routine for csum_partial for 64 bit x86. This
>> primarily speeds up checksum calculation for smaller lengths such as
>> those that are present when doing skb_postpull_rcsum when getting
>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
>> conversion.
>>
>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
>> we need to avoid unaligned accesses. Efficient unaligned accesses
>> offer a nice additional speedup as demonstrated in the results
>> provided below.
>>
>> This implementation is similar to csum_partial implemented in
>> checksum_32.S, however since we are dealing with 8 bytes at a time
>> there are more cases for small lengths (and alignments)-- for that
>> we employ a jump table.
>>
>> Testing:
>>
>> Correctness:
>>
>> Verified correctness by testing arbitrary length buffer filled with
>> random data. For each buffer I compared the computed checksum
>> using the original algorithm for each possible alignment (0-7 bytes).
>>
>> Checksum performance:
>>
>> Isolating old and new implementation for some common cases:
>>
>>          Old      NewA     NewA %   NewNoA   NewNoA %
>> Len/Aln  nsec     nsec     Improv   nsecs    Improve
>> --------+-------+--------+-------+-------+---------------------
>> 1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
>> 40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
>> 8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
>> 14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
>> 14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
>> 14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
>> 7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)
>>
>> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
>> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>>
>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>
>> Also test on these with similar results:
>>
>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> Branch prediction:
>>
>> To test the effects of poor branch prediction in the jump tables I
>> tested checksum performance with runs for two combinations of length
>> and alignment. As the baseline I performed the test by doing half of
>> calls with the first combination, followed by using the second
>> combination for the second half. In the test case, I interleave the
>> two combinations so that in every call the length and alignment are
>> different to defeat the effects of branch prediction. Running several
>> cases, I did not see any material performance difference between
>> the baseline and the interleaving test case.
>>
>> Signed-off-by: Tom Herbert <tom@herbertland.com>
>
> So a couple of general questions about all this.
>
> First why do you break the alignment part out over so many reads?  It
> seems like you could just get the offset, subtract that from your
> starting buffer, read the aligned long, and then mask off the bits you
> don't want.  That should take maybe 4 or 5 instructions which would
> save you a bunch of jumping around since most of it is just
> arithmetic.
>
I will look at doing that.

> I am still not sure about the loops, but as mentioned you probably
> don't want to use loop.  You would likely be better off generating a
> pointer representing the last valid offset and just running through
> the loop that way.
>
> Then on the last read the same question as the first.  Why not just do
> one read and mask the result.  It should be faster.
>
> Finally I am not sure if your result is safe if the offset for the
> buffer is an odd value.  The original code did a collapse to 16b and
> rotate by 8 if the offset was odd.  I don't see anything like that in
> your code.

The roll $8 is done in the RETURN macro. We only need to worry about
this when alignment matters.

Thanks,
Tom

>
> - Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 10:56   ` Ingo Molnar
@ 2016-02-04 19:24     ` Tom Herbert
  2016-02-05  9:24       ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David S. Miller, Linux Kernel Network Developers, tglx, mingo,
	hpa, x86, Kernel Team, LKML, Peter Zijlstra, Linus Torvalds

On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > +   /* Check length */
>> > +10:        cmpl    $8, %esi
>> > +   jg      30f
>> > +   jl      20f
>> > +
>> > +   /* Exactly 8 bytes length */
>> > +   addl    (%rdi), %eax
>> > +   adcl    4(%rdi), %eax
>> > +   RETURN
>> > +
>> > +   /* Less than 8 bytes length */
>> > +20:        clc
>> > +   jmpq *branch_tbl_len(, %rsi, 8)
>> > +
>> > +   /* Greater than 8 bytes length. Determine number of quads (n). Sum
>> > +    * over first n % 16 quads
>> > +    */
>> > +30:        movl    %esi, %ecx
>> > +   shrl    $3, %ecx
>> > +   andl    $0xf, %ecx
>> > +   negq    %rcx
>> > +   lea     40f(, %rcx, 4), %r11
>> > +   clc
>> > +   jmp     *%r11
>>
>> Are you absolutely sure that a jump table is the proper solution here? It
>> defeats branch prediction on most x86 uarchs. Why not label the loop stages and
>> jump in directly with a binary tree of branches?
>
> So just to expand on this a bit, attached below is a quick & simple & stupid
> testcase that generates a 16 entries call table. (Indirect jumps and indirect
> calls are predicted similarly on most x86 uarchs.) Just built it with:
>
>   gcc -Wall -O2 -o jump-table jump-table.c
>
> Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU
> and also on AMD Opteron. The numbers are from the Intel box.) this gives a high
> branch miss rate with a 16 entries jump table:
>
>  triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
>  ... using 16 jump table entries.
>  ... using 100000000 loop iterations.
>  ... result: 10000000100000000
>  [...]
>
>  Performance counter stats for './jump-table 16' (10 runs):
>
>        1386.131780      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.18% )
>                 33      context-switches          #    0.024 K/sec                    ( +-  1.71% )
>                  0      cpu-migrations            #    0.000 K/sec
>                 52      page-faults               #    0.037 K/sec                    ( +-  0.71% )
>      6,247,215,683      cycles                    #    4.507 GHz                      ( +-  0.18% )
>      3,895,337,877      stalled-cycles-frontend   #   62.35% frontend cycles idle     ( +-  0.30% )
>      1,404,014,996      instructions              #    0.22  insns per cycle
>                                                   #    2.77  stalled cycles per insn  ( +-  0.02% )
>        300,820,988      branches                  #  217.022 M/sec                    ( +-  0.02% )
>         87,518,741      branch-misses             #   29.09% of all branches          ( +-  0.01% )
>
>        1.385240076 seconds time elapsed                                          ( +-  0.21% )
>
> ... as you can see the branch miss rate is very significant, causing a stalled
> decoder and very low instruction throughput.
>
> I have to reduce the jump table to a single entry (!) to get good performance on
> this CPU:
>
>  Performance counter stats for './jump-table 1' (10 runs):
>
>         739.173505      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.26% )
>                 37      context-switches          #    0.051 K/sec                    ( +- 16.79% )
>                  0      cpu-migrations            #    0.000 K/sec
>                 52      page-faults               #    0.070 K/sec                    ( +-  0.41% )
>      3,331,328,405      cycles                    #    4.507 GHz                      ( +-  0.26% )
>      2,012,973,596      stalled-cycles-frontend   #   60.43% frontend cycles idle     ( +-  0.47% )
>      1,403,880,792      instructions              #    0.42  insns per cycle
>                                                   #    1.43  stalled cycles per insn  ( +-  0.05% )
>        300,817,064      branches                  #  406.964 M/sec                    ( +-  0.05% )
>             12,177      branch-misses             #    0.00% of all branches          ( +- 12.39% )
>
>        0.738616356 seconds time elapsed                                          ( +-  0.26% )
>
> Note how the runtime got halved: that is because stalls got halved and the
> instructions per cycle throughput doubled.
>
> Even a two entries jump table performs poorly:
>
>  Performance counter stats for './jump-table 2' (10 runs):
>
>        1493.790686      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.06% )
>                 39      context-switches          #    0.026 K/sec                    ( +-  4.73% )
>                  0      cpu-migrations            #    0.000 K/sec
>                 52      page-faults               #    0.035 K/sec                    ( +-  0.26% )
>      6,732,372,612      cycles                    #    4.507 GHz                      ( +-  0.06% )
>      4,229,130,302      stalled-cycles-frontend   #   62.82% frontend cycles idle     ( +-  0.09% )
>      1,407,803,145      instructions              #    0.21  insns per cycle
>                                                   #    3.00  stalled cycles per insn  ( +-  0.08% )
>        301,688,946      branches                  #  201.962 M/sec                    ( +-  0.09% )
>        100,022,150      branch-misses             #   33.15% of all branches          ( +-  0.00% )
>
>        1.492820524 seconds time elapsed                                          ( +-  0.08% )
>
> In fact it's worse than the 16 entries runtime.
>
> ( Note that Intel SkyLake improved the indirect jump/call branch predictor.
>   On another Intel CPU I have the boundary between good and bad prediction is
>   at 4-6 entries. So this is highly uarch (and code layout) dependent. )
>
> In contrast, a static hierarchy/tree of branches is predicted a lot better on most
> x86 uarchs, even with highly variable input. (This does not even count the extra
> calculations needed to calculate the jump table index, which, as you coded it in
> your patch, add 2-3 cycles of extra latency as well.)
>
> Such branch miss characteristics matter and they can become more visible with
> smaller skb sizes.
>
> So I'm generally sceptical of jump tables and I'd like to see careful and
> convincing perf stat runs on modern x86 uarchs, comparing the jump table solution
> to a branch hierarchy solution, before accepting a jump table solution into the
> x86 architecture.
>
> x86 uarchs are generally good at predicting function pointer indirect jumps (which
> tend to be static) - multi-target jump tables are a lot harder nut to crack,
> especially if the jump index is calculated right before the jump is performed,
> like your patch does it.
>
> The impact of branch misses is more pronounced on modern Intel CPUs, because
> speculation is deeper.
>
> But as always I can be convinced with contrary numbers! :-)
>
Hi Ingo,

Thanks for the explanation and sample code. Expanding on your example,
I added a switch statement to perform to function (code below).

gcc seems to be implementing this switch as a jump table:

   jmpq   *0x400ac8(,%rdx,8)

Which is a different call than the function table implementation:

   callq  *(%rsp,%rdx,8)

The switch jump table method is what I coded in the assembly for the
checksum routine (I basically just copied the assembly from  what gcc
was doing with the C code for the same function).

Running using the functptr and switch table gives:

taskset 1 perf stat --repeat 10 ./jump_table -S 16
... using funcptr
... result: 10000000100000000

 Performance counter stats for './jump_table -S 16' (10 runs):

       2318.192392      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  1.02% )
                 9      context-switches          #    0.004 K/sec
               ( +-  9.51% )
                 1      cpu-migrations            #    0.000 K/sec
               ( +- 68.31% )
               132      page-faults               #    0.057 K/sec
               ( +-  0.10% )
     6,328,766,014      cycles                    #    2.730 GHz
               ( +-  0.04% ) [49.99%]
     3,325,346,741      stalled-cycles-frontend   #   52.54% frontend
cycles idle     ( +-  0.10% ) [50.01%]
     1,793,615,301      stalled-cycles-backend    #   28.34% backend
cycles idle     ( +-  2.84% ) [50.04%]
     1,509,733,276      instructions              #    0.24  insns per cycle
                                                  #    2.20  stalled
cycles per insn  ( +-  0.02% ) [50.03%]
       301,788,275      branches                  #  130.183 M/sec
               ( +-  0.01% ) [50.01%]
       100,030,120      branch-misses             #   33.15% of all
branches          ( +-  0.01% ) [49.98%]

       2.320510795 seconds time elapsed
          ( +-  1.02% )

taskset 1 perf stat --repeat 10 ./jump_table -S 16 -x
... using switch
... result: 10000000100000000

 Performance counter stats for './jump_table -S 16 -x' (10 runs):

        942.117858      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  1.07% )
                 5      context-switches          #    0.005 K/sec
               ( +- 13.04% )
                 2      cpu-migrations            #    0.003 K/sec
               ( +- 42.27% )
               132      page-faults               #    0.140 K/sec
               ( +-  0.12% )
     2,521,873,028      cycles                    #    2.677 GHz
               ( +-  0.09% ) [50.01%]
     1,021,412,581      stalled-cycles-frontend   #   40.50% frontend
cycles idle     ( +-  0.17% ) [50.06%]
       255,591,030      stalled-cycles-backend    #   10.13% backend
cycles idle     ( +- 13.11% ) [50.10%]
     1,104,081,483      instructions              #    0.44  insns per cycle
                                                  #    0.93  stalled
cycles per insn  ( +-  0.01% ) [50.05%]
       300,713,493      branches                  #  319.189 M/sec
               ( +-  0.01% ) [49.98%]
            11,500      branch-misses             #    0.00% of all
branches          ( +- 12.18% ) [49.95%]

       0.943088132 seconds time elapsed
          ( +-  1.07% )

So the switch implementation does not seem to be causing branch-misses
to be counted and is a lot faster. This is also what I see when
looking at the branch-misses with the checksum code-- I haven't yet
found any test case thrashing lengths or alignments increase
branch-misses and shows a performance degradation over the case where
they are static.

Tom

/*
 *  * Simple testcase to generate jump table usage.
 *   */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>

typedef unsigned long (fn_t) (unsigned long param);

#define TABLE_SIZE_MAX 16L

unsigned long global;
unsigned long table_size = TABLE_SIZE_MAX;
unsigned long loops = 100000000;
int doswitch = 0;

#define DEFINE_FUN(x)                           \
                                                \
unsigned long fun_##x(unsigned long param)      \
{                                               \
        return param+global;                    \
}

DEFINE_FUN( 0);
DEFINE_FUN( 1);
DEFINE_FUN( 2);
DEFINE_FUN( 3);
DEFINE_FUN( 4);
DEFINE_FUN( 5);
DEFINE_FUN( 6);
DEFINE_FUN( 7);
DEFINE_FUN( 8);
DEFINE_FUN( 9);
DEFINE_FUN(10);
DEFINE_FUN(11);
DEFINE_FUN(12);
DEFINE_FUN(13);
DEFINE_FUN(14);
DEFINE_FUN(15);

static void usage(void)
{
                printf("usage: jump-table [-S <table_size>] [-L
<loops> ] [-h] [-x]\n");
                exit(0);
}

unsigned long do_funcptr(void)
{
        int i;
        unsigned long local = 0;
        fn_t *fn_ptr [TABLE_SIZE_MAX] = {        fun_0 , fun_1 , fun_2 , fun_3 ,
                                                 fun_4 , fun_5 , fun_6 , fun_7 ,
                                                 fun_8 , fun_9, fun_10, fun_11,
                                                 fun_12, fun_13, fun_14, fun_15,
                                        };

        /* Call a set of simple arithmetic functions from a jump table: */
        for (i = 0; i < loops; i++) {
                global++;
                local += fn_ptr[global % table_size](global);
        }

        return local;
}

unsigned long do_switch(void)
{
        int i;
        unsigned long local = 0;

        /* Call a set of simple arithmetic functions via switch: */


#define CASE(X) case X:                                 \
                        local += fun_##X(global);       \
                        break;

        for (i = 0; i < loops; i++) {
                global++;
                switch (global % table_size) {
                CASE(0) CASE(1) CASE(2) CASE(3) CASE(4) CASE(5) CASE(6) CASE(7)
                CASE(8) CASE(9) CASE(10) CASE(11) CASE(12) CASE(13)
CASE(14) CASE(15)
                }
        }

        return local;
}

int main(int argc, char **argv)
{
        unsigned long local = 0;
        int c;

        while ((c = getopt(argc, argv, "hS:xL:")) != -1)
                switch (c) {
                case 'S':
                        table_size = atoi(optarg);
                        if (table_size <= 1)
                                table_size = 1;
                        if (table_size > TABLE_SIZE_MAX)
                                table_size = TABLE_SIZE_MAX;
                        break;
                case 'x':
                        doswitch = 1;
                        break;
                case 'L':
                        loops = atoi(optarg);
                case 'h':
                default:
                        usage();
        }

        printf("... using %lu jump table entries.\n", table_size);
        printf("... using %lu loop iterations.\n", loops);

        if (doswitch) {
                printf("... using switch\n");
        } else {
                printf("... using funcptr\n");
        }

        local = doswitch ? do_switch() : do_funcptr();

        printf("... result: %lu\n", local);
        return 0;
}

> Thanks,
>
>         Ingo
>
> -------------------------------------->
> /*
>  * Simple testcase to generate jump table usage.
>  */
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> typedef unsigned long (fn_t) (unsigned long param);
>
> unsigned long global;
>
> #define DEFINE_FUN(x)                           \
>                                                 \
> unsigned long fun_##x(unsigned long param)      \
> {                                               \
>         return param+global;                    \
> }
>
> #define TABLE_SIZE_MAX 16L
>
> DEFINE_FUN( 1);
> DEFINE_FUN( 2);
> DEFINE_FUN( 3);
> DEFINE_FUN( 4);
> DEFINE_FUN( 5);
> DEFINE_FUN( 6);
> DEFINE_FUN( 7);
> DEFINE_FUN( 8);
> DEFINE_FUN( 9);
> DEFINE_FUN(10);
> DEFINE_FUN(11);
> DEFINE_FUN(12);
> DEFINE_FUN(13);
> DEFINE_FUN(14);
> DEFINE_FUN(15);
> DEFINE_FUN(16);
>
> int main(int argc, char **argv)
> {
>         fn_t *fn_ptr [TABLE_SIZE_MAX] = {        fun_1 , fun_2 , fun_3 , fun_4 ,
>                                                  fun_5 , fun_6 , fun_7 , fun_8 ,
>                                                  fun_9 , fun_10, fun_11, fun_12,
>                                                  fun_13, fun_14, fun_15, fun_16,
>                                                                                         };
>         unsigned long local = 0;
>         unsigned long i;
>         unsigned long table_size = TABLE_SIZE_MAX;
>         unsigned long loops = 100000000;
>
>
>         if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) {
>                 printf("usage: jump-table <table_size(1-%lu)(default: %lu)> <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops);
>                 exit(0);
>         }
>
>         if (argc >= 2) {
>                 table_size = atol(argv[1]);
>                 if (table_size <= 1)
>                         table_size = 1;
>                 if (table_size > TABLE_SIZE_MAX)
>                         table_size = TABLE_SIZE_MAX;
>         }
>         printf("... using %lu jump table entries.\n", table_size);
>
>         if (argc >= 3)
>                 loops = atol(argv[2]);
>         printf("... using %lu loop iterations.\n", loops);
>
>         /* Call a set of simple arithmetic functions from a jump table: */
>         for (i = 0; i < loops; i++) {
>                 global++;
>                 local += fn_ptr[global % table_size](global);
>         }
>
>         printf("... result: %lu\n", local);
> }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-03 19:18 Tom Herbert
  2016-02-04  9:30 ` Ingo Molnar
  2016-02-04 11:08 ` David Laight
@ 2016-02-04 19:22 ` Alexander Duyck
  2016-02-04 19:31   ` Tom Herbert
  2016-02-04 19:44   ` Tom Herbert
  2 siblings, 2 replies; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 19:22 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Netdev, tglx, mingo, hpa, x86, Kernel Team

On Wed, Feb 3, 2016 at 11:18 AM, Tom Herbert <tom@herbertland.com> wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.
>
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
> we need to avoid unaligned accesses. Efficient unaligned accesses
> offer a nice additional speedup as demonstrated in the results
> provided below.
>
> This implementation is similar to csum_partial implemented in
> checksum_32.S, however since we are dealing with 8 bytes at a time
> there are more cases for small lengths (and alignments)-- for that
> we employ a jump table.
>
> Testing:
>
> Correctness:
>
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
>
> Checksum performance:
>
> Isolating old and new implementation for some common cases:
>
>          Old      NewA     NewA %   NewNoA   NewNoA %
> Len/Aln  nsec     nsec     Improv   nsecs    Improve
> --------+-------+--------+-------+-------+---------------------
> 1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
> 40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
> 8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
> 14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
> 14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
> 14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
> 7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)
>
> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
>
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>
> Also test on these with similar results:
>
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>
> Branch prediction:
>
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between
> the baseline and the interleaving test case.
>
> Signed-off-by: Tom Herbert <tom@herbertland.com>

So a couple of general questions about all this.

First why do you break the alignment part out over so many reads?  It
seems like you could just get the offset, subtract that from your
starting buffer, read the aligned long, and then mask off the bits you
don't want.  That should take maybe 4 or 5 instructions which would
save you a bunch of jumping around since most of it is just
arithmetic.

I am still not sure about the loops, but as mentioned you probably
don't want to use loop.  You would likely be better off generating a
pointer representing the last valid offset and just running through
the loop that way.

Then on the last read the same question as the first.  Why not just do
one read and mask the result.  It should be faster.

Finally I am not sure if your result is safe if the offset for the
buffer is an odd value.  The original code did a collapse to 16b and
rotate by 8 if the offset was odd.  I don't see anything like that in
your code.

- Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 16:58     ` Tom Herbert
@ 2016-02-04 17:09       ` David Laight
  2016-02-04 20:59         ` Tom Herbert
  0 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-04 17:09 UTC (permalink / raw)
  To: 'Tom Herbert', Alexander Duyck
  Cc: davem, netdev, tglx, mingo, hpa, x86, kernel-team

From: Tom Herbert
...
> > If nothing else reducing the size of this main loop may be desirable.
> > I know the newer x86 is supposed to have a loop buffer so that it can
> > basically loop on already decoded instructions.  Normally it is only
> > something like 64 or 128 bytes in size though.  You might find that
> > reducing this loop to that smaller size may improve the performance
> > for larger payloads.
>
> I saw 128 to be better in my testing. For large packets this loop does
> all the work. I see performance dependent on the amount of loop
> overhead, i.e. we got it down to two non-adcq instructions but it is
> still noticeable. Also, this helps a lot on sizes up to 128 bytes
> since we only need to do single call in the jump table and no trip
> through the loop.

But one of your 'loop overhead' instructions is 'loop'.
Look at http://www.agner.org/optimize/instruction_tables.pdf
you don't want to be using 'loop' on intel cpus.

You might get some benefit from pipelining the loop (so you do
a read to register in one iteration and a register-register adc
the next).

	David


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 16:51   ` Alexander Duyck
@ 2016-02-04 16:58     ` Tom Herbert
  2016-02-04 17:09       ` David Laight
  0 siblings, 1 reply; 29+ messages in thread
From: Tom Herbert @ 2016-02-04 16:58 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, davem, netdev, tglx, mingo, hpa, x86, kernel-team

On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@aculab.com> wrote:
>> From: Tom Herbert
>>> Sent: 03 February 2016 19:19
>> ...
>>> +     /* Main loop */
>>> +50:  adcq    0*8(%rdi),%rax
>>> +     adcq    1*8(%rdi),%rax
>>> +     adcq    2*8(%rdi),%rax
>>> +     adcq    3*8(%rdi),%rax
>>> +     adcq    4*8(%rdi),%rax
>>> +     adcq    5*8(%rdi),%rax
>>> +     adcq    6*8(%rdi),%rax
>>> +     adcq    7*8(%rdi),%rax
>>> +     adcq    8*8(%rdi),%rax
>>> +     adcq    9*8(%rdi),%rax
>>> +     adcq    10*8(%rdi),%rax
>>> +     adcq    11*8(%rdi),%rax
>>> +     adcq    12*8(%rdi),%rax
>>> +     adcq    13*8(%rdi),%rax
>>> +     adcq    14*8(%rdi),%rax
>>> +     adcq    15*8(%rdi),%rax
>>> +     lea     128(%rdi), %rdi
>>> +     loop    50b
>>
>> I'd need convincing that unrolling the loop like that gives any significant gain.
>> You have a dependency chain on the carry flag so have delays between the 'adcq'
>> instructions (these may be more significant than the memory reads from l1 cache).
>>
>> I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
>> If 'loop' is fast then you will probably find that:
>>
>> 10:     adcq 0(%rdi),%rax
>>         lea  8(%rdi),%rdi
>>         loop 10b
>>
>> is just as fast since the three instructions could all be executed in parallel.
>> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
>> a single micro-op).
>> IIRC 'adc' and 'dec' will both have dependencies on the flags register
>> so cannot execute together (which is a shame here).
>>
>> It is also possible that breaking the carry-chain dependency by doing 32bit
>> adds (possibly after 64bit reads) can be made to be faster.
>
> If nothing else reducing the size of this main loop may be desirable.
> I know the newer x86 is supposed to have a loop buffer so that it can
> basically loop on already decoded instructions.  Normally it is only
> something like 64 or 128 bytes in size though.  You might find that
> reducing this loop to that smaller size may improve the performance
> for larger payloads.
>
I saw 128 to be better in my testing. For large packets this loop does
all the work. I see performance dependent on the amount of loop
overhead, i.e. we got it down to two non-adcq instructions but it is
still noticeable. Also, this helps a lot on sizes up to 128 bytes
since we only need to do single call in the jump table and no trip
through the loop.

Tom

> - Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04 11:08 ` David Laight
@ 2016-02-04 16:51   ` Alexander Duyck
  2016-02-04 16:58     ` Tom Herbert
  0 siblings, 1 reply; 29+ messages in thread
From: Alexander Duyck @ 2016-02-04 16:51 UTC (permalink / raw)
  To: David Laight
  Cc: Tom Herbert, davem, netdev, tglx, mingo, hpa, x86, kernel-team

On Thu, Feb 4, 2016 at 3:08 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Tom Herbert
>> Sent: 03 February 2016 19:19
> ...
>> +     /* Main loop */
>> +50:  adcq    0*8(%rdi),%rax
>> +     adcq    1*8(%rdi),%rax
>> +     adcq    2*8(%rdi),%rax
>> +     adcq    3*8(%rdi),%rax
>> +     adcq    4*8(%rdi),%rax
>> +     adcq    5*8(%rdi),%rax
>> +     adcq    6*8(%rdi),%rax
>> +     adcq    7*8(%rdi),%rax
>> +     adcq    8*8(%rdi),%rax
>> +     adcq    9*8(%rdi),%rax
>> +     adcq    10*8(%rdi),%rax
>> +     adcq    11*8(%rdi),%rax
>> +     adcq    12*8(%rdi),%rax
>> +     adcq    13*8(%rdi),%rax
>> +     adcq    14*8(%rdi),%rax
>> +     adcq    15*8(%rdi),%rax
>> +     lea     128(%rdi), %rdi
>> +     loop    50b
>
> I'd need convincing that unrolling the loop like that gives any significant gain.
> You have a dependency chain on the carry flag so have delays between the 'adcq'
> instructions (these may be more significant than the memory reads from l1 cache).
>
> I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
> If 'loop' is fast then you will probably find that:
>
> 10:     adcq 0(%rdi),%rax
>         lea  8(%rdi),%rdi
>         loop 10b
>
> is just as fast since the three instructions could all be executed in parallel.
> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
> a single micro-op).
> IIRC 'adc' and 'dec' will both have dependencies on the flags register
> so cannot execute together (which is a shame here).
>
> It is also possible that breaking the carry-chain dependency by doing 32bit
> adds (possibly after 64bit reads) can be made to be faster.

If nothing else reducing the size of this main loop may be desirable.
I know the newer x86 is supposed to have a loop buffer so that it can
basically loop on already decoded instructions.  Normally it is only
something like 64 or 128 bytes in size though.  You might find that
reducing this loop to that smaller size may improve the performance
for larger payloads.

- Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-03 19:18 Tom Herbert
  2016-02-04  9:30 ` Ingo Molnar
@ 2016-02-04 11:08 ` David Laight
  2016-02-04 16:51   ` Alexander Duyck
  2016-02-04 19:22 ` Alexander Duyck
  2 siblings, 1 reply; 29+ messages in thread
From: David Laight @ 2016-02-04 11:08 UTC (permalink / raw)
  To: 'Tom Herbert', davem, netdev; +Cc: tglx, mingo, hpa, x86, kernel-team

From: Tom Herbert
> Sent: 03 February 2016 19:19
...
> +	/* Main loop */
> +50:	adcq	0*8(%rdi),%rax
> +	adcq	1*8(%rdi),%rax
> +	adcq	2*8(%rdi),%rax
> +	adcq	3*8(%rdi),%rax
> +	adcq	4*8(%rdi),%rax
> +	adcq	5*8(%rdi),%rax
> +	adcq	6*8(%rdi),%rax
> +	adcq	7*8(%rdi),%rax
> +	adcq	8*8(%rdi),%rax
> +	adcq	9*8(%rdi),%rax
> +	adcq	10*8(%rdi),%rax
> +	adcq	11*8(%rdi),%rax
> +	adcq	12*8(%rdi),%rax
> +	adcq	13*8(%rdi),%rax
> +	adcq	14*8(%rdi),%rax
> +	adcq	15*8(%rdi),%rax
> +	lea	128(%rdi), %rdi
> +	loop	50b

I'd need convincing that unrolling the loop like that gives any significant gain.
You have a dependency chain on the carry flag so have delays between the 'adcq'
instructions (these may be more significant than the memory reads from l1 cache).

I also don't remember (might be wrong) the 'loop' instruction being executed quickly.
If 'loop' is fast then you will probably find that:

10:	adcq 0(%rdi),%rax
	lea  8(%rdi),%rdi
	loop 10b

is just as fast since the three instructions could all be executed in parallel.
But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as
a single micro-op).
IIRC 'adc' and 'dec' will both have dependencies on the flags register
so cannot execute together (which is a shame here).

It is also possible that breaking the carry-chain dependency by doing 32bit
adds (possibly after 64bit reads) can be made to be faster.

	David

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-04  9:30 ` Ingo Molnar
@ 2016-02-04 10:56   ` Ingo Molnar
  2016-02-04 19:24     ` Tom Herbert
  2016-02-04 21:46   ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2016-02-04 10:56 UTC (permalink / raw)
  To: Tom Herbert
  Cc: davem, netdev, tglx, mingo, hpa, x86, kernel-team, linux-kernel,
	Peter Zijlstra, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> 
> > +
> > +	/* Check length */
> > +10:	cmpl	$8, %esi
> > +	jg	30f
> > +	jl	20f
> > +
> > +	/* Exactly 8 bytes length */
> > +	addl	(%rdi), %eax
> > +	adcl	4(%rdi), %eax
> > +	RETURN
> > +
> > +	/* Less than 8 bytes length */
> > +20:	clc
> > +	jmpq *branch_tbl_len(, %rsi, 8)
> > +
> > +	/* Greater than 8 bytes length. Determine number of quads (n). Sum
> > +	 * over first n % 16 quads
> > +	 */
> > +30:	movl	%esi, %ecx
> > +	shrl	$3, %ecx
> > +	andl	$0xf, %ecx
> > +	negq	%rcx
> > +	lea	40f(, %rcx, 4), %r11
> > +	clc
> > +	jmp	*%r11
> 
> Are you absolutely sure that a jump table is the proper solution here? It 
> defeats branch prediction on most x86 uarchs. Why not label the loop stages and 
> jump in directly with a binary tree of branches?

So just to expand on this a bit, attached below is a quick & simple & stupid 
testcase that generates a 16 entries call table. (Indirect jumps and indirect 
calls are predicted similarly on most x86 uarchs.) Just built it with:

  gcc -Wall -O2 -o jump-table jump-table.c

Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU 
and also on AMD Opteron. The numbers are from the Intel box.) this gives a high 
branch miss rate with a 16 entries jump table:

 triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
 ... using 16 jump table entries.
 ... using 100000000 loop iterations.
 ... result: 10000000100000000
 [...]

 Performance counter stats for './jump-table 16' (10 runs):

       1386.131780      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.18% )
                33      context-switches          #    0.024 K/sec                    ( +-  1.71% )
                 0      cpu-migrations            #    0.000 K/sec                  
                52      page-faults               #    0.037 K/sec                    ( +-  0.71% )
     6,247,215,683      cycles                    #    4.507 GHz                      ( +-  0.18% )
     3,895,337,877      stalled-cycles-frontend   #   62.35% frontend cycles idle     ( +-  0.30% )
     1,404,014,996      instructions              #    0.22  insns per cycle        
                                                  #    2.77  stalled cycles per insn  ( +-  0.02% )
       300,820,988      branches                  #  217.022 M/sec                    ( +-  0.02% )
        87,518,741      branch-misses             #   29.09% of all branches          ( +-  0.01% )

       1.385240076 seconds time elapsed                                          ( +-  0.21% )

... as you can see the branch miss rate is very significant, causing a stalled 
decoder and very low instruction throughput.

I have to reduce the jump table to a single entry (!) to get good performance on 
this CPU:

 Performance counter stats for './jump-table 1' (10 runs):

        739.173505      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.26% )
                37      context-switches          #    0.051 K/sec                    ( +- 16.79% )
                 0      cpu-migrations            #    0.000 K/sec                  
                52      page-faults               #    0.070 K/sec                    ( +-  0.41% )
     3,331,328,405      cycles                    #    4.507 GHz                      ( +-  0.26% )
     2,012,973,596      stalled-cycles-frontend   #   60.43% frontend cycles idle     ( +-  0.47% )
     1,403,880,792      instructions              #    0.42  insns per cycle        
                                                  #    1.43  stalled cycles per insn  ( +-  0.05% )
       300,817,064      branches                  #  406.964 M/sec                    ( +-  0.05% )
            12,177      branch-misses             #    0.00% of all branches          ( +- 12.39% )

       0.738616356 seconds time elapsed                                          ( +-  0.26% )

Note how the runtime got halved: that is because stalls got halved and the 
instructions per cycle throughput doubled.

Even a two entries jump table performs poorly:

 Performance counter stats for './jump-table 2' (10 runs):

       1493.790686      task-clock (msec)         #    1.001 CPUs utilized            ( +-  0.06% )
                39      context-switches          #    0.026 K/sec                    ( +-  4.73% )
                 0      cpu-migrations            #    0.000 K/sec                  
                52      page-faults               #    0.035 K/sec                    ( +-  0.26% )
     6,732,372,612      cycles                    #    4.507 GHz                      ( +-  0.06% )
     4,229,130,302      stalled-cycles-frontend   #   62.82% frontend cycles idle     ( +-  0.09% )
     1,407,803,145      instructions              #    0.21  insns per cycle        
                                                  #    3.00  stalled cycles per insn  ( +-  0.08% )
       301,688,946      branches                  #  201.962 M/sec                    ( +-  0.09% )
       100,022,150      branch-misses             #   33.15% of all branches          ( +-  0.00% )

       1.492820524 seconds time elapsed                                          ( +-  0.08% )

In fact it's worse than the 16 entries runtime.

( Note that Intel SkyLake improved the indirect jump/call branch predictor.
  On another Intel CPU I have the boundary between good and bad prediction is
  at 4-6 entries. So this is highly uarch (and code layout) dependent. )

In contrast, a static hierarchy/tree of branches is predicted a lot better on most 
x86 uarchs, even with highly variable input. (This does not even count the extra 
calculations needed to calculate the jump table index, which, as you coded it in 
your patch, add 2-3 cycles of extra latency as well.)

Such branch miss characteristics matter and they can become more visible with 
smaller skb sizes.

So I'm generally sceptical of jump tables and I'd like to see careful and 
convincing perf stat runs on modern x86 uarchs, comparing the jump table solution 
to a branch hierarchy solution, before accepting a jump table solution into the 
x86 architecture.

x86 uarchs are generally good at predicting function pointer indirect jumps (which 
tend to be static) - multi-target jump tables are a lot harder nut to crack, 
especially if the jump index is calculated right before the jump is performed, 
like your patch does it.

The impact of branch misses is more pronounced on modern Intel CPUs, because 
speculation is deeper.

But as always I can be convinced with contrary numbers! :-)

Thanks,

	Ingo

-------------------------------------->
/*
 * Simple testcase to generate jump table usage.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef unsigned long (fn_t) (unsigned long param);

unsigned long global;

#define DEFINE_FUN(x)				\
						\
unsigned long fun_##x(unsigned long param)	\
{						\
	return param+global;			\
}

#define TABLE_SIZE_MAX 16L

DEFINE_FUN( 1);
DEFINE_FUN( 2);
DEFINE_FUN( 3);
DEFINE_FUN( 4);
DEFINE_FUN( 5);
DEFINE_FUN( 6);
DEFINE_FUN( 7);
DEFINE_FUN( 8);
DEFINE_FUN( 9);
DEFINE_FUN(10);
DEFINE_FUN(11);
DEFINE_FUN(12);
DEFINE_FUN(13);
DEFINE_FUN(14);
DEFINE_FUN(15);
DEFINE_FUN(16);

int main(int argc, char **argv)
{
	fn_t *fn_ptr [TABLE_SIZE_MAX] = {	 fun_1 , fun_2 , fun_3 , fun_4 ,
						 fun_5 , fun_6 , fun_7 , fun_8 ,
						 fun_9 , fun_10, fun_11, fun_12,
						 fun_13, fun_14, fun_15, fun_16,
											};
	unsigned long local = 0;
	unsigned long i;
	unsigned long table_size = TABLE_SIZE_MAX;
	unsigned long loops = 100000000;


	if ((argc >= 2 && !strcmp(argv[1], "-h")) || argc >= 4) {
		printf("usage: jump-table <table_size(1-%lu)(default: %lu)> <loops(default: %lu)>\n", TABLE_SIZE_MAX, TABLE_SIZE_MAX, loops);
		exit(0);
	}

	if (argc >= 2) {
		table_size = atol(argv[1]);
		if (table_size <= 1)
			table_size = 1;
		if (table_size > TABLE_SIZE_MAX)
			table_size = TABLE_SIZE_MAX;
	}
	printf("... using %lu jump table entries.\n", table_size);

	if (argc >= 3)
		loops = atol(argv[2]);
	printf("... using %lu loop iterations.\n", loops);

	/* Call a set of simple arithmetic functions from a jump table: */
	for (i = 0; i < loops; i++) {
		global++;
		local += fn_ptr[global % table_size](global);
	}

	printf("... result: %lu\n", local);
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
  2016-02-03 19:18 Tom Herbert
@ 2016-02-04  9:30 ` Ingo Molnar
  2016-02-04 10:56   ` Ingo Molnar
  2016-02-04 21:46   ` Linus Torvalds
  2016-02-04 11:08 ` David Laight
  2016-02-04 19:22 ` Alexander Duyck
  2 siblings, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2016-02-04  9:30 UTC (permalink / raw)
  To: Tom Herbert
  Cc: davem, netdev, tglx, mingo, hpa, x86, kernel-team, linux-kernel,
	Peter Zijlstra


* Tom Herbert <tom@herbertland.com> wrote:

> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.
> 
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
> we need to avoid unaligned accesses. Efficient unaligned accesses
> offer a nice additional speedup as demonstrated in the results
> provided below.
> 
> This implementation is similar to csum_partial implemented in
> checksum_32.S, however since we are dealing with 8 bytes at a time
> there are more cases for small lengths (and alignments)-- for that
> we employ a jump table.
> 
> Testing:
> 
> Correctness:
> 
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
> 
> Checksum performance:
> 
> Isolating old and new implementation for some common cases:
> 
>          Old      NewA     NewA %   NewNoA   NewNoA %
> Len/Aln  nsec     nsec     Improv   nsecs    Improve
> --------+-------+--------+-------+-------+---------------------
> 1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
> 40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
> 8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
> 14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
> 14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
> 14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
> 7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)
> 
> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
> 
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
> 
> Also test on these with similar results:
> 
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> 
> Branch prediction:
> 
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between
> the baseline and the interleaving test case.

Possibly because the jumps are missing branch prediction for all these variants...

Please run something like:

 triton:~/tip> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' ~/hackbench 10
 ...

 Performance counter stats for '/home/mingo/hackbench 10' (10 runs):

     1,701,689,802      instructions                                                  ( +-  0.83% )
       305,439,262       branches                                                     ( +-  0.92% )
         2,011,032       branch-misses            #    0.66% of all branches          ( +-  3.00% )

       0.074311070 seconds time elapsed                                          ( +-  2.42% )


... to see the quality of x86 branch prediction and to see the statistical quality 
of the measurement (stddev).

The real comparison would be jump table against a hierarchy of open coded 
branches. The latter is much more likely to be correctly branch-predicted.

I have a couple of (mostly stylistic) comments about the patch:

> Signed-off-by: Tom Herbert <tom@herbertland.com>
> ---
>  arch/x86/include/asm/checksum_64.h |   5 +
>  arch/x86/lib/csum-partial_64.S     | 277 +++++++++++++++++++++++++++++++++++++
>  arch/x86/lib/csum-partial_64.c     | 148 --------------------
>  3 files changed, 282 insertions(+), 148 deletions(-)
>  create mode 100644 arch/x86/lib/csum-partial_64.S
>  delete mode 100644 arch/x86/lib/csum-partial_64.c
> 
> diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
> index cd00e17..a888f65 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
>   */
>  extern __wsum csum_partial(const void *buff, int len, __wsum sum);
>  
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
> +{
> +	return csum_fold(csum_partial(buff, len, 0));
> +}
> +
>  #define  _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
>  #define HAVE_CSUM_COPY_USER 1
>  
> diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S
> new file mode 100644
> index 0000000..520b400
> --- /dev/null
> +++ b/arch/x86/lib/csum-partial_64.S
> @@ -0,0 +1,277 @@
> +/* Copyright 2016 Tom Herbert <tom@herbertland.com>
> + *
> + * Checksum partial calculation

s/    Partial checksum calculation

> + *
> + * __wsum csum_partial(const void *buff, int len, __wsum sum)
> + *
> + * Computes the checksum of a memory block at buff, length len,
> + * and adds in "sum" (32-bit)

So please refer to arguments consistently in an escaped manner, i.e. 'buff', 
'len', 'sum'. It's a bit confusing to read otherwise.

This applies to other comments in your patch as well.

> + *
> + * Returns a 32-bit number suitable for feeding into itself
> + * or csum_tcpudp_magic

In comments please refer to functions with (), i.e. csum_tcpudp_magic().

> + *
> + * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS determines whether alignment of the
> + * buffer must be dealt with.
> + *
> + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set then the steps are:

s/    If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y then the steps are:


> + *     1) Initialize accumulator to initial sum
> + *     2) Sum 8 bytes at a time using adcq (unroll main loop
> + *        to do 128 bytes at a time)

Please refer to x86 instructions capitalized, i.e. ADCQ here. That makes them 
stand out better especially when they alias to some English word.

This applies to other parts of your patch as well.

> + *     3) Sum remaining length (less than 8 bytes)
> + *
> + * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set then the steps are:

s/    If !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS then the steps are:

> + *     1) Handle buffer that is not aligned to 8 bytes, sum up to 8 byte
> + *        alignment
> + *     2) Sum 8 bytes at a time using adcq (unroll main loop
> + *        to do 128 bytes at a time)
> + *     3) Sum remaining length (less than 8 bytes)
> + *     4) Roll result if alignment is odd and add in initial sum argument
> + *     5) If buffer is not aligned to 8 bytes and length is less than
> + *        or equal to 8 - alignment (whole buffer is in one quad), then
> + *        treat that as a special case.

Had to read it 3 times to realize that '-' is not a separator, so please:

s/8-alignment

> + *
> + * Register usage:
> + *   %rdi: argument #1, buff
> + *   %rsi: argument #2, length
> + *   %rdx: argument #3, add in value
> + *   %rax,%eax: accumulator and return value
> + *   %rcx,%ecx: counter and tmp
> + *   %r11: tmp
> + *   %r10: alignment (0-7) - when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set

s/CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y

> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/errno.h>
> +#include <asm/asm.h>
> +
> +#define branch_tbl_len .L_branch_tbl_len
> +
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +
> +/* Close the carry chain and return. */
> +#define	RETURN			\
> +	adcl	$0, %eax;	\
> +	ret
> +
> +#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
> +
> +/* Before returning need to roll the result if alignment was odd and then add
> + * in the initial sum.
> + */

Please use the customary (multi-line) comment style:

  /*
   * Comment .....
   * ...... goes here.
   */

specified in Documentation/CodingStyle. This applies to other comments in your 
patch as well.

> +#define	RETURN			\
> +	adcl	$0, %eax;	\
> +	test	$0x1, %r10d;	\
> +	jz	99f;		\
> +	roll	$8, %eax;	\
> +99:	addl	%edx, %eax;	\
> +	adcl	$0, %eax;	\
> +	ret
> +
> +#define branch_tbl_align .L_branch_tbl_align
> +
> +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */

On x86 we always set CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

So this is dead, untested code that is never compiled on x86.

> +
> +ENTRY(csum_partial)
> +
> +#ifdef	CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +	movl	%edx, %eax	/* Initialize with initial sum argument */
> +#else	/* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */

s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

> +	test	%esi, %esi	/* Zero length? */
> +	jne	310f
> +	movl	%edx, %eax
> +	ret
> +
> +310:	xorl	%eax, %eax

So in modern x86 assembly we try to use local named labels instead of numeric 
ones. Easier to extend and the result is a lot easier to read. This applies to 
most other numeric labels in your patch.

> +
> +	/* Determine alignment */
> +	movl	%edi, %r10d
> +	andl	$0x7, %r10d
> +	jz	10f
> +	movl	$8, %ecx
> +	subl	%r10d, %ecx
> +	cmpl	%ecx, %esi
> +	jle	320f
> +	clc
> +	jmpq	*branch_tbl_align(, %r10, 8)
> +
> +	/* Whole buffer fits into one quad. Sum up to a four byte alignment
> +	 * and then call into the length table to finish.
> +	 */
> +320:	test	$0x1, %r10d
> +	jz	330f
> +	movb	(%rdi), %ah /* Align to two bytes */
> +	decl	%esi
> +	lea	1(%rdi), %rdi
> +330:	cmpl	$2, %esi
> +	jl	340f
> +	test	$0x2, %r10d
> +	jz	340f
> +	addw	(%rdi), %ax /* Align to four bytes */
> +	adcl	$0, %eax
> +	lea	2(%rdi), %rdi
> +	subl	$2, %esi
> +340:
> +	clc
> +	jmpq *branch_tbl_len(, %rsi, 8)

Also, please use extra newlines to better delineate the control flow.

I.e. something like:

.L_small_buffer:
	test	$0x1, %r10d
	jz	L_2_bytes_aligned:

	/* Align to two bytes: */
	movb	(%rdi), %ah
	decl	%esi
	lea	1(%rdi), %rdi

.L_2_bytes_aligned:
	cmpl	$2, %esi
	jl	.L_4_bytes_aligned

	test	$0x2, %r10d
	jz	.L_4_bytes_aligned

	/* Align to four bytes: */
	addw	(%rdi), %ax
	adcl	$0, %eax
	lea	2(%rdi), %rdi
	subl	$2, %esi

.L_4_bytes_aligned:

See how much more readable such an assembly construct is than numeric labels and 
no proper flow control separation?

Also note how I changed the placement and style of the existing comments.

Note that I just guessed about the labels based on patch context, some might be 
wrong. Also, this code will go away as it's dead code - but the comments still 
apply to the rest of the patch.

Please propagate all such readability improvements to the rest of your patch.

> +
> +/* Jumps table for alignments */

s/Jump table

> +
> +201:	/* Align 1 */
> +	adcw	5(%rdi), %ax
> +203:	/* Align 3 */
> +	adcw	3(%rdi), %ax
> +205:	/* Align 5 */
> +	adcw	1(%rdi), %ax
> +207:	/* Align 7 */
> +	adcl	$0, %eax
> +	addb	(%rdi), %ah
> +	jmp	222f
> +202:	/* Align 2 */
> +	adcw	4(%rdi), %ax
> +204:	/* Align 4 */
> +	adcw	2(%rdi), %ax
> +206:	/* Align 6 */
> +	adcw	(%rdi), %ax
> +
> +222:	adcl	$0, %eax
> +	subl	%ecx, %esi /* %rcx is 8 - alignment */
> +	addq	%rcx, %rdi
> +200:
> +	/* Fall through */
> +
> +#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */

s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

> +
> +	/* Check length */
> +10:	cmpl	$8, %esi
> +	jg	30f
> +	jl	20f
> +
> +	/* Exactly 8 bytes length */
> +	addl	(%rdi), %eax
> +	adcl	4(%rdi), %eax
> +	RETURN
> +
> +	/* Less than 8 bytes length */
> +20:	clc
> +	jmpq *branch_tbl_len(, %rsi, 8)
> +
> +	/* Greater than 8 bytes length. Determine number of quads (n). Sum
> +	 * over first n % 16 quads
> +	 */
> +30:	movl	%esi, %ecx
> +	shrl	$3, %ecx
> +	andl	$0xf, %ecx
> +	negq	%rcx
> +	lea	40f(, %rcx, 4), %r11
> +	clc
> +	jmp	*%r11

Are you absolutely sure that a jump table is the proper solution here? It defeats 
branch prediction on most x86 uarchs. Why not label the loop stages and jump in 
directly with a binary tree of branches?

> +
> +.align 8
> +	adcq	14*8(%rdi),%rax
> +	adcq	13*8(%rdi),%rax
> +	adcq	12*8(%rdi),%rax
> +	adcq	11*8(%rdi),%rax
> +	adcq	10*8(%rdi),%rax
> +	adcq	9*8(%rdi),%rax
> +	adcq	8*8(%rdi),%rax
> +	adcq	7*8(%rdi),%rax
> +	adcq	6*8(%rdi),%rax
> +	adcq	5*8(%rdi),%rax
> +	adcq	4*8(%rdi),%rax
> +	adcq	3*8(%rdi),%rax
> +	adcq	2*8(%rdi),%rax
> +	adcq	1*8(%rdi),%rax
> +	adcq 	0*8(%rdi),%rax

Stray whitespace.

> +	nop
> +40:	/* #quads % 16 jump table base */

So both the jump table and its initial alignment adds NOPs that at minimum blows 
up the I$ a bit. With direct jumps we could avoid all that.

> +
> +	adcq	$0, %rax
> +	shlq	$3, %rcx
> +	subq	%rcx, %rdi /* %rcx is already negative length */
> +
> +	/* Now determine number of blocks of 8 quads. Sum 128 bytes at a time
> +	 * using unrolled loop.

s/using an unrolled loop

> +	 */
> +	movl	%esi, %ecx
> +	shrl	$7, %ecx
> +	jz	60f
> +	clc
> +
> +	/* Main loop */
> +50:	adcq	0*8(%rdi),%rax
> +	adcq	1*8(%rdi),%rax
> +	adcq	2*8(%rdi),%rax
> +	adcq	3*8(%rdi),%rax
> +	adcq	4*8(%rdi),%rax
> +	adcq	5*8(%rdi),%rax
> +	adcq	6*8(%rdi),%rax
> +	adcq	7*8(%rdi),%rax
> +	adcq	8*8(%rdi),%rax
> +	adcq	9*8(%rdi),%rax
> +	adcq	10*8(%rdi),%rax
> +	adcq	11*8(%rdi),%rax
> +	adcq	12*8(%rdi),%rax
> +	adcq	13*8(%rdi),%rax
> +	adcq	14*8(%rdi),%rax
> +	adcq	15*8(%rdi),%rax
> +	lea	128(%rdi), %rdi
> +	loop	50b
> +
> +	adcq	$0, %rax
> +
> +	/* Handle remaining length which is <= 8 bytes */
> +60:	andl	$0x7, %esi
> +
> +	/* Fold 64 bit sum to 32 bits */

s/64-bit

> +	movq	%rax, %rcx
> +	shrq	$32, %rcx
> +	addl	%ecx, %eax
> +
> +	jmpq *branch_tbl_len(, %rsi, 8)

Same fundamental question as above.

> +
> +/* Length table targets */
> +
> +107:	/* Length 7 */
> +	adcw	4(%rdi), %ax
> +105:	/* Length 5 */
> +	adcw	2(%rdi), %ax
> +103:	/* Length 3 */
> +	adcw	(%rdi), %ax
> +101:	/* Length 1, grab the odd byte */
> +	adcb	-1(%rdi, %rsi), %al
> +	adcb	$0, %ah
> +	RETURN
> +106:	/* Length 6 */
> +	adcw	4(%rdi), %ax
> +104:	/* Length 4, optimized  for double word access*/
> +	adcl	(%rdi), %eax
> +	RETURN
> +102:	/* Length 2 */
> +	adcw	(%rdi), %ax
> +100:	/* Length 0 */
> +	RETURN
> +
> +.section .rodata
> +.align 64

Please use the proper parametric cache alignment define. There are 
subarchitectures that prefer (and need) much larger alignment.

> +.L_branch_tbl_len:
> +	.quad	100b
> +	.quad	101b
> +	.quad	102b
> +	.quad	103b
> +	.quad	104b
> +	.quad	105b
> +	.quad	106b
> +	.quad	107b
> +
> +#ifndef	CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +.L_branch_tbl_align:
> +	.quad	200b
> +	.quad	201b
> +	.quad	202b
> +	.quad	203b
> +	.quad	204b
> +	.quad	205b
> +	.quad	206b
> +	.quad	207b
> +#endif

This too is dead code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
@ 2016-02-03 19:18 Tom Herbert
  2016-02-04  9:30 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Tom Herbert @ 2016-02-03 19:18 UTC (permalink / raw)
  To: davem, netdev; +Cc: tglx, mingo, hpa, x86, kernel-team

Implement assembly routine for csum_partial for 64 bit x86. This
primarily speeds up checksum calculation for smaller lengths such as
those that are present when doing skb_postpull_rcsum when getting
CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
conversion.

CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
we need to avoid unaligned accesses. Efficient unaligned accesses
offer a nice additional speedup as demonstrated in the results
provided below.

This implementation is similar to csum_partial implemented in
checksum_32.S, however since we are dealing with 8 bytes at a time
there are more cases for small lengths (and alignments)-- for that
we employ a jump table.

Testing:

Correctness:

Verified correctness by testing arbitrary length buffer filled with
random data. For each buffer I compared the computed checksum
using the original algorithm for each possible alignment (0-7 bytes).

Checksum performance:

Isolating old and new implementation for some common cases:

         Old      NewA     NewA %   NewNoA   NewNoA %
Len/Aln  nsec     nsec     Improv   nsecs    Improve
--------+-------+--------+-------+-------+---------------------
1400/0    192.9    175.1   10%     174.9     10%   (Big packet)
40/0      13.8     7.7     44%     5.7       58%   (Ipv6 hdr cmn case)
8/4       8.4      6.9     18%     2.8       67%   (UDP, VXLAN in IPv4)
14/0      10.5     7.3     30%     5.4       48%   (Eth hdr)
14/4      10.8     8.7     19%     5.4       50%   (Eth hdr in IPv4)
14/3      11.0     9.8     11%     5.6       49%   (Eth with odd align)
7/1       10.0     5.8     42%     4.8       52%   (buffer in one quad)

NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set

Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Also test on these with similar results:

Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

Branch prediction:

To test the effects of poor branch prediction in the jump tables I
tested checksum performance with runs for two combinations of length
and alignment. As the baseline I performed the test by doing half of
calls with the first combination, followed by using the second
combination for the second half. In the test case, I interleave the
two combinations so that in every call the length and alignment are
different to defeat the effects of branch prediction. Running several
cases, I did not see any material performance difference between
the baseline and the interleaving test case.

Signed-off-by: Tom Herbert <tom@herbertland.com>
---
 arch/x86/include/asm/checksum_64.h |   5 +
 arch/x86/lib/csum-partial_64.S     | 277 +++++++++++++++++++++++++++++++++++++
 arch/x86/lib/csum-partial_64.c     | 148 --------------------
 3 files changed, 282 insertions(+), 148 deletions(-)
 create mode 100644 arch/x86/lib/csum-partial_64.S
 delete mode 100644 arch/x86/lib/csum-partial_64.c

diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index cd00e17..a888f65 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
  */
 extern __wsum csum_partial(const void *buff, int len, __wsum sum);
 
+static inline __sum16 ip_compute_csum(const void *buff, int len)
+{
+	return csum_fold(csum_partial(buff, len, 0));
+}
+
 #define  _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
 #define HAVE_CSUM_COPY_USER 1
 
diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S
new file mode 100644
index 0000000..520b400
--- /dev/null
+++ b/arch/x86/lib/csum-partial_64.S
@@ -0,0 +1,277 @@
+/* Copyright 2016 Tom Herbert <tom@herbertland.com>
+ *
+ * Checksum partial calculation
+ *
+ * __wsum csum_partial(const void *buff, int len, __wsum sum)
+ *
+ * Computes the checksum of a memory block at buff, length len,
+ * and adds in "sum" (32-bit)
+ *
+ * Returns a 32-bit number suitable for feeding into itself
+ * or csum_tcpudp_magic
+ *
+ * CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS determines whether alignment of the
+ * buffer must be dealt with.
+ *
+ * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set then the steps are:
+ *     1) Initialize accumulator to initial sum
+ *     2) Sum 8 bytes at a time using adcq (unroll main loop
+ *        to do 128 bytes at a time)
+ *     3) Sum remaining length (less than 8 bytes)
+ *
+ * If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set then the steps are:
+ *     1) Handle buffer that is not aligned to 8 bytes, sum up to 8 byte
+ *        alignment
+ *     2) Sum 8 bytes at a time using adcq (unroll main loop
+ *        to do 128 bytes at a time)
+ *     3) Sum remaining length (less than 8 bytes)
+ *     4) Roll result if alignment is odd and add in initial sum argument
+ *     5) If buffer is not aligned to 8 bytes and length is less than
+ *        or equal to 8 - alignment (whole buffer is in one quad), then
+ *        treat that as a special case.
+ *
+ * Register usage:
+ *   %rdi: argument #1, buff
+ *   %rsi: argument #2, length
+ *   %rdx: argument #3, add in value
+ *   %rax,%eax: accumulator and return value
+ *   %rcx,%ecx: counter and tmp
+ *   %r11: tmp
+ *   %r10: alignment (0-7) - when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
+ */
+
+#include <linux/linkage.h>
+#include <asm/errno.h>
+#include <asm/asm.h>
+
+#define branch_tbl_len .L_branch_tbl_len
+
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+
+/* Close the carry chain and return. */
+#define	RETURN			\
+	adcl	$0, %eax;	\
+	ret
+
+#else /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+/* Before returning need to roll the result if alignment was odd and then add
+ * in the initial sum.
+ */
+#define	RETURN			\
+	adcl	$0, %eax;	\
+	test	$0x1, %r10d;	\
+	jz	99f;		\
+	roll	$8, %eax;	\
+99:	addl	%edx, %eax;	\
+	adcl	$0, %eax;	\
+	ret
+
+#define branch_tbl_align .L_branch_tbl_align
+
+#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+ENTRY(csum_partial)
+
+#ifdef	CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	movl	%edx, %eax	/* Initialize with initial sum argument */
+#else	/* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+	test	%esi, %esi	/* Zero length? */
+	jne	310f
+	movl	%edx, %eax
+	ret
+
+310:	xorl	%eax, %eax
+
+	/* Determine alignment */
+	movl	%edi, %r10d
+	andl	$0x7, %r10d
+	jz	10f
+	movl	$8, %ecx
+	subl	%r10d, %ecx
+	cmpl	%ecx, %esi
+	jle	320f
+	clc
+	jmpq	*branch_tbl_align(, %r10, 8)
+
+	/* Whole buffer fits into one quad. Sum up to a four byte alignment
+	 * and then call into the length table to finish.
+	 */
+320:	test	$0x1, %r10d
+	jz	330f
+	movb	(%rdi), %ah /* Align to two bytes */
+	decl	%esi
+	lea	1(%rdi), %rdi
+330:	cmpl	$2, %esi
+	jl	340f
+	test	$0x2, %r10d
+	jz	340f
+	addw	(%rdi), %ax /* Align to four bytes */
+	adcl	$0, %eax
+	lea	2(%rdi), %rdi
+	subl	$2, %esi
+340:
+	clc
+	jmpq *branch_tbl_len(, %rsi, 8)
+
+/* Jumps table for alignments */
+
+201:	/* Align 1 */
+	adcw	5(%rdi), %ax
+203:	/* Align 3 */
+	adcw	3(%rdi), %ax
+205:	/* Align 5 */
+	adcw	1(%rdi), %ax
+207:	/* Align 7 */
+	adcl	$0, %eax
+	addb	(%rdi), %ah
+	jmp	222f
+202:	/* Align 2 */
+	adcw	4(%rdi), %ax
+204:	/* Align 4 */
+	adcw	2(%rdi), %ax
+206:	/* Align 6 */
+	adcw	(%rdi), %ax
+
+222:	adcl	$0, %eax
+	subl	%ecx, %esi /* %rcx is 8 - alignment */
+	addq	%rcx, %rdi
+200:
+	/* Fall through */
+
+#endif /* CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS */
+
+	/* Check length */
+10:	cmpl	$8, %esi
+	jg	30f
+	jl	20f
+
+	/* Exactly 8 bytes length */
+	addl	(%rdi), %eax
+	adcl	4(%rdi), %eax
+	RETURN
+
+	/* Less than 8 bytes length */
+20:	clc
+	jmpq *branch_tbl_len(, %rsi, 8)
+
+	/* Greater than 8 bytes length. Determine number of quads (n). Sum
+	 * over first n % 16 quads
+	 */
+30:	movl	%esi, %ecx
+	shrl	$3, %ecx
+	andl	$0xf, %ecx
+	negq	%rcx
+	lea	40f(, %rcx, 4), %r11
+	clc
+	jmp	*%r11
+
+.align 8
+	adcq	14*8(%rdi),%rax
+	adcq	13*8(%rdi),%rax
+	adcq	12*8(%rdi),%rax
+	adcq	11*8(%rdi),%rax
+	adcq	10*8(%rdi),%rax
+	adcq	9*8(%rdi),%rax
+	adcq	8*8(%rdi),%rax
+	adcq	7*8(%rdi),%rax
+	adcq	6*8(%rdi),%rax
+	adcq	5*8(%rdi),%rax
+	adcq	4*8(%rdi),%rax
+	adcq	3*8(%rdi),%rax
+	adcq	2*8(%rdi),%rax
+	adcq	1*8(%rdi),%rax
+	adcq 	0*8(%rdi),%rax
+	nop
+40:	/* #quads % 16 jump table base */
+
+	adcq	$0, %rax
+	shlq	$3, %rcx
+	subq	%rcx, %rdi /* %rcx is already negative length */
+
+	/* Now determine number of blocks of 8 quads. Sum 128 bytes at a time
+	 * using unrolled loop.
+	 */
+	movl	%esi, %ecx
+	shrl	$7, %ecx
+	jz	60f
+	clc
+
+	/* Main loop */
+50:	adcq	0*8(%rdi),%rax
+	adcq	1*8(%rdi),%rax
+	adcq	2*8(%rdi),%rax
+	adcq	3*8(%rdi),%rax
+	adcq	4*8(%rdi),%rax
+	adcq	5*8(%rdi),%rax
+	adcq	6*8(%rdi),%rax
+	adcq	7*8(%rdi),%rax
+	adcq	8*8(%rdi),%rax
+	adcq	9*8(%rdi),%rax
+	adcq	10*8(%rdi),%rax
+	adcq	11*8(%rdi),%rax
+	adcq	12*8(%rdi),%rax
+	adcq	13*8(%rdi),%rax
+	adcq	14*8(%rdi),%rax
+	adcq	15*8(%rdi),%rax
+	lea	128(%rdi), %rdi
+	loop	50b
+
+	adcq	$0, %rax
+
+	/* Handle remaining length which is <= 8 bytes */
+60:	andl	$0x7, %esi
+
+	/* Fold 64 bit sum to 32 bits */
+	movq	%rax, %rcx
+	shrq	$32, %rcx
+	addl	%ecx, %eax
+
+	jmpq *branch_tbl_len(, %rsi, 8)
+
+/* Length table targets */
+
+107:	/* Length 7 */
+	adcw	4(%rdi), %ax
+105:	/* Length 5 */
+	adcw	2(%rdi), %ax
+103:	/* Length 3 */
+	adcw	(%rdi), %ax
+101:	/* Length 1, grab the odd byte */
+	adcb	-1(%rdi, %rsi), %al
+	adcb	$0, %ah
+	RETURN
+106:	/* Length 6 */
+	adcw	4(%rdi), %ax
+104:	/* Length 4, optimized  for double word access*/
+	adcl	(%rdi), %eax
+	RETURN
+102:	/* Length 2 */
+	adcw	(%rdi), %ax
+100:	/* Length 0 */
+	RETURN
+
+.section .rodata
+.align 64
+.L_branch_tbl_len:
+	.quad	100b
+	.quad	101b
+	.quad	102b
+	.quad	103b
+	.quad	104b
+	.quad	105b
+	.quad	106b
+	.quad	107b
+
+#ifndef	CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+.L_branch_tbl_align:
+	.quad	200b
+	.quad	201b
+	.quad	202b
+	.quad	203b
+	.quad	204b
+	.quad	205b
+	.quad	206b
+	.quad	207b
+#endif
+
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
deleted file mode 100644
index 9845371..0000000
--- a/arch/x86/lib/csum-partial_64.c
+++ /dev/null
@@ -1,148 +0,0 @@
-/*
- * arch/x86_64/lib/csum-partial.c
- *
- * This file contains network checksum routines that are better done
- * in an architecture-specific manner due to speed.
- */
- 
-#include <linux/compiler.h>
-#include <linux/module.h>
-#include <asm/checksum.h>
-
-static inline unsigned short from32to16(unsigned a) 
-{
-	unsigned short b = a >> 16; 
-	asm("addw %w2,%w0\n\t"
-	    "adcw $0,%w0\n" 
-	    : "=r" (b)
-	    : "0" (b), "r" (a));
-	return b;
-}
-
-/*
- * Do a 64-bit checksum on an arbitrary memory area.
- * Returns a 32bit checksum.
- *
- * This isn't as time critical as it used to be because many NICs
- * do hardware checksumming these days.
- * 
- * Things tried and found to not make it faster:
- * Manual Prefetching
- * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
- */
-static unsigned do_csum(const unsigned char *buff, unsigned len)
-{
-	unsigned odd, count;
-	unsigned long result = 0;
-
-	if (unlikely(len == 0))
-		return result; 
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		result = *buff << 8;
-		len--;
-		buff++;
-	}
-	count = len >> 1;		/* nr of 16-bit words.. */
-	if (count) {
-		if (2 & (unsigned long) buff) {
-			result += *(unsigned short *)buff;
-			count--;
-			len -= 2;
-			buff += 2;
-		}
-		count >>= 1;		/* nr of 32-bit words.. */
-		if (count) {
-			unsigned long zero;
-			unsigned count64;
-			if (4 & (unsigned long) buff) {
-				result += *(unsigned int *) buff;
-				count--;
-				len -= 4;
-				buff += 4;
-			}
-			count >>= 1;	/* nr of 64-bit words.. */
-
-			/* main loop using 64byte blocks */
-			zero = 0;
-			count64 = count >> 3;
-			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
-				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
-				buff += 64;
-				count64--;
-			}
-
-			/* last up to 7 8byte blocks */
-			count %= 8; 
-			while (count) { 
-				asm("addq %1,%0\n\t"
-				    "adcq %2,%0\n" 
-					    : "=r" (result)
-				    : "m" (*(unsigned long *)buff), 
-				    "r" (zero),  "0" (result));
-				--count; 
-					buff += 8;
-			}
-			result = add32_with_carry(result>>32,
-						  result&0xffffffff); 
-
-			if (len & 4) {
-				result += *(unsigned int *) buff;
-				buff += 4;
-			}
-		}
-		if (len & 2) {
-			result += *(unsigned short *) buff;
-			buff += 2;
-		}
-	}
-	if (len & 1)
-		result += *buff;
-	result = add32_with_carry(result>>32, result & 0xffffffff); 
-	if (unlikely(odd)) { 
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-	}
-	return result;
-}
-
-/*
- * computes the checksum of a memory block at buff, length len,
- * and adds in "sum" (32-bit)
- *
- * returns a 32-bit number suitable for feeding into itself
- * or csum_tcpudp_magic
- *
- * this function must be called with even lengths, except
- * for the last fragment, which may be odd
- *
- * it's best to have buff aligned on a 64-bit boundary
- */
-__wsum csum_partial(const void *buff, int len, __wsum sum)
-{
-	return (__force __wsum)add32_with_carry(do_csum(buff, len),
-						(__force u32)sum);
-}
-
-/*
- * this routine is used for miscellaneous IP-like checksums, mainly
- * in icmp.c
- */
-__sum16 ip_compute_csum(const void *buff, int len)
-{
-	return csum_fold(csum_partial(buff,len,0));
-}
-EXPORT_SYMBOL(ip_compute_csum);
-
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-02-10 15:21 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-08 20:12 [PATCH v3 net-next] net: Implement fast csum_partial for x86_64 George Spelvin
2016-02-09 10:48 ` David Laight
2016-02-10  0:53   ` George Spelvin
2016-02-10 11:39     ` David Laight
2016-02-10 14:43       ` George Spelvin
2016-02-10 15:18         ` David Laight
  -- strict thread matches above, loose matches on Subject: below --
2016-02-03 19:18 Tom Herbert
2016-02-04  9:30 ` Ingo Molnar
2016-02-04 10:56   ` Ingo Molnar
2016-02-04 19:24     ` Tom Herbert
2016-02-05  9:24       ` Ingo Molnar
2016-02-04 21:46   ` Linus Torvalds
2016-02-04 22:09     ` Linus Torvalds
2016-02-05  1:27       ` Linus Torvalds
2016-02-05  1:39         ` Linus Torvalds
2016-02-04 22:43     ` Tom Herbert
2016-02-04 22:57       ` Linus Torvalds
2016-02-05  8:01       ` Ingo Molnar
2016-02-05 10:07         ` David Laight
2016-02-04 11:08 ` David Laight
2016-02-04 16:51   ` Alexander Duyck
2016-02-04 16:58     ` Tom Herbert
2016-02-04 17:09       ` David Laight
2016-02-04 20:59         ` Tom Herbert
2016-02-04 21:09           ` Alexander Duyck
2016-02-04 19:22 ` Alexander Duyck
2016-02-04 19:31   ` Tom Herbert
2016-02-04 19:44   ` Tom Herbert
2016-02-04 20:03     ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.