All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-11 16:51 Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-11 16:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Sébastien Dugué reported to me that devices implementing ipoib (which don't have
checksum offload hardware were spending a significant amount of time computing
checksums.  We found that by splitting the checksum computation into two
separate streams, each skipping successive elements of the buffer being summed,
we could parallelize the checksum operation accros multiple alus.  Since neither
chain is dependent on the result of the other, we get a speedup in execution (on
hardware that has multiple alu's available, which is almost ubiquitous on x86),
and only a negligible decrease on hardware that has only a single alu (an extra
addition is introduced).  Since addition in commutative, the result is the same,
only faster

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/lib/csum-partial_64.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..2c7bc50 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,11 +29,12 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
+	unsigned long result1 = 0;
+	unsigned long result2 = 0;
 	unsigned long result = 0;
 
 	if (unlikely(len == 0))
@@ -68,22 +69,34 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
+				asm("addq 0*8(%[src]),%[res1]\n\t"
+				    "adcq 2*8(%[src]),%[res1]\n\t"
+				    "adcq 4*8(%[src]),%[res1]\n\t"
+				    "adcq 6*8(%[src]),%[res1]\n\t"
+				    "adcq %[zero],%[res1]\n\t"
+
+				    "addq 1*8(%[src]),%[res2]\n\t"
+				    "adcq 3*8(%[src]),%[res2]\n\t"
+				    "adcq 5*8(%[src]),%[res2]\n\t"
+				    "adcq 7*8(%[src]),%[res2]\n\t"
+				    "adcq %[zero],%[res2]"
+				    : [res1] "=r" (result1),
+				      [res2] "=r" (result2)
 				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
+				      "[res1]" (result1), "[res2]" (result2));
 				buff += 64;
 				count64--;
 			}
 
+			asm("addq %[res1],%[res]\n\t"
+			    "adcq %[res2],%[res]\n\t"
+			    "adcq %[zero],%[res]"
+			    : [res] "=r" (result)
+			    : [res1] "r" (result1),
+			      [res2] "r" (result2),
+			      [zero] "r" (zero),
+			      "0" (result));
+
 			/* last up to 7 8byte blocks */
 			count %= 8; 
 			while (count) { 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
@ 2013-10-12 17:21 ` Ingo Molnar
  2013-10-13 12:53   ` Neil Horman
  2013-10-14 20:28   ` Neil Horman
  2013-10-12 22:29 ` H. Peter Anvin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-12 17:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sébastien Dugué reported to me that devices implementing ipoib (which 
> don't have checksum offload hardware were spending a significant amount 
> of time computing checksums.  We found that by splitting the checksum 
> computation into two separate streams, each skipping successive elements 
> of the buffer being summed, we could parallelize the checksum operation 
> accros multiple alus.  Since neither chain is dependent on the result of 
> the other, we get a speedup in execution (on hardware that has multiple 
> alu's available, which is almost ubiquitous on x86), and only a 
> negligible decrease on hardware that has only a single alu (an extra 
> addition is introduced).  Since addition in commutative, the result is 
> the same, only faster

This patch should really come with measurement numbers: what performance 
increase (and drop) did you get on what CPUs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
@ 2013-10-12 22:29 ` H. Peter Anvin
  2013-10-13 12:53   ` Neil Horman
  2013-10-18 16:42   ` Neil Horman
  2013-10-14  4:38 ` Andi Kleen
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
  3 siblings, 2 replies; 132+ messages in thread
From: H. Peter Anvin @ 2013-10-12 22:29 UTC (permalink / raw)
  To: Neil Horman, linux-kernel
  Cc: sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On 10/11/2013 09:51 AM, Neil Horman wrote:
> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing
> checksums.  We found that by splitting the checksum computation into two
> separate streams, each skipping successive elements of the buffer being summed,
> we could parallelize the checksum operation accros multiple alus.  Since neither
> chain is dependent on the result of the other, we get a speedup in execution (on
> hardware that has multiple alu's available, which is almost ubiquitous on x86),
> and only a negligible decrease on hardware that has only a single alu (an extra
> addition is introduced).  Since addition in commutative, the result is the same,
> only faster

On hardware that implement ADCX/ADOX then you should also be able to
have additional streams interleaved since those instructions allow for
dual carry chains.

	-hpa




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 17:21 ` Ingo Molnar
@ 2013-10-13 12:53   ` Neil Horman
  2013-10-14 20:28   ` Neil Horman
  1 sibling, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-13 12:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant amount 
> > of time computing checksums.  We found that by splitting the checksum 
> > computation into two separate streams, each skipping successive elements 
> > of the buffer being summed, we could parallelize the checksum operation 
> > accros multiple alus.  Since neither chain is dependent on the result of 
> > the other, we get a speedup in execution (on hardware that has multiple 
> > alu's available, which is almost ubiquitous on x86), and only a 
> > negligible decrease on hardware that has only a single alu (an extra 
> > addition is introduced).  Since addition in commutative, the result is 
> > the same, only faster
> 
> This patch should really come with measurement numbers: what performance 
> increase (and drop) did you get on what CPUs.
> 
> Thanks,
> 
Sure, I can gather some stats for you.  I'll post them later this week
Neil

> 	Ingo
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 22:29 ` H. Peter Anvin
@ 2013-10-13 12:53   ` Neil Horman
  2013-10-18 16:42   ` Neil Horman
  1 sibling, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-13 12:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums.  We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus.  Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced).  Since addition in commutative, the result is the same,
> > only faster
> 
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
> 
Ok, thats a good idea, I'll look into those instructions this week
Neil

> 	-hpa
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
  2013-10-12 22:29 ` H. Peter Anvin
@ 2013-10-14  4:38 ` Andi Kleen
  2013-10-14  7:49   ` Ingo Molnar
  2013-10-14 20:25   ` Neil Horman
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
  3 siblings, 2 replies; 132+ messages in thread
From: Andi Kleen @ 2013-10-14  4:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Neil Horman <nhorman@tuxdriver.com> writes:

> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing

Must be an odd workload, most TCP/UDP workloads do copy-checksum
anyways. I would rather investigate why that doesn't work.

That said the change looks reasonable, but may not fix the root cause.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  4:38 ` Andi Kleen
@ 2013-10-14  7:49   ` Ingo Molnar
  2013-10-14 21:07     ` Eric Dumazet
  2013-10-14 20:25   ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-14  7:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Andi Kleen <andi@firstfloor.org> wrote:

> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant 
> > amount of time computing
> 
> Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> anyways. I would rather investigate why that doesn't work.

There's a fair amount of csum_partial()-only workloads, a packet does not 
need to hit user-space to be a significant portion of the system's 
workload.

That said, it would indeed be nice to hear which particular code path was 
hit in this case, if nothing else then for education purposes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  4:38 ` Andi Kleen
  2013-10-14  7:49   ` Ingo Molnar
@ 2013-10-14 20:25   ` Neil Horman
  2013-10-15  7:12     ` Sébastien Dugué
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-14 20:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> 
> Must be an odd workload, most TCP/UDP workloads do copy-checksum
> anyways. I would rather investigate why that doesn't work.
> 
FWIW, the reporter was reporting this using an IP over Infiniband network.
Neil

> That said the change looks reasonable, but may not fix the root cause.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 17:21 ` Ingo Molnar
  2013-10-13 12:53   ` Neil Horman
@ 2013-10-14 20:28   ` Neil Horman
  2013-10-14 21:19     ` Eric Dumazet
  2013-10-15  7:32     ` Ingo Molnar
  1 sibling, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-14 20:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant amount 
> > of time computing checksums.  We found that by splitting the checksum 
> > computation into two separate streams, each skipping successive elements 
> > of the buffer being summed, we could parallelize the checksum operation 
> > accros multiple alus.  Since neither chain is dependent on the result of 
> > the other, we get a speedup in execution (on hardware that has multiple 
> > alu's available, which is almost ubiquitous on x86), and only a 
> > negligible decrease on hardware that has only a single alu (an extra 
> > addition is introduced).  Since addition in commutative, the result is 
> > the same, only faster
> 
> This patch should really come with measurement numbers: what performance 
> increase (and drop) did you get on what CPUs.
> 
> Thanks,
> 
> 	Ingo
> 


So, early testing results today.  I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 100000
times, recording the time at the start and end of that loop.  Results on a 2.4
GHz Intel Xeon processor:

Without patch: Average execute time for csum_partial was 808 ns
With patch: Average execute time for csum_partial was 438 ns


I'm looking into hpa's suggestion to use alternate instructions where available
right now.  I'll have more soon
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  7:49   ` Ingo Molnar
@ 2013-10-14 21:07     ` Eric Dumazet
  2013-10-15 13:17       ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-14 21:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Neil Horman <nhorman@tuxdriver.com> writes:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > don't have checksum offload hardware were spending a significant 
> > > amount of time computing
> > 
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> > anyways. I would rather investigate why that doesn't work.
> 
> There's a fair amount of csum_partial()-only workloads, a packet does not 
> need to hit user-space to be a significant portion of the system's 
> workload.
> 
> That said, it would indeed be nice to hear which particular code path was 
> hit in this case, if nothing else then for education purposes.

Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
frames, meaning we have to fallback to software csum to validate
TCP frames, once tunnel header is pulled.

So to reproduce the issue, all you need is to setup a GRE tunnel between
two hosts, and use any tcp stream workload.

Then receiver profile looks like :

11.45%	[kernel]	 [k] csum_partial
 3.08%	[kernel]	 [k] _raw_spin_lock
 3.04%	[kernel]	 [k] intel_idle
 2.73%	[kernel]	 [k] ipt_do_table
 2.57%	[kernel]	 [k] __netif_receive_skb_core
 2.15%	[kernel]	 [k] copy_user_generic_string
 2.05%	[kernel]	 [k] __hrtimer_start_range_ns
 1.42%	[kernel]	 [k] ip_rcv
 1.39%	[kernel]	 [k] kmem_cache_free
 1.36%	[kernel]	 [k] _raw_spin_unlock_irqrestore
 1.24%	[kernel]	 [k] __schedule
 1.13%	[bnx2x] 	 [k] bnx2x_rx_int
 1.12%	[bnx2x] 	 [k] bnx2x_start_xmit
 1.11%	[kernel]	 [k] fib_table_lookup
 0.99%	[ip_tunnel]  [k] ip_tunnel_lookup
 0.91%	[ip_tunnel]  [k] ip_tunnel_rcv
 0.90%	[kernel]	 [k] check_leaf.isra.7
 0.89%	[kernel]	 [k] nf_iterate



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:28   ` Neil Horman
@ 2013-10-14 21:19     ` Eric Dumazet
  2013-10-14 22:18       ` Eric Dumazet
  2013-10-15  7:32     ` Ingo Molnar
  1 sibling, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-14 21:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:

> So, early testing results today.  I wrote a test module that, allocated a 4k
> buffer, initalized it with random data, and called csum_partial on it 100000
> times, recording the time at the start and end of that loop.  Results on a 2.4
> GHz Intel Xeon processor:
> 
> Without patch: Average execute time for csum_partial was 808 ns
> With patch: Average execute time for csum_partial was 438 ns

Impressive, but could you try again with data out of cache ?




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 21:19     ` Eric Dumazet
@ 2013-10-14 22:18       ` Eric Dumazet
  2013-10-14 22:37         ` Joe Perches
  2013-10-17  0:34         ` Neil Horman
  0 siblings, 2 replies; 132+ messages in thread
From: Eric Dumazet @ 2013-10-14 22:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> 
> > So, early testing results today.  I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop.  Results on a 2.4
> > GHz Intel Xeon processor:
> > 
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
> 
> Impressive, but could you try again with data out of cache ?

So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)


Using a prefetch 5*64([%src]) helps more (see at the end)

cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz


Before patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7651.61   2.51     5.45     0.645   1.399  


After patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7239.78   2.09     5.19     0.569   1.408  

Profile on receiver

   PerfTop:    1358 irqs/sec  kernel:98.5%  exact:  0.0% [1000Hz cycles],  (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

    19.99%  [kernel]     [k] csum_partial                
     7.04%  [kernel]     [k] copy_user_generic_string    
     4.92%  [bnx2x]      [k] bnx2x_rx_int                
     3.50%  [kernel]     [k] ipt_do_table                
     2.86%  [kernel]     [k] __netif_receive_skb_core    
     2.35%  [kernel]     [k] fib_table_lookup            
     2.19%  [kernel]     [k] netif_receive_skb           
     1.87%  [kernel]     [k] intel_idle                  
     1.65%  [kernel]     [k] kmem_cache_alloc            
     1.64%  [kernel]     [k] ip_rcv                      
     1.51%  [kernel]     [k] kmem_cache_free             


And attached patch brings much better results

lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
+				asm("prefetch 5*64(%[src])\n\t"
+				    "addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"
 				    "adcq 3*8(%[src]),%[res]\n\t"



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:18       ` Eric Dumazet
@ 2013-10-14 22:37         ` Joe Perches
  2013-10-14 22:44           ` Eric Dumazet
  2013-10-17  0:34         ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-10-14 22:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> attached patch brings much better results
> 
> lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> 
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
[]
> @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>  			zero = 0;
>  			count64 = count >> 3;
>  			while (count64) { 
> -				asm("addq 0*8(%[src]),%[res]\n\t"
> +				asm("prefetch 5*64(%[src])\n\t"

Might the prefetch size be too big here?

0x140 is pretty big and is always multiple cachelines no?

> +				    "addq 0*8(%[src]),%[res]\n\t"
>  				    "adcq 1*8(%[src]),%[res]\n\t"
>  				    "adcq 2*8(%[src]),%[res]\n\t"
>  				    "adcq 3*8(%[src]),%[res]\n\t"




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:37         ` Joe Perches
@ 2013-10-14 22:44           ` Eric Dumazet
  2013-10-14 22:49             ` Joe Perches
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-14 22:44 UTC (permalink / raw)
  To: Joe Perches
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > attached patch brings much better results
> > 
> > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > 
> >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > 
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> []
> > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> >  			zero = 0;
> >  			count64 = count >> 3;
> >  			while (count64) { 
> > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > +				asm("prefetch 5*64(%[src])\n\t"
> 
> Might the prefetch size be too big here?

To be effective, you need to prefetch well ahead of time.

5*64 seems common practice (check arch/x86/lib/copy_page_64.S)




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:44           ` Eric Dumazet
@ 2013-10-14 22:49             ` Joe Perches
  2013-10-15  7:41               ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-10-14 22:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > attached patch brings much better results
> > > 
> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > Recv   Send    Send                          Utilization       Service Demand
> > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > 
> > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > 
> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > []
> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > >  			zero = 0;
> > >  			count64 = count >> 3;
> > >  			while (count64) { 
> > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > +				asm("prefetch 5*64(%[src])\n\t"
> > 
> > Might the prefetch size be too big here?
> 
> To be effective, you need to prefetch well ahead of time.

No doubt.

> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)

5 cachelines for some processors seems like a lot.

Given you've got a test rig, maybe you could experiment
with 2 and increase it until it doesn't get better.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:25   ` Neil Horman
@ 2013-10-15  7:12     ` Sébastien Dugué
  2013-10-15 13:33       ` Andi Kleen
  0 siblings, 1 reply; 132+ messages in thread
From: Sébastien Dugué @ 2013-10-15  7:12 UTC (permalink / raw)
  To: Neil Horman
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


  Hi Neil, Andi,

On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> > Neil Horman <nhorman@tuxdriver.com> writes:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > > checksum offload hardware were spending a significant amount of time computing
> > 
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum
> > anyways. I would rather investigate why that doesn't work.
> > 
> FWIW, the reporter was reporting this using an IP over Infiniband network.
> Neil

  indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
where one cannot benefit from hardware offloads.

  For a bit of background on the issue:

  It all started nearly 3 years ago when trying to understand why IPoIB BW was
so low in our setups and why ksoftirqd used 100% of one CPU. A kernel profile
trace showed that the CPU spent most of it's time in checksum computation (from
the only old trace I managed to unearth):

  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                              1730    629976998 us     364148.5 us
  csum_partial                      10813465    20944414 us     1.936 us
  mwait_idle_with_hints                 1451    9858861 us     6794.529 us
  get_page_from_freelist            10110434    8120524 us     0.803 us
  alloc_pages_current               10093675    5180650 us     0.513 us
  __phys_addr                       35554783    4471387 us     0.125 us
  zone_statistics                   10110434    4360871 us     0.431 us
  ipoib_cm_alloc_rx_skb               673899    4343949 us     6.445 us

  After having recoded the checksum to use 2 ALUs, csum_partial() disappeared
from the tracer radar. IPoIB BW got from ~12Gb/s to ~ 20Gb/s and ksoftirqd load
dropped down drastically. Sorry, I could not manage to locate my old traces and
results, those seem to have been lost in the mist of time.

  I did some micro benchmark (dirty hack code below) of different solutions.
It looks like processing 128-byte blocks in 4 chains allows the best performance,
but there are plenty other possibilities.

  FWIW, this code has been running as is at our customers sites for 3 years now.

  Sébastien.

> 
> > That said the change looks reasonable, but may not fix the root cause.
> > 
> > -Andi
> > 
> > -- 
> > ak@linux.intel.com -- Speaking for myself only
> > 

8<----------------------------------------------------------------------


/*
 * gcc -Wall -O3 -o csum_test csum_test.c -lrt
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <errno.h>

#define __force
#define unlikely(x)	(x)

typedef uint32_t u32;
typedef uint16_t u16;

typedef u16 __sum16;
typedef u32 __wsum;

#define NUM_LOOPS	100000
#define BUF_LEN		65536
unsigned char buf[BUF_LEN];


/*
 * csum_fold - Fold and invert a 32bit checksum.
 * sum: 32bit unfolded sum
 *
 * Fold a 32bit running checksum to 16bit and invert it. This is usually
 * the last step before putting a checksum into a packet.
 * Make sure not to mix with 64bit checksums.
 */
static inline __sum16 csum_fold(__wsum sum)
{
	asm("  addl %1,%0\n"
	    "  adcl $0xffff,%0"
	    : "=r" (sum)
	    : "r" ((__force u32)sum << 16),
	      "0" ((__force u32)sum & 0xffff0000));
	return (__force __sum16)(~(__force u32)sum >> 16);
}

static inline unsigned short from32to16(unsigned a)
{
	unsigned short b = a >> 16;
	asm("addw %w2,%w0\n\t"
	    "adcw $0,%w0\n"
	    : "=r" (b)
	    : "0" (b), "r" (a));
	return b;
}

static inline unsigned add32_with_carry(unsigned a, unsigned b)
{
	asm("addl %2,%0\n\t"
	    "adcl $0,%0"
	    : "=r" (a)
	    : "0" (a), "r" (b));
	return a;
}

/*
 * Do a 64-bit checksum on an arbitrary memory area.
 * Returns a 32bit checksum.
 *
 * This isn't as time critical as it used to be because many NICs
 * do hardware checksumming these days.
 *
 * Things tried and found to not make it faster:
 * Manual Prefetching
 * Unrolling to an 128 bytes inner loop.
 * Using interleaving with more registers to break the carry chains.
 */
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res]\n\t"
				    "adcq 1*8(%[src]),%[res]\n\t"
				    "adcq 2*8(%[src]),%[res]\n\t"
				    "adcq 3*8(%[src]),%[res]\n\t"
				    "adcq 4*8(%[src]),%[res]\n\t"
				    "adcq 5*8(%[src]),%[res]\n\t"
				    "adcq 6*8(%[src]),%[res]\n\t"
				    "adcq 7*8(%[src]),%[res]\n\t"
				    "adcq %[zero],%[res]"
				    : [res] "=r" (result)
				    : [src] "r" (buff), [zero] "r" (zero),
				    "[res]" (result));
				buff += 64;
				count64--;
			}
			/* printf("csum %lx\n", result); */

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

static unsigned do_csum1(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res1]\n\t"
				    "adcq 2*8(%[src]),%[res1]\n\t"
				    "adcq 4*8(%[src]),%[res1]\n\t"
				    "adcq 6*8(%[src]),%[res1]\n\t"
				    "adcq %[zero],%[res1]\n\t"

				    "addq 1*8(%[src]),%[res2]\n\t"
				    "adcq 3*8(%[src]),%[res2]\n\t"
				    "adcq 5*8(%[src]),%[res2]\n\t"
				    "adcq 7*8(%[src]),%[res2]\n\t"
				    "adcq %[zero],%[res2]"
				    : [res1] "=r" (result1),
				      [res2] "=r" (result2)
				    : [src] "r" (buff), [zero] "r" (zero),
				      "[res1]" (result1), "[res2]" (result2));
				buff += 64;
				count64--;
			}

			asm("addq %[res1],%[res]\n\t"
			    "adcq %[res2],%[res]\n\t"
			    "adcq %[zero],%[res]"
			    : [res] "=r" (result)
			    : [res1] "r" (result1),
			      [res2] "r" (result2),
			      [zero] "r" (zero),
			      "0" (result));

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

static unsigned do_csum2(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result3 = 0;
	unsigned long result4 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;

	odd = 1 & (unsigned long) buff;

	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}

	count = len >> 1;		/* nr of 16-bit words.. */

	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}

		count >>= 1;		/* nr of 32-bit words.. */

		if (count) {

			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}

			count >>= 1;	/* nr of 64-bit words.. */

			if (count) {
				unsigned long zero = 0;
				unsigned count128;

				if (8 & (unsigned long) buff) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					count--;
					buff += 8;
				}

				/* main loop using 128 byte blocks */
				count128 = count >> 4;

				while (count128) {
					asm("addq 0*8(%[src]),%[res1]\n\t"
					    "adcq 4*8(%[src]),%[res1]\n\t"
					    "adcq 8*8(%[src]),%[res1]\n\t"
					    "adcq 12*8(%[src]),%[res1]\n\t"
					    "adcq %[zero],%[res1]\n\t"

					    "addq 1*8(%[src]),%[res2]\n\t"
					    "adcq 5*8(%[src]),%[res2]\n\t"
					    "adcq 9*8(%[src]),%[res2]\n\t"
					    "adcq 13*8(%[src]),%[res2]\n\t"
					    "adcq %[zero],%[res2]\n\t"

					    "addq 2*8(%[src]),%[res3]\n\t"
					    "adcq 6*8(%[src]),%[res3]\n\t"
					    "adcq 10*8(%[src]),%[res3]\n\t"
					    "adcq 14*8(%[src]),%[res3]\n\t"
					    "adcq %[zero],%[res3]\n\t"

					    "addq 3*8(%[src]),%[res4]\n\t"
					    "adcq 7*8(%[src]),%[res4]\n\t"
					    "adcq 11*8(%[src]),%[res4]\n\t"
					    "adcq 15*8(%[src]),%[res4]\n\t"
					    "adcq %[zero],%[res4]"

					    : [res1] "=r" (result1),
					      [res2] "=r" (result2),
					      [res3] "=r" (result3),
					      [res4] "=r" (result4)

					    : [src] "r" (buff),
					      [zero] "r" (zero),
					      "[res1]" (result1),
					      "[res2]" (result2),
					      "[res3]" (result3),
					      "[res4]" (result4));
					buff += 128;
					count128--;
				}

				asm("addq %[res1],%[res]\n\t"
				    "adcq %[res2],%[res]\n\t"
				    "adcq %[res3],%[res]\n\t"
				    "adcq %[res4],%[res]\n\t"
				    "adcq %[zero],%[res]"
				    : [res] "=r" (result)
				    : [res1] "r" (result1),
				      [res2] "r" (result2),
				      [res3] "r" (result3),
				      [res4] "r" (result4),
				      [zero] "r" (zero),
				      "0" (result));

				/* last upto 15 8byte blocks */
				count %= 16;
				while (count) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					--count;
					buff += 8;
				}
				result = add32_with_carry(result>>32,
							  result&0xffffffff);

				if (len & 8) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					buff += 8;
				}
			}

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}


static unsigned do_csum3(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result3 = 0;
	unsigned long result4 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res1]\n\t"
				    "adcq 4*8(%[src]),%[res1]\n\t"
				    "adcq %[zero],%[res1]\n\t"

				    "addq 1*8(%[src]),%[res2]\n\t"
				    "adcq 5*8(%[src]),%[res2]\n\t"
				    "adcq %[zero],%[res2]\n\t"

				    "addq 2*8(%[src]),%[res3]\n\t"
				    "adcq 6*8(%[src]),%[res3]\n\t"
				    "adcq %[zero],%[res3]\n\t"

				    "addq 3*8(%[src]),%[res4]\n\t"
				    "adcq 7*8(%[src]),%[res4]\n\t"
				    "adcq %[zero],%[res4]\n\t"

				    : [res1] "=r" (result1),
				      [res2] "=r" (result2),
				      [res3] "=r" (result3),
				      [res4] "=r" (result4)
				    : [src] "r" (buff),
				      [zero] "r" (zero),
				      "[res1]" (result1),
				      "[res2]" (result2),
				      "[res3]" (result3),
				      "[res4]" (result4));
				buff += 64;
				count64--;
			}

			asm("addq %[res1],%[res]\n\t"
			    "adcq %[res2],%[res]\n\t"
			    "adcq %[res3],%[res]\n\t"
			    "adcq %[res4],%[res]\n\t"
			    "adcq %[zero],%[res]"
			    : [res] "=r" (result)
			    : [res1] "r" (result1),
			      [res2] "r" (result2),
			      [res3] "r" (result3),
			      [res4] "r" (result4),
			      [zero] "r" (zero),
			      "0" (result));

			/* printf("csum1 %lx\n", result); */

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

long long delta_ns(struct timespec *t1, struct timespec *t2)
{
	long long tt1, tt2, delta;

	tt1 = t1->tv_sec * 1000000000 + t1->tv_nsec;
	tt2 = t2->tv_sec * 1000000000 + t2->tv_nsec;
	delta = tt2 - tt1;

	return delta;
}

int main(int argc, char **argv)
{
	FILE *f;
	unsigned csum1, csum2, csum3, csum4;
	struct timespec t1;
	struct timespec t2;
	double delta;
	int i;
	unsigned int offset = 0;
	unsigned char *ptr;
	unsigned int size;

	if ((f = fopen("data.bin", "r")) == NULL) {
		printf("Failed to open input file data.bin: %s\n",
		       strerror(errno));
		return -1;
	}

	if (fread(buf, 1, BUF_LEN, f) != BUF_LEN) {
		printf("Failed to read data.bin: %s\n",
		       strerror(errno));
		fclose(f);
		return -1;
	}

	fclose(f);

	if (argc > 1)
		offset = atoi(argv[1]);

	printf("Using offset=%d\n", offset);

	ptr = &buf[offset];
	size = BUF_LEN - offset;

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum1 = do_csum((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("Original:    %.8x %f us\n",
	       csum1, (double)delta/(double)NUM_LOOPS);

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum2 = do_csum1((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("64B Split2:  %.8x %f us\n",
	       csum2, (double)delta/(double)NUM_LOOPS);


	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum3 = do_csum2((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("128B Split4: %.8x %f us\n",
	       csum3, (double)delta/(double)NUM_LOOPS);

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum4 = do_csum3((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("64B Split4:  %.8x %f us\n",
	       csum4, (double)delta/(double)NUM_LOOPS);

	if ((csum1 != csum2) || (csum1 != csum3) || (csum1 != csum4))
		printf("Wrong checksum\n");

	return 0;
}



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:28   ` Neil Horman
  2013-10-14 21:19     ` Eric Dumazet
@ 2013-10-15  7:32     ` Ingo Molnar
  2013-10-15 13:14       ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:32 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > don't have checksum offload hardware were spending a significant amount 
> > > of time computing checksums.  We found that by splitting the checksum 
> > > computation into two separate streams, each skipping successive elements 
> > > of the buffer being summed, we could parallelize the checksum operation 
> > > accros multiple alus.  Since neither chain is dependent on the result of 
> > > the other, we get a speedup in execution (on hardware that has multiple 
> > > alu's available, which is almost ubiquitous on x86), and only a 
> > > negligible decrease on hardware that has only a single alu (an extra 
> > > addition is introduced).  Since addition in commutative, the result is 
> > > the same, only faster
> > 
> > This patch should really come with measurement numbers: what performance 
> > increase (and drop) did you get on what CPUs.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> 
> So, early testing results today.  I wrote a test module that, allocated 
> a 4k buffer, initalized it with random data, and called csum_partial on 
> it 100000 times, recording the time at the start and end of that loop.  

It would be nice to stick that testcase into tools/perf/bench/, see how we 
are able to benchmark the kernel's mempcy and memset implementation there:

 $ perf bench mem memcpy -r help
 # Running 'mem/memcpy' benchmark:
 Unknown routine:help
 Available routines...
        default ... Default memcpy() provided by glibc
        x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
        x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
        x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S

In a similar fashion we could build the csum_partial() code as well and do 
measurements. (We could change arch/x86/ code as well to make such 
embedding/including easier, as long as it does not change performance.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:49             ` Joe Perches
@ 2013-10-15  7:41               ` Ingo Molnar
  2013-10-15 10:51                 ` Borislav Petkov
  2013-10-15 16:21                 ` Joe Perches
  0 siblings, 2 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:41 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Joe Perches <joe@perches.com> wrote:

> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > attached patch brings much better results
> > > > 
> > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > Recv   Send    Send                          Utilization       Service Demand
> > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > 
> > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > 
> > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > []
> > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > >  			zero = 0;
> > > >  			count64 = count >> 3;
> > > >  			while (count64) { 
> > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > 
> > > Might the prefetch size be too big here?
> > 
> > To be effective, you need to prefetch well ahead of time.
> 
> No doubt.

So why did you ask then?

> > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> 
> 5 cachelines for some processors seems like a lot.

What processors would that be?

Most processors have hundreds of cachelines even in their L1 cache. 
Thousands in the L2 cache, up to hundreds of thousands.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:41               ` Ingo Molnar
@ 2013-10-15 10:51                 ` Borislav Petkov
  2013-10-15 12:04                   ` Ingo Molnar
  2013-10-15 16:21                 ` Joe Perches
  1 sibling, 1 reply; 132+ messages in thread
From: Borislav Petkov @ 2013-10-15 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joe Perches, Eric Dumazet, Neil Horman, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86

On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> Most processors have hundreds of cachelines even in their L1 cache.
> Thousands in the L2 cache, up to hundreds of thousands.

Also, I have this hazy memory of prefetch hints being harmful in some
situations: https://lwn.net/Articles/444344/

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 10:51                 ` Borislav Petkov
@ 2013-10-15 12:04                   ` Ingo Molnar
  0 siblings, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-15 12:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joe Perches, Eric Dumazet, Neil Horman, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86


* Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> > Most processors have hundreds of cachelines even in their L1 cache.
> > Thousands in the L2 cache, up to hundreds of thousands.
> 
> Also, I have this hazy memory of prefetch hints being harmful in some
> situations: https://lwn.net/Articles/444344/

Yes, for things like random list walks they tend to be harmful - the 
hardware is smarter.

For something like a controlled packet stream they might be helpful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:32     ` Ingo Molnar
@ 2013-10-15 13:14       ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-15 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant amount 
> > > > of time computing checksums.  We found that by splitting the checksum 
> > > > computation into two separate streams, each skipping successive elements 
> > > > of the buffer being summed, we could parallelize the checksum operation 
> > > > accros multiple alus.  Since neither chain is dependent on the result of 
> > > > the other, we get a speedup in execution (on hardware that has multiple 
> > > > alu's available, which is almost ubiquitous on x86), and only a 
> > > > negligible decrease on hardware that has only a single alu (an extra 
> > > > addition is introduced).  Since addition in commutative, the result is 
> > > > the same, only faster
> > > 
> > > This patch should really come with measurement numbers: what performance 
> > > increase (and drop) did you get on what CPUs.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > 
> > So, early testing results today.  I wrote a test module that, allocated 
> > a 4k buffer, initalized it with random data, and called csum_partial on 
> > it 100000 times, recording the time at the start and end of that loop.  
> 
> It would be nice to stick that testcase into tools/perf/bench/, see how we 
> are able to benchmark the kernel's mempcy and memset implementation there:
> 
Sure, my module is a mess currently.  But as soon as I investigate the use of
ADCX/ADOX that Anvin suggested I'll see about integrating that
Neil

>  $ perf bench mem memcpy -r help
>  # Running 'mem/memcpy' benchmark:
>  Unknown routine:help
>  Available routines...
>         default ... Default memcpy() provided by glibc
>         x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
> 
> In a similar fashion we could build the csum_partial() code as well and do 
> measurements. (We could change arch/x86/ code as well to make such 
> embedding/including easier, as long as it does not change performance.)
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 21:07     ` Eric Dumazet
@ 2013-10-15 13:17       ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-15 13:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> > * Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > Neil Horman <nhorman@tuxdriver.com> writes:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant 
> > > > amount of time computing
> > > 
> > > Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> > > anyways. I would rather investigate why that doesn't work.
> > 
> > There's a fair amount of csum_partial()-only workloads, a packet does not 
> > need to hit user-space to be a significant portion of the system's 
> > workload.
> > 
> > That said, it would indeed be nice to hear which particular code path was 
> > hit in this case, if nothing else then for education purposes.
> 
> Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
> frames, meaning we have to fallback to software csum to validate
> TCP frames, once tunnel header is pulled.
> 
> So to reproduce the issue, all you need is to setup a GRE tunnel between
> two hosts, and use any tcp stream workload.
> 
> Then receiver profile looks like :
> 
> 11.45%	[kernel]	 [k] csum_partial
>  3.08%	[kernel]	 [k] _raw_spin_lock
>  3.04%	[kernel]	 [k] intel_idle
>  2.73%	[kernel]	 [k] ipt_do_table
>  2.57%	[kernel]	 [k] __netif_receive_skb_core
>  2.15%	[kernel]	 [k] copy_user_generic_string
>  2.05%	[kernel]	 [k] __hrtimer_start_range_ns
>  1.42%	[kernel]	 [k] ip_rcv
>  1.39%	[kernel]	 [k] kmem_cache_free
>  1.36%	[kernel]	 [k] _raw_spin_unlock_irqrestore
>  1.24%	[kernel]	 [k] __schedule
>  1.13%	[bnx2x] 	 [k] bnx2x_rx_int
>  1.12%	[bnx2x] 	 [k] bnx2x_start_xmit
>  1.11%	[kernel]	 [k] fib_table_lookup
>  0.99%	[ip_tunnel]  [k] ip_tunnel_lookup
>  0.91%	[ip_tunnel]  [k] ip_tunnel_rcv
>  0.90%	[kernel]	 [k] check_leaf.isra.7
>  0.89%	[kernel]	 [k] nf_iterate
> 
As I noted previously the workload that this got reported on was ipoib, which
has a simmilar profile, since infiniband cards tend to not be able to do
checksum offload for ip frames.

Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:12     ` Sébastien Dugué
@ 2013-10-15 13:33       ` Andi Kleen
  2013-10-15 13:56         ` Sébastien Dugué
  0 siblings, 1 reply; 132+ messages in thread
From: Andi Kleen @ 2013-10-15 13:33 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Neil Horman, Andi Kleen, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

>   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> where one cannot benefit from hardware offloads.

Is this with sendfile? 

For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases

-Andi

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 13:33       ` Andi Kleen
@ 2013-10-15 13:56         ` Sébastien Dugué
  2013-10-15 14:06           ` Eric Dumazet
  0 siblings, 1 reply; 132+ messages in thread
From: Sébastien Dugué @ 2013-10-15 13:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > where one cannot benefit from hardware offloads.
> 
> Is this with sendfile?

  Tests were done with iperf at the time without any extra funky options, and
looking at the code it looks like it does plain write() / recv() on the socket.

  Sébastien.

> 
> For normal send() the checksum is done in the user copy and for receiving it
> can be also done during the copy in most cases
> 
> -Andi

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 13:56         ` Sébastien Dugué
@ 2013-10-15 14:06           ` Eric Dumazet
  2013-10-15 14:15             ` Sébastien Dugué
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:06 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> On Tue, 15 Oct 2013 15:33:36 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > where one cannot benefit from hardware offloads.
> > 
> > Is this with sendfile?
> 
>   Tests were done with iperf at the time without any extra funky options, and
> looking at the code it looks like it does plain write() / recv() on the socket.
> 

But the csum cost is both for sender and receiver ?

Please post the following :

perf record -g "your iperf session"

perf report | head -n 200




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:06           ` Eric Dumazet
@ 2013-10-15 14:15             ` Sébastien Dugué
  2013-10-15 14:26               ` Eric Dumazet
  0 siblings, 1 reply; 132+ messages in thread
From: Sébastien Dugué @ 2013-10-15 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

Hi Eric,

On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> > On Tue, 15 Oct 2013 15:33:36 +0200
> > Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > > where one cannot benefit from hardware offloads.
> > > 
> > > Is this with sendfile?
> > 
> >   Tests were done with iperf at the time without any extra funky options, and
> > looking at the code it looks like it does plain write() / recv() on the socket.
> > 
> 
> But the csum cost is both for sender and receiver ?

  No, it was only on the receiver side that I noticed it.

> 
> Please post the following :
> 
> perf record -g "your iperf session"
> 
> perf report | head -n 200

  Sorry, but this is 3 years old stuff and I do not have the
setup anymore to reproduce.

  Sébastien.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:15             ` Sébastien Dugué
@ 2013-10-15 14:26               ` Eric Dumazet
  2013-10-15 14:52                 ` Eric Dumazet
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:26 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
> Hi Eric,
> 
> On Tue, 15 Oct 2013 07:06:25 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > But the csum cost is both for sender and receiver ?
> 
>   No, it was only on the receiver side that I noticed it.
> 

Yes, as Andi said, we do the csum while copying the data for the
sender : (I disabled hardware assist tx checksum using 'ethtool -K eth0
tx off')

    17.21%  netperf  [kernel.kallsyms]       [k]
csum_partial_copy_generic                 
            |
            --- csum_partial_copy_generic
               |          
               |--97.39%-- __libc_send
               |          
                --2.61%-- tcp_sendmsg
                          inet_sendmsg
                          sock_sendmsg
                          _sys_sendto
                          sys_sendto
                          system_call_fastpath
                          __libc_send



>   Sorry, but this is 3 years old stuff and I do not have the
> setup anymore to reproduce.

And the receiver should also do the same : (ethtool -K eth0 rx off)

    10.55%    netserver  [kernel.kallsyms]  [k]
csum_partial_copy_generic            
              |
              --- csum_partial_copy_generic
                 |          
                 |--98.24%-- __libc_recv
                 |          
                  --1.76%-- skb_copy_and_csum_datagram
                            skb_copy_and_csum_datagram_iovec
                            tcp_rcv_established
                            tcp_v4_do_rcv
                            |          
                            |--73.05%-- tcp_prequeue_process
                            |          tcp_recvmsg
                            |          inet_recvmsg
                            |          sock_recvmsg
                            |          SYSC_recvfrom
                            |          SyS_recvfrom
                            |          system_call_fastpath
                            |          __libc_recv
                            |          

So I suspect something is wrong with IPoIB.





^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:26               ` Eric Dumazet
@ 2013-10-15 14:52                 ` Eric Dumazet
  2013-10-15 16:02                   ` Andi Kleen
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:52 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:

> And the receiver should also do the same : (ethtool -K eth0 rx off)
> 
>     10.55%    netserver  [kernel.kallsyms]  [k]
> csum_partial_copy_generic            

I get the csum_partial() if disabling prequeue.

echo 1 >/proc/sys/net/ipv4/tcp_low_latency

    24.49%      swapper  [kernel.kallsyms]  [k]
csum_partial                        
                |
                --- csum_partial
                    skb_checksum
                    __skb_checksum_complete_head
                    __skb_checksum_complete
                    tcp_rcv_established
                    tcp_v4_do_rcv
                    tcp_v4_rcv
                    ip_local_deliver_finish
                    ip_local_deliver
                    ip_rcv_finish
                    ip_rcv

So yes, we can call csum_partial() in receive path in this case.




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:52                 ` Eric Dumazet
@ 2013-10-15 16:02                   ` Andi Kleen
  2013-10-16  0:28                     ` Eric Dumazet
  0 siblings, 1 reply; 132+ messages in thread
From: Andi Kleen @ 2013-10-15 16:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Sébastien Dugué,
	Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> I get the csum_partial() if disabling prequeue.

At least in the ipoib case i would consider that a misconfiguration.

"don't do this if it hurts"

There may be more such problems.

-Andi

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:41               ` Ingo Molnar
  2013-10-15 10:51                 ` Borislav Petkov
@ 2013-10-15 16:21                 ` Joe Perches
  2013-10-16  0:34                   ` Eric Dumazet
  2013-10-16  6:25                   ` Ingo Molnar
  1 sibling, 2 replies; 132+ messages in thread
From: Joe Perches @ 2013-10-15 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> * Joe Perches <joe@perches.com> wrote:
> 
> > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > attached patch brings much better results
> > > > > 
> > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > Recv   Send    Send                          Utilization       Service Demand
> > > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > > 
> > > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > > 
> > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > []
> > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > >  			zero = 0;
> > > > >  			count64 = count >> 3;
> > > > >  			while (count64) { 
> > > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > > 
> > > > Might the prefetch size be too big here?
> > > 
> > > To be effective, you need to prefetch well ahead of time.
> > 
> > No doubt.
> 
> So why did you ask then?
> 
> > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > 
> > 5 cachelines for some processors seems like a lot.
> 
> What processors would that be?

The ones where conservatism in L1 cache use is good
because there are multiple threads running concurrently.

> Most processors have hundreds of cachelines even in their L1 cache. 

And sometimes that many executable processes too.

> Thousands in the L2 cache, up to hundreds of thousands.

Irrelevant because prefetch doesn't apply there.

Ingo, Eric _showed_ that the prefetch is good here.
How about looking at a little optimization to the minimal
prefetch that gives that level of performance.

You could argue that prefetching PAGE_SIZE or larger
would be better still otherwise.

I suspect that using a smaller multiple of
L1_CACHE_BYTES like 2 or 3 would perform the same.

The last time it was looked at for copy_page_64.S was
quite awhile ago.  It looks like maybe 2003.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:02                   ` Andi Kleen
@ 2013-10-16  0:28                     ` Eric Dumazet
  0 siblings, 0 replies; 132+ messages in thread
From: Eric Dumazet @ 2013-10-16  0:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Sébastien Dugué,
	Neil Horman, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
> > I get the csum_partial() if disabling prequeue.
> 
> At least in the ipoib case i would consider that a misconfiguration.

There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is not used at all.

In this case, we need to csum_partial() frame before sending an ACK,
don't you think ? ;)




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:21                 ` Joe Perches
@ 2013-10-16  0:34                   ` Eric Dumazet
  2013-10-16  6:25                   ` Ingo Molnar
  1 sibling, 0 replies; 132+ messages in thread
From: Eric Dumazet @ 2013-10-16  0:34 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:

> Ingo, Eric _showed_ that the prefetch is good here.
> How about looking at a little optimization to the minimal
> prefetch that gives that level of performance.

Wait a minute, my point was to remind that main cost is the
memory fetching.

Its nice to optimize cpu cycles if we are short of them,
but in the csum_partial() case, the bottleneck is the memory.

Also I was wondering on the implications of changing reads order,
as it might fool cpu predictions.

I do not particularly care about finding the right prefetch stride,
I think Intel guys know better than me.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:21                 ` Joe Perches
  2013-10-16  0:34                   ` Eric Dumazet
@ 2013-10-16  6:25                   ` Ingo Molnar
  2013-10-16 16:55                     ` Joe Perches
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-16  6:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Joe Perches <joe@perches.com> wrote:

> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches <joe@perches.com> wrote:
> > 
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > > attached patch brings much better results
> > > > > > 
> > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > > Recv   Send    Send                          Utilization       Service Demand
> > > > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > > > 
> > > > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > > > 
> > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > > []
> > > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > > >  			zero = 0;
> > > > > >  			count64 = count >> 3;
> > > > > >  			while (count64) { 
> > > > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > > > 
> > > > > Might the prefetch size be too big here?
> > > > 
> > > > To be effective, you need to prefetch well ahead of time.
> > > 
> > > No doubt.
> > 
> > So why did you ask then?
> > 
> > > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > > 
> > > 5 cachelines for some processors seems like a lot.
> > 
> > What processors would that be?
> 
> The ones where conservatism in L1 cache use is good because there are 
> multiple threads running concurrently.

What specific processor models would that be?

> > Most processors have hundreds of cachelines even in their L1 cache.
>
> And sometimes that many executable processes too.

Nonsense, this is an unrolled loop running in softirq context most of the 
time that does not get preempted.

> > Thousands in the L2 cache, up to hundreds of thousands.
> 
> Irrelevant because prefetch doesn't apply there.

What planet are you living on? Prefetch takes memory from L2->L1 memory 
just as much as it moves it cachelines from memory to the L2 cache. 

Especially in the usecase cited here there will be a second use of the 
data (when it's finally copied over into user-space), so the L2 cache size 
very much matters.

The prefetches here matter mostly to the packet being processed: the ideal 
size of the look-ahead window in csum_partial() is dictated by typical 
memory latencies and bandwidth. The amount of parallelism is limited by 
the number of carry bits we can maintain independently.

> Ingo, Eric _showed_ that the prefetch is good here. How about looking at 
> a little optimization to the minimal prefetch that gives that level of 
> performance.

Joe, instead of using a condescending tone in matters you clearly have 
very little clue about you might want to start doing some real kernel 
hacking in more serious kernel areas, beyond trivial matters such as 
printk strings, to gain a bit of experience and respect ...

Every word you uttered in this thread made it more likely for me to 
redirect you to /dev/null, permanently.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-16  6:25                   ` Ingo Molnar
@ 2013-10-16 16:55                     ` Joe Perches
  0 siblings, 0 replies; 132+ messages in thread
From: Joe Perches @ 2013-10-16 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
>  Prefetch takes memory from L2->L1 memory 
> just as much as it moves it cachelines from memory to the L2 cache. 

Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:18       ` Eric Dumazet
  2013-10-14 22:37         ` Joe Perches
@ 2013-10-17  0:34         ` Neil Horman
  2013-10-17  1:42           ` Eric Dumazet
  2013-10-17  8:41           ` Ingo Molnar
  1 sibling, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-17  0:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > 
> > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > GHz Intel Xeon processor:
> > > 
> > > Without patch: Average execute time for csum_partial was 808 ns
> > > With patch: Average execute time for csum_partial was 438 ns
> > 
> > Impressive, but could you try again with data out of cache ?
> 
> So I tried your patch on a GRE tunnel and got following results on a
> single TCP flow. (short result : no visible difference)
> 
> 

So I went to reproduce these results, but was unable to (due to the fact that I
only have a pretty jittery network to do testing accross at the moment with
these devices).  So instead I figured that I would go back to just doing
measurements with the module that I cobbled together (operating under the
assumption that it would give me accurate, relatively jitter free results (I've
attached the module code for reference below).  My results show slightly
different behavior:

Base results runs:
89417240
85170397
85208407
89422794
91645494
103655144
86063791
75647774
83502921
85847372
AVG = 875 ns

Prefetch only runs:
70962849
77555099
81898170
68249290
72636538
83039294
78561494
83393369
85317556
79570951
AVG = 781 ns

Parallel addition only runs:
42024233
44313064
48304416
64762297
42994259
41811628
55654282
64892958
55125582
42456403
AVG = 510 ns


Both prefetch and parallel addition:
41329930
40689195
61106622
46332422
49398117
52525171
49517101
61311153
43691814
49043084
AVG = 494 ns


For reference, each of the above large numbers is the number of nanoseconds
taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
results, I ran the test in a loop 10 times, averaged them, and divided by
100000.


Based on these, prefetching is obviously a a good improvement, but not as good
as parallel execution, and the winner by far is doing both.

Thoughts?

Neil



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;

	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

	if (!buf) {
		printk(KERN_CRIT "UNABLE TO ALLOCATE A BUFFER OF %lu bytes\n", PAGE_SIZE);
		return -ENOMEM;
	}

	printk(KERN_CRIT "INITALIZING BUFFER\n");
	get_random_bytes(buf, PAGE_SIZE);

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);

	for(i=0;i<100000;i++)
		sum = csum_partial(buf, PAGE_SIZE, sum);
	getnstimeofday(&end);
	preempt_enable();
	if (start.tv_nsec > end.tv_nsec)
		time = (ULLONG_MAX - end.tv_nsec) + start.tv_nsec;
	else 
		time = end.tv_nsec - start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	kfree(buf);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  0:34         ` Neil Horman
@ 2013-10-17  1:42           ` Eric Dumazet
  2013-10-18 16:50             ` Neil Horman
  2013-10-17  8:41           ` Ingo Molnar
  1 sibling, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-17  1:42 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:

> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
> 
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of nanoseconds
> taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
> results, I ran the test in a loop 10 times, averaged them, and divided by
> 100000.
> 
> 
> Based on these, prefetching is obviously a a good improvement, but not as good
> as parallel execution, and the winner by far is doing both.
> 
> Thoughts?
> 
> Neil
> 


Your benchmark uses a single 4K page, so data is _super_ hot in cpu
caches.
( prefetch should give no speedups, I am surprised it makes any
difference)

Try now with 32 huges pages, to get 64 MBytes of working set.

Because in reality we never csum_partial() data in cpu cache.
(Unless the NIC preloaded the data into cpu cache before sending the
interrupt)

Really, if Sebastien got a speed up, it means that something fishy was
going on, like :

- A copy of data into some area of memory, prefilling cpu caches
- csum_partial() done while data is hot in cache.

This is exactly a "should not happen" scenario, because the csum in this
case should happen _while_ doing the copy, for 0 ns.




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  0:34         ` Neil Horman
  2013-10-17  1:42           ` Eric Dumazet
@ 2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-17  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > > 
> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > > 
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > > 
> > > Impressive, but could you try again with data out of cache ?
> > 
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> > 
> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of 
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.  
> To get my average results, I ran the test in a loop 10 times, averaged 
> them, and divided by 100000.
> 
> Based on these, prefetching is obviously a a good improvement, but not 
> as good as parallel execution, and the winner by far is doing both.

But in the actual usecase mentioned the packet data was likely cache-cold, 
it just arrived in the NIC and an IRQ got sent. Your testcase uses a 
super-hot 4K buffer that fits into the L1 cache. So it's apples to 
oranges.

To correctly simulate the workload you'd have to:

 - allocate a buffer larger than your L2 cache.

 - to measure the effects of the prefetches you'd also have to randomize
   the individual buffer positions. See how 'perf bench numa' implements a
   random walk via --data_rand_walk, in tools/perf/bench/numa.c.
   Otherwise the CPU might learn your simplistic stream direction and the
   L2 cache might hw-prefetch your data, interfering with any explicit 
   prefetches the code does. In many real-life usecases packet buffers are
   scattered.

Also, it would be nice to see standard deviation noise numbers when two 
averages are close to each other, to be able to tell whether differences 
are statistically significant or not.

For example 'perf stat --repeat' will output stddev for you:

  comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'

   Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):

       0.189084480 seconds time elapsed                                          ( +- 11.95% )

The last '+-' percentage is the noise of the measurement.

Also note that you can inspect many cache behavior details of your 
algorithm via perf stat - the -ddd option will give you a laundry list:

  aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
  ...

     Total time: 0.095 [sec]

 Performance counter stats for 'perf bench sched messaging' (20 runs):

       1519.128721 task-clock (msec)         #   12.305 CPUs utilized            ( +-  0.34% )
            22,882 context-switches          #    0.015 M/sec                    ( +-  2.84% )
             3,927 cpu-migrations            #    0.003 M/sec                    ( +-  2.74% )
            16,616 page-faults               #    0.011 M/sec                    ( +-  0.17% )
     2,327,978,366 cycles                    #    1.532 GHz                      ( +-  1.61% ) [36.43%]
     1,715,561,189 stalled-cycles-frontend   #   73.69% frontend cycles idle     ( +-  1.76% ) [38.05%]
       715,715,454 stalled-cycles-backend    #   30.74% backend  cycles idle     ( +-  2.25% ) [39.85%]
     1,253,106,346 instructions              #    0.54  insns per cycle        
                                             #    1.37  stalled cycles per insn  ( +-  1.71% ) [49.68%]
       241,181,126 branches                  #  158.763 M/sec                    ( +-  1.43% ) [47.83%]
         4,232,053 branch-misses             #    1.75% of all branches          ( +-  1.23% ) [48.63%]
       431,907,354 L1-dcache-loads           #  284.313 M/sec                    ( +-  1.00% ) [48.37%]
        20,550,528 L1-dcache-load-misses     #    4.76% of all L1-dcache hits    ( +-  0.82% ) [47.61%]
         7,435,847 LLC-loads                 #    4.895 M/sec                    ( +-  0.94% ) [36.11%]
         2,419,201 LLC-load-misses           #   32.53% of all LL-cache hits     ( +-  2.93% ) [ 7.33%]
       448,638,547 L1-icache-loads           #  295.326 M/sec                    ( +-  2.43% ) [21.75%]
        22,066,490 L1-icache-load-misses     #    4.92% of all L1-icache hits    ( +-  2.54% ) [30.66%]
       475,557,948 dTLB-loads                #  313.047 M/sec                    ( +-  1.96% ) [37.96%]
         6,741,523 dTLB-load-misses          #    1.42% of all dTLB cache hits   ( +-  2.38% ) [37.05%]
     1,268,628,660 iTLB-loads                #  835.103 M/sec                    ( +-  1.75% ) [36.45%]
            74,192 iTLB-load-misses          #    0.01% of all iTLB cache hits   ( +-  2.88% ) [36.19%]
         4,466,526 L1-dcache-prefetches      #    2.940 M/sec                    ( +-  1.61% ) [36.17%]
         2,396,311 L1-dcache-prefetch-misses #    1.577 M/sec                    ( +-  1.55% ) [35.71%]

       0.123459566 seconds time elapsed                                          ( +-  0.58% )

There's also a number of prefetch counters that might be useful:

 aldebaran:~> perf list | grep prefetch
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
@ 2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 132+ messages in thread
From: H. Peter Anvin @ 2013-10-17 18:19 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, x86, netdev

On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> 
> To correctly simulate the workload you'd have to:
> 
>  - allocate a buffer larger than your L2 cache.
> 
>  - to measure the effects of the prefetches you'd also have to randomize
>    the individual buffer positions. See how 'perf bench numa' implements a
>    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
>    Otherwise the CPU might learn your simplistic stream direction and the
>    L2 cache might hw-prefetch your data, interfering with any explicit 
>    prefetches the code does. In many real-life usecases packet buffers are
>    scattered.
> 
> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 

Seriously, though, how much does it matter?  All the above seems likely
to do is to drown the signal by adding noise.

If the parallel (threaded) checksumming is faster, which theory says it
should and microbenchmarking confirms, how important are the
macrobenchmarks?

	-hpa



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 132+ messages in thread
From: Eric Dumazet @ 2013-10-17 18:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev

On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:

> Seriously, though, how much does it matter?  All the above seems likely
> to do is to drown the signal by adding noise.

I don't think so.

> 
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?

Seriously, micro benchmarks are very misleading.

I spent time on this patch, and found no changes on real workloads.

I was excited first, then disappointed.

I hope we will find the real issue, as I really don't care of micro
benchmarks.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
@ 2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-18  6:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Neil Horman, Eric Dumazet, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> > 
> > To correctly simulate the workload you'd have to:
> > 
> >  - allocate a buffer larger than your L2 cache.
> > 
> >  - to measure the effects of the prefetches you'd also have to randomize
> >    the individual buffer positions. See how 'perf bench numa' implements a
> >    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> >    Otherwise the CPU might learn your simplistic stream direction and the
> >    L2 cache might hw-prefetch your data, interfering with any explicit 
> >    prefetches the code does. In many real-life usecases packet buffers are
> >    scattered.
> > 
> > Also, it would be nice to see standard deviation noise numbers when two 
> > averages are close to each other, to be able to tell whether differences 
> > are statistically significant or not.
> 
> 
> Seriously, though, how much does it matter?  All the above seems likely 
> to do is to drown the signal by adding noise.

I think it matters a lot and I don't think it 'adds' noise - it measures 
something else (cache cold behavior - which is the common case for 
first-time csum_partial() use for network packets), which was not measured 
before, and that that is by its nature has different noise patterns.

I've done many cache-cold measurements myself and had no trouble achieving 
statistically significant results and high precision.

> If the parallel (threaded) checksumming is faster, which theory says it 
> should and microbenchmarking confirms, how important are the 
> macrobenchmarks?

Microbenchmarks can be totally blind to things like the ideal prefetch 
window size. (or whether a prefetch should be done at all: some CPUs will 
throw away prefetches if enough regular fetches arrive.)

Also, 'naive' single-threaded algorithms can occasionally be better in the 
cache-cold case because a linear, predictable stream of memory accesses 
might saturate the memory bus better than a somewhat random looking, 
interleaved web of accesses that might not harmonize with buffer depths.

I _think_ if correctly tuned then the parallel algorithm should be better 
in the cache cold case, I just don't know with what parameters (and the 
algorithm has at least one free parameter: the prefetch window size), and 
I don't know how significant the effect is.

Also, more fundamentally, I absolutely detest doing no measurements or 
measuring the wrong thing - IMHO there are too many 'blind' optimization 
commits in the kernel with little to no observational data attached.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 22:29 ` H. Peter Anvin
  2013-10-13 12:53   ` Neil Horman
@ 2013-10-18 16:42   ` Neil Horman
  2013-10-18 17:09     ` H. Peter Anvin
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-18 16:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums.  We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus.  Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced).  Since addition in commutative, the result is the same,
> > only faster
> 
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
> 
> 	-hpa
> 
I've been looking into this a bit more, and I'm a bit confused.  According to
this:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

by my read, this pair of instructions simply supports 2 carry bit chains,
allowing for two parallel execution paths through the cpu that won't block on
one another.  Its exactly the same as whats being done with the universally
available addcq instruction, so theres no real speedup (that I can see).  Since
we'd either have to use the alternatives macro to support adcx/adox here or the
old instruction set, it seems not overly worth the effort to support the
extension.  

Or am I missing something?

Neil

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  1:42           ` Eric Dumazet
@ 2013-10-18 16:50             ` Neil Horman
  2013-10-18 17:20               ` Eric Dumazet
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-18 16:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> 
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
> 
> Try now with 32 huges pages, to get 64 MBytes of working set.
> 
> Because in reality we never csum_partial() data in cpu cache.
> (Unless the NIC preloaded the data into cpu cache before sending the
> interrupt)
> 
> Really, if Sebastien got a speed up, it means that something fishy was
> going on, like :
> 
> - A copy of data into some area of memory, prefilling cpu caches
> - csum_partial() done while data is hot in cache.
> 
> This is exactly a "should not happen" scenario, because the csum in this
> case should happen _while_ doing the copy, for 0 ns.
> 
> 
> 
> 


So, I took your suggestion, and modified my test module to allocate 32 huge
pages instead of a single 4k page.  I've attached the module changes and the
results below.  Contrary to your assertion above, results came out the same as
in my first run.  See below:

base results:
80381491
85279536
99537729
80398029
121385411
109478429
85369632
99242786
80250395
98170542

AVG=939 ns

prefetch only results:
86803812
101891541
85762713
95866956
102316712
93529111
90473728
79374183
93744053
90075501

AVG=919 ns

parallel only results:
68994797
63503221
64298412
63784256
75350022
66398821
77776050
79158271
91006098
67822318

AVG=718 ns

both prefetch and parallel results:
68852213
77536525
63963560
67255913
76169867
80418081
63485088
62386262
75533808
57731705

AVG=693 ns


So based on these, it seems that your assertion that prefetching is the key to
speedup here isn't quite correct.  Either that or the testing continues to be
invalid.  I'm going to try to do some of ingos microbenchmarking just to see if
that provides any further details.  But any other thoughts about what might be
going awry are appreciated.

My module code:



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;
	struct page *page;
	u32 offset = 0;

	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
	if (!page) {
		printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
		return -ENOMEM;
	}
	buf = page_address(page); 

	
	printk(KERN_CRIT "INITALIZING BUFFER\n");

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);
	
	for(i=0;i<100000;i++) {
		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
	}
	getnstimeofday(&end);
	preempt_enable();
	if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
		time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
	else 
		time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	__free_pages(page, BUFSIZ_ORDER);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 16:42   ` Neil Horman
@ 2013-10-18 17:09     ` H. Peter Anvin
  2013-10-25 13:06       ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: H. Peter Anvin @ 2013-10-18 17:09 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.

Neil Horman <nhorman@tuxdriver.com> wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices implementing ipoib
>(which don't have
>> > checksum offload hardware were spending a significant amount of
>time computing
>> > checksums.  We found that by splitting the checksum computation
>into two
>> > separate streams, each skipping successive elements of the buffer
>being summed,
>> > we could parallelize the checksum operation accros multiple alus. 
>Since neither
>> > chain is dependent on the result of the other, we get a speedup in
>execution (on
>> > hardware that has multiple alu's available, which is almost
>ubiquitous on x86),
>> > and only a negligible decrease on hardware that has only a single
>alu (an extra
>> > addition is introduced).  Since addition in commutative, the result
>is the same,
>> > only faster
>> 
>> On hardware that implement ADCX/ADOX then you should also be able to
>> have additional streams interleaved since those instructions allow
>for
>> dual carry chains.
>> 
>> 	-hpa
>> 
>I've been looking into this a bit more, and I'm a bit confused. 
>According to
>this:
>http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
>
>by my read, this pair of instructions simply supports 2 carry bit
>chains,
>allowing for two parallel execution paths through the cpu that won't
>block on
>one another.  Its exactly the same as whats being done with the
>universally
>available addcq instruction, so theres no real speedup (that I can
>see).  Since
>we'd either have to use the alternatives macro to support adcx/adox
>here or the
>old instruction set, it seems not overly worth the effort to support
>the
>extension.  
>
>Or am I missing something?
>
>Neil

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 16:50             ` Neil Horman
@ 2013-10-18 17:20               ` Eric Dumazet
  2013-10-18 20:11                 ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-18 17:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > 

> 	for(i=0;i<100000;i++) {
> 		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> 		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
> 	}

Please replace this by random accesses, and use the more standard 1500
length.

offset = prandom_u32() % (BUFSIZ - 1500);
offset &= ~1U;

sum = csum_partial(buf + offset, 1500, sum);

You are basically doing sequential accesses, so prefetch should
be automatically done by cpu itself.

Thanks !



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:20               ` Eric Dumazet
@ 2013-10-18 20:11                 ` Neil Horman
  2013-10-18 21:15                   ` Eric Dumazet
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-18 20:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > > 
> 
> > 	for(i=0;i<100000;i++) {
> > 		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> > 		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
> > 	}
> 
> Please replace this by random accesses, and use the more standard 1500
> length.
> 
> offset = prandom_u32() % (BUFSIZ - 1500);
> offset &= ~1U;
> 
> sum = csum_partial(buf + offset, 1500, sum);
> 
> You are basically doing sequential accesses, so prefetch should
> be automatically done by cpu itself.
> 
> Thanks !
> 
> 
> 

Sure, you got it!  Results below.  However, they continue to bear out that
parallel execution beats prefetch only execution, and both is better than either
one.

base results:
53156647
59670931
62839770
44842780
39297190
44905905
53300688
53287805
39436951
43021730

AVG=493 ns

prefetch-only results:
40337434
51986404
43509199
53128857
52973171
53520649
53536338
50325466
44864664
47908398

AVG=492 ns


parallel-only results:
52157183
44496511
36180011
38298368
36258099
43263531
45365519
54116344
62529241
63118224

AVG = 475 ns


both prefetch and parallel:
44317078
44526464
45761272
44477906
34868814
44637904
49478309
49718417
58681403
58304972

AVG = 474 ns


Heres the code I was using



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;
	struct page *page;
	u32 offset = 0;

	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
	if (!page) {
		printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
		return -ENOMEM;
	}
	buf = page_address(page); 

	
	printk(KERN_CRIT "INITALIZING BUFFER\n");

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);
	
	for(i=0;i<100000;i++) {
		sum = csum_partial(buf+offset, 1500, sum);
		offset = prandom_u32() % (BUFSIZ - 1500);
		offset &= ~1U;
	}
	getnstimeofday(&end);
	preempt_enable();
	if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
		time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
	else 
		time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	__free_pages(page, BUFSIZ_ORDER);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 20:11                 ` Neil Horman
@ 2013-10-18 21:15                   ` Eric Dumazet
  2013-10-20 21:29                     ` Neil Horman
  2013-10-21 19:21                     ` Neil Horman
  0 siblings, 2 replies; 132+ messages in thread
From: Eric Dumazet @ 2013-10-18 21:15 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:

> #define BUFSIZ_ORDER 4
> #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> static int __init csum_init_module(void)
> {
> 	int i;
> 	__wsum sum = 0;
> 	struct timespec start, end;
> 	u64 time;
> 	struct page *page;
> 	u32 offset = 0;
> 
> 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);

Not sure what you are doing here, but its not correct.

You have a lot of variations in your results, I suspect a NUMA affinity
problem.

You can try the following code, and use taskset to make sure you run
this on a cpu on node 0

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

static int __init csum_init_module(void)
{
        int i;
        __wsum sum = 0;
        u64 start, end;
	void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	memset(addrs, 0, sizeof(addrs));
	for (i = 0; i < NBPAGES; i++) {
		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
		if (!addrs[i])
			goto out;
	}

        local_bh_disable();
        pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
        start = ktime_to_ns(ktime_get());
        
        for (i = 0; i < 100000; i++) {
		rnd = prandom_u32();
		base = addrs[rnd % NBPAGES];
		rnd /= NBPAGES;
		offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial_opt(base + offset, 1500, sum);
        }
        end = ktime_to_ns(ktime_get());
        local_bh_enable();

        pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);

out:
	for (i = 0; i < NBPAGES; i++)
		kfree(addrs[i]);

        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 21:15                   ` Eric Dumazet
@ 2013-10-20 21:29                     ` Neil Horman
  2013-10-21 17:31                       ` Eric Dumazet
  2013-10-21 19:21                     ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-20 21:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> 
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > 	int i;
> > 	__wsum sum = 0;
> > 	struct timespec start, end;
> > 	u64 time;
> > 	struct page *page;
> > 	u32 offset = 0;
> > 
> > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> 
> Not sure what you are doing here, but its not correct.
> 
Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.

> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
> 
I do have some variation, you're correct, but I don't think its a numa issue

> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
> 
I did run this with taskset to do exactly that (hence my comment above).  I'll
be glad to run your variant on monday morning though and provide results.


Best
Neil


> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
> 
> static int __init csum_init_module(void)
> {
>         int i;
>         __wsum sum = 0;
>         u64 start, end;
> 	void *base, *addrs[NBPAGES];
>         u32 rnd, offset;
> 
> 	memset(addrs, 0, sizeof(addrs));
> 	for (i = 0; i < NBPAGES; i++) {
> 		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> 		if (!addrs[i])
> 			goto out;
> 	}
> 
>         local_bh_disable();
>         pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
>         start = ktime_to_ns(ktime_get());
>         
>         for (i = 0; i < 100000; i++) {
> 		rnd = prandom_u32();
> 		base = addrs[rnd % NBPAGES];
> 		rnd /= NBPAGES;
> 		offset = rnd % (BUFSIZ - 1500);
>                 offset &= ~1U;
>                 sum = csum_partial_opt(base + offset, 1500, sum);
>         }
>         end = ktime_to_ns(ktime_get());
>         local_bh_enable();
> 
>         pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
> 
> out:
> 	for (i = 0; i < NBPAGES; i++)
> 		kfree(addrs[i]);
> 
>         return 0;
> }
> 
> static void __exit csum_cleanup_module(void)
> {
>         return;
> }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-20 21:29                     ` Neil Horman
@ 2013-10-21 17:31                       ` Eric Dumazet
  2013-10-21 17:46                         ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-21 17:31 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > 
> > > #define BUFSIZ_ORDER 4
> > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > static int __init csum_init_module(void)
> > > {
> > > 	int i;
> > > 	__wsum sum = 0;
> > > 	struct timespec start, end;
> > > 	u64 time;
> > > 	struct page *page;
> > > 	u32 offset = 0;
> > > 
> > > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > 
> > Not sure what you are doing here, but its not correct.
> > 
> Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.

Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
on x86. (MAX_ORDER = 11)

You noticed nothing because you did not 
write anything on the 64Mbytes area (and corrupt memory) or
use CONFIG_DEBUG_PAGEALLOC=y.

Your code read data out of bounds and was lucky, thats all...

You in fact allocated a page of (4096<<4) bytes




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 17:31                       ` Eric Dumazet
@ 2013-10-21 17:46                         ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-21 17:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > > 
> > > > #define BUFSIZ_ORDER 4
> > > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > > static int __init csum_init_module(void)
> > > > {
> > > > 	int i;
> > > > 	__wsum sum = 0;
> > > > 	struct timespec start, end;
> > > > 	u64 time;
> > > > 	struct page *page;
> > > > 	u32 offset = 0;
> > > > 
> > > > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > > 
> > > Not sure what you are doing here, but its not correct.
> > > 
> > Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.
> 
> Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
> on x86. (MAX_ORDER = 11)
> 
> You noticed nothing because you did not 
> write anything on the 64Mbytes area (and corrupt memory) or
> use CONFIG_DEBUG_PAGEALLOC=y.
> 
> Your code read data out of bounds and was lucky, thats all...
> 
> You in fact allocated a page of (4096<<4) bytes
> 
Gahh!  I see what I did, the order in the alloc_pages call is the order of
hugepages, it still allocates that order as typically sized pages, and then
treats them as huge.  Stupid of me...

I'll have results on your version of the test case in just a bit here
Neil

> 
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 21:15                   ` Eric Dumazet
  2013-10-20 21:29                     ` Neil Horman
@ 2013-10-21 19:21                     ` Neil Horman
  2013-10-21 19:44                       ` Eric Dumazet
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-21 19:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> 
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > 	int i;
> > 	__wsum sum = 0;
> > 	struct timespec start, end;
> > 	u64 time;
> > 	struct page *page;
> > 	u32 offset = 0;
> > 
> > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> 
> Not sure what you are doing here, but its not correct.
> 
> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
> 
> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
> 
> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
> 
> static int __init csum_init_module(void)
> {
>         int i;
>         __wsum sum = 0;
>         u64 start, end;
> 	void *base, *addrs[NBPAGES];
>         u32 rnd, offset;
> 
> 	memset(addrs, 0, sizeof(addrs));
> 	for (i = 0; i < NBPAGES; i++) {
> 		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> 		if (!addrs[i])
> 			goto out;
> 	}
> 
>         local_bh_disable();
>         pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
>         start = ktime_to_ns(ktime_get());
>         
>         for (i = 0; i < 100000; i++) {
> 		rnd = prandom_u32();
> 		base = addrs[rnd % NBPAGES];
> 		rnd /= NBPAGES;
> 		offset = rnd % (BUFSIZ - 1500);
>                 offset &= ~1U;
>                 sum = csum_partial_opt(base + offset, 1500, sum);
>         }
>         end = ktime_to_ns(ktime_get());
>         local_bh_enable();
> 
>         pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
> 
> out:
> 	for (i = 0; i < NBPAGES; i++)
> 		kfree(addrs[i]);
> 
>         return 0;
> }
> 
> static void __exit csum_cleanup_module(void)
> {
>         return;
> }
> 
> 
> 
> 


Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
such that no interrupts (save for local ones), would occur on that cpu.  Note
that I had to convert csum_partial_opt to csum_partial, as the _opt variant
doesn't exist in my tree, nor do I see it in any upstream tree or in the history
anywhere.

base results:
53569916
43506025
43476542
44048436
45048042
48550429
53925556
53927374
53489708
53003915

AVG = 492 ns

prefetching only:
53279213
45518140
49585388
53176179
44071822
43588822
44086546
47507065
53646812
54469118

AVG = 488 ns


parallel alu's only:
46226844
44458101
46803498
45060002
46187624
37542946
45632866
46275249
45031141
46281204

AVG = 449 ns


both optimizations:
45708837
45631124
45697135
45647011
45036679
39418544
44481577
46820868
44496471
35523928

AVG = 438 ns


We continue to see a small savings in execution time with prefetching (4 ns, or
about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
the best savings with both optimizations (54 ns, or 10.9%).  

These results, while they've changed as we've modified the test case slightly
have remained consistent in their sppedup ordinality.  Prefetching helps, but
not as much as using multiple alu's, and neither is as good as doing both
together.

Unless you see something else that I'm doing wrong here.  It seems like a win to
do both.

Regards
Neil




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 19:21                     ` Neil Horman
@ 2013-10-21 19:44                       ` Eric Dumazet
  2013-10-21 20:19                         ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-10-21 19:44 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:

> 
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu.  Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.

This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)

> 
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
> 
> AVG = 492 ns
> 
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
> 
> AVG = 488 ns
> 
> 
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
> 
> AVG = 449 ns
> 
> 
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
> 
> AVG = 438 ns
> 
> 
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).  
> 
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality.  Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
> 
> Unless you see something else that I'm doing wrong here.  It seems like a win to
> do both.
> 

Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)

Only noises in the results.

It seems it depends on cpus and/or multiple factors.

Last machine I used for the tests had :

processor	: 23
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping	: 2
microcode	: 0x13
cpu MHz		: 2800.256
cache size	: 12288 KB
physical id	: 1
siblings	: 12
core id		: 10
cpu cores	: 6




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 19:44                       ` Eric Dumazet
@ 2013-10-21 20:19                         ` Neil Horman
  2013-10-26 12:01                           ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-21 20:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> 
> > 
> > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > anywhere.
> 
> This csum_partial_opt() was a private implementation of csum_partial()
> so that I could load the module without rebooting the kernel ;)
> 
> > 
> > base results:
> > 53569916
> > 43506025
> > 43476542
> > 44048436
> > 45048042
> > 48550429
> > 53925556
> > 53927374
> > 53489708
> > 53003915
> > 
> > AVG = 492 ns
> > 
> > prefetching only:
> > 53279213
> > 45518140
> > 49585388
> > 53176179
> > 44071822
> > 43588822
> > 44086546
> > 47507065
> > 53646812
> > 54469118
> > 
> > AVG = 488 ns
> > 
> > 
> > parallel alu's only:
> > 46226844
> > 44458101
> > 46803498
> > 45060002
> > 46187624
> > 37542946
> > 45632866
> > 46275249
> > 45031141
> > 46281204
> > 
> > AVG = 449 ns
> > 
> > 
> > both optimizations:
> > 45708837
> > 45631124
> > 45697135
> > 45647011
> > 45036679
> > 39418544
> > 44481577
> > 46820868
> > 44496471
> > 35523928
> > 
> > AVG = 438 ns
> > 
> > 
> > We continue to see a small savings in execution time with prefetching (4 ns, or
> > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > the best savings with both optimizations (54 ns, or 10.9%).  
> > 
> > These results, while they've changed as we've modified the test case slightly
> > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > not as much as using multiple alu's, and neither is as good as doing both
> > together.
> > 
> > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > do both.
> > 
> 
> Well, I only said (or maybe I forgot), that on my machines, I got no
> improvements at all with the multiple alu or the prefetch. (I tried
> different strides)
> 
> Only noises in the results.
> 
I thought you previously said that running netperf gave you a stastically
significant performance boost when you added prefetching:
http://marc.info/?l=linux-kernel&m=138178914124863&w=2

But perhaps I missed a note somewhere.

> It seems it depends on cpus and/or multiple factors.
> 
> Last machine I used for the tests had :
> 
> processor	: 23
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 44
> model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> stepping	: 2
> microcode	: 0x13
> cpu MHz		: 2800.256
> cache size	: 12288 KB
> physical id	: 1
> siblings	: 12
> core id		: 10
> cpu cores	: 6
> 
> 
> 
> 

Thats about what I'm running with:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping        : 2
microcode       : 0x13
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4


I can't imagine what would cause the discrepancy in our results (a 10% savings
in execution time seems significant to me). My only thought would be that
possibly the alu's on your cpu are faster than mine, and reduce the speedup
obtained by preforming operation in parallel, though I can't imagine thats the
case with these processors being so closely matched.

Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:09     ` H. Peter Anvin
@ 2013-10-25 13:06       ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-25 13:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
> If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.
> 
Ok, fair enough.  Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any code using these instructions for some time, so
I'll back burner their use and revisit them later.  I'm still working on the
parallel alu/prefetch angle though.

Neil

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 20:19                         ` Neil Horman
@ 2013-10-26 12:01                           ` Ingo Molnar
  2013-10-26 13:58                             ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-26 12:01 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > 
> > > 
> > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > anywhere.
> > 
> > This csum_partial_opt() was a private implementation of csum_partial()
> > so that I could load the module without rebooting the kernel ;)
> > 
> > > 
> > > base results:
> > > 53569916
> > > 43506025
> > > 43476542
> > > 44048436
> > > 45048042
> > > 48550429
> > > 53925556
> > > 53927374
> > > 53489708
> > > 53003915
> > > 
> > > AVG = 492 ns
> > > 
> > > prefetching only:
> > > 53279213
> > > 45518140
> > > 49585388
> > > 53176179
> > > 44071822
> > > 43588822
> > > 44086546
> > > 47507065
> > > 53646812
> > > 54469118
> > > 
> > > AVG = 488 ns
> > > 
> > > 
> > > parallel alu's only:
> > > 46226844
> > > 44458101
> > > 46803498
> > > 45060002
> > > 46187624
> > > 37542946
> > > 45632866
> > > 46275249
> > > 45031141
> > > 46281204
> > > 
> > > AVG = 449 ns
> > > 
> > > 
> > > both optimizations:
> > > 45708837
> > > 45631124
> > > 45697135
> > > 45647011
> > > 45036679
> > > 39418544
> > > 44481577
> > > 46820868
> > > 44496471
> > > 35523928
> > > 
> > > AVG = 438 ns
> > > 
> > > 
> > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > the best savings with both optimizations (54 ns, or 10.9%).  
> > > 
> > > These results, while they've changed as we've modified the test case slightly
> > > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > > not as much as using multiple alu's, and neither is as good as doing both
> > > together.
> > > 
> > > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > > do both.
> > > 
> > 
> > Well, I only said (or maybe I forgot), that on my machines, I got no
> > improvements at all with the multiple alu or the prefetch. (I tried
> > different strides)
> > 
> > Only noises in the results.
> > 
> I thought you previously said that running netperf gave you a stastically
> significant performance boost when you added prefetching:
> http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> 
> But perhaps I missed a note somewhere.
> 
> > It seems it depends on cpus and/or multiple factors.
> > 
> > Last machine I used for the tests had :
> > 
> > processor	: 23
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 44
> > model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> > stepping	: 2
> > microcode	: 0x13
> > cpu MHz		: 2800.256
> > cache size	: 12288 KB
> > physical id	: 1
> > siblings	: 12
> > core id		: 10
> > cpu cores	: 6
> > 
> > 
> > 
> > 
> 
> Thats about what I'm running with:
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 44
> model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> stepping        : 2
> microcode       : 0x13
> cpu MHz         : 1600.000
> cache size      : 12288 KB
> physical id     : 0
> siblings        : 8
> core id         : 0
> cpu cores       : 4
> 
> 
> I can't imagine what would cause the discrepancy in our results (a 
> 10% savings in execution time seems significant to me). My only 
> thought would be that possibly the alu's on your cpu are faster 
> than mine, and reduce the speedup obtained by preforming operation 
> in parallel, though I can't imagine thats the case with these 
> processors being so closely matched.

You keep ignoring my request to calculate and account for noise of 
the measurement.

For example you are talking about a 0.8% prefetch effect while the 
noise in the results is obviously much larger than that, with a 
min/max distance of around 5%:

> > > 43476542
> > > 53927374

so the noise of 10 measurements would be around 5-10%. (back of the 
envelope calculation)

So you might be right in the end, but the posted data does not 
support your claims, statistically.

It's your responsibility to come up with convincing measurements and 
results, not of those who review your work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 12:01                           ` Ingo Molnar
@ 2013-10-26 13:58                             ` Neil Horman
  2013-10-27  7:26                               ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-26 13:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > > 
> > > > 
> > > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > > anywhere.
> > > 
> > > This csum_partial_opt() was a private implementation of csum_partial()
> > > so that I could load the module without rebooting the kernel ;)
> > > 
> > > > 
> > > > base results:
> > > > 53569916
> > > > 43506025
> > > > 43476542
> > > > 44048436
> > > > 45048042
> > > > 48550429
> > > > 53925556
> > > > 53927374
> > > > 53489708
> > > > 53003915
> > > > 
> > > > AVG = 492 ns
> > > > 
> > > > prefetching only:
> > > > 53279213
> > > > 45518140
> > > > 49585388
> > > > 53176179
> > > > 44071822
> > > > 43588822
> > > > 44086546
> > > > 47507065
> > > > 53646812
> > > > 54469118
> > > > 
> > > > AVG = 488 ns
> > > > 
> > > > 
> > > > parallel alu's only:
> > > > 46226844
> > > > 44458101
> > > > 46803498
> > > > 45060002
> > > > 46187624
> > > > 37542946
> > > > 45632866
> > > > 46275249
> > > > 45031141
> > > > 46281204
> > > > 
> > > > AVG = 449 ns
> > > > 
> > > > 
> > > > both optimizations:
> > > > 45708837
> > > > 45631124
> > > > 45697135
> > > > 45647011
> > > > 45036679
> > > > 39418544
> > > > 44481577
> > > > 46820868
> > > > 44496471
> > > > 35523928
> > > > 
> > > > AVG = 438 ns
> > > > 
> > > > 
> > > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > > the best savings with both optimizations (54 ns, or 10.9%).  
> > > > 
> > > > These results, while they've changed as we've modified the test case slightly
> > > > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > > > not as much as using multiple alu's, and neither is as good as doing both
> > > > together.
> > > > 
> > > > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > > > do both.
> > > > 
> > > 
> > > Well, I only said (or maybe I forgot), that on my machines, I got no
> > > improvements at all with the multiple alu or the prefetch. (I tried
> > > different strides)
> > > 
> > > Only noises in the results.
> > > 
> > I thought you previously said that running netperf gave you a stastically
> > significant performance boost when you added prefetching:
> > http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> > 
> > But perhaps I missed a note somewhere.
> > 
> > > It seems it depends on cpus and/or multiple factors.
> > > 
> > > Last machine I used for the tests had :
> > > 
> > > processor	: 23
> > > vendor_id	: GenuineIntel
> > > cpu family	: 6
> > > model		: 44
> > > model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> > > stepping	: 2
> > > microcode	: 0x13
> > > cpu MHz		: 2800.256
> > > cache size	: 12288 KB
> > > physical id	: 1
> > > siblings	: 12
> > > core id		: 10
> > > cpu cores	: 6
> > > 
> > > 
> > > 
> > > 
> > 
> > Thats about what I'm running with:
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 44
> > model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> > stepping        : 2
> > microcode       : 0x13
> > cpu MHz         : 1600.000
> > cache size      : 12288 KB
> > physical id     : 0
> > siblings        : 8
> > core id         : 0
> > cpu cores       : 4
> > 
> > 
> > I can't imagine what would cause the discrepancy in our results (a 
> > 10% savings in execution time seems significant to me). My only 
> > thought would be that possibly the alu's on your cpu are faster 
> > than mine, and reduce the speedup obtained by preforming operation 
> > in parallel, though I can't imagine thats the case with these 
> > processors being so closely matched.
> 
> You keep ignoring my request to calculate and account for noise of 
> the measurement.
> 
Don't confuse "ignoring" with "haven't gotten there yet".  Sometimes we all have
to wait, Ingo.  I'm working on it now, but I hit a snag on the machine I'm
working with and am trying to figure it out now.

> For example you are talking about a 0.8% prefetch effect while the 
> noise in the results is obviously much larger than that, with a 
> min/max distance of around 5%:
> 
> > > > 43476542
> > > > 53927374
> 
> so the noise of 10 measurements would be around 5-10%. (back of the 
> envelope calculation)
> 
> So you might be right in the end, but the posted data does not 
> support your claims, statistically.
> 
> It's your responsibility to come up with convincing measurements and 
> results, not of those who review your work.
> 
Be patient, I'm getting there

Thanks
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 13:58                             ` Neil Horman
@ 2013-10-27  7:26                               ` Ingo Molnar
  2013-10-27 17:05                                 ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-27  7:26 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > You keep ignoring my request to calculate and account for noise of the 
> > measurement.
> 
> Don't confuse "ignoring" with "haven't gotten there yet".  [...]

So, instead of replying to my repeated feedback with a single line mail 
that you plan to address it, you repeated the same measurement mistakes 
again and again, posting invalid results, and forced me to spend time to 
repeat this same argument 2-3 times?

> [...] Sometimes we all have to wait, Ingo. [...]

I'm making bog standard technical requests to which you've not replied in 
substance, there's no need for the patronizing tone really.

Anyway, to simplify the workflow I'm NAK-ing it all until it's done 
convincingly.

  NAKed-by: Ingo Molnar <mingo@kernel.org>

I'll lift the NAK once my technical concerns and questions are resolved.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-27  7:26                               ` Ingo Molnar
@ 2013-10-27 17:05                                 ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-27 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > You keep ignoring my request to calculate and account for noise of the 
> > > measurement.
> > 
> > Don't confuse "ignoring" with "haven't gotten there yet".  [...]
> 
> So, instead of replying to my repeated feedback with a single line mail 
> that you plan to address it, you repeated the same measurement mistakes 
> again and again, posting invalid results, and forced me to spend time to 
> repeat this same argument 2-3 times?
> 
No one forced you to do anything Ingo.  I was finishing a valid line of
discussion with Eric prior to addressing your questions, while handling several
other unrelated issues (and related issues with my test system), that cropped up
in parallel.  

> > [...] Sometimes we all have to wait, Ingo. [...]
> 
> I'm making bog standard technical requests to which you've not replied in 
> substance, there's no need for the patronizing tone really.
> 
No one said they weren't easy to do, Ingo, I said I was getting to your request.
And now I am.  I'll be running the tests tomorrow.

> Anyway, to simplify the workflow I'm NAK-ing it all until it's done 
> convincingly.
> 
>   NAKed-by: Ingo Molnar <mingo@kernel.org>
> 
> I'll lift the NAK once my technical concerns and questions are resolved.
> 
Ok, if that helps you wait.  I'll have your test results in the next day or so.

Thanks
Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-28 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev



Ingo, et al.-
	Ok, sorry for the delay, here are the test results you've been asking
for.


First, some information about what I did.  I attached the module that I ran this
test with at the bottom of this email.  You'll note that I started using a
module parameter write patch to trigger the csum rather than the module load
path.  The latter seemed to be giving me lots of variance in my run times, which
I wanted to eliminate.  I attributed it to the module load mechanism itself, and
by using the parameter write path, I was able to get more consistent results.

First, the run time tests:

I ran this command:
for i in `seq 0 1 3`
do
	 echo $i > /sys/module/csum_test/parameters/module_test_mode
	 perf stat --repeat 20 --null echo 1 > echo 1 > /sys/module/csum_test/parameters/test_fire
done

The for loop allows me to chagne the module_test_mode, which is tied to a switch
statement in do_csum that selects which checksumming method we use
(base/prefetch/parallel alu/both).  The results are:


Base:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.093269042 seconds time elapsed                                          ( +-  2.24% )

Prefetch (5x64):
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.079440009 seconds time elapsed                                          ( +-  2.29% )

Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.087666677 seconds time elapsed                                          ( +-  4.01% )

Prefetch + Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.080758702 seconds time elapsed                                          ( +-  2.34% )

So we can see here that we get about a 1% speedup between the base and the both
(Prefetch + Parallel ALU) case, with prefetch accounting for most of that
speedup.

Looking at the specific cpu counters we get this:


Base:
     Total time: 0.179 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
            14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
             2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
            75,402 page-faults               #    0.048 M/sec                    ( +-  0.07% )
     1,597,349,326 cycles                    #    1.017 GHz                      ( +-  1.74% ) [40.51%]
       104,882,858 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.25% ) [40.33%]
     1,043,429,984 stalled-cycles-backend    #   65.32% backend  cycles idle     ( +-  1.25% ) [39.73%]
       868,372,132 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.43% ) [39.88%]
       161,143,820 branches                  #  102.554 M/sec                    ( +-  1.49% ) [39.76%]
         4,348,075 branch-misses             #    2.70% of all branches          ( +-  1.43% ) [39.99%]
       457,042,576 L1-dcache-loads           #  290.868 M/sec                    ( +-  1.25% ) [40.63%]
         8,928,240 L1-dcache-load-misses     #    1.95% of all L1-dcache hits    ( +-  1.26% ) [41.17%]
        15,821,051 LLC-loads                 #   10.069 M/sec                    ( +-  1.56% ) [41.20%]
         4,902,576 LLC-load-misses           #   30.99% of all LL-cache hits     ( +-  1.51% ) [41.36%]
       235,775,688 L1-icache-loads           #  150.051 M/sec                    ( +-  1.39% ) [41.10%]
         3,116,106 L1-icache-load-misses     #    1.32% of all L1-icache hits    ( +-  3.43% ) [40.96%]
       461,315,416 dTLB-loads                #  293.588 M/sec                    ( +-  1.43% ) [41.18%]
           140,280 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.30% ) [40.96%]
       236,127,031 iTLB-loads                #  150.275 M/sec                    ( +-  1.63% ) [41.43%]
            46,173 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.40% ) [41.11%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.82%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.37%]

       0.301414024 seconds time elapsed                                          ( +-  0.47% )

Prefetch (5x64):
     Total time: 0.172 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1565.797128 task-clock                #    5.238 CPUs utilized            ( +-  0.46% )
            13,845 context-switches          #    0.009 M/sec                    ( +-  4.20% )
             2,624 cpu-migrations            #    0.002 M/sec                    ( +-  2.72% )
            75,452 page-faults               #    0.048 M/sec                    ( +-  0.08% )
     1,642,106,355 cycles                    #    1.049 GHz                      ( +-  1.33% ) [40.17%]
       107,786,666 stalled-cycles-frontend   #    6.56% frontend cycles idle     ( +-  1.37% ) [39.90%]
     1,065,286,880 stalled-cycles-backend    #   64.87% backend  cycles idle     ( +-  1.59% ) [39.14%]
       888,815,001 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.29% ) [38.92%]
       163,106,907 branches                  #  104.169 M/sec                    ( +-  1.32% ) [38.93%]
         4,333,456 branch-misses             #    2.66% of all branches          ( +-  1.94% ) [39.77%]
       459,779,806 L1-dcache-loads           #  293.639 M/sec                    ( +-  1.60% ) [40.23%]
         8,827,680 L1-dcache-load-misses     #    1.92% of all L1-dcache hits    ( +-  1.77% ) [41.38%]
        15,556,816 LLC-loads                 #    9.935 M/sec                    ( +-  1.76% ) [41.16%]
         4,885,618 LLC-load-misses           #   31.40% of all LL-cache hits     ( +-  1.40% ) [40.84%]
       236,131,778 L1-icache-loads           #  150.806 M/sec                    ( +-  1.32% ) [40.59%]
         3,037,537 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.23% ) [41.13%]
       454,835,028 dTLB-loads                #  290.481 M/sec                    ( +-  1.23% ) [41.34%]
           139,907 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.18% ) [41.21%]
       236,357,655 iTLB-loads                #  150.950 M/sec                    ( +-  1.31% ) [41.29%]
            46,633 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.74% ) [40.67%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.16%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.09%]

       0.298948767 seconds time elapsed                                          ( +-  0.36% )

Here it appears everything between the two runs is about the same.  We reduced
the number of dcache misses by a small amount (0.03 percentage points), which is
nice, but I'm not sure would account for the speedup we see in the run time.

Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1553.544876 task-clock                #    5.217 CPUs utilized            ( +-  0.42% )
            14,066 context-switches          #    0.009 M/sec                    ( +-  6.24% )
             2,831 cpu-migrations            #    0.002 M/sec                    ( +-  3.33% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.08% )
     1,659,509,743 cycles                    #    1.068 GHz                      ( +-  1.27% ) [40.10%]
       106,466,680 stalled-cycles-frontend   #    6.42% frontend cycles idle     ( +-  1.50% ) [39.98%]
     1,035,481,957 stalled-cycles-backend    #   62.40% backend  cycles idle     ( +-  1.23% ) [39.38%]
       875,104,201 instructions              #    0.53  insns per cycle        
                                             #    1.18  stalled cycles per insn  ( +-  1.30% ) [38.66%]
       160,553,275 branches                  #  103.346 M/sec                    ( +-  1.32% ) [38.85%]
         4,329,119 branch-misses             #    2.70% of all branches          ( +-  1.39% ) [39.59%]
       448,195,116 L1-dcache-loads           #  288.498 M/sec                    ( +-  1.91% ) [41.07%]
         8,632,347 L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  1.90% ) [41.56%]
        15,143,145 LLC-loads                 #    9.747 M/sec                    ( +-  1.89% ) [41.05%]
         4,698,204 LLC-load-misses           #   31.03% of all LL-cache hits     ( +-  1.03% ) [41.23%]
       224,316,468 L1-icache-loads           #  144.390 M/sec                    ( +-  1.27% ) [41.39%]
         2,902,842 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.65% ) [42.60%]
       433,914,588 dTLB-loads                #  279.306 M/sec                    ( +-  1.75% ) [43.07%]
           132,090 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.15% ) [43.12%]
       230,701,361 iTLB-loads                #  148.500 M/sec                    ( +-  1.77% ) [43.47%]
            45,562 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.76% ) [42.88%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [42.29%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [41.32%]

       0.297758185 seconds time elapsed                                          ( +-  0.40% )

Here It seems the major advantage was backend stall cycles saved (which makes
sense to me).  Since we split the instruction path into two units that could run
independently of each other we spent less time waiting for prior instructions to
retire.  As a result we dropped two percentage points in our stall number.

Prefetch + Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1549.171283 task-clock                #    5.231 CPUs utilized            ( +-  0.50% )
            13,717 context-switches          #    0.009 M/sec                    ( +-  4.32% )
             2,721 cpu-migrations            #    0.002 M/sec                    ( +-  2.47% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.07% )
     1,579,140,244 cycles                    #    1.019 GHz                      ( +-  1.71% ) [40.06%]
       103,803,034 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.74% ) [39.60%]
     1,016,582,613 stalled-cycles-backend    #   64.38% backend  cycles idle     ( +-  1.79% ) [39.57%]
       881,036,653 instructions              #    0.56  insns per cycle        
                                             #    1.15  stalled cycles per insn  ( +-  1.61% ) [39.29%]
       164,333,010 branches                  #  106.078 M/sec                    ( +-  1.51% ) [39.38%]
         4,385,459 branch-misses             #    2.67% of all branches          ( +-  1.62% ) [40.29%]
       463,987,526 L1-dcache-loads           #  299.507 M/sec                    ( +-  1.52% ) [40.20%]
         8,739,535 L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  1.95% ) [40.37%]
        15,318,497 LLC-loads                 #    9.888 M/sec                    ( +-  1.80% ) [40.43%]
         4,846,148 LLC-load-misses           #   31.64% of all LL-cache hits     ( +-  1.68% ) [40.59%]
       231,982,874 L1-icache-loads           #  149.746 M/sec                    ( +-  1.43% ) [41.25%]
         3,141,106 L1-icache-load-misses     #    1.35% of all L1-icache hits    ( +-  2.32% ) [41.76%]
       459,688,615 dTLB-loads                #  296.732 M/sec                    ( +-  1.75% ) [41.87%]
           138,667 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  1.97% ) [42.31%]
       235,629,204 iTLB-loads                #  152.100 M/sec                    ( +-  1.40% ) [42.04%]
            46,038 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.75% ) [41.20%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.77%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.27%]

       0.296173305 seconds time elapsed                                          ( +-  0.44% )
Here, with both optimizations, we've reduced both our backend stall cycles, and
our dcache miss rate (though our load misses here is higher than it was when we
are just doing parallel ALU execution.  I wonder if the separation of the adcx
path is leading to multiple load requests before the prefetch completes.  I'll
try messing with the stride a bit more to see if I can get some more insight
there.

So there you have it.  I think, looking at this, I can say that its not as big a
win as my initial measurements were indicating, but still a win.

Thoughts?

Regards
Neil

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

extern int csum_mode;
int module_test_mode = 0;
int test_fire = 0;

static int __init csum_init_module(void)
{
        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}

static int set_param_str(const char *val, const struct kernel_param *kp)
{
        int i;
        __wsum sum = 0;
        /*u64 start, end;*/
        void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	
        memset(addrs, 0, sizeof(addrs));
        for (i = 0; i < NBPAGES; i++) {
                addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
                if (!addrs[i])
                        goto out;
        }

	csum_mode = module_test_mode;

	local_bh_disable();
        /*pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());*/
        /*start = ktime_to_ns(ktime_get());*/

        for (i = 0; i < 100000; i++) {
                rnd = prandom_u32();
                base = addrs[rnd % NBPAGES];
                rnd /= NBPAGES;
                offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial(base + offset, 1500, sum);
        }
        /*end = ktime_to_ns(ktime_get());*/
        local_bh_enable();

	/*pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);*/

	csum_mode = 0;
out:
        for (i = 0; i < NBPAGES; i++)
                kfree(addrs[i]);

        return 0;
}

static int get_param_str(char *buffer, const struct kernel_param *kp)
{
	return sprintf(buffer, "%d\n", test_fire);
}

static struct kernel_param_ops param_ops_str = {
	.set = set_param_str,
	.get = get_param_str,
};

module_param_named(module_test_mode, module_test_mode, int, 0644);
MODULE_PARM_DESC(module_test_mode, "csum test mode");
module_param_cb(test_fire, &param_ops_str, &test_fire, 0644);
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
@ 2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 17:49                 ` Neil Horman
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Base:
>        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> Prefetch (5x64):
>        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> Parallel ALU:
>        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> Prefetch + Parallel ALU:
>        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> 
> So we can see here that we get about a 1% speedup between the base 
> and the both (Prefetch + Parallel ALU) case, with prefetch 
> accounting for most of that speedup.

Hm, there's still something strange about these results. So the 
range of the results is 790-930 nsecs. The noise of the measurements 
is 2%-4%, i.e. 20-40 nsecs.

The prefetch-only result itself is the fastest of all - 
statistically equivalent to the prefetch+parallel-ALU result, within 
the noise range.

So if prefetch is enabled, turning on parallel-ALU has no measurable 
effect - which is counter-intuitive. Do you have an 
theory/explanation for that?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 2 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Looking at the specific cpu counters we get this:
> 
> Base:
>      Total time: 0.179 [sec]
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
>        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
>             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
>              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )

Hm, for these second round of measurements were you using 'perf stat 
-a -C ...'?

The most accurate method of measurement for such single-threaded 
workloads is something like:

	taskset 0x1 perf stat -a -C 1 --repeat 20 ...

this will bind your workload to CPU#0, and will do PMU measurements 
only there - without mixing in other CPUs or workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
@ 2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 0 replies; 132+ messages in thread
From: David Ahern @ 2013-10-28 16:49 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/28/13 10:24 AM, Ingo Molnar wrote:

> The most accurate method of measurement for such single-threaded
> workloads is something like:
>
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
>
> this will bind your workload to CPU#0, and will do PMU measurements
> only there - without mixing in other CPUs or workloads.

you can drop the -a if you only want a specific CPU (-C arg). And -C in 
perf is cpu number starting with 0, so in your example above -C 1 means 
cpu1, not cpu0.

David



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
@ 2013-10-28 17:46                 ` Neil Horman
  2013-10-28 18:29                   ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-28 17:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Looking at the specific cpu counters we get this:
> > 
> > Base:
> >      Total time: 0.179 [sec]
> > 
> >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > 
> >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> 
> Hm, for these second round of measurements were you using 'perf stat 
> -a -C ...'?
> 
> The most accurate method of measurement for such single-threaded 
> workloads is something like:
> 
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> 
> this will bind your workload to CPU#0, and will do PMU measurements 
> only there - without mixing in other CPUs or workloads.
> 
> Thanks,
> 
> 	Ingo
I wasn't, but I will...
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 17:49                 ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-28 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Base:
> >        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> > Prefetch (5x64):
> >        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> > Parallel ALU:
> >        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> > Prefetch + Parallel ALU:
> >        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> > 
> > So we can see here that we get about a 1% speedup between the base 
> > and the both (Prefetch + Parallel ALU) case, with prefetch 
> > accounting for most of that speedup.
> 
> Hm, there's still something strange about these results. So the 
> range of the results is 790-930 nsecs. The noise of the measurements 
> is 2%-4%, i.e. 20-40 nsecs.
> 
> The prefetch-only result itself is the fastest of all - 
> statistically equivalent to the prefetch+parallel-ALU result, within 
> the noise range.
> 
> So if prefetch is enabled, turning on parallel-ALU has no measurable 
> effect - which is counter-intuitive. Do you have an 
> theory/explanation for that?
> 
> Thanks,
 I mentioned it farther down, loosely theorizing that running with parallel
alu's in conjunction with a prefetch, puts more pressure on the load/store unit
causing stalls while both alu's wait for the L1 cache to fill.  Not sure if that
makes sense, but I did note that in the both (prefetch+alu case) our data cache
hit rate was somewhat degraded, so I was going to play with the prefetch stride
to see if that fixed the situation.  Regardless I agree, the lack of improvement
in the both case is definately counter-intuitive.

Neil

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 17:46                 ` Neil Horman
@ 2013-10-28 18:29                   ` Neil Horman
  2013-10-29  8:25                     ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-28 18:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Looking at the specific cpu counters we get this:
> > > 
> > > Base:
> > >      Total time: 0.179 [sec]
> > > 
> > >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > > 
> > >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> > >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> > >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> > 
> > Hm, for these second round of measurements were you using 'perf stat 
> > -a -C ...'?
> > 
> > The most accurate method of measurement for such single-threaded 
> > workloads is something like:
> > 
> > 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> > 
> > this will bind your workload to CPU#0, and will do PMU measurements 
> > only there - without mixing in other CPUs or workloads.
> > 
> > Thanks,
> > 
> > 	Ingo
> I wasn't, but I will...
> Neil
> 
> > --

Heres my data for running the same test with taskset restricting execution to
only cpu0.  I'm not quite sure whats going on here, but doing so resulted in a
10x slowdown of the runtime of each iteration which I can't explain.  As before
however, both the parallel alu run and the prefetch run resulted in speedups,
but the two together were not in any way addative.  I'm going to keep playing
with the prefetch stride, unless you have an alternate theory.

Regards
Neil


Base:
     Total time: 1.013 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1140.286043 task-clock                #    1.001 CPUs utilized            ( +-  0.65% ) [100.00%]
            48,779 context-switches          #    0.043 M/sec                    ( +- 10.08% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.066 M/sec                    ( +-  0.05% )
     2,950,225,491 cycles                    #    2.587 GHz                      ( +-  0.65% ) [16.63%]
       263,349,439 stalled-cycles-frontend   #    8.93% frontend cycles idle     ( +-  1.87% ) [16.70%]
     1,615,723,017 stalled-cycles-backend    #   54.77% backend  cycles idle     ( +-  0.64% ) [16.76%]
     2,168,440,946 instructions              #    0.74  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.52% ) [16.76%]
       406,885,149 branches                  #  356.827 M/sec                    ( +-  0.61% ) [16.74%]
        10,099,789 branch-misses             #    2.48% of all branches          ( +-  0.73% ) [16.73%]
     1,138,829,982 L1-dcache-loads           #  998.723 M/sec                    ( +-  0.57% ) [16.71%]
        21,341,094 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.22% ) [16.69%]
        38,453,870 LLC-loads                 #   33.723 M/sec                    ( +-  1.46% ) [16.67%]
         9,587,987 LLC-load-misses           #   24.93% of all LL-cache hits     ( +-  0.48% ) [16.66%]
       566,241,820 L1-icache-loads           #  496.579 M/sec                    ( +-  0.70% ) [16.65%]
         9,061,979 L1-icache-load-misses     #    1.60% of all L1-icache hits    ( +-  3.39% ) [16.65%]
     1,130,620,555 dTLB-loads                #  991.524 M/sec                    ( +-  0.64% ) [16.64%]
           423,302 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.89% ) [16.63%]
       563,371,089 iTLB-loads                #  494.061 M/sec                    ( +-  0.62% ) [16.62%]
           215,406 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.97% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.59%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.58%]

       1.139598762 seconds time elapsed                                          ( +-  0.65% )

Prefetch:
     Total time: 0.981 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1128.603117 task-clock                #    1.001 CPUs utilized            ( +-  0.66% ) [100.00%]
            45,992 context-switches          #    0.041 M/sec                    ( +-  9.47% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,428 page-faults               #    0.067 M/sec                    ( +-  0.06% )
     2,920,666,228 cycles                    #    2.588 GHz                      ( +-  0.66% ) [16.59%]
       255,998,006 stalled-cycles-frontend   #    8.77% frontend cycles idle     ( +-  1.78% ) [16.67%]
     1,601,090,475 stalled-cycles-backend    #   54.82% backend  cycles idle     ( +-  0.69% ) [16.75%]
     2,164,301,312 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.59% ) [16.78%]
       404,920,928 branches                  #  358.781 M/sec                    ( +-  0.54% ) [16.77%]
        10,025,146 branch-misses             #    2.48% of all branches          ( +-  0.66% ) [16.75%]
     1,133,764,674 L1-dcache-loads           # 1004.573 M/sec                    ( +-  0.47% ) [16.74%]
        21,251,432 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.01% ) [16.72%]
        38,006,432 LLC-loads                 #   33.676 M/sec                    ( +-  1.56% ) [16.70%]
         9,625,034 LLC-load-misses           #   25.32% of all LL-cache hits     ( +-  0.40% ) [16.68%]
       565,712,289 L1-icache-loads           #  501.250 M/sec                    ( +-  0.57% ) [16.66%]
         8,726,826 L1-icache-load-misses     #    1.54% of all L1-icache hits    ( +-  3.40% ) [16.64%]
     1,130,140,463 dTLB-loads                # 1001.362 M/sec                    ( +-  0.53% ) [16.63%]
           419,645 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.44% ) [16.62%]
       560,199,307 iTLB-loads                #  496.365 M/sec                    ( +-  0.51% ) [16.61%]
           213,413 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.65% ) [16.59%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.56%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.54%]

       1.127934534 seconds time elapsed                                          ( +-  0.66% )


Parallel ALU:
     Total time: 0.986 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1131.914738 task-clock                #    1.001 CPUs utilized            ( +-  0.49% ) [100.00%]
            40,807 context-switches          #    0.036 M/sec                    ( +- 10.72% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% ) [100.00%]
            75,329 page-faults               #    0.067 M/sec                    ( +-  0.04% )
     2,929,149,996 cycles                    #    2.588 GHz                      ( +-  0.49% ) [16.58%]
       250,428,558 stalled-cycles-frontend   #    8.55% frontend cycles idle     ( +-  1.75% ) [16.66%]
     1,621,074,968 stalled-cycles-backend    #   55.34% backend  cycles idle     ( +-  0.46% ) [16.73%]
     2,147,405,781 instructions              #    0.73  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.56% ) [16.77%]
       401,196,771 branches                  #  354.441 M/sec                    ( +-  0.58% ) [16.76%]
         9,941,701 branch-misses             #    2.48% of all branches          ( +-  0.67% ) [16.74%]
     1,126,651,774 L1-dcache-loads           #  995.350 M/sec                    ( +-  0.50% ) [16.73%]
        21,075,294 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  0.96% ) [16.72%]
        37,885,850 LLC-loads                 #   33.471 M/sec                    ( +-  1.10% ) [16.71%]
         9,729,116 LLC-load-misses           #   25.68% of all LL-cache hits     ( +-  0.62% ) [16.69%]
       562,058,495 L1-icache-loads           #  496.556 M/sec                    ( +-  0.54% ) [16.67%]
         8,617,450 L1-icache-load-misses     #    1.53% of all L1-icache hits    ( +-  3.06% ) [16.65%]
     1,121,765,737 dTLB-loads                #  991.034 M/sec                    ( +-  0.57% ) [16.63%]
           388,875 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  4.27% ) [16.62%]
       556,029,393 iTLB-loads                #  491.229 M/sec                    ( +-  0.64% ) [16.61%]
           189,181 iTLB-load-misses          #    0.03% of all iTLB cache hits   ( +-  6.98% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.58%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.131247174 seconds time elapsed                                          ( +-  0.49% )


Both:
     Total time: 0.993 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1130.912197 task-clock                #    1.001 CPUs utilized            ( +-  0.60% ) [100.00%]
            45,859 context-switches          #    0.041 M/sec                    ( +-  9.00% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.067 M/sec                    ( +-  0.07% )
     2,926,527,048 cycles                    #    2.588 GHz                      ( +-  0.60% ) [16.60%]
       255,482,254 stalled-cycles-frontend   #    8.73% frontend cycles idle     ( +-  1.62% ) [16.67%]
     1,608,247,364 stalled-cycles-backend    #   54.95% backend  cycles idle     ( +-  0.73% ) [16.74%]
     2,162,135,903 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.46% ) [16.77%]
       403,436,790 branches                  #  356.736 M/sec                    ( +-  0.44% ) [16.76%]
        10,062,572 branch-misses             #    2.49% of all branches          ( +-  0.85% ) [16.75%]
     1,133,889,264 L1-dcache-loads           # 1002.632 M/sec                    ( +-  0.56% ) [16.74%]
        21,460,116 L1-dcache-load-misses     #    1.89% of all L1-dcache hits    ( +-  1.31% ) [16.73%]
        38,070,119 LLC-loads                 #   33.663 M/sec                    ( +-  1.63% ) [16.72%]
         9,593,162 LLC-load-misses           #   25.20% of all LL-cache hits     ( +-  0.42% ) [16.71%]
       562,867,188 L1-icache-loads           #  497.711 M/sec                    ( +-  0.59% ) [16.68%]
         8,472,343 L1-icache-load-misses     #    1.51% of all L1-icache hits    ( +-  3.02% ) [16.64%]
     1,126,997,403 dTLB-loads                #  996.538 M/sec                    ( +-  0.53% ) [16.61%]
           414,900 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.12% ) [16.60%]
       561,156,032 iTLB-loads                #  496.198 M/sec                    ( +-  0.56% ) [16.59%]
           212,482 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.10% ) [16.58%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.57%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.130242195 seconds time elapsed                                          ( +-  0.60% )

> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 18:29                   ` Neil Horman
@ 2013-10-29  8:25                     ` Ingo Molnar
  2013-10-29 11:20                       ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-29  8:25 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Heres my data for running the same test with taskset restricting 
> execution to only cpu0.  I'm not quite sure whats going on here, 
> but doing so resulted in a 10x slowdown of the runtime of each 
> iteration which I can't explain.  As before however, both the 
> parallel alu run and the prefetch run resulted in speedups, but 
> the two together were not in any way addative.  I'm going to keep 
> playing with the prefetch stride, unless you have an alternate 
> theory.

Could you please cite the exact command-line you used for running 
the test?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29  8:25                     ` Ingo Molnar
@ 2013-10-29 11:20                       ` Neil Horman
  2013-10-29 11:30                         ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-29 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Heres my data for running the same test with taskset restricting 
> > execution to only cpu0.  I'm not quite sure whats going on here, 
> > but doing so resulted in a 10x slowdown of the runtime of each 
> > iteration which I can't explain.  As before however, both the 
> > parallel alu run and the prefetch run resulted in speedups, but 
> > the two together were not in any way addative.  I'm going to keep 
> > playing with the prefetch stride, unless you have an alternate 
> > theory.
> 
> Could you please cite the exact command-line you used for running 
> the test?
> 
> Thanks,
> 
> 	Ingo
> 

Sure it was this:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
done >> counters.txt 2>&1

where test.sh is:
#!/bin/sh
echo 1 > /sys/module/csum_test/parameters/test_fire


As before, module_test_mode selects a case in a switch statement I added in
do_csum to test one of the 4 csum variants we've been discusing (base, prefetch,
parallel ALU or both), and test_fire is a callback trigger I use in the test
module to run 100000 iterations of a checksum operation.  As you requested, I
ran the above on cpu 0 (-C 0 on perf and -c 0 on taskset), and I removed all irq
affinity to cpu 0.

Regards
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:20                       ` Neil Horman
@ 2013-10-29 11:30                         ` Ingo Molnar
  2013-10-29 11:49                           ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-29 11:30 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> done >> counters.txt 2>&1
> 
> where test.sh is:
> #!/bin/sh
> echo 1 > /sys/module/csum_test/parameters/test_fire

What does '-- /root/test.sh' do?

Unless I'm missing something, the line above will run:

  perf bench sched messaging -- /root/test.sh

which should be equivalent to:

  perf bench sched messaging

i.e. /root/test.sh won't be run.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:30                         ` Ingo Molnar
@ 2013-10-29 11:49                           ` Neil Horman
  2013-10-29 12:52                             ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-29 11:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > done >> counters.txt 2>&1
> > 
> > where test.sh is:
> > #!/bin/sh
> > echo 1 > /sys/module/csum_test/parameters/test_fire
> 
> What does '-- /root/test.sh' do?
> 
> Unless I'm missing something, the line above will run:
> 
>   perf bench sched messaging -- /root/test.sh
> 
> which should be equivalent to:
> 
>   perf bench sched messaging
> 
> i.e. /root/test.sh won't be run.
> 
According to the perf man page, I'm supposed to be able to use -- to separate
perf command line parameters from the command I want to run.  And it definately
executed test.sh, I added an echo to stdout in there as a test run and observed
them get captured in counters.txt

Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:49                           ` Neil Horman
@ 2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
  2013-10-29 14:12                               ` David Ahern
  0 siblings, 2 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-29 12:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > done >> counters.txt 2>&1
> > > 
> > > where test.sh is:
> > > #!/bin/sh
> > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > 
> > What does '-- /root/test.sh' do?
> > 
> > Unless I'm missing something, the line above will run:
> > 
> >   perf bench sched messaging -- /root/test.sh
> > 
> > which should be equivalent to:
> > 
> >   perf bench sched messaging
> > 
> > i.e. /root/test.sh won't be run.
> 
> According to the perf man page, I'm supposed to be able to use -- 
> to separate perf command line parameters from the command I want 
> to run.  And it definately executed test.sh, I added an echo to 
> stdout in there as a test run and observed them get captured in 
> counters.txt

Well, '--' can be used to delineate the command portion for cases 
where it's ambiguous.

Here's it's unambiguous though. This:

  perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh

stops parsing a valid option after the -ddd option, so in theory it 
should execute 'perf bench sched messaging -- /root/test.sh' where 
'-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
ignored.

The message output you provided seems to suggest that to be the 
case:

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

See how the command executed by perf stat was 'perf bench ...'.

Did you want to run:

  perf stat --repeat 20 -C 0 -ddd /root/test.sh

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
@ 2013-10-29 13:07                               ` Neil Horman
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 14:12                               ` David Ahern
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-29 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > > done >> counters.txt 2>&1
> > > > 
> > > > where test.sh is:
> > > > #!/bin/sh
> > > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > > 
> > > What does '-- /root/test.sh' do?
> > > 
> > > Unless I'm missing something, the line above will run:
> > > 
> > >   perf bench sched messaging -- /root/test.sh
> > > 
> > > which should be equivalent to:
> > > 
> > >   perf bench sched messaging
> > > 
> > > i.e. /root/test.sh won't be run.
> > 
> > According to the perf man page, I'm supposed to be able to use -- 
> > to separate perf command line parameters from the command I want 
> > to run.  And it definately executed test.sh, I added an echo to 
> > stdout in there as a test run and observed them get captured in 
> > counters.txt
> 
> Well, '--' can be used to delineate the command portion for cases 
> where it's ambiguous.
> 
> Here's it's unambiguous though. This:
> 
>   perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> 
> stops parsing a valid option after the -ddd option, so in theory it 
> should execute 'perf bench sched messaging -- /root/test.sh' where 
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
> ignored.
> 
> The message output you provided seems to suggest that to be the 
> case:
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
> See how the command executed by perf stat was 'perf bench ...'.
> 
> Did you want to run:
> 
>   perf stat --repeat 20 -C 0 -ddd /root/test.sh
> 
I'm sure it worked properly on my system here, I specificially checked it, but
I'll gladly run it again.  You have to give me an hour as I have a meeting to
run to, but I'll have results shortly.
Neil

> ?
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  0 siblings, 2 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-29 13:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> I'm sure it worked properly on my system here, I specificially 
> checked it, but I'll gladly run it again.  You have to give me an 
> hour as I have a meeting to run to, but I'll have results shortly.

So what I tried to react to was this observation of yours:

> > > Heres my data for running the same test with taskset 
> > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > going on here, but doing so resulted in a 10x slowdown of the 
> > > runtime of each iteration which I can't explain. [...]

A 10x slowdown would be consistent with not running your testcase 
but 'perf bench sched messaging' by accident, or so.

But I was really just guessing wildly here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
@ 2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  1 sibling, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-29 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 
Ok, well, I'll run it again in just a bit here.
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 14:12                               ` David Ahern
  1 sibling, 0 replies; 132+ messages in thread
From: David Ahern @ 2013-10-29 14:12 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/29/13 6:52 AM, Ingo Molnar wrote:
>> According to the perf man page, I'm supposed to be able to use --
>> to separate perf command line parameters from the command I want
>> to run.  And it definately executed test.sh, I added an echo to
>> stdout in there as a test run and observed them get captured in
>> counters.txt
>
> Well, '--' can be used to delineate the command portion for cases
> where it's ambiguous.
>
> Here's it's unambiguous though. This:
>
>    perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
>
> stops parsing a valid option after the -ddd option, so in theory it
> should execute 'perf bench sched messaging -- /root/test.sh' where
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
> ignored.

Normally with perf commands a workload can be specified to state how 
long to collect perf data. That is not the case for perf-bench.

David

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
@ 2013-10-29 14:17                                   ` Neil Horman
  2013-10-29 14:27                                     ` Ingo Molnar
  1 sibling, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-29 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 


So, I apologize, you were right.  I was running the test.sh script but perf was
measuring itself.  Using this command line:

for i in `seq 0 1 3`
do
echo $i > /sys/modules/csum_test/parameters/module_test_mode; taskset -c 0 perf stat --repeat -C 0 -ddd /root/test.sh
done >> counters.txt 2>&1

with test.sh unchanged I get these results:


Base:
 Performance counter stats for '/root/test.sh' (20 runs):

         56.069737 task-clock                #    1.005 CPUs utilized            ( +-  0.13% ) [100.00%]
                 5 context-switches          #    0.091 K/sec                    ( +-  5.11% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               366 page-faults               #    0.007 M/sec                    ( +-  0.08% )
       144,264,737 cycles                    #    2.573 GHz                      ( +-  0.23% ) [17.49%]
         9,239,760 stalled-cycles-frontend   #    6.40% frontend cycles idle     ( +-  3.77% ) [19.19%]
       110,635,829 stalled-cycles-backend    #   76.69% backend  cycles idle     ( +-  0.14% ) [19.68%]
        54,291,496 instructions              #    0.38  insns per cycle        
                                             #    2.04  stalled cycles per insn  ( +-  0.14% ) [18.30%]
         5,844,933 branches                  #  104.244 M/sec                    ( +-  2.81% ) [16.58%]
           301,523 branch-misses             #    5.16% of all branches          ( +-  0.12% ) [16.09%]
        23,645,797 L1-dcache-loads           #  421.721 M/sec                    ( +-  0.05% ) [16.06%]
           494,467 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.06% ) [16.06%]
         2,907,250 LLC-loads                 #   51.851 M/sec                    ( +-  0.08% ) [16.06%]
           486,329 LLC-load-misses           #   16.73% of all LL-cache hits     ( +-  0.11% ) [16.06%]
        11,113,848 L1-icache-loads           #  198.215 M/sec                    ( +-  0.07% ) [16.06%]
             5,378 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.34% ) [16.06%]
        23,742,876 dTLB-loads                #  423.453 M/sec                    ( +-  0.06% ) [16.06%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [16.06%]
        11,108,538 iTLB-loads                #  198.120 M/sec                    ( +-  0.06% ) [16.06%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [16.07%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.07%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.07%]

       0.055817066 seconds time elapsed                                          ( +-  0.10% )

Prefetch(5*64):
 Performance counter stats for '/root/test.sh' (20 runs):

         47.423853 task-clock                #    1.005 CPUs utilized            ( +-  0.62% ) [100.00%]
                 6 context-switches          #    0.116 K/sec                    ( +-  4.27% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               368 page-faults               #    0.008 M/sec                    ( +-  0.07% )
       120,423,860 cycles                    #    2.539 GHz                      ( +-  0.85% ) [14.23%]
         8,555,632 stalled-cycles-frontend   #    7.10% frontend cycles idle     ( +-  0.56% ) [16.23%]
        87,438,794 stalled-cycles-backend    #   72.61% backend  cycles idle     ( +-  1.13% ) [18.33%]
        55,039,308 instructions              #    0.46  insns per cycle        
                                             #    1.59  stalled cycles per insn  ( +-  0.05% ) [18.98%]
         5,619,298 branches                  #  118.491 M/sec                    ( +-  2.32% ) [18.98%]
           303,686 branch-misses             #    5.40% of all branches          ( +-  0.08% ) [18.98%]
        26,577,868 L1-dcache-loads           #  560.432 M/sec                    ( +-  0.05% ) [18.98%]
         1,323,630 L1-dcache-load-misses     #    4.98% of all L1-dcache hits    ( +-  0.14% ) [18.98%]
         3,426,016 LLC-loads                 #   72.242 M/sec                    ( +-  0.05% ) [18.98%]
         1,304,201 LLC-load-misses           #   38.07% of all LL-cache hits     ( +-  0.13% ) [18.98%]
        13,190,316 L1-icache-loads           #  278.137 M/sec                    ( +-  0.21% ) [18.98%]
            33,881 L1-icache-load-misses     #    0.26% of all L1-icache hits    ( +-  4.63% ) [17.93%]
        25,366,685 dTLB-loads                #  534.893 M/sec                    ( +-  0.24% ) [15.93%]
               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.98%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.98%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.87%]

       0.047194407 seconds time elapsed                                          ( +-  0.62% )

Parallel ALU:
 Performance counter stats for '/root/test.sh' (20 runs):

         57.395070 task-clock                #    1.004 CPUs utilized            ( +-  1.71% ) [100.00%]
                 5 context-switches          #    0.092 K/sec                    ( +-  3.90% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               367 page-faults               #    0.006 M/sec                    ( +-  0.10% )
       143,232,396 cycles                    #    2.496 GHz                      ( +-  1.68% ) [16.73%]
         7,299,843 stalled-cycles-frontend   #    5.10% frontend cycles idle     ( +-  2.69% ) [18.47%]
       109,485,845 stalled-cycles-backend    #   76.44% backend  cycles idle     ( +-  2.01% ) [19.99%]
        56,867,669 instructions              #    0.40  insns per cycle        
                                             #    1.93  stalled cycles per insn  ( +-  0.22% ) [19.49%]
         6,646,323 branches                  #  115.800 M/sec                    ( +-  2.15% ) [17.75%]
           304,671 branch-misses             #    4.58% of all branches          ( +-  0.37% ) [16.23%]
        23,612,428 L1-dcache-loads           #  411.402 M/sec                    ( +-  0.05% ) [15.95%]
           518,988 L1-dcache-load-misses     #    2.20% of all L1-dcache hits    ( +-  0.11% ) [15.95%]
         2,934,119 LLC-loads                 #   51.121 M/sec                    ( +-  0.06% ) [15.95%]
           509,027 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.15% ) [15.95%]
        11,103,819 L1-icache-loads           #  193.463 M/sec                    ( +-  0.08% ) [15.95%]
             5,381 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  2.45% ) [15.95%]
        23,727,164 dTLB-loads                #  413.401 M/sec                    ( +-  0.06% ) [15.95%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [15.95%]
        11,104,205 iTLB-loads                #  193.470 M/sec                    ( +-  0.06% ) [15.95%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [15.95%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [15.95%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [15.96%]

       0.057151644 seconds time elapsed                                          ( +-  1.69% )

Both:
 Performance counter stats for '/root/test.sh' (20 runs):

         48.377833 task-clock                #    1.005 CPUs utilized            ( +-  0.67% ) [100.00%]
                 5 context-switches          #    0.113 K/sec                    ( +-  3.88% ) [100.00%]
                 0 cpu-migrations            #    0.001 K/sec                    ( +-100.00% ) [100.00%]
               367 page-faults               #    0.008 M/sec                    ( +-  0.08% )
       122,529,490 cycles                    #    2.533 GHz                      ( +-  1.05% ) [14.24%]
         8,796,729 stalled-cycles-frontend   #    7.18% frontend cycles idle     ( +-  0.56% ) [16.20%]
        88,936,550 stalled-cycles-backend    #   72.58% backend  cycles idle     ( +-  1.48% ) [18.16%]
        58,405,660 instructions              #    0.48  insns per cycle        
                                             #    1.52  stalled cycles per insn  ( +-  0.07% ) [18.61%]
         5,742,738 branches                  #  118.706 M/sec                    ( +-  1.54% ) [18.61%]
           303,555 branch-misses             #    5.29% of all branches          ( +-  0.09% ) [18.61%]
        26,321,789 L1-dcache-loads           #  544.088 M/sec                    ( +-  0.07% ) [18.61%]
         1,236,101 L1-dcache-load-misses     #    4.70% of all L1-dcache hits    ( +-  0.08% ) [18.61%]
         3,409,768 LLC-loads                 #   70.482 M/sec                    ( +-  0.05% ) [18.61%]
         1,212,511 LLC-load-misses           #   35.56% of all LL-cache hits     ( +-  0.08% ) [18.61%]
        10,579,372 L1-icache-loads           #  218.682 M/sec                    ( +-  0.05% ) [18.61%]
            19,426 L1-icache-load-misses     #    0.18% of all L1-icache hits    ( +- 14.70% ) [18.61%]
        25,329,963 dTLB-loads                #  523.586 M/sec                    ( +-  0.27% ) [17.29%]
               802 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  5.43% ) [15.33%]
        10,635,524 iTLB-loads                #  219.843 M/sec                    ( +-  0.09% ) [13.38%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.72%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.72%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.72%]

       0.048140073 seconds time elapsed                                          ( +-  0.67% )


Which overall looks alot more like I expect, save for the parallel ALU cases.
It seems here that the parallel ALU changes actually hurt performance, which
really seems counter-intuitive.  I don't yet have any explination for that.  I
do note that we seem to have more stalls in the both case so perhaps the
parallel chains call for a more agressive prefetch.  Do you have any thoughts?

Regards
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:17                                   ` Neil Horman
@ 2013-10-29 14:27                                     ` Ingo Molnar
  2013-10-29 20:26                                       ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-29 14:27 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> So, I apologize, you were right.  I was running the test.sh script 
> but perf was measuring itself. [...]

Ok, cool - one mystery less!

> Which overall looks alot more like I expect, save for the parallel 
> ALU cases. It seems here that the parallel ALU changes actually 
> hurt performance, which really seems counter-intuitive.  I don't 
> yet have any explination for that.  I do note that we seem to have 
> more stalls in the both case so perhaps the parallel chains call 
> for a more agressive prefetch.  Do you have any thoughts?

Note that with -ddd you 'overload' the PMU with more counters than 
can be run at once, which introduces extra noise. Since you are 
running the tests for 0.150 secs or so, the results are not very 
representative:

               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]

with such low runtimes those results are very hard to trust.

So -ddd is typically used to pick up the most interesting PMU events 
you want to see measured, and then use them like this:

   -e dTLB-load-misses -e iTLB-loads

etc. For such short runtimes make sure the last column displays 
close to 100%, so that the PMU results become trustable.

A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
plus generics like 'cycles', 'instructions' can be added 'for free' 
because they get counted in a separate (fixed purpose) PMU register.

The last colum tells you what percentage of the runtime that 
particular event was actually active. 100% (or empty last column) 
means it was active all the time.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:27                                     ` Ingo Molnar
@ 2013-10-29 20:26                                       ` Neil Horman
  2013-10-31 10:22                                         ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-29 20:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > So, I apologize, you were right.  I was running the test.sh script 
> > but perf was measuring itself. [...]
> 
> Ok, cool - one mystery less!
> 
> > Which overall looks alot more like I expect, save for the parallel 
> > ALU cases. It seems here that the parallel ALU changes actually 
> > hurt performance, which really seems counter-intuitive.  I don't 
> > yet have any explination for that.  I do note that we seem to have 
> > more stalls in the both case so perhaps the parallel chains call 
> > for a more agressive prefetch.  Do you have any thoughts?
> 
> Note that with -ddd you 'overload' the PMU with more counters than 
> can be run at once, which introduces extra noise. Since you are 
> running the tests for 0.150 secs or so, the results are not very 
> representative:
> 
>                734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
>         13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
> 
> with such low runtimes those results are very hard to trust.
> 
> So -ddd is typically used to pick up the most interesting PMU events 
> you want to see measured, and then use them like this:
> 
>    -e dTLB-load-misses -e iTLB-loads
> 
> etc. For such short runtimes make sure the last column displays 
> close to 100%, so that the PMU results become trustable.
> 
> A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> plus generics like 'cycles', 'instructions' can be added 'for free' 
> because they get counted in a separate (fixed purpose) PMU register.
> 
> The last colum tells you what percentage of the runtime that 
> particular event was actually active. 100% (or empty last column) 
> means it was active all the time.
> 
> Thanks,
> 
> 	Ingo
> 

Hmm, 

I ran this test:

for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
done

And I updated the test module to run for a million iterations rather than 100000 to increase the sample size and got this:


Base:
 Performance counter stats for './test.sh' (20 runs):

        47,305,064 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    13,906,212,348 cycles                    #    0.000 GHz                      ( +-  0.05% ) [18.76%]
     4,426,395,949 instructions              #    0.32  insns per cycle          ( +-  0.01% ) [18.77%]
     2,261,551,278 L1-dcache-loads                                               ( +-  0.02% ) [18.76%]
        47,287,226 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.76%]
       276,842,685 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        46,454,114 LLC-load-misses           #   16.78% of all LL-cache hits     ( +-  0.05% ) [18.76%]
     1,048,894,486 L1-icache-loads                                               ( +-  0.07% ) [18.76%]
           472,205 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.19% ) [18.76%]
     2,260,639,613 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               172 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 35.14% ) [18.74%]
     1,048,732,481 iTLB-loads                                                    ( +-  0.07% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.75% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.370546698 seconds time elapsed                                          ( +-  0.05% )


Prefetch:
 Performance counter stats for './test.sh' (20 runs):

       124,885,469 L1-dcache-load-misses     #    4.96% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,434,328,889 cycles                    #    0.000 GHz                      ( +-  1.11% ) [18.77%]
     4,601,831,553 instructions              #    0.40  insns per cycle          ( +-  0.01% ) [18.77%]
     2,515,483,814 L1-dcache-loads                                               ( +-  0.01% ) [18.77%]
       124,928,127 L1-dcache-load-misses     #    4.97% of all L1-dcache hits    ( +-  0.09% ) [18.76%]
       323,355,145 LLC-loads                                                     ( +-  0.02% ) [18.76%]
       123,008,548 LLC-load-misses           #   38.04% of all LL-cache hits     ( +-  0.10% ) [18.75%]
     1,256,391,060 L1-icache-loads                                               ( +-  0.01% ) [18.75%]
           374,691 L1-icache-load-misses     #    0.03% of all L1-icache hits    ( +-  1.41% ) [18.75%]
     2,514,984,046 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
                67 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 51.81% ) [18.74%]
     1,256,333,548 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.74% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       4.496839773 seconds time elapsed                                          ( +-  0.64% )


Parallel ALU:
 Performance counter stats for './test.sh' (20 runs):

        49,489,518 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.76%]
    13,777,501,365 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.78%]
     4,707,160,703 instructions              #    0.34  insns per cycle          ( +-  0.01% ) [18.78%]
     2,261,693,074 L1-dcache-loads                                               ( +-  0.02% ) [18.78%]
        49,468,878 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.77%]
       279,524,254 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        48,491,934 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.12% ) [18.75%]
     1,057,877,680 L1-icache-loads                                               ( +-  0.02% ) [18.74%]
           461,784 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.87% ) [18.74%]
     2,260,978,836 dTLB-loads                                                    ( +-  0.02% ) [18.74%]
                27 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 89.96% ) [18.74%]
     1,057,886,632 iTLB-loads                                                    ( +-  0.02% ) [18.74%]
                 4 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +-100.00% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.500417234 seconds time elapsed                                          ( +-  1.60% )


Both:
 Performance counter stats for './test.sh' (20 runs):

       116,621,570 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,597,067,510 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.77%]
     4,952,251,361 instructions              #    0.43  insns per cycle          ( +-  0.01% ) [18.77%]
     2,493,003,710 L1-dcache-loads                                               ( +-  0.02% ) [18.77%]
       116,640,333 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.77%]
       322,246,216 LLC-loads                                                     ( +-  0.03% ) [18.76%]
       114,528,956 LLC-load-misses           #   35.54% of all LL-cache hits     ( +-  0.04% ) [18.76%]
       999,371,469 L1-icache-loads                                               ( +-  0.02% ) [18.76%]
           406,679 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.97% ) [18.75%]
     2,492,708,710 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               140 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 38.46% ) [18.74%]
       999,320,389 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.90% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.72%]

       4.634419247 seconds time elapsed                                          ( +-  1.60% )


I note a few oddities here:

1) We seem to be getting more counter results than I specified, not sure why
2) The % active column is adding up to way more than 100 (which from my read of
the man page makes sense, given that multiple counters might increment in
response to a single instruction execution
3) The run times are proportionally larger, but still indicate that Parallel ALU
execution is hurting rather than helping, which is counter-intuitive.  I'm
looking into it, but thought you might want to see these results in case
something jumped out at you

Regards
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 20:26                                       ` Neil Horman
@ 2013-10-31 10:22                                         ` Ingo Molnar
  2013-10-31 14:33                                           ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-31 10:22 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > etc. For such short runtimes make sure the last column displays 
> > close to 100%, so that the PMU results become trustable.
> > 
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > because they get counted in a separate (fixed purpose) PMU register.
> > 
> > The last colum tells you what percentage of the runtime that 
> > particular event was actually active. 100% (or empty last column) 
> > means it was active all the time.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Hmm, 
> 
> I ran this test:
> 
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> done

You need to remove '-ddd' which is a shortcut for a ton of useful 
events, but here you want to use fewer events, to increase the 
precision of the measurement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 10:22                                         ` Ingo Molnar
@ 2013-10-31 14:33                                           ` Neil Horman
  2013-11-01  9:13                                             ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-10-31 14:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > etc. For such short runtimes make sure the last column displays 
> > > close to 100%, so that the PMU results become trustable.
> > > 
> > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > because they get counted in a separate (fixed purpose) PMU register.
> > > 
> > > The last colum tells you what percentage of the runtime that 
> > > particular event was actually active. 100% (or empty last column) 
> > > means it was active all the time.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Hmm, 
> > 
> > I ran this test:
> > 
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > done
> 
> You need to remove '-ddd' which is a shortcut for a ton of useful 
> events, but here you want to use fewer events, to increase the 
> precision of the measurement.
> 
> Thanks,
> 
> 	Ingo
> 

Thank you ingo, that fixed it.  I'm trying some other variants of the csum
algorithm that Doug and I discussed last night, but FWIW, the relative
performance of the 4 test cases (base/prefetch/parallel/both) remains unchanged.
I'm starting to feel like at this point, theres very little point in doing
parallel alu operations (unless we can find a way to break the dependency on the
carry flag, which is what I'm tinkering with now).
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 14:33                                           ` Neil Horman
@ 2013-11-01  9:13                                             ` Ingo Molnar
  2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:13 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > > etc. For such short runtimes make sure the last column displays 
> > > > close to 100%, so that the PMU results become trustable.
> > > > 
> > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > 
> > > > The last colum tells you what percentage of the runtime that 
> > > > particular event was actually active. 100% (or empty last column) 
> > > > means it was active all the time.
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Ingo
> > > > 
> > > 
> > > Hmm, 
> > > 
> > > I ran this test:
> > > 
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > done
> > 
> > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > events, but here you want to use fewer events, to increase the 
> > precision of the measurement.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Thank you ingo, that fixed it.  I'm trying some other variants of 
> the csum algorithm that Doug and I discussed last night, but FWIW, 
> the relative performance of the 4 test cases 
> (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> feel like at this point, theres very little point in doing 
> parallel alu operations (unless we can find a way to break the 
> dependency on the carry flag, which is what I'm tinkering with 
> now).

I would still like to encourage you to pick up the improvements that 
Doug measured (mostly via prefetch tweaking?) - that looked like 
some significant speedups that we don't want to lose!

Also, trying to stick the in-kernel implementation into 'perf bench' 
would be a useful first step as well, for this and future efforts.

See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
up the in-kernel assembly memcpy implementations:

#define memcpy MEMCPY /* don't hide glibc's memcpy() */
#define altinstr_replacement text
#define globl p2align 4; .globl
#define Lmemcpy_c globl memcpy_c; memcpy_c
#define Lmemcpy_c_e globl memcpy_c_e; memcpy_c_e

#include "../../../arch/x86/lib/memcpy_64.S"

So it needed a bit of trickery/wrappery for 'perf bench mem memcpy', 
but that is a one-time effort - once it's done then the current 
in-kernel csum_partial() implementation would be easily measurable 
(and any performance regression in it bisectable, etc.) from that 
point on.

In user-space it would also be easier to add various parameters and 
experimental implementations and background cache-stressing 
workloads automatically.

Something similar might be possible for csum_partial(), 
csum_partial_copy*(), etc.

Note, if any of you ventures to add checksum-benchmarking to perf 
bench, please base any patches on top of tip:perf/core:

  git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

as there are a couple of perf bench enhancements in the pipeline 
already for v3.13.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01  9:13                                             ` Ingo Molnar
@ 2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-01 14:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > > etc. For such short runtimes make sure the last column displays 
> > > > > close to 100%, so that the PMU results become trustable.
> > > > > 
> > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > > 
> > > > > The last colum tells you what percentage of the runtime that 
> > > > > particular event was actually active. 100% (or empty last column) 
> > > > > means it was active all the time.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > 	Ingo
> > > > > 
> > > > 
> > > > Hmm, 
> > > > 
> > > > I ran this test:
> > > > 
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > > done
> > > 
> > > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > > events, but here you want to use fewer events, to increase the 
> > > precision of the measurement.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Thank you ingo, that fixed it.  I'm trying some other variants of 
> > the csum algorithm that Doug and I discussed last night, but FWIW, 
> > the relative performance of the 4 test cases 
> > (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> > feel like at this point, theres very little point in doing 
> > parallel alu operations (unless we can find a way to break the 
> > dependency on the carry flag, which is what I'm tinkering with 
> > now).
> 
> I would still like to encourage you to pick up the improvements that 
> Doug measured (mostly via prefetch tweaking?) - that looked like 
> some significant speedups that we don't want to lose!
> 
Well, yes, I made a line item of that in my subsequent note below.  I'm going to
repost that shortly, and I suggested that we revisit this when the AVX
instruction extensions are available.

> Also, trying to stick the in-kernel implementation into 'perf bench' 
> would be a useful first step as well, for this and future efforts.
> 
> See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
> up the in-kernel assembly memcpy implementations:
> 
Yes, I'll look into adding this as well
Regards
Neil



^ permalink raw reply	[flat|nested] 132+ messages in thread

* x86: Enhance perf checksum profiling and x86 implementation
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
                   ` (2 preceding siblings ...)
  2013-10-14  4:38 ` Andi Kleen
@ 2013-11-06 15:23 ` Neil Horman
  2013-11-06 15:23   ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
  3 siblings, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-06 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Hey all-
	Sorry for the delay here, but it took me a bit to get the perf bits
working to my satisfaction.  As Ingo requested I added do_csum to the perf
benchmarking utility (as part of the mem suite, since it didn't seem right to
create its own suite).  I've also revamped the do_csum routine to do some smart
prefetching, as it yielded slightly better performance over simple prefetching
at a fixed stride:

Without prefetch:
[root@rdma-dev-02 perf]# ./perf bench mem csum -r x86-64-csum -l 1500B -s 512MB
-i 1000000 -c
# Running mem/csum benchmark...
# Copying 1500B Bytes ...

       0.955977 Cycle/Byte

With prefetch:
[root@rdma-dev-02 perf]# ./perf bench mem csum -r x86-64-csum -l 1500B -s 512MB
-i 1000000 -c
# Running mem/csum benchmark...
# Copying 1500B Bytes ...

       0.922540 Cycle/Byte


About a 3% improvement.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org


^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH v2 1/2] perf: Add csum benchmark tests to perf
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
@ 2013-11-06 15:23   ` Neil Horman
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
  1 sibling, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-06 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Adding perf benchmarks to test the arch independent and x86[64] versions of
do_csum to the perf suite.  Other arches can be added as needed.  To avoid
creating a new suite instance (as I didn't think it was warranted), the csum
benchmarks have been added to the mem suite

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 tools/perf/Makefile.perf               |   3 +
 tools/perf/bench/bench.h               |   2 +
 tools/perf/bench/mem-csum-generic.c    |  21 +++
 tools/perf/bench/mem-csum-x86-64-def.h |   8 +
 tools/perf/bench/mem-csum-x86-64.c     |  51 +++++++
 tools/perf/bench/mem-csum.c            | 266 +++++++++++++++++++++++++++++++++
 tools/perf/bench/mem-csum.h            |  46 ++++++
 tools/perf/builtin-bench.c             |   1 +
 8 files changed, 398 insertions(+)
 create mode 100644 tools/perf/bench/mem-csum-generic.c
 create mode 100644 tools/perf/bench/mem-csum-x86-64-def.h
 create mode 100644 tools/perf/bench/mem-csum-x86-64.c
 create mode 100644 tools/perf/bench/mem-csum.c
 create mode 100644 tools/perf/bench/mem-csum.h

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 5b86390..d0ac05b 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -413,9 +413,12 @@ BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
 ifeq ($(RAW_ARCH),x86_64)
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-csum-x86-64.o
 endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-csum.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-csum-generic.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 0fdc852..3bbe43e 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -32,6 +32,8 @@ extern int bench_mem_memcpy(int argc, const char **argv,
 			    const char *prefix __maybe_unused);
 extern int bench_mem_memset(int argc, const char **argv, const char *prefix);
 
+extern int bench_mem_csum(int argc, const char **argv, const char *prefix);
+
 #define BENCH_FORMAT_DEFAULT_STR	"default"
 #define BENCH_FORMAT_DEFAULT		0
 #define BENCH_FORMAT_SIMPLE_STR		"simple"
diff --git a/tools/perf/bench/mem-csum-generic.c b/tools/perf/bench/mem-csum-generic.c
new file mode 100644
index 0000000..3e77b0d
--- /dev/null
+++ b/tools/perf/bench/mem-csum-generic.c
@@ -0,0 +1,21 @@
+#include "mem-csum.h"
+
+u32 generic_do_csum(unsigned char *buff, unsigned int len);
+
+__wsum csum_partial_copy(const void *src, void *dst, int len, __wsum sum);
+
+/*
+ * Each arch specific implementation file exports these functions,
+ * So we get link time conflicts.  Since we're not testing these paths right now
+ * just rename them to something generic here
+ */
+#define csum_partial(x, y, z) csum_partial_generic(x, y, z)
+#define ip_compute_csum(x, y) ip_complete_csum_generic(x, y)
+
+#include "../../../lib/checksum.c"
+
+u32 generic_do_csum(unsigned char *buff, unsigned int len)
+{
+	return do_csum(buff, len);
+}
+
diff --git a/tools/perf/bench/mem-csum-x86-64-def.h b/tools/perf/bench/mem-csum-x86-64-def.h
new file mode 100644
index 0000000..6698193
--- /dev/null
+++ b/tools/perf/bench/mem-csum-x86-64-def.h
@@ -0,0 +1,8 @@
+/*
+ * Arch specific bench tests for x86[_64]
+ */
+
+CSUM_FN(x86_do_csum, x86_do_csum_init,
+	"x86-64-csum",
+	"x86 unrolled optimized csum() from kernel")
+
diff --git a/tools/perf/bench/mem-csum-x86-64.c b/tools/perf/bench/mem-csum-x86-64.c
new file mode 100644
index 0000000..72bc855
--- /dev/null
+++ b/tools/perf/bench/mem-csum-x86-64.c
@@ -0,0 +1,51 @@
+#include "mem-csum.h"
+
+static int clflush_size;
+
+/*
+ * This overrides the cache_line_size() function from the kernel
+ * The kernel version returns the size of the processor cache line, so 
+ * we emulate that here
+ */
+static inline int cache_line_size(void)
+{
+	return clflush_size;
+}
+
+/*
+ * userspace has no idea what these macros do, and since we don't 
+ * need them to do anything for perf, just make them go away
+ */
+#define unlikely(x) x
+#define EXPORT_SYMBOL(x)
+
+u32 x86_do_csum(unsigned char *buff, unsigned int len);
+void x86_do_csum_init(void);
+
+#include "../../../arch/x86/lib/csum-partial_64.c"
+
+u32 x86_do_csum(unsigned char *buff, unsigned int len)
+{
+	return do_csum(buff, len);
+}
+
+void x86_do_csum_init(void)
+{
+	/*
+	 * The do_csum routine we're testing requires the kernel
+	 * implementation of cache_line_size(), which relies on data
+	 * parsed from the cpuid instruction, do that computation here
+	 */
+	asm("mov $0x1, %%eax\n\t"
+	    "cpuid\n\t"
+	    "mov %%ebx, %[size]\n"
+	    : : [size] "m" (clflush_size));
+
+	/*
+	 * The size of a cache line evicted by a clflush operation is
+	 * contained in bits 15:8 of ebx when cpuid 0x1 is issued
+	 * and is reported in 8 byte words, hence the multiplcation below
+	 */
+	clflush_size = (clflush_size >> 8) & 0x0000000f;
+	clflush_size *= 8;
+}
diff --git a/tools/perf/bench/mem-csum.c b/tools/perf/bench/mem-csum.c
new file mode 100644
index 0000000..3676f6e
--- /dev/null
+++ b/tools/perf/bench/mem-csum.c
@@ -0,0 +1,266 @@
+/*
+ * mem-csum.c
+ *
+ * csum: checksum speed tests
+ *
+ */
+
+#include "../perf.h"
+#include "../util/util.h"
+#include "../util/parse-options.h"
+#include "../util/header.h"
+#include "bench.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <errno.h>
+
+#define K 1024
+
+static const char	*length_str	= "1500B";
+static const char	*size_str	= "64MB";
+static const char	*routine	= "default";
+static int		iterations	= 1;
+static bool		use_cycle;
+static int		cycle_fd;
+
+static const struct option options[] = {
+	OPT_STRING('l', "length", &length_str, "1MB",
+		    "Specify length of memory to checksum. "
+		    "Available units: B, KB, MB, GB and TB (upper and lower)"),
+	OPT_STRING('s', "size", &size_str, "64MB",
+		   "Size of working set to draw csumed buffer from."
+		   "Available units: B, KB, MB, GB and TB"),
+	OPT_STRING('r', "routine", &routine, "default",
+		    "Specify routine to set"),
+	OPT_INTEGER('i', "iterations", &iterations,
+		    "repeat csum() invocation this number of times"),
+	OPT_BOOLEAN('c', "cycle", &use_cycle,
+		    "Use cycles event instead of gettimeofday() for measuring"),
+	OPT_END()
+};
+
+
+extern u32 generic_do_csum(unsigned char *buff, unsigned int len);
+
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+extern u32 x86_do_csum(unsigned char *buff, unsigned int len);
+extern void x86_do_csum_init(void);
+#endif
+
+typedef u32 (*csum_t)(unsigned char *, unsigned int);
+typedef void (*csum_init_t)(void);
+
+struct routine {
+	const char *name;
+	const char *desc;
+	csum_t fn;
+	csum_init_t initfn;
+};
+
+static const struct routine routines[] = {
+	{ "default",
+	  "Default arch-independent csum",
+	  generic_do_csum,
+	  NULL },
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+#define CSUM_FN(fn, init, name, desc) { name, desc, fn, init },
+#include "mem-csum-x86-64-def.h"
+#undef CSUM_FN
+
+#endif
+
+	{ NULL,
+	  NULL,
+	  NULL,
+	  NULL }
+};
+
+static const char * const bench_mem_csum_usage[] = {
+	"perf bench mem csum <options>",
+	NULL
+};
+
+static struct perf_event_attr cycle_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES
+};
+
+static void init_cycle(void)
+{
+	cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+
+	if (cycle_fd < 0 && errno == ENOSYS)
+		die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
+	else
+		BUG_ON(cycle_fd < 0);
+}
+
+static u64 get_cycle(void)
+{
+	int ret;
+	u64 clk;
+
+	ret = read(cycle_fd, &clk, sizeof(u64));
+	BUG_ON(ret != sizeof(u64));
+
+	return clk;
+}
+
+static double timeval2double(struct timeval *ts)
+{
+	return (double)ts->tv_sec +
+		(double)ts->tv_usec / (double)1000000;
+}
+
+static void alloc_mem(void **dst, size_t length)
+{
+	*dst = malloc(length);
+	if (!*dst)
+		die("memory allocation failed - maybe length is too large?\n");
+}
+
+
+static u64 do_csum_cycle(csum_t fn, size_t size, size_t len)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	void *dst = NULL;
+	void *pool = NULL;
+	unsigned int segments;
+	u64 total_cycles = 0;
+	int i;
+
+	alloc_mem(&pool, size);
+
+	segments = (size / len) - 1;
+	for (i = 0; i < iterations; ++i) {
+		dst = pool + ((random() % segments) * len);
+		cycle_start = get_cycle();
+		fn(dst, len);
+		cycle_end = get_cycle();
+		total_cycles += (cycle_end - cycle_start);
+	}
+
+	free(pool);
+	return total_cycles;
+}
+
+static double do_csum_gettimeofday(csum_t fn, size_t size, size_t len)
+{
+	struct timeval tv_start, tv_end, tv_diff, tv_total;
+	void *dst = NULL;
+	void *pool = NULL;
+	unsigned int segments;
+	int i;
+
+	alloc_mem(&pool, size);
+	timerclear(&tv_total);
+	segments = (size / len) - 1;
+
+	for (i = 0; i < iterations; ++i) {
+		dst = pool + ((random() % segments) * len);
+		BUG_ON(gettimeofday(&tv_start, NULL));
+		fn(dst, len);
+		BUG_ON(gettimeofday(&tv_end, NULL));
+		timersub(&tv_end, &tv_start, &tv_diff);
+		timeradd(&tv_total, &tv_diff, &tv_total);
+	}
+
+
+	free(pool);
+	return (double)((double)(len*iterations) / timeval2double(&tv_total));
+}
+
+#define print_bps(x) do {					\
+		if (x < K)					\
+			printf(" %14lf B/Sec\n", x);		\
+		else if (x < K * K)				\
+			printf(" %14lfd KB/Sec\n", x / K);	\
+		else if (x < K * K * K)				\
+			printf(" %14lf MB/Sec\n", x / K / K);	\
+		else						\
+			printf(" %14lf GB/Sec\n", x / K / K / K); \
+	} while (0)
+
+int bench_mem_csum(int argc, const char **argv,
+		   const char *prefix __maybe_unused)
+{
+	int i;
+	size_t len;
+	size_t setsize;
+	double result_bps;
+	u64 result_cycle;
+
+	argc = parse_options(argc, argv, options,
+			     bench_mem_csum_usage, 0);
+
+	if (use_cycle)
+		init_cycle();
+
+	len = (size_t)perf_atoll((char *)length_str);
+	setsize = (size_t)perf_atoll((char *)size_str);
+
+	result_cycle = 0ULL;
+	result_bps = 0.0;
+
+	if ((s64)len <= 0) {
+		fprintf(stderr, "Invalid length:%s\n", length_str);
+		return 1;
+	}
+
+	for (i = 0; routines[i].name; i++) {
+		if (!strcmp(routines[i].name, routine))
+			break;
+	}
+	if (!routines[i].name) {
+		printf("Unknown routine:%s\n", routine);
+		printf("Available routines...\n");
+		for (i = 0; routines[i].name; i++) {
+			printf("\t%s ... %s\n",
+			       routines[i].name, routines[i].desc);
+		}
+		return 1;
+	}
+
+	if (routines[i].initfn)
+		routines[i].initfn();
+
+	if (bench_format == BENCH_FORMAT_DEFAULT)
+		printf("# Copying %s Bytes ...\n\n", length_str);
+
+	if (use_cycle) {
+		result_cycle =
+			do_csum_cycle(routines[i].fn, setsize, len);
+	} else {
+		result_bps =
+			do_csum_gettimeofday(routines[i].fn, setsize, len);
+	}
+
+	switch (bench_format) {
+	case BENCH_FORMAT_DEFAULT:
+		if (use_cycle) {
+			printf(" %14lf Cycle/Byte\n",
+				(double)result_cycle
+				/ (double)(len*iterations));
+		} else
+			print_bps(result_bps);
+
+
+		break;
+	case BENCH_FORMAT_SIMPLE:
+		if (use_cycle) {
+			printf("%lf\n", (double)result_cycle
+				/ (double)(len*iterations));
+		} else
+			printf("%lf\n", result_bps);
+		break;
+	default:
+		/* reaching this means there's some disaster: */
+		die("unknown format: %d\n", bench_format);
+		break;
+	}
+
+	return 0;
+}
diff --git a/tools/perf/bench/mem-csum.h b/tools/perf/bench/mem-csum.h
new file mode 100644
index 0000000..cca9a77
--- /dev/null
+++ b/tools/perf/bench/mem-csum.h
@@ -0,0 +1,46 @@
+/*
+ * Header for mem-csum
+ * mostly trickery to get the kernel code to compile
+ * in user space
+ */
+
+#include "../util/util.h"
+
+#include <linux/types.h>
+
+
+typedef __u16 __le16;
+typedef __u16 __be16;
+typedef __u32 __le32;
+typedef __u32 __be32;
+typedef __u64 __le64;
+typedef __u64 __be64;
+
+typedef __u16 __sum16;
+typedef __u32 __wsum;
+
+/*
+ * __visible isn't defined in userspace, so make it dissappear
+ */
+#define __visible
+
+/*
+ * These get multiple definitions in the kernel with a common inline version
+ * We're not testing them so just move them to another name
+ */
+#define ip_fast_csum ip_fast_csum_backup
+#define csum_tcpudp_nofold csum_tcpudp_nofold_backup
+
+/*
+ * Most csum implementations need this defined, for the copy_and_csum variants.
+ * Since we're building in userspace, this can be voided out
+ */
+static inline int __copy_from_user(void *dst, const void *src, size_t len)
+{
+	(void)dst;
+	(void)src;
+	(void)len;
+	return 0;
+}
+
+
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index e47f90c..44199e0 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -50,6 +50,7 @@ static struct bench sched_benchmarks[] = {
 static struct bench mem_benchmarks[] = {
 	{ "memcpy",	"Benchmark for memcpy()",			bench_mem_memcpy	},
 	{ "memset",	"Benchmark for memset() tests",			bench_mem_memset	},
+	{ "csum",	"Simple csum timing for various arches",	bench_mem_csum		},
 	{ "all",	"Test all memory benchmarks",			NULL			},
 	{ NULL,		NULL,						NULL			}
 };
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
  2013-11-06 15:23   ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
@ 2013-11-06 15:23   ` Neil Horman
  2013-11-06 15:34     ` Dave Jones
  2013-11-06 20:19     ` Andi Kleen
  1 sibling, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-06 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

do_csum was identified via perf recently as a hot spot when doing
receive on ip over infiniband workloads.  After alot of testing and
ideas, we found the best optimization available to us currently is to
prefetch the entire data buffer prior to doing the checksum

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/lib/csum-partial_64.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..9f2d3ee 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
+
+static inline void prefetch_lines(void *addr, size_t len)
+{
+	void *end = addr + len;
+	for (; addr < end; addr += cache_line_size())
+		asm("prefetch 0(%[buf])\n\t" : : [buf] "r" (addr));
+}
+
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
@@ -67,7 +74,9 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			/* main loop using 64byte blocks */
 			zero = 0;
 			count64 = count >> 3;
-			while (count64) { 
+
+			prefetch_lines((void *)buff, len);
+			while (count64) {
 				asm("addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
@ 2013-11-06 15:34     ` Dave Jones
  2013-11-06 15:54       ` Neil Horman
  2013-11-06 20:19     ` Andi Kleen
  1 sibling, 1 reply; 132+ messages in thread
From: Dave Jones @ 2013-11-06 15:34 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
 > do_csum was identified via perf recently as a hot spot when doing
 > receive on ip over infiniband workloads.  After alot of testing and
 > ideas, we found the best optimization available to us currently is to
 > prefetch the entire data buffer prior to doing the checksum
 > 
 > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
 > index 9845371..9f2d3ee 100644
 > --- a/arch/x86/lib/csum-partial_64.c
 > +++ b/arch/x86/lib/csum-partial_64.c
 > @@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
 >   * Things tried and found to not make it faster:
 >   * Manual Prefetching
 >   * Unrolling to an 128 bytes inner loop.
 > - * Using interleaving with more registers to break the carry chains.
 
Did you mean perhaps to remove the "Manual Prefetching" line instead ?
(Curious, what was tried before that made it not worthwhile?)
 
	Dave
 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:34     ` Dave Jones
@ 2013-11-06 15:54       ` Neil Horman
  2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:23         ` Eric Dumazet
  0 siblings, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-06 15:54 UTC (permalink / raw)
  To: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
>  > do_csum was identified via perf recently as a hot spot when doing
>  > receive on ip over infiniband workloads.  After alot of testing and
>  > ideas, we found the best optimization available to us currently is to
>  > prefetch the entire data buffer prior to doing the checksum
>  > 
>  > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>  > index 9845371..9f2d3ee 100644
>  > --- a/arch/x86/lib/csum-partial_64.c
>  > +++ b/arch/x86/lib/csum-partial_64.c
>  > @@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
>  >   * Things tried and found to not make it faster:
>  >   * Manual Prefetching
>  >   * Unrolling to an 128 bytes inner loop.
>  > - * Using interleaving with more registers to break the carry chains.
>  
> Did you mean perhaps to remove the "Manual Prefetching" line instead ?
> (Curious, what was tried before that made it not worthwhile?)
>  
Crap, I didn't notice that previously, thanks Dave.

My guess was that the whole comment was made in reference to the fact that
checksum offload negated all these advantages.  Thats not so true anymore, since
infiniband needs csum in software for ipoib.

I'll fix this up and send a v3, but I'll give it a day in case there are more
comments first.

Thanks
Neil

> 	Dave
>  
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:54       ` Neil Horman
@ 2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:11           ` Neil Horman
                             ` (2 more replies)
  2013-11-06 18:23         ` Eric Dumazet
  1 sibling, 3 replies; 132+ messages in thread
From: Joe Perches @ 2013-11-06 17:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> >  > do_csum was identified via perf recently as a hot spot when doing
> >  > receive on ip over infiniband workloads.  After alot of testing and
> >  > ideas, we found the best optimization available to us currently is to
> >  > prefetch the entire data buffer prior to doing the checksum
[]
> I'll fix this up and send a v3, but I'll give it a day in case there are more
> comments first.

Perhaps a reduction in prefetch loop count helps.

Was capping the amount prefetched and letting the
hardware prefetch also tested?

	prefetch_lines(buff, min(len, cache_line_size() * 8u));

Also pedantry/trivial comments:

__always_inline instead of inline
static __always_inline void prefetch_lines(const void *addr, size_t len)
{
	const void *end = addr + len;
...

buff doesn't need a void * cast in prefetch_lines

Beside the commit message, the comment above prefetch_lines
also needs updating to remove the "Manual Prefetching" line.

/*
 * Do a 64-bit checksum on an arbitrary memory area.
 * Returns a 32bit checksum.
 *
 * This isn't as time critical as it used to be because many NICs
 * do hardware checksumming these days.
 * 
 * Things tried and found to not make it faster:
 * Manual Prefetching
 * Unrolling to an 128 bytes inner loop.
 */



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 17:19         ` Joe Perches
@ 2013-11-06 18:11           ` Neil Horman
  2013-11-06 20:02           ` Neil Horman
  2013-11-08 19:01           ` Neil Horman
  2 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-06 18:11 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 
It was not, but I did not bother to try since accurate branch predicion in the
loop and prefetch issuing should be very fast.  Id also be worried that capping
prefetch would be relatively hardware specific, so what worked well on some
hardware, wouldn't be enough on other hardware.  I'd rather just issue the
prefetch for the whole buffer, as that should produce consistent results.

> Also pedantry/trivial comments:
> 
> __always_inline instead of inline
> static __always_inline void prefetch_lines(const void *addr, size_t len)
> {
> 	const void *end = addr + len;
> ...
> 
> buff doesn't need a void * cast in prefetch_lines
> 
ACK

> Beside the commit message, the comment above prefetch_lines
> also needs updating to remove the "Manual Prefetching" line.
> 
Yup, Dave noted the Manual Prefetch issue, and I'll move the whole comment as
part of that.

> /*
>  * Do a 64-bit checksum on an arbitrary memory area.
>  * Returns a 32bit checksum.
>  *
>  * This isn't as time critical as it used to be because many NICs
>  * do hardware checksumming these days.
>  * 
>  * Things tried and found to not make it faster:
>  * Manual Prefetching
>  * Unrolling to an 128 bytes inner loop.
>  */
> 
> 
> 

Regards
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:54       ` Neil Horman
  2013-11-06 17:19         ` Joe Perches
@ 2013-11-06 18:23         ` Eric Dumazet
  2013-11-06 18:59           ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Eric Dumazet @ 2013-11-06 18:23 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:

> My guess was that the whole comment was made in reference to the fact that
> checksum offload negated all these advantages.  Thats not so true anymore, since
> infiniband needs csum in software for ipoib.
> 
> I'll fix this up and send a v3, but I'll give it a day in case there are more
> comments first.

Also please include netdev, I think people there are interested.

I caught this message, but I usually cannot read lkml traffic.

I wonder why you do not use (and/or change/tune) prefetch_range()
instead of a local definition.




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 18:23         ` Eric Dumazet
@ 2013-11-06 18:59           ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-06 18:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 10:23:10AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> 
> > My guess was that the whole comment was made in reference to the fact that
> > checksum offload negated all these advantages.  Thats not so true anymore, since
> > infiniband needs csum in software for ipoib.
> > 
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Also please include netdev, I think people there are interested.
> 
Sure, will do in the updated version

> I caught this message, but I usually cannot read lkml traffic.
> 
> I wonder why you do not use (and/or change/tune) prefetch_range()
> instead of a local definition.
> 
I wanted to look into this further, because I wasn't (yet) sure if it was  a bug
or not, but from what I can see x86_64 doesn't define ARCH_HAS_PREFECTH.  That
makes prefetch_range() a nop (I confirmed this via objdump).  It seems like we
should either define ARCH_HAS_PREFETCH on x86_64, or we should remove the
#ifdef from prefetch_range
> 
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:11           ` Neil Horman
@ 2013-11-06 20:02           ` Neil Horman
  2013-11-06 20:07             ` Joe Perches
  2013-11-08 19:01           ` Neil Horman
  2 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-11-06 20:02 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 
> Also pedantry/trivial comments:
> 
> __always_inline instead of inline
> static __always_inline void prefetch_lines(const void *addr, size_t len)
> {
> 	const void *end = addr + len;
> ...
> 
> buff doesn't need a void * cast in prefetch_lines
> 
Actually I take back what I said here, we do need the cast, not for a conversion
from unsigned char * to void *, but rather to discard the const qualifier
without making the compiler complain.

Neil

> Beside the commit message, the comment above prefetch_lines
> also needs updating to remove the "Manual Prefetching" line.
> 
> /*
>  * Do a 64-bit checksum on an arbitrary memory area.
>  * Returns a 32bit checksum.
>  *
>  * This isn't as time critical as it used to be because many NICs
>  * do hardware checksumming these days.
>  * 
>  * Things tried and found to not make it faster:
>  * Manual Prefetching
>  * Unrolling to an 128 bytes inner loop.
>  */
> 
> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 20:02           ` Neil Horman
@ 2013-11-06 20:07             ` Joe Perches
  2013-11-08 16:25               ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-06 20:07 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
[]
> > __always_inline instead of inline
> > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > {
> > 	const void *end = addr + len;
> > ...
> > 
> > buff doesn't need a void * cast in prefetch_lines
> > 
> Actually I take back what I said here, we do need the cast, not for a conversion
> from unsigned char * to void *, but rather to discard the const qualifier
> without making the compiler complain.

Not if the function is changed to const void *
and end is also const void * as shown.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
  2013-11-06 15:34     ` Dave Jones
@ 2013-11-06 20:19     ` Andi Kleen
  2013-11-07 21:23       ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Andi Kleen @ 2013-11-06 20:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Neil Horman <nhorman@tuxdriver.com> writes:

> do_csum was identified via perf recently as a hot spot when doing
> receive on ip over infiniband workloads.  After alot of testing and
> ideas, we found the best optimization available to us currently is to
> prefetch the entire data buffer prior to doing the checksum

On what CPU? Most modern CPUs should not have any trouble at all
prefetching a linear access.

Also for large buffers it is unlikely that all the prefetches
are actually executed, there is usually some limit.

As a minimum you would need:
- run it with a range of buffer sizes
- run this on a range of different CPUs and show no major regressions
- describe all of this actually in the description

But I find at least this patch very dubious.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 20:19     ` Andi Kleen
@ 2013-11-07 21:23       ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-07 21:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, netdev

On Wed, Nov 06, 2013 at 12:19:52PM -0800, Andi Kleen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > do_csum was identified via perf recently as a hot spot when doing
> > receive on ip over infiniband workloads.  After alot of testing and
> > ideas, we found the best optimization available to us currently is to
> > prefetch the entire data buffer prior to doing the checksum
> 
> On what CPU? Most modern CPUs should not have any trouble at all
> prefetching a linear access.
> 
> Also for large buffers it is unlikely that all the prefetches
> are actually executed, there is usually some limit.
> 
> As a minimum you would need:
> - run it with a range of buffer sizes
> - run this on a range of different CPUs and show no major regressions
> - describe all of this actually in the description
> 
> But I find at least this patch very dubious.
> 
> -Andi
> 
Well, if you look back in the thread, you can see several tests done with
various forms of prefetching, that show performance improvements, but if you
want them all collected, heres what I have, using the perf bench from patch 1.

As you can see, you're right, on newer hardware theres negligible advantage (but
no regression that I can see).  On older hardware however, we see a definate
improvement (up to 3%).  I'm afraid I don't have a wide variety of hardware
handy at the moment to do any large scale testing on multiple cpu's.  But if you
have them available, please share your results


Regards
Neil



vendor_id       : AuthenticAMD
cpu family      : 16
model           : 8
model name      : AMD Opteron(tm) Processor 4130
stepping        : 0
microcode       : 0x10000da
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 11
initial apicid  : 11
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor
cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv svm_lock
nrip_save pausefilter
bogomips        : 5200.49
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Without prefecth:
length	| Set Sz| iterations	| cycles/byte
1500B	| 64MB  | 1000000       | 1.432338
1500B   | 128MB | 1000000       | 1.426212
1500B   | 256MB | 1000000       | 1.425988
1500B   | 512MB | 1000000       | 1.517873
9000B   | 64MB  | 1000000       | 0.897998
9000B   | 128MB | 1000000       | 0.884120
9000B   | 256MB | 1000000       | 0.881770
9000B   | 512MB | 1000000       | 0.883644
64KB    | 64MB  | 1000000       | 0.813054
64KB    | 128MB | 1000000       | 0.801859
64KB    | 256MB | 1000000       | 0.796415
64KB    | 512MB | 1000000       | 0.793869

With prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B	| 64MB	| 1000000	| 1.442855
1500B	| 128MB	| 1000000	| 1.438841
1500B	| 256MB	| 1000000	| 1.427324
1500B	| 512MB	| 1000000	| 1.462715 
9000B	| 64MB	| 1000000	| 0.894097 
9000B	| 128MB	| 1000000	| 0.884738 
9000B	| 256MB	| 1000000	| 0.881370  
9000B	| 512MB	| 1000000	| 0.884799 
64KB	| 64MB	| 1000000	| 0.813512 
64KB	| 128MB	| 1000000	| 0.801596 
64KB	| 256MB	| 1000000	| 0.795575  
64KB	| 512MB	| 1000000	| 0.793927 


==========================================================================================

vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping        : 7
microcode       : 0x29
cpu MHz         : 2754.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx
lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept
vpid
bogomips        : 6784.46
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual


Without prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.343645
1500B   | 128MB | 1000000       | 1.345782
1500B   | 256MB | 1000000       | 1.353145
1500B   | 512MB | 1000000       | 1.354844
9000B   | 64MB  | 1000000       | 0.856552
9000B   | 128MB | 1000000       | 0.852786
9000B   | 256MB | 1000000       | 0.854705
9000B   | 512MB | 1000000       | 0.863308
64KB    | 64MB  | 1000000       | 0.771888
64KB    | 128MB | 1000000       | 0.773453
64KB    | 256MB | 1000000       | 0.771728
64KB    | 512MB | 1000000       | 0.771390

With prefetching:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.344733
1500B   | 128MB | 1000000       | 1.342285
1500B   | 256MB | 1000000       | 1.344818
1500B   | 512MB | 1000000       | 1.342632
9000B   | 64MB  | 1000000       | 0.851043
9000B   | 128MB | 1000000       | 0.850629
9000B   | 256MB | 1000000       | 0.852207
9000B   | 512MB | 1000000       | 0.851927
64KB    | 64MB  | 1000000       | 0.768549
64KB    | 128MB | 1000000       | 0.768623
64KB    | 256MB | 1000000       | 0.768938
64KB    | 512MB | 1000000       | 0.768824

==========================================================================================
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 9
model name      : AMD Opteron(tm) Processor 6172
stepping        : 1
microcode       : 0x10000d9
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 12
core id         : 5
cpu cores       : 12
apicid          : 43
initial apicid  : 27
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid amd_dcm pni
monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv
svm_lock nrip_save pausefilter
bogomips        : 4189.63
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Without prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.415370
1500B   | 128MB | 1000000       | 1.437025
1500B   | 256MB | 1000000       | 1.424822
1500B   | 512MB | 1000000       | 1.442021
9000B   | 64MB  | 1000000       | 0.891699
9000B   | 128MB | 1000000       | 0.884261
9000B   | 256MB | 1000000       | 0.880179
9000B   | 512MB | 1000000       | 0.882190
64KB    | 64MB  | 1000000       | 0.813047
64KB    | 128MB | 1000000       | 0.800755
64KB    | 256MB | 1000000       | 0.795207
64KB    | 512MB | 1000000       | 0.792065

With prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.424003
1500B   | 128MB | 1000000       | 1.435567
1500B   | 256MB | 1000000       | 1.446858
1500B   | 512MB | 1000000       | 1.459407
9000B   | 64MB  | 1000000       | 0.899858
9000B   | 128MB | 1000000       | 0.885170
9000B   | 256MB | 1000000       | 0.883936
9000B   | 512MB | 1000000       | 0.886158
64KB    | 64MB  | 1000000       | 0.814136
64KB    | 128MB | 1000000       | 0.802202
64KB    | 256MB | 1000000       | 0.796140
64KB    | 512MB | 1000000       | 0.793792



==========================================================================================
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 10
model name      : AMD Athlon(tm) XP 2800+
stepping        : 0
cpu MHz         : 2079.461
cache size      : 512 KB
fdiv_bug        : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 4158.92
clflush size    : 32
cache_alignment : 32
address sizes   : 34 bits physical, 32 bits virtual
power management: ts

Without prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 3.335217
1500B   | 128MB | 1000000       | 3.403103
1500B   | 256MB | 1000000       | 3.445059
1500B   | 512MB | 1000000       | 3.742008
9000B   | 64MB  | 1000000       | 47.466255
9000B   | 128MB | 1000000       | 47.742751
9000B   | 256MB | 1000000       | 47.965001
9000B   | 512MB | 1000000       | 48.589349
64KB    | 64MB  | 1000000       | 118.088638
64KB    | 128MB | 1000000       | 118.261744
64KB    | 256MB | 1000000       | 118.349641
64KB    | 512MB | 1000000       | 118.695321

With prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 3.231086
1500B   | 128MB | 1000000       | 3.423485
1500B   | 256MB | 1000000       | 3.278899
1500B   | 512MB | 1000000       | 3.545504
9000B   | 64MB  | 1000000       | 46.907795
9000B   | 128MB | 1000000       | 47.321743
9000B   | 256MB | 1000000       | 47.306189
9000B   | 512MB | 1000000       | 48.144320
64KB    | 64MB  | 1000000       | 117.897735
64KB    | 128MB | 1000000       | 118.122266
64KB    | 256MB | 1000000       | 118.126397
64KB    | 512MB | 1000000       | 118.546901



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 20:07             ` Joe Perches
@ 2013-11-08 16:25               ` Neil Horman
  2013-11-08 16:51                 ` Joe Perches
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-11-08 16:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> []
> > > __always_inline instead of inline
> > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > {
> > > 	const void *end = addr + len;
> > > ...
> > > 
> > > buff doesn't need a void * cast in prefetch_lines
> > > 
> > Actually I take back what I said here, we do need the cast, not for a conversion
> > from unsigned char * to void *, but rather to discard the const qualifier
> > without making the compiler complain.
> 
> Not if the function is changed to const void *
> and end is also const void * as shown.
> 
Addr is incremented in the for loop, so it can't be const.  I could add a loop
counter variable on the stack, but that doesn't seem like it would help anything

Neil

> 
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 16:25               ` Neil Horman
@ 2013-11-08 16:51                 ` Joe Perches
  2013-11-08 19:07                   ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-08 16:51 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > []
> > > > __always_inline instead of inline
> > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > {
> > > > 	const void *end = addr + len;
> > > > ...
> > > > 
> > > > buff doesn't need a void * cast in prefetch_lines
> > > > 
> > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > from unsigned char * to void *, but rather to discard the const qualifier
> > > without making the compiler complain.
> > 
> > Not if the function is changed to const void *
> > and end is also const void * as shown.
> > 
> Addr is incremented in the for loop, so it can't be const.  I could add a loop
> counter variable on the stack, but that doesn't seem like it would help anything

Perhaps you meant
	void * const addr;
but that's not what I wrote.

Let me know if this doesn't compile.
It does here...
---
 arch/x86/lib/csum-partial_64.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..891194a 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
+
+static __always_inline void prefetch_lines(const void * addr, size_t len)
+{
+	const void *end = addr + len;
+	for (; addr < end; addr += cache_line_size())
+		asm("prefetch 0(%[buf])\n\t" : : [buf] "r" (addr));
+}
+
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
@@ -67,7 +74,9 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			/* main loop using 64byte blocks */
 			zero = 0;
 			count64 = count >> 3;
-			while (count64) { 
+
+			prefetch_lines(buff, min(len, cache_line_size() * 4u));
+			while (count64) {
 				asm("addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:11           ` Neil Horman
  2013-11-06 20:02           ` Neil Horman
@ 2013-11-08 19:01           ` Neil Horman
  2013-11-08 19:33             ` Joe Perches
  2 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-11-08 19:01 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 

Just tested this out:

With limiting:
1500B   | 64MB  | 1000000       | 1.344167
1500B   | 128MB | 1000000       | 1.340970
1500B   | 256MB | 1000000       | 1.353562
1500B   | 512MB | 1000000       | 1.346349
9000B   | 64MB  | 1000000       | 0.852174
9000B   | 128MB | 1000000       | 0.852765
9000B   | 256MB | 1000000       | 0.853153
9000B   | 512MB | 1000000       | 0.852661
64KB    | 64MB  | 1000000       | 0.768585
64KB    | 128MB | 1000000       | 0.769465
64KB    | 256MB | 1000000       | 0.769909
64KB    | 512MB | 1000000       | 0.779895


Without limiting

1500B   | 64MB  | 1000000       | 1.360525
1500B   | 128MB | 1000000       | 1.354220
1500B   | 256MB | 1000000       | 1.371037
1500B   | 512MB | 1000000       | 1.353557
9000B   | 64MB  | 1000000       | 0.850415
9000B   | 128MB | 1000000       | 0.853642
9000B   | 256MB | 1000000       | 0.852048
9000B   | 512MB | 1000000       | 0.852484
64KB    | 64MB  | 1000000       | 0.768261
64KB    | 128MB | 1000000       | 0.768566
64KB    | 256MB | 1000000       | 0.770822
64KB    | 512MB | 1000000       | 0.769391

Doesn't look like much consistent improvement.

Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 16:51                 ` Joe Perches
@ 2013-11-08 19:07                   ` Neil Horman
  2013-11-08 19:17                     ` Joe Perches
  2013-11-08 19:17                     ` H. Peter Anvin
  0 siblings, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-08 19:07 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > []
> > > > > __always_inline instead of inline
> > > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > > {
> > > > > 	const void *end = addr + len;
> > > > > ...
> > > > > 
> > > > > buff doesn't need a void * cast in prefetch_lines
> > > > > 
> > > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > > from unsigned char * to void *, but rather to discard the const qualifier
> > > > without making the compiler complain.
> > > 
> > > Not if the function is changed to const void *
> > > and end is also const void * as shown.
> > > 
> > Addr is incremented in the for loop, so it can't be const.  I could add a loop
> > counter variable on the stack, but that doesn't seem like it would help anything
> 
> Perhaps you meant
> 	void * const addr;
> but that's not what I wrote.
> 
No, I meant smoething like:
static __always_inline void prefetch_lines(const void * addr, size_t len)
{
	const void *tmp = (void *)addr;
	...
	for(;tmp<end; tmp+=cache_line_size())
	...
}

> Let me know if this doesn't compile.
> It does here...
Huh, it does.  But that makes very little sense to me.  by qualifying addr as
const, how is the compiler not throwing a warning in the for loop about us
incrementing that same variable?



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:07                   ` Neil Horman
@ 2013-11-08 19:17                     ` Joe Perches
  2013-11-08 20:08                       ` Neil Horman
  2013-11-08 19:17                     ` H. Peter Anvin
  1 sibling, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-08 19:17 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 14:07 -0500, Neil Horman wrote:
> On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
> > On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > > > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > []
> > > > > > __always_inline instead of inline
> > > > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > > > {
> > > > > > 	const void *end = addr + len;
> > > > > > ...
> > > > > > 
> > > > > > buff doesn't need a void * cast in prefetch_lines
> > > > > > 
> > > > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > > > from unsigned char * to void *, but rather to discard the const qualifier
> > > > > without making the compiler complain.
> > > > 
> > > > Not if the function is changed to const void *
> > > > and end is also const void * as shown.
> > > > 
> > > Addr is incremented in the for loop, so it can't be const.  I could add a loop
> > > counter variable on the stack, but that doesn't seem like it would help anything
> > 
> > Perhaps you meant
> > 	void * const addr;
> > but that's not what I wrote.
> > 
> No, I meant smoething like:
> static __always_inline void prefetch_lines(const void * addr, size_t len)
> {
> 	const void *tmp = (void *)addr;
> 	...
> 	for(;tmp<end; tmp+=cache_line_size())
> 	...
> }
> 
> > Let me know if this doesn't compile.
> > It does here...
> Huh, it does.  But that makes very little sense to me.  by qualifying addr as
> const, how is the compiler not throwing a warning in the for loop about us
> incrementing that same variable?

Because it points to const data but is not const itself.

void * const foo;	/* value of foo can't change */
const void *bar;	/* data pointed to by bar can't change */
const void * const baz; /* Neither baz nor data pointed to by baz can change */




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:07                   ` Neil Horman
  2013-11-08 19:17                     ` Joe Perches
@ 2013-11-08 19:17                     ` H. Peter Anvin
  1 sibling, 0 replies; 132+ messages in thread
From: H. Peter Anvin @ 2013-11-08 19:17 UTC (permalink / raw)
  To: Neil Horman, Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, x86

On 11/08/2013 11:07 AM, Neil Horman wrote:
> On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
>> On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
>>> On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
>>>> On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
>>>>> On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
>>>> []
>>>>>> __always_inline instead of inline
>>>>>> static __always_inline void prefetch_lines(const void *addr, size_t len)
>>>>>> {
>>>>>> 	const void *end = addr + len;
>>>>>> ...
>>>>>>
>>>>>> buff doesn't need a void * cast in prefetch_lines
>>>>>>
>>>>> Actually I take back what I said here, we do need the cast, not for a conversion
>>>>> from unsigned char * to void *, but rather to discard the const qualifier
>>>>> without making the compiler complain.
>>>>
>>>> Not if the function is changed to const void *
>>>> and end is also const void * as shown.
>>>>
>>> Addr is incremented in the for loop, so it can't be const.  I could add a loop
>>> counter variable on the stack, but that doesn't seem like it would help anything
>>
>> Perhaps you meant
>> 	void * const addr;
>> but that's not what I wrote.
>>
> No, I meant smoething like:
> static __always_inline void prefetch_lines(const void * addr, size_t len)
> {
> 	const void *tmp = (void *)addr;
> 	...
> 	for(;tmp<end; tmp+=cache_line_size())
> 	...
> }
> 
>> Let me know if this doesn't compile.
>> It does here...
> Huh, it does.  But that makes very little sense to me.  by qualifying addr as
> const, how is the compiler not throwing a warning in the for loop about us
> incrementing that same variable?
> 

As Joe is pointing out, you are confusing "const foo *tmp" with "foo *
const tmp".  The former means: "tmp is a variable pointing to type const
foo".  The latter means: "tmp is a constant pointing to type foo".

There is no problem modifying tmp in the former case; it prohibits
modifying *tmp.  In the latter case modifying tmp is prohibited, but
modifying *tmp is just fine.

Now, "const char *" would arguably be more correct here since arithmetic
on void is a gcc extension, but the same argument applies there.

	-hpa



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:01           ` Neil Horman
@ 2013-11-08 19:33             ` Joe Perches
  2013-11-08 20:14               ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-08 19:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > >  > ideas, we found the best optimization available to us currently is to
> > > >  > prefetch the entire data buffer prior to doing the checksum
> > []
> > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > comments first.
> > 
> > Perhaps a reduction in prefetch loop count helps.
> > 
> > Was capping the amount prefetched and letting the
> > hardware prefetch also tested?
> > 
> > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > 
> 
> Just tested this out:

Thanks.

Reformatting the table so it's a bit more
readable/comparable for me:

len	SetSz	Loops	cycles/byte
			limited	unlimited
1500B	64MB	1M	1.3442	1.3605
1500B	128MB	1M	1.3410	1.3542
1500B	256MB	1M	1.3536	1.3710
1500B	512MB	1M	1.3463	1.3536
9000B	64MB	1M	0.8522	0.8504
9000B	128MB	1M	0.8528	0.8536
9000B	256MB	1M	0.8532	0.8520
9000B	512MB	1M	0.8527	0.8525
64KB	64MB	1M	0.7686	0.7683
64KB	128MB	1M	0.7695	0.7686
64KB	256MB	1M	0.7699	0.7708
64KB	512MB	1M	0.7799	0.7694

This data appears to show some value
in capping for 1500b lengths and noise
for shorter and longer lengths.

Any idea what the actual distribution of
do_csum lengths is under various loads?



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:17                     ` Joe Perches
@ 2013-11-08 20:08                       ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-08 20:08 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 11:17:39AM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 14:07 -0500, Neil Horman wrote:
> > On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
> > > On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > > > > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > []
> > > > > > > __always_inline instead of inline
> > > > > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > > > > {
> > > > > > > 	const void *end = addr + len;
> > > > > > > ...
> > > > > > > 
> > > > > > > buff doesn't need a void * cast in prefetch_lines
> > > > > > > 
> > > > > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > > > > from unsigned char * to void *, but rather to discard the const qualifier
> > > > > > without making the compiler complain.
> > > > > 
> > > > > Not if the function is changed to const void *
> > > > > and end is also const void * as shown.
> > > > > 
> > > > Addr is incremented in the for loop, so it can't be const.  I could add a loop
> > > > counter variable on the stack, but that doesn't seem like it would help anything
> > > 
> > > Perhaps you meant
> > > 	void * const addr;
> > > but that's not what I wrote.
> > > 
> > No, I meant smoething like:
> > static __always_inline void prefetch_lines(const void * addr, size_t len)
> > {
> > 	const void *tmp = (void *)addr;
> > 	...
> > 	for(;tmp<end; tmp+=cache_line_size())
> > 	...
> > }
> > 
> > > Let me know if this doesn't compile.
> > > It does here...
> > Huh, it does.  But that makes very little sense to me.  by qualifying addr as
> > const, how is the compiler not throwing a warning in the for loop about us
> > incrementing that same variable?
> 
> Because it points to const data but is not const itself.
> 
> void * const foo;	/* value of foo can't change */
> const void *bar;	/* data pointed to by bar can't change */
> const void * const baz; /* Neither baz nor data pointed to by baz can change */
> 
Doh!  Wow, that was just staring me in the face and I missed it :)

Thanks for pointing it out.  I'll make that adjustment
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:33             ` Joe Perches
@ 2013-11-08 20:14               ` Neil Horman
  2013-11-08 20:29                 ` Joe Perches
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-11-08 20:14 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > >  > ideas, we found the best optimization available to us currently is to
> > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > []
> > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > comments first.
> > > 
> > > Perhaps a reduction in prefetch loop count helps.
> > > 
> > > Was capping the amount prefetched and letting the
> > > hardware prefetch also tested?
> > > 
> > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > 
> > 
> > Just tested this out:
> 
> Thanks.
> 
> Reformatting the table so it's a bit more
> readable/comparable for me:
> 
> len	SetSz	Loops	cycles/byte
> 			limited	unlimited
> 1500B	64MB	1M	1.3442	1.3605
> 1500B	128MB	1M	1.3410	1.3542
> 1500B	256MB	1M	1.3536	1.3710
> 1500B	512MB	1M	1.3463	1.3536
> 9000B	64MB	1M	0.8522	0.8504
> 9000B	128MB	1M	0.8528	0.8536
> 9000B	256MB	1M	0.8532	0.8520
> 9000B	512MB	1M	0.8527	0.8525
> 64KB	64MB	1M	0.7686	0.7683
> 64KB	128MB	1M	0.7695	0.7686
> 64KB	256MB	1M	0.7699	0.7708
> 64KB	512MB	1M	0.7799	0.7694
> 
> This data appears to show some value
> in capping for 1500b lengths and noise
> for shorter and longer lengths.
> 
> Any idea what the actual distribution of
> do_csum lengths is under various loads?
> 

I don't have any hard data no, sorry. I chose the above values for length based
on typical mtus for ethernet, jumbo frame ethernet and ipoib (which Doug tells
me commonly has a 64k mtu).  I expect we anecdotally say 1500 bytes is going to
be the most common case.  I'll cap the prefetch at 1500B for now, since it
doesn't seem to hurt or help beyond that
Neil



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 20:14               ` Neil Horman
@ 2013-11-08 20:29                 ` Joe Perches
  2013-11-11 19:40                   ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-08 20:29 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > > >  > ideas, we found the best optimization available to us currently is to
> > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > []
> > > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > > comments first.
> > > > 
> > > > Perhaps a reduction in prefetch loop count helps.
> > > > 
> > > > Was capping the amount prefetched and letting the
> > > > hardware prefetch also tested?
> > > > 
> > > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > 
> > > 
> > > Just tested this out:
> > 
> > Thanks.
> > 
> > Reformatting the table so it's a bit more
> > readable/comparable for me:
> > 
> > len	SetSz	Loops	cycles/byte
> > 			limited	unlimited
> > 1500B	64MB	1M	1.3442	1.3605
> > 1500B	128MB	1M	1.3410	1.3542
> > 1500B	256MB	1M	1.3536	1.3710
> > 1500B	512MB	1M	1.3463	1.3536
> > 9000B	64MB	1M	0.8522	0.8504
> > 9000B	128MB	1M	0.8528	0.8536
> > 9000B	256MB	1M	0.8532	0.8520
> > 9000B	512MB	1M	0.8527	0.8525
> > 64KB	64MB	1M	0.7686	0.7683
> > 64KB	128MB	1M	0.7695	0.7686
> > 64KB	256MB	1M	0.7699	0.7708
> > 64KB	512MB	1M	0.7799	0.7694
> > 
> > This data appears to show some value
> > in capping for 1500b lengths and noise
> > for shorter and longer lengths.
> > 
> > Any idea what the actual distribution of
> > do_csum lengths is under various loads?
> > 
> I don't have any hard data no, sorry.

I think you should before you implement this.
You might find extremely short lengths.

> I'll cap the prefetch at 1500B for now, since it
> doesn't seem to hurt or help beyond that

The table data has a max prefetch of
8 * boot_cpu_data.x86_cache_alignment so
I believe it's always less than 1500 but
perhaps 4 might be slightly better still.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 20:29                 ` Joe Perches
@ 2013-11-11 19:40                   ` Neil Horman
  2013-11-11 21:18                     ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-11-11 19:40 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> > On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > > > >  > ideas, we found the best optimization available to us currently is to
> > > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > > []
> > > > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > > > comments first.
> > > > > 
> > > > > Perhaps a reduction in prefetch loop count helps.
> > > > > 
> > > > > Was capping the amount prefetched and letting the
> > > > > hardware prefetch also tested?
> > > > > 
> > > > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > > 
> > > > 
> > > > Just tested this out:
> > > 
> > > Thanks.
> > > 
> > > Reformatting the table so it's a bit more
> > > readable/comparable for me:
> > > 
> > > len	SetSz	Loops	cycles/byte
> > > 			limited	unlimited
> > > 1500B	64MB	1M	1.3442	1.3605
> > > 1500B	128MB	1M	1.3410	1.3542
> > > 1500B	256MB	1M	1.3536	1.3710
> > > 1500B	512MB	1M	1.3463	1.3536
> > > 9000B	64MB	1M	0.8522	0.8504
> > > 9000B	128MB	1M	0.8528	0.8536
> > > 9000B	256MB	1M	0.8532	0.8520
> > > 9000B	512MB	1M	0.8527	0.8525
> > > 64KB	64MB	1M	0.7686	0.7683
> > > 64KB	128MB	1M	0.7695	0.7686
> > > 64KB	256MB	1M	0.7699	0.7708
> > > 64KB	512MB	1M	0.7799	0.7694
> > > 
> > > This data appears to show some value
> > > in capping for 1500b lengths and noise
> > > for shorter and longer lengths.
> > > 
> > > Any idea what the actual distribution of
> > > do_csum lengths is under various loads?
> > > 
> > I don't have any hard data no, sorry.
> 
> I think you should before you implement this.
> You might find extremely short lengths.
> 
> > I'll cap the prefetch at 1500B for now, since it
> > doesn't seem to hurt or help beyond that
> 
> The table data has a max prefetch of
> 8 * boot_cpu_data.x86_cache_alignment so
> I believe it's always less than 1500 but
> perhaps 4 might be slightly better still.
> 


So, you appear to be correct, I reran my test set with different prefetch
ceilings and got the results below.  There are some cases in which there is a
performance gain, but the gain is small, and occurs at different spots depending
on the input buffer size (though most peak gains appear around 2 cache lines).
I'm guessing it takes about 2 prefetches before hardware prefetching catches up,
at which point we're just spending time issuing instructions that get discarded.
Given the small prefetch limit, and the limited gains (which may also change on
different hardware), I think we should probably just drop the prefetch idea
entirely, and perhaps just take the perf patch so that we can revisit this area
when hardware that supports the avx extensions and/or adcx/adox becomes
available.

Ingo, does that seem reasonable to you?
Neil



1 cache line:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.434190
1500B   | 128MB | 1000000       | 1.431216
1500B   | 256MB | 1000000       | 1.430888
1500B   | 512MB | 1000000       | 1.453422
9000B   | 64MB  | 1000000       | 0.892055
9000B   | 128MB | 1000000       | 0.884050
9000B   | 256MB | 1000000       | 0.880551
9000B   | 512MB | 1000000       | 0.883848
64KB    | 64MB  | 1000000       | 0.813187
64KB    | 128MB | 1000000       | 0.801326
64KB    | 256MB | 1000000       | 0.795643
64KB    | 512MB | 1000000       | 0.793400


2 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.430030
1500B   | 128MB | 1000000       | 1.434589
1500B   | 256MB | 1000000       | 1.425430
1500B   | 512MB | 1000000       | 1.451570
9000B   | 64MB  | 1000000       | 0.892369
9000B   | 128MB | 1000000       | 0.885577
9000B   | 256MB | 1000000       | 0.882091
9000B   | 512MB | 1000000       | 0.885201
64KB    | 64MB  | 1000000       | 0.813629
64KB    | 128MB | 1000000       | 0.801377
64KB    | 256MB | 1000000       | 0.795861
64KB    | 512MB | 1000000       | 0.793242

3 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.435048
1500B   | 128MB | 1000000       | 1.427103
1500B   | 256MB | 1000000       | 1.431558
1500B   | 512MB | 1000000       | 1.452250
9000B   | 64MB  | 1000000       | 0.893162
9000B   | 128MB | 1000000       | 0.884488
9000B   | 256MB | 1000000       | 0.881314
9000B   | 512MB | 1000000       | 0.884060
64KB    | 64MB  | 1000000       | 0.813185
64KB    | 128MB | 1000000       | 0.801280
64KB    | 256MB | 1000000       | 0.795554
64KB    | 512MB | 1000000       | 0.793670

4 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.435013
1500B   | 128MB | 1000000       | 1.428434
1500B   | 256MB | 1000000       | 1.430780
1500B   | 512MB | 1000000       | 1.456285
9000B   | 64MB  | 1000000       | 0.894877
9000B   | 128MB | 1000000       | 0.885387
9000B   | 256MB | 1000000       | 0.883293
9000B   | 512MB | 1000000       | 0.886462
64KB    | 64MB  | 1000000       | 0.815036
64KB    | 128MB | 1000000       | 0.801962
64KB    | 256MB | 1000000       | 0.797618
64KB    | 512MB | 1000000       | 0.795138

6 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.439609
1500B   | 128MB | 1000000       | 1.437569
1500B   | 256MB | 1000000       | 1.441776
1500B   | 512MB | 1000000       | 1.455362
9000B   | 64MB  | 1000000       | 0.895242
9000B   | 128MB | 1000000       | 0.886149
9000B   | 256MB | 1000000       | 0.881375
9000B   | 512MB | 1000000       | 0.884610
64KB    | 64MB  | 1000000       | 0.814658
64KB    | 128MB | 1000000       | 0.804124
64KB    | 256MB | 1000000       | 0.798143
64KB    | 512MB | 1000000       | 0.795377

10 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.431512
1500B   | 128MB | 1000000       | 1.431805
1500B   | 256MB | 1000000       | 1.430388
1500B   | 512MB | 1000000       | 1.464370
9000B   | 64MB  | 1000000       | 0.893922
9000B   | 128MB | 1000000       | 0.887852
9000B   | 256MB | 1000000       | 0.882711
9000B   | 512MB | 1000000       | 0.890067
64KB    | 64MB  | 1000000       | 0.814890
64KB    | 128MB | 1000000       | 0.801470
64KB    | 256MB | 1000000       | 0.796658
64KB    | 512MB | 1000000       | 0.794266

20 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.455539
1500B   | 128MB | 1000000       | 1.443117
1500B   | 256MB | 1000000       | 1.436739
1500B   | 512MB | 1000000       | 1.458973
9000B   | 64MB  | 1000000       | 0.898470
9000B   | 128MB | 1000000       | 0.886110
9000B   | 256MB | 1000000       | 0.889549
9000B   | 512MB | 1000000       | 0.886547
64KB    | 64MB  | 1000000       | 0.814665
64KB    | 128MB | 1000000       | 0.803252
64KB    | 256MB | 1000000       | 0.797268
64KB    | 512MB | 1000000       | 0.794830


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-11 19:40                   ` Neil Horman
@ 2013-11-11 21:18                     ` Ingo Molnar
  0 siblings, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-11-11 21:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: Joe Perches, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Ingo, does that seem reasonable to you?

FYI, in the past few days I've been busy due to the merge window, but 
everything I've seen so far in this portion of the thread gave me warm 
fuzzy feelings, so I definitely like the direction.

(More once I get around to looking at the code in detail.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 17:37             ` Neil Horman
  2013-11-01 19:45               ` Joe Perches
@ 2013-11-04  9:47               ` David Laight
  1 sibling, 0 replies; 132+ messages in thread
From: David Laight @ 2013-11-04  9:47 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ben Hutchings, Doug Ledford, Ingo Molnar, Eric Dumazet,
	linux-kernel, netdev

> > I think you need 3 instructions, move a 0, conditionally move a 1
> > then add. I suspect it won't be a win!

Or, with an appropriately unrolled loop, for each word:
	zero %eax, cmove a 1 to %al
	cmove a 1 to %ah
	shift %eax left, cmove a 1 to %al
	cmove a 1 to %ah, add %eax onto somewhere.
However the 2nd instruction stream would have to use a different
register (IIRC 8bit updates depend on the entire register).

> I agree, that sounds interesting, but very cpu dependent.  Thanks for the
> suggestion, Ben, but I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

I didn't look too closely at the original figures.
With a simple loop you need 4 instructions per iteration (load, adc, inc, branch).
How close to one iteration per clock do you get?
I thought x86 hardware prefetch would load the cache lines for sequential
accesses - so any prefetch instructions are rather pointless.
However reading the value in the previous loop iteration should help.

I've just realised that there is a problem with the loop termination
condition also needing the flags register:-(
I don't remember the 'loop' instruction ever being added to any of the
fast path instruction decodes - so it won't help.

So I suspect the best you'll get is an interleaved sequence of load and adc
with an lea and inc (both to adjust the index) and a bne back to the top.
(the lea wants to be in the middle somewhere).
That might manage 1 clock per word + 1 clock per loop iteration (if the inc
and bne can be 'fused').

	David





^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 20:26                   ` Joe Perches
@ 2013-11-02  2:07                     ` Neil Horman
  0 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-02  2:07 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > > 
> > > > I think it would be better if we just did the prefetch here
> > > > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > > > for testing on hardware.
> > > 
> > > Could there be a difference if only a single software
> > > prefetch was done at the beginning of transfer before
> > > the while loop and hardware prefetches did the rest?
> > > 
> > I wouldn't think so.  If hardware was going to do any prefetching based on
> > memory access patterns it will do so regardless of the leading prefetch, and
> > that first prefetch isn't helpful because we still wind up stalling on the adds
> > while its completing
> 
> I imagine one benefit to be helping prevent
> prefetching beyond the actual data required.
> 
> Maybe some hardware optimizes prefetch stride
> better than 5*64.
> 
> I wonder also if using
> 
> 	if (count > some_length)
> 		prefetch
> 	while (...)
> 
> helps small lengths more than the test/jump cost.
> 
We've already done this and it is in fact the best performing.  I'll be posting
that patch along with ingos request to add do_csum to the perf bench code when I
have that done
Best
Neil

> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 19:58                 ` Neil Horman
@ 2013-11-01 20:26                   ` Joe Perches
  2013-11-02  2:07                     ` Neil Horman
  0 siblings, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-01 20:26 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > 
> > > I think it would be better if we just did the prefetch here
> > > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > > for testing on hardware.
> > 
> > Could there be a difference if only a single software
> > prefetch was done at the beginning of transfer before
> > the while loop and hardware prefetches did the rest?
> > 
> I wouldn't think so.  If hardware was going to do any prefetching based on
> memory access patterns it will do so regardless of the leading prefetch, and
> that first prefetch isn't helpful because we still wind up stalling on the adds
> while its completing

I imagine one benefit to be helping prevent
prefetching beyond the actual data required.

Maybe some hardware optimizes prefetch stride
better than 5*64.

I wonder also if using

	if (count > some_length)
		prefetch
	while (...)

helps small lengths more than the test/jump cost.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 19:45               ` Joe Perches
@ 2013-11-01 19:58                 ` Neil Horman
  2013-11-01 20:26                   ` Joe Perches
  0 siblings, 1 reply; 132+ messages in thread
From: Neil Horman @ 2013-11-01 19:58 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> 
> > I think it would be better if we just did the prefetch here
> > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > for testing on hardware.
> 
> Could there be a difference if only a single software
> prefetch was done at the beginning of transfer before
> the while loop and hardware prefetches did the rest?
> 

I wouldn't think so.  If hardware was going to do any prefetching based on
memory access patterns it will do so regardless of the leading prefetch, and
that first prefetch isn't helpful because we still wind up stalling on the adds
while its completing
Neil

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 17:37             ` Neil Horman
@ 2013-11-01 19:45               ` Joe Perches
  2013-11-01 19:58                 ` Neil Horman
  2013-11-04  9:47               ` David Laight
  1 sibling, 1 reply; 132+ messages in thread
From: Joe Perches @ 2013-11-01 19:45 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:

> I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

Could there be a difference if only a single software
prefetch was done at the beginning of transfer before
the while loop and hardware prefetches did the rest?




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:18           ` David Laight
@ 2013-11-01 17:37             ` Neil Horman
  2013-11-01 19:45               ` Joe Perches
  2013-11-04  9:47               ` David Laight
  0 siblings, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-01 17:37 UTC (permalink / raw)
  To: David Laight
  Cc: Ben Hutchings, Doug Ledford, Ingo Molnar, Eric Dumazet,
	linux-kernel, netdev

On Fri, Nov 01, 2013 at 04:18:50PM -0000, David Laight wrote:
> > How would you suggest replacing the jumps in this case?  I agree it would be
> > faster here, but I'm not sure how I would implement an increment using a single
> > conditional move.
> 
> I think you need 3 instructions, move a 0, conditionally move a 1
> then add. I suspect it won't be a win!
> 
> If you do 'win' it is probably very dependent on how the instructions
> get scheduled onto the execution units - which will probably make
> it very cpu type dependant.
> 
> 	David
> 
I agree, that sounds interesting, but very cpu dependent.  Thanks for the
suggestion, Ben, but I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were available
for testing on hardware.

Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:08         ` Neil Horman
  2013-11-01 16:16           ` Ben Hutchings
@ 2013-11-01 16:18           ` David Laight
  2013-11-01 17:37             ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: David Laight @ 2013-11-01 16:18 UTC (permalink / raw)
  To: Neil Horman, Ben Hutchings
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> How would you suggest replacing the jumps in this case?  I agree it would be
> faster here, but I'm not sure how I would implement an increment using a single
> conditional move.

I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!

If you do 'win' it is probably very dependent on how the instructions
get scheduled onto the execution units - which will probably make
it very cpu type dependant.

	David





^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:08         ` Neil Horman
@ 2013-11-01 16:16           ` Ben Hutchings
  2013-11-01 16:18           ` David Laight
  1 sibling, 0 replies; 132+ messages in thread
From: Ben Hutchings @ 2013-11-01 16:16 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 03:42:46PM +0000, Ben Hutchings wrote:
> > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> > [...]
> > > It
> > > functions, but unfortunately the performance lost to the completely broken
> > > branch prediction that this inflicts makes it a non starter:
> > [...]
> > 
> > Conditional branches are no good but conditional moves might be worth a shot.
> > 
> > Ben.
> > 
> How would you suggest replacing the jumps in this case?  I agree it would be
> faster here, but I'm not sure how I would implement an increment using a single
> conditional move.

You can't, but it lets you use additional registers as carry flags.
Whether there are enough registers and enough parallelism to cancel out
the extra additions required, I don't know.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 15:42       ` Ben Hutchings
@ 2013-11-01 16:08         ` Neil Horman
  2013-11-01 16:16           ` Ben Hutchings
  2013-11-01 16:18           ` David Laight
  0 siblings, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-11-01 16:08 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Fri, Nov 01, 2013 at 03:42:46PM +0000, Ben Hutchings wrote:
> On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> [...]
> > It
> > functions, but unfortunately the performance lost to the completely broken
> > branch prediction that this inflicts makes it a non starter:
> [...]
> 
> Conditional branches are no good but conditional moves might be worth a shot.
> 
> Ben.
> 
How would you suggest replacing the jumps in this case?  I agree it would be
faster here, but I'm not sure how I would implement an increment using a single
conditional move.
Neil

> -- 
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 18:30     ` Neil Horman
  2013-11-01  9:21       ` Ingo Molnar
@ 2013-11-01 15:42       ` Ben Hutchings
  2013-11-01 16:08         ` Neil Horman
  1 sibling, 1 reply; 132+ messages in thread
From: Ben Hutchings @ 2013-11-01 15:42 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
> It
> functions, but unfortunately the performance lost to the completely broken
> branch prediction that this inflicts makes it a non starter:
[...]

Conditional branches are no good but conditional moves might be worth a shot.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 18:30     ` Neil Horman
@ 2013-11-01  9:21       ` Ingo Molnar
  2013-11-01 15:42       ` Ben Hutchings
  1 sibling, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Eric Dumazet, linux-kernel, netdev, David Laight


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Prefetch and simluated adcx/adox from above:
>  Performance counter stats for './test.sh' (20 runs):
> 
>         35,704,331 L1-dcache-load-misses                                         ( +-  0.07% ) [75.00%]
>                  0 L1-dcache-prefetches                                         [75.00%]
>     19,751,409,264 cycles                    #    0.000 GHz                      ( +-  0.59% ) [75.00%]
>         34,850,056 branch-misses                                                 ( +-  1.29% ) [75.00%]
> 
>        7.768602160 seconds time elapsed                                          ( +-  1.38% )

btw., you might also want to try measuring only the basics:

   -e cycles -e instructions -e branches -e branch-misses

that should give you 100% in the last column and should also allow 
you to double check whether all the PMU counts are correct: is it 
the expected number of instructions, expected number of branches, 
expected number of branch-misses, etc.

Then you can remove branch stats and add just L1-dcache stats - and 
still be 100% covered:

   -e cycles -e instructions -e L1-dcache-loads -e L1-dcache-load-misses

etc.

Just so that you can trust what the PMU tells you. Prefetch counts 
are sometimes off, they might include speculative activities, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
  2013-10-30 14:52     ` Neil Horman
@ 2013-10-31 18:30     ` Neil Horman
  2013-11-01  9:21       ` Ingo Molnar
  2013-11-01 15:42       ` Ben Hutchings
  2 siblings, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-31 18:30 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
> 


So, above and beyond this I spent yesterday trying this pattern, something Doug
and I discussed together offline:
 
asm("prefetch 5*64(%[src])\n\t"
    "addq 0*8(%[src]),%[res1]\n\t"
    "jo 2f\n\t"
    "incq %[cry]\n\t"
    "2:addq 1*8(%[src]),%[res2]\n\t"
    "jc 3f\n\t"
    "incq %[cry]\n\t"
    "3:addq 2*8(%[src]),%[res1]\n\t"
    ...

The hope being that by using the add instead instead of the adc instruction, and
alternatively testing the overflow and carry flags, I could break the
serialization on the flags register between subeuqent adds and start doing
things in parallel (creating a poor mans adcx/adox instruction in effect).  It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this inflicts makes it a non starter:


Base Performance:
 Performance counter stats for './test.sh' (20 runs):

        48,143,372 L1-dcache-load-misses                                         ( +-  0.03% ) [74.99%]
                 0 L1-dcache-prefetches                                         [75.00%]
    13,913,339,911 cycles                    #    0.000 GHz                      ( +-  0.06% ) [75.01%]
        28,878,999 branch-misses                                                 ( +-  0.05% ) [75.00%]

       5.367618727 seconds time elapsed                                          ( +-  0.06% )


Prefetch and simluated adcx/adox from above:
 Performance counter stats for './test.sh' (20 runs):

        35,704,331 L1-dcache-load-misses                                         ( +-  0.07% ) [75.00%]
                 0 L1-dcache-prefetches                                         [75.00%]
    19,751,409,264 cycles                    #    0.000 GHz                      ( +-  0.59% ) [75.00%]
        34,850,056 branch-misses                                                 ( +-  1.29% ) [75.00%]

       7.768602160 seconds time elapsed                                          ( +-  1.38% )


With the above instruction changes the prefetching lowers our dcache miss rate
significantly, but greatly raises our branch miss rate, and absolutely kills our
cycle count and run time.

At this point I feel like this is dead in the water.  I apologize for wasting
everyones time.  The best thing to do here would seem to be:

1) Add in some prefetching (from what I've seen a simple prefetch is as
performant as smart prefetching), so we may as well do it exactly as
csum_copy_from_user does it, and save ourselves the extra while loop.

2) Revisit this when the AVX extensions, or the adcx/adox instructions are
available and we can really preform parallel alu ops here.

Does that sound reasonable?
Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
@ 2013-10-30 14:52     ` Neil Horman
  2013-10-31 18:30     ` Neil Horman
  2 siblings, 0 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-30 14:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
Yup, I just tried the 2 chains doing interleaved access and came up with the
same results for both prefetch cases.

> In all cases, the result is basically that the add method doesn't
> matter much in the grand scheme of things, but the prefetch does,
> and smart prefetch always beat simple prefetch.
> 
> My simple prefetch was to just go into the main while() loop for the
> csum operation and always prefetch 5*64 into the future.
> 
> My smart prefetch looks like this:
> 
> static inline void prefetch_line(unsigned long *cur_line,
>                                  unsigned long *end_line,
>                                  size_t size)
> {
>         size_t fetched = 0;
> 
>         while (*cur_line <= *end_line && fetched < size) {
>                 prefetch((void *)*cur_line);
>                 *cur_line += cache_line_size();
>                 fetched += cache_line_size();
>         }
> }
> 
I've done this too, but I've come up with results that are very close to simple
prefetch.

> I was going to tinker today and tomorrow with this function once I
> get a toolchain that will compile it (I reinstalled all my rhel6
> hosts as f20 and I'm hoping that does the trick, if not I need to do
> more work):
> 
> #define ADCXQ_64                                        \
>         asm("xorq %[res1],%[res1]\n\t"                  \
>             "adcxq 0*8(%[src]),%[res1]\n\t"             \
>             "adoxq 1*8(%[src]),%[res2]\n\t"             \
>             "adcxq 2*8(%[src]),%[res1]\n\t"             \
>             "adoxq 3*8(%[src]),%[res2]\n\t"             \
>             "adcxq 4*8(%[src]),%[res1]\n\t"             \
>             "adoxq 5*8(%[src]),%[res2]\n\t"             \
>             "adcxq 6*8(%[src]),%[res1]\n\t"             \
>             "adoxq 7*8(%[src]),%[res2]\n\t"             \
>             "adcxq %[zero],%[res1]\n\t"                 \
>             "adoxq %[zero],%[res2]\n\t"                 \
>             : [res1] "=r" (result1),                    \
>               [res2] "=r" (result2)                     \
>             : [src] "r" (buff), [zero] "r" (zero),      \
>               "[res1]" (result1), "[res2]" (result2))
> 
I've tried using this method also (HPA suggested it early in the thread, but its
not going to be usefull for awhile.  The compiler supports it already, but
theres not hardware available with support for these instructions yet (at least
not that I have available).

Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
@ 2013-10-30 14:04     ` David Laight
  2013-10-30 14:52     ` Neil Horman
  2013-10-31 18:30     ` Neil Horman
  2 siblings, 0 replies; 132+ messages in thread
From: David Laight @ 2013-10-30 14:04 UTC (permalink / raw)
  To: Doug Ledford, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

...
> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely.  David, you mentioned you've tried this, how did
> your experiment turn out and what was your method?  I was planning on
> doing regular full size loads into one xmm/ymm register, then using
> pshufd/vshufd to move the data into two different registers, then
> summing into a fourth register, and possible running two of those
> pipelines in parallel.

It was a long time ago, and IIRC the code was just SSE so the
register length just wasn't going to give the required benefit.
I know I wrote the code, but I can't even remember whether I
actually got it working!
With the longer AVX words it might make enough difference.
Of course, this assumes that you have the fpu registers
available. If you have to do a fpu context switch it will
be a lot slower.

About the same time I did manage to an open coded copy
loop to run as fast as 'rep movs' - and without any unrolling
or any prefetch instructions.

Thinking about AVX you should be able to do (without looking up the
actual mnemonics):
	load
	add 32bit chunks to sum
	compare sum with read value (equiv of carry)
	add/subtract compare result (0 or ~0) to a carry-sum register
That is 4 instructions for 256 bits, so you can aim for 4 clocks.
You'd need to check the cpu book to see if any of those can
be scheduled at the same time (if not dependant).
(and also whether there is any result delay - don't think so.)

I'd try running two copies of the above - probably skewed so that
the memory accesses are separated, do the memory read for the
next iteration, and use the 3rd instruction unit for loop control.

	David




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 11:02 ` Neil Horman
  2013-10-30 12:18   ` David Laight
@ 2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
                       ` (2 more replies)
  1 sibling, 3 replies; 132+ messages in thread
From: Doug Ledford @ 2013-10-30 13:35 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On 10/30/2013 07:02 AM, Neil Horman wrote:

> That does makes sense, but it then begs the question, whats the advantage of
> having multiple alu's at all?

There's lots of ALU operations that don't operate on the flags or other 
entities that can be run in parallel.

> If they're just going to serialize on the
> updating of the condition register, there doesn't seem to be much advantage in
> having multiple alu's at all, especially if a common use case (parallelizing an
> operation on a large linear dataset) resulted in lower performance.
>
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
>
> would prevent pipeline stalls.  That would be interesting data, and (I think)
> support your theory, Doug.  I'll give that a try

Just to avoid spending too much time on various combinations, here are 
the methods I've tried:

Original code
2 chains doing interleaved memory accesses
2 chains doing serial memory accesses (as above)
4 chains doing serial memory accesses
4 chains using 32bit values in 64bit registers so you can always use add 
instead of adc and never need the carry flag

And I've done all of the above with simple prefetch and smart prefetch.

In all cases, the result is basically that the add method doesn't matter 
much in the grand scheme of things, but the prefetch does, and smart 
prefetch always beat simple prefetch.

My simple prefetch was to just go into the main while() loop for the 
csum operation and always prefetch 5*64 into the future.

My smart prefetch looks like this:

static inline void prefetch_line(unsigned long *cur_line,
                                  unsigned long *end_line,
                                  size_t size)
{
         size_t fetched = 0;

         while (*cur_line <= *end_line && fetched < size) {
                 prefetch((void *)*cur_line);
                 *cur_line += cache_line_size();
                 fetched += cache_line_size();
         }
}

static unsigned do_csum(const unsigned char *buff, unsigned len)
{
	...
         unsigned long cur_line = (unsigned long)buff & 
~(cache_line_size() - 1);
         unsigned long end_line = ((unsigned long)buff + len) & 
~(cache_line_size() - 1);

	...
         /* Don't bother to prefetch the first line, we'll end up 
stalling on
          * it anyway, but go ahead and start the prefetch on the next 3 */
         cur_line += cache_line_size();
         prefetch_line(&cur_line, &end_line, cache_line_size() * 3);
         odd = 1 & (unsigned long) buff;
         if (unlikely(odd)) {
                 result = *buff << 8;
	...
                 count >>= 1;            /* nr of 32-bit words.. */

                 /* prefetch line #4 ahead of main loop */
                 prefetch_line(&cur_line, &end_line, cache_line_size());

                 if (count) {
		...
                         while (count64) {
                                 /* we are now prefetching line #5 ahead of
                                  * where we are starting, and will stay 5
                                  * ahead throughout the loop, at least 
until
                                  * we get to the end line and then 
we'll stop
                                  * prefetching */
                                 prefetch_line(&cur_line, &end_line, 64);
                                 ADDL_64;
                                 buff += 64;
                                 count64--;
                         }

                         ADDL_64_FINISH;


I was going to tinker today and tomorrow with this function once I get a 
toolchain that will compile it (I reinstalled all my rhel6 hosts as f20 
and I'm hoping that does the trick, if not I need to do more work):

#define ADCXQ_64                                        \
         asm("xorq %[res1],%[res1]\n\t"                  \
             "adcxq 0*8(%[src]),%[res1]\n\t"             \
             "adoxq 1*8(%[src]),%[res2]\n\t"             \
             "adcxq 2*8(%[src]),%[res1]\n\t"             \
             "adoxq 3*8(%[src]),%[res2]\n\t"             \
             "adcxq 4*8(%[src]),%[res1]\n\t"             \
             "adoxq 5*8(%[src]),%[res2]\n\t"             \
             "adcxq 6*8(%[src]),%[res1]\n\t"             \
             "adoxq 7*8(%[src]),%[res2]\n\t"             \
             "adcxq %[zero],%[res1]\n\t"                 \
             "adoxq %[zero],%[res2]\n\t"                 \
             : [res1] "=r" (result1),                    \
               [res2] "=r" (result2)                     \
             : [src] "r" (buff), [zero] "r" (zero),      \
               "[res1]" (result1), "[res2]" (result2))

and then I also wanted to try using both xmm and ymm registers and doing 
64bit adds with 32bit numbers across multiple xmm/ymm registers as that 
should parallel nicely.  David, you mentioned you've tried this, how did 
your experiment turn out and what was your method?  I was planning on 
doing regular full size loads into one xmm/ymm register, then using 
pshufd/vshufd to move the data into two different registers, then 
summing into a fourth register, and possible running two of those 
pipelines in parallel.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 12:18   ` David Laight
@ 2013-10-30 13:22     ` Doug Ledford
  0 siblings, 0 replies; 132+ messages in thread
From: Doug Ledford @ 2013-10-30 13:22 UTC (permalink / raw)
  To: David Laight, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

On 10/30/2013 08:18 AM, David Laight wrote:
>> /me wonders if rearranging the instructions into this order:
>> adcq 0*8(src), res1
>> adcq 1*8(src), res2
>> adcq 2*8(src), res1
>
> Those have to be sequenced.
>
> Using a 64bit lea to add 32bit quantities should avoid the
> dependencies on the flags register.
> However you'd need to get 3 of those active to beat a 64bit adc.
>
> 	David
>
>
>

Already done (well, something similar to what you mention above anyway), 
doesn't help (although doesn't hurt either, even though it doubles the 
number of adds needed to complete the same work).  This is the code I 
tested:

#define ADDL_64                                         \
         asm("xorq  %%r8,%%r8\n\t"                       \
             "xorq  %%r9,%%r9\n\t"                       \
             "xorq  %%r10,%%r10\n\t"                     \
             "xorq  %%r11,%%r11\n\t"                     \
             "movl  0*4(%[src]),%%r8d\n\t"               \
             "movl  1*4(%[src]),%%r9d\n\t"               \
             "movl  2*4(%[src]),%%r10d\n\t"              \
             "movl  3*4(%[src]),%%r11d\n\t"              \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  4*4(%[src]),%%r8d\n\t"               \
             "movl  5*4(%[src]),%%r9d\n\t"               \
             "movl  6*4(%[src]),%%r10d\n\t"              \
             "movl  7*4(%[src]),%%r11d\n\t"              \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  8*4(%[src]),%%r8d\n\t"               \
             "movl  9*4(%[src]),%%r9d\n\t"               \
             "movl  10*4(%[src]),%%r10d\n\t"             \
             "movl  11*4(%[src]),%%r11d\n\t"             \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  12*4(%[src]),%%r8d\n\t"              \
             "movl  13*4(%[src]),%%r9d\n\t"              \
             "movl  14*4(%[src]),%%r10d\n\t"             \
             "movl  15*4(%[src]),%%r11d\n\t"             \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]"                       \
             : [res1] "=r" (result1),                    \
               [res2] "=r" (result2),                    \
               [res3] "=r" (result3),                    \
               [res4] "=r" (result4)                     \
             : [src] "r" (buff),                         \
               "[res1]" (result1), "[res2]" (result2),   \
               "[res3]" (result3), "[res4]" (result4)    \
             : "r8", "r9", "r10", "r11" )


^ permalink raw reply	[flat|nested] 132+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 11:02 ` Neil Horman
@ 2013-10-30 12:18   ` David Laight
  2013-10-30 13:22     ` Doug Ledford
  2013-10-30 13:35   ` Doug Ledford
  1 sibling, 1 reply; 132+ messages in thread
From: David Laight @ 2013-10-30 12:18 UTC (permalink / raw)
  To: Neil Horman, Doug Ledford; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1

Those have to be sequenced.

Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active to beat a 64bit adc.

	David




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30  5:25 Doug Ledford
  2013-10-30 10:27 ` David Laight
@ 2013-10-30 11:02 ` Neil Horman
  2013-10-30 12:18   ` David Laight
  2013-10-30 13:35   ` Doug Ledford
  1 sibling, 2 replies; 132+ messages in thread
From: Neil Horman @ 2013-10-30 11:02 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 3) The run times are proportionally larger, but still indicate that Parallel ALU
> > execution is hurting rather than helping, which is counter-intuitive.  I'm
> > looking into it, but thought you might want to see these results in case
> > something jumped out at you
> 
> So here's my theory about all of this.
> 
> I think that the original observation some years back was a fluke caused by
> either a buggy CPU or a CPU design that is no longer used.
> 
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.
> 
> It's my theory that all modern CPUs have this behavior fixed, probably via a
> microcode update, and so trying to do parallel ALU operations like this simply
> has no effect because the CPU (rightly so) serializes the operations to keep
> them from clobbering the overflow/carry flags of the other ALUs operations.
> 
> My additional theory then is that the reason you see a slowdown from this
> patch is because the attempt to parallelize the ALU operation has caused
> us to write a series of instructions that, once serialized, are non-optimal
> and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
> 3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
> order, and since we aren't getting the parallel operation we want, this
> is the net result of the patch).
> 
> It would explain things anyway.
> 

That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?  If they're just going to serialize on the
updating of the condition register, there doesn't seem to be much advantage in
having multiple alu's at all, especially if a common use case (parallelizing an
operation on a large linear dataset) resulted in lower performance.

/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1

would prevent pipeline stalls.  That would be interesting data, and (I think)
support your theory, Doug.  I'll give that a try

Neil


^ permalink raw reply	[flat|nested] 132+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30  5:25 Doug Ledford
@ 2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  1 sibling, 0 replies; 132+ messages in thread
From: David Laight @ 2013-10-30 10:27 UTC (permalink / raw)
  To: Doug Ledford, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.

IIRC x86 cpu treat the (arithmetic) flags register as a single entity.
So an instruction that only changes some of the flags is dependant
on any previous instruction that changes any flags.
OTOH it the instruction writes all of the flags then it doesn't
have to wait for the earlier instruction to complete.

This is problematic for the ADC chain in the IP checksum.
I did once try to use the SSE instructions to sum 16bit
fields into multiple 32bit registers.

	David




^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-30  5:25 Doug Ledford
  2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  0 siblings, 2 replies; 132+ messages in thread
From: Doug Ledford @ 2013-10-30  5:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, Doug Ledford, linux-kernel, netdev

* Neil Horman <nhorman@tuxdriver.com> wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel ALU
> execution is hurting rather than helping, which is counter-intuitive.  I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you

So here's my theory about all of this.

I think that the original observation some years back was a fluke caused by
either a buggy CPU or a CPU design that is no longer used.

The parallel ALU design of this patch seems OK at first glance, but it means
that two parallel operations are both trying to set/clear both the overflow
and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
some CPU in the past had a set of overflow/carry flags per ALU and did some
sort of magic to make sure that the last state of those flags across multiple
ALUs that might have been used in parallelizing work were always in the CPU's
logical EFLAGS register, or the CPU has a buggy microcode that allowed two
ALUs to operate on data at the same time in situations where they would
potentially stomp on the carry/overflow flags of the other ALUs operations.

It's my theory that all modern CPUs have this behavior fixed, probably via a
microcode update, and so trying to do parallel ALU operations like this simply
has no effect because the CPU (rightly so) serializes the operations to keep
them from clobbering the overflow/carry flags of the other ALUs operations.

My additional theory then is that the reason you see a slowdown from this
patch is because the attempt to parallelize the ALU operation has caused
us to write a series of instructions that, once serialized, are non-optimal
and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
order, and since we aren't getting the parallel operation we want, this
is the net result of the patch).

It would explain things anyway.


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 17:02       ` Doug Ledford
@ 2013-10-29  8:38         ` Ingo Molnar
  0 siblings, 0 replies; 132+ messages in thread
From: Ingo Molnar @ 2013-10-29  8:38 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Eric Dumazet, Neil Horman, linux-kernel, H. Peter Anvin,
	Andi Kleen, Sebastien Dugue


* Doug Ledford <dledford@redhat.com> wrote:

> [ Snipped a couple of really nice real-life bandwidth tests. ]

> Some of my preliminary results:
> 
> 1) Regarding the initial claim that changing the code to have two 
> addition chains, allowing the use of two ALUs, doubling 
> performance: I'm just not seeing it.  I have a number of theories 
> about this, but they are dependent on point #2 below:
> 
> 2) Prefetch definitely helped, although how much depends on which 
> of the test setups I was using above.  The biggest gainer was B) 
> the E3-1240 V2 @ 3.40GHz based machines.
> 
> So, my theories about #1 are that, with modern CPUs, it's more our 
> load/store speed that is killing us than the ALU speed.  I tried 
> at least 5 distinctly different ALU algorithms, including one that 
> eliminated the use of the carry chain entirely, and none of them 
> had a noticeable effect.  On the other hand, prefetch always had a 
> noticeable effect.  I suspect the original patch worked and had a 
> performance benefit some time ago due to a quirk on some CPU 
> common back then, but modern CPUs are capable of optimizing the 
> routine well enough that the benefit of the patch is already in 
> our original csum routine due to CPU optimizations. [...]

That definitely sounds plausible.

> [...] Or maybe there is another explanation, but I'm not really 
> looking too hard for it.
> 
> I also tried two different prefetch methods on the theory that 
> memory access cycles are more important than CPU access cycles, 
> and there appears to be a minor benefit to wasting CPU cycles to 
> prevent unnecessary prefetches, even with 65520 as our MTU where a 
> 320 byte excess prefetch at the end of the operation only caused 
> us to load a few % points of extra memory.  I suspect that if I 
> dropped the MTU down to 9K (to mimic jumbo frames on a device 
> without tx/rx checksum offloads), the smart version of prefetch 
> would be a much bigger winner.  The fact that there is any 
> apparent difference at all on such a large copy tells me that 
> prefetch should probably always be smart and never dumb (and here 
> by smart versus dumb I mean prefetch should check to make sure you 
> aren't prefetching beyond the end of data you care about before 
> executing the prefetch instruction).

That looks like an important result and it should matter even more 
to ~1.5k MTU sizes where the prefetch window will be even larger 
relative to the IP packet size.

> What strikes me as important here is that these 8 core Intel CPUs 
> actually got *slower* with the ALU patch + prefetch.  This 
> warrants more investigation to find out if it's the prefetch or 
> the ALU patch that did the damage to the speed.  It's also worth 
> noting that these 8 core CPUs have such high variability that I 
> don't trust these measurements yet.

It might make sense to have a good look at the PMU counts for these 
cases to see what's going on.

Also, once the packet is copied to user-space, we might want to do a 
CLFLUSH on the originating buffer, to zap the cacheline from the CPU 
caches. (This might or might not matter, depending on how good the 
CPU is at keeping its true working set in the cache.)

> > I'm a bit sceptical - I think 'looking 1-2 cachelines in 
> > advance' is something that might work reasonably well on a wide 
> > range of systems, while trying to find a bus capacity/latency 
> > dependent sweet spot would be difficult.
> 
> I think 1-2 cachelines is probably way too short. [...]

The 4-5 cachelines result you seem to be converging on looks very 
plausible to me too.

What I think we should try to avoid is to make the actual window per 
system variable: that would be really hard to tune right.

But the 'don't prefetch past the buffer' "smart prefetch" logic you 
mentioned is system-agnostic and might make sense to introduce.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 11:55     ` Ingo Molnar
@ 2013-10-28 17:02       ` Doug Ledford
  2013-10-29  8:38         ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Doug Ledford @ 2013-10-28 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, H. Peter Anvin,
	Andi Kleen, Sebastien Dugue

[-- Attachment #1: Type: text/plain, Size: 13247 bytes --]

On 10/26/2013 07:55 AM, Ingo Molnar wrote:
>
> * Doug Ledford <dledford@redhat.com> wrote:
>
>>> What I was objecting to strongly here was to measure the _wrong_
>>> thing, i.e. the cache-hot case. The cache-cold case should be
>>> measured in a low noise fashion, so that results are
>>> representative. It's closer to the real usecase than any other
>>> microbenchmark. That will give us a usable speedup figure and
>>> will tell us which technique helped how much and which parameter
>>> should be how large.
>>
>> Cold cache, yes.  Low noise, yes.  But you need DMA traffic at the
>> same time to be truly representative.
>
> Well, but in most usecases network DMA traffic is an order of
> magnitude smaller than system bus capacity. 100 gigabit network
> traffic is possible but not very common.

That's not necessarily true.  For gigabit, that's true.  For something
faster, even just 10GigE, it's not.  At least not when you consider that
network traffic usually involves hitting the bus at least two times, and
up to four times, depending on how it's processed on receive and whether
it goes cold from cache between accesses (once for the DMA from card to
memory, once for csum_partial so we know if the packet was good, and a
third time in copy_to_user so the user application can do something with
it, and possibly a fourth time if the user space application does
something with it).

> So I'd say that _if_ prefetching helps in the typical case we should
> tune it for that - not for the bus-contended case...

Well, I've been running a lot of tests here on various optimizations.
Some have helped, some not so much.  But I haven't been doing
micro-benchmarks like Neil.  I've been focused on running netperf over
IPoIB interfaces.  That should at least mimic real use somewhat and be
likely more indicitave of what the change will do to the system as a
whole than a micro-benchmark will.

I have a number of test systems, and they have a matrix of three
combinations of InfiniBand link speed and PCI-e bus speed that change
the theoretical max for each system.

For the 40GBit/s InfiniBand, the theoretical max throughput is 4GByte/s
(10/8bit wire encoding, not bothering to account for headers and such).

For the 56GBit/s InfiniBand, the theoretical max throughput is ~7GByte/s
(66/64 bit wire encoding).

For the PCI-e gen2 system, the PCI-e theoretical limit is 40GBit/s, for
the PCI-e gen3 systems the PCI-e theoretical limit is 64GBit/s. However,
with a max PCI-e payload of 128 bytes, the PCI-e gen2 bus will
definitely be a bottleneck before the 56GBit/s InfiniBand link.  The
PCI-e gen3 busses are probably right on par with a 56GBit/s InfiniBand
link in terms of max possible throughput.

Here are my test systems:

A - 2 Dell PowerEdge R415 AMD based servers, dual quad core processors
at 2.6GHz, 2MB L2, 5MB L3 cache, 32GB DDR3 1333 RAM, 56GBit/s InfiniBand
link speed on a card in a PCI-e Gen2 slot.  Results of base performance
bandwidth test:

[root@rdma-dev-00 ~]# qperf -t 15 ib0-dev-01 rc_bw rc_bi_bw
rc_bw:
    bw  =  2.93 GB/sec
rc_bi_bw:
    bw  =  5.5 GB/sec


B - 2 HP DL320e Gen8 servers, single Intel quad core Intel(R) Xeon(R)
CPU E3-1240 V2 @ 3.40GHz, 8GB DDR3 1600 RAM, card in PCI-e Gen3 slot
(8GT/s x8 active config).  Results of base performance bandwidth test
(40GBit/s link):

[root@rdma-qe-10 ~]# qperf -t 15 ib1-qe-11 rc_bw rc_bi_bw
rc_bw:
    bw  =  3.55 GB/sec
rc_bi_bw:
    bw  =  6.75 GB/sec


C - 2 HP DL360p Gen8 servers, dual Intel 8-core Intel(R) Xeon(R) CPU
E5-2660 0 @ 2.20GHz, 32GB DDR3 1333 RAM, card in PCI-e Gen3 slot (8GT/s
x8 active config).  Results of base performance bandwidth test (56GBit/s
link):

[root@rdma-perf-00 ~]# qperf -t 15 ib0-perf-01 rc_bw rc_bi_bw
rc_bw:
    bw  =  5.87 GB/sec
rc_bi_bw:
    bw  =  12.3 GB/sec


Some of my preliminary results:

1) Regarding the initial claim that changing the code to have two
addition chains, allowing the use of two ALUs, doubling performance: I'm
just not seeing it.  I have a number of theories about this, but they
are dependent on point #2 below:

2) Prefetch definitely helped, although how much depends on which of the
test setups I was using above.  The biggest gainer was B) the E3-1240 V2
@ 3.40GHz based machines.

So, my theories about #1 are that, with modern CPUs, it's more our
load/store speed that is killing us than the ALU speed.  I tried at
least 5 distinctly different ALU algorithms, including one that
eliminated the use of the carry chain entirely, and none of them had a
noticeable effect.  On the other hand, prefetch always had a noticeable
effect.  I suspect the original patch worked and had a performance
benefit some time ago due to a quirk on some CPU common back then, but
modern CPUs are capable of optimizing the routine well enough that the
benefit of the patch is already in our original csum routine due to CPU
optimizations.  Or maybe there is another explanation, but I'm not
really looking too hard for it.

I also tried two different prefetch methods on the theory that memory
access cycles are more important than CPU access cycles, and there
appears to be a minor benefit to wasting CPU cycles to prevent
unnecessary prefetches, even with 65520 as our MTU where a 320 byte
excess prefetch at the end of the operation only caused us to load a few
% points of extra memory.  I suspect that if I dropped the MTU down to
9K (to mimic jumbo frames on a device without tx/rx checksum offloads),
the smart version of prefetch would be a much bigger winner.  The fact
that there is any apparent difference at all on such a large copy tells
me that prefetch should probably always be smart and never dumb (and
here by smart versus dumb I mean prefetch should check to make sure you
aren't prefetching beyond the end of data you care about before
executing the prefetch instruction).

What I've found probably warrants more experimentation on the optimum
prefetch methods.  I also have another idea on speeding up the ALU
operations that I want to try.  So I'm not ready to send off everything
I have yet (and people wouldn't want that anyway, my collected data set
is megabytes in size).  But just to demonstrate some of what I'm seeing
here (notes: Recv CPU% of 12.5% is one CPU core pegged to 100% usage for
the A and B systems, for the C systems 3.125% is 100% usage for one CPU
core.  Also, although not so apparent on the AMD CPUs, the odd runs are
all with perf record, the even runs are with perf stat, and perf record
causes the odd runs to generally have a lower throughput (and this
effect is *huge* on the Intel 8 core CPUs, fully cutting throughput in
half on those systems)):

For the A systems:
Stock kernel:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1082.47   3.69     12.55    0.266   0.906
  1087.64   3.46     12.52    0.249   0.899
  1104.43   3.52     12.53    0.249   0.886
  1090.37   3.68     12.51    0.264   0.897
  1078.73   3.13     12.56    0.227   0.910
  1091.88   3.63     12.52    0.259   0.896

With ALU patch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1075.01   3.70     12.53    0.269   0.911
  1116.90   3.86     12.53    0.270   0.876
  1073.40   3.67     12.54    0.267   0.913
  1092.79   3.83     12.52    0.274   0.895
  1108.69   2.98     12.56    0.210   0.885
  1116.76   2.66     12.51    0.186   0.875

With ALU patch + 5*64 smart prefetch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1243.05   4.63     12.60    0.291   0.792
  1194.70   5.80     12.58    0.380   0.822
  1149.15   4.09     12.57    0.278   0.854
  1207.21   5.69     12.53    0.368   0.811
  1204.07   4.27     12.57    0.277   0.816
  1191.04   4.78     12.60    0.313   0.826


For the B systems:
Stock kernel:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  2778.98   7.75     12.34    0.218   0.347
  2819.14   7.31     12.52    0.203   0.347
  2721.43   8.43     12.19    0.242   0.350
  2832.93   7.38     12.58    0.203   0.347
  2770.07   8.01     12.27    0.226   0.346
  2829.17   7.27     12.51    0.201   0.345

With ALU patch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  2801.36   8.18     11.97    0.228   0.334
  2927.81   7.52     12.51    0.201   0.334
  2808.32   8.62     11.98    0.240   0.333
  2918.12   7.20     12.54    0.193   0.336
  2730.00   8.85     11.60    0.253   0.332
  2932.17   7.37     12.51    0.196   0.333

With ALU patch + 5*64 smart prefetch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  3029.53   9.34     10.67    0.241   0.275
  3229.36   7.81     11.65    0.189   0.282  <- this is a saturated
                                           40GBit/s InfiniBand link,
                                           and the recv CPU is no longer
                                           pegged at 100%, so the gains
                                           here are higher than just the
                                           throughput gains suggest
  3161.14   8.24     11.10    0.204   0.274
  3171.78   7.80     11.89    0.192   0.293
  3134.01   8.35     10.99    0.208   0.274
  3235.50   7.75     11.57    0.187   0.279  <- ditto here

For the C systems:
Stock kernel:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1091.03   1.59     3.14     0.454   0.900
  2299.34   2.57     3.07     0.350   0.417
  1177.07   1.71     3.15     0.455   0.838
  2312.59   2.54     3.02     0.344   0.408
  1273.94   2.03     3.15     0.499   0.772
  2591.50   2.76     3.19     0.332   0.385

With ALU patch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  Data for this series is missing (these machines were added to
  the matrix late and this kernel had already been rebuilt to
  something else and was no longer installable...I could recreate
  this if people really care).

With ALU patch + 5*64 smart prefetch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1377.03   2.05     3.13     0.466   0.711
  2002.30   2.40     3.04     0.374   0.474
  1470.18   2.25     3.13     0.479   0.666
  1994.96   2.44     3.08     0.382   0.482
  1167.82   1.72     3.14     0.461   0.840
  2004.49   2.46     3.06     0.384   0.477

What strikes me as important here is that these 8 core Intel CPUs
actually got *slower* with the ALU patch + prefetch.  This warrants more
investigation to find out if it's the prefetch or the ALU patch that did
the damage to the speed.  It's also worth noting that these 8 core CPUs
have such high variability that I don't trust these measurements yet.

>>> More importantly, the 'maximally adversarial' case is very hard
>>> to generate, validate, and it's highly system dependent!
>>
>> This I agree with 100%, which is why I tend to think we should
>> scrap the static prefetch optimizations entirely and have a boot
>> up test that allows us to find our optimum prefetch distance for
>> our given hardware.
>
> Would be interesting to see.
>
> I'm a bit sceptical - I think 'looking 1-2 cachelines in advance' is
> something that might work reasonably well on a wide range of
> systems, while trying to find a bus capacity/latency dependent sweet
> spot would be difficult.

I think 1-2 cachelines is probably way too short.  Measuring the length
of time that we stall when accessing memory for the first time and then
comparing that to operation cycles for typical instruction chains would
give us more insight I think.  That or just tinkering with numbers and
seeing where things work best (but not just on static tests, under a
variety of workloads).

> We had pretty bad experience from boot-time measurements, and it's
> not for lack of trying: I implemented the raid algorithm
> benchmarking thing and also the scheduler's boot time cache-size
> probing, both were problematic and have hurt reproducability and
> debuggability.

OK, that's it from me for now, off to run more tests and try more things...


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 17:54   ` Doug Ledford
@ 2013-10-26 11:55     ` Ingo Molnar
  2013-10-28 17:02       ` Doug Ledford
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-26 11:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Eric Dumazet, Neil Horman, linux-kernel


* Doug Ledford <dledford@redhat.com> wrote:

> > What I was objecting to strongly here was to measure the _wrong_ 
> > thing, i.e. the cache-hot case. The cache-cold case should be 
> > measured in a low noise fashion, so that results are 
> > representative. It's closer to the real usecase than any other 
> > microbenchmark. That will give us a usable speedup figure and 
> > will tell us which technique helped how much and which parameter 
> > should be how large.
> 
> Cold cache, yes.  Low noise, yes.  But you need DMA traffic at the 
> same time to be truly representative.

Well, but in most usecases network DMA traffic is an order of 
magnitude smaller than system bus capacity. 100 gigabit network 
traffic is possible but not very common.

So I'd say that _if_ prefetching helps in the typical case we should 
tune it for that - not for the bus-contended case...

> >> [...]  This distance should be far enough out that it can 
> >> withstand other memory pressure, yet not so far as to 
> >> constantly be prefetching, tossing the result out of cache due 
> >> to pressure, then fetching/stalling that same memory on load.  
> >> And it may not benchmark as well on a quiescent system running 
> >> only the micro-benchmark, but it should end up performing 
> >> better in actual real world usage.
> > 
> > The 'fully adversarial' case where all resources are maximally 
> > competed for by all other cores is actually pretty rare in 
> > practice. I don't say it does not happen or that it does not 
> > matter, but I do say there are many other important usecases as 
> > well.
> > 
> > More importantly, the 'maximally adversarial' case is very hard 
> > to generate, validate, and it's highly system dependent!
> 
> This I agree with 100%, which is why I tend to think we should 
> scrap the static prefetch optimizations entirely and have a boot 
> up test that allows us to find our optimum prefetch distance for 
> our given hardware.

Would be interesting to see.

I'm a bit sceptical - I think 'looking 1-2 cachelines in advance' is 
something that might work reasonably well on a wide range of 
systems, while trying to find a bus capacity/latency dependent sweet 
spot would be difficult.

We had pretty bad experience from boot-time measurements, and it's 
not for lack of trying: I implemented the raid algorithm 
benchmarking thing and also the scheduler's boot time cache-size 
probing, both were problematic and have hurt reproducability and 
debuggability.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-19  8:23 ` Ingo Molnar
@ 2013-10-21 17:54   ` Doug Ledford
  2013-10-26 11:55     ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Doug Ledford @ 2013-10-21 17:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Eric Dumazet, Neil Horman, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6563 bytes --]

On 10/19/2013 04:23 AM, Ingo Molnar wrote:
> 
> * Doug Ledford <dledford@redhat.com> wrote:
>> All prefetch operations get sent to an access queue in the memory 
>> controller where they compete with both other reads and writes for the 
>> available memory bandwidth.  The optimal prefetch window is not a factor 
>> of memory bandwidth and latency, it's a factor of memory bandwidth, 
>> memory latency, current memory access queue depth at time prefetch is 
>> issued, and memory bank switch time * number of queued memory operations 
>> that will require a bank switch.  In other words, it's much more complex 
>> and also much more fluid than any static optimization can pull out. 
>> [...]
> 
> But this is generally true of _any_ static operation - CPUs are complex, 
> workloads are complex, other threads, CPUs, sockets, devices might 
> interact, etc.
> 
> Yet it does not make it invalid to optimize for the isolated, static 
> usecase that was offered, because 'dynamism' and parallelism in a real 
> system will rarely make that optimization completely invalid, it will 
> typically only diminish its fruits to a certain degree (for example by 
> causing prefetches to be discarded).

So, prefetches are a bit of a special beast in that, if they are done
incorrectly, they can actually make the overall system slower than if we
didn't do anything at all.  If you are talking about anything other than
prefetch I would agree with you.  With prefetch, as much as possible,
you need to mimic the environment for which you are optimizing.  Neil's
test kernel module just called csum_partial on a bunch of memory pages.
 The actual usage pattern of csum_partial though is that it will be used
while, most likely, there is ongoing DMA of data packets across the
network interface.  It is very unlikely that we could care less about
the case of optimizing csum_partial for no network activity since
csum_partial is always going to be the result of network activity and
unlikely to happen in isolation.  As such, my suggestion about a kernel
compile was to create activity across the PCI bus to hard drives,
mimicking network interface DMA traffic.  You could also run a netperf
instance instead.  I just don't agree with optimizing it without
simultaneous DMA traffic as that particular case if likely to be a
rarity, not the norm.

> What I was objecting to strongly here was to measure the _wrong_ thing, 
> i.e. the cache-hot case. The cache-cold case should be measured in a low 
> noise fashion, so that results are representative. It's closer to the real 
> usecase than any other microbenchmark. That will give us a usable speedup 
> figure and will tell us which technique helped how much and which 
> parameter should be how large.

Cold cache, yes.  Low noise, yes.  But you need DMA traffic at the same
time to be truly representative.

>> [...]  So every time I see someone run a series of micro- benchmarks 
>> like you just did, where the system was only doing the micro- benchmark 
>> and not a real workload, and we draw conclusions about optimal prefetch 
>> distances from that test, I cringe inside and I think I even die... just 
>> a little.
> 
> So the thing is, microbenchmarks can indeed be misleading - and as in this 
> case the cache-hot claims can be outright dangerously misleading.

So can non-DMA cases.

> But yet, if done correctly and interpreted correctly they tell us a little 
> bit of the truth and are often correlated to real performance.
> 
> Do microbenchmarks show us everything that a 'real' workload inhibits? Not 
> at all, they are way too simple for that. They are a shortcut, an 
> indicator, which is often helpful as long as not taken as 'the' 
> performance of the system.
> 
>> A better test for this, IMO, would be to start a local kernel compile 
>> with at least twice as many gcc instances allowed as you have CPUs, 
>> *then* run your benchmark kernel module and see what prefetch distance 
>> works well. [...]
> 
> I don't agree that this represents our optimization target. It may 
> represent _one_ optimization target. But many other important usecases 
> such as a dedicated file server, or a computation node that is 
> cache-optimized, would unlikely to show such high parallel memory pressure 
> as a GCC compilation.

But they will *all* show network DMA load, not quiescent DMA load.

>> [...]  This distance should be far enough out that it can withstand 
>> other memory pressure, yet not so far as to constantly be prefetching, 
>> tossing the result out of cache due to pressure, then fetching/stalling 
>> that same memory on load.  And it may not benchmark as well on a 
>> quiescent system running only the micro-benchmark, but it should end up 
>> performing better in actual real world usage.
> 
> The 'fully adversarial' case where all resources are maximally competed 
> for by all other cores is actually pretty rare in practice. I don't say it 
> does not happen or that it does not matter, but I do say there are many 
> other important usecases as well.
> 
> More importantly, the 'maximally adversarial' case is very hard to 
> generate, validate, and it's highly system dependent!

This I agree with 100%, which is why I tend to think we should scrap the
static prefetch optimizations entirely and have a boot up test that
allows us to find our optimum prefetch distance for our given hardware.

> Cache-cold (and cache hot) microbenchmarks on the other hand tend to be 
> more stable, because they typically reflect current physical (mostly 
> latency) limits of CPU and system technology, _not_ highly system 
> dependent resource sizing (mostly bandwidth) limits which are very hard to 
> optimize for in a generic fashion.
> 
> Cache-cold and cache-hot measurements are, in a way, important physical 
> 'eigenvalues' of a complex system. If they both show speedups then it's 
> likely that a more dynamic, contended for, mixed workload will show 
> speedups as well. And these 'eigenvalues' are statistically much more 
> stable across systems, and that's something we care for when we implement 
> various lowlevel assembly routines in arch/x86/ which cover many different 
> systems with different bandwidth characteristics.
> 
> I hope I managed to explain my views clearly enough on this.
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD
	      http://people.redhat.com/dledford



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:42 Doug Ledford
@ 2013-10-19  8:23 ` Ingo Molnar
  2013-10-21 17:54   ` Doug Ledford
  0 siblings, 1 reply; 132+ messages in thread
From: Ingo Molnar @ 2013-10-19  8:23 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Eric Dumazet, Neil Horman, linux-kernel


* Doug Ledford <dledford@redhat.com> wrote:

> >> Based on these, prefetching is obviously a a good improvement, but 
> >> not as good as parallel execution, and the winner by far is doing 
> >> both.
> 
> OK, this is where I have to chime in that these tests can *not* be used 
> to say anything about prefetch, and not just for the reasons Ingo lists 
> in his various emails to this thread.  In fact I would argue that Ingo's 
> methodology on this is wrong as well.

Well, I didn't go into as many details as you - but I agree with your full 
list obviously:

> All prefetch operations get sent to an access queue in the memory 
> controller where they compete with both other reads and writes for the 
> available memory bandwidth.  The optimal prefetch window is not a factor 
> of memory bandwidth and latency, it's a factor of memory bandwidth, 
> memory latency, current memory access queue depth at time prefetch is 
> issued, and memory bank switch time * number of queued memory operations 
> that will require a bank switch.  In other words, it's much more complex 
> and also much more fluid than any static optimization can pull out. 
> [...]

But this is generally true of _any_ static operation - CPUs are complex, 
workloads are complex, other threads, CPUs, sockets, devices might 
interact, etc.

Yet it does not make it invalid to optimize for the isolated, static 
usecase that was offered, because 'dynamism' and parallelism in a real 
system will rarely make that optimization completely invalid, it will 
typically only diminish its fruits to a certain degree (for example by 
causing prefetches to be discarded).

What I was objecting to strongly here was to measure the _wrong_ thing, 
i.e. the cache-hot case. The cache-cold case should be measured in a low 
noise fashion, so that results are representative. It's closer to the real 
usecase than any other microbenchmark. That will give us a usable speedup 
figure and will tell us which technique helped how much and which 
parameter should be how large.

> [...]  So every time I see someone run a series of micro- benchmarks 
> like you just did, where the system was only doing the micro- benchmark 
> and not a real workload, and we draw conclusions about optimal prefetch 
> distances from that test, I cringe inside and I think I even die... just 
> a little.

So the thing is, microbenchmarks can indeed be misleading - and as in this 
case the cache-hot claims can be outright dangerously misleading.

But yet, if done correctly and interpreted correctly they tell us a little 
bit of the truth and are often correlated to real performance.

Do microbenchmarks show us everything that a 'real' workload inhibits? Not 
at all, they are way too simple for that. They are a shortcut, an 
indicator, which is often helpful as long as not taken as 'the' 
performance of the system.

> A better test for this, IMO, would be to start a local kernel compile 
> with at least twice as many gcc instances allowed as you have CPUs, 
> *then* run your benchmark kernel module and see what prefetch distance 
> works well. [...]

I don't agree that this represents our optimization target. It may 
represent _one_ optimization target. But many other important usecases 
such as a dedicated file server, or a computation node that is 
cache-optimized, would unlikely to show such high parallel memory pressure 
as a GCC compilation.

> [...]  This distance should be far enough out that it can withstand 
> other memory pressure, yet not so far as to constantly be prefetching, 
> tossing the result out of cache due to pressure, then fetching/stalling 
> that same memory on load.  And it may not benchmark as well on a 
> quiescent system running only the micro-benchmark, but it should end up 
> performing better in actual real world usage.

The 'fully adversarial' case where all resources are maximally competed 
for by all other cores is actually pretty rare in practice. I don't say it 
does not happen or that it does not matter, but I do say there are many 
other important usecases as well.

More importantly, the 'maximally adversarial' case is very hard to 
generate, validate, and it's highly system dependent!

Cache-cold (and cache hot) microbenchmarks on the other hand tend to be 
more stable, because they typically reflect current physical (mostly 
latency) limits of CPU and system technology, _not_ highly system 
dependent resource sizing (mostly bandwidth) limits which are very hard to 
optimize for in a generic fashion.

Cache-cold and cache-hot measurements are, in a way, important physical 
'eigenvalues' of a complex system. If they both show speedups then it's 
likely that a more dynamic, contended for, mixed workload will show 
speedups as well. And these 'eigenvalues' are statistically much more 
stable across systems, and that's something we care for when we implement 
various lowlevel assembly routines in arch/x86/ which cover many different 
systems with different bandwidth characteristics.

I hope I managed to explain my views clearly enough on this.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-18 17:42 Doug Ledford
  2013-10-19  8:23 ` Ingo Molnar
  0 siblings, 1 reply; 132+ messages in thread
From: Doug Ledford @ 2013-10-18 17:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Eric Dumazet, Doug Ledford, Neil Horman, linux-kernel

On 2013-10-17, Ingo wrote:
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
>> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
>> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
>> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>> > > 
>> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
>> > > > buffer, initalized it with random data, and called csum_partial on it 100000
>> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
>> > > > GHz Intel Xeon processor:
>> > > > 
>> > > > Without patch: Average execute time for csum_partial was 808 ns
>> > > > With patch: Average execute time for csum_partial was 438 ns
>> > > 
>> > > Impressive, but could you try again with data out of cache ?
>> > 
>> > So I tried your patch on a GRE tunnel and got following results on a
>> > single TCP flow. (short result : no visible difference)

[ to Eric ]

You didn't show profile data from before and after the patch, only after.  And it
showed csum_partial at 19.9% IIRC.  That's a much better than I get on my test
machines (even though this is on a rhel6.5-beta kernel, understand that the entire
IB stack in rhel6.5-beta is up to a 3.10 level, with parts closer to 3.11+):

For IPoIB in connected mode, where there is no rx csum offload:

::::::::::::::
rhel6.5-beta-cm-no-offload-oprofile-run1
::::::::::::::
CPU: Intel Architectural Perfmon, speed 3392.17 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
Samples on CPU 0
Samples on CPU 4
Samples on CPU 6 (edited out as it was only a few samples and ruined
                  line wrapping)
samples  %        samples  %        image name               symbol name
98588    59.1431  215      57.9515  vmlinux    csum_partial_copy_generic
3003      1.8015  8         2.1563  vmlinux                  tcp_sendmsg
2219      1.3312  0              0  vmlinux            irq_entries_start
2076      1.2454  4         1.0782  vmlinux         avc_has_perm_noaudit
1815      1.0888  0              0  mlx4_ib.ko           mlx4_ib_poll_cq

So, here anyway, it's 60%.  At that level of showing, there is a lot more to be
gained from an improvement to that function.  And here's the measured performance
from those runs:

[root@rdma-master rhel6.5-beta-client]# more rhel6.5-beta-cm-no-offload-netperf.output 
Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send     Recv
Size   Size    Size     Time     Throughput  local    remote
bytes  bytes   bytes    secs.    MBytes  /s  % S      % S
 87380  16384  16384    20.00      2815.29   7.92     12.80  
 87380  16384  16384    20.00      2798.22   7.88     12.87  
 87380  16384  16384    20.00      2786.74   7.79     12.84

The test machine has 8 logical CPUs, so 12.5% is 100% of a single CPU.  That
said, the receive side is obviously the bottleneck here, and 60% of that
bottleneck is csum_partial.

[ snip a bunch of Neil's measurements ]

>> Based on these, prefetching is obviously a a good improvement, but not 
>> as good as parallel execution, and the winner by far is doing both.

OK, this is where I have to chime in that these tests can *not* be used
to say anything about prefetch, and not just for the reasons Ingo lists
in his various emails to this thread.  In fact I would argue that Ingo's
methodology on this is wrong as well.

All prefetch operations get sent to an access queue in the memory controller
where they compete with both other reads and writes for the available memory
bandwidth.  The optimal prefetch window is not a factor of memory bandwidth
and latency, it's a factor of memory bandwidth, memory latency, current memory
access queue depth at time prefetch is issued, and memory bank switch time *
number of queued memory operations that will require a bank switch.  In other
words, it's much more complex and also much more fluid than any static
optimization can pull out.  So every time I see someone run a series of micro-
benchmarks like you just did, where the system was only doing the micro-
benchmark and not a real workload, and we draw conclusions about optimal
prefetch distances from that test, I cringe inside and I think I even die...
just a little.

A better test for this, IMO, would be to start a local kernel compile with at
least twice as many gcc instances allowed as you have CPUs, *then* run your
benchmark kernel module and see what prefetch distance works well.  This
distance should be far enough out that it can withstand other memory pressure,
yet not so far as to constantly be prefetching, tossing the result out of cache
due to pressure, then fetching/stalling that same memory on load.  And it may
not benchmark as well on a quiescent system running only the micro-benchmark,
but it should end up performing better in actual real world usage.

> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 
> For example 'perf stat --repeat' will output stddev for you:
> 
>   comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'
> 
>    Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):
> 
>        0.189084480 seconds time elapsed                                          ( +- 11.95% )

[ snip perf usage tips ]

I ran my original tests with oprofile.  I'll rerun the last one plus some new
tests with the various incarnations of this patch using perf and report the
results back here.

However, the machines I ran these tests on were limited by a 40GBit/s line
speed, with a theoretical max of 4GBytes/s due to bit encoding on the wire,
and I think limited even a bit lower by theoretical limit of useful data
across a PCI-e gen2 x8 bus.  So I wouldn't expect the throughput to go
much higher even if this helps, it should mainly reduce CPU usage.  I can
try the same tests on a 56GBit/s link and with cards that have PCI-e
gen3 and see how those machines do by comparison (the hosts are identical,
just the cards are different).

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-18 15:46 Doug Ledford
  0 siblings, 0 replies; 132+ messages in thread
From: Doug Ledford @ 2013-10-18 15:46 UTC (permalink / raw)
  To: Joe Perches; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel

On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
>> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
>> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
>> > > attached patch brings much better results
>> > > 
>> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
>> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
>> > > Recv   Send    Send                          Utilization       Service Demand
>> > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
>> > > Size   Size    Size     Time     Throughput  local    remote   local   remote
>> > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
>> > > 
>> > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
>> > > 
>> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> > []
>> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>> > >  			zero = 0;
>> > >  			count64 = count >> 3;
>> > >  			while (count64) { 
>> > > -				asm("addq 0*8(%[src]),%[res]\n\t"
>> > > +				asm("prefetch 5*64(%[src])\n\t"
>> > 
>> > Might the prefetch size be too big here?
>> 
>> To be effective, you need to prefetch well ahead of time.
> 
> No doubt.
> 
>> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> 
> 5 cachelines for some processors seems like a lot.
> 
> Given you've got a test rig, maybe you could experiment
> with 2 and increase it until it doesn't get better.

You have a fundamental misunderstanding of the prefetch operation.  The 5*64
in the above asm statment does not mean a size, it is an index, with %[src]
as the base pointer.  So it is saying to go to address %[src] + 5*64 and
prefetch there.  The prefetch size itself is always a cache line.  Once the
address is known, whatever cacheline holds that address is the cacheline we
will prefetch.  Your size concerns have no meaning.


^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2013-11-11 21:18 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53   ` Neil Horman
2013-10-14 20:28   ` Neil Horman
2013-10-14 21:19     ` Eric Dumazet
2013-10-14 22:18       ` Eric Dumazet
2013-10-14 22:37         ` Joe Perches
2013-10-14 22:44           ` Eric Dumazet
2013-10-14 22:49             ` Joe Perches
2013-10-15  7:41               ` Ingo Molnar
2013-10-15 10:51                 ` Borislav Petkov
2013-10-15 12:04                   ` Ingo Molnar
2013-10-15 16:21                 ` Joe Perches
2013-10-16  0:34                   ` Eric Dumazet
2013-10-16  6:25                   ` Ingo Molnar
2013-10-16 16:55                     ` Joe Perches
2013-10-17  0:34         ` Neil Horman
2013-10-17  1:42           ` Eric Dumazet
2013-10-18 16:50             ` Neil Horman
2013-10-18 17:20               ` Eric Dumazet
2013-10-18 20:11                 ` Neil Horman
2013-10-18 21:15                   ` Eric Dumazet
2013-10-20 21:29                     ` Neil Horman
2013-10-21 17:31                       ` Eric Dumazet
2013-10-21 17:46                         ` Neil Horman
2013-10-21 19:21                     ` Neil Horman
2013-10-21 19:44                       ` Eric Dumazet
2013-10-21 20:19                         ` Neil Horman
2013-10-26 12:01                           ` Ingo Molnar
2013-10-26 13:58                             ` Neil Horman
2013-10-27  7:26                               ` Ingo Molnar
2013-10-27 17:05                                 ` Neil Horman
2013-10-17  8:41           ` Ingo Molnar
2013-10-17 18:19             ` H. Peter Anvin
2013-10-17 18:48               ` Eric Dumazet
2013-10-18  6:43               ` Ingo Molnar
2013-10-28 16:01             ` Neil Horman
2013-10-28 16:20               ` Ingo Molnar
2013-10-28 17:49                 ` Neil Horman
2013-10-28 16:24               ` Ingo Molnar
2013-10-28 16:49                 ` David Ahern
2013-10-28 17:46                 ` Neil Horman
2013-10-28 18:29                   ` Neil Horman
2013-10-29  8:25                     ` Ingo Molnar
2013-10-29 11:20                       ` Neil Horman
2013-10-29 11:30                         ` Ingo Molnar
2013-10-29 11:49                           ` Neil Horman
2013-10-29 12:52                             ` Ingo Molnar
2013-10-29 13:07                               ` Neil Horman
2013-10-29 13:11                                 ` Ingo Molnar
2013-10-29 13:20                                   ` Neil Horman
2013-10-29 14:17                                   ` Neil Horman
2013-10-29 14:27                                     ` Ingo Molnar
2013-10-29 20:26                                       ` Neil Horman
2013-10-31 10:22                                         ` Ingo Molnar
2013-10-31 14:33                                           ` Neil Horman
2013-11-01  9:13                                             ` Ingo Molnar
2013-11-01 14:06                                               ` Neil Horman
2013-10-29 14:12                               ` David Ahern
2013-10-15  7:32     ` Ingo Molnar
2013-10-15 13:14       ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53   ` Neil Horman
2013-10-18 16:42   ` Neil Horman
2013-10-18 17:09     ` H. Peter Anvin
2013-10-25 13:06       ` Neil Horman
2013-10-14  4:38 ` Andi Kleen
2013-10-14  7:49   ` Ingo Molnar
2013-10-14 21:07     ` Eric Dumazet
2013-10-15 13:17       ` Neil Horman
2013-10-14 20:25   ` Neil Horman
2013-10-15  7:12     ` Sébastien Dugué
2013-10-15 13:33       ` Andi Kleen
2013-10-15 13:56         ` Sébastien Dugué
2013-10-15 14:06           ` Eric Dumazet
2013-10-15 14:15             ` Sébastien Dugué
2013-10-15 14:26               ` Eric Dumazet
2013-10-15 14:52                 ` Eric Dumazet
2013-10-15 16:02                   ` Andi Kleen
2013-10-16  0:28                     ` Eric Dumazet
2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
2013-11-06 15:23   ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
2013-11-06 15:34     ` Dave Jones
2013-11-06 15:54       ` Neil Horman
2013-11-06 17:19         ` Joe Perches
2013-11-06 18:11           ` Neil Horman
2013-11-06 20:02           ` Neil Horman
2013-11-06 20:07             ` Joe Perches
2013-11-08 16:25               ` Neil Horman
2013-11-08 16:51                 ` Joe Perches
2013-11-08 19:07                   ` Neil Horman
2013-11-08 19:17                     ` Joe Perches
2013-11-08 20:08                       ` Neil Horman
2013-11-08 19:17                     ` H. Peter Anvin
2013-11-08 19:01           ` Neil Horman
2013-11-08 19:33             ` Joe Perches
2013-11-08 20:14               ` Neil Horman
2013-11-08 20:29                 ` Joe Perches
2013-11-11 19:40                   ` Neil Horman
2013-11-11 21:18                     ` Ingo Molnar
2013-11-06 18:23         ` Eric Dumazet
2013-11-06 18:59           ` Neil Horman
2013-11-06 20:19     ` Andi Kleen
2013-11-07 21:23       ` Neil Horman
2013-10-18 15:46 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-18 17:42 Doug Ledford
2013-10-19  8:23 ` Ingo Molnar
2013-10-21 17:54   ` Doug Ledford
2013-10-26 11:55     ` Ingo Molnar
2013-10-28 17:02       ` Doug Ledford
2013-10-29  8:38         ` Ingo Molnar
2013-10-30  5:25 Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18   ` David Laight
2013-10-30 13:22     ` Doug Ledford
2013-10-30 13:35   ` Doug Ledford
2013-10-30 14:04     ` David Laight
2013-10-30 14:52     ` Neil Horman
2013-10-31 18:30     ` Neil Horman
2013-11-01  9:21       ` Ingo Molnar
2013-11-01 15:42       ` Ben Hutchings
2013-11-01 16:08         ` Neil Horman
2013-11-01 16:16           ` Ben Hutchings
2013-11-01 16:18           ` David Laight
2013-11-01 17:37             ` Neil Horman
2013-11-01 19:45               ` Joe Perches
2013-11-01 19:58                 ` Neil Horman
2013-11-01 20:26                   ` Joe Perches
2013-11-02  2:07                     ` Neil Horman
2013-11-04  9:47               ` David Laight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.