All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-11 16:51 Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
                   ` (3 more replies)
  0 siblings, 4 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-11 16:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Sébastien Dugué reported to me that devices implementing ipoib (which don't have
checksum offload hardware were spending a significant amount of time computing
checksums.  We found that by splitting the checksum computation into two
separate streams, each skipping successive elements of the buffer being summed,
we could parallelize the checksum operation accros multiple alus.  Since neither
chain is dependent on the result of the other, we get a speedup in execution (on
hardware that has multiple alu's available, which is almost ubiquitous on x86),
and only a negligible decrease on hardware that has only a single alu (an extra
addition is introduced).  Since addition in commutative, the result is the same,
only faster

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/lib/csum-partial_64.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..2c7bc50 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,11 +29,12 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
+	unsigned long result1 = 0;
+	unsigned long result2 = 0;
 	unsigned long result = 0;
 
 	if (unlikely(len == 0))
@@ -68,22 +69,34 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
+				asm("addq 0*8(%[src]),%[res1]\n\t"
+				    "adcq 2*8(%[src]),%[res1]\n\t"
+				    "adcq 4*8(%[src]),%[res1]\n\t"
+				    "adcq 6*8(%[src]),%[res1]\n\t"
+				    "adcq %[zero],%[res1]\n\t"
+
+				    "addq 1*8(%[src]),%[res2]\n\t"
+				    "adcq 3*8(%[src]),%[res2]\n\t"
+				    "adcq 5*8(%[src]),%[res2]\n\t"
+				    "adcq 7*8(%[src]),%[res2]\n\t"
+				    "adcq %[zero],%[res2]"
+				    : [res1] "=r" (result1),
+				      [res2] "=r" (result2)
 				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
+				      "[res1]" (result1), "[res2]" (result2));
 				buff += 64;
 				count64--;
 			}
 
+			asm("addq %[res1],%[res]\n\t"
+			    "adcq %[res2],%[res]\n\t"
+			    "adcq %[zero],%[res]"
+			    : [res] "=r" (result)
+			    : [res1] "r" (result1),
+			      [res2] "r" (result2),
+			      [zero] "r" (zero),
+			      "0" (result));
+
 			/* last up to 7 8byte blocks */
 			count %= 8; 
 			while (count) { 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
@ 2013-10-12 17:21 ` Ingo Molnar
  2013-10-13 12:53   ` Neil Horman
  2013-10-14 20:28   ` Neil Horman
  2013-10-12 22:29 ` H. Peter Anvin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-12 17:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sébastien Dugué reported to me that devices implementing ipoib (which 
> don't have checksum offload hardware were spending a significant amount 
> of time computing checksums.  We found that by splitting the checksum 
> computation into two separate streams, each skipping successive elements 
> of the buffer being summed, we could parallelize the checksum operation 
> accros multiple alus.  Since neither chain is dependent on the result of 
> the other, we get a speedup in execution (on hardware that has multiple 
> alu's available, which is almost ubiquitous on x86), and only a 
> negligible decrease on hardware that has only a single alu (an extra 
> addition is introduced).  Since addition in commutative, the result is 
> the same, only faster

This patch should really come with measurement numbers: what performance 
increase (and drop) did you get on what CPUs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
@ 2013-10-12 22:29 ` H. Peter Anvin
  2013-10-13 12:53   ` Neil Horman
  2013-10-18 16:42   ` Neil Horman
  2013-10-14  4:38 ` Andi Kleen
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
  3 siblings, 2 replies; 105+ messages in thread
From: H. Peter Anvin @ 2013-10-12 22:29 UTC (permalink / raw)
  To: Neil Horman, linux-kernel
  Cc: sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On 10/11/2013 09:51 AM, Neil Horman wrote:
> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing
> checksums.  We found that by splitting the checksum computation into two
> separate streams, each skipping successive elements of the buffer being summed,
> we could parallelize the checksum operation accros multiple alus.  Since neither
> chain is dependent on the result of the other, we get a speedup in execution (on
> hardware that has multiple alu's available, which is almost ubiquitous on x86),
> and only a negligible decrease on hardware that has only a single alu (an extra
> addition is introduced).  Since addition in commutative, the result is the same,
> only faster

On hardware that implement ADCX/ADOX then you should also be able to
have additional streams interleaved since those instructions allow for
dual carry chains.

	-hpa




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 17:21 ` Ingo Molnar
@ 2013-10-13 12:53   ` Neil Horman
  2013-10-14 20:28   ` Neil Horman
  1 sibling, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-13 12:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant amount 
> > of time computing checksums.  We found that by splitting the checksum 
> > computation into two separate streams, each skipping successive elements 
> > of the buffer being summed, we could parallelize the checksum operation 
> > accros multiple alus.  Since neither chain is dependent on the result of 
> > the other, we get a speedup in execution (on hardware that has multiple 
> > alu's available, which is almost ubiquitous on x86), and only a 
> > negligible decrease on hardware that has only a single alu (an extra 
> > addition is introduced).  Since addition in commutative, the result is 
> > the same, only faster
> 
> This patch should really come with measurement numbers: what performance 
> increase (and drop) did you get on what CPUs.
> 
> Thanks,
> 
Sure, I can gather some stats for you.  I'll post them later this week
Neil

> 	Ingo
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 22:29 ` H. Peter Anvin
@ 2013-10-13 12:53   ` Neil Horman
  2013-10-18 16:42   ` Neil Horman
  1 sibling, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-13 12:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums.  We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus.  Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced).  Since addition in commutative, the result is the same,
> > only faster
> 
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
> 
Ok, thats a good idea, I'll look into those instructions this week
Neil

> 	-hpa
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
  2013-10-12 22:29 ` H. Peter Anvin
@ 2013-10-14  4:38 ` Andi Kleen
  2013-10-14  7:49   ` Ingo Molnar
  2013-10-14 20:25   ` Neil Horman
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
  3 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2013-10-14  4:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Neil Horman <nhorman@tuxdriver.com> writes:

> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing

Must be an odd workload, most TCP/UDP workloads do copy-checksum
anyways. I would rather investigate why that doesn't work.

That said the change looks reasonable, but may not fix the root cause.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  4:38 ` Andi Kleen
@ 2013-10-14  7:49   ` Ingo Molnar
  2013-10-14 21:07     ` Eric Dumazet
  2013-10-14 20:25   ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-14  7:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Andi Kleen <andi@firstfloor.org> wrote:

> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant 
> > amount of time computing
> 
> Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> anyways. I would rather investigate why that doesn't work.

There's a fair amount of csum_partial()-only workloads, a packet does not 
need to hit user-space to be a significant portion of the system's 
workload.

That said, it would indeed be nice to hear which particular code path was 
hit in this case, if nothing else then for education purposes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  4:38 ` Andi Kleen
  2013-10-14  7:49   ` Ingo Molnar
@ 2013-10-14 20:25   ` Neil Horman
  2013-10-15  7:12     ` Sébastien Dugué
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-14 20:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> 
> Must be an odd workload, most TCP/UDP workloads do copy-checksum
> anyways. I would rather investigate why that doesn't work.
> 
FWIW, the reporter was reporting this using an IP over Infiniband network.
Neil

> That said the change looks reasonable, but may not fix the root cause.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 17:21 ` Ingo Molnar
  2013-10-13 12:53   ` Neil Horman
@ 2013-10-14 20:28   ` Neil Horman
  2013-10-14 21:19     ` Eric Dumazet
  2013-10-15  7:32     ` Ingo Molnar
  1 sibling, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-14 20:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant amount 
> > of time computing checksums.  We found that by splitting the checksum 
> > computation into two separate streams, each skipping successive elements 
> > of the buffer being summed, we could parallelize the checksum operation 
> > accros multiple alus.  Since neither chain is dependent on the result of 
> > the other, we get a speedup in execution (on hardware that has multiple 
> > alu's available, which is almost ubiquitous on x86), and only a 
> > negligible decrease on hardware that has only a single alu (an extra 
> > addition is introduced).  Since addition in commutative, the result is 
> > the same, only faster
> 
> This patch should really come with measurement numbers: what performance 
> increase (and drop) did you get on what CPUs.
> 
> Thanks,
> 
> 	Ingo
> 


So, early testing results today.  I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 100000
times, recording the time at the start and end of that loop.  Results on a 2.4
GHz Intel Xeon processor:

Without patch: Average execute time for csum_partial was 808 ns
With patch: Average execute time for csum_partial was 438 ns


I'm looking into hpa's suggestion to use alternate instructions where available
right now.  I'll have more soon
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  7:49   ` Ingo Molnar
@ 2013-10-14 21:07     ` Eric Dumazet
  2013-10-15 13:17       ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-14 21:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Neil Horman <nhorman@tuxdriver.com> writes:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > don't have checksum offload hardware were spending a significant 
> > > amount of time computing
> > 
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> > anyways. I would rather investigate why that doesn't work.
> 
> There's a fair amount of csum_partial()-only workloads, a packet does not 
> need to hit user-space to be a significant portion of the system's 
> workload.
> 
> That said, it would indeed be nice to hear which particular code path was 
> hit in this case, if nothing else then for education purposes.

Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
frames, meaning we have to fallback to software csum to validate
TCP frames, once tunnel header is pulled.

So to reproduce the issue, all you need is to setup a GRE tunnel between
two hosts, and use any tcp stream workload.

Then receiver profile looks like :

11.45%	[kernel]	 [k] csum_partial
 3.08%	[kernel]	 [k] _raw_spin_lock
 3.04%	[kernel]	 [k] intel_idle
 2.73%	[kernel]	 [k] ipt_do_table
 2.57%	[kernel]	 [k] __netif_receive_skb_core
 2.15%	[kernel]	 [k] copy_user_generic_string
 2.05%	[kernel]	 [k] __hrtimer_start_range_ns
 1.42%	[kernel]	 [k] ip_rcv
 1.39%	[kernel]	 [k] kmem_cache_free
 1.36%	[kernel]	 [k] _raw_spin_unlock_irqrestore
 1.24%	[kernel]	 [k] __schedule
 1.13%	[bnx2x] 	 [k] bnx2x_rx_int
 1.12%	[bnx2x] 	 [k] bnx2x_start_xmit
 1.11%	[kernel]	 [k] fib_table_lookup
 0.99%	[ip_tunnel]  [k] ip_tunnel_lookup
 0.91%	[ip_tunnel]  [k] ip_tunnel_rcv
 0.90%	[kernel]	 [k] check_leaf.isra.7
 0.89%	[kernel]	 [k] nf_iterate



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:28   ` Neil Horman
@ 2013-10-14 21:19     ` Eric Dumazet
  2013-10-14 22:18       ` Eric Dumazet
  2013-10-15  7:32     ` Ingo Molnar
  1 sibling, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-14 21:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:

> So, early testing results today.  I wrote a test module that, allocated a 4k
> buffer, initalized it with random data, and called csum_partial on it 100000
> times, recording the time at the start and end of that loop.  Results on a 2.4
> GHz Intel Xeon processor:
> 
> Without patch: Average execute time for csum_partial was 808 ns
> With patch: Average execute time for csum_partial was 438 ns

Impressive, but could you try again with data out of cache ?




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 21:19     ` Eric Dumazet
@ 2013-10-14 22:18       ` Eric Dumazet
  2013-10-14 22:37         ` Joe Perches
  2013-10-17  0:34         ` Neil Horman
  0 siblings, 2 replies; 105+ messages in thread
From: Eric Dumazet @ 2013-10-14 22:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> 
> > So, early testing results today.  I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop.  Results on a 2.4
> > GHz Intel Xeon processor:
> > 
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
> 
> Impressive, but could you try again with data out of cache ?

So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)


Using a prefetch 5*64([%src]) helps more (see at the end)

cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz


Before patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7651.61   2.51     5.45     0.645   1.399  


After patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7239.78   2.09     5.19     0.569   1.408  

Profile on receiver

   PerfTop:    1358 irqs/sec  kernel:98.5%  exact:  0.0% [1000Hz cycles],  (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

    19.99%  [kernel]     [k] csum_partial                
     7.04%  [kernel]     [k] copy_user_generic_string    
     4.92%  [bnx2x]      [k] bnx2x_rx_int                
     3.50%  [kernel]     [k] ipt_do_table                
     2.86%  [kernel]     [k] __netif_receive_skb_core    
     2.35%  [kernel]     [k] fib_table_lookup            
     2.19%  [kernel]     [k] netif_receive_skb           
     1.87%  [kernel]     [k] intel_idle                  
     1.65%  [kernel]     [k] kmem_cache_alloc            
     1.64%  [kernel]     [k] ip_rcv                      
     1.51%  [kernel]     [k] kmem_cache_free             


And attached patch brings much better results

lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
+				asm("prefetch 5*64(%[src])\n\t"
+				    "addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"
 				    "adcq 3*8(%[src]),%[res]\n\t"



^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:18       ` Eric Dumazet
@ 2013-10-14 22:37         ` Joe Perches
  2013-10-14 22:44           ` Eric Dumazet
  2013-10-17  0:34         ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-10-14 22:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> attached patch brings much better results
> 
> lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> 
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
[]
> @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>  			zero = 0;
>  			count64 = count >> 3;
>  			while (count64) { 
> -				asm("addq 0*8(%[src]),%[res]\n\t"
> +				asm("prefetch 5*64(%[src])\n\t"

Might the prefetch size be too big here?

0x140 is pretty big and is always multiple cachelines no?

> +				    "addq 0*8(%[src]),%[res]\n\t"
>  				    "adcq 1*8(%[src]),%[res]\n\t"
>  				    "adcq 2*8(%[src]),%[res]\n\t"
>  				    "adcq 3*8(%[src]),%[res]\n\t"




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:37         ` Joe Perches
@ 2013-10-14 22:44           ` Eric Dumazet
  2013-10-14 22:49             ` Joe Perches
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-14 22:44 UTC (permalink / raw)
  To: Joe Perches
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > attached patch brings much better results
> > 
> > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > 
> >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > 
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> []
> > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> >  			zero = 0;
> >  			count64 = count >> 3;
> >  			while (count64) { 
> > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > +				asm("prefetch 5*64(%[src])\n\t"
> 
> Might the prefetch size be too big here?

To be effective, you need to prefetch well ahead of time.

5*64 seems common practice (check arch/x86/lib/copy_page_64.S)




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:44           ` Eric Dumazet
@ 2013-10-14 22:49             ` Joe Perches
  2013-10-15  7:41               ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-10-14 22:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > attached patch brings much better results
> > > 
> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > Recv   Send    Send                          Utilization       Service Demand
> > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > 
> > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > 
> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > []
> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > >  			zero = 0;
> > >  			count64 = count >> 3;
> > >  			while (count64) { 
> > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > +				asm("prefetch 5*64(%[src])\n\t"
> > 
> > Might the prefetch size be too big here?
> 
> To be effective, you need to prefetch well ahead of time.

No doubt.

> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)

5 cachelines for some processors seems like a lot.

Given you've got a test rig, maybe you could experiment
with 2 and increase it until it doesn't get better.


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:25   ` Neil Horman
@ 2013-10-15  7:12     ` Sébastien Dugué
  2013-10-15 13:33       ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Sébastien Dugué @ 2013-10-15  7:12 UTC (permalink / raw)
  To: Neil Horman
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


  Hi Neil, Andi,

On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> > Neil Horman <nhorman@tuxdriver.com> writes:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > > checksum offload hardware were spending a significant amount of time computing
> > 
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum
> > anyways. I would rather investigate why that doesn't work.
> > 
> FWIW, the reporter was reporting this using an IP over Infiniband network.
> Neil

  indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
where one cannot benefit from hardware offloads.

  For a bit of background on the issue:

  It all started nearly 3 years ago when trying to understand why IPoIB BW was
so low in our setups and why ksoftirqd used 100% of one CPU. A kernel profile
trace showed that the CPU spent most of it's time in checksum computation (from
the only old trace I managed to unearth):

  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                              1730    629976998 us     364148.5 us
  csum_partial                      10813465    20944414 us     1.936 us
  mwait_idle_with_hints                 1451    9858861 us     6794.529 us
  get_page_from_freelist            10110434    8120524 us     0.803 us
  alloc_pages_current               10093675    5180650 us     0.513 us
  __phys_addr                       35554783    4471387 us     0.125 us
  zone_statistics                   10110434    4360871 us     0.431 us
  ipoib_cm_alloc_rx_skb               673899    4343949 us     6.445 us

  After having recoded the checksum to use 2 ALUs, csum_partial() disappeared
from the tracer radar. IPoIB BW got from ~12Gb/s to ~ 20Gb/s and ksoftirqd load
dropped down drastically. Sorry, I could not manage to locate my old traces and
results, those seem to have been lost in the mist of time.

  I did some micro benchmark (dirty hack code below) of different solutions.
It looks like processing 128-byte blocks in 4 chains allows the best performance,
but there are plenty other possibilities.

  FWIW, this code has been running as is at our customers sites for 3 years now.

  Sébastien.

> 
> > That said the change looks reasonable, but may not fix the root cause.
> > 
> > -Andi
> > 
> > -- 
> > ak@linux.intel.com -- Speaking for myself only
> > 

8<----------------------------------------------------------------------


/*
 * gcc -Wall -O3 -o csum_test csum_test.c -lrt
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <errno.h>

#define __force
#define unlikely(x)	(x)

typedef uint32_t u32;
typedef uint16_t u16;

typedef u16 __sum16;
typedef u32 __wsum;

#define NUM_LOOPS	100000
#define BUF_LEN		65536
unsigned char buf[BUF_LEN];


/*
 * csum_fold - Fold and invert a 32bit checksum.
 * sum: 32bit unfolded sum
 *
 * Fold a 32bit running checksum to 16bit and invert it. This is usually
 * the last step before putting a checksum into a packet.
 * Make sure not to mix with 64bit checksums.
 */
static inline __sum16 csum_fold(__wsum sum)
{
	asm("  addl %1,%0\n"
	    "  adcl $0xffff,%0"
	    : "=r" (sum)
	    : "r" ((__force u32)sum << 16),
	      "0" ((__force u32)sum & 0xffff0000));
	return (__force __sum16)(~(__force u32)sum >> 16);
}

static inline unsigned short from32to16(unsigned a)
{
	unsigned short b = a >> 16;
	asm("addw %w2,%w0\n\t"
	    "adcw $0,%w0\n"
	    : "=r" (b)
	    : "0" (b), "r" (a));
	return b;
}

static inline unsigned add32_with_carry(unsigned a, unsigned b)
{
	asm("addl %2,%0\n\t"
	    "adcl $0,%0"
	    : "=r" (a)
	    : "0" (a), "r" (b));
	return a;
}

/*
 * Do a 64-bit checksum on an arbitrary memory area.
 * Returns a 32bit checksum.
 *
 * This isn't as time critical as it used to be because many NICs
 * do hardware checksumming these days.
 *
 * Things tried and found to not make it faster:
 * Manual Prefetching
 * Unrolling to an 128 bytes inner loop.
 * Using interleaving with more registers to break the carry chains.
 */
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res]\n\t"
				    "adcq 1*8(%[src]),%[res]\n\t"
				    "adcq 2*8(%[src]),%[res]\n\t"
				    "adcq 3*8(%[src]),%[res]\n\t"
				    "adcq 4*8(%[src]),%[res]\n\t"
				    "adcq 5*8(%[src]),%[res]\n\t"
				    "adcq 6*8(%[src]),%[res]\n\t"
				    "adcq 7*8(%[src]),%[res]\n\t"
				    "adcq %[zero],%[res]"
				    : [res] "=r" (result)
				    : [src] "r" (buff), [zero] "r" (zero),
				    "[res]" (result));
				buff += 64;
				count64--;
			}
			/* printf("csum %lx\n", result); */

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

static unsigned do_csum1(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res1]\n\t"
				    "adcq 2*8(%[src]),%[res1]\n\t"
				    "adcq 4*8(%[src]),%[res1]\n\t"
				    "adcq 6*8(%[src]),%[res1]\n\t"
				    "adcq %[zero],%[res1]\n\t"

				    "addq 1*8(%[src]),%[res2]\n\t"
				    "adcq 3*8(%[src]),%[res2]\n\t"
				    "adcq 5*8(%[src]),%[res2]\n\t"
				    "adcq 7*8(%[src]),%[res2]\n\t"
				    "adcq %[zero],%[res2]"
				    : [res1] "=r" (result1),
				      [res2] "=r" (result2)
				    : [src] "r" (buff), [zero] "r" (zero),
				      "[res1]" (result1), "[res2]" (result2));
				buff += 64;
				count64--;
			}

			asm("addq %[res1],%[res]\n\t"
			    "adcq %[res2],%[res]\n\t"
			    "adcq %[zero],%[res]"
			    : [res] "=r" (result)
			    : [res1] "r" (result1),
			      [res2] "r" (result2),
			      [zero] "r" (zero),
			      "0" (result));

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

static unsigned do_csum2(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result3 = 0;
	unsigned long result4 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;

	odd = 1 & (unsigned long) buff;

	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}

	count = len >> 1;		/* nr of 16-bit words.. */

	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}

		count >>= 1;		/* nr of 32-bit words.. */

		if (count) {

			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}

			count >>= 1;	/* nr of 64-bit words.. */

			if (count) {
				unsigned long zero = 0;
				unsigned count128;

				if (8 & (unsigned long) buff) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					count--;
					buff += 8;
				}

				/* main loop using 128 byte blocks */
				count128 = count >> 4;

				while (count128) {
					asm("addq 0*8(%[src]),%[res1]\n\t"
					    "adcq 4*8(%[src]),%[res1]\n\t"
					    "adcq 8*8(%[src]),%[res1]\n\t"
					    "adcq 12*8(%[src]),%[res1]\n\t"
					    "adcq %[zero],%[res1]\n\t"

					    "addq 1*8(%[src]),%[res2]\n\t"
					    "adcq 5*8(%[src]),%[res2]\n\t"
					    "adcq 9*8(%[src]),%[res2]\n\t"
					    "adcq 13*8(%[src]),%[res2]\n\t"
					    "adcq %[zero],%[res2]\n\t"

					    "addq 2*8(%[src]),%[res3]\n\t"
					    "adcq 6*8(%[src]),%[res3]\n\t"
					    "adcq 10*8(%[src]),%[res3]\n\t"
					    "adcq 14*8(%[src]),%[res3]\n\t"
					    "adcq %[zero],%[res3]\n\t"

					    "addq 3*8(%[src]),%[res4]\n\t"
					    "adcq 7*8(%[src]),%[res4]\n\t"
					    "adcq 11*8(%[src]),%[res4]\n\t"
					    "adcq 15*8(%[src]),%[res4]\n\t"
					    "adcq %[zero],%[res4]"

					    : [res1] "=r" (result1),
					      [res2] "=r" (result2),
					      [res3] "=r" (result3),
					      [res4] "=r" (result4)

					    : [src] "r" (buff),
					      [zero] "r" (zero),
					      "[res1]" (result1),
					      "[res2]" (result2),
					      "[res3]" (result3),
					      "[res4]" (result4));
					buff += 128;
					count128--;
				}

				asm("addq %[res1],%[res]\n\t"
				    "adcq %[res2],%[res]\n\t"
				    "adcq %[res3],%[res]\n\t"
				    "adcq %[res4],%[res]\n\t"
				    "adcq %[zero],%[res]"
				    : [res] "=r" (result)
				    : [res1] "r" (result1),
				      [res2] "r" (result2),
				      [res3] "r" (result3),
				      [res4] "r" (result4),
				      [zero] "r" (zero),
				      "0" (result));

				/* last upto 15 8byte blocks */
				count %= 16;
				while (count) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					--count;
					buff += 8;
				}
				result = add32_with_carry(result>>32,
							  result&0xffffffff);

				if (len & 8) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					buff += 8;
				}
			}

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}


static unsigned do_csum3(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result3 = 0;
	unsigned long result4 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res1]\n\t"
				    "adcq 4*8(%[src]),%[res1]\n\t"
				    "adcq %[zero],%[res1]\n\t"

				    "addq 1*8(%[src]),%[res2]\n\t"
				    "adcq 5*8(%[src]),%[res2]\n\t"
				    "adcq %[zero],%[res2]\n\t"

				    "addq 2*8(%[src]),%[res3]\n\t"
				    "adcq 6*8(%[src]),%[res3]\n\t"
				    "adcq %[zero],%[res3]\n\t"

				    "addq 3*8(%[src]),%[res4]\n\t"
				    "adcq 7*8(%[src]),%[res4]\n\t"
				    "adcq %[zero],%[res4]\n\t"

				    : [res1] "=r" (result1),
				      [res2] "=r" (result2),
				      [res3] "=r" (result3),
				      [res4] "=r" (result4)
				    : [src] "r" (buff),
				      [zero] "r" (zero),
				      "[res1]" (result1),
				      "[res2]" (result2),
				      "[res3]" (result3),
				      "[res4]" (result4));
				buff += 64;
				count64--;
			}

			asm("addq %[res1],%[res]\n\t"
			    "adcq %[res2],%[res]\n\t"
			    "adcq %[res3],%[res]\n\t"
			    "adcq %[res4],%[res]\n\t"
			    "adcq %[zero],%[res]"
			    : [res] "=r" (result)
			    : [res1] "r" (result1),
			      [res2] "r" (result2),
			      [res3] "r" (result3),
			      [res4] "r" (result4),
			      [zero] "r" (zero),
			      "0" (result));

			/* printf("csum1 %lx\n", result); */

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

long long delta_ns(struct timespec *t1, struct timespec *t2)
{
	long long tt1, tt2, delta;

	tt1 = t1->tv_sec * 1000000000 + t1->tv_nsec;
	tt2 = t2->tv_sec * 1000000000 + t2->tv_nsec;
	delta = tt2 - tt1;

	return delta;
}

int main(int argc, char **argv)
{
	FILE *f;
	unsigned csum1, csum2, csum3, csum4;
	struct timespec t1;
	struct timespec t2;
	double delta;
	int i;
	unsigned int offset = 0;
	unsigned char *ptr;
	unsigned int size;

	if ((f = fopen("data.bin", "r")) == NULL) {
		printf("Failed to open input file data.bin: %s\n",
		       strerror(errno));
		return -1;
	}

	if (fread(buf, 1, BUF_LEN, f) != BUF_LEN) {
		printf("Failed to read data.bin: %s\n",
		       strerror(errno));
		fclose(f);
		return -1;
	}

	fclose(f);

	if (argc > 1)
		offset = atoi(argv[1]);

	printf("Using offset=%d\n", offset);

	ptr = &buf[offset];
	size = BUF_LEN - offset;

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum1 = do_csum((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("Original:    %.8x %f us\n",
	       csum1, (double)delta/(double)NUM_LOOPS);

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum2 = do_csum1((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("64B Split2:  %.8x %f us\n",
	       csum2, (double)delta/(double)NUM_LOOPS);


	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum3 = do_csum2((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("128B Split4: %.8x %f us\n",
	       csum3, (double)delta/(double)NUM_LOOPS);

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum4 = do_csum3((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("64B Split4:  %.8x %f us\n",
	       csum4, (double)delta/(double)NUM_LOOPS);

	if ((csum1 != csum2) || (csum1 != csum3) || (csum1 != csum4))
		printf("Wrong checksum\n");

	return 0;
}



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:28   ` Neil Horman
  2013-10-14 21:19     ` Eric Dumazet
@ 2013-10-15  7:32     ` Ingo Molnar
  2013-10-15 13:14       ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:32 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > don't have checksum offload hardware were spending a significant amount 
> > > of time computing checksums.  We found that by splitting the checksum 
> > > computation into two separate streams, each skipping successive elements 
> > > of the buffer being summed, we could parallelize the checksum operation 
> > > accros multiple alus.  Since neither chain is dependent on the result of 
> > > the other, we get a speedup in execution (on hardware that has multiple 
> > > alu's available, which is almost ubiquitous on x86), and only a 
> > > negligible decrease on hardware that has only a single alu (an extra 
> > > addition is introduced).  Since addition in commutative, the result is 
> > > the same, only faster
> > 
> > This patch should really come with measurement numbers: what performance 
> > increase (and drop) did you get on what CPUs.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> 
> So, early testing results today.  I wrote a test module that, allocated 
> a 4k buffer, initalized it with random data, and called csum_partial on 
> it 100000 times, recording the time at the start and end of that loop.  

It would be nice to stick that testcase into tools/perf/bench/, see how we 
are able to benchmark the kernel's mempcy and memset implementation there:

 $ perf bench mem memcpy -r help
 # Running 'mem/memcpy' benchmark:
 Unknown routine:help
 Available routines...
        default ... Default memcpy() provided by glibc
        x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
        x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
        x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S

In a similar fashion we could build the csum_partial() code as well and do 
measurements. (We could change arch/x86/ code as well to make such 
embedding/including easier, as long as it does not change performance.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:49             ` Joe Perches
@ 2013-10-15  7:41               ` Ingo Molnar
  2013-10-15 10:51                 ` Borislav Petkov
  2013-10-15 16:21                 ` Joe Perches
  0 siblings, 2 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:41 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Joe Perches <joe@perches.com> wrote:

> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > attached patch brings much better results
> > > > 
> > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > Recv   Send    Send                          Utilization       Service Demand
> > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > 
> > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > 
> > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > []
> > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > >  			zero = 0;
> > > >  			count64 = count >> 3;
> > > >  			while (count64) { 
> > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > 
> > > Might the prefetch size be too big here?
> > 
> > To be effective, you need to prefetch well ahead of time.
> 
> No doubt.

So why did you ask then?

> > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> 
> 5 cachelines for some processors seems like a lot.

What processors would that be?

Most processors have hundreds of cachelines even in their L1 cache. 
Thousands in the L2 cache, up to hundreds of thousands.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:41               ` Ingo Molnar
@ 2013-10-15 10:51                 ` Borislav Petkov
  2013-10-15 12:04                   ` Ingo Molnar
  2013-10-15 16:21                 ` Joe Perches
  1 sibling, 1 reply; 105+ messages in thread
From: Borislav Petkov @ 2013-10-15 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joe Perches, Eric Dumazet, Neil Horman, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86

On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> Most processors have hundreds of cachelines even in their L1 cache.
> Thousands in the L2 cache, up to hundreds of thousands.

Also, I have this hazy memory of prefetch hints being harmful in some
situations: https://lwn.net/Articles/444344/

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 10:51                 ` Borislav Petkov
@ 2013-10-15 12:04                   ` Ingo Molnar
  0 siblings, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-15 12:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joe Perches, Eric Dumazet, Neil Horman, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86


* Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> > Most processors have hundreds of cachelines even in their L1 cache.
> > Thousands in the L2 cache, up to hundreds of thousands.
> 
> Also, I have this hazy memory of prefetch hints being harmful in some
> situations: https://lwn.net/Articles/444344/

Yes, for things like random list walks they tend to be harmful - the 
hardware is smarter.

For something like a controlled packet stream they might be helpful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:32     ` Ingo Molnar
@ 2013-10-15 13:14       ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-15 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant amount 
> > > > of time computing checksums.  We found that by splitting the checksum 
> > > > computation into two separate streams, each skipping successive elements 
> > > > of the buffer being summed, we could parallelize the checksum operation 
> > > > accros multiple alus.  Since neither chain is dependent on the result of 
> > > > the other, we get a speedup in execution (on hardware that has multiple 
> > > > alu's available, which is almost ubiquitous on x86), and only a 
> > > > negligible decrease on hardware that has only a single alu (an extra 
> > > > addition is introduced).  Since addition in commutative, the result is 
> > > > the same, only faster
> > > 
> > > This patch should really come with measurement numbers: what performance 
> > > increase (and drop) did you get on what CPUs.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > 
> > So, early testing results today.  I wrote a test module that, allocated 
> > a 4k buffer, initalized it with random data, and called csum_partial on 
> > it 100000 times, recording the time at the start and end of that loop.  
> 
> It would be nice to stick that testcase into tools/perf/bench/, see how we 
> are able to benchmark the kernel's mempcy and memset implementation there:
> 
Sure, my module is a mess currently.  But as soon as I investigate the use of
ADCX/ADOX that Anvin suggested I'll see about integrating that
Neil

>  $ perf bench mem memcpy -r help
>  # Running 'mem/memcpy' benchmark:
>  Unknown routine:help
>  Available routines...
>         default ... Default memcpy() provided by glibc
>         x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
> 
> In a similar fashion we could build the csum_partial() code as well and do 
> measurements. (We could change arch/x86/ code as well to make such 
> embedding/including easier, as long as it does not change performance.)
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 21:07     ` Eric Dumazet
@ 2013-10-15 13:17       ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-15 13:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> > * Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > Neil Horman <nhorman@tuxdriver.com> writes:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant 
> > > > amount of time computing
> > > 
> > > Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> > > anyways. I would rather investigate why that doesn't work.
> > 
> > There's a fair amount of csum_partial()-only workloads, a packet does not 
> > need to hit user-space to be a significant portion of the system's 
> > workload.
> > 
> > That said, it would indeed be nice to hear which particular code path was 
> > hit in this case, if nothing else then for education purposes.
> 
> Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
> frames, meaning we have to fallback to software csum to validate
> TCP frames, once tunnel header is pulled.
> 
> So to reproduce the issue, all you need is to setup a GRE tunnel between
> two hosts, and use any tcp stream workload.
> 
> Then receiver profile looks like :
> 
> 11.45%	[kernel]	 [k] csum_partial
>  3.08%	[kernel]	 [k] _raw_spin_lock
>  3.04%	[kernel]	 [k] intel_idle
>  2.73%	[kernel]	 [k] ipt_do_table
>  2.57%	[kernel]	 [k] __netif_receive_skb_core
>  2.15%	[kernel]	 [k] copy_user_generic_string
>  2.05%	[kernel]	 [k] __hrtimer_start_range_ns
>  1.42%	[kernel]	 [k] ip_rcv
>  1.39%	[kernel]	 [k] kmem_cache_free
>  1.36%	[kernel]	 [k] _raw_spin_unlock_irqrestore
>  1.24%	[kernel]	 [k] __schedule
>  1.13%	[bnx2x] 	 [k] bnx2x_rx_int
>  1.12%	[bnx2x] 	 [k] bnx2x_start_xmit
>  1.11%	[kernel]	 [k] fib_table_lookup
>  0.99%	[ip_tunnel]  [k] ip_tunnel_lookup
>  0.91%	[ip_tunnel]  [k] ip_tunnel_rcv
>  0.90%	[kernel]	 [k] check_leaf.isra.7
>  0.89%	[kernel]	 [k] nf_iterate
> 
As I noted previously the workload that this got reported on was ipoib, which
has a simmilar profile, since infiniband cards tend to not be able to do
checksum offload for ip frames.

Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:12     ` Sébastien Dugué
@ 2013-10-15 13:33       ` Andi Kleen
  2013-10-15 13:56         ` Sébastien Dugué
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2013-10-15 13:33 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Neil Horman, Andi Kleen, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

>   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> where one cannot benefit from hardware offloads.

Is this with sendfile? 

For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 13:33       ` Andi Kleen
@ 2013-10-15 13:56         ` Sébastien Dugué
  2013-10-15 14:06           ` Eric Dumazet
  0 siblings, 1 reply; 105+ messages in thread
From: Sébastien Dugué @ 2013-10-15 13:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > where one cannot benefit from hardware offloads.
> 
> Is this with sendfile?

  Tests were done with iperf at the time without any extra funky options, and
looking at the code it looks like it does plain write() / recv() on the socket.

  Sébastien.

> 
> For normal send() the checksum is done in the user copy and for receiving it
> can be also done during the copy in most cases
> 
> -Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 13:56         ` Sébastien Dugué
@ 2013-10-15 14:06           ` Eric Dumazet
  2013-10-15 14:15             ` Sébastien Dugué
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:06 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> On Tue, 15 Oct 2013 15:33:36 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > where one cannot benefit from hardware offloads.
> > 
> > Is this with sendfile?
> 
>   Tests were done with iperf at the time without any extra funky options, and
> looking at the code it looks like it does plain write() / recv() on the socket.
> 

But the csum cost is both for sender and receiver ?

Please post the following :

perf record -g "your iperf session"

perf report | head -n 200




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:06           ` Eric Dumazet
@ 2013-10-15 14:15             ` Sébastien Dugué
  2013-10-15 14:26               ` Eric Dumazet
  0 siblings, 1 reply; 105+ messages in thread
From: Sébastien Dugué @ 2013-10-15 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

Hi Eric,

On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> > On Tue, 15 Oct 2013 15:33:36 +0200
> > Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > > where one cannot benefit from hardware offloads.
> > > 
> > > Is this with sendfile?
> > 
> >   Tests were done with iperf at the time without any extra funky options, and
> > looking at the code it looks like it does plain write() / recv() on the socket.
> > 
> 
> But the csum cost is both for sender and receiver ?

  No, it was only on the receiver side that I noticed it.

> 
> Please post the following :
> 
> perf record -g "your iperf session"
> 
> perf report | head -n 200

  Sorry, but this is 3 years old stuff and I do not have the
setup anymore to reproduce.

  Sébastien.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:15             ` Sébastien Dugué
@ 2013-10-15 14:26               ` Eric Dumazet
  2013-10-15 14:52                 ` Eric Dumazet
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:26 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
> Hi Eric,
> 
> On Tue, 15 Oct 2013 07:06:25 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > But the csum cost is both for sender and receiver ?
> 
>   No, it was only on the receiver side that I noticed it.
> 

Yes, as Andi said, we do the csum while copying the data for the
sender : (I disabled hardware assist tx checksum using 'ethtool -K eth0
tx off')

    17.21%  netperf  [kernel.kallsyms]       [k]
csum_partial_copy_generic                 
            |
            --- csum_partial_copy_generic
               |          
               |--97.39%-- __libc_send
               |          
                --2.61%-- tcp_sendmsg
                          inet_sendmsg
                          sock_sendmsg
                          _sys_sendto
                          sys_sendto
                          system_call_fastpath
                          __libc_send



>   Sorry, but this is 3 years old stuff and I do not have the
> setup anymore to reproduce.

And the receiver should also do the same : (ethtool -K eth0 rx off)

    10.55%    netserver  [kernel.kallsyms]  [k]
csum_partial_copy_generic            
              |
              --- csum_partial_copy_generic
                 |          
                 |--98.24%-- __libc_recv
                 |          
                  --1.76%-- skb_copy_and_csum_datagram
                            skb_copy_and_csum_datagram_iovec
                            tcp_rcv_established
                            tcp_v4_do_rcv
                            |          
                            |--73.05%-- tcp_prequeue_process
                            |          tcp_recvmsg
                            |          inet_recvmsg
                            |          sock_recvmsg
                            |          SYSC_recvfrom
                            |          SyS_recvfrom
                            |          system_call_fastpath
                            |          __libc_recv
                            |          

So I suspect something is wrong with IPoIB.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:26               ` Eric Dumazet
@ 2013-10-15 14:52                 ` Eric Dumazet
  2013-10-15 16:02                   ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:52 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:

> And the receiver should also do the same : (ethtool -K eth0 rx off)
> 
>     10.55%    netserver  [kernel.kallsyms]  [k]
> csum_partial_copy_generic            

I get the csum_partial() if disabling prequeue.

echo 1 >/proc/sys/net/ipv4/tcp_low_latency

    24.49%      swapper  [kernel.kallsyms]  [k]
csum_partial                        
                |
                --- csum_partial
                    skb_checksum
                    __skb_checksum_complete_head
                    __skb_checksum_complete
                    tcp_rcv_established
                    tcp_v4_do_rcv
                    tcp_v4_rcv
                    ip_local_deliver_finish
                    ip_local_deliver
                    ip_rcv_finish
                    ip_rcv

So yes, we can call csum_partial() in receive path in this case.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:52                 ` Eric Dumazet
@ 2013-10-15 16:02                   ` Andi Kleen
  2013-10-16  0:28                     ` Eric Dumazet
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2013-10-15 16:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Sébastien Dugué,
	Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> I get the csum_partial() if disabling prequeue.

At least in the ipoib case i would consider that a misconfiguration.

"don't do this if it hurts"

There may be more such problems.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:41               ` Ingo Molnar
  2013-10-15 10:51                 ` Borislav Petkov
@ 2013-10-15 16:21                 ` Joe Perches
  2013-10-16  0:34                   ` Eric Dumazet
  2013-10-16  6:25                   ` Ingo Molnar
  1 sibling, 2 replies; 105+ messages in thread
From: Joe Perches @ 2013-10-15 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> * Joe Perches <joe@perches.com> wrote:
> 
> > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > attached patch brings much better results
> > > > > 
> > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > Recv   Send    Send                          Utilization       Service Demand
> > > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > > 
> > > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > > 
> > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > []
> > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > >  			zero = 0;
> > > > >  			count64 = count >> 3;
> > > > >  			while (count64) { 
> > > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > > 
> > > > Might the prefetch size be too big here?
> > > 
> > > To be effective, you need to prefetch well ahead of time.
> > 
> > No doubt.
> 
> So why did you ask then?
> 
> > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > 
> > 5 cachelines for some processors seems like a lot.
> 
> What processors would that be?

The ones where conservatism in L1 cache use is good
because there are multiple threads running concurrently.

> Most processors have hundreds of cachelines even in their L1 cache. 

And sometimes that many executable processes too.

> Thousands in the L2 cache, up to hundreds of thousands.

Irrelevant because prefetch doesn't apply there.

Ingo, Eric _showed_ that the prefetch is good here.
How about looking at a little optimization to the minimal
prefetch that gives that level of performance.

You could argue that prefetching PAGE_SIZE or larger
would be better still otherwise.

I suspect that using a smaller multiple of
L1_CACHE_BYTES like 2 or 3 would perform the same.

The last time it was looked at for copy_page_64.S was
quite awhile ago.  It looks like maybe 2003.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:02                   ` Andi Kleen
@ 2013-10-16  0:28                     ` Eric Dumazet
  0 siblings, 0 replies; 105+ messages in thread
From: Eric Dumazet @ 2013-10-16  0:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Sébastien Dugué,
	Neil Horman, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
> > I get the csum_partial() if disabling prequeue.
> 
> At least in the ipoib case i would consider that a misconfiguration.

There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is not used at all.

In this case, we need to csum_partial() frame before sending an ACK,
don't you think ? ;)




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:21                 ` Joe Perches
@ 2013-10-16  0:34                   ` Eric Dumazet
  2013-10-16  6:25                   ` Ingo Molnar
  1 sibling, 0 replies; 105+ messages in thread
From: Eric Dumazet @ 2013-10-16  0:34 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:

> Ingo, Eric _showed_ that the prefetch is good here.
> How about looking at a little optimization to the minimal
> prefetch that gives that level of performance.

Wait a minute, my point was to remind that main cost is the
memory fetching.

Its nice to optimize cpu cycles if we are short of them,
but in the csum_partial() case, the bottleneck is the memory.

Also I was wondering on the implications of changing reads order,
as it might fool cpu predictions.

I do not particularly care about finding the right prefetch stride,
I think Intel guys know better than me.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:21                 ` Joe Perches
  2013-10-16  0:34                   ` Eric Dumazet
@ 2013-10-16  6:25                   ` Ingo Molnar
  2013-10-16 16:55                     ` Joe Perches
  1 sibling, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-16  6:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Joe Perches <joe@perches.com> wrote:

> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches <joe@perches.com> wrote:
> > 
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > > attached patch brings much better results
> > > > > > 
> > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > > Recv   Send    Send                          Utilization       Service Demand
> > > > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > > > 
> > > > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > > > 
> > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > > []
> > > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > > >  			zero = 0;
> > > > > >  			count64 = count >> 3;
> > > > > >  			while (count64) { 
> > > > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > > > 
> > > > > Might the prefetch size be too big here?
> > > > 
> > > > To be effective, you need to prefetch well ahead of time.
> > > 
> > > No doubt.
> > 
> > So why did you ask then?
> > 
> > > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > > 
> > > 5 cachelines for some processors seems like a lot.
> > 
> > What processors would that be?
> 
> The ones where conservatism in L1 cache use is good because there are 
> multiple threads running concurrently.

What specific processor models would that be?

> > Most processors have hundreds of cachelines even in their L1 cache.
>
> And sometimes that many executable processes too.

Nonsense, this is an unrolled loop running in softirq context most of the 
time that does not get preempted.

> > Thousands in the L2 cache, up to hundreds of thousands.
> 
> Irrelevant because prefetch doesn't apply there.

What planet are you living on? Prefetch takes memory from L2->L1 memory 
just as much as it moves it cachelines from memory to the L2 cache. 

Especially in the usecase cited here there will be a second use of the 
data (when it's finally copied over into user-space), so the L2 cache size 
very much matters.

The prefetches here matter mostly to the packet being processed: the ideal 
size of the look-ahead window in csum_partial() is dictated by typical 
memory latencies and bandwidth. The amount of parallelism is limited by 
the number of carry bits we can maintain independently.

> Ingo, Eric _showed_ that the prefetch is good here. How about looking at 
> a little optimization to the minimal prefetch that gives that level of 
> performance.

Joe, instead of using a condescending tone in matters you clearly have 
very little clue about you might want to start doing some real kernel 
hacking in more serious kernel areas, beyond trivial matters such as 
printk strings, to gain a bit of experience and respect ...

Every word you uttered in this thread made it more likely for me to 
redirect you to /dev/null, permanently.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-16  6:25                   ` Ingo Molnar
@ 2013-10-16 16:55                     ` Joe Perches
  0 siblings, 0 replies; 105+ messages in thread
From: Joe Perches @ 2013-10-16 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
>  Prefetch takes memory from L2->L1 memory 
> just as much as it moves it cachelines from memory to the L2 cache. 

Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:18       ` Eric Dumazet
  2013-10-14 22:37         ` Joe Perches
@ 2013-10-17  0:34         ` Neil Horman
  2013-10-17  1:42           ` Eric Dumazet
  2013-10-17  8:41           ` Ingo Molnar
  1 sibling, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-17  0:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > 
> > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > GHz Intel Xeon processor:
> > > 
> > > Without patch: Average execute time for csum_partial was 808 ns
> > > With patch: Average execute time for csum_partial was 438 ns
> > 
> > Impressive, but could you try again with data out of cache ?
> 
> So I tried your patch on a GRE tunnel and got following results on a
> single TCP flow. (short result : no visible difference)
> 
> 

So I went to reproduce these results, but was unable to (due to the fact that I
only have a pretty jittery network to do testing accross at the moment with
these devices).  So instead I figured that I would go back to just doing
measurements with the module that I cobbled together (operating under the
assumption that it would give me accurate, relatively jitter free results (I've
attached the module code for reference below).  My results show slightly
different behavior:

Base results runs:
89417240
85170397
85208407
89422794
91645494
103655144
86063791
75647774
83502921
85847372
AVG = 875 ns

Prefetch only runs:
70962849
77555099
81898170
68249290
72636538
83039294
78561494
83393369
85317556
79570951
AVG = 781 ns

Parallel addition only runs:
42024233
44313064
48304416
64762297
42994259
41811628
55654282
64892958
55125582
42456403
AVG = 510 ns


Both prefetch and parallel addition:
41329930
40689195
61106622
46332422
49398117
52525171
49517101
61311153
43691814
49043084
AVG = 494 ns


For reference, each of the above large numbers is the number of nanoseconds
taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
results, I ran the test in a loop 10 times, averaged them, and divided by
100000.


Based on these, prefetching is obviously a a good improvement, but not as good
as parallel execution, and the winner by far is doing both.

Thoughts?

Neil



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;

	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

	if (!buf) {
		printk(KERN_CRIT "UNABLE TO ALLOCATE A BUFFER OF %lu bytes\n", PAGE_SIZE);
		return -ENOMEM;
	}

	printk(KERN_CRIT "INITALIZING BUFFER\n");
	get_random_bytes(buf, PAGE_SIZE);

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);

	for(i=0;i<100000;i++)
		sum = csum_partial(buf, PAGE_SIZE, sum);
	getnstimeofday(&end);
	preempt_enable();
	if (start.tv_nsec > end.tv_nsec)
		time = (ULLONG_MAX - end.tv_nsec) + start.tv_nsec;
	else 
		time = end.tv_nsec - start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	kfree(buf);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  0:34         ` Neil Horman
@ 2013-10-17  1:42           ` Eric Dumazet
  2013-10-18 16:50             ` Neil Horman
  2013-10-17  8:41           ` Ingo Molnar
  1 sibling, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-17  1:42 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:

> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
> 
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of nanoseconds
> taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
> results, I ran the test in a loop 10 times, averaged them, and divided by
> 100000.
> 
> 
> Based on these, prefetching is obviously a a good improvement, but not as good
> as parallel execution, and the winner by far is doing both.
> 
> Thoughts?
> 
> Neil
> 


Your benchmark uses a single 4K page, so data is _super_ hot in cpu
caches.
( prefetch should give no speedups, I am surprised it makes any
difference)

Try now with 32 huges pages, to get 64 MBytes of working set.

Because in reality we never csum_partial() data in cpu cache.
(Unless the NIC preloaded the data into cpu cache before sending the
interrupt)

Really, if Sebastien got a speed up, it means that something fishy was
going on, like :

- A copy of data into some area of memory, prefilling cpu caches
- csum_partial() done while data is hot in cache.

This is exactly a "should not happen" scenario, because the csum in this
case should happen _while_ doing the copy, for 0 ns.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  0:34         ` Neil Horman
  2013-10-17  1:42           ` Eric Dumazet
@ 2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-17  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > > 
> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > > 
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > > 
> > > Impressive, but could you try again with data out of cache ?
> > 
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> > 
> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of 
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.  
> To get my average results, I ran the test in a loop 10 times, averaged 
> them, and divided by 100000.
> 
> Based on these, prefetching is obviously a a good improvement, but not 
> as good as parallel execution, and the winner by far is doing both.

But in the actual usecase mentioned the packet data was likely cache-cold, 
it just arrived in the NIC and an IRQ got sent. Your testcase uses a 
super-hot 4K buffer that fits into the L1 cache. So it's apples to 
oranges.

To correctly simulate the workload you'd have to:

 - allocate a buffer larger than your L2 cache.

 - to measure the effects of the prefetches you'd also have to randomize
   the individual buffer positions. See how 'perf bench numa' implements a
   random walk via --data_rand_walk, in tools/perf/bench/numa.c.
   Otherwise the CPU might learn your simplistic stream direction and the
   L2 cache might hw-prefetch your data, interfering with any explicit 
   prefetches the code does. In many real-life usecases packet buffers are
   scattered.

Also, it would be nice to see standard deviation noise numbers when two 
averages are close to each other, to be able to tell whether differences 
are statistically significant or not.

For example 'perf stat --repeat' will output stddev for you:

  comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'

   Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):

       0.189084480 seconds time elapsed                                          ( +- 11.95% )

The last '+-' percentage is the noise of the measurement.

Also note that you can inspect many cache behavior details of your 
algorithm via perf stat - the -ddd option will give you a laundry list:

  aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
  ...

     Total time: 0.095 [sec]

 Performance counter stats for 'perf bench sched messaging' (20 runs):

       1519.128721 task-clock (msec)         #   12.305 CPUs utilized            ( +-  0.34% )
            22,882 context-switches          #    0.015 M/sec                    ( +-  2.84% )
             3,927 cpu-migrations            #    0.003 M/sec                    ( +-  2.74% )
            16,616 page-faults               #    0.011 M/sec                    ( +-  0.17% )
     2,327,978,366 cycles                    #    1.532 GHz                      ( +-  1.61% ) [36.43%]
     1,715,561,189 stalled-cycles-frontend   #   73.69% frontend cycles idle     ( +-  1.76% ) [38.05%]
       715,715,454 stalled-cycles-backend    #   30.74% backend  cycles idle     ( +-  2.25% ) [39.85%]
     1,253,106,346 instructions              #    0.54  insns per cycle        
                                             #    1.37  stalled cycles per insn  ( +-  1.71% ) [49.68%]
       241,181,126 branches                  #  158.763 M/sec                    ( +-  1.43% ) [47.83%]
         4,232,053 branch-misses             #    1.75% of all branches          ( +-  1.23% ) [48.63%]
       431,907,354 L1-dcache-loads           #  284.313 M/sec                    ( +-  1.00% ) [48.37%]
        20,550,528 L1-dcache-load-misses     #    4.76% of all L1-dcache hits    ( +-  0.82% ) [47.61%]
         7,435,847 LLC-loads                 #    4.895 M/sec                    ( +-  0.94% ) [36.11%]
         2,419,201 LLC-load-misses           #   32.53% of all LL-cache hits     ( +-  2.93% ) [ 7.33%]
       448,638,547 L1-icache-loads           #  295.326 M/sec                    ( +-  2.43% ) [21.75%]
        22,066,490 L1-icache-load-misses     #    4.92% of all L1-icache hits    ( +-  2.54% ) [30.66%]
       475,557,948 dTLB-loads                #  313.047 M/sec                    ( +-  1.96% ) [37.96%]
         6,741,523 dTLB-load-misses          #    1.42% of all dTLB cache hits   ( +-  2.38% ) [37.05%]
     1,268,628,660 iTLB-loads                #  835.103 M/sec                    ( +-  1.75% ) [36.45%]
            74,192 iTLB-load-misses          #    0.01% of all iTLB cache hits   ( +-  2.88% ) [36.19%]
         4,466,526 L1-dcache-prefetches      #    2.940 M/sec                    ( +-  1.61% ) [36.17%]
         2,396,311 L1-dcache-prefetch-misses #    1.577 M/sec                    ( +-  1.55% ) [35.71%]

       0.123459566 seconds time elapsed                                          ( +-  0.58% )

There's also a number of prefetch counters that might be useful:

 aldebaran:~> perf list | grep prefetch
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
@ 2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 105+ messages in thread
From: H. Peter Anvin @ 2013-10-17 18:19 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, x86, netdev

On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> 
> To correctly simulate the workload you'd have to:
> 
>  - allocate a buffer larger than your L2 cache.
> 
>  - to measure the effects of the prefetches you'd also have to randomize
>    the individual buffer positions. See how 'perf bench numa' implements a
>    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
>    Otherwise the CPU might learn your simplistic stream direction and the
>    L2 cache might hw-prefetch your data, interfering with any explicit 
>    prefetches the code does. In many real-life usecases packet buffers are
>    scattered.
> 
> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 

Seriously, though, how much does it matter?  All the above seems likely
to do is to drown the signal by adding noise.

If the parallel (threaded) checksumming is faster, which theory says it
should and microbenchmarking confirms, how important are the
macrobenchmarks?

	-hpa



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 105+ messages in thread
From: Eric Dumazet @ 2013-10-17 18:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev

On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:

> Seriously, though, how much does it matter?  All the above seems likely
> to do is to drown the signal by adding noise.

I don't think so.

> 
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?

Seriously, micro benchmarks are very misleading.

I spent time on this patch, and found no changes on real workloads.

I was excited first, then disappointed.

I hope we will find the real issue, as I really don't care of micro
benchmarks.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
@ 2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-18  6:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Neil Horman, Eric Dumazet, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> > 
> > To correctly simulate the workload you'd have to:
> > 
> >  - allocate a buffer larger than your L2 cache.
> > 
> >  - to measure the effects of the prefetches you'd also have to randomize
> >    the individual buffer positions. See how 'perf bench numa' implements a
> >    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> >    Otherwise the CPU might learn your simplistic stream direction and the
> >    L2 cache might hw-prefetch your data, interfering with any explicit 
> >    prefetches the code does. In many real-life usecases packet buffers are
> >    scattered.
> > 
> > Also, it would be nice to see standard deviation noise numbers when two 
> > averages are close to each other, to be able to tell whether differences 
> > are statistically significant or not.
> 
> 
> Seriously, though, how much does it matter?  All the above seems likely 
> to do is to drown the signal by adding noise.

I think it matters a lot and I don't think it 'adds' noise - it measures 
something else (cache cold behavior - which is the common case for 
first-time csum_partial() use for network packets), which was not measured 
before, and that that is by its nature has different noise patterns.

I've done many cache-cold measurements myself and had no trouble achieving 
statistically significant results and high precision.

> If the parallel (threaded) checksumming is faster, which theory says it 
> should and microbenchmarking confirms, how important are the 
> macrobenchmarks?

Microbenchmarks can be totally blind to things like the ideal prefetch 
window size. (or whether a prefetch should be done at all: some CPUs will 
throw away prefetches if enough regular fetches arrive.)

Also, 'naive' single-threaded algorithms can occasionally be better in the 
cache-cold case because a linear, predictable stream of memory accesses 
might saturate the memory bus better than a somewhat random looking, 
interleaved web of accesses that might not harmonize with buffer depths.

I _think_ if correctly tuned then the parallel algorithm should be better 
in the cache cold case, I just don't know with what parameters (and the 
algorithm has at least one free parameter: the prefetch window size), and 
I don't know how significant the effect is.

Also, more fundamentally, I absolutely detest doing no measurements or 
measuring the wrong thing - IMHO there are too many 'blind' optimization 
commits in the kernel with little to no observational data attached.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 22:29 ` H. Peter Anvin
  2013-10-13 12:53   ` Neil Horman
@ 2013-10-18 16:42   ` Neil Horman
  2013-10-18 17:09     ` H. Peter Anvin
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-18 16:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums.  We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus.  Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced).  Since addition in commutative, the result is the same,
> > only faster
> 
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
> 
> 	-hpa
> 
I've been looking into this a bit more, and I'm a bit confused.  According to
this:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

by my read, this pair of instructions simply supports 2 carry bit chains,
allowing for two parallel execution paths through the cpu that won't block on
one another.  Its exactly the same as whats being done with the universally
available addcq instruction, so theres no real speedup (that I can see).  Since
we'd either have to use the alternatives macro to support adcx/adox here or the
old instruction set, it seems not overly worth the effort to support the
extension.  

Or am I missing something?

Neil

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  1:42           ` Eric Dumazet
@ 2013-10-18 16:50             ` Neil Horman
  2013-10-18 17:20               ` Eric Dumazet
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-18 16:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> 
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
> 
> Try now with 32 huges pages, to get 64 MBytes of working set.
> 
> Because in reality we never csum_partial() data in cpu cache.
> (Unless the NIC preloaded the data into cpu cache before sending the
> interrupt)
> 
> Really, if Sebastien got a speed up, it means that something fishy was
> going on, like :
> 
> - A copy of data into some area of memory, prefilling cpu caches
> - csum_partial() done while data is hot in cache.
> 
> This is exactly a "should not happen" scenario, because the csum in this
> case should happen _while_ doing the copy, for 0 ns.
> 
> 
> 
> 


So, I took your suggestion, and modified my test module to allocate 32 huge
pages instead of a single 4k page.  I've attached the module changes and the
results below.  Contrary to your assertion above, results came out the same as
in my first run.  See below:

base results:
80381491
85279536
99537729
80398029
121385411
109478429
85369632
99242786
80250395
98170542

AVG=939 ns

prefetch only results:
86803812
101891541
85762713
95866956
102316712
93529111
90473728
79374183
93744053
90075501

AVG=919 ns

parallel only results:
68994797
63503221
64298412
63784256
75350022
66398821
77776050
79158271
91006098
67822318

AVG=718 ns

both prefetch and parallel results:
68852213
77536525
63963560
67255913
76169867
80418081
63485088
62386262
75533808
57731705

AVG=693 ns


So based on these, it seems that your assertion that prefetching is the key to
speedup here isn't quite correct.  Either that or the testing continues to be
invalid.  I'm going to try to do some of ingos microbenchmarking just to see if
that provides any further details.  But any other thoughts about what might be
going awry are appreciated.

My module code:



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;
	struct page *page;
	u32 offset = 0;

	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
	if (!page) {
		printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
		return -ENOMEM;
	}
	buf = page_address(page); 

	
	printk(KERN_CRIT "INITALIZING BUFFER\n");

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);
	
	for(i=0;i<100000;i++) {
		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
	}
	getnstimeofday(&end);
	preempt_enable();
	if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
		time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
	else 
		time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	__free_pages(page, BUFSIZ_ORDER);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 16:42   ` Neil Horman
@ 2013-10-18 17:09     ` H. Peter Anvin
  2013-10-25 13:06       ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: H. Peter Anvin @ 2013-10-18 17:09 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.

Neil Horman <nhorman@tuxdriver.com> wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices implementing ipoib
>(which don't have
>> > checksum offload hardware were spending a significant amount of
>time computing
>> > checksums.  We found that by splitting the checksum computation
>into two
>> > separate streams, each skipping successive elements of the buffer
>being summed,
>> > we could parallelize the checksum operation accros multiple alus. 
>Since neither
>> > chain is dependent on the result of the other, we get a speedup in
>execution (on
>> > hardware that has multiple alu's available, which is almost
>ubiquitous on x86),
>> > and only a negligible decrease on hardware that has only a single
>alu (an extra
>> > addition is introduced).  Since addition in commutative, the result
>is the same,
>> > only faster
>> 
>> On hardware that implement ADCX/ADOX then you should also be able to
>> have additional streams interleaved since those instructions allow
>for
>> dual carry chains.
>> 
>> 	-hpa
>> 
>I've been looking into this a bit more, and I'm a bit confused. 
>According to
>this:
>http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
>
>by my read, this pair of instructions simply supports 2 carry bit
>chains,
>allowing for two parallel execution paths through the cpu that won't
>block on
>one another.  Its exactly the same as whats being done with the
>universally
>available addcq instruction, so theres no real speedup (that I can
>see).  Since
>we'd either have to use the alternatives macro to support adcx/adox
>here or the
>old instruction set, it seems not overly worth the effort to support
>the
>extension.  
>
>Or am I missing something?
>
>Neil

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 16:50             ` Neil Horman
@ 2013-10-18 17:20               ` Eric Dumazet
  2013-10-18 20:11                 ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-18 17:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > 

> 	for(i=0;i<100000;i++) {
> 		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> 		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
> 	}

Please replace this by random accesses, and use the more standard 1500
length.

offset = prandom_u32() % (BUFSIZ - 1500);
offset &= ~1U;

sum = csum_partial(buf + offset, 1500, sum);

You are basically doing sequential accesses, so prefetch should
be automatically done by cpu itself.

Thanks !



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:20               ` Eric Dumazet
@ 2013-10-18 20:11                 ` Neil Horman
  2013-10-18 21:15                   ` Eric Dumazet
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-18 20:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > > 
> 
> > 	for(i=0;i<100000;i++) {
> > 		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> > 		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
> > 	}
> 
> Please replace this by random accesses, and use the more standard 1500
> length.
> 
> offset = prandom_u32() % (BUFSIZ - 1500);
> offset &= ~1U;
> 
> sum = csum_partial(buf + offset, 1500, sum);
> 
> You are basically doing sequential accesses, so prefetch should
> be automatically done by cpu itself.
> 
> Thanks !
> 
> 
> 

Sure, you got it!  Results below.  However, they continue to bear out that
parallel execution beats prefetch only execution, and both is better than either
one.

base results:
53156647
59670931
62839770
44842780
39297190
44905905
53300688
53287805
39436951
43021730

AVG=493 ns

prefetch-only results:
40337434
51986404
43509199
53128857
52973171
53520649
53536338
50325466
44864664
47908398

AVG=492 ns


parallel-only results:
52157183
44496511
36180011
38298368
36258099
43263531
45365519
54116344
62529241
63118224

AVG = 475 ns


both prefetch and parallel:
44317078
44526464
45761272
44477906
34868814
44637904
49478309
49718417
58681403
58304972

AVG = 474 ns


Heres the code I was using



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;
	struct page *page;
	u32 offset = 0;

	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
	if (!page) {
		printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
		return -ENOMEM;
	}
	buf = page_address(page); 

	
	printk(KERN_CRIT "INITALIZING BUFFER\n");

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);
	
	for(i=0;i<100000;i++) {
		sum = csum_partial(buf+offset, 1500, sum);
		offset = prandom_u32() % (BUFSIZ - 1500);
		offset &= ~1U;
	}
	getnstimeofday(&end);
	preempt_enable();
	if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
		time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
	else 
		time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	__free_pages(page, BUFSIZ_ORDER);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 20:11                 ` Neil Horman
@ 2013-10-18 21:15                   ` Eric Dumazet
  2013-10-20 21:29                     ` Neil Horman
  2013-10-21 19:21                     ` Neil Horman
  0 siblings, 2 replies; 105+ messages in thread
From: Eric Dumazet @ 2013-10-18 21:15 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:

> #define BUFSIZ_ORDER 4
> #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> static int __init csum_init_module(void)
> {
> 	int i;
> 	__wsum sum = 0;
> 	struct timespec start, end;
> 	u64 time;
> 	struct page *page;
> 	u32 offset = 0;
> 
> 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);

Not sure what you are doing here, but its not correct.

You have a lot of variations in your results, I suspect a NUMA affinity
problem.

You can try the following code, and use taskset to make sure you run
this on a cpu on node 0

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

static int __init csum_init_module(void)
{
        int i;
        __wsum sum = 0;
        u64 start, end;
	void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	memset(addrs, 0, sizeof(addrs));
	for (i = 0; i < NBPAGES; i++) {
		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
		if (!addrs[i])
			goto out;
	}

        local_bh_disable();
        pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
        start = ktime_to_ns(ktime_get());
        
        for (i = 0; i < 100000; i++) {
		rnd = prandom_u32();
		base = addrs[rnd % NBPAGES];
		rnd /= NBPAGES;
		offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial_opt(base + offset, 1500, sum);
        }
        end = ktime_to_ns(ktime_get());
        local_bh_enable();

        pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);

out:
	for (i = 0; i < NBPAGES; i++)
		kfree(addrs[i]);

        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 21:15                   ` Eric Dumazet
@ 2013-10-20 21:29                     ` Neil Horman
  2013-10-21 17:31                       ` Eric Dumazet
  2013-10-21 19:21                     ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-20 21:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> 
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > 	int i;
> > 	__wsum sum = 0;
> > 	struct timespec start, end;
> > 	u64 time;
> > 	struct page *page;
> > 	u32 offset = 0;
> > 
> > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> 
> Not sure what you are doing here, but its not correct.
> 
Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.

> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
> 
I do have some variation, you're correct, but I don't think its a numa issue

> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
> 
I did run this with taskset to do exactly that (hence my comment above).  I'll
be glad to run your variant on monday morning though and provide results.


Best
Neil


> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
> 
> static int __init csum_init_module(void)
> {
>         int i;
>         __wsum sum = 0;
>         u64 start, end;
> 	void *base, *addrs[NBPAGES];
>         u32 rnd, offset;
> 
> 	memset(addrs, 0, sizeof(addrs));
> 	for (i = 0; i < NBPAGES; i++) {
> 		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> 		if (!addrs[i])
> 			goto out;
> 	}
> 
>         local_bh_disable();
>         pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
>         start = ktime_to_ns(ktime_get());
>         
>         for (i = 0; i < 100000; i++) {
> 		rnd = prandom_u32();
> 		base = addrs[rnd % NBPAGES];
> 		rnd /= NBPAGES;
> 		offset = rnd % (BUFSIZ - 1500);
>                 offset &= ~1U;
>                 sum = csum_partial_opt(base + offset, 1500, sum);
>         }
>         end = ktime_to_ns(ktime_get());
>         local_bh_enable();
> 
>         pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
> 
> out:
> 	for (i = 0; i < NBPAGES; i++)
> 		kfree(addrs[i]);
> 
>         return 0;
> }
> 
> static void __exit csum_cleanup_module(void)
> {
>         return;
> }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-20 21:29                     ` Neil Horman
@ 2013-10-21 17:31                       ` Eric Dumazet
  2013-10-21 17:46                         ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-21 17:31 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > 
> > > #define BUFSIZ_ORDER 4
> > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > static int __init csum_init_module(void)
> > > {
> > > 	int i;
> > > 	__wsum sum = 0;
> > > 	struct timespec start, end;
> > > 	u64 time;
> > > 	struct page *page;
> > > 	u32 offset = 0;
> > > 
> > > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > 
> > Not sure what you are doing here, but its not correct.
> > 
> Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.

Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
on x86. (MAX_ORDER = 11)

You noticed nothing because you did not 
write anything on the 64Mbytes area (and corrupt memory) or
use CONFIG_DEBUG_PAGEALLOC=y.

Your code read data out of bounds and was lucky, thats all...

You in fact allocated a page of (4096<<4) bytes




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 17:31                       ` Eric Dumazet
@ 2013-10-21 17:46                         ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-21 17:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > > 
> > > > #define BUFSIZ_ORDER 4
> > > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > > static int __init csum_init_module(void)
> > > > {
> > > > 	int i;
> > > > 	__wsum sum = 0;
> > > > 	struct timespec start, end;
> > > > 	u64 time;
> > > > 	struct page *page;
> > > > 	u32 offset = 0;
> > > > 
> > > > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > > 
> > > Not sure what you are doing here, but its not correct.
> > > 
> > Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.
> 
> Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
> on x86. (MAX_ORDER = 11)
> 
> You noticed nothing because you did not 
> write anything on the 64Mbytes area (and corrupt memory) or
> use CONFIG_DEBUG_PAGEALLOC=y.
> 
> Your code read data out of bounds and was lucky, thats all...
> 
> You in fact allocated a page of (4096<<4) bytes
> 
Gahh!  I see what I did, the order in the alloc_pages call is the order of
hugepages, it still allocates that order as typically sized pages, and then
treats them as huge.  Stupid of me...

I'll have results on your version of the test case in just a bit here
Neil

> 
> 
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 21:15                   ` Eric Dumazet
  2013-10-20 21:29                     ` Neil Horman
@ 2013-10-21 19:21                     ` Neil Horman
  2013-10-21 19:44                       ` Eric Dumazet
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-21 19:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> 
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > 	int i;
> > 	__wsum sum = 0;
> > 	struct timespec start, end;
> > 	u64 time;
> > 	struct page *page;
> > 	u32 offset = 0;
> > 
> > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> 
> Not sure what you are doing here, but its not correct.
> 
> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
> 
> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
> 
> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
> 
> static int __init csum_init_module(void)
> {
>         int i;
>         __wsum sum = 0;
>         u64 start, end;
> 	void *base, *addrs[NBPAGES];
>         u32 rnd, offset;
> 
> 	memset(addrs, 0, sizeof(addrs));
> 	for (i = 0; i < NBPAGES; i++) {
> 		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> 		if (!addrs[i])
> 			goto out;
> 	}
> 
>         local_bh_disable();
>         pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
>         start = ktime_to_ns(ktime_get());
>         
>         for (i = 0; i < 100000; i++) {
> 		rnd = prandom_u32();
> 		base = addrs[rnd % NBPAGES];
> 		rnd /= NBPAGES;
> 		offset = rnd % (BUFSIZ - 1500);
>                 offset &= ~1U;
>                 sum = csum_partial_opt(base + offset, 1500, sum);
>         }
>         end = ktime_to_ns(ktime_get());
>         local_bh_enable();
> 
>         pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
> 
> out:
> 	for (i = 0; i < NBPAGES; i++)
> 		kfree(addrs[i]);
> 
>         return 0;
> }
> 
> static void __exit csum_cleanup_module(void)
> {
>         return;
> }
> 
> 
> 
> 


Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
such that no interrupts (save for local ones), would occur on that cpu.  Note
that I had to convert csum_partial_opt to csum_partial, as the _opt variant
doesn't exist in my tree, nor do I see it in any upstream tree or in the history
anywhere.

base results:
53569916
43506025
43476542
44048436
45048042
48550429
53925556
53927374
53489708
53003915

AVG = 492 ns

prefetching only:
53279213
45518140
49585388
53176179
44071822
43588822
44086546
47507065
53646812
54469118

AVG = 488 ns


parallel alu's only:
46226844
44458101
46803498
45060002
46187624
37542946
45632866
46275249
45031141
46281204

AVG = 449 ns


both optimizations:
45708837
45631124
45697135
45647011
45036679
39418544
44481577
46820868
44496471
35523928

AVG = 438 ns


We continue to see a small savings in execution time with prefetching (4 ns, or
about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
the best savings with both optimizations (54 ns, or 10.9%).  

These results, while they've changed as we've modified the test case slightly
have remained consistent in their sppedup ordinality.  Prefetching helps, but
not as much as using multiple alu's, and neither is as good as doing both
together.

Unless you see something else that I'm doing wrong here.  It seems like a win to
do both.

Regards
Neil




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 19:21                     ` Neil Horman
@ 2013-10-21 19:44                       ` Eric Dumazet
  2013-10-21 20:19                         ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-10-21 19:44 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:

> 
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu.  Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.

This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)

> 
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
> 
> AVG = 492 ns
> 
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
> 
> AVG = 488 ns
> 
> 
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
> 
> AVG = 449 ns
> 
> 
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
> 
> AVG = 438 ns
> 
> 
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).  
> 
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality.  Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
> 
> Unless you see something else that I'm doing wrong here.  It seems like a win to
> do both.
> 

Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)

Only noises in the results.

It seems it depends on cpus and/or multiple factors.

Last machine I used for the tests had :

processor	: 23
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping	: 2
microcode	: 0x13
cpu MHz		: 2800.256
cache size	: 12288 KB
physical id	: 1
siblings	: 12
core id		: 10
cpu cores	: 6




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 19:44                       ` Eric Dumazet
@ 2013-10-21 20:19                         ` Neil Horman
  2013-10-26 12:01                           ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-21 20:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> 
> > 
> > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > anywhere.
> 
> This csum_partial_opt() was a private implementation of csum_partial()
> so that I could load the module without rebooting the kernel ;)
> 
> > 
> > base results:
> > 53569916
> > 43506025
> > 43476542
> > 44048436
> > 45048042
> > 48550429
> > 53925556
> > 53927374
> > 53489708
> > 53003915
> > 
> > AVG = 492 ns
> > 
> > prefetching only:
> > 53279213
> > 45518140
> > 49585388
> > 53176179
> > 44071822
> > 43588822
> > 44086546
> > 47507065
> > 53646812
> > 54469118
> > 
> > AVG = 488 ns
> > 
> > 
> > parallel alu's only:
> > 46226844
> > 44458101
> > 46803498
> > 45060002
> > 46187624
> > 37542946
> > 45632866
> > 46275249
> > 45031141
> > 46281204
> > 
> > AVG = 449 ns
> > 
> > 
> > both optimizations:
> > 45708837
> > 45631124
> > 45697135
> > 45647011
> > 45036679
> > 39418544
> > 44481577
> > 46820868
> > 44496471
> > 35523928
> > 
> > AVG = 438 ns
> > 
> > 
> > We continue to see a small savings in execution time with prefetching (4 ns, or
> > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > the best savings with both optimizations (54 ns, or 10.9%).  
> > 
> > These results, while they've changed as we've modified the test case slightly
> > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > not as much as using multiple alu's, and neither is as good as doing both
> > together.
> > 
> > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > do both.
> > 
> 
> Well, I only said (or maybe I forgot), that on my machines, I got no
> improvements at all with the multiple alu or the prefetch. (I tried
> different strides)
> 
> Only noises in the results.
> 
I thought you previously said that running netperf gave you a stastically
significant performance boost when you added prefetching:
http://marc.info/?l=linux-kernel&m=138178914124863&w=2

But perhaps I missed a note somewhere.

> It seems it depends on cpus and/or multiple factors.
> 
> Last machine I used for the tests had :
> 
> processor	: 23
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 44
> model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> stepping	: 2
> microcode	: 0x13
> cpu MHz		: 2800.256
> cache size	: 12288 KB
> physical id	: 1
> siblings	: 12
> core id		: 10
> cpu cores	: 6
> 
> 
> 
> 

Thats about what I'm running with:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping        : 2
microcode       : 0x13
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4


I can't imagine what would cause the discrepancy in our results (a 10% savings
in execution time seems significant to me). My only thought would be that
possibly the alu's on your cpu are faster than mine, and reduce the speedup
obtained by preforming operation in parallel, though I can't imagine thats the
case with these processors being so closely matched.

Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:09     ` H. Peter Anvin
@ 2013-10-25 13:06       ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-25 13:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
> If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.
> 
Ok, fair enough.  Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any code using these instructions for some time, so
I'll back burner their use and revisit them later.  I'm still working on the
parallel alu/prefetch angle though.

Neil

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 20:19                         ` Neil Horman
@ 2013-10-26 12:01                           ` Ingo Molnar
  2013-10-26 13:58                             ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-26 12:01 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > 
> > > 
> > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > anywhere.
> > 
> > This csum_partial_opt() was a private implementation of csum_partial()
> > so that I could load the module without rebooting the kernel ;)
> > 
> > > 
> > > base results:
> > > 53569916
> > > 43506025
> > > 43476542
> > > 44048436
> > > 45048042
> > > 48550429
> > > 53925556
> > > 53927374
> > > 53489708
> > > 53003915
> > > 
> > > AVG = 492 ns
> > > 
> > > prefetching only:
> > > 53279213
> > > 45518140
> > > 49585388
> > > 53176179
> > > 44071822
> > > 43588822
> > > 44086546
> > > 47507065
> > > 53646812
> > > 54469118
> > > 
> > > AVG = 488 ns
> > > 
> > > 
> > > parallel alu's only:
> > > 46226844
> > > 44458101
> > > 46803498
> > > 45060002
> > > 46187624
> > > 37542946
> > > 45632866
> > > 46275249
> > > 45031141
> > > 46281204
> > > 
> > > AVG = 449 ns
> > > 
> > > 
> > > both optimizations:
> > > 45708837
> > > 45631124
> > > 45697135
> > > 45647011
> > > 45036679
> > > 39418544
> > > 44481577
> > > 46820868
> > > 44496471
> > > 35523928
> > > 
> > > AVG = 438 ns
> > > 
> > > 
> > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > the best savings with both optimizations (54 ns, or 10.9%).  
> > > 
> > > These results, while they've changed as we've modified the test case slightly
> > > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > > not as much as using multiple alu's, and neither is as good as doing both
> > > together.
> > > 
> > > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > > do both.
> > > 
> > 
> > Well, I only said (or maybe I forgot), that on my machines, I got no
> > improvements at all with the multiple alu or the prefetch. (I tried
> > different strides)
> > 
> > Only noises in the results.
> > 
> I thought you previously said that running netperf gave you a stastically
> significant performance boost when you added prefetching:
> http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> 
> But perhaps I missed a note somewhere.
> 
> > It seems it depends on cpus and/or multiple factors.
> > 
> > Last machine I used for the tests had :
> > 
> > processor	: 23
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 44
> > model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> > stepping	: 2
> > microcode	: 0x13
> > cpu MHz		: 2800.256
> > cache size	: 12288 KB
> > physical id	: 1
> > siblings	: 12
> > core id		: 10
> > cpu cores	: 6
> > 
> > 
> > 
> > 
> 
> Thats about what I'm running with:
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 44
> model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> stepping        : 2
> microcode       : 0x13
> cpu MHz         : 1600.000
> cache size      : 12288 KB
> physical id     : 0
> siblings        : 8
> core id         : 0
> cpu cores       : 4
> 
> 
> I can't imagine what would cause the discrepancy in our results (a 
> 10% savings in execution time seems significant to me). My only 
> thought would be that possibly the alu's on your cpu are faster 
> than mine, and reduce the speedup obtained by preforming operation 
> in parallel, though I can't imagine thats the case with these 
> processors being so closely matched.

You keep ignoring my request to calculate and account for noise of 
the measurement.

For example you are talking about a 0.8% prefetch effect while the 
noise in the results is obviously much larger than that, with a 
min/max distance of around 5%:

> > > 43476542
> > > 53927374

so the noise of 10 measurements would be around 5-10%. (back of the 
envelope calculation)

So you might be right in the end, but the posted data does not 
support your claims, statistically.

It's your responsibility to come up with convincing measurements and 
results, not of those who review your work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 12:01                           ` Ingo Molnar
@ 2013-10-26 13:58                             ` Neil Horman
  2013-10-27  7:26                               ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-26 13:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > > 
> > > > 
> > > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > > anywhere.
> > > 
> > > This csum_partial_opt() was a private implementation of csum_partial()
> > > so that I could load the module without rebooting the kernel ;)
> > > 
> > > > 
> > > > base results:
> > > > 53569916
> > > > 43506025
> > > > 43476542
> > > > 44048436
> > > > 45048042
> > > > 48550429
> > > > 53925556
> > > > 53927374
> > > > 53489708
> > > > 53003915
> > > > 
> > > > AVG = 492 ns
> > > > 
> > > > prefetching only:
> > > > 53279213
> > > > 45518140
> > > > 49585388
> > > > 53176179
> > > > 44071822
> > > > 43588822
> > > > 44086546
> > > > 47507065
> > > > 53646812
> > > > 54469118
> > > > 
> > > > AVG = 488 ns
> > > > 
> > > > 
> > > > parallel alu's only:
> > > > 46226844
> > > > 44458101
> > > > 46803498
> > > > 45060002
> > > > 46187624
> > > > 37542946
> > > > 45632866
> > > > 46275249
> > > > 45031141
> > > > 46281204
> > > > 
> > > > AVG = 449 ns
> > > > 
> > > > 
> > > > both optimizations:
> > > > 45708837
> > > > 45631124
> > > > 45697135
> > > > 45647011
> > > > 45036679
> > > > 39418544
> > > > 44481577
> > > > 46820868
> > > > 44496471
> > > > 35523928
> > > > 
> > > > AVG = 438 ns
> > > > 
> > > > 
> > > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > > the best savings with both optimizations (54 ns, or 10.9%).  
> > > > 
> > > > These results, while they've changed as we've modified the test case slightly
> > > > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > > > not as much as using multiple alu's, and neither is as good as doing both
> > > > together.
> > > > 
> > > > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > > > do both.
> > > > 
> > > 
> > > Well, I only said (or maybe I forgot), that on my machines, I got no
> > > improvements at all with the multiple alu or the prefetch. (I tried
> > > different strides)
> > > 
> > > Only noises in the results.
> > > 
> > I thought you previously said that running netperf gave you a stastically
> > significant performance boost when you added prefetching:
> > http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> > 
> > But perhaps I missed a note somewhere.
> > 
> > > It seems it depends on cpus and/or multiple factors.
> > > 
> > > Last machine I used for the tests had :
> > > 
> > > processor	: 23
> > > vendor_id	: GenuineIntel
> > > cpu family	: 6
> > > model		: 44
> > > model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> > > stepping	: 2
> > > microcode	: 0x13
> > > cpu MHz		: 2800.256
> > > cache size	: 12288 KB
> > > physical id	: 1
> > > siblings	: 12
> > > core id		: 10
> > > cpu cores	: 6
> > > 
> > > 
> > > 
> > > 
> > 
> > Thats about what I'm running with:
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 44
> > model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> > stepping        : 2
> > microcode       : 0x13
> > cpu MHz         : 1600.000
> > cache size      : 12288 KB
> > physical id     : 0
> > siblings        : 8
> > core id         : 0
> > cpu cores       : 4
> > 
> > 
> > I can't imagine what would cause the discrepancy in our results (a 
> > 10% savings in execution time seems significant to me). My only 
> > thought would be that possibly the alu's on your cpu are faster 
> > than mine, and reduce the speedup obtained by preforming operation 
> > in parallel, though I can't imagine thats the case with these 
> > processors being so closely matched.
> 
> You keep ignoring my request to calculate and account for noise of 
> the measurement.
> 
Don't confuse "ignoring" with "haven't gotten there yet".  Sometimes we all have
to wait, Ingo.  I'm working on it now, but I hit a snag on the machine I'm
working with and am trying to figure it out now.

> For example you are talking about a 0.8% prefetch effect while the 
> noise in the results is obviously much larger than that, with a 
> min/max distance of around 5%:
> 
> > > > 43476542
> > > > 53927374
> 
> so the noise of 10 measurements would be around 5-10%. (back of the 
> envelope calculation)
> 
> So you might be right in the end, but the posted data does not 
> support your claims, statistically.
> 
> It's your responsibility to come up with convincing measurements and 
> results, not of those who review your work.
> 
Be patient, I'm getting there

Thanks
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 13:58                             ` Neil Horman
@ 2013-10-27  7:26                               ` Ingo Molnar
  2013-10-27 17:05                                 ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-27  7:26 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > You keep ignoring my request to calculate and account for noise of the 
> > measurement.
> 
> Don't confuse "ignoring" with "haven't gotten there yet".  [...]

So, instead of replying to my repeated feedback with a single line mail 
that you plan to address it, you repeated the same measurement mistakes 
again and again, posting invalid results, and forced me to spend time to 
repeat this same argument 2-3 times?

> [...] Sometimes we all have to wait, Ingo. [...]

I'm making bog standard technical requests to which you've not replied in 
substance, there's no need for the patronizing tone really.

Anyway, to simplify the workflow I'm NAK-ing it all until it's done 
convincingly.

  NAKed-by: Ingo Molnar <mingo@kernel.org>

I'll lift the NAK once my technical concerns and questions are resolved.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-27  7:26                               ` Ingo Molnar
@ 2013-10-27 17:05                                 ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-27 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > You keep ignoring my request to calculate and account for noise of the 
> > > measurement.
> > 
> > Don't confuse "ignoring" with "haven't gotten there yet".  [...]
> 
> So, instead of replying to my repeated feedback with a single line mail 
> that you plan to address it, you repeated the same measurement mistakes 
> again and again, posting invalid results, and forced me to spend time to 
> repeat this same argument 2-3 times?
> 
No one forced you to do anything Ingo.  I was finishing a valid line of
discussion with Eric prior to addressing your questions, while handling several
other unrelated issues (and related issues with my test system), that cropped up
in parallel.  

> > [...] Sometimes we all have to wait, Ingo. [...]
> 
> I'm making bog standard technical requests to which you've not replied in 
> substance, there's no need for the patronizing tone really.
> 
No one said they weren't easy to do, Ingo, I said I was getting to your request.
And now I am.  I'll be running the tests tomorrow.

> Anyway, to simplify the workflow I'm NAK-ing it all until it's done 
> convincingly.
> 
>   NAKed-by: Ingo Molnar <mingo@kernel.org>
> 
> I'll lift the NAK once my technical concerns and questions are resolved.
> 
Ok, if that helps you wait.  I'll have your test results in the next day or so.

Thanks
Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-28 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev



Ingo, et al.-
	Ok, sorry for the delay, here are the test results you've been asking
for.


First, some information about what I did.  I attached the module that I ran this
test with at the bottom of this email.  You'll note that I started using a
module parameter write patch to trigger the csum rather than the module load
path.  The latter seemed to be giving me lots of variance in my run times, which
I wanted to eliminate.  I attributed it to the module load mechanism itself, and
by using the parameter write path, I was able to get more consistent results.

First, the run time tests:

I ran this command:
for i in `seq 0 1 3`
do
	 echo $i > /sys/module/csum_test/parameters/module_test_mode
	 perf stat --repeat 20 --null echo 1 > echo 1 > /sys/module/csum_test/parameters/test_fire
done

The for loop allows me to chagne the module_test_mode, which is tied to a switch
statement in do_csum that selects which checksumming method we use
(base/prefetch/parallel alu/both).  The results are:


Base:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.093269042 seconds time elapsed                                          ( +-  2.24% )

Prefetch (5x64):
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.079440009 seconds time elapsed                                          ( +-  2.29% )

Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.087666677 seconds time elapsed                                          ( +-  4.01% )

Prefetch + Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.080758702 seconds time elapsed                                          ( +-  2.34% )

So we can see here that we get about a 1% speedup between the base and the both
(Prefetch + Parallel ALU) case, with prefetch accounting for most of that
speedup.

Looking at the specific cpu counters we get this:


Base:
     Total time: 0.179 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
            14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
             2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
            75,402 page-faults               #    0.048 M/sec                    ( +-  0.07% )
     1,597,349,326 cycles                    #    1.017 GHz                      ( +-  1.74% ) [40.51%]
       104,882,858 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.25% ) [40.33%]
     1,043,429,984 stalled-cycles-backend    #   65.32% backend  cycles idle     ( +-  1.25% ) [39.73%]
       868,372,132 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.43% ) [39.88%]
       161,143,820 branches                  #  102.554 M/sec                    ( +-  1.49% ) [39.76%]
         4,348,075 branch-misses             #    2.70% of all branches          ( +-  1.43% ) [39.99%]
       457,042,576 L1-dcache-loads           #  290.868 M/sec                    ( +-  1.25% ) [40.63%]
         8,928,240 L1-dcache-load-misses     #    1.95% of all L1-dcache hits    ( +-  1.26% ) [41.17%]
        15,821,051 LLC-loads                 #   10.069 M/sec                    ( +-  1.56% ) [41.20%]
         4,902,576 LLC-load-misses           #   30.99% of all LL-cache hits     ( +-  1.51% ) [41.36%]
       235,775,688 L1-icache-loads           #  150.051 M/sec                    ( +-  1.39% ) [41.10%]
         3,116,106 L1-icache-load-misses     #    1.32% of all L1-icache hits    ( +-  3.43% ) [40.96%]
       461,315,416 dTLB-loads                #  293.588 M/sec                    ( +-  1.43% ) [41.18%]
           140,280 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.30% ) [40.96%]
       236,127,031 iTLB-loads                #  150.275 M/sec                    ( +-  1.63% ) [41.43%]
            46,173 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.40% ) [41.11%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.82%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.37%]

       0.301414024 seconds time elapsed                                          ( +-  0.47% )

Prefetch (5x64):
     Total time: 0.172 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1565.797128 task-clock                #    5.238 CPUs utilized            ( +-  0.46% )
            13,845 context-switches          #    0.009 M/sec                    ( +-  4.20% )
             2,624 cpu-migrations            #    0.002 M/sec                    ( +-  2.72% )
            75,452 page-faults               #    0.048 M/sec                    ( +-  0.08% )
     1,642,106,355 cycles                    #    1.049 GHz                      ( +-  1.33% ) [40.17%]
       107,786,666 stalled-cycles-frontend   #    6.56% frontend cycles idle     ( +-  1.37% ) [39.90%]
     1,065,286,880 stalled-cycles-backend    #   64.87% backend  cycles idle     ( +-  1.59% ) [39.14%]
       888,815,001 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.29% ) [38.92%]
       163,106,907 branches                  #  104.169 M/sec                    ( +-  1.32% ) [38.93%]
         4,333,456 branch-misses             #    2.66% of all branches          ( +-  1.94% ) [39.77%]
       459,779,806 L1-dcache-loads           #  293.639 M/sec                    ( +-  1.60% ) [40.23%]
         8,827,680 L1-dcache-load-misses     #    1.92% of all L1-dcache hits    ( +-  1.77% ) [41.38%]
        15,556,816 LLC-loads                 #    9.935 M/sec                    ( +-  1.76% ) [41.16%]
         4,885,618 LLC-load-misses           #   31.40% of all LL-cache hits     ( +-  1.40% ) [40.84%]
       236,131,778 L1-icache-loads           #  150.806 M/sec                    ( +-  1.32% ) [40.59%]
         3,037,537 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.23% ) [41.13%]
       454,835,028 dTLB-loads                #  290.481 M/sec                    ( +-  1.23% ) [41.34%]
           139,907 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.18% ) [41.21%]
       236,357,655 iTLB-loads                #  150.950 M/sec                    ( +-  1.31% ) [41.29%]
            46,633 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.74% ) [40.67%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.16%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.09%]

       0.298948767 seconds time elapsed                                          ( +-  0.36% )

Here it appears everything between the two runs is about the same.  We reduced
the number of dcache misses by a small amount (0.03 percentage points), which is
nice, but I'm not sure would account for the speedup we see in the run time.

Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1553.544876 task-clock                #    5.217 CPUs utilized            ( +-  0.42% )
            14,066 context-switches          #    0.009 M/sec                    ( +-  6.24% )
             2,831 cpu-migrations            #    0.002 M/sec                    ( +-  3.33% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.08% )
     1,659,509,743 cycles                    #    1.068 GHz                      ( +-  1.27% ) [40.10%]
       106,466,680 stalled-cycles-frontend   #    6.42% frontend cycles idle     ( +-  1.50% ) [39.98%]
     1,035,481,957 stalled-cycles-backend    #   62.40% backend  cycles idle     ( +-  1.23% ) [39.38%]
       875,104,201 instructions              #    0.53  insns per cycle        
                                             #    1.18  stalled cycles per insn  ( +-  1.30% ) [38.66%]
       160,553,275 branches                  #  103.346 M/sec                    ( +-  1.32% ) [38.85%]
         4,329,119 branch-misses             #    2.70% of all branches          ( +-  1.39% ) [39.59%]
       448,195,116 L1-dcache-loads           #  288.498 M/sec                    ( +-  1.91% ) [41.07%]
         8,632,347 L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  1.90% ) [41.56%]
        15,143,145 LLC-loads                 #    9.747 M/sec                    ( +-  1.89% ) [41.05%]
         4,698,204 LLC-load-misses           #   31.03% of all LL-cache hits     ( +-  1.03% ) [41.23%]
       224,316,468 L1-icache-loads           #  144.390 M/sec                    ( +-  1.27% ) [41.39%]
         2,902,842 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.65% ) [42.60%]
       433,914,588 dTLB-loads                #  279.306 M/sec                    ( +-  1.75% ) [43.07%]
           132,090 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.15% ) [43.12%]
       230,701,361 iTLB-loads                #  148.500 M/sec                    ( +-  1.77% ) [43.47%]
            45,562 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.76% ) [42.88%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [42.29%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [41.32%]

       0.297758185 seconds time elapsed                                          ( +-  0.40% )

Here It seems the major advantage was backend stall cycles saved (which makes
sense to me).  Since we split the instruction path into two units that could run
independently of each other we spent less time waiting for prior instructions to
retire.  As a result we dropped two percentage points in our stall number.

Prefetch + Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1549.171283 task-clock                #    5.231 CPUs utilized            ( +-  0.50% )
            13,717 context-switches          #    0.009 M/sec                    ( +-  4.32% )
             2,721 cpu-migrations            #    0.002 M/sec                    ( +-  2.47% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.07% )
     1,579,140,244 cycles                    #    1.019 GHz                      ( +-  1.71% ) [40.06%]
       103,803,034 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.74% ) [39.60%]
     1,016,582,613 stalled-cycles-backend    #   64.38% backend  cycles idle     ( +-  1.79% ) [39.57%]
       881,036,653 instructions              #    0.56  insns per cycle        
                                             #    1.15  stalled cycles per insn  ( +-  1.61% ) [39.29%]
       164,333,010 branches                  #  106.078 M/sec                    ( +-  1.51% ) [39.38%]
         4,385,459 branch-misses             #    2.67% of all branches          ( +-  1.62% ) [40.29%]
       463,987,526 L1-dcache-loads           #  299.507 M/sec                    ( +-  1.52% ) [40.20%]
         8,739,535 L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  1.95% ) [40.37%]
        15,318,497 LLC-loads                 #    9.888 M/sec                    ( +-  1.80% ) [40.43%]
         4,846,148 LLC-load-misses           #   31.64% of all LL-cache hits     ( +-  1.68% ) [40.59%]
       231,982,874 L1-icache-loads           #  149.746 M/sec                    ( +-  1.43% ) [41.25%]
         3,141,106 L1-icache-load-misses     #    1.35% of all L1-icache hits    ( +-  2.32% ) [41.76%]
       459,688,615 dTLB-loads                #  296.732 M/sec                    ( +-  1.75% ) [41.87%]
           138,667 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  1.97% ) [42.31%]
       235,629,204 iTLB-loads                #  152.100 M/sec                    ( +-  1.40% ) [42.04%]
            46,038 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.75% ) [41.20%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.77%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.27%]

       0.296173305 seconds time elapsed                                          ( +-  0.44% )
Here, with both optimizations, we've reduced both our backend stall cycles, and
our dcache miss rate (though our load misses here is higher than it was when we
are just doing parallel ALU execution.  I wonder if the separation of the adcx
path is leading to multiple load requests before the prefetch completes.  I'll
try messing with the stride a bit more to see if I can get some more insight
there.

So there you have it.  I think, looking at this, I can say that its not as big a
win as my initial measurements were indicating, but still a win.

Thoughts?

Regards
Neil

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

extern int csum_mode;
int module_test_mode = 0;
int test_fire = 0;

static int __init csum_init_module(void)
{
        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}

static int set_param_str(const char *val, const struct kernel_param *kp)
{
        int i;
        __wsum sum = 0;
        /*u64 start, end;*/
        void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	
        memset(addrs, 0, sizeof(addrs));
        for (i = 0; i < NBPAGES; i++) {
                addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
                if (!addrs[i])
                        goto out;
        }

	csum_mode = module_test_mode;

	local_bh_disable();
        /*pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());*/
        /*start = ktime_to_ns(ktime_get());*/

        for (i = 0; i < 100000; i++) {
                rnd = prandom_u32();
                base = addrs[rnd % NBPAGES];
                rnd /= NBPAGES;
                offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial(base + offset, 1500, sum);
        }
        /*end = ktime_to_ns(ktime_get());*/
        local_bh_enable();

	/*pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);*/

	csum_mode = 0;
out:
        for (i = 0; i < NBPAGES; i++)
                kfree(addrs[i]);

        return 0;
}

static int get_param_str(char *buffer, const struct kernel_param *kp)
{
	return sprintf(buffer, "%d\n", test_fire);
}

static struct kernel_param_ops param_ops_str = {
	.set = set_param_str,
	.get = get_param_str,
};

module_param_named(module_test_mode, module_test_mode, int, 0644);
MODULE_PARM_DESC(module_test_mode, "csum test mode");
module_param_cb(test_fire, &param_ops_str, &test_fire, 0644);
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
@ 2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 17:49                 ` Neil Horman
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Base:
>        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> Prefetch (5x64):
>        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> Parallel ALU:
>        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> Prefetch + Parallel ALU:
>        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> 
> So we can see here that we get about a 1% speedup between the base 
> and the both (Prefetch + Parallel ALU) case, with prefetch 
> accounting for most of that speedup.

Hm, there's still something strange about these results. So the 
range of the results is 790-930 nsecs. The noise of the measurements 
is 2%-4%, i.e. 20-40 nsecs.

The prefetch-only result itself is the fastest of all - 
statistically equivalent to the prefetch+parallel-ALU result, within 
the noise range.

So if prefetch is enabled, turning on parallel-ALU has no measurable 
effect - which is counter-intuitive. Do you have an 
theory/explanation for that?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 2 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Looking at the specific cpu counters we get this:
> 
> Base:
>      Total time: 0.179 [sec]
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
>        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
>             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
>              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )

Hm, for these second round of measurements were you using 'perf stat 
-a -C ...'?

The most accurate method of measurement for such single-threaded 
workloads is something like:

	taskset 0x1 perf stat -a -C 1 --repeat 20 ...

this will bind your workload to CPU#0, and will do PMU measurements 
only there - without mixing in other CPUs or workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
@ 2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 0 replies; 105+ messages in thread
From: David Ahern @ 2013-10-28 16:49 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/28/13 10:24 AM, Ingo Molnar wrote:

> The most accurate method of measurement for such single-threaded
> workloads is something like:
>
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
>
> this will bind your workload to CPU#0, and will do PMU measurements
> only there - without mixing in other CPUs or workloads.

you can drop the -a if you only want a specific CPU (-C arg). And -C in 
perf is cpu number starting with 0, so in your example above -C 1 means 
cpu1, not cpu0.

David



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
@ 2013-10-28 17:46                 ` Neil Horman
  2013-10-28 18:29                   ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-28 17:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Looking at the specific cpu counters we get this:
> > 
> > Base:
> >      Total time: 0.179 [sec]
> > 
> >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > 
> >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> 
> Hm, for these second round of measurements were you using 'perf stat 
> -a -C ...'?
> 
> The most accurate method of measurement for such single-threaded 
> workloads is something like:
> 
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> 
> this will bind your workload to CPU#0, and will do PMU measurements 
> only there - without mixing in other CPUs or workloads.
> 
> Thanks,
> 
> 	Ingo
I wasn't, but I will...
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 17:49                 ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-28 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Base:
> >        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> > Prefetch (5x64):
> >        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> > Parallel ALU:
> >        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> > Prefetch + Parallel ALU:
> >        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> > 
> > So we can see here that we get about a 1% speedup between the base 
> > and the both (Prefetch + Parallel ALU) case, with prefetch 
> > accounting for most of that speedup.
> 
> Hm, there's still something strange about these results. So the 
> range of the results is 790-930 nsecs. The noise of the measurements 
> is 2%-4%, i.e. 20-40 nsecs.
> 
> The prefetch-only result itself is the fastest of all - 
> statistically equivalent to the prefetch+parallel-ALU result, within 
> the noise range.
> 
> So if prefetch is enabled, turning on parallel-ALU has no measurable 
> effect - which is counter-intuitive. Do you have an 
> theory/explanation for that?
> 
> Thanks,
 I mentioned it farther down, loosely theorizing that running with parallel
alu's in conjunction with a prefetch, puts more pressure on the load/store unit
causing stalls while both alu's wait for the L1 cache to fill.  Not sure if that
makes sense, but I did note that in the both (prefetch+alu case) our data cache
hit rate was somewhat degraded, so I was going to play with the prefetch stride
to see if that fixed the situation.  Regardless I agree, the lack of improvement
in the both case is definately counter-intuitive.

Neil

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 17:46                 ` Neil Horman
@ 2013-10-28 18:29                   ` Neil Horman
  2013-10-29  8:25                     ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-28 18:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Looking at the specific cpu counters we get this:
> > > 
> > > Base:
> > >      Total time: 0.179 [sec]
> > > 
> > >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > > 
> > >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> > >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> > >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> > 
> > Hm, for these second round of measurements were you using 'perf stat 
> > -a -C ...'?
> > 
> > The most accurate method of measurement for such single-threaded 
> > workloads is something like:
> > 
> > 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> > 
> > this will bind your workload to CPU#0, and will do PMU measurements 
> > only there - without mixing in other CPUs or workloads.
> > 
> > Thanks,
> > 
> > 	Ingo
> I wasn't, but I will...
> Neil
> 
> > --

Heres my data for running the same test with taskset restricting execution to
only cpu0.  I'm not quite sure whats going on here, but doing so resulted in a
10x slowdown of the runtime of each iteration which I can't explain.  As before
however, both the parallel alu run and the prefetch run resulted in speedups,
but the two together were not in any way addative.  I'm going to keep playing
with the prefetch stride, unless you have an alternate theory.

Regards
Neil


Base:
     Total time: 1.013 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1140.286043 task-clock                #    1.001 CPUs utilized            ( +-  0.65% ) [100.00%]
            48,779 context-switches          #    0.043 M/sec                    ( +- 10.08% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.066 M/sec                    ( +-  0.05% )
     2,950,225,491 cycles                    #    2.587 GHz                      ( +-  0.65% ) [16.63%]
       263,349,439 stalled-cycles-frontend   #    8.93% frontend cycles idle     ( +-  1.87% ) [16.70%]
     1,615,723,017 stalled-cycles-backend    #   54.77% backend  cycles idle     ( +-  0.64% ) [16.76%]
     2,168,440,946 instructions              #    0.74  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.52% ) [16.76%]
       406,885,149 branches                  #  356.827 M/sec                    ( +-  0.61% ) [16.74%]
        10,099,789 branch-misses             #    2.48% of all branches          ( +-  0.73% ) [16.73%]
     1,138,829,982 L1-dcache-loads           #  998.723 M/sec                    ( +-  0.57% ) [16.71%]
        21,341,094 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.22% ) [16.69%]
        38,453,870 LLC-loads                 #   33.723 M/sec                    ( +-  1.46% ) [16.67%]
         9,587,987 LLC-load-misses           #   24.93% of all LL-cache hits     ( +-  0.48% ) [16.66%]
       566,241,820 L1-icache-loads           #  496.579 M/sec                    ( +-  0.70% ) [16.65%]
         9,061,979 L1-icache-load-misses     #    1.60% of all L1-icache hits    ( +-  3.39% ) [16.65%]
     1,130,620,555 dTLB-loads                #  991.524 M/sec                    ( +-  0.64% ) [16.64%]
           423,302 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.89% ) [16.63%]
       563,371,089 iTLB-loads                #  494.061 M/sec                    ( +-  0.62% ) [16.62%]
           215,406 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.97% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.59%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.58%]

       1.139598762 seconds time elapsed                                          ( +-  0.65% )

Prefetch:
     Total time: 0.981 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1128.603117 task-clock                #    1.001 CPUs utilized            ( +-  0.66% ) [100.00%]
            45,992 context-switches          #    0.041 M/sec                    ( +-  9.47% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,428 page-faults               #    0.067 M/sec                    ( +-  0.06% )
     2,920,666,228 cycles                    #    2.588 GHz                      ( +-  0.66% ) [16.59%]
       255,998,006 stalled-cycles-frontend   #    8.77% frontend cycles idle     ( +-  1.78% ) [16.67%]
     1,601,090,475 stalled-cycles-backend    #   54.82% backend  cycles idle     ( +-  0.69% ) [16.75%]
     2,164,301,312 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.59% ) [16.78%]
       404,920,928 branches                  #  358.781 M/sec                    ( +-  0.54% ) [16.77%]
        10,025,146 branch-misses             #    2.48% of all branches          ( +-  0.66% ) [16.75%]
     1,133,764,674 L1-dcache-loads           # 1004.573 M/sec                    ( +-  0.47% ) [16.74%]
        21,251,432 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.01% ) [16.72%]
        38,006,432 LLC-loads                 #   33.676 M/sec                    ( +-  1.56% ) [16.70%]
         9,625,034 LLC-load-misses           #   25.32% of all LL-cache hits     ( +-  0.40% ) [16.68%]
       565,712,289 L1-icache-loads           #  501.250 M/sec                    ( +-  0.57% ) [16.66%]
         8,726,826 L1-icache-load-misses     #    1.54% of all L1-icache hits    ( +-  3.40% ) [16.64%]
     1,130,140,463 dTLB-loads                # 1001.362 M/sec                    ( +-  0.53% ) [16.63%]
           419,645 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.44% ) [16.62%]
       560,199,307 iTLB-loads                #  496.365 M/sec                    ( +-  0.51% ) [16.61%]
           213,413 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.65% ) [16.59%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.56%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.54%]

       1.127934534 seconds time elapsed                                          ( +-  0.66% )


Parallel ALU:
     Total time: 0.986 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1131.914738 task-clock                #    1.001 CPUs utilized            ( +-  0.49% ) [100.00%]
            40,807 context-switches          #    0.036 M/sec                    ( +- 10.72% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% ) [100.00%]
            75,329 page-faults               #    0.067 M/sec                    ( +-  0.04% )
     2,929,149,996 cycles                    #    2.588 GHz                      ( +-  0.49% ) [16.58%]
       250,428,558 stalled-cycles-frontend   #    8.55% frontend cycles idle     ( +-  1.75% ) [16.66%]
     1,621,074,968 stalled-cycles-backend    #   55.34% backend  cycles idle     ( +-  0.46% ) [16.73%]
     2,147,405,781 instructions              #    0.73  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.56% ) [16.77%]
       401,196,771 branches                  #  354.441 M/sec                    ( +-  0.58% ) [16.76%]
         9,941,701 branch-misses             #    2.48% of all branches          ( +-  0.67% ) [16.74%]
     1,126,651,774 L1-dcache-loads           #  995.350 M/sec                    ( +-  0.50% ) [16.73%]
        21,075,294 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  0.96% ) [16.72%]
        37,885,850 LLC-loads                 #   33.471 M/sec                    ( +-  1.10% ) [16.71%]
         9,729,116 LLC-load-misses           #   25.68% of all LL-cache hits     ( +-  0.62% ) [16.69%]
       562,058,495 L1-icache-loads           #  496.556 M/sec                    ( +-  0.54% ) [16.67%]
         8,617,450 L1-icache-load-misses     #    1.53% of all L1-icache hits    ( +-  3.06% ) [16.65%]
     1,121,765,737 dTLB-loads                #  991.034 M/sec                    ( +-  0.57% ) [16.63%]
           388,875 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  4.27% ) [16.62%]
       556,029,393 iTLB-loads                #  491.229 M/sec                    ( +-  0.64% ) [16.61%]
           189,181 iTLB-load-misses          #    0.03% of all iTLB cache hits   ( +-  6.98% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.58%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.131247174 seconds time elapsed                                          ( +-  0.49% )


Both:
     Total time: 0.993 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1130.912197 task-clock                #    1.001 CPUs utilized            ( +-  0.60% ) [100.00%]
            45,859 context-switches          #    0.041 M/sec                    ( +-  9.00% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.067 M/sec                    ( +-  0.07% )
     2,926,527,048 cycles                    #    2.588 GHz                      ( +-  0.60% ) [16.60%]
       255,482,254 stalled-cycles-frontend   #    8.73% frontend cycles idle     ( +-  1.62% ) [16.67%]
     1,608,247,364 stalled-cycles-backend    #   54.95% backend  cycles idle     ( +-  0.73% ) [16.74%]
     2,162,135,903 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.46% ) [16.77%]
       403,436,790 branches                  #  356.736 M/sec                    ( +-  0.44% ) [16.76%]
        10,062,572 branch-misses             #    2.49% of all branches          ( +-  0.85% ) [16.75%]
     1,133,889,264 L1-dcache-loads           # 1002.632 M/sec                    ( +-  0.56% ) [16.74%]
        21,460,116 L1-dcache-load-misses     #    1.89% of all L1-dcache hits    ( +-  1.31% ) [16.73%]
        38,070,119 LLC-loads                 #   33.663 M/sec                    ( +-  1.63% ) [16.72%]
         9,593,162 LLC-load-misses           #   25.20% of all LL-cache hits     ( +-  0.42% ) [16.71%]
       562,867,188 L1-icache-loads           #  497.711 M/sec                    ( +-  0.59% ) [16.68%]
         8,472,343 L1-icache-load-misses     #    1.51% of all L1-icache hits    ( +-  3.02% ) [16.64%]
     1,126,997,403 dTLB-loads                #  996.538 M/sec                    ( +-  0.53% ) [16.61%]
           414,900 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.12% ) [16.60%]
       561,156,032 iTLB-loads                #  496.198 M/sec                    ( +-  0.56% ) [16.59%]
           212,482 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.10% ) [16.58%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.57%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.130242195 seconds time elapsed                                          ( +-  0.60% )

> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 18:29                   ` Neil Horman
@ 2013-10-29  8:25                     ` Ingo Molnar
  2013-10-29 11:20                       ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-29  8:25 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Heres my data for running the same test with taskset restricting 
> execution to only cpu0.  I'm not quite sure whats going on here, 
> but doing so resulted in a 10x slowdown of the runtime of each 
> iteration which I can't explain.  As before however, both the 
> parallel alu run and the prefetch run resulted in speedups, but 
> the two together were not in any way addative.  I'm going to keep 
> playing with the prefetch stride, unless you have an alternate 
> theory.

Could you please cite the exact command-line you used for running 
the test?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29  8:25                     ` Ingo Molnar
@ 2013-10-29 11:20                       ` Neil Horman
  2013-10-29 11:30                         ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-29 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Heres my data for running the same test with taskset restricting 
> > execution to only cpu0.  I'm not quite sure whats going on here, 
> > but doing so resulted in a 10x slowdown of the runtime of each 
> > iteration which I can't explain.  As before however, both the 
> > parallel alu run and the prefetch run resulted in speedups, but 
> > the two together were not in any way addative.  I'm going to keep 
> > playing with the prefetch stride, unless you have an alternate 
> > theory.
> 
> Could you please cite the exact command-line you used for running 
> the test?
> 
> Thanks,
> 
> 	Ingo
> 

Sure it was this:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
done >> counters.txt 2>&1

where test.sh is:
#!/bin/sh
echo 1 > /sys/module/csum_test/parameters/test_fire


As before, module_test_mode selects a case in a switch statement I added in
do_csum to test one of the 4 csum variants we've been discusing (base, prefetch,
parallel ALU or both), and test_fire is a callback trigger I use in the test
module to run 100000 iterations of a checksum operation.  As you requested, I
ran the above on cpu 0 (-C 0 on perf and -c 0 on taskset), and I removed all irq
affinity to cpu 0.

Regards
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:20                       ` Neil Horman
@ 2013-10-29 11:30                         ` Ingo Molnar
  2013-10-29 11:49                           ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-29 11:30 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> done >> counters.txt 2>&1
> 
> where test.sh is:
> #!/bin/sh
> echo 1 > /sys/module/csum_test/parameters/test_fire

What does '-- /root/test.sh' do?

Unless I'm missing something, the line above will run:

  perf bench sched messaging -- /root/test.sh

which should be equivalent to:

  perf bench sched messaging

i.e. /root/test.sh won't be run.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:30                         ` Ingo Molnar
@ 2013-10-29 11:49                           ` Neil Horman
  2013-10-29 12:52                             ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-29 11:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > done >> counters.txt 2>&1
> > 
> > where test.sh is:
> > #!/bin/sh
> > echo 1 > /sys/module/csum_test/parameters/test_fire
> 
> What does '-- /root/test.sh' do?
> 
> Unless I'm missing something, the line above will run:
> 
>   perf bench sched messaging -- /root/test.sh
> 
> which should be equivalent to:
> 
>   perf bench sched messaging
> 
> i.e. /root/test.sh won't be run.
> 
According to the perf man page, I'm supposed to be able to use -- to separate
perf command line parameters from the command I want to run.  And it definately
executed test.sh, I added an echo to stdout in there as a test run and observed
them get captured in counters.txt

Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:49                           ` Neil Horman
@ 2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
  2013-10-29 14:12                               ` David Ahern
  0 siblings, 2 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-29 12:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > done >> counters.txt 2>&1
> > > 
> > > where test.sh is:
> > > #!/bin/sh
> > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > 
> > What does '-- /root/test.sh' do?
> > 
> > Unless I'm missing something, the line above will run:
> > 
> >   perf bench sched messaging -- /root/test.sh
> > 
> > which should be equivalent to:
> > 
> >   perf bench sched messaging
> > 
> > i.e. /root/test.sh won't be run.
> 
> According to the perf man page, I'm supposed to be able to use -- 
> to separate perf command line parameters from the command I want 
> to run.  And it definately executed test.sh, I added an echo to 
> stdout in there as a test run and observed them get captured in 
> counters.txt

Well, '--' can be used to delineate the command portion for cases 
where it's ambiguous.

Here's it's unambiguous though. This:

  perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh

stops parsing a valid option after the -ddd option, so in theory it 
should execute 'perf bench sched messaging -- /root/test.sh' where 
'-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
ignored.

The message output you provided seems to suggest that to be the 
case:

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

See how the command executed by perf stat was 'perf bench ...'.

Did you want to run:

  perf stat --repeat 20 -C 0 -ddd /root/test.sh

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
@ 2013-10-29 13:07                               ` Neil Horman
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 14:12                               ` David Ahern
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-29 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > > done >> counters.txt 2>&1
> > > > 
> > > > where test.sh is:
> > > > #!/bin/sh
> > > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > > 
> > > What does '-- /root/test.sh' do?
> > > 
> > > Unless I'm missing something, the line above will run:
> > > 
> > >   perf bench sched messaging -- /root/test.sh
> > > 
> > > which should be equivalent to:
> > > 
> > >   perf bench sched messaging
> > > 
> > > i.e. /root/test.sh won't be run.
> > 
> > According to the perf man page, I'm supposed to be able to use -- 
> > to separate perf command line parameters from the command I want 
> > to run.  And it definately executed test.sh, I added an echo to 
> > stdout in there as a test run and observed them get captured in 
> > counters.txt
> 
> Well, '--' can be used to delineate the command portion for cases 
> where it's ambiguous.
> 
> Here's it's unambiguous though. This:
> 
>   perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> 
> stops parsing a valid option after the -ddd option, so in theory it 
> should execute 'perf bench sched messaging -- /root/test.sh' where 
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
> ignored.
> 
> The message output you provided seems to suggest that to be the 
> case:
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
> See how the command executed by perf stat was 'perf bench ...'.
> 
> Did you want to run:
> 
>   perf stat --repeat 20 -C 0 -ddd /root/test.sh
> 
I'm sure it worked properly on my system here, I specificially checked it, but
I'll gladly run it again.  You have to give me an hour as I have a meeting to
run to, but I'll have results shortly.
Neil

> ?
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  0 siblings, 2 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-10-29 13:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> I'm sure it worked properly on my system here, I specificially 
> checked it, but I'll gladly run it again.  You have to give me an 
> hour as I have a meeting to run to, but I'll have results shortly.

So what I tried to react to was this observation of yours:

> > > Heres my data for running the same test with taskset 
> > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > going on here, but doing so resulted in a 10x slowdown of the 
> > > runtime of each iteration which I can't explain. [...]

A 10x slowdown would be consistent with not running your testcase 
but 'perf bench sched messaging' by accident, or so.

But I was really just guessing wildly here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
@ 2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  1 sibling, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-10-29 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 
Ok, well, I'll run it again in just a bit here.
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 14:12                               ` David Ahern
  1 sibling, 0 replies; 105+ messages in thread
From: David Ahern @ 2013-10-29 14:12 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/29/13 6:52 AM, Ingo Molnar wrote:
>> According to the perf man page, I'm supposed to be able to use --
>> to separate perf command line parameters from the command I want
>> to run.  And it definately executed test.sh, I added an echo to
>> stdout in there as a test run and observed them get captured in
>> counters.txt
>
> Well, '--' can be used to delineate the command portion for cases
> where it's ambiguous.
>
> Here's it's unambiguous though. This:
>
>    perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
>
> stops parsing a valid option after the -ddd option, so in theory it
> should execute 'perf bench sched messaging -- /root/test.sh' where
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
> ignored.

Normally with perf commands a workload can be specified to state how 
long to collect perf data. That is not the case for perf-bench.

David

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
@ 2013-10-29 14:17                                   ` Neil Horman
  2013-10-29 14:27                                     ` Ingo Molnar
  1 sibling, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-29 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 


So, I apologize, you were right.  I was running the test.sh script but perf was
measuring itself.  Using this command line:

for i in `seq 0 1 3`
do
echo $i > /sys/modules/csum_test/parameters/module_test_mode; taskset -c 0 perf stat --repeat -C 0 -ddd /root/test.sh
done >> counters.txt 2>&1

with test.sh unchanged I get these results:


Base:
 Performance counter stats for '/root/test.sh' (20 runs):

         56.069737 task-clock                #    1.005 CPUs utilized            ( +-  0.13% ) [100.00%]
                 5 context-switches          #    0.091 K/sec                    ( +-  5.11% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               366 page-faults               #    0.007 M/sec                    ( +-  0.08% )
       144,264,737 cycles                    #    2.573 GHz                      ( +-  0.23% ) [17.49%]
         9,239,760 stalled-cycles-frontend   #    6.40% frontend cycles idle     ( +-  3.77% ) [19.19%]
       110,635,829 stalled-cycles-backend    #   76.69% backend  cycles idle     ( +-  0.14% ) [19.68%]
        54,291,496 instructions              #    0.38  insns per cycle        
                                             #    2.04  stalled cycles per insn  ( +-  0.14% ) [18.30%]
         5,844,933 branches                  #  104.244 M/sec                    ( +-  2.81% ) [16.58%]
           301,523 branch-misses             #    5.16% of all branches          ( +-  0.12% ) [16.09%]
        23,645,797 L1-dcache-loads           #  421.721 M/sec                    ( +-  0.05% ) [16.06%]
           494,467 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.06% ) [16.06%]
         2,907,250 LLC-loads                 #   51.851 M/sec                    ( +-  0.08% ) [16.06%]
           486,329 LLC-load-misses           #   16.73% of all LL-cache hits     ( +-  0.11% ) [16.06%]
        11,113,848 L1-icache-loads           #  198.215 M/sec                    ( +-  0.07% ) [16.06%]
             5,378 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.34% ) [16.06%]
        23,742,876 dTLB-loads                #  423.453 M/sec                    ( +-  0.06% ) [16.06%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [16.06%]
        11,108,538 iTLB-loads                #  198.120 M/sec                    ( +-  0.06% ) [16.06%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [16.07%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.07%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.07%]

       0.055817066 seconds time elapsed                                          ( +-  0.10% )

Prefetch(5*64):
 Performance counter stats for '/root/test.sh' (20 runs):

         47.423853 task-clock                #    1.005 CPUs utilized            ( +-  0.62% ) [100.00%]
                 6 context-switches          #    0.116 K/sec                    ( +-  4.27% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               368 page-faults               #    0.008 M/sec                    ( +-  0.07% )
       120,423,860 cycles                    #    2.539 GHz                      ( +-  0.85% ) [14.23%]
         8,555,632 stalled-cycles-frontend   #    7.10% frontend cycles idle     ( +-  0.56% ) [16.23%]
        87,438,794 stalled-cycles-backend    #   72.61% backend  cycles idle     ( +-  1.13% ) [18.33%]
        55,039,308 instructions              #    0.46  insns per cycle        
                                             #    1.59  stalled cycles per insn  ( +-  0.05% ) [18.98%]
         5,619,298 branches                  #  118.491 M/sec                    ( +-  2.32% ) [18.98%]
           303,686 branch-misses             #    5.40% of all branches          ( +-  0.08% ) [18.98%]
        26,577,868 L1-dcache-loads           #  560.432 M/sec                    ( +-  0.05% ) [18.98%]
         1,323,630 L1-dcache-load-misses     #    4.98% of all L1-dcache hits    ( +-  0.14% ) [18.98%]
         3,426,016 LLC-loads                 #   72.242 M/sec                    ( +-  0.05% ) [18.98%]
         1,304,201 LLC-load-misses           #   38.07% of all LL-cache hits     ( +-  0.13% ) [18.98%]
        13,190,316 L1-icache-loads           #  278.137 M/sec                    ( +-  0.21% ) [18.98%]
            33,881 L1-icache-load-misses     #    0.26% of all L1-icache hits    ( +-  4.63% ) [17.93%]
        25,366,685 dTLB-loads                #  534.893 M/sec                    ( +-  0.24% ) [15.93%]
               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.98%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.98%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.87%]

       0.047194407 seconds time elapsed                                          ( +-  0.62% )

Parallel ALU:
 Performance counter stats for '/root/test.sh' (20 runs):

         57.395070 task-clock                #    1.004 CPUs utilized            ( +-  1.71% ) [100.00%]
                 5 context-switches          #    0.092 K/sec                    ( +-  3.90% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               367 page-faults               #    0.006 M/sec                    ( +-  0.10% )
       143,232,396 cycles                    #    2.496 GHz                      ( +-  1.68% ) [16.73%]
         7,299,843 stalled-cycles-frontend   #    5.10% frontend cycles idle     ( +-  2.69% ) [18.47%]
       109,485,845 stalled-cycles-backend    #   76.44% backend  cycles idle     ( +-  2.01% ) [19.99%]
        56,867,669 instructions              #    0.40  insns per cycle        
                                             #    1.93  stalled cycles per insn  ( +-  0.22% ) [19.49%]
         6,646,323 branches                  #  115.800 M/sec                    ( +-  2.15% ) [17.75%]
           304,671 branch-misses             #    4.58% of all branches          ( +-  0.37% ) [16.23%]
        23,612,428 L1-dcache-loads           #  411.402 M/sec                    ( +-  0.05% ) [15.95%]
           518,988 L1-dcache-load-misses     #    2.20% of all L1-dcache hits    ( +-  0.11% ) [15.95%]
         2,934,119 LLC-loads                 #   51.121 M/sec                    ( +-  0.06% ) [15.95%]
           509,027 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.15% ) [15.95%]
        11,103,819 L1-icache-loads           #  193.463 M/sec                    ( +-  0.08% ) [15.95%]
             5,381 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  2.45% ) [15.95%]
        23,727,164 dTLB-loads                #  413.401 M/sec                    ( +-  0.06% ) [15.95%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [15.95%]
        11,104,205 iTLB-loads                #  193.470 M/sec                    ( +-  0.06% ) [15.95%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [15.95%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [15.95%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [15.96%]

       0.057151644 seconds time elapsed                                          ( +-  1.69% )

Both:
 Performance counter stats for '/root/test.sh' (20 runs):

         48.377833 task-clock                #    1.005 CPUs utilized            ( +-  0.67% ) [100.00%]
                 5 context-switches          #    0.113 K/sec                    ( +-  3.88% ) [100.00%]
                 0 cpu-migrations            #    0.001 K/sec                    ( +-100.00% ) [100.00%]
               367 page-faults               #    0.008 M/sec                    ( +-  0.08% )
       122,529,490 cycles                    #    2.533 GHz                      ( +-  1.05% ) [14.24%]
         8,796,729 stalled-cycles-frontend   #    7.18% frontend cycles idle     ( +-  0.56% ) [16.20%]
        88,936,550 stalled-cycles-backend    #   72.58% backend  cycles idle     ( +-  1.48% ) [18.16%]
        58,405,660 instructions              #    0.48  insns per cycle        
                                             #    1.52  stalled cycles per insn  ( +-  0.07% ) [18.61%]
         5,742,738 branches                  #  118.706 M/sec                    ( +-  1.54% ) [18.61%]
           303,555 branch-misses             #    5.29% of all branches          ( +-  0.09% ) [18.61%]
        26,321,789 L1-dcache-loads           #  544.088 M/sec                    ( +-  0.07% ) [18.61%]
         1,236,101 L1-dcache-load-misses     #    4.70% of all L1-dcache hits    ( +-  0.08% ) [18.61%]
         3,409,768 LLC-loads                 #   70.482 M/sec                    ( +-  0.05% ) [18.61%]
         1,212,511 LLC-load-misses           #   35.56% of all LL-cache hits     ( +-  0.08% ) [18.61%]
        10,579,372 L1-icache-loads           #  218.682 M/sec                    ( +-  0.05% ) [18.61%]
            19,426 L1-icache-load-misses     #    0.18% of all L1-icache hits    ( +- 14.70% ) [18.61%]
        25,329,963 dTLB-loads                #  523.586 M/sec                    ( +-  0.27% ) [17.29%]
               802 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  5.43% ) [15.33%]
        10,635,524 iTLB-loads                #  219.843 M/sec                    ( +-  0.09% ) [13.38%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.72%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.72%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.72%]

       0.048140073 seconds time elapsed                                          ( +-  0.67% )


Which overall looks alot more like I expect, save for the parallel ALU cases.
It seems here that the parallel ALU changes actually hurt performance, which
really seems counter-intuitive.  I don't yet have any explination for that.  I
do note that we seem to have more stalls in the both case so perhaps the
parallel chains call for a more agressive prefetch.  Do you have any thoughts?

Regards
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:17                                   ` Neil Horman
@ 2013-10-29 14:27                                     ` Ingo Molnar
  2013-10-29 20:26                                       ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-29 14:27 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> So, I apologize, you were right.  I was running the test.sh script 
> but perf was measuring itself. [...]

Ok, cool - one mystery less!

> Which overall looks alot more like I expect, save for the parallel 
> ALU cases. It seems here that the parallel ALU changes actually 
> hurt performance, which really seems counter-intuitive.  I don't 
> yet have any explination for that.  I do note that we seem to have 
> more stalls in the both case so perhaps the parallel chains call 
> for a more agressive prefetch.  Do you have any thoughts?

Note that with -ddd you 'overload' the PMU with more counters than 
can be run at once, which introduces extra noise. Since you are 
running the tests for 0.150 secs or so, the results are not very 
representative:

               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]

with such low runtimes those results are very hard to trust.

So -ddd is typically used to pick up the most interesting PMU events 
you want to see measured, and then use them like this:

   -e dTLB-load-misses -e iTLB-loads

etc. For such short runtimes make sure the last column displays 
close to 100%, so that the PMU results become trustable.

A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
plus generics like 'cycles', 'instructions' can be added 'for free' 
because they get counted in a separate (fixed purpose) PMU register.

The last colum tells you what percentage of the runtime that 
particular event was actually active. 100% (or empty last column) 
means it was active all the time.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:27                                     ` Ingo Molnar
@ 2013-10-29 20:26                                       ` Neil Horman
  2013-10-31 10:22                                         ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-29 20:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > So, I apologize, you were right.  I was running the test.sh script 
> > but perf was measuring itself. [...]
> 
> Ok, cool - one mystery less!
> 
> > Which overall looks alot more like I expect, save for the parallel 
> > ALU cases. It seems here that the parallel ALU changes actually 
> > hurt performance, which really seems counter-intuitive.  I don't 
> > yet have any explination for that.  I do note that we seem to have 
> > more stalls in the both case so perhaps the parallel chains call 
> > for a more agressive prefetch.  Do you have any thoughts?
> 
> Note that with -ddd you 'overload' the PMU with more counters than 
> can be run at once, which introduces extra noise. Since you are 
> running the tests for 0.150 secs or so, the results are not very 
> representative:
> 
>                734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
>         13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
> 
> with such low runtimes those results are very hard to trust.
> 
> So -ddd is typically used to pick up the most interesting PMU events 
> you want to see measured, and then use them like this:
> 
>    -e dTLB-load-misses -e iTLB-loads
> 
> etc. For such short runtimes make sure the last column displays 
> close to 100%, so that the PMU results become trustable.
> 
> A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> plus generics like 'cycles', 'instructions' can be added 'for free' 
> because they get counted in a separate (fixed purpose) PMU register.
> 
> The last colum tells you what percentage of the runtime that 
> particular event was actually active. 100% (or empty last column) 
> means it was active all the time.
> 
> Thanks,
> 
> 	Ingo
> 

Hmm, 

I ran this test:

for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
done

And I updated the test module to run for a million iterations rather than 100000 to increase the sample size and got this:


Base:
 Performance counter stats for './test.sh' (20 runs):

        47,305,064 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    13,906,212,348 cycles                    #    0.000 GHz                      ( +-  0.05% ) [18.76%]
     4,426,395,949 instructions              #    0.32  insns per cycle          ( +-  0.01% ) [18.77%]
     2,261,551,278 L1-dcache-loads                                               ( +-  0.02% ) [18.76%]
        47,287,226 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.76%]
       276,842,685 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        46,454,114 LLC-load-misses           #   16.78% of all LL-cache hits     ( +-  0.05% ) [18.76%]
     1,048,894,486 L1-icache-loads                                               ( +-  0.07% ) [18.76%]
           472,205 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.19% ) [18.76%]
     2,260,639,613 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               172 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 35.14% ) [18.74%]
     1,048,732,481 iTLB-loads                                                    ( +-  0.07% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.75% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.370546698 seconds time elapsed                                          ( +-  0.05% )


Prefetch:
 Performance counter stats for './test.sh' (20 runs):

       124,885,469 L1-dcache-load-misses     #    4.96% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,434,328,889 cycles                    #    0.000 GHz                      ( +-  1.11% ) [18.77%]
     4,601,831,553 instructions              #    0.40  insns per cycle          ( +-  0.01% ) [18.77%]
     2,515,483,814 L1-dcache-loads                                               ( +-  0.01% ) [18.77%]
       124,928,127 L1-dcache-load-misses     #    4.97% of all L1-dcache hits    ( +-  0.09% ) [18.76%]
       323,355,145 LLC-loads                                                     ( +-  0.02% ) [18.76%]
       123,008,548 LLC-load-misses           #   38.04% of all LL-cache hits     ( +-  0.10% ) [18.75%]
     1,256,391,060 L1-icache-loads                                               ( +-  0.01% ) [18.75%]
           374,691 L1-icache-load-misses     #    0.03% of all L1-icache hits    ( +-  1.41% ) [18.75%]
     2,514,984,046 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
                67 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 51.81% ) [18.74%]
     1,256,333,548 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.74% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       4.496839773 seconds time elapsed                                          ( +-  0.64% )


Parallel ALU:
 Performance counter stats for './test.sh' (20 runs):

        49,489,518 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.76%]
    13,777,501,365 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.78%]
     4,707,160,703 instructions              #    0.34  insns per cycle          ( +-  0.01% ) [18.78%]
     2,261,693,074 L1-dcache-loads                                               ( +-  0.02% ) [18.78%]
        49,468,878 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.77%]
       279,524,254 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        48,491,934 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.12% ) [18.75%]
     1,057,877,680 L1-icache-loads                                               ( +-  0.02% ) [18.74%]
           461,784 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.87% ) [18.74%]
     2,260,978,836 dTLB-loads                                                    ( +-  0.02% ) [18.74%]
                27 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 89.96% ) [18.74%]
     1,057,886,632 iTLB-loads                                                    ( +-  0.02% ) [18.74%]
                 4 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +-100.00% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.500417234 seconds time elapsed                                          ( +-  1.60% )


Both:
 Performance counter stats for './test.sh' (20 runs):

       116,621,570 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,597,067,510 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.77%]
     4,952,251,361 instructions              #    0.43  insns per cycle          ( +-  0.01% ) [18.77%]
     2,493,003,710 L1-dcache-loads                                               ( +-  0.02% ) [18.77%]
       116,640,333 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.77%]
       322,246,216 LLC-loads                                                     ( +-  0.03% ) [18.76%]
       114,528,956 LLC-load-misses           #   35.54% of all LL-cache hits     ( +-  0.04% ) [18.76%]
       999,371,469 L1-icache-loads                                               ( +-  0.02% ) [18.76%]
           406,679 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.97% ) [18.75%]
     2,492,708,710 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               140 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 38.46% ) [18.74%]
       999,320,389 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.90% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.72%]

       4.634419247 seconds time elapsed                                          ( +-  1.60% )


I note a few oddities here:

1) We seem to be getting more counter results than I specified, not sure why
2) The % active column is adding up to way more than 100 (which from my read of
the man page makes sense, given that multiple counters might increment in
response to a single instruction execution
3) The run times are proportionally larger, but still indicate that Parallel ALU
execution is hurting rather than helping, which is counter-intuitive.  I'm
looking into it, but thought you might want to see these results in case
something jumped out at you

Regards
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 20:26                                       ` Neil Horman
@ 2013-10-31 10:22                                         ` Ingo Molnar
  2013-10-31 14:33                                           ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-10-31 10:22 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > etc. For such short runtimes make sure the last column displays 
> > close to 100%, so that the PMU results become trustable.
> > 
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > because they get counted in a separate (fixed purpose) PMU register.
> > 
> > The last colum tells you what percentage of the runtime that 
> > particular event was actually active. 100% (or empty last column) 
> > means it was active all the time.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Hmm, 
> 
> I ran this test:
> 
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> done

You need to remove '-ddd' which is a shortcut for a ton of useful 
events, but here you want to use fewer events, to increase the 
precision of the measurement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 10:22                                         ` Ingo Molnar
@ 2013-10-31 14:33                                           ` Neil Horman
  2013-11-01  9:13                                             ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-10-31 14:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > etc. For such short runtimes make sure the last column displays 
> > > close to 100%, so that the PMU results become trustable.
> > > 
> > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > because they get counted in a separate (fixed purpose) PMU register.
> > > 
> > > The last colum tells you what percentage of the runtime that 
> > > particular event was actually active. 100% (or empty last column) 
> > > means it was active all the time.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Hmm, 
> > 
> > I ran this test:
> > 
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > done
> 
> You need to remove '-ddd' which is a shortcut for a ton of useful 
> events, but here you want to use fewer events, to increase the 
> precision of the measurement.
> 
> Thanks,
> 
> 	Ingo
> 

Thank you ingo, that fixed it.  I'm trying some other variants of the csum
algorithm that Doug and I discussed last night, but FWIW, the relative
performance of the 4 test cases (base/prefetch/parallel/both) remains unchanged.
I'm starting to feel like at this point, theres very little point in doing
parallel alu operations (unless we can find a way to break the dependency on the
carry flag, which is what I'm tinkering with now).
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 14:33                                           ` Neil Horman
@ 2013-11-01  9:13                                             ` Ingo Molnar
  2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:13 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > > etc. For such short runtimes make sure the last column displays 
> > > > close to 100%, so that the PMU results become trustable.
> > > > 
> > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > 
> > > > The last colum tells you what percentage of the runtime that 
> > > > particular event was actually active. 100% (or empty last column) 
> > > > means it was active all the time.
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Ingo
> > > > 
> > > 
> > > Hmm, 
> > > 
> > > I ran this test:
> > > 
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > done
> > 
> > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > events, but here you want to use fewer events, to increase the 
> > precision of the measurement.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Thank you ingo, that fixed it.  I'm trying some other variants of 
> the csum algorithm that Doug and I discussed last night, but FWIW, 
> the relative performance of the 4 test cases 
> (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> feel like at this point, theres very little point in doing 
> parallel alu operations (unless we can find a way to break the 
> dependency on the carry flag, which is what I'm tinkering with 
> now).

I would still like to encourage you to pick up the improvements that 
Doug measured (mostly via prefetch tweaking?) - that looked like 
some significant speedups that we don't want to lose!

Also, trying to stick the in-kernel implementation into 'perf bench' 
would be a useful first step as well, for this and future efforts.

See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
up the in-kernel assembly memcpy implementations:

#define memcpy MEMCPY /* don't hide glibc's memcpy() */
#define altinstr_replacement text
#define globl p2align 4; .globl
#define Lmemcpy_c globl memcpy_c; memcpy_c
#define Lmemcpy_c_e globl memcpy_c_e; memcpy_c_e

#include "../../../arch/x86/lib/memcpy_64.S"

So it needed a bit of trickery/wrappery for 'perf bench mem memcpy', 
but that is a one-time effort - once it's done then the current 
in-kernel csum_partial() implementation would be easily measurable 
(and any performance regression in it bisectable, etc.) from that 
point on.

In user-space it would also be easier to add various parameters and 
experimental implementations and background cache-stressing 
workloads automatically.

Something similar might be possible for csum_partial(), 
csum_partial_copy*(), etc.

Note, if any of you ventures to add checksum-benchmarking to perf 
bench, please base any patches on top of tip:perf/core:

  git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

as there are a couple of perf bench enhancements in the pipeline 
already for v3.13.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01  9:13                                             ` Ingo Molnar
@ 2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-01 14:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > > etc. For such short runtimes make sure the last column displays 
> > > > > close to 100%, so that the PMU results become trustable.
> > > > > 
> > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > > 
> > > > > The last colum tells you what percentage of the runtime that 
> > > > > particular event was actually active. 100% (or empty last column) 
> > > > > means it was active all the time.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > 	Ingo
> > > > > 
> > > > 
> > > > Hmm, 
> > > > 
> > > > I ran this test:
> > > > 
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > > done
> > > 
> > > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > > events, but here you want to use fewer events, to increase the 
> > > precision of the measurement.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Thank you ingo, that fixed it.  I'm trying some other variants of 
> > the csum algorithm that Doug and I discussed last night, but FWIW, 
> > the relative performance of the 4 test cases 
> > (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> > feel like at this point, theres very little point in doing 
> > parallel alu operations (unless we can find a way to break the 
> > dependency on the carry flag, which is what I'm tinkering with 
> > now).
> 
> I would still like to encourage you to pick up the improvements that 
> Doug measured (mostly via prefetch tweaking?) - that looked like 
> some significant speedups that we don't want to lose!
> 
Well, yes, I made a line item of that in my subsequent note below.  I'm going to
repost that shortly, and I suggested that we revisit this when the AVX
instruction extensions are available.

> Also, trying to stick the in-kernel implementation into 'perf bench' 
> would be a useful first step as well, for this and future efforts.
> 
> See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
> up the in-kernel assembly memcpy implementations:
> 
Yes, I'll look into adding this as well
Regards
Neil



^ permalink raw reply	[flat|nested] 105+ messages in thread

* x86: Enhance perf checksum profiling and x86 implementation
  2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
                   ` (2 preceding siblings ...)
  2013-10-14  4:38 ` Andi Kleen
@ 2013-11-06 15:23 ` Neil Horman
  2013-11-06 15:23   ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
  3 siblings, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-06 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Hey all-
	Sorry for the delay here, but it took me a bit to get the perf bits
working to my satisfaction.  As Ingo requested I added do_csum to the perf
benchmarking utility (as part of the mem suite, since it didn't seem right to
create its own suite).  I've also revamped the do_csum routine to do some smart
prefetching, as it yielded slightly better performance over simple prefetching
at a fixed stride:

Without prefetch:
[root@rdma-dev-02 perf]# ./perf bench mem csum -r x86-64-csum -l 1500B -s 512MB
-i 1000000 -c
# Running mem/csum benchmark...
# Copying 1500B Bytes ...

       0.955977 Cycle/Byte

With prefetch:
[root@rdma-dev-02 perf]# ./perf bench mem csum -r x86-64-csum -l 1500B -s 512MB
-i 1000000 -c
# Running mem/csum benchmark...
# Copying 1500B Bytes ...

       0.922540 Cycle/Byte


About a 3% improvement.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org


^ permalink raw reply	[flat|nested] 105+ messages in thread

* [PATCH v2 1/2] perf: Add csum benchmark tests to perf
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
@ 2013-11-06 15:23   ` Neil Horman
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
  1 sibling, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-06 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Adding perf benchmarks to test the arch independent and x86[64] versions of
do_csum to the perf suite.  Other arches can be added as needed.  To avoid
creating a new suite instance (as I didn't think it was warranted), the csum
benchmarks have been added to the mem suite

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 tools/perf/Makefile.perf               |   3 +
 tools/perf/bench/bench.h               |   2 +
 tools/perf/bench/mem-csum-generic.c    |  21 +++
 tools/perf/bench/mem-csum-x86-64-def.h |   8 +
 tools/perf/bench/mem-csum-x86-64.c     |  51 +++++++
 tools/perf/bench/mem-csum.c            | 266 +++++++++++++++++++++++++++++++++
 tools/perf/bench/mem-csum.h            |  46 ++++++
 tools/perf/builtin-bench.c             |   1 +
 8 files changed, 398 insertions(+)
 create mode 100644 tools/perf/bench/mem-csum-generic.c
 create mode 100644 tools/perf/bench/mem-csum-x86-64-def.h
 create mode 100644 tools/perf/bench/mem-csum-x86-64.c
 create mode 100644 tools/perf/bench/mem-csum.c
 create mode 100644 tools/perf/bench/mem-csum.h

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 5b86390..d0ac05b 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -413,9 +413,12 @@ BUILTIN_OBJS += $(OUTPUT)bench/sched-pipe.o
 ifeq ($(RAW_ARCH),x86_64)
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy-x86-64-asm.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset-x86-64-asm.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-csum-x86-64.o
 endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-csum.o
+BUILTIN_OBJS += $(OUTPUT)bench/mem-csum-generic.o
 
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 0fdc852..3bbe43e 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -32,6 +32,8 @@ extern int bench_mem_memcpy(int argc, const char **argv,
 			    const char *prefix __maybe_unused);
 extern int bench_mem_memset(int argc, const char **argv, const char *prefix);
 
+extern int bench_mem_csum(int argc, const char **argv, const char *prefix);
+
 #define BENCH_FORMAT_DEFAULT_STR	"default"
 #define BENCH_FORMAT_DEFAULT		0
 #define BENCH_FORMAT_SIMPLE_STR		"simple"
diff --git a/tools/perf/bench/mem-csum-generic.c b/tools/perf/bench/mem-csum-generic.c
new file mode 100644
index 0000000..3e77b0d
--- /dev/null
+++ b/tools/perf/bench/mem-csum-generic.c
@@ -0,0 +1,21 @@
+#include "mem-csum.h"
+
+u32 generic_do_csum(unsigned char *buff, unsigned int len);
+
+__wsum csum_partial_copy(const void *src, void *dst, int len, __wsum sum);
+
+/*
+ * Each arch specific implementation file exports these functions,
+ * So we get link time conflicts.  Since we're not testing these paths right now
+ * just rename them to something generic here
+ */
+#define csum_partial(x, y, z) csum_partial_generic(x, y, z)
+#define ip_compute_csum(x, y) ip_complete_csum_generic(x, y)
+
+#include "../../../lib/checksum.c"
+
+u32 generic_do_csum(unsigned char *buff, unsigned int len)
+{
+	return do_csum(buff, len);
+}
+
diff --git a/tools/perf/bench/mem-csum-x86-64-def.h b/tools/perf/bench/mem-csum-x86-64-def.h
new file mode 100644
index 0000000..6698193
--- /dev/null
+++ b/tools/perf/bench/mem-csum-x86-64-def.h
@@ -0,0 +1,8 @@
+/*
+ * Arch specific bench tests for x86[_64]
+ */
+
+CSUM_FN(x86_do_csum, x86_do_csum_init,
+	"x86-64-csum",
+	"x86 unrolled optimized csum() from kernel")
+
diff --git a/tools/perf/bench/mem-csum-x86-64.c b/tools/perf/bench/mem-csum-x86-64.c
new file mode 100644
index 0000000..72bc855
--- /dev/null
+++ b/tools/perf/bench/mem-csum-x86-64.c
@@ -0,0 +1,51 @@
+#include "mem-csum.h"
+
+static int clflush_size;
+
+/*
+ * This overrides the cache_line_size() function from the kernel
+ * The kernel version returns the size of the processor cache line, so 
+ * we emulate that here
+ */
+static inline int cache_line_size(void)
+{
+	return clflush_size;
+}
+
+/*
+ * userspace has no idea what these macros do, and since we don't 
+ * need them to do anything for perf, just make them go away
+ */
+#define unlikely(x) x
+#define EXPORT_SYMBOL(x)
+
+u32 x86_do_csum(unsigned char *buff, unsigned int len);
+void x86_do_csum_init(void);
+
+#include "../../../arch/x86/lib/csum-partial_64.c"
+
+u32 x86_do_csum(unsigned char *buff, unsigned int len)
+{
+	return do_csum(buff, len);
+}
+
+void x86_do_csum_init(void)
+{
+	/*
+	 * The do_csum routine we're testing requires the kernel
+	 * implementation of cache_line_size(), which relies on data
+	 * parsed from the cpuid instruction, do that computation here
+	 */
+	asm("mov $0x1, %%eax\n\t"
+	    "cpuid\n\t"
+	    "mov %%ebx, %[size]\n"
+	    : : [size] "m" (clflush_size));
+
+	/*
+	 * The size of a cache line evicted by a clflush operation is
+	 * contained in bits 15:8 of ebx when cpuid 0x1 is issued
+	 * and is reported in 8 byte words, hence the multiplcation below
+	 */
+	clflush_size = (clflush_size >> 8) & 0x0000000f;
+	clflush_size *= 8;
+}
diff --git a/tools/perf/bench/mem-csum.c b/tools/perf/bench/mem-csum.c
new file mode 100644
index 0000000..3676f6e
--- /dev/null
+++ b/tools/perf/bench/mem-csum.c
@@ -0,0 +1,266 @@
+/*
+ * mem-csum.c
+ *
+ * csum: checksum speed tests
+ *
+ */
+
+#include "../perf.h"
+#include "../util/util.h"
+#include "../util/parse-options.h"
+#include "../util/header.h"
+#include "bench.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <errno.h>
+
+#define K 1024
+
+static const char	*length_str	= "1500B";
+static const char	*size_str	= "64MB";
+static const char	*routine	= "default";
+static int		iterations	= 1;
+static bool		use_cycle;
+static int		cycle_fd;
+
+static const struct option options[] = {
+	OPT_STRING('l', "length", &length_str, "1MB",
+		    "Specify length of memory to checksum. "
+		    "Available units: B, KB, MB, GB and TB (upper and lower)"),
+	OPT_STRING('s', "size", &size_str, "64MB",
+		   "Size of working set to draw csumed buffer from."
+		   "Available units: B, KB, MB, GB and TB"),
+	OPT_STRING('r', "routine", &routine, "default",
+		    "Specify routine to set"),
+	OPT_INTEGER('i', "iterations", &iterations,
+		    "repeat csum() invocation this number of times"),
+	OPT_BOOLEAN('c', "cycle", &use_cycle,
+		    "Use cycles event instead of gettimeofday() for measuring"),
+	OPT_END()
+};
+
+
+extern u32 generic_do_csum(unsigned char *buff, unsigned int len);
+
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+extern u32 x86_do_csum(unsigned char *buff, unsigned int len);
+extern void x86_do_csum_init(void);
+#endif
+
+typedef u32 (*csum_t)(unsigned char *, unsigned int);
+typedef void (*csum_init_t)(void);
+
+struct routine {
+	const char *name;
+	const char *desc;
+	csum_t fn;
+	csum_init_t initfn;
+};
+
+static const struct routine routines[] = {
+	{ "default",
+	  "Default arch-independent csum",
+	  generic_do_csum,
+	  NULL },
+#ifdef HAVE_ARCH_X86_64_SUPPORT
+#define CSUM_FN(fn, init, name, desc) { name, desc, fn, init },
+#include "mem-csum-x86-64-def.h"
+#undef CSUM_FN
+
+#endif
+
+	{ NULL,
+	  NULL,
+	  NULL,
+	  NULL }
+};
+
+static const char * const bench_mem_csum_usage[] = {
+	"perf bench mem csum <options>",
+	NULL
+};
+
+static struct perf_event_attr cycle_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES
+};
+
+static void init_cycle(void)
+{
+	cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+
+	if (cycle_fd < 0 && errno == ENOSYS)
+		die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
+	else
+		BUG_ON(cycle_fd < 0);
+}
+
+static u64 get_cycle(void)
+{
+	int ret;
+	u64 clk;
+
+	ret = read(cycle_fd, &clk, sizeof(u64));
+	BUG_ON(ret != sizeof(u64));
+
+	return clk;
+}
+
+static double timeval2double(struct timeval *ts)
+{
+	return (double)ts->tv_sec +
+		(double)ts->tv_usec / (double)1000000;
+}
+
+static void alloc_mem(void **dst, size_t length)
+{
+	*dst = malloc(length);
+	if (!*dst)
+		die("memory allocation failed - maybe length is too large?\n");
+}
+
+
+static u64 do_csum_cycle(csum_t fn, size_t size, size_t len)
+{
+	u64 cycle_start = 0ULL, cycle_end = 0ULL;
+	void *dst = NULL;
+	void *pool = NULL;
+	unsigned int segments;
+	u64 total_cycles = 0;
+	int i;
+
+	alloc_mem(&pool, size);
+
+	segments = (size / len) - 1;
+	for (i = 0; i < iterations; ++i) {
+		dst = pool + ((random() % segments) * len);
+		cycle_start = get_cycle();
+		fn(dst, len);
+		cycle_end = get_cycle();
+		total_cycles += (cycle_end - cycle_start);
+	}
+
+	free(pool);
+	return total_cycles;
+}
+
+static double do_csum_gettimeofday(csum_t fn, size_t size, size_t len)
+{
+	struct timeval tv_start, tv_end, tv_diff, tv_total;
+	void *dst = NULL;
+	void *pool = NULL;
+	unsigned int segments;
+	int i;
+
+	alloc_mem(&pool, size);
+	timerclear(&tv_total);
+	segments = (size / len) - 1;
+
+	for (i = 0; i < iterations; ++i) {
+		dst = pool + ((random() % segments) * len);
+		BUG_ON(gettimeofday(&tv_start, NULL));
+		fn(dst, len);
+		BUG_ON(gettimeofday(&tv_end, NULL));
+		timersub(&tv_end, &tv_start, &tv_diff);
+		timeradd(&tv_total, &tv_diff, &tv_total);
+	}
+
+
+	free(pool);
+	return (double)((double)(len*iterations) / timeval2double(&tv_total));
+}
+
+#define print_bps(x) do {					\
+		if (x < K)					\
+			printf(" %14lf B/Sec\n", x);		\
+		else if (x < K * K)				\
+			printf(" %14lfd KB/Sec\n", x / K);	\
+		else if (x < K * K * K)				\
+			printf(" %14lf MB/Sec\n", x / K / K);	\
+		else						\
+			printf(" %14lf GB/Sec\n", x / K / K / K); \
+	} while (0)
+
+int bench_mem_csum(int argc, const char **argv,
+		   const char *prefix __maybe_unused)
+{
+	int i;
+	size_t len;
+	size_t setsize;
+	double result_bps;
+	u64 result_cycle;
+
+	argc = parse_options(argc, argv, options,
+			     bench_mem_csum_usage, 0);
+
+	if (use_cycle)
+		init_cycle();
+
+	len = (size_t)perf_atoll((char *)length_str);
+	setsize = (size_t)perf_atoll((char *)size_str);
+
+	result_cycle = 0ULL;
+	result_bps = 0.0;
+
+	if ((s64)len <= 0) {
+		fprintf(stderr, "Invalid length:%s\n", length_str);
+		return 1;
+	}
+
+	for (i = 0; routines[i].name; i++) {
+		if (!strcmp(routines[i].name, routine))
+			break;
+	}
+	if (!routines[i].name) {
+		printf("Unknown routine:%s\n", routine);
+		printf("Available routines...\n");
+		for (i = 0; routines[i].name; i++) {
+			printf("\t%s ... %s\n",
+			       routines[i].name, routines[i].desc);
+		}
+		return 1;
+	}
+
+	if (routines[i].initfn)
+		routines[i].initfn();
+
+	if (bench_format == BENCH_FORMAT_DEFAULT)
+		printf("# Copying %s Bytes ...\n\n", length_str);
+
+	if (use_cycle) {
+		result_cycle =
+			do_csum_cycle(routines[i].fn, setsize, len);
+	} else {
+		result_bps =
+			do_csum_gettimeofday(routines[i].fn, setsize, len);
+	}
+
+	switch (bench_format) {
+	case BENCH_FORMAT_DEFAULT:
+		if (use_cycle) {
+			printf(" %14lf Cycle/Byte\n",
+				(double)result_cycle
+				/ (double)(len*iterations));
+		} else
+			print_bps(result_bps);
+
+
+		break;
+	case BENCH_FORMAT_SIMPLE:
+		if (use_cycle) {
+			printf("%lf\n", (double)result_cycle
+				/ (double)(len*iterations));
+		} else
+			printf("%lf\n", result_bps);
+		break;
+	default:
+		/* reaching this means there's some disaster: */
+		die("unknown format: %d\n", bench_format);
+		break;
+	}
+
+	return 0;
+}
diff --git a/tools/perf/bench/mem-csum.h b/tools/perf/bench/mem-csum.h
new file mode 100644
index 0000000..cca9a77
--- /dev/null
+++ b/tools/perf/bench/mem-csum.h
@@ -0,0 +1,46 @@
+/*
+ * Header for mem-csum
+ * mostly trickery to get the kernel code to compile
+ * in user space
+ */
+
+#include "../util/util.h"
+
+#include <linux/types.h>
+
+
+typedef __u16 __le16;
+typedef __u16 __be16;
+typedef __u32 __le32;
+typedef __u32 __be32;
+typedef __u64 __le64;
+typedef __u64 __be64;
+
+typedef __u16 __sum16;
+typedef __u32 __wsum;
+
+/*
+ * __visible isn't defined in userspace, so make it dissappear
+ */
+#define __visible
+
+/*
+ * These get multiple definitions in the kernel with a common inline version
+ * We're not testing them so just move them to another name
+ */
+#define ip_fast_csum ip_fast_csum_backup
+#define csum_tcpudp_nofold csum_tcpudp_nofold_backup
+
+/*
+ * Most csum implementations need this defined, for the copy_and_csum variants.
+ * Since we're building in userspace, this can be voided out
+ */
+static inline int __copy_from_user(void *dst, const void *src, size_t len)
+{
+	(void)dst;
+	(void)src;
+	(void)len;
+	return 0;
+}
+
+
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index e47f90c..44199e0 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -50,6 +50,7 @@ static struct bench sched_benchmarks[] = {
 static struct bench mem_benchmarks[] = {
 	{ "memcpy",	"Benchmark for memcpy()",			bench_mem_memcpy	},
 	{ "memset",	"Benchmark for memset() tests",			bench_mem_memset	},
+	{ "csum",	"Simple csum timing for various arches",	bench_mem_csum		},
 	{ "all",	"Test all memory benchmarks",			NULL			},
 	{ NULL,		NULL,						NULL			}
 };
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
  2013-11-06 15:23   ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
@ 2013-11-06 15:23   ` Neil Horman
  2013-11-06 15:34     ` Dave Jones
  2013-11-06 20:19     ` Andi Kleen
  1 sibling, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-06 15:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

do_csum was identified via perf recently as a hot spot when doing
receive on ip over infiniband workloads.  After alot of testing and
ideas, we found the best optimization available to us currently is to
prefetch the entire data buffer prior to doing the checksum

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/lib/csum-partial_64.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..9f2d3ee 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
+
+static inline void prefetch_lines(void *addr, size_t len)
+{
+	void *end = addr + len;
+	for (; addr < end; addr += cache_line_size())
+		asm("prefetch 0(%[buf])\n\t" : : [buf] "r" (addr));
+}
+
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
@@ -67,7 +74,9 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			/* main loop using 64byte blocks */
 			zero = 0;
 			count64 = count >> 3;
-			while (count64) { 
+
+			prefetch_lines((void *)buff, len);
+			while (count64) {
 				asm("addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
@ 2013-11-06 15:34     ` Dave Jones
  2013-11-06 15:54       ` Neil Horman
  2013-11-06 20:19     ` Andi Kleen
  1 sibling, 1 reply; 105+ messages in thread
From: Dave Jones @ 2013-11-06 15:34 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
 > do_csum was identified via perf recently as a hot spot when doing
 > receive on ip over infiniband workloads.  After alot of testing and
 > ideas, we found the best optimization available to us currently is to
 > prefetch the entire data buffer prior to doing the checksum
 > 
 > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
 > index 9845371..9f2d3ee 100644
 > --- a/arch/x86/lib/csum-partial_64.c
 > +++ b/arch/x86/lib/csum-partial_64.c
 > @@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
 >   * Things tried and found to not make it faster:
 >   * Manual Prefetching
 >   * Unrolling to an 128 bytes inner loop.
 > - * Using interleaving with more registers to break the carry chains.
 
Did you mean perhaps to remove the "Manual Prefetching" line instead ?
(Curious, what was tried before that made it not worthwhile?)
 
	Dave
 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:34     ` Dave Jones
@ 2013-11-06 15:54       ` Neil Horman
  2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:23         ` Eric Dumazet
  0 siblings, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-06 15:54 UTC (permalink / raw)
  To: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
>  > do_csum was identified via perf recently as a hot spot when doing
>  > receive on ip over infiniband workloads.  After alot of testing and
>  > ideas, we found the best optimization available to us currently is to
>  > prefetch the entire data buffer prior to doing the checksum
>  > 
>  > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>  > index 9845371..9f2d3ee 100644
>  > --- a/arch/x86/lib/csum-partial_64.c
>  > +++ b/arch/x86/lib/csum-partial_64.c
>  > @@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
>  >   * Things tried and found to not make it faster:
>  >   * Manual Prefetching
>  >   * Unrolling to an 128 bytes inner loop.
>  > - * Using interleaving with more registers to break the carry chains.
>  
> Did you mean perhaps to remove the "Manual Prefetching" line instead ?
> (Curious, what was tried before that made it not worthwhile?)
>  
Crap, I didn't notice that previously, thanks Dave.

My guess was that the whole comment was made in reference to the fact that
checksum offload negated all these advantages.  Thats not so true anymore, since
infiniband needs csum in software for ipoib.

I'll fix this up and send a v3, but I'll give it a day in case there are more
comments first.

Thanks
Neil

> 	Dave
>  
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:54       ` Neil Horman
@ 2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:11           ` Neil Horman
                             ` (2 more replies)
  2013-11-06 18:23         ` Eric Dumazet
  1 sibling, 3 replies; 105+ messages in thread
From: Joe Perches @ 2013-11-06 17:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> >  > do_csum was identified via perf recently as a hot spot when doing
> >  > receive on ip over infiniband workloads.  After alot of testing and
> >  > ideas, we found the best optimization available to us currently is to
> >  > prefetch the entire data buffer prior to doing the checksum
[]
> I'll fix this up and send a v3, but I'll give it a day in case there are more
> comments first.

Perhaps a reduction in prefetch loop count helps.

Was capping the amount prefetched and letting the
hardware prefetch also tested?

	prefetch_lines(buff, min(len, cache_line_size() * 8u));

Also pedantry/trivial comments:

__always_inline instead of inline
static __always_inline void prefetch_lines(const void *addr, size_t len)
{
	const void *end = addr + len;
...

buff doesn't need a void * cast in prefetch_lines

Beside the commit message, the comment above prefetch_lines
also needs updating to remove the "Manual Prefetching" line.

/*
 * Do a 64-bit checksum on an arbitrary memory area.
 * Returns a 32bit checksum.
 *
 * This isn't as time critical as it used to be because many NICs
 * do hardware checksumming these days.
 * 
 * Things tried and found to not make it faster:
 * Manual Prefetching
 * Unrolling to an 128 bytes inner loop.
 */



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 17:19         ` Joe Perches
@ 2013-11-06 18:11           ` Neil Horman
  2013-11-06 20:02           ` Neil Horman
  2013-11-08 19:01           ` Neil Horman
  2 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-06 18:11 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 
It was not, but I did not bother to try since accurate branch predicion in the
loop and prefetch issuing should be very fast.  Id also be worried that capping
prefetch would be relatively hardware specific, so what worked well on some
hardware, wouldn't be enough on other hardware.  I'd rather just issue the
prefetch for the whole buffer, as that should produce consistent results.

> Also pedantry/trivial comments:
> 
> __always_inline instead of inline
> static __always_inline void prefetch_lines(const void *addr, size_t len)
> {
> 	const void *end = addr + len;
> ...
> 
> buff doesn't need a void * cast in prefetch_lines
> 
ACK

> Beside the commit message, the comment above prefetch_lines
> also needs updating to remove the "Manual Prefetching" line.
> 
Yup, Dave noted the Manual Prefetch issue, and I'll move the whole comment as
part of that.

> /*
>  * Do a 64-bit checksum on an arbitrary memory area.
>  * Returns a 32bit checksum.
>  *
>  * This isn't as time critical as it used to be because many NICs
>  * do hardware checksumming these days.
>  * 
>  * Things tried and found to not make it faster:
>  * Manual Prefetching
>  * Unrolling to an 128 bytes inner loop.
>  */
> 
> 
> 

Regards
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:54       ` Neil Horman
  2013-11-06 17:19         ` Joe Perches
@ 2013-11-06 18:23         ` Eric Dumazet
  2013-11-06 18:59           ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Eric Dumazet @ 2013-11-06 18:23 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:

> My guess was that the whole comment was made in reference to the fact that
> checksum offload negated all these advantages.  Thats not so true anymore, since
> infiniband needs csum in software for ipoib.
> 
> I'll fix this up and send a v3, but I'll give it a day in case there are more
> comments first.

Also please include netdev, I think people there are interested.

I caught this message, but I usually cannot read lkml traffic.

I wonder why you do not use (and/or change/tune) prefetch_range()
instead of a local definition.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 18:23         ` Eric Dumazet
@ 2013-11-06 18:59           ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-06 18:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 10:23:10AM -0800, Eric Dumazet wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> 
> > My guess was that the whole comment was made in reference to the fact that
> > checksum offload negated all these advantages.  Thats not so true anymore, since
> > infiniband needs csum in software for ipoib.
> > 
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Also please include netdev, I think people there are interested.
> 
Sure, will do in the updated version

> I caught this message, but I usually cannot read lkml traffic.
> 
> I wonder why you do not use (and/or change/tune) prefetch_range()
> instead of a local definition.
> 
I wanted to look into this further, because I wasn't (yet) sure if it was  a bug
or not, but from what I can see x86_64 doesn't define ARCH_HAS_PREFECTH.  That
makes prefetch_range() a nop (I confirmed this via objdump).  It seems like we
should either define ARCH_HAS_PREFETCH on x86_64, or we should remove the
#ifdef from prefetch_range
> 
> 
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:11           ` Neil Horman
@ 2013-11-06 20:02           ` Neil Horman
  2013-11-06 20:07             ` Joe Perches
  2013-11-08 19:01           ` Neil Horman
  2 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-11-06 20:02 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 
> Also pedantry/trivial comments:
> 
> __always_inline instead of inline
> static __always_inline void prefetch_lines(const void *addr, size_t len)
> {
> 	const void *end = addr + len;
> ...
> 
> buff doesn't need a void * cast in prefetch_lines
> 
Actually I take back what I said here, we do need the cast, not for a conversion
from unsigned char * to void *, but rather to discard the const qualifier
without making the compiler complain.

Neil

> Beside the commit message, the comment above prefetch_lines
> also needs updating to remove the "Manual Prefetching" line.
> 
> /*
>  * Do a 64-bit checksum on an arbitrary memory area.
>  * Returns a 32bit checksum.
>  *
>  * This isn't as time critical as it used to be because many NICs
>  * do hardware checksumming these days.
>  * 
>  * Things tried and found to not make it faster:
>  * Manual Prefetching
>  * Unrolling to an 128 bytes inner loop.
>  */
> 
> 
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 20:02           ` Neil Horman
@ 2013-11-06 20:07             ` Joe Perches
  2013-11-08 16:25               ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-11-06 20:07 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
[]
> > __always_inline instead of inline
> > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > {
> > 	const void *end = addr + len;
> > ...
> > 
> > buff doesn't need a void * cast in prefetch_lines
> > 
> Actually I take back what I said here, we do need the cast, not for a conversion
> from unsigned char * to void *, but rather to discard the const qualifier
> without making the compiler complain.

Not if the function is changed to const void *
and end is also const void * as shown.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
  2013-11-06 15:34     ` Dave Jones
@ 2013-11-06 20:19     ` Andi Kleen
  2013-11-07 21:23       ` Neil Horman
  1 sibling, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2013-11-06 20:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Neil Horman <nhorman@tuxdriver.com> writes:

> do_csum was identified via perf recently as a hot spot when doing
> receive on ip over infiniband workloads.  After alot of testing and
> ideas, we found the best optimization available to us currently is to
> prefetch the entire data buffer prior to doing the checksum

On what CPU? Most modern CPUs should not have any trouble at all
prefetching a linear access.

Also for large buffers it is unlikely that all the prefetches
are actually executed, there is usually some limit.

As a minimum you would need:
- run it with a range of buffer sizes
- run this on a range of different CPUs and show no major regressions
- describe all of this actually in the description

But I find at least this patch very dubious.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 20:19     ` Andi Kleen
@ 2013-11-07 21:23       ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-07 21:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, netdev

On Wed, Nov 06, 2013 at 12:19:52PM -0800, Andi Kleen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > do_csum was identified via perf recently as a hot spot when doing
> > receive on ip over infiniband workloads.  After alot of testing and
> > ideas, we found the best optimization available to us currently is to
> > prefetch the entire data buffer prior to doing the checksum
> 
> On what CPU? Most modern CPUs should not have any trouble at all
> prefetching a linear access.
> 
> Also for large buffers it is unlikely that all the prefetches
> are actually executed, there is usually some limit.
> 
> As a minimum you would need:
> - run it with a range of buffer sizes
> - run this on a range of different CPUs and show no major regressions
> - describe all of this actually in the description
> 
> But I find at least this patch very dubious.
> 
> -Andi
> 
Well, if you look back in the thread, you can see several tests done with
various forms of prefetching, that show performance improvements, but if you
want them all collected, heres what I have, using the perf bench from patch 1.

As you can see, you're right, on newer hardware theres negligible advantage (but
no regression that I can see).  On older hardware however, we see a definate
improvement (up to 3%).  I'm afraid I don't have a wide variety of hardware
handy at the moment to do any large scale testing on multiple cpu's.  But if you
have them available, please share your results


Regards
Neil



vendor_id       : AuthenticAMD
cpu family      : 16
model           : 8
model name      : AMD Opteron(tm) Processor 4130
stepping        : 0
microcode       : 0x10000da
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 11
initial apicid  : 11
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor
cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv svm_lock
nrip_save pausefilter
bogomips        : 5200.49
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Without prefecth:
length	| Set Sz| iterations	| cycles/byte
1500B	| 64MB  | 1000000       | 1.432338
1500B   | 128MB | 1000000       | 1.426212
1500B   | 256MB | 1000000       | 1.425988
1500B   | 512MB | 1000000       | 1.517873
9000B   | 64MB  | 1000000       | 0.897998
9000B   | 128MB | 1000000       | 0.884120
9000B   | 256MB | 1000000       | 0.881770
9000B   | 512MB | 1000000       | 0.883644
64KB    | 64MB  | 1000000       | 0.813054
64KB    | 128MB | 1000000       | 0.801859
64KB    | 256MB | 1000000       | 0.796415
64KB    | 512MB | 1000000       | 0.793869

With prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B	| 64MB	| 1000000	| 1.442855
1500B	| 128MB	| 1000000	| 1.438841
1500B	| 256MB	| 1000000	| 1.427324
1500B	| 512MB	| 1000000	| 1.462715 
9000B	| 64MB	| 1000000	| 0.894097 
9000B	| 128MB	| 1000000	| 0.884738 
9000B	| 256MB	| 1000000	| 0.881370  
9000B	| 512MB	| 1000000	| 0.884799 
64KB	| 64MB	| 1000000	| 0.813512 
64KB	| 128MB	| 1000000	| 0.801596 
64KB	| 256MB	| 1000000	| 0.795575  
64KB	| 512MB	| 1000000	| 0.793927 


==========================================================================================

vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping        : 7
microcode       : 0x29
cpu MHz         : 2754.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc
aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx
lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept
vpid
bogomips        : 6784.46
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual


Without prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.343645
1500B   | 128MB | 1000000       | 1.345782
1500B   | 256MB | 1000000       | 1.353145
1500B   | 512MB | 1000000       | 1.354844
9000B   | 64MB  | 1000000       | 0.856552
9000B   | 128MB | 1000000       | 0.852786
9000B   | 256MB | 1000000       | 0.854705
9000B   | 512MB | 1000000       | 0.863308
64KB    | 64MB  | 1000000       | 0.771888
64KB    | 128MB | 1000000       | 0.773453
64KB    | 256MB | 1000000       | 0.771728
64KB    | 512MB | 1000000       | 0.771390

With prefetching:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.344733
1500B   | 128MB | 1000000       | 1.342285
1500B   | 256MB | 1000000       | 1.344818
1500B   | 512MB | 1000000       | 1.342632
9000B   | 64MB  | 1000000       | 0.851043
9000B   | 128MB | 1000000       | 0.850629
9000B   | 256MB | 1000000       | 0.852207
9000B   | 512MB | 1000000       | 0.851927
64KB    | 64MB  | 1000000       | 0.768549
64KB    | 128MB | 1000000       | 0.768623
64KB    | 256MB | 1000000       | 0.768938
64KB    | 512MB | 1000000       | 0.768824

==========================================================================================
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 9
model name      : AMD Opteron(tm) Processor 6172
stepping        : 1
microcode       : 0x10000d9
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 1
siblings        : 12
core id         : 5
cpu cores       : 12
apicid          : 43
initial apicid  : 27
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid amd_dcm pni
monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr hw_pstate npt lbrv
svm_lock nrip_save pausefilter
bogomips        : 4189.63
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Without prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.415370
1500B   | 128MB | 1000000       | 1.437025
1500B   | 256MB | 1000000       | 1.424822
1500B   | 512MB | 1000000       | 1.442021
9000B   | 64MB  | 1000000       | 0.891699
9000B   | 128MB | 1000000       | 0.884261
9000B   | 256MB | 1000000       | 0.880179
9000B   | 512MB | 1000000       | 0.882190
64KB    | 64MB  | 1000000       | 0.813047
64KB    | 128MB | 1000000       | 0.800755
64KB    | 256MB | 1000000       | 0.795207
64KB    | 512MB | 1000000       | 0.792065

With prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 1.424003
1500B   | 128MB | 1000000       | 1.435567
1500B   | 256MB | 1000000       | 1.446858
1500B   | 512MB | 1000000       | 1.459407
9000B   | 64MB  | 1000000       | 0.899858
9000B   | 128MB | 1000000       | 0.885170
9000B   | 256MB | 1000000       | 0.883936
9000B   | 512MB | 1000000       | 0.886158
64KB    | 64MB  | 1000000       | 0.814136
64KB    | 128MB | 1000000       | 0.802202
64KB    | 256MB | 1000000       | 0.796140
64KB    | 512MB | 1000000       | 0.793792



==========================================================================================
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 10
model name      : AMD Athlon(tm) XP 2800+
stepping        : 0
cpu MHz         : 2079.461
cache size      : 512 KB
fdiv_bug        : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
bogomips        : 4158.92
clflush size    : 32
cache_alignment : 32
address sizes   : 34 bits physical, 32 bits virtual
power management: ts

Without prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 3.335217
1500B   | 128MB | 1000000       | 3.403103
1500B   | 256MB | 1000000       | 3.445059
1500B   | 512MB | 1000000       | 3.742008
9000B   | 64MB  | 1000000       | 47.466255
9000B   | 128MB | 1000000       | 47.742751
9000B   | 256MB | 1000000       | 47.965001
9000B   | 512MB | 1000000       | 48.589349
64KB    | 64MB  | 1000000       | 118.088638
64KB    | 128MB | 1000000       | 118.261744
64KB    | 256MB | 1000000       | 118.349641
64KB    | 512MB | 1000000       | 118.695321

With prefetch:
length	| Set Sz| iterations	| cycles/byte
1500B   | 64MB  | 1000000       | 3.231086
1500B   | 128MB | 1000000       | 3.423485
1500B   | 256MB | 1000000       | 3.278899
1500B   | 512MB | 1000000       | 3.545504
9000B   | 64MB  | 1000000       | 46.907795
9000B   | 128MB | 1000000       | 47.321743
9000B   | 256MB | 1000000       | 47.306189
9000B   | 512MB | 1000000       | 48.144320
64KB    | 64MB  | 1000000       | 117.897735
64KB    | 128MB | 1000000       | 118.122266
64KB    | 256MB | 1000000       | 118.126397
64KB    | 512MB | 1000000       | 118.546901



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 20:07             ` Joe Perches
@ 2013-11-08 16:25               ` Neil Horman
  2013-11-08 16:51                 ` Joe Perches
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-11-08 16:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> []
> > > __always_inline instead of inline
> > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > {
> > > 	const void *end = addr + len;
> > > ...
> > > 
> > > buff doesn't need a void * cast in prefetch_lines
> > > 
> > Actually I take back what I said here, we do need the cast, not for a conversion
> > from unsigned char * to void *, but rather to discard the const qualifier
> > without making the compiler complain.
> 
> Not if the function is changed to const void *
> and end is also const void * as shown.
> 
Addr is incremented in the for loop, so it can't be const.  I could add a loop
counter variable on the stack, but that doesn't seem like it would help anything

Neil

> 
> 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 16:25               ` Neil Horman
@ 2013-11-08 16:51                 ` Joe Perches
  2013-11-08 19:07                   ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-11-08 16:51 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > []
> > > > __always_inline instead of inline
> > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > {
> > > > 	const void *end = addr + len;
> > > > ...
> > > > 
> > > > buff doesn't need a void * cast in prefetch_lines
> > > > 
> > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > from unsigned char * to void *, but rather to discard the const qualifier
> > > without making the compiler complain.
> > 
> > Not if the function is changed to const void *
> > and end is also const void * as shown.
> > 
> Addr is incremented in the for loop, so it can't be const.  I could add a loop
> counter variable on the stack, but that doesn't seem like it would help anything

Perhaps you meant
	void * const addr;
but that's not what I wrote.

Let me know if this doesn't compile.
It does here...
---
 arch/x86/lib/csum-partial_64.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..891194a 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,8 +29,15 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
+
+static __always_inline void prefetch_lines(const void * addr, size_t len)
+{
+	const void *end = addr + len;
+	for (; addr < end; addr += cache_line_size())
+		asm("prefetch 0(%[buf])\n\t" : : [buf] "r" (addr));
+}
+
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
@@ -67,7 +74,9 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			/* main loop using 64byte blocks */
 			zero = 0;
 			count64 = count >> 3;
-			while (count64) { 
+
+			prefetch_lines(buff, min(len, cache_line_size() * 4u));
+			while (count64) {
 				asm("addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"



^ permalink raw reply related	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-06 17:19         ` Joe Perches
  2013-11-06 18:11           ` Neil Horman
  2013-11-06 20:02           ` Neil Horman
@ 2013-11-08 19:01           ` Neil Horman
  2013-11-08 19:33             ` Joe Perches
  2 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-11-08 19:01 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > >  > do_csum was identified via perf recently as a hot spot when doing
> > >  > receive on ip over infiniband workloads.  After alot of testing and
> > >  > ideas, we found the best optimization available to us currently is to
> > >  > prefetch the entire data buffer prior to doing the checksum
> []
> > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > comments first.
> 
> Perhaps a reduction in prefetch loop count helps.
> 
> Was capping the amount prefetched and letting the
> hardware prefetch also tested?
> 
> 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> 

Just tested this out:

With limiting:
1500B   | 64MB  | 1000000       | 1.344167
1500B   | 128MB | 1000000       | 1.340970
1500B   | 256MB | 1000000       | 1.353562
1500B   | 512MB | 1000000       | 1.346349
9000B   | 64MB  | 1000000       | 0.852174
9000B   | 128MB | 1000000       | 0.852765
9000B   | 256MB | 1000000       | 0.853153
9000B   | 512MB | 1000000       | 0.852661
64KB    | 64MB  | 1000000       | 0.768585
64KB    | 128MB | 1000000       | 0.769465
64KB    | 256MB | 1000000       | 0.769909
64KB    | 512MB | 1000000       | 0.779895


Without limiting

1500B   | 64MB  | 1000000       | 1.360525
1500B   | 128MB | 1000000       | 1.354220
1500B   | 256MB | 1000000       | 1.371037
1500B   | 512MB | 1000000       | 1.353557
9000B   | 64MB  | 1000000       | 0.850415
9000B   | 128MB | 1000000       | 0.853642
9000B   | 256MB | 1000000       | 0.852048
9000B   | 512MB | 1000000       | 0.852484
64KB    | 64MB  | 1000000       | 0.768261
64KB    | 128MB | 1000000       | 0.768566
64KB    | 256MB | 1000000       | 0.770822
64KB    | 512MB | 1000000       | 0.769391

Doesn't look like much consistent improvement.

Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 16:51                 ` Joe Perches
@ 2013-11-08 19:07                   ` Neil Horman
  2013-11-08 19:17                     ` Joe Perches
  2013-11-08 19:17                     ` H. Peter Anvin
  0 siblings, 2 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-08 19:07 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > []
> > > > > __always_inline instead of inline
> > > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > > {
> > > > > 	const void *end = addr + len;
> > > > > ...
> > > > > 
> > > > > buff doesn't need a void * cast in prefetch_lines
> > > > > 
> > > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > > from unsigned char * to void *, but rather to discard the const qualifier
> > > > without making the compiler complain.
> > > 
> > > Not if the function is changed to const void *
> > > and end is also const void * as shown.
> > > 
> > Addr is incremented in the for loop, so it can't be const.  I could add a loop
> > counter variable on the stack, but that doesn't seem like it would help anything
> 
> Perhaps you meant
> 	void * const addr;
> but that's not what I wrote.
> 
No, I meant smoething like:
static __always_inline void prefetch_lines(const void * addr, size_t len)
{
	const void *tmp = (void *)addr;
	...
	for(;tmp<end; tmp+=cache_line_size())
	...
}

> Let me know if this doesn't compile.
> It does here...
Huh, it does.  But that makes very little sense to me.  by qualifying addr as
const, how is the compiler not throwing a warning in the for loop about us
incrementing that same variable?



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:07                   ` Neil Horman
@ 2013-11-08 19:17                     ` Joe Perches
  2013-11-08 20:08                       ` Neil Horman
  2013-11-08 19:17                     ` H. Peter Anvin
  1 sibling, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-11-08 19:17 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 14:07 -0500, Neil Horman wrote:
> On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
> > On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > > > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > []
> > > > > > __always_inline instead of inline
> > > > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > > > {
> > > > > > 	const void *end = addr + len;
> > > > > > ...
> > > > > > 
> > > > > > buff doesn't need a void * cast in prefetch_lines
> > > > > > 
> > > > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > > > from unsigned char * to void *, but rather to discard the const qualifier
> > > > > without making the compiler complain.
> > > > 
> > > > Not if the function is changed to const void *
> > > > and end is also const void * as shown.
> > > > 
> > > Addr is incremented in the for loop, so it can't be const.  I could add a loop
> > > counter variable on the stack, but that doesn't seem like it would help anything
> > 
> > Perhaps you meant
> > 	void * const addr;
> > but that's not what I wrote.
> > 
> No, I meant smoething like:
> static __always_inline void prefetch_lines(const void * addr, size_t len)
> {
> 	const void *tmp = (void *)addr;
> 	...
> 	for(;tmp<end; tmp+=cache_line_size())
> 	...
> }
> 
> > Let me know if this doesn't compile.
> > It does here...
> Huh, it does.  But that makes very little sense to me.  by qualifying addr as
> const, how is the compiler not throwing a warning in the for loop about us
> incrementing that same variable?

Because it points to const data but is not const itself.

void * const foo;	/* value of foo can't change */
const void *bar;	/* data pointed to by bar can't change */
const void * const baz; /* Neither baz nor data pointed to by baz can change */




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:07                   ` Neil Horman
  2013-11-08 19:17                     ` Joe Perches
@ 2013-11-08 19:17                     ` H. Peter Anvin
  1 sibling, 0 replies; 105+ messages in thread
From: H. Peter Anvin @ 2013-11-08 19:17 UTC (permalink / raw)
  To: Neil Horman, Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, x86

On 11/08/2013 11:07 AM, Neil Horman wrote:
> On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
>> On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
>>> On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
>>>> On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
>>>>> On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
>>>> []
>>>>>> __always_inline instead of inline
>>>>>> static __always_inline void prefetch_lines(const void *addr, size_t len)
>>>>>> {
>>>>>> 	const void *end = addr + len;
>>>>>> ...
>>>>>>
>>>>>> buff doesn't need a void * cast in prefetch_lines
>>>>>>
>>>>> Actually I take back what I said here, we do need the cast, not for a conversion
>>>>> from unsigned char * to void *, but rather to discard the const qualifier
>>>>> without making the compiler complain.
>>>>
>>>> Not if the function is changed to const void *
>>>> and end is also const void * as shown.
>>>>
>>> Addr is incremented in the for loop, so it can't be const.  I could add a loop
>>> counter variable on the stack, but that doesn't seem like it would help anything
>>
>> Perhaps you meant
>> 	void * const addr;
>> but that's not what I wrote.
>>
> No, I meant smoething like:
> static __always_inline void prefetch_lines(const void * addr, size_t len)
> {
> 	const void *tmp = (void *)addr;
> 	...
> 	for(;tmp<end; tmp+=cache_line_size())
> 	...
> }
> 
>> Let me know if this doesn't compile.
>> It does here...
> Huh, it does.  But that makes very little sense to me.  by qualifying addr as
> const, how is the compiler not throwing a warning in the for loop about us
> incrementing that same variable?
> 

As Joe is pointing out, you are confusing "const foo *tmp" with "foo *
const tmp".  The former means: "tmp is a variable pointing to type const
foo".  The latter means: "tmp is a constant pointing to type foo".

There is no problem modifying tmp in the former case; it prohibits
modifying *tmp.  In the latter case modifying tmp is prohibited, but
modifying *tmp is just fine.

Now, "const char *" would arguably be more correct here since arithmetic
on void is a gcc extension, but the same argument applies there.

	-hpa



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:01           ` Neil Horman
@ 2013-11-08 19:33             ` Joe Perches
  2013-11-08 20:14               ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-11-08 19:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > >  > ideas, we found the best optimization available to us currently is to
> > > >  > prefetch the entire data buffer prior to doing the checksum
> > []
> > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > comments first.
> > 
> > Perhaps a reduction in prefetch loop count helps.
> > 
> > Was capping the amount prefetched and letting the
> > hardware prefetch also tested?
> > 
> > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > 
> 
> Just tested this out:

Thanks.

Reformatting the table so it's a bit more
readable/comparable for me:

len	SetSz	Loops	cycles/byte
			limited	unlimited
1500B	64MB	1M	1.3442	1.3605
1500B	128MB	1M	1.3410	1.3542
1500B	256MB	1M	1.3536	1.3710
1500B	512MB	1M	1.3463	1.3536
9000B	64MB	1M	0.8522	0.8504
9000B	128MB	1M	0.8528	0.8536
9000B	256MB	1M	0.8532	0.8520
9000B	512MB	1M	0.8527	0.8525
64KB	64MB	1M	0.7686	0.7683
64KB	128MB	1M	0.7695	0.7686
64KB	256MB	1M	0.7699	0.7708
64KB	512MB	1M	0.7799	0.7694

This data appears to show some value
in capping for 1500b lengths and noise
for shorter and longer lengths.

Any idea what the actual distribution of
do_csum lengths is under various loads?



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:17                     ` Joe Perches
@ 2013-11-08 20:08                       ` Neil Horman
  0 siblings, 0 replies; 105+ messages in thread
From: Neil Horman @ 2013-11-08 20:08 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 11:17:39AM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 14:07 -0500, Neil Horman wrote:
> > On Fri, Nov 08, 2013 at 08:51:07AM -0800, Joe Perches wrote:
> > > On Fri, 2013-11-08 at 11:25 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 12:07:38PM -0800, Joe Perches wrote:
> > > > > On Wed, 2013-11-06 at 15:02 -0500, Neil Horman wrote:
> > > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > []
> > > > > > > __always_inline instead of inline
> > > > > > > static __always_inline void prefetch_lines(const void *addr, size_t len)
> > > > > > > {
> > > > > > > 	const void *end = addr + len;
> > > > > > > ...
> > > > > > > 
> > > > > > > buff doesn't need a void * cast in prefetch_lines
> > > > > > > 
> > > > > > Actually I take back what I said here, we do need the cast, not for a conversion
> > > > > > from unsigned char * to void *, but rather to discard the const qualifier
> > > > > > without making the compiler complain.
> > > > > 
> > > > > Not if the function is changed to const void *
> > > > > and end is also const void * as shown.
> > > > > 
> > > > Addr is incremented in the for loop, so it can't be const.  I could add a loop
> > > > counter variable on the stack, but that doesn't seem like it would help anything
> > > 
> > > Perhaps you meant
> > > 	void * const addr;
> > > but that's not what I wrote.
> > > 
> > No, I meant smoething like:
> > static __always_inline void prefetch_lines(const void * addr, size_t len)
> > {
> > 	const void *tmp = (void *)addr;
> > 	...
> > 	for(;tmp<end; tmp+=cache_line_size())
> > 	...
> > }
> > 
> > > Let me know if this doesn't compile.
> > > It does here...
> > Huh, it does.  But that makes very little sense to me.  by qualifying addr as
> > const, how is the compiler not throwing a warning in the for loop about us
> > incrementing that same variable?
> 
> Because it points to const data but is not const itself.
> 
> void * const foo;	/* value of foo can't change */
> const void *bar;	/* data pointed to by bar can't change */
> const void * const baz; /* Neither baz nor data pointed to by baz can change */
> 
Doh!  Wow, that was just staring me in the face and I missed it :)

Thanks for pointing it out.  I'll make that adjustment
Neil


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 19:33             ` Joe Perches
@ 2013-11-08 20:14               ` Neil Horman
  2013-11-08 20:29                 ` Joe Perches
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-11-08 20:14 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > >  > ideas, we found the best optimization available to us currently is to
> > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > []
> > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > comments first.
> > > 
> > > Perhaps a reduction in prefetch loop count helps.
> > > 
> > > Was capping the amount prefetched and letting the
> > > hardware prefetch also tested?
> > > 
> > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > 
> > 
> > Just tested this out:
> 
> Thanks.
> 
> Reformatting the table so it's a bit more
> readable/comparable for me:
> 
> len	SetSz	Loops	cycles/byte
> 			limited	unlimited
> 1500B	64MB	1M	1.3442	1.3605
> 1500B	128MB	1M	1.3410	1.3542
> 1500B	256MB	1M	1.3536	1.3710
> 1500B	512MB	1M	1.3463	1.3536
> 9000B	64MB	1M	0.8522	0.8504
> 9000B	128MB	1M	0.8528	0.8536
> 9000B	256MB	1M	0.8532	0.8520
> 9000B	512MB	1M	0.8527	0.8525
> 64KB	64MB	1M	0.7686	0.7683
> 64KB	128MB	1M	0.7695	0.7686
> 64KB	256MB	1M	0.7699	0.7708
> 64KB	512MB	1M	0.7799	0.7694
> 
> This data appears to show some value
> in capping for 1500b lengths and noise
> for shorter and longer lengths.
> 
> Any idea what the actual distribution of
> do_csum lengths is under various loads?
> 

I don't have any hard data no, sorry. I chose the above values for length based
on typical mtus for ethernet, jumbo frame ethernet and ipoib (which Doug tells
me commonly has a 64k mtu).  I expect we anecdotally say 1500 bytes is going to
be the most common case.  I'll cap the prefetch at 1500B for now, since it
doesn't seem to hurt or help beyond that
Neil



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 20:14               ` Neil Horman
@ 2013-11-08 20:29                 ` Joe Perches
  2013-11-11 19:40                   ` Neil Horman
  0 siblings, 1 reply; 105+ messages in thread
From: Joe Perches @ 2013-11-08 20:29 UTC (permalink / raw)
  To: Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > > >  > ideas, we found the best optimization available to us currently is to
> > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > []
> > > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > > comments first.
> > > > 
> > > > Perhaps a reduction in prefetch loop count helps.
> > > > 
> > > > Was capping the amount prefetched and letting the
> > > > hardware prefetch also tested?
> > > > 
> > > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > 
> > > 
> > > Just tested this out:
> > 
> > Thanks.
> > 
> > Reformatting the table so it's a bit more
> > readable/comparable for me:
> > 
> > len	SetSz	Loops	cycles/byte
> > 			limited	unlimited
> > 1500B	64MB	1M	1.3442	1.3605
> > 1500B	128MB	1M	1.3410	1.3542
> > 1500B	256MB	1M	1.3536	1.3710
> > 1500B	512MB	1M	1.3463	1.3536
> > 9000B	64MB	1M	0.8522	0.8504
> > 9000B	128MB	1M	0.8528	0.8536
> > 9000B	256MB	1M	0.8532	0.8520
> > 9000B	512MB	1M	0.8527	0.8525
> > 64KB	64MB	1M	0.7686	0.7683
> > 64KB	128MB	1M	0.7695	0.7686
> > 64KB	256MB	1M	0.7699	0.7708
> > 64KB	512MB	1M	0.7799	0.7694
> > 
> > This data appears to show some value
> > in capping for 1500b lengths and noise
> > for shorter and longer lengths.
> > 
> > Any idea what the actual distribution of
> > do_csum lengths is under various loads?
> > 
> I don't have any hard data no, sorry.

I think you should before you implement this.
You might find extremely short lengths.

> I'll cap the prefetch at 1500B for now, since it
> doesn't seem to hurt or help beyond that

The table data has a max prefetch of
8 * boot_cpu_data.x86_cache_alignment so
I believe it's always less than 1500 but
perhaps 4 might be slightly better still.



^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-08 20:29                 ` Joe Perches
@ 2013-11-11 19:40                   ` Neil Horman
  2013-11-11 21:18                     ` Ingo Molnar
  0 siblings, 1 reply; 105+ messages in thread
From: Neil Horman @ 2013-11-11 19:40 UTC (permalink / raw)
  To: Joe Perches
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> > On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > > > >  > ideas, we found the best optimization available to us currently is to
> > > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > > []
> > > > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > > > comments first.
> > > > > 
> > > > > Perhaps a reduction in prefetch loop count helps.
> > > > > 
> > > > > Was capping the amount prefetched and letting the
> > > > > hardware prefetch also tested?
> > > > > 
> > > > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > > 
> > > > 
> > > > Just tested this out:
> > > 
> > > Thanks.
> > > 
> > > Reformatting the table so it's a bit more
> > > readable/comparable for me:
> > > 
> > > len	SetSz	Loops	cycles/byte
> > > 			limited	unlimited
> > > 1500B	64MB	1M	1.3442	1.3605
> > > 1500B	128MB	1M	1.3410	1.3542
> > > 1500B	256MB	1M	1.3536	1.3710
> > > 1500B	512MB	1M	1.3463	1.3536
> > > 9000B	64MB	1M	0.8522	0.8504
> > > 9000B	128MB	1M	0.8528	0.8536
> > > 9000B	256MB	1M	0.8532	0.8520
> > > 9000B	512MB	1M	0.8527	0.8525
> > > 64KB	64MB	1M	0.7686	0.7683
> > > 64KB	128MB	1M	0.7695	0.7686
> > > 64KB	256MB	1M	0.7699	0.7708
> > > 64KB	512MB	1M	0.7799	0.7694
> > > 
> > > This data appears to show some value
> > > in capping for 1500b lengths and noise
> > > for shorter and longer lengths.
> > > 
> > > Any idea what the actual distribution of
> > > do_csum lengths is under various loads?
> > > 
> > I don't have any hard data no, sorry.
> 
> I think you should before you implement this.
> You might find extremely short lengths.
> 
> > I'll cap the prefetch at 1500B for now, since it
> > doesn't seem to hurt or help beyond that
> 
> The table data has a max prefetch of
> 8 * boot_cpu_data.x86_cache_alignment so
> I believe it's always less than 1500 but
> perhaps 4 might be slightly better still.
> 


So, you appear to be correct, I reran my test set with different prefetch
ceilings and got the results below.  There are some cases in which there is a
performance gain, but the gain is small, and occurs at different spots depending
on the input buffer size (though most peak gains appear around 2 cache lines).
I'm guessing it takes about 2 prefetches before hardware prefetching catches up,
at which point we're just spending time issuing instructions that get discarded.
Given the small prefetch limit, and the limited gains (which may also change on
different hardware), I think we should probably just drop the prefetch idea
entirely, and perhaps just take the perf patch so that we can revisit this area
when hardware that supports the avx extensions and/or adcx/adox becomes
available.

Ingo, does that seem reasonable to you?
Neil



1 cache line:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.434190
1500B   | 128MB | 1000000       | 1.431216
1500B   | 256MB | 1000000       | 1.430888
1500B   | 512MB | 1000000       | 1.453422
9000B   | 64MB  | 1000000       | 0.892055
9000B   | 128MB | 1000000       | 0.884050
9000B   | 256MB | 1000000       | 0.880551
9000B   | 512MB | 1000000       | 0.883848
64KB    | 64MB  | 1000000       | 0.813187
64KB    | 128MB | 1000000       | 0.801326
64KB    | 256MB | 1000000       | 0.795643
64KB    | 512MB | 1000000       | 0.793400


2 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.430030
1500B   | 128MB | 1000000       | 1.434589
1500B   | 256MB | 1000000       | 1.425430
1500B   | 512MB | 1000000       | 1.451570
9000B   | 64MB  | 1000000       | 0.892369
9000B   | 128MB | 1000000       | 0.885577
9000B   | 256MB | 1000000       | 0.882091
9000B   | 512MB | 1000000       | 0.885201
64KB    | 64MB  | 1000000       | 0.813629
64KB    | 128MB | 1000000       | 0.801377
64KB    | 256MB | 1000000       | 0.795861
64KB    | 512MB | 1000000       | 0.793242

3 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.435048
1500B   | 128MB | 1000000       | 1.427103
1500B   | 256MB | 1000000       | 1.431558
1500B   | 512MB | 1000000       | 1.452250
9000B   | 64MB  | 1000000       | 0.893162
9000B   | 128MB | 1000000       | 0.884488
9000B   | 256MB | 1000000       | 0.881314
9000B   | 512MB | 1000000       | 0.884060
64KB    | 64MB  | 1000000       | 0.813185
64KB    | 128MB | 1000000       | 0.801280
64KB    | 256MB | 1000000       | 0.795554
64KB    | 512MB | 1000000       | 0.793670

4 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.435013
1500B   | 128MB | 1000000       | 1.428434
1500B   | 256MB | 1000000       | 1.430780
1500B   | 512MB | 1000000       | 1.456285
9000B   | 64MB  | 1000000       | 0.894877
9000B   | 128MB | 1000000       | 0.885387
9000B   | 256MB | 1000000       | 0.883293
9000B   | 512MB | 1000000       | 0.886462
64KB    | 64MB  | 1000000       | 0.815036
64KB    | 128MB | 1000000       | 0.801962
64KB    | 256MB | 1000000       | 0.797618
64KB    | 512MB | 1000000       | 0.795138

6 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.439609
1500B   | 128MB | 1000000       | 1.437569
1500B   | 256MB | 1000000       | 1.441776
1500B   | 512MB | 1000000       | 1.455362
9000B   | 64MB  | 1000000       | 0.895242
9000B   | 128MB | 1000000       | 0.886149
9000B   | 256MB | 1000000       | 0.881375
9000B   | 512MB | 1000000       | 0.884610
64KB    | 64MB  | 1000000       | 0.814658
64KB    | 128MB | 1000000       | 0.804124
64KB    | 256MB | 1000000       | 0.798143
64KB    | 512MB | 1000000       | 0.795377

10 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.431512
1500B   | 128MB | 1000000       | 1.431805
1500B   | 256MB | 1000000       | 1.430388
1500B   | 512MB | 1000000       | 1.464370
9000B   | 64MB  | 1000000       | 0.893922
9000B   | 128MB | 1000000       | 0.887852
9000B   | 256MB | 1000000       | 0.882711
9000B   | 512MB | 1000000       | 0.890067
64KB    | 64MB  | 1000000       | 0.814890
64KB    | 128MB | 1000000       | 0.801470
64KB    | 256MB | 1000000       | 0.796658
64KB    | 512MB | 1000000       | 0.794266

20 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.455539
1500B   | 128MB | 1000000       | 1.443117
1500B   | 256MB | 1000000       | 1.436739
1500B   | 512MB | 1000000       | 1.458973
9000B   | 64MB  | 1000000       | 0.898470
9000B   | 128MB | 1000000       | 0.886110
9000B   | 256MB | 1000000       | 0.889549
9000B   | 512MB | 1000000       | 0.886547
64KB    | 64MB  | 1000000       | 0.814665
64KB    | 128MB | 1000000       | 0.803252
64KB    | 256MB | 1000000       | 0.797268
64KB    | 512MB | 1000000       | 0.794830


^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v2 2/2] x86: add prefetching to do_csum
  2013-11-11 19:40                   ` Neil Horman
@ 2013-11-11 21:18                     ` Ingo Molnar
  0 siblings, 0 replies; 105+ messages in thread
From: Ingo Molnar @ 2013-11-11 21:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: Joe Perches, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Ingo, does that seem reasonable to you?

FYI, in the past few days I've been busy due to the merge window, but 
everything I've seen so far in this portion of the thread gave me warm 
fuzzy feelings, so I definitely like the direction.

(More once I get around to looking at the code in detail.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2013-11-11 21:18 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-11 16:51 [PATCH] x86: Run checksumming in parallel accross multiple alu's Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53   ` Neil Horman
2013-10-14 20:28   ` Neil Horman
2013-10-14 21:19     ` Eric Dumazet
2013-10-14 22:18       ` Eric Dumazet
2013-10-14 22:37         ` Joe Perches
2013-10-14 22:44           ` Eric Dumazet
2013-10-14 22:49             ` Joe Perches
2013-10-15  7:41               ` Ingo Molnar
2013-10-15 10:51                 ` Borislav Petkov
2013-10-15 12:04                   ` Ingo Molnar
2013-10-15 16:21                 ` Joe Perches
2013-10-16  0:34                   ` Eric Dumazet
2013-10-16  6:25                   ` Ingo Molnar
2013-10-16 16:55                     ` Joe Perches
2013-10-17  0:34         ` Neil Horman
2013-10-17  1:42           ` Eric Dumazet
2013-10-18 16:50             ` Neil Horman
2013-10-18 17:20               ` Eric Dumazet
2013-10-18 20:11                 ` Neil Horman
2013-10-18 21:15                   ` Eric Dumazet
2013-10-20 21:29                     ` Neil Horman
2013-10-21 17:31                       ` Eric Dumazet
2013-10-21 17:46                         ` Neil Horman
2013-10-21 19:21                     ` Neil Horman
2013-10-21 19:44                       ` Eric Dumazet
2013-10-21 20:19                         ` Neil Horman
2013-10-26 12:01                           ` Ingo Molnar
2013-10-26 13:58                             ` Neil Horman
2013-10-27  7:26                               ` Ingo Molnar
2013-10-27 17:05                                 ` Neil Horman
2013-10-17  8:41           ` Ingo Molnar
2013-10-17 18:19             ` H. Peter Anvin
2013-10-17 18:48               ` Eric Dumazet
2013-10-18  6:43               ` Ingo Molnar
2013-10-28 16:01             ` Neil Horman
2013-10-28 16:20               ` Ingo Molnar
2013-10-28 17:49                 ` Neil Horman
2013-10-28 16:24               ` Ingo Molnar
2013-10-28 16:49                 ` David Ahern
2013-10-28 17:46                 ` Neil Horman
2013-10-28 18:29                   ` Neil Horman
2013-10-29  8:25                     ` Ingo Molnar
2013-10-29 11:20                       ` Neil Horman
2013-10-29 11:30                         ` Ingo Molnar
2013-10-29 11:49                           ` Neil Horman
2013-10-29 12:52                             ` Ingo Molnar
2013-10-29 13:07                               ` Neil Horman
2013-10-29 13:11                                 ` Ingo Molnar
2013-10-29 13:20                                   ` Neil Horman
2013-10-29 14:17                                   ` Neil Horman
2013-10-29 14:27                                     ` Ingo Molnar
2013-10-29 20:26                                       ` Neil Horman
2013-10-31 10:22                                         ` Ingo Molnar
2013-10-31 14:33                                           ` Neil Horman
2013-11-01  9:13                                             ` Ingo Molnar
2013-11-01 14:06                                               ` Neil Horman
2013-10-29 14:12                               ` David Ahern
2013-10-15  7:32     ` Ingo Molnar
2013-10-15 13:14       ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53   ` Neil Horman
2013-10-18 16:42   ` Neil Horman
2013-10-18 17:09     ` H. Peter Anvin
2013-10-25 13:06       ` Neil Horman
2013-10-14  4:38 ` Andi Kleen
2013-10-14  7:49   ` Ingo Molnar
2013-10-14 21:07     ` Eric Dumazet
2013-10-15 13:17       ` Neil Horman
2013-10-14 20:25   ` Neil Horman
2013-10-15  7:12     ` Sébastien Dugué
2013-10-15 13:33       ` Andi Kleen
2013-10-15 13:56         ` Sébastien Dugué
2013-10-15 14:06           ` Eric Dumazet
2013-10-15 14:15             ` Sébastien Dugué
2013-10-15 14:26               ` Eric Dumazet
2013-10-15 14:52                 ` Eric Dumazet
2013-10-15 16:02                   ` Andi Kleen
2013-10-16  0:28                     ` Eric Dumazet
2013-11-06 15:23 ` x86: Enhance perf checksum profiling and x86 implementation Neil Horman
2013-11-06 15:23   ` [PATCH v2 1/2] perf: Add csum benchmark tests to perf Neil Horman
2013-11-06 15:23   ` [PATCH v2 2/2] x86: add prefetching to do_csum Neil Horman
2013-11-06 15:34     ` Dave Jones
2013-11-06 15:54       ` Neil Horman
2013-11-06 17:19         ` Joe Perches
2013-11-06 18:11           ` Neil Horman
2013-11-06 20:02           ` Neil Horman
2013-11-06 20:07             ` Joe Perches
2013-11-08 16:25               ` Neil Horman
2013-11-08 16:51                 ` Joe Perches
2013-11-08 19:07                   ` Neil Horman
2013-11-08 19:17                     ` Joe Perches
2013-11-08 20:08                       ` Neil Horman
2013-11-08 19:17                     ` H. Peter Anvin
2013-11-08 19:01           ` Neil Horman
2013-11-08 19:33             ` Joe Perches
2013-11-08 20:14               ` Neil Horman
2013-11-08 20:29                 ` Joe Perches
2013-11-11 19:40                   ` Neil Horman
2013-11-11 21:18                     ` Ingo Molnar
2013-11-06 18:23         ` Eric Dumazet
2013-11-06 18:59           ` Neil Horman
2013-11-06 20:19     ` Andi Kleen
2013-11-07 21:23       ` Neil Horman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.