Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-30  5:25 Doug Ledford
  2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  0 siblings, 2 replies; 107+ messages in thread
From: Doug Ledford @ 2013-10-30  5:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, Doug Ledford, linux-kernel, netdev

* Neil Horman <nhorman@tuxdriver.com> wrote:
> 3) The run times are proportionally larger, but still indicate that Parallel ALU
> execution is hurting rather than helping, which is counter-intuitive.  I'm
> looking into it, but thought you might want to see these results in case
> something jumped out at you

So here's my theory about all of this.

I think that the original observation some years back was a fluke caused by
either a buggy CPU or a CPU design that is no longer used.

The parallel ALU design of this patch seems OK at first glance, but it means
that two parallel operations are both trying to set/clear both the overflow
and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
some CPU in the past had a set of overflow/carry flags per ALU and did some
sort of magic to make sure that the last state of those flags across multiple
ALUs that might have been used in parallelizing work were always in the CPU's
logical EFLAGS register, or the CPU has a buggy microcode that allowed two
ALUs to operate on data at the same time in situations where they would
potentially stomp on the carry/overflow flags of the other ALUs operations.

It's my theory that all modern CPUs have this behavior fixed, probably via a
microcode update, and so trying to do parallel ALU operations like this simply
has no effect because the CPU (rightly so) serializes the operations to keep
them from clobbering the overflow/carry flags of the other ALUs operations.

My additional theory then is that the reason you see a slowdown from this
patch is because the attempt to parallelize the ALU operation has caused
us to write a series of instructions that, once serialized, are non-optimal
and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
order, and since we aren't getting the parallel operation we want, this
is the net result of the patch).

It would explain things anyway.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
@ 2013-10-30 10:27 ` David Laight
  2013-10-30 11:02 ` Neil Horman
  1 sibling, 0 replies; 107+ messages in thread
From: David Laight @ 2013-10-30 10:27 UTC (permalink / raw)
  To: Doug Ledford, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.

IIRC x86 cpu treat the (arithmetic) flags register as a single entity.
So an instruction that only changes some of the flags is dependant
on any previous instruction that changes any flags.
OTOH it the instruction writes all of the flags then it doesn't
have to wait for the earlier instruction to complete.

This is problematic for the ADC chain in the IP checksum.
I did once try to use the SSE instructions to sum 16bit
fields into multiple 32bit registers.

	David




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
  2013-10-30 10:27 ` David Laight
@ 2013-10-30 11:02 ` Neil Horman
  2013-10-30 12:18   ` David Laight
  2013-10-30 13:35   ` Doug Ledford
  1 sibling, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-30 11:02 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

On Wed, Oct 30, 2013 at 01:25:39AM -0400, Doug Ledford wrote:
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 3) The run times are proportionally larger, but still indicate that Parallel ALU
> > execution is hurting rather than helping, which is counter-intuitive.  I'm
> > looking into it, but thought you might want to see these results in case
> > something jumped out at you
> 
> So here's my theory about all of this.
> 
> I think that the original observation some years back was a fluke caused by
> either a buggy CPU or a CPU design that is no longer used.
> 
> The parallel ALU design of this patch seems OK at first glance, but it means
> that two parallel operations are both trying to set/clear both the overflow
> and carry flags of the EFLAGS register of the *CPU* (not the ALU).  So, either
> some CPU in the past had a set of overflow/carry flags per ALU and did some
> sort of magic to make sure that the last state of those flags across multiple
> ALUs that might have been used in parallelizing work were always in the CPU's
> logical EFLAGS register, or the CPU has a buggy microcode that allowed two
> ALUs to operate on data at the same time in situations where they would
> potentially stomp on the carry/overflow flags of the other ALUs operations.
> 
> It's my theory that all modern CPUs have this behavior fixed, probably via a
> microcode update, and so trying to do parallel ALU operations like this simply
> has no effect because the CPU (rightly so) serializes the operations to keep
> them from clobbering the overflow/carry flags of the other ALUs operations.
> 
> My additional theory then is that the reason you see a slowdown from this
> patch is because the attempt to parallelize the ALU operation has caused
> us to write a series of instructions that, once serialized, are non-optimal
> and hinder smooth pipelining of the data (aka going 0*8, 2*8, 4*8, 6*8, 1*8,
> 3*8, 5*8, and 7*8 in terms of memory accesses is worse than doing them in
> order, and since we aren't getting the parallel operation we want, this
> is the net result of the patch).
> 
> It would explain things anyway.
> 

That does makes sense, but it then begs the question, whats the advantage of
having multiple alu's at all?  If they're just going to serialize on the
updating of the condition register, there doesn't seem to be much advantage in
having multiple alu's at all, especially if a common use case (parallelizing an
operation on a large linear dataset) resulted in lower performance.

/me wonders if rearranging the instructions into this order:
adcq 0*8(src), res1
adcq 1*8(src), res2
adcq 2*8(src), res1

would prevent pipeline stalls.  That would be interesting data, and (I think)
support your theory, Doug.  I'll give that a try

Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 11:02 ` Neil Horman
@ 2013-10-30 12:18   ` David Laight
  2013-10-30 13:22     ` Doug Ledford
  2013-10-30 13:35   ` Doug Ledford
  1 sibling, 1 reply; 107+ messages in thread
From: David Laight @ 2013-10-30 12:18 UTC (permalink / raw)
  To: Neil Horman, Doug Ledford; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1

Those have to be sequenced.

Using a 64bit lea to add 32bit quantities should avoid the
dependencies on the flags register.
However you'd need to get 3 of those active to beat a 64bit adc.

	David




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 12:18   ` David Laight
@ 2013-10-30 13:22     ` Doug Ledford
  0 siblings, 0 replies; 107+ messages in thread
From: Doug Ledford @ 2013-10-30 13:22 UTC (permalink / raw)
  To: David Laight, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

On 10/30/2013 08:18 AM, David Laight wrote:
>> /me wonders if rearranging the instructions into this order:
>> adcq 0*8(src), res1
>> adcq 1*8(src), res2
>> adcq 2*8(src), res1
>
> Those have to be sequenced.
>
> Using a 64bit lea to add 32bit quantities should avoid the
> dependencies on the flags register.
> However you'd need to get 3 of those active to beat a 64bit adc.
>
> 	David
>
>
>

Already done (well, something similar to what you mention above anyway), 
doesn't help (although doesn't hurt either, even though it doubles the 
number of adds needed to complete the same work).  This is the code I 
tested:

#define ADDL_64                                         \
         asm("xorq  %%r8,%%r8\n\t"                       \
             "xorq  %%r9,%%r9\n\t"                       \
             "xorq  %%r10,%%r10\n\t"                     \
             "xorq  %%r11,%%r11\n\t"                     \
             "movl  0*4(%[src]),%%r8d\n\t"               \
             "movl  1*4(%[src]),%%r9d\n\t"               \
             "movl  2*4(%[src]),%%r10d\n\t"              \
             "movl  3*4(%[src]),%%r11d\n\t"              \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  4*4(%[src]),%%r8d\n\t"               \
             "movl  5*4(%[src]),%%r9d\n\t"               \
             "movl  6*4(%[src]),%%r10d\n\t"              \
             "movl  7*4(%[src]),%%r11d\n\t"              \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  8*4(%[src]),%%r8d\n\t"               \
             "movl  9*4(%[src]),%%r9d\n\t"               \
             "movl  10*4(%[src]),%%r10d\n\t"             \
             "movl  11*4(%[src]),%%r11d\n\t"             \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]\n\t"                   \
             "movl  12*4(%[src]),%%r8d\n\t"              \
             "movl  13*4(%[src]),%%r9d\n\t"              \
             "movl  14*4(%[src]),%%r10d\n\t"             \
             "movl  15*4(%[src]),%%r11d\n\t"             \
             "addq  %%r8,%[res1]\n\t"                    \
             "addq  %%r9,%[res2]\n\t"                    \
             "addq  %%r10,%[res3]\n\t"                   \
             "addq  %%r11,%[res4]"                       \
             : [res1] "=r" (result1),                    \
               [res2] "=r" (result2),                    \
               [res3] "=r" (result3),                    \
               [res4] "=r" (result4)                     \
             : [src] "r" (buff),                         \
               "[res1]" (result1), "[res2]" (result2),   \
               "[res3]" (result3), "[res4]" (result4)    \
             : "r8", "r9", "r10", "r11" )


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 11:02 ` Neil Horman
  2013-10-30 12:18   ` David Laight
@ 2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
                       ` (2 more replies)
  1 sibling, 3 replies; 107+ messages in thread
From: Doug Ledford @ 2013-10-30 13:35 UTC (permalink / raw)
  To: Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On 10/30/2013 07:02 AM, Neil Horman wrote:

> That does makes sense, but it then begs the question, whats the advantage of
> having multiple alu's at all?

There's lots of ALU operations that don't operate on the flags or other 
entities that can be run in parallel.

> If they're just going to serialize on the
> updating of the condition register, there doesn't seem to be much advantage in
> having multiple alu's at all, especially if a common use case (parallelizing an
> operation on a large linear dataset) resulted in lower performance.
>
> /me wonders if rearranging the instructions into this order:
> adcq 0*8(src), res1
> adcq 1*8(src), res2
> adcq 2*8(src), res1
>
> would prevent pipeline stalls.  That would be interesting data, and (I think)
> support your theory, Doug.  I'll give that a try

Just to avoid spending too much time on various combinations, here are 
the methods I've tried:

Original code
2 chains doing interleaved memory accesses
2 chains doing serial memory accesses (as above)
4 chains doing serial memory accesses
4 chains using 32bit values in 64bit registers so you can always use add 
instead of adc and never need the carry flag

And I've done all of the above with simple prefetch and smart prefetch.

In all cases, the result is basically that the add method doesn't matter 
much in the grand scheme of things, but the prefetch does, and smart 
prefetch always beat simple prefetch.

My simple prefetch was to just go into the main while() loop for the 
csum operation and always prefetch 5*64 into the future.

My smart prefetch looks like this:

static inline void prefetch_line(unsigned long *cur_line,
                                  unsigned long *end_line,
                                  size_t size)
{
         size_t fetched = 0;

         while (*cur_line <= *end_line && fetched < size) {
                 prefetch((void *)*cur_line);
                 *cur_line += cache_line_size();
                 fetched += cache_line_size();
         }
}

static unsigned do_csum(const unsigned char *buff, unsigned len)
{
	...
         unsigned long cur_line = (unsigned long)buff & 
~(cache_line_size() - 1);
         unsigned long end_line = ((unsigned long)buff + len) & 
~(cache_line_size() - 1);

	...
         /* Don't bother to prefetch the first line, we'll end up 
stalling on
          * it anyway, but go ahead and start the prefetch on the next 3 */
         cur_line += cache_line_size();
         prefetch_line(&cur_line, &end_line, cache_line_size() * 3);
         odd = 1 & (unsigned long) buff;
         if (unlikely(odd)) {
                 result = *buff << 8;
	...
                 count >>= 1;            /* nr of 32-bit words.. */

                 /* prefetch line #4 ahead of main loop */
                 prefetch_line(&cur_line, &end_line, cache_line_size());

                 if (count) {
		...
                         while (count64) {
                                 /* we are now prefetching line #5 ahead of
                                  * where we are starting, and will stay 5
                                  * ahead throughout the loop, at least 
until
                                  * we get to the end line and then 
we'll stop
                                  * prefetching */
                                 prefetch_line(&cur_line, &end_line, 64);
                                 ADDL_64;
                                 buff += 64;
                                 count64--;
                         }

                         ADDL_64_FINISH;


I was going to tinker today and tomorrow with this function once I get a 
toolchain that will compile it (I reinstalled all my rhel6 hosts as f20 
and I'm hoping that does the trick, if not I need to do more work):

#define ADCXQ_64                                        \
         asm("xorq %[res1],%[res1]\n\t"                  \
             "adcxq 0*8(%[src]),%[res1]\n\t"             \
             "adoxq 1*8(%[src]),%[res2]\n\t"             \
             "adcxq 2*8(%[src]),%[res1]\n\t"             \
             "adoxq 3*8(%[src]),%[res2]\n\t"             \
             "adcxq 4*8(%[src]),%[res1]\n\t"             \
             "adoxq 5*8(%[src]),%[res2]\n\t"             \
             "adcxq 6*8(%[src]),%[res1]\n\t"             \
             "adoxq 7*8(%[src]),%[res2]\n\t"             \
             "adcxq %[zero],%[res1]\n\t"                 \
             "adoxq %[zero],%[res2]\n\t"                 \
             : [res1] "=r" (result1),                    \
               [res2] "=r" (result2)                     \
             : [src] "r" (buff), [zero] "r" (zero),      \
               "[res1]" (result1), "[res2]" (result2))

and then I also wanted to try using both xmm and ymm registers and doing 
64bit adds with 32bit numbers across multiple xmm/ymm registers as that 
should parallel nicely.  David, you mentioned you've tried this, how did 
your experiment turn out and what was your method?  I was planning on 
doing regular full size loads into one xmm/ymm register, then using 
pshufd/vshufd to move the data into two different registers, then 
summing into a fourth register, and possible running two of those 
pipelines in parallel.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
@ 2013-10-30 14:04     ` David Laight
  2013-10-30 14:52     ` Neil Horman
  2013-10-31 18:30     ` Neil Horman
  2 siblings, 0 replies; 107+ messages in thread
From: David Laight @ 2013-10-30 14:04 UTC (permalink / raw)
  To: Doug Ledford, Neil Horman; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev

...
> and then I also wanted to try using both xmm and ymm registers and doing
> 64bit adds with 32bit numbers across multiple xmm/ymm registers as that
> should parallel nicely.  David, you mentioned you've tried this, how did
> your experiment turn out and what was your method?  I was planning on
> doing regular full size loads into one xmm/ymm register, then using
> pshufd/vshufd to move the data into two different registers, then
> summing into a fourth register, and possible running two of those
> pipelines in parallel.

It was a long time ago, and IIRC the code was just SSE so the
register length just wasn't going to give the required benefit.
I know I wrote the code, but I can't even remember whether I
actually got it working!
With the longer AVX words it might make enough difference.
Of course, this assumes that you have the fpu registers
available. If you have to do a fpu context switch it will
be a lot slower.

About the same time I did manage to an open coded copy
loop to run as fast as 'rep movs' - and without any unrolling
or any prefetch instructions.

Thinking about AVX you should be able to do (without looking up the
actual mnemonics):
	load
	add 32bit chunks to sum
	compare sum with read value (equiv of carry)
	add/subtract compare result (0 or ~0) to a carry-sum register
That is 4 instructions for 256 bits, so you can aim for 4 clocks.
You'd need to check the cpu book to see if any of those can
be scheduled at the same time (if not dependant).
(and also whether there is any result delay - don't think so.)

I'd try running two copies of the above - probably skewed so that
the memory accesses are separated, do the memory read for the
next iteration, and use the 3rd instruction unit for loop control.

	David

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
@ 2013-10-30 14:52     ` Neil Horman
  2013-10-31 18:30     ` Neil Horman
  2 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-30 14:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
Yup, I just tried the 2 chains doing interleaved access and came up with the
same results for both prefetch cases.

> In all cases, the result is basically that the add method doesn't
> matter much in the grand scheme of things, but the prefetch does,
> and smart prefetch always beat simple prefetch.
> 
> My simple prefetch was to just go into the main while() loop for the
> csum operation and always prefetch 5*64 into the future.
> 
> My smart prefetch looks like this:
> 
> static inline void prefetch_line(unsigned long *cur_line,
>                                  unsigned long *end_line,
>                                  size_t size)
> {
>         size_t fetched = 0;
> 
>         while (*cur_line <= *end_line && fetched < size) {
>                 prefetch((void *)*cur_line);
>                 *cur_line += cache_line_size();
>                 fetched += cache_line_size();
>         }
> }
> 
I've done this too, but I've come up with results that are very close to simple
prefetch.

> I was going to tinker today and tomorrow with this function once I
> get a toolchain that will compile it (I reinstalled all my rhel6
> hosts as f20 and I'm hoping that does the trick, if not I need to do
> more work):
> 
> #define ADCXQ_64                                        \
>         asm("xorq %[res1],%[res1]\n\t"                  \
>             "adcxq 0*8(%[src]),%[res1]\n\t"             \
>             "adoxq 1*8(%[src]),%[res2]\n\t"             \
>             "adcxq 2*8(%[src]),%[res1]\n\t"             \
>             "adoxq 3*8(%[src]),%[res2]\n\t"             \
>             "adcxq 4*8(%[src]),%[res1]\n\t"             \
>             "adoxq 5*8(%[src]),%[res2]\n\t"             \
>             "adcxq 6*8(%[src]),%[res1]\n\t"             \
>             "adoxq 7*8(%[src]),%[res2]\n\t"             \
>             "adcxq %[zero],%[res1]\n\t"                 \
>             "adoxq %[zero],%[res2]\n\t"                 \
>             : [res1] "=r" (result1),                    \
>               [res2] "=r" (result2)                     \
>             : [src] "r" (buff), [zero] "r" (zero),      \
>               "[res1]" (result1), "[res2]" (result2))
> 
I've tried using this method also (HPA suggested it early in the thread, but its
not going to be usefull for awhile.  The compiler supports it already, but
theres not hardware available with support for these instructions yet (at least
not that I have available).

Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-30 13:35   ` Doug Ledford
  2013-10-30 14:04     ` David Laight
  2013-10-30 14:52     ` Neil Horman
@ 2013-10-31 18:30     ` Neil Horman
  2013-11-01  9:21       ` Ingo Molnar
  2013-11-01 15:42       ` Ben Hutchings
  2 siblings, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-31 18:30 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Ingo Molnar, Eric Dumazet, linux-kernel, netdev, David Laight

On Wed, Oct 30, 2013 at 09:35:05AM -0400, Doug Ledford wrote:
> On 10/30/2013 07:02 AM, Neil Horman wrote:
> 
> >That does makes sense, but it then begs the question, whats the advantage of
> >having multiple alu's at all?
> 
> There's lots of ALU operations that don't operate on the flags or
> other entities that can be run in parallel.
> 
> >If they're just going to serialize on the
> >updating of the condition register, there doesn't seem to be much advantage in
> >having multiple alu's at all, especially if a common use case (parallelizing an
> >operation on a large linear dataset) resulted in lower performance.
> >
> >/me wonders if rearranging the instructions into this order:
> >adcq 0*8(src), res1
> >adcq 1*8(src), res2
> >adcq 2*8(src), res1
> >
> >would prevent pipeline stalls.  That would be interesting data, and (I think)
> >support your theory, Doug.  I'll give that a try
> 
> Just to avoid spending too much time on various combinations, here
> are the methods I've tried:
> 
> Original code
> 2 chains doing interleaved memory accesses
> 2 chains doing serial memory accesses (as above)
> 4 chains doing serial memory accesses
> 4 chains using 32bit values in 64bit registers so you can always use
> add instead of adc and never need the carry flag
> 
> And I've done all of the above with simple prefetch and smart prefetch.
> 
> 

So, above and beyond this I spent yesterday trying this pattern, something Doug
and I discussed together offline:

asm("prefetch 5*64(%[src])\n\t"
    "addq 0*8(%[src]),%[res1]\n\t"
    "jo 2f\n\t"
    "incq %[cry]\n\t"
    "2:addq 1*8(%[src]),%[res2]\n\t"
    "jc 3f\n\t"
    "incq %[cry]\n\t"
    "3:addq 2*8(%[src]),%[res1]\n\t"
    ...

The hope being that by using the add instead instead of the adc instruction, and
alternatively testing the overflow and carry flags, I could break the
serialization on the flags register between subeuqent adds and start doing
things in parallel (creating a poor mans adcx/adox instruction in effect).  It
functions, but unfortunately the performance lost to the completely broken
branch prediction that this inflicts makes it a non starter:

Base Performance:
 Performance counter stats for './test.sh' (20 runs):

        48,143,372 L1-dcache-load-misses                                         ( +-  0.03% ) [74.99%]
                 0 L1-dcache-prefetches                                         [75.00%]
    13,913,339,911 cycles                    #    0.000 GHz                      ( +-  0.06% ) [75.01%]
        28,878,999 branch-misses                                                 ( +-  0.05% ) [75.00%]

       5.367618727 seconds time elapsed                                          ( +-  0.06% )

Prefetch and simluated adcx/adox from above:
 Performance counter stats for './test.sh' (20 runs):

        35,704,331 L1-dcache-load-misses                                         ( +-  0.07% ) [75.00%]
                 0 L1-dcache-prefetches                                         [75.00%]
    19,751,409,264 cycles                    #    0.000 GHz                      ( +-  0.59% ) [75.00%]
        34,850,056 branch-misses                                                 ( +-  1.29% ) [75.00%]

       7.768602160 seconds time elapsed                                          ( +-  1.38% )

With the above instruction changes the prefetching lowers our dcache miss rate
significantly, but greatly raises our branch miss rate, and absolutely kills our
cycle count and run time.

At this point I feel like this is dead in the water.  I apologize for wasting
everyones time.  The best thing to do here would seem to be:

1) Add in some prefetching (from what I've seen a simple prefetch is as
performant as smart prefetching), so we may as well do it exactly as
csum_copy_from_user does it, and save ourselves the extra while loop.

2) Revisit this when the AVX extensions, or the adcx/adox instructions are
available and we can really preform parallel alu ops here.

Does that sound reasonable?
Neil

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 18:30     ` Neil Horman
@ 2013-11-01  9:21       ` Ingo Molnar
  2013-11-01 15:42       ` Ben Hutchings
  1 sibling, 0 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Eric Dumazet, linux-kernel, netdev, David Laight


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Prefetch and simluated adcx/adox from above:
>  Performance counter stats for './test.sh' (20 runs):
> 
>         35,704,331 L1-dcache-load-misses                                         ( +-  0.07% ) [75.00%]
>                  0 L1-dcache-prefetches                                         [75.00%]
>     19,751,409,264 cycles                    #    0.000 GHz                      ( +-  0.59% ) [75.00%]
>         34,850,056 branch-misses                                                 ( +-  1.29% ) [75.00%]
> 
>        7.768602160 seconds time elapsed                                          ( +-  1.38% )

btw., you might also want to try measuring only the basics:

   -e cycles -e instructions -e branches -e branch-misses

that should give you 100% in the last column and should also allow 
you to double check whether all the PMU counts are correct: is it 
the expected number of instructions, expected number of branches, 
expected number of branch-misses, etc.

Then you can remove branch stats and add just L1-dcache stats - and 
still be 100% covered:

   -e cycles -e instructions -e L1-dcache-loads -e L1-dcache-load-misses

etc.

Just so that you can trust what the PMU tells you. Prefetch counts 
are sometimes off, they might include speculative activities, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 18:30     ` Neil Horman
  2013-11-01  9:21       ` Ingo Molnar
@ 2013-11-01 15:42       ` Ben Hutchings
  2013-11-01 16:08         ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: Ben Hutchings @ 2013-11-01 15:42 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
[...]
> It
> functions, but unfortunately the performance lost to the completely broken
> branch prediction that this inflicts makes it a non starter:
[...]

Conditional branches are no good but conditional moves might be worth a shot.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 15:42       ` Ben Hutchings
@ 2013-11-01 16:08         ` Neil Horman
  2013-11-01 16:16           ` Ben Hutchings
  2013-11-01 16:18           ` David Laight
  0 siblings, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-11-01 16:08 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Fri, Nov 01, 2013 at 03:42:46PM +0000, Ben Hutchings wrote:
> On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> [...]
> > It
> > functions, but unfortunately the performance lost to the completely broken
> > branch prediction that this inflicts makes it a non starter:
> [...]
> 
> Conditional branches are no good but conditional moves might be worth a shot.
> 
> Ben.
> 
How would you suggest replacing the jumps in this case?  I agree it would be
faster here, but I'm not sure how I would implement an increment using a single
conditional move.
Neil

> -- 
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:08         ` Neil Horman
@ 2013-11-01 16:16           ` Ben Hutchings
  2013-11-01 16:18           ` David Laight
  1 sibling, 0 replies; 107+ messages in thread
From: Ben Hutchings @ 2013-11-01 16:16 UTC (permalink / raw)
  To: Neil Horman
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev,
	David Laight

On Fri, 2013-11-01 at 12:08 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 03:42:46PM +0000, Ben Hutchings wrote:
> > On Thu, 2013-10-31 at 14:30 -0400, Neil Horman wrote:
> > [...]
> > > It
> > > functions, but unfortunately the performance lost to the completely broken
> > > branch prediction that this inflicts makes it a non starter:
> > [...]
> > 
> > Conditional branches are no good but conditional moves might be worth a shot.
> > 
> > Ben.
> > 
> How would you suggest replacing the jumps in this case?  I agree it would be
> faster here, but I'm not sure how I would implement an increment using a single
> conditional move.

You can't, but it lets you use additional registers as carry flags.
Whether there are enough registers and enough parallelism to cancel out
the extra additions required, I don't know.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:08         ` Neil Horman
  2013-11-01 16:16           ` Ben Hutchings
@ 2013-11-01 16:18           ` David Laight
  2013-11-01 17:37             ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: David Laight @ 2013-11-01 16:18 UTC (permalink / raw)
  To: Neil Horman, Ben Hutchings
  Cc: Doug Ledford, Ingo Molnar, Eric Dumazet, linux-kernel, netdev

> How would you suggest replacing the jumps in this case?  I agree it would be
> faster here, but I'm not sure how I would implement an increment using a single
> conditional move.

I think you need 3 instructions, move a 0, conditionally move a 1
then add. I suspect it won't be a win!

If you do 'win' it is probably very dependent on how the instructions
get scheduled onto the execution units - which will probably make
it very cpu type dependant.

	David





^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 16:18           ` David Laight
@ 2013-11-01 17:37             ` Neil Horman
  2013-11-01 19:45               ` Joe Perches
  2013-11-04  9:47               ` David Laight
  0 siblings, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-11-01 17:37 UTC (permalink / raw)
  To: David Laight
  Cc: Ben Hutchings, Doug Ledford, Ingo Molnar, Eric Dumazet,
	linux-kernel, netdev

On Fri, Nov 01, 2013 at 04:18:50PM -0000, David Laight wrote:
> > How would you suggest replacing the jumps in this case?  I agree it would be
> > faster here, but I'm not sure how I would implement an increment using a single
> > conditional move.
> 
> I think you need 3 instructions, move a 0, conditionally move a 1
> then add. I suspect it won't be a win!
> 
> If you do 'win' it is probably very dependent on how the instructions
> get scheduled onto the execution units - which will probably make
> it very cpu type dependant.
> 
> 	David
> 
I agree, that sounds interesting, but very cpu dependent.  Thanks for the
suggestion, Ben, but I think it would be better if we just did the prefetch here
and re-addressed this area when AVX (or addcx/addox) instructions were available
for testing on hardware.

Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 17:37             ` Neil Horman
@ 2013-11-01 19:45               ` Joe Perches
  2013-11-01 19:58                 ` Neil Horman
  2013-11-04  9:47               ` David Laight
  1 sibling, 1 reply; 107+ messages in thread
From: Joe Perches @ 2013-11-01 19:45 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:

> I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

Could there be a difference if only a single software
prefetch was done at the beginning of transfer before
the while loop and hardware prefetches did the rest?




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 19:45               ` Joe Perches
@ 2013-11-01 19:58                 ` Neil Horman
  2013-11-01 20:26                   ` Joe Perches
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-11-01 19:58 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> 
> > I think it would be better if we just did the prefetch here
> > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > for testing on hardware.
> 
> Could there be a difference if only a single software
> prefetch was done at the beginning of transfer before
> the while loop and hardware prefetches did the rest?
> 

I wouldn't think so.  If hardware was going to do any prefetching based on
memory access patterns it will do so regardless of the leading prefetch, and
that first prefetch isn't helpful because we still wind up stalling on the adds
while its completing
Neil

> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 19:58                 ` Neil Horman
@ 2013-11-01 20:26                   ` Joe Perches
  2013-11-02  2:07                     ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Joe Perches @ 2013-11-01 20:26 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > 
> > > I think it would be better if we just did the prefetch here
> > > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > > for testing on hardware.
> > 
> > Could there be a difference if only a single software
> > prefetch was done at the beginning of transfer before
> > the while loop and hardware prefetches did the rest?
> > 
> I wouldn't think so.  If hardware was going to do any prefetching based on
> memory access patterns it will do so regardless of the leading prefetch, and
> that first prefetch isn't helpful because we still wind up stalling on the adds
> while its completing

I imagine one benefit to be helping prevent
prefetching beyond the actual data required.

Maybe some hardware optimizes prefetch stride
better than 5*64.

I wonder also if using

	if (count > some_length)
		prefetch
	while (...)

helps small lengths more than the test/jump cost.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 20:26                   ` Joe Perches
@ 2013-11-02  2:07                     ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-11-02  2:07 UTC (permalink / raw)
  To: Joe Perches
  Cc: David Laight, Ben Hutchings, Doug Ledford, Ingo Molnar,
	Eric Dumazet, linux-kernel, netdev

On Fri, Nov 01, 2013 at 01:26:52PM -0700, Joe Perches wrote:
> On Fri, 2013-11-01 at 15:58 -0400, Neil Horman wrote:
> > On Fri, Nov 01, 2013 at 12:45:29PM -0700, Joe Perches wrote:
> > > On Fri, 2013-11-01 at 13:37 -0400, Neil Horman wrote:
> > > 
> > > > I think it would be better if we just did the prefetch here
> > > > and re-addressed this area when AVX (or addcx/addox) instructions were available
> > > > for testing on hardware.
> > > 
> > > Could there be a difference if only a single software
> > > prefetch was done at the beginning of transfer before
> > > the while loop and hardware prefetches did the rest?
> > > 
> > I wouldn't think so.  If hardware was going to do any prefetching based on
> > memory access patterns it will do so regardless of the leading prefetch, and
> > that first prefetch isn't helpful because we still wind up stalling on the adds
> > while its completing
> 
> I imagine one benefit to be helping prevent
> prefetching beyond the actual data required.
> 
> Maybe some hardware optimizes prefetch stride
> better than 5*64.
> 
> I wonder also if using
> 
> 	if (count > some_length)
> 		prefetch
> 	while (...)
> 
> helps small lengths more than the test/jump cost.
> 
We've already done this and it is in fact the best performing.  I'll be posting
that patch along with ingos request to add do_csum to the perf bench code when I
have that done
Best
Neil

> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* RE: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01 17:37             ` Neil Horman
  2013-11-01 19:45               ` Joe Perches
@ 2013-11-04  9:47               ` David Laight
  1 sibling, 0 replies; 107+ messages in thread
From: David Laight @ 2013-11-04  9:47 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ben Hutchings, Doug Ledford, Ingo Molnar, Eric Dumazet,
	linux-kernel, netdev

> > I think you need 3 instructions, move a 0, conditionally move a 1
> > then add. I suspect it won't be a win!

Or, with an appropriately unrolled loop, for each word:
	zero %eax, cmove a 1 to %al
	cmove a 1 to %ah
	shift %eax left, cmove a 1 to %al
	cmove a 1 to %ah, add %eax onto somewhere.
However the 2nd instruction stream would have to use a different
register (IIRC 8bit updates depend on the entire register).

> I agree, that sounds interesting, but very cpu dependent.  Thanks for the
> suggestion, Ben, but I think it would be better if we just did the prefetch here
> and re-addressed this area when AVX (or addcx/addox) instructions were available
> for testing on hardware.

I didn't look too closely at the original figures.
With a simple loop you need 4 instructions per iteration (load, adc, inc, branch).
How close to one iteration per clock do you get?
I thought x86 hardware prefetch would load the cache lines for sequential
accesses - so any prefetch instructions are rather pointless.
However reading the value in the previous loop iteration should help.

I've just realised that there is a problem with the loop termination
condition also needing the flags register:-(
I don't remember the 'loop' instruction ever being added to any of the
fast path instruction decodes - so it won't help.

So I suspect the best you'll get is an interleaved sequence of load and adc
with an lea and inc (both to adjust the index) and a bne back to the top.
(the lea wants to be in the middle somewhere).
That might manage 1 clock per word + 1 clock per loop iteration (if the inc
and bne can be 'fused').

	David

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-11-01  9:13                                             ` Ingo Molnar
@ 2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-11-01 14:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Fri, Nov 01, 2013 at 10:13:37AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > > etc. For such short runtimes make sure the last column displays 
> > > > > close to 100%, so that the PMU results become trustable.
> > > > > 
> > > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > > 
> > > > > The last colum tells you what percentage of the runtime that 
> > > > > particular event was actually active. 100% (or empty last column) 
> > > > > means it was active all the time.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > 	Ingo
> > > > > 
> > > > 
> > > > Hmm, 
> > > > 
> > > > I ran this test:
> > > > 
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > > done
> > > 
> > > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > > events, but here you want to use fewer events, to increase the 
> > > precision of the measurement.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Thank you ingo, that fixed it.  I'm trying some other variants of 
> > the csum algorithm that Doug and I discussed last night, but FWIW, 
> > the relative performance of the 4 test cases 
> > (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> > feel like at this point, theres very little point in doing 
> > parallel alu operations (unless we can find a way to break the 
> > dependency on the carry flag, which is what I'm tinkering with 
> > now).
> 
> I would still like to encourage you to pick up the improvements that 
> Doug measured (mostly via prefetch tweaking?) - that looked like 
> some significant speedups that we don't want to lose!
> 
Well, yes, I made a line item of that in my subsequent note below.  I'm going to
repost that shortly, and I suggested that we revisit this when the AVX
instruction extensions are available.

> Also, trying to stick the in-kernel implementation into 'perf bench' 
> would be a useful first step as well, for this and future efforts.
> 
> See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
> up the in-kernel assembly memcpy implementations:
> 
Yes, I'll look into adding this as well
Regards
Neil



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 14:33                                           ` Neil Horman
@ 2013-11-01  9:13                                             ` Ingo Molnar
  2013-11-01 14:06                                               ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-11-01  9:13 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > > etc. For such short runtimes make sure the last column displays 
> > > > close to 100%, so that the PMU results become trustable.
> > > > 
> > > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > > because they get counted in a separate (fixed purpose) PMU register.
> > > > 
> > > > The last colum tells you what percentage of the runtime that 
> > > > particular event was actually active. 100% (or empty last column) 
> > > > means it was active all the time.
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Ingo
> > > > 
> > > 
> > > Hmm, 
> > > 
> > > I ran this test:
> > > 
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > > done
> > 
> > You need to remove '-ddd' which is a shortcut for a ton of useful 
> > events, but here you want to use fewer events, to increase the 
> > precision of the measurement.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Thank you ingo, that fixed it.  I'm trying some other variants of 
> the csum algorithm that Doug and I discussed last night, but FWIW, 
> the relative performance of the 4 test cases 
> (base/prefetch/parallel/both) remains unchanged. I'm starting to 
> feel like at this point, theres very little point in doing 
> parallel alu operations (unless we can find a way to break the 
> dependency on the carry flag, which is what I'm tinkering with 
> now).

I would still like to encourage you to pick up the improvements that 
Doug measured (mostly via prefetch tweaking?) - that looked like 
some significant speedups that we don't want to lose!

Also, trying to stick the in-kernel implementation into 'perf bench' 
would be a useful first step as well, for this and future efforts.

See what we do in tools/perf/bench/mem-memcpy-x86-64-asm.S to pick 
up the in-kernel assembly memcpy implementations:

#define memcpy MEMCPY /* don't hide glibc's memcpy() */
#define altinstr_replacement text
#define globl p2align 4; .globl
#define Lmemcpy_c globl memcpy_c; memcpy_c
#define Lmemcpy_c_e globl memcpy_c_e; memcpy_c_e

#include "../../../arch/x86/lib/memcpy_64.S"

So it needed a bit of trickery/wrappery for 'perf bench mem memcpy', 
but that is a one-time effort - once it's done then the current 
in-kernel csum_partial() implementation would be easily measurable 
(and any performance regression in it bisectable, etc.) from that 
point on.

In user-space it would also be easier to add various parameters and 
experimental implementations and background cache-stressing 
workloads automatically.

Something similar might be possible for csum_partial(), 
csum_partial_copy*(), etc.

Note, if any of you ventures to add checksum-benchmarking to perf 
bench, please base any patches on top of tip:perf/core:

  git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core

as there are a couple of perf bench enhancements in the pipeline 
already for v3.13.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-31 10:22                                         ` Ingo Molnar
@ 2013-10-31 14:33                                           ` Neil Horman
  2013-11-01  9:13                                             ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-31 14:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Thu, Oct 31, 2013 at 11:22:00AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > etc. For such short runtimes make sure the last column displays 
> > > close to 100%, so that the PMU results become trustable.
> > > 
> > > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > > because they get counted in a separate (fixed purpose) PMU register.
> > > 
> > > The last colum tells you what percentage of the runtime that 
> > > particular event was actually active. 100% (or empty last column) 
> > > means it was active all the time.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > Hmm, 
> > 
> > I ran this test:
> > 
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> > done
> 
> You need to remove '-ddd' which is a shortcut for a ton of useful 
> events, but here you want to use fewer events, to increase the 
> precision of the measurement.
> 
> Thanks,
> 
> 	Ingo
> 

Thank you ingo, that fixed it.  I'm trying some other variants of the csum
algorithm that Doug and I discussed last night, but FWIW, the relative
performance of the 4 test cases (base/prefetch/parallel/both) remains unchanged.
I'm starting to feel like at this point, theres very little point in doing
parallel alu operations (unless we can find a way to break the dependency on the
carry flag, which is what I'm tinkering with now).
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 20:26                                       ` Neil Horman
@ 2013-10-31 10:22                                         ` Ingo Molnar
  2013-10-31 14:33                                           ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-31 10:22 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> > etc. For such short runtimes make sure the last column displays 
> > close to 100%, so that the PMU results become trustable.
> > 
> > A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> > plus generics like 'cycles', 'instructions' can be added 'for free' 
> > because they get counted in a separate (fixed purpose) PMU register.
> > 
> > The last colum tells you what percentage of the runtime that 
> > particular event was actually active. 100% (or empty last column) 
> > means it was active all the time.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> Hmm, 
> 
> I ran this test:
> 
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
> done

You need to remove '-ddd' which is a shortcut for a ton of useful 
events, but here you want to use fewer events, to increase the 
precision of the measurement.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:27                                     ` Ingo Molnar
@ 2013-10-29 20:26                                       ` Neil Horman
  2013-10-31 10:22                                         ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-29 20:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 03:27:16PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > So, I apologize, you were right.  I was running the test.sh script 
> > but perf was measuring itself. [...]
> 
> Ok, cool - one mystery less!
> 
> > Which overall looks alot more like I expect, save for the parallel 
> > ALU cases. It seems here that the parallel ALU changes actually 
> > hurt performance, which really seems counter-intuitive.  I don't 
> > yet have any explination for that.  I do note that we seem to have 
> > more stalls in the both case so perhaps the parallel chains call 
> > for a more agressive prefetch.  Do you have any thoughts?
> 
> Note that with -ddd you 'overload' the PMU with more counters than 
> can be run at once, which introduces extra noise. Since you are 
> running the tests for 0.150 secs or so, the results are not very 
> representative:
> 
>                734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
>         13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
> 
> with such low runtimes those results are very hard to trust.
> 
> So -ddd is typically used to pick up the most interesting PMU events 
> you want to see measured, and then use them like this:
> 
>    -e dTLB-load-misses -e iTLB-loads
> 
> etc. For such short runtimes make sure the last column displays 
> close to 100%, so that the PMU results become trustable.
> 
> A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
> plus generics like 'cycles', 'instructions' can be added 'for free' 
> because they get counted in a separate (fixed purpose) PMU register.
> 
> The last colum tells you what percentage of the runtime that 
> particular event was actually active. 100% (or empty last column) 
> means it was active all the time.
> 
> Thanks,
> 
> 	Ingo
> 

Hmm, 

I ran this test:

for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -e L1-dcache-load-misses -e L1-dcache-prefetches -e cycles -e instructions -ddd ./test.sh
done

And I updated the test module to run for a million iterations rather than 100000 to increase the sample size and got this:


Base:
 Performance counter stats for './test.sh' (20 runs):

        47,305,064 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    13,906,212,348 cycles                    #    0.000 GHz                      ( +-  0.05% ) [18.76%]
     4,426,395,949 instructions              #    0.32  insns per cycle          ( +-  0.01% ) [18.77%]
     2,261,551,278 L1-dcache-loads                                               ( +-  0.02% ) [18.76%]
        47,287,226 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.04% ) [18.76%]
       276,842,685 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        46,454,114 LLC-load-misses           #   16.78% of all LL-cache hits     ( +-  0.05% ) [18.76%]
     1,048,894,486 L1-icache-loads                                               ( +-  0.07% ) [18.76%]
           472,205 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.19% ) [18.76%]
     2,260,639,613 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               172 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 35.14% ) [18.74%]
     1,048,732,481 iTLB-loads                                                    ( +-  0.07% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.75% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.370546698 seconds time elapsed                                          ( +-  0.05% )


Prefetch:
 Performance counter stats for './test.sh' (20 runs):

       124,885,469 L1-dcache-load-misses     #    4.96% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,434,328,889 cycles                    #    0.000 GHz                      ( +-  1.11% ) [18.77%]
     4,601,831,553 instructions              #    0.40  insns per cycle          ( +-  0.01% ) [18.77%]
     2,515,483,814 L1-dcache-loads                                               ( +-  0.01% ) [18.77%]
       124,928,127 L1-dcache-load-misses     #    4.97% of all L1-dcache hits    ( +-  0.09% ) [18.76%]
       323,355,145 LLC-loads                                                     ( +-  0.02% ) [18.76%]
       123,008,548 LLC-load-misses           #   38.04% of all LL-cache hits     ( +-  0.10% ) [18.75%]
     1,256,391,060 L1-icache-loads                                               ( +-  0.01% ) [18.75%]
           374,691 L1-icache-load-misses     #    0.03% of all L1-icache hits    ( +-  1.41% ) [18.75%]
     2,514,984,046 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
                67 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 51.81% ) [18.74%]
     1,256,333,548 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.74% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       4.496839773 seconds time elapsed                                          ( +-  0.64% )


Parallel ALU:
 Performance counter stats for './test.sh' (20 runs):

        49,489,518 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.76%]
    13,777,501,365 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.78%]
     4,707,160,703 instructions              #    0.34  insns per cycle          ( +-  0.01% ) [18.78%]
     2,261,693,074 L1-dcache-loads                                               ( +-  0.02% ) [18.78%]
        49,468,878 L1-dcache-load-misses     #    2.19% of all L1-dcache hits    ( +-  0.09% ) [18.77%]
       279,524,254 LLC-loads                                                     ( +-  0.01% ) [18.76%]
        48,491,934 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.12% ) [18.75%]
     1,057,877,680 L1-icache-loads                                               ( +-  0.02% ) [18.74%]
           461,784 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.87% ) [18.74%]
     2,260,978,836 dTLB-loads                                                    ( +-  0.02% ) [18.74%]
                27 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 89.96% ) [18.74%]
     1,057,886,632 iTLB-loads                                                    ( +-  0.02% ) [18.74%]
                 4 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +-100.00% ) [18.74%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.73%]

       5.500417234 seconds time elapsed                                          ( +-  1.60% )


Both:
 Performance counter stats for './test.sh' (20 runs):

       116,621,570 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.75%]
    11,597,067,510 cycles                    #    0.000 GHz                      ( +-  1.73% ) [18.77%]
     4,952,251,361 instructions              #    0.43  insns per cycle          ( +-  0.01% ) [18.77%]
     2,493,003,710 L1-dcache-loads                                               ( +-  0.02% ) [18.77%]
       116,640,333 L1-dcache-load-misses     #    4.68% of all L1-dcache hits    ( +-  0.04% ) [18.77%]
       322,246,216 LLC-loads                                                     ( +-  0.03% ) [18.76%]
       114,528,956 LLC-load-misses           #   35.54% of all LL-cache hits     ( +-  0.04% ) [18.76%]
       999,371,469 L1-icache-loads                                               ( +-  0.02% ) [18.76%]
           406,679 L1-icache-load-misses     #    0.04% of all L1-icache hits    ( +-  1.97% ) [18.75%]
     2,492,708,710 dTLB-loads                                                    ( +-  0.01% ) [18.75%]
               140 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 38.46% ) [18.74%]
       999,320,389 iTLB-loads                                                    ( +-  0.01% ) [18.74%]
                19 iTLB-load-misses          #    0.00% of all iTLB cache hits   ( +- 39.90% ) [18.73%]
                 0 L1-dcache-prefetches                                         [18.73%]
                 0 L1-dcache-prefetch-misses                                    [18.72%]

       4.634419247 seconds time elapsed                                          ( +-  1.60% )


I note a few oddities here:

1) We seem to be getting more counter results than I specified, not sure why
2) The % active column is adding up to way more than 100 (which from my read of
the man page makes sense, given that multiple counters might increment in
response to a single instruction execution
3) The run times are proportionally larger, but still indicate that Parallel ALU
execution is hurting rather than helping, which is counter-intuitive.  I'm
looking into it, but thought you might want to see these results in case
something jumped out at you

Regards
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 14:17                                   ` Neil Horman
@ 2013-10-29 14:27                                     ` Ingo Molnar
  2013-10-29 20:26                                       ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-29 14:27 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

* Neil Horman <nhorman@tuxdriver.com> wrote:

> So, I apologize, you were right.  I was running the test.sh script 
> but perf was measuring itself. [...]

Ok, cool - one mystery less!

> Which overall looks alot more like I expect, save for the parallel 
> ALU cases. It seems here that the parallel ALU changes actually 
> hurt performance, which really seems counter-intuitive.  I don't 
> yet have any explination for that.  I do note that we seem to have 
> more stalls in the both case so perhaps the parallel chains call 
> for a more agressive prefetch.  Do you have any thoughts?

Note that with -ddd you 'overload' the PMU with more counters than 
can be run at once, which introduces extra noise. Since you are 
running the tests for 0.150 secs or so, the results are not very 
representative:

               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]

with such low runtimes those results are very hard to trust.

So -ddd is typically used to pick up the most interesting PMU events 
you want to see measured, and then use them like this:

   -e dTLB-load-misses -e iTLB-loads

etc. For such short runtimes make sure the last column displays 
close to 100%, so that the PMU results become trustable.

A nehalem+ PMU will allow 2-4 events to be measured in parallel, 
plus generics like 'cycles', 'instructions' can be added 'for free' 
because they get counted in a separate (fixed purpose) PMU register.

The last colum tells you what percentage of the runtime that 
particular event was actually active. 100% (or empty last column) 
means it was active all the time.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
@ 2013-10-29 14:17                                   ` Neil Horman
  2013-10-29 14:27                                     ` Ingo Molnar
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-29 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 


So, I apologize, you were right.  I was running the test.sh script but perf was
measuring itself.  Using this command line:

for i in `seq 0 1 3`
do
echo $i > /sys/modules/csum_test/parameters/module_test_mode; taskset -c 0 perf stat --repeat -C 0 -ddd /root/test.sh
done >> counters.txt 2>&1

with test.sh unchanged I get these results:


Base:
 Performance counter stats for '/root/test.sh' (20 runs):

         56.069737 task-clock                #    1.005 CPUs utilized            ( +-  0.13% ) [100.00%]
                 5 context-switches          #    0.091 K/sec                    ( +-  5.11% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               366 page-faults               #    0.007 M/sec                    ( +-  0.08% )
       144,264,737 cycles                    #    2.573 GHz                      ( +-  0.23% ) [17.49%]
         9,239,760 stalled-cycles-frontend   #    6.40% frontend cycles idle     ( +-  3.77% ) [19.19%]
       110,635,829 stalled-cycles-backend    #   76.69% backend  cycles idle     ( +-  0.14% ) [19.68%]
        54,291,496 instructions              #    0.38  insns per cycle        
                                             #    2.04  stalled cycles per insn  ( +-  0.14% ) [18.30%]
         5,844,933 branches                  #  104.244 M/sec                    ( +-  2.81% ) [16.58%]
           301,523 branch-misses             #    5.16% of all branches          ( +-  0.12% ) [16.09%]
        23,645,797 L1-dcache-loads           #  421.721 M/sec                    ( +-  0.05% ) [16.06%]
           494,467 L1-dcache-load-misses     #    2.09% of all L1-dcache hits    ( +-  0.06% ) [16.06%]
         2,907,250 LLC-loads                 #   51.851 M/sec                    ( +-  0.08% ) [16.06%]
           486,329 LLC-load-misses           #   16.73% of all LL-cache hits     ( +-  0.11% ) [16.06%]
        11,113,848 L1-icache-loads           #  198.215 M/sec                    ( +-  0.07% ) [16.06%]
             5,378 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  1.34% ) [16.06%]
        23,742,876 dTLB-loads                #  423.453 M/sec                    ( +-  0.06% ) [16.06%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [16.06%]
        11,108,538 iTLB-loads                #  198.120 M/sec                    ( +-  0.06% ) [16.06%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [16.07%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.07%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.07%]

       0.055817066 seconds time elapsed                                          ( +-  0.10% )

Prefetch(5*64):
 Performance counter stats for '/root/test.sh' (20 runs):

         47.423853 task-clock                #    1.005 CPUs utilized            ( +-  0.62% ) [100.00%]
                 6 context-switches          #    0.116 K/sec                    ( +-  4.27% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               368 page-faults               #    0.008 M/sec                    ( +-  0.07% )
       120,423,860 cycles                    #    2.539 GHz                      ( +-  0.85% ) [14.23%]
         8,555,632 stalled-cycles-frontend   #    7.10% frontend cycles idle     ( +-  0.56% ) [16.23%]
        87,438,794 stalled-cycles-backend    #   72.61% backend  cycles idle     ( +-  1.13% ) [18.33%]
        55,039,308 instructions              #    0.46  insns per cycle        
                                             #    1.59  stalled cycles per insn  ( +-  0.05% ) [18.98%]
         5,619,298 branches                  #  118.491 M/sec                    ( +-  2.32% ) [18.98%]
           303,686 branch-misses             #    5.40% of all branches          ( +-  0.08% ) [18.98%]
        26,577,868 L1-dcache-loads           #  560.432 M/sec                    ( +-  0.05% ) [18.98%]
         1,323,630 L1-dcache-load-misses     #    4.98% of all L1-dcache hits    ( +-  0.14% ) [18.98%]
         3,426,016 LLC-loads                 #   72.242 M/sec                    ( +-  0.05% ) [18.98%]
         1,304,201 LLC-load-misses           #   38.07% of all LL-cache hits     ( +-  0.13% ) [18.98%]
        13,190,316 L1-icache-loads           #  278.137 M/sec                    ( +-  0.21% ) [18.98%]
            33,881 L1-icache-load-misses     #    0.26% of all L1-icache hits    ( +-  4.63% ) [17.93%]
        25,366,685 dTLB-loads                #  534.893 M/sec                    ( +-  0.24% ) [15.93%]
               734 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  8.40% ) [13.94%]
        13,314,660 iTLB-loads                #  280.759 M/sec                    ( +-  0.05% ) [12.97%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.98%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.98%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.87%]

       0.047194407 seconds time elapsed                                          ( +-  0.62% )

Parallel ALU:
 Performance counter stats for '/root/test.sh' (20 runs):

         57.395070 task-clock                #    1.004 CPUs utilized            ( +-  1.71% ) [100.00%]
                 5 context-switches          #    0.092 K/sec                    ( +-  3.90% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
               367 page-faults               #    0.006 M/sec                    ( +-  0.10% )
       143,232,396 cycles                    #    2.496 GHz                      ( +-  1.68% ) [16.73%]
         7,299,843 stalled-cycles-frontend   #    5.10% frontend cycles idle     ( +-  2.69% ) [18.47%]
       109,485,845 stalled-cycles-backend    #   76.44% backend  cycles idle     ( +-  2.01% ) [19.99%]
        56,867,669 instructions              #    0.40  insns per cycle        
                                             #    1.93  stalled cycles per insn  ( +-  0.22% ) [19.49%]
         6,646,323 branches                  #  115.800 M/sec                    ( +-  2.15% ) [17.75%]
           304,671 branch-misses             #    4.58% of all branches          ( +-  0.37% ) [16.23%]
        23,612,428 L1-dcache-loads           #  411.402 M/sec                    ( +-  0.05% ) [15.95%]
           518,988 L1-dcache-load-misses     #    2.20% of all L1-dcache hits    ( +-  0.11% ) [15.95%]
         2,934,119 LLC-loads                 #   51.121 M/sec                    ( +-  0.06% ) [15.95%]
           509,027 LLC-load-misses           #   17.35% of all LL-cache hits     ( +-  0.15% ) [15.95%]
        11,103,819 L1-icache-loads           #  193.463 M/sec                    ( +-  0.08% ) [15.95%]
             5,381 L1-icache-load-misses     #    0.05% of all L1-icache hits    ( +-  2.45% ) [15.95%]
        23,727,164 dTLB-loads                #  413.401 M/sec                    ( +-  0.06% ) [15.95%]
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits  [15.95%]
        11,104,205 iTLB-loads                #  193.470 M/sec                    ( +-  0.06% ) [15.95%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [15.95%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [15.95%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [15.96%]

       0.057151644 seconds time elapsed                                          ( +-  1.69% )

Both:
 Performance counter stats for '/root/test.sh' (20 runs):

         48.377833 task-clock                #    1.005 CPUs utilized            ( +-  0.67% ) [100.00%]
                 5 context-switches          #    0.113 K/sec                    ( +-  3.88% ) [100.00%]
                 0 cpu-migrations            #    0.001 K/sec                    ( +-100.00% ) [100.00%]
               367 page-faults               #    0.008 M/sec                    ( +-  0.08% )
       122,529,490 cycles                    #    2.533 GHz                      ( +-  1.05% ) [14.24%]
         8,796,729 stalled-cycles-frontend   #    7.18% frontend cycles idle     ( +-  0.56% ) [16.20%]
        88,936,550 stalled-cycles-backend    #   72.58% backend  cycles idle     ( +-  1.48% ) [18.16%]
        58,405,660 instructions              #    0.48  insns per cycle        
                                             #    1.52  stalled cycles per insn  ( +-  0.07% ) [18.61%]
         5,742,738 branches                  #  118.706 M/sec                    ( +-  1.54% ) [18.61%]
           303,555 branch-misses             #    5.29% of all branches          ( +-  0.09% ) [18.61%]
        26,321,789 L1-dcache-loads           #  544.088 M/sec                    ( +-  0.07% ) [18.61%]
         1,236,101 L1-dcache-load-misses     #    4.70% of all L1-dcache hits    ( +-  0.08% ) [18.61%]
         3,409,768 LLC-loads                 #   70.482 M/sec                    ( +-  0.05% ) [18.61%]
         1,212,511 LLC-load-misses           #   35.56% of all LL-cache hits     ( +-  0.08% ) [18.61%]
        10,579,372 L1-icache-loads           #  218.682 M/sec                    ( +-  0.05% ) [18.61%]
            19,426 L1-icache-load-misses     #    0.18% of all L1-icache hits    ( +- 14.70% ) [18.61%]
        25,329,963 dTLB-loads                #  523.586 M/sec                    ( +-  0.27% ) [17.29%]
               802 dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  5.43% ) [15.33%]
        10,635,524 iTLB-loads                #  219.843 M/sec                    ( +-  0.09% ) [13.38%]
                 0 iTLB-load-misses          #    0.00% of all iTLB cache hits  [12.72%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [12.72%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [12.72%]

       0.048140073 seconds time elapsed                                          ( +-  0.67% )


Which overall looks alot more like I expect, save for the parallel ALU cases.
It seems here that the parallel ALU changes actually hurt performance, which
really seems counter-intuitive.  I don't yet have any explination for that.  I
do note that we seem to have more stalls in the both case so perhaps the
parallel chains call for a more agressive prefetch.  Do you have any thoughts?

Regards
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 14:12                               ` David Ahern
  1 sibling, 0 replies; 107+ messages in thread
From: David Ahern @ 2013-10-29 14:12 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/29/13 6:52 AM, Ingo Molnar wrote:
>> According to the perf man page, I'm supposed to be able to use --
>> to separate perf command line parameters from the command I want
>> to run.  And it definately executed test.sh, I added an echo to
>> stdout in there as a test run and observed them get captured in
>> counters.txt
>
> Well, '--' can be used to delineate the command portion for cases
> where it's ambiguous.
>
> Here's it's unambiguous though. This:
>
>    perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
>
> stops parsing a valid option after the -ddd option, so in theory it
> should execute 'perf bench sched messaging -- /root/test.sh' where
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus
> ignored.

Normally with perf commands a workload can be specified to state how 
long to collect perf data. That is not the case for perf-bench.

David

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:11                                 ` Ingo Molnar
@ 2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  1 sibling, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-29 13:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 02:11:49PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > I'm sure it worked properly on my system here, I specificially 
> > checked it, but I'll gladly run it again.  You have to give me an 
> > hour as I have a meeting to run to, but I'll have results shortly.
> 
> So what I tried to react to was this observation of yours:
> 
> > > > Heres my data for running the same test with taskset 
> > > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > > going on here, but doing so resulted in a 10x slowdown of the 
> > > > runtime of each iteration which I can't explain. [...]
> 
> A 10x slowdown would be consistent with not running your testcase 
> but 'perf bench sched messaging' by accident, or so.
> 
> But I was really just guessing wildly here.
> 
> Thanks,
> 
> 	Ingo
> 
Ok, well, I'll run it again in just a bit here.
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 13:07                               ` Neil Horman
@ 2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 13:20                                   ` Neil Horman
  2013-10-29 14:17                                   ` Neil Horman
  0 siblings, 2 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-29 13:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> I'm sure it worked properly on my system here, I specificially 
> checked it, but I'll gladly run it again.  You have to give me an 
> hour as I have a meeting to run to, but I'll have results shortly.

So what I tried to react to was this observation of yours:

> > > Heres my data for running the same test with taskset 
> > > restricting execution to only cpu0.  I'm not quite sure whats 
> > > going on here, but doing so resulted in a 10x slowdown of the 
> > > runtime of each iteration which I can't explain. [...]

A 10x slowdown would be consistent with not running your testcase 
but 'perf bench sched messaging' by accident, or so.

But I was really just guessing wildly here.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 12:52                             ` Ingo Molnar
@ 2013-10-29 13:07                               ` Neil Horman
  2013-10-29 13:11                                 ` Ingo Molnar
  2013-10-29 14:12                               ` David Ahern
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-29 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 01:52:33PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sure it was this:
> > > > for i in `seq 0 1 3`
> > > > do
> > > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > > done >> counters.txt 2>&1
> > > > 
> > > > where test.sh is:
> > > > #!/bin/sh
> > > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > > 
> > > What does '-- /root/test.sh' do?
> > > 
> > > Unless I'm missing something, the line above will run:
> > > 
> > >   perf bench sched messaging -- /root/test.sh
> > > 
> > > which should be equivalent to:
> > > 
> > >   perf bench sched messaging
> > > 
> > > i.e. /root/test.sh won't be run.
> > 
> > According to the perf man page, I'm supposed to be able to use -- 
> > to separate perf command line parameters from the command I want 
> > to run.  And it definately executed test.sh, I added an echo to 
> > stdout in there as a test run and observed them get captured in 
> > counters.txt
> 
> Well, '--' can be used to delineate the command portion for cases 
> where it's ambiguous.
> 
> Here's it's unambiguous though. This:
> 
>   perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> 
> stops parsing a valid option after the -ddd option, so in theory it 
> should execute 'perf bench sched messaging -- /root/test.sh' where 
> '-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
> ignored.
> 
> The message output you provided seems to suggest that to be the 
> case:
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
> See how the command executed by perf stat was 'perf bench ...'.
> 
> Did you want to run:
> 
>   perf stat --repeat 20 -C 0 -ddd /root/test.sh
> 
I'm sure it worked properly on my system here, I specificially checked it, but
I'll gladly run it again.  You have to give me an hour as I have a meeting to
run to, but I'll have results shortly.
Neil

> ?
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:49                           ` Neil Horman
@ 2013-10-29 12:52                             ` Ingo Molnar
  2013-10-29 13:07                               ` Neil Horman
  2013-10-29 14:12                               ` David Ahern
  0 siblings, 2 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-29 12:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sure it was this:
> > > for i in `seq 0 1 3`
> > > do
> > > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > > done >> counters.txt 2>&1
> > > 
> > > where test.sh is:
> > > #!/bin/sh
> > > echo 1 > /sys/module/csum_test/parameters/test_fire
> > 
> > What does '-- /root/test.sh' do?
> > 
> > Unless I'm missing something, the line above will run:
> > 
> >   perf bench sched messaging -- /root/test.sh
> > 
> > which should be equivalent to:
> > 
> >   perf bench sched messaging
> > 
> > i.e. /root/test.sh won't be run.
> 
> According to the perf man page, I'm supposed to be able to use -- 
> to separate perf command line parameters from the command I want 
> to run.  And it definately executed test.sh, I added an echo to 
> stdout in there as a test run and observed them get captured in 
> counters.txt

Well, '--' can be used to delineate the command portion for cases 
where it's ambiguous.

Here's it's unambiguous though. This:

  perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh

stops parsing a valid option after the -ddd option, so in theory it 
should execute 'perf bench sched messaging -- /root/test.sh' where 
'-- /root/test.sh' is simply a parameter to 'perf bench' and is thus 
ignored.

The message output you provided seems to suggest that to be the 
case:

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

See how the command executed by perf stat was 'perf bench ...'.

Did you want to run:

  perf stat --repeat 20 -C 0 -ddd /root/test.sh

?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:30                         ` Ingo Molnar
@ 2013-10-29 11:49                           ` Neil Horman
  2013-10-29 12:52                             ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-29 11:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 12:30:31PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sure it was this:
> > for i in `seq 0 1 3`
> > do
> > echo $i > /sys/module/csum_test/parameters/module_test_mode
> > taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> > done >> counters.txt 2>&1
> > 
> > where test.sh is:
> > #!/bin/sh
> > echo 1 > /sys/module/csum_test/parameters/test_fire
> 
> What does '-- /root/test.sh' do?
> 
> Unless I'm missing something, the line above will run:
> 
>   perf bench sched messaging -- /root/test.sh
> 
> which should be equivalent to:
> 
>   perf bench sched messaging
> 
> i.e. /root/test.sh won't be run.
> 
According to the perf man page, I'm supposed to be able to use -- to separate
perf command line parameters from the command I want to run.  And it definately
executed test.sh, I added an echo to stdout in there as a test run and observed
them get captured in counters.txt

Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29 11:20                       ` Neil Horman
@ 2013-10-29 11:30                         ` Ingo Molnar
  2013-10-29 11:49                           ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-29 11:30 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sure it was this:
> for i in `seq 0 1 3`
> do
> echo $i > /sys/module/csum_test/parameters/module_test_mode
> taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
> done >> counters.txt 2>&1
> 
> where test.sh is:
> #!/bin/sh
> echo 1 > /sys/module/csum_test/parameters/test_fire

What does '-- /root/test.sh' do?

Unless I'm missing something, the line above will run:

  perf bench sched messaging -- /root/test.sh

which should be equivalent to:

  perf bench sched messaging

i.e. /root/test.sh won't be run.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-29  8:25                     ` Ingo Molnar
@ 2013-10-29 11:20                       ` Neil Horman
  2013-10-29 11:30                         ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-29 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Tue, Oct 29, 2013 at 09:25:42AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Heres my data for running the same test with taskset restricting 
> > execution to only cpu0.  I'm not quite sure whats going on here, 
> > but doing so resulted in a 10x slowdown of the runtime of each 
> > iteration which I can't explain.  As before however, both the 
> > parallel alu run and the prefetch run resulted in speedups, but 
> > the two together were not in any way addative.  I'm going to keep 
> > playing with the prefetch stride, unless you have an alternate 
> > theory.
> 
> Could you please cite the exact command-line you used for running 
> the test?
> 
> Thanks,
> 
> 	Ingo
> 

Sure it was this:
for i in `seq 0 1 3`
do
echo $i > /sys/module/csum_test/parameters/module_test_mode
taskset -c 0 perf stat --repeat 20 -C 0 -ddd perf bench sched messaging -- /root/test.sh
done >> counters.txt 2>&1

where test.sh is:
#!/bin/sh
echo 1 > /sys/module/csum_test/parameters/test_fire


As before, module_test_mode selects a case in a switch statement I added in
do_csum to test one of the 4 csum variants we've been discusing (base, prefetch,
parallel ALU or both), and test_fire is a callback trigger I use in the test
module to run 100000 iterations of a checksum operation.  As you requested, I
ran the above on cpu 0 (-C 0 on perf and -c 0 on taskset), and I removed all irq
affinity to cpu 0.

Regards
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 17:02       ` Doug Ledford
@ 2013-10-29  8:38         ` Ingo Molnar
  0 siblings, 0 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-29  8:38 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Eric Dumazet, Neil Horman, linux-kernel, H. Peter Anvin,
	Andi Kleen, Sebastien Dugue


* Doug Ledford <dledford@redhat.com> wrote:

> [ Snipped a couple of really nice real-life bandwidth tests. ]

> Some of my preliminary results:
> 
> 1) Regarding the initial claim that changing the code to have two 
> addition chains, allowing the use of two ALUs, doubling 
> performance: I'm just not seeing it.  I have a number of theories 
> about this, but they are dependent on point #2 below:
> 
> 2) Prefetch definitely helped, although how much depends on which 
> of the test setups I was using above.  The biggest gainer was B) 
> the E3-1240 V2 @ 3.40GHz based machines.
> 
> So, my theories about #1 are that, with modern CPUs, it's more our 
> load/store speed that is killing us than the ALU speed.  I tried 
> at least 5 distinctly different ALU algorithms, including one that 
> eliminated the use of the carry chain entirely, and none of them 
> had a noticeable effect.  On the other hand, prefetch always had a 
> noticeable effect.  I suspect the original patch worked and had a 
> performance benefit some time ago due to a quirk on some CPU 
> common back then, but modern CPUs are capable of optimizing the 
> routine well enough that the benefit of the patch is already in 
> our original csum routine due to CPU optimizations. [...]

That definitely sounds plausible.

> [...] Or maybe there is another explanation, but I'm not really 
> looking too hard for it.
> 
> I also tried two different prefetch methods on the theory that 
> memory access cycles are more important than CPU access cycles, 
> and there appears to be a minor benefit to wasting CPU cycles to 
> prevent unnecessary prefetches, even with 65520 as our MTU where a 
> 320 byte excess prefetch at the end of the operation only caused 
> us to load a few % points of extra memory.  I suspect that if I 
> dropped the MTU down to 9K (to mimic jumbo frames on a device 
> without tx/rx checksum offloads), the smart version of prefetch 
> would be a much bigger winner.  The fact that there is any 
> apparent difference at all on such a large copy tells me that 
> prefetch should probably always be smart and never dumb (and here 
> by smart versus dumb I mean prefetch should check to make sure you 
> aren't prefetching beyond the end of data you care about before 
> executing the prefetch instruction).

That looks like an important result and it should matter even more 
to ~1.5k MTU sizes where the prefetch window will be even larger 
relative to the IP packet size.

> What strikes me as important here is that these 8 core Intel CPUs 
> actually got *slower* with the ALU patch + prefetch.  This 
> warrants more investigation to find out if it's the prefetch or 
> the ALU patch that did the damage to the speed.  It's also worth 
> noting that these 8 core CPUs have such high variability that I 
> don't trust these measurements yet.

It might make sense to have a good look at the PMU counts for these 
cases to see what's going on.

Also, once the packet is copied to user-space, we might want to do a 
CLFLUSH on the originating buffer, to zap the cacheline from the CPU 
caches. (This might or might not matter, depending on how good the 
CPU is at keeping its true working set in the cache.)

> > I'm a bit sceptical - I think 'looking 1-2 cachelines in 
> > advance' is something that might work reasonably well on a wide 
> > range of systems, while trying to find a bus capacity/latency 
> > dependent sweet spot would be difficult.
> 
> I think 1-2 cachelines is probably way too short. [...]

The 4-5 cachelines result you seem to be converging on looks very 
plausible to me too.

What I think we should try to avoid is to make the actual window per 
system variable: that would be really hard to tune right.

But the 'don't prefetch past the buffer' "smart prefetch" logic you 
mentioned is system-agnostic and might make sense to introduce.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 18:29                   ` Neil Horman
@ 2013-10-29  8:25                     ` Ingo Molnar
  2013-10-29 11:20                       ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-29  8:25 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Heres my data for running the same test with taskset restricting 
> execution to only cpu0.  I'm not quite sure whats going on here, 
> but doing so resulted in a 10x slowdown of the runtime of each 
> iteration which I can't explain.  As before however, both the 
> parallel alu run and the prefetch run resulted in speedups, but 
> the two together were not in any way addative.  I'm going to keep 
> playing with the prefetch stride, unless you have an alternate 
> theory.

Could you please cite the exact command-line you used for running 
the test?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 17:46                 ` Neil Horman
@ 2013-10-28 18:29                   ` Neil Horman
  2013-10-29  8:25                     ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-28 18:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 01:46:30PM -0400, Neil Horman wrote:
> On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Looking at the specific cpu counters we get this:
> > > 
> > > Base:
> > >      Total time: 0.179 [sec]
> > > 
> > >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > > 
> > >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> > >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> > >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> > 
> > Hm, for these second round of measurements were you using 'perf stat 
> > -a -C ...'?
> > 
> > The most accurate method of measurement for such single-threaded 
> > workloads is something like:
> > 
> > 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> > 
> > this will bind your workload to CPU#0, and will do PMU measurements 
> > only there - without mixing in other CPUs or workloads.
> > 
> > Thanks,
> > 
> > 	Ingo
> I wasn't, but I will...
> Neil
> 
> > --

Heres my data for running the same test with taskset restricting execution to
only cpu0.  I'm not quite sure whats going on here, but doing so resulted in a
10x slowdown of the runtime of each iteration which I can't explain.  As before
however, both the parallel alu run and the prefetch run resulted in speedups,
but the two together were not in any way addative.  I'm going to keep playing
with the prefetch stride, unless you have an alternate theory.

Regards
Neil


Base:
     Total time: 1.013 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1140.286043 task-clock                #    1.001 CPUs utilized            ( +-  0.65% ) [100.00%]
            48,779 context-switches          #    0.043 M/sec                    ( +- 10.08% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.066 M/sec                    ( +-  0.05% )
     2,950,225,491 cycles                    #    2.587 GHz                      ( +-  0.65% ) [16.63%]
       263,349,439 stalled-cycles-frontend   #    8.93% frontend cycles idle     ( +-  1.87% ) [16.70%]
     1,615,723,017 stalled-cycles-backend    #   54.77% backend  cycles idle     ( +-  0.64% ) [16.76%]
     2,168,440,946 instructions              #    0.74  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.52% ) [16.76%]
       406,885,149 branches                  #  356.827 M/sec                    ( +-  0.61% ) [16.74%]
        10,099,789 branch-misses             #    2.48% of all branches          ( +-  0.73% ) [16.73%]
     1,138,829,982 L1-dcache-loads           #  998.723 M/sec                    ( +-  0.57% ) [16.71%]
        21,341,094 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.22% ) [16.69%]
        38,453,870 LLC-loads                 #   33.723 M/sec                    ( +-  1.46% ) [16.67%]
         9,587,987 LLC-load-misses           #   24.93% of all LL-cache hits     ( +-  0.48% ) [16.66%]
       566,241,820 L1-icache-loads           #  496.579 M/sec                    ( +-  0.70% ) [16.65%]
         9,061,979 L1-icache-load-misses     #    1.60% of all L1-icache hits    ( +-  3.39% ) [16.65%]
     1,130,620,555 dTLB-loads                #  991.524 M/sec                    ( +-  0.64% ) [16.64%]
           423,302 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.89% ) [16.63%]
       563,371,089 iTLB-loads                #  494.061 M/sec                    ( +-  0.62% ) [16.62%]
           215,406 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.97% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.59%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.58%]

       1.139598762 seconds time elapsed                                          ( +-  0.65% )

Prefetch:
     Total time: 0.981 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1128.603117 task-clock                #    1.001 CPUs utilized            ( +-  0.66% ) [100.00%]
            45,992 context-switches          #    0.041 M/sec                    ( +-  9.47% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,428 page-faults               #    0.067 M/sec                    ( +-  0.06% )
     2,920,666,228 cycles                    #    2.588 GHz                      ( +-  0.66% ) [16.59%]
       255,998,006 stalled-cycles-frontend   #    8.77% frontend cycles idle     ( +-  1.78% ) [16.67%]
     1,601,090,475 stalled-cycles-backend    #   54.82% backend  cycles idle     ( +-  0.69% ) [16.75%]
     2,164,301,312 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.59% ) [16.78%]
       404,920,928 branches                  #  358.781 M/sec                    ( +-  0.54% ) [16.77%]
        10,025,146 branch-misses             #    2.48% of all branches          ( +-  0.66% ) [16.75%]
     1,133,764,674 L1-dcache-loads           # 1004.573 M/sec                    ( +-  0.47% ) [16.74%]
        21,251,432 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  1.01% ) [16.72%]
        38,006,432 LLC-loads                 #   33.676 M/sec                    ( +-  1.56% ) [16.70%]
         9,625,034 LLC-load-misses           #   25.32% of all LL-cache hits     ( +-  0.40% ) [16.68%]
       565,712,289 L1-icache-loads           #  501.250 M/sec                    ( +-  0.57% ) [16.66%]
         8,726,826 L1-icache-load-misses     #    1.54% of all L1-icache hits    ( +-  3.40% ) [16.64%]
     1,130,140,463 dTLB-loads                # 1001.362 M/sec                    ( +-  0.53% ) [16.63%]
           419,645 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.44% ) [16.62%]
       560,199,307 iTLB-loads                #  496.365 M/sec                    ( +-  0.51% ) [16.61%]
           213,413 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.65% ) [16.59%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.56%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.54%]

       1.127934534 seconds time elapsed                                          ( +-  0.66% )


Parallel ALU:
     Total time: 0.986 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1131.914738 task-clock                #    1.001 CPUs utilized            ( +-  0.49% ) [100.00%]
            40,807 context-switches          #    0.036 M/sec                    ( +- 10.72% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                    ( +-100.00% ) [100.00%]
            75,329 page-faults               #    0.067 M/sec                    ( +-  0.04% )
     2,929,149,996 cycles                    #    2.588 GHz                      ( +-  0.49% ) [16.58%]
       250,428,558 stalled-cycles-frontend   #    8.55% frontend cycles idle     ( +-  1.75% ) [16.66%]
     1,621,074,968 stalled-cycles-backend    #   55.34% backend  cycles idle     ( +-  0.46% ) [16.73%]
     2,147,405,781 instructions              #    0.73  insns per cycle        
                                             #    0.75  stalled cycles per insn  ( +-  0.56% ) [16.77%]
       401,196,771 branches                  #  354.441 M/sec                    ( +-  0.58% ) [16.76%]
         9,941,701 branch-misses             #    2.48% of all branches          ( +-  0.67% ) [16.74%]
     1,126,651,774 L1-dcache-loads           #  995.350 M/sec                    ( +-  0.50% ) [16.73%]
        21,075,294 L1-dcache-load-misses     #    1.87% of all L1-dcache hits    ( +-  0.96% ) [16.72%]
        37,885,850 LLC-loads                 #   33.471 M/sec                    ( +-  1.10% ) [16.71%]
         9,729,116 LLC-load-misses           #   25.68% of all LL-cache hits     ( +-  0.62% ) [16.69%]
       562,058,495 L1-icache-loads           #  496.556 M/sec                    ( +-  0.54% ) [16.67%]
         8,617,450 L1-icache-load-misses     #    1.53% of all L1-icache hits    ( +-  3.06% ) [16.65%]
     1,121,765,737 dTLB-loads                #  991.034 M/sec                    ( +-  0.57% ) [16.63%]
           388,875 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  4.27% ) [16.62%]
       556,029,393 iTLB-loads                #  491.229 M/sec                    ( +-  0.64% ) [16.61%]
           189,181 iTLB-load-misses          #    0.03% of all iTLB cache hits   ( +-  6.98% ) [16.60%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.58%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.131247174 seconds time elapsed                                          ( +-  0.49% )


Both:
     Total time: 0.993 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1130.912197 task-clock                #    1.001 CPUs utilized            ( +-  0.60% ) [100.00%]
            45,859 context-switches          #    0.041 M/sec                    ( +-  9.00% ) [100.00%]
                 0 cpu-migrations            #    0.000 K/sec                   [100.00%]
            75,398 page-faults               #    0.067 M/sec                    ( +-  0.07% )
     2,926,527,048 cycles                    #    2.588 GHz                      ( +-  0.60% ) [16.60%]
       255,482,254 stalled-cycles-frontend   #    8.73% frontend cycles idle     ( +-  1.62% ) [16.67%]
     1,608,247,364 stalled-cycles-backend    #   54.95% backend  cycles idle     ( +-  0.73% ) [16.74%]
     2,162,135,903 instructions              #    0.74  insns per cycle        
                                             #    0.74  stalled cycles per insn  ( +-  0.46% ) [16.77%]
       403,436,790 branches                  #  356.736 M/sec                    ( +-  0.44% ) [16.76%]
        10,062,572 branch-misses             #    2.49% of all branches          ( +-  0.85% ) [16.75%]
     1,133,889,264 L1-dcache-loads           # 1002.632 M/sec                    ( +-  0.56% ) [16.74%]
        21,460,116 L1-dcache-load-misses     #    1.89% of all L1-dcache hits    ( +-  1.31% ) [16.73%]
        38,070,119 LLC-loads                 #   33.663 M/sec                    ( +-  1.63% ) [16.72%]
         9,593,162 LLC-load-misses           #   25.20% of all LL-cache hits     ( +-  0.42% ) [16.71%]
       562,867,188 L1-icache-loads           #  497.711 M/sec                    ( +-  0.59% ) [16.68%]
         8,472,343 L1-icache-load-misses     #    1.51% of all L1-icache hits    ( +-  3.02% ) [16.64%]
     1,126,997,403 dTLB-loads                #  996.538 M/sec                    ( +-  0.53% ) [16.61%]
           414,900 dTLB-load-misses          #    0.04% of all dTLB cache hits   ( +-  4.12% ) [16.60%]
       561,156,032 iTLB-loads                #  496.198 M/sec                    ( +-  0.56% ) [16.59%]
           212,482 iTLB-load-misses          #    0.04% of all iTLB cache hits   ( +-  6.10% ) [16.58%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [16.57%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [16.56%]

       1.130242195 seconds time elapsed                                          ( +-  0.60% )

> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 17:49                 ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-28 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:20:45PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Base:
> >        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> > Prefetch (5x64):
> >        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> > Parallel ALU:
> >        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> > Prefetch + Parallel ALU:
> >        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> > 
> > So we can see here that we get about a 1% speedup between the base 
> > and the both (Prefetch + Parallel ALU) case, with prefetch 
> > accounting for most of that speedup.
> 
> Hm, there's still something strange about these results. So the 
> range of the results is 790-930 nsecs. The noise of the measurements 
> is 2%-4%, i.e. 20-40 nsecs.
> 
> The prefetch-only result itself is the fastest of all - 
> statistically equivalent to the prefetch+parallel-ALU result, within 
> the noise range.
> 
> So if prefetch is enabled, turning on parallel-ALU has no measurable 
> effect - which is counter-intuitive. Do you have an 
> theory/explanation for that?
> 
> Thanks,
 I mentioned it farther down, loosely theorizing that running with parallel
alu's in conjunction with a prefetch, puts more pressure on the load/store unit
causing stalls while both alu's wait for the L1 cache to fill.  Not sure if that
makes sense, but I did note that in the both (prefetch+alu case) our data cache
hit rate was somewhat degraded, so I was going to play with the prefetch stride
to see if that fixed the situation.  Regardless I agree, the lack of improvement
in the both case is definately counter-intuitive.

Neil

> 
> 	Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
@ 2013-10-28 17:46                 ` Neil Horman
  2013-10-28 18:29                   ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-28 17:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On Mon, Oct 28, 2013 at 05:24:38PM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Looking at the specific cpu counters we get this:
> > 
> > Base:
> >      Total time: 0.179 [sec]
> > 
> >  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> > 
> >        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
> >             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
> >              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
> 
> Hm, for these second round of measurements were you using 'perf stat 
> -a -C ...'?
> 
> The most accurate method of measurement for such single-threaded 
> workloads is something like:
> 
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
> 
> this will bind your workload to CPU#0, and will do PMU measurements 
> only there - without mixing in other CPUs or workloads.
> 
> Thanks,
> 
> 	Ingo
I wasn't, but I will...
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 11:55     ` Ingo Molnar
@ 2013-10-28 17:02       ` Doug Ledford
  2013-10-29  8:38         ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Doug Ledford @ 2013-10-28 17:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, H. Peter Anvin,
	Andi Kleen, Sebastien Dugue

[-- Attachment #1: Type: text/plain, Size: 13247 bytes --]

On 10/26/2013 07:55 AM, Ingo Molnar wrote:
>
> * Doug Ledford <dledford@redhat.com> wrote:
>
>>> What I was objecting to strongly here was to measure the _wrong_
>>> thing, i.e. the cache-hot case. The cache-cold case should be
>>> measured in a low noise fashion, so that results are
>>> representative. It's closer to the real usecase than any other
>>> microbenchmark. That will give us a usable speedup figure and
>>> will tell us which technique helped how much and which parameter
>>> should be how large.
>>
>> Cold cache, yes.  Low noise, yes.  But you need DMA traffic at the
>> same time to be truly representative.
>
> Well, but in most usecases network DMA traffic is an order of
> magnitude smaller than system bus capacity. 100 gigabit network
> traffic is possible but not very common.

That's not necessarily true.  For gigabit, that's true.  For something
faster, even just 10GigE, it's not.  At least not when you consider that
network traffic usually involves hitting the bus at least two times, and
up to four times, depending on how it's processed on receive and whether
it goes cold from cache between accesses (once for the DMA from card to
memory, once for csum_partial so we know if the packet was good, and a
third time in copy_to_user so the user application can do something with
it, and possibly a fourth time if the user space application does
something with it).

> So I'd say that _if_ prefetching helps in the typical case we should
> tune it for that - not for the bus-contended case...

Well, I've been running a lot of tests here on various optimizations.
Some have helped, some not so much.  But I haven't been doing
micro-benchmarks like Neil.  I've been focused on running netperf over
IPoIB interfaces.  That should at least mimic real use somewhat and be
likely more indicitave of what the change will do to the system as a
whole than a micro-benchmark will.

I have a number of test systems, and they have a matrix of three
combinations of InfiniBand link speed and PCI-e bus speed that change
the theoretical max for each system.

For the 40GBit/s InfiniBand, the theoretical max throughput is 4GByte/s
(10/8bit wire encoding, not bothering to account for headers and such).

For the 56GBit/s InfiniBand, the theoretical max throughput is ~7GByte/s
(66/64 bit wire encoding).

For the PCI-e gen2 system, the PCI-e theoretical limit is 40GBit/s, for
the PCI-e gen3 systems the PCI-e theoretical limit is 64GBit/s. However,
with a max PCI-e payload of 128 bytes, the PCI-e gen2 bus will
definitely be a bottleneck before the 56GBit/s InfiniBand link.  The
PCI-e gen3 busses are probably right on par with a 56GBit/s InfiniBand
link in terms of max possible throughput.

Here are my test systems:

A - 2 Dell PowerEdge R415 AMD based servers, dual quad core processors
at 2.6GHz, 2MB L2, 5MB L3 cache, 32GB DDR3 1333 RAM, 56GBit/s InfiniBand
link speed on a card in a PCI-e Gen2 slot.  Results of base performance
bandwidth test:

[root@rdma-dev-00 ~]# qperf -t 15 ib0-dev-01 rc_bw rc_bi_bw
rc_bw:
    bw  =  2.93 GB/sec
rc_bi_bw:
    bw  =  5.5 GB/sec

B - 2 HP DL320e Gen8 servers, single Intel quad core Intel(R) Xeon(R)
CPU E3-1240 V2 @ 3.40GHz, 8GB DDR3 1600 RAM, card in PCI-e Gen3 slot
(8GT/s x8 active config).  Results of base performance bandwidth test
(40GBit/s link):

[root@rdma-qe-10 ~]# qperf -t 15 ib1-qe-11 rc_bw rc_bi_bw
rc_bw:
    bw  =  3.55 GB/sec
rc_bi_bw:
    bw  =  6.75 GB/sec

C - 2 HP DL360p Gen8 servers, dual Intel 8-core Intel(R) Xeon(R) CPU
E5-2660 0 @ 2.20GHz, 32GB DDR3 1333 RAM, card in PCI-e Gen3 slot (8GT/s
x8 active config).  Results of base performance bandwidth test (56GBit/s
link):

[root@rdma-perf-00 ~]# qperf -t 15 ib0-perf-01 rc_bw rc_bi_bw
rc_bw:
    bw  =  5.87 GB/sec
rc_bi_bw:
    bw  =  12.3 GB/sec

Some of my preliminary results:

1) Regarding the initial claim that changing the code to have two
addition chains, allowing the use of two ALUs, doubling performance: I'm
just not seeing it.  I have a number of theories about this, but they
are dependent on point #2 below:

2) Prefetch definitely helped, although how much depends on which of the
test setups I was using above.  The biggest gainer was B) the E3-1240 V2
@ 3.40GHz based machines.

So, my theories about #1 are that, with modern CPUs, it's more our
load/store speed that is killing us than the ALU speed.  I tried at
least 5 distinctly different ALU algorithms, including one that
eliminated the use of the carry chain entirely, and none of them had a
noticeable effect.  On the other hand, prefetch always had a noticeable
effect.  I suspect the original patch worked and had a performance
benefit some time ago due to a quirk on some CPU common back then, but
modern CPUs are capable of optimizing the routine well enough that the
benefit of the patch is already in our original csum routine due to CPU
optimizations.  Or maybe there is another explanation, but I'm not
really looking too hard for it.

I also tried two different prefetch methods on the theory that memory
access cycles are more important than CPU access cycles, and there
appears to be a minor benefit to wasting CPU cycles to prevent
unnecessary prefetches, even with 65520 as our MTU where a 320 byte
excess prefetch at the end of the operation only caused us to load a few
% points of extra memory.  I suspect that if I dropped the MTU down to
9K (to mimic jumbo frames on a device without tx/rx checksum offloads),
the smart version of prefetch would be a much bigger winner.  The fact
that there is any apparent difference at all on such a large copy tells
me that prefetch should probably always be smart and never dumb (and
here by smart versus dumb I mean prefetch should check to make sure you
aren't prefetching beyond the end of data you care about before
executing the prefetch instruction).

What I've found probably warrants more experimentation on the optimum
prefetch methods.  I also have another idea on speeding up the ALU
operations that I want to try.  So I'm not ready to send off everything
I have yet (and people wouldn't want that anyway, my collected data set
is megabytes in size).  But just to demonstrate some of what I'm seeing
here (notes: Recv CPU% of 12.5% is one CPU core pegged to 100% usage for
the A and B systems, for the C systems 3.125% is 100% usage for one CPU
core.  Also, although not so apparent on the AMD CPUs, the odd runs are
all with perf record, the even runs are with perf stat, and perf record
causes the odd runs to generally have a lower throughput (and this
effect is *huge* on the Intel 8 core CPUs, fully cutting throughput in
half on those systems)):

For the A systems:
Stock kernel:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1082.47   3.69     12.55    0.266   0.906
  1087.64   3.46     12.52    0.249   0.899
  1104.43   3.52     12.53    0.249   0.886
  1090.37   3.68     12.51    0.264   0.897
  1078.73   3.13     12.56    0.227   0.910
  1091.88   3.63     12.52    0.259   0.896

With ALU patch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1075.01   3.70     12.53    0.269   0.911
  1116.90   3.86     12.53    0.270   0.876
  1073.40   3.67     12.54    0.267   0.913
  1092.79   3.83     12.52    0.274   0.895
  1108.69   2.98     12.56    0.210   0.885
  1116.76   2.66     12.51    0.186   0.875

With ALU patch + 5*64 smart prefetch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1243.05   4.63     12.60    0.291   0.792
  1194.70   5.80     12.58    0.380   0.822
  1149.15   4.09     12.57    0.278   0.854
  1207.21   5.69     12.53    0.368   0.811
  1204.07   4.27     12.57    0.277   0.816
  1191.04   4.78     12.60    0.313   0.826

For the B systems:
Stock kernel:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  2778.98   7.75     12.34    0.218   0.347
  2819.14   7.31     12.52    0.203   0.347
  2721.43   8.43     12.19    0.242   0.350
  2832.93   7.38     12.58    0.203   0.347
  2770.07   8.01     12.27    0.226   0.346
  2829.17   7.27     12.51    0.201   0.345

With ALU patch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  2801.36   8.18     11.97    0.228   0.334
  2927.81   7.52     12.51    0.201   0.334
  2808.32   8.62     11.98    0.240   0.333
  2918.12   7.20     12.54    0.193   0.336
  2730.00   8.85     11.60    0.253   0.332
  2932.17   7.37     12.51    0.196   0.333

With ALU patch + 5*64 smart prefetch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  3029.53   9.34     10.67    0.241   0.275
  3229.36   7.81     11.65    0.189   0.282  <- this is a saturated
                                           40GBit/s InfiniBand link,
                                           and the recv CPU is no longer
                                           pegged at 100%, so the gains
                                           here are higher than just the
                                           throughput gains suggest
  3161.14   8.24     11.10    0.204   0.274
  3171.78   7.80     11.89    0.192   0.293
  3134.01   8.35     10.99    0.208   0.274
  3235.50   7.75     11.57    0.187   0.279  <- ditto here

For the C systems:
Stock kernel:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1091.03   1.59     3.14     0.454   0.900
  2299.34   2.57     3.07     0.350   0.417
  1177.07   1.71     3.15     0.455   0.838
  2312.59   2.54     3.02     0.344   0.408
  1273.94   2.03     3.15     0.499   0.772
  2591.50   2.76     3.19     0.332   0.385

With ALU patch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  Data for this series is missing (these machines were added to
  the matrix late and this kernel had already been rebuilt to
  something else and was no longer installable...I could recreate
  this if people really care).

With ALU patch + 5*64 smart prefetch:
            Utilization       Service Demand
            Send     Recv     Send    Recv
Throughput  local    remote   local   remote
MBytes  /s  % S      % S      us/KB   us/KB
  1377.03   2.05     3.13     0.466   0.711
  2002.30   2.40     3.04     0.374   0.474
  1470.18   2.25     3.13     0.479   0.666
  1994.96   2.44     3.08     0.382   0.482
  1167.82   1.72     3.14     0.461   0.840
  2004.49   2.46     3.06     0.384   0.477

What strikes me as important here is that these 8 core Intel CPUs
actually got *slower* with the ALU patch + prefetch.  This warrants more
investigation to find out if it's the prefetch or the ALU patch that did
the damage to the speed.  It's also worth noting that these 8 core CPUs
have such high variability that I don't trust these measurements yet.

>>> More importantly, the 'maximally adversarial' case is very hard
>>> to generate, validate, and it's highly system dependent!
>>
>> This I agree with 100%, which is why I tend to think we should
>> scrap the static prefetch optimizations entirely and have a boot
>> up test that allows us to find our optimum prefetch distance for
>> our given hardware.
>
> Would be interesting to see.
>
> I'm a bit sceptical - I think 'looking 1-2 cachelines in advance' is
> something that might work reasonably well on a wide range of
> systems, while trying to find a bus capacity/latency dependent sweet
> spot would be difficult.

I think 1-2 cachelines is probably way too short.  Measuring the length
of time that we stall when accessing memory for the first time and then
comparing that to operation cycles for typical instruction chains would
give us more insight I think.  That or just tinkering with numbers and
seeing where things work best (but not just on static tests, under a
variety of workloads).

> We had pretty bad experience from boot-time measurements, and it's
> not for lack of trying: I implemented the raid algorithm
> benchmarking thing and also the scheduler's boot time cache-size
> probing, both were problematic and have hurt reproducability and
> debuggability.

OK, that's it from me for now, off to run more tests and try more things...

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:24               ` Ingo Molnar
@ 2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 0 replies; 107+ messages in thread
From: David Ahern @ 2013-10-28 16:49 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev

On 10/28/13 10:24 AM, Ingo Molnar wrote:

> The most accurate method of measurement for such single-threaded
> workloads is something like:
>
> 	taskset 0x1 perf stat -a -C 1 --repeat 20 ...
>
> this will bind your workload to CPU#0, and will do PMU measurements
> only there - without mixing in other CPUs or workloads.

you can drop the -a if you only want a specific CPU (-C arg). And -C in 
perf is cpu number starting with 0, so in your example above -C 1 means 
cpu1, not cpu0.

David



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
@ 2013-10-28 16:24               ` Ingo Molnar
  2013-10-28 16:49                 ` David Ahern
  2013-10-28 17:46                 ` Neil Horman
  1 sibling, 2 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Looking at the specific cpu counters we get this:
> 
> Base:
>      Total time: 0.179 [sec]
> 
>  Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):
> 
>        1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
>             14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
>              2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )

Hm, for these second round of measurements were you using 'perf stat 
-a -C ...'?

The most accurate method of measurement for such single-threaded 
workloads is something like:

	taskset 0x1 perf stat -a -C 1 --repeat 20 ...

this will bind your workload to CPU#0, and will do PMU measurements 
only there - without mixing in other CPUs or workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-28 16:01             ` Neil Horman
@ 2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 17:49                 ` Neil Horman
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-28 16:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Base:
>        0.093269042 seconds time elapsed                                          ( +-  2.24% )
> Prefetch (5x64):
>        0.079440009 seconds time elapsed                                          ( +-  2.29% )
> Parallel ALU:
>        0.087666677 seconds time elapsed                                          ( +-  4.01% )
> Prefetch + Parallel ALU:
>        0.080758702 seconds time elapsed                                          ( +-  2.34% )
> 
> So we can see here that we get about a 1% speedup between the base 
> and the both (Prefetch + Parallel ALU) case, with prefetch 
> accounting for most of that speedup.

Hm, there's still something strange about these results. So the 
range of the results is 790-930 nsecs. The noise of the measurements 
is 2%-4%, i.e. 20-40 nsecs.

The prefetch-only result itself is the fastest of all - 
statistically equivalent to the prefetch+parallel-ALU result, within 
the noise range.

So if prefetch is enabled, turning on parallel-ALU has no measurable 
effect - which is counter-intuitive. Do you have an 
theory/explanation for that?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-28 16:01             ` Neil Horman
  2013-10-28 16:20               ` Ingo Molnar
  2013-10-28 16:24               ` Ingo Molnar
  1 sibling, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-28 16:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev



Ingo, et al.-
	Ok, sorry for the delay, here are the test results you've been asking
for.


First, some information about what I did.  I attached the module that I ran this
test with at the bottom of this email.  You'll note that I started using a
module parameter write patch to trigger the csum rather than the module load
path.  The latter seemed to be giving me lots of variance in my run times, which
I wanted to eliminate.  I attributed it to the module load mechanism itself, and
by using the parameter write path, I was able to get more consistent results.

First, the run time tests:

I ran this command:
for i in `seq 0 1 3`
do
	 echo $i > /sys/module/csum_test/parameters/module_test_mode
	 perf stat --repeat 20 --null echo 1 > echo 1 > /sys/module/csum_test/parameters/test_fire
done

The for loop allows me to chagne the module_test_mode, which is tied to a switch
statement in do_csum that selects which checksumming method we use
(base/prefetch/parallel alu/both).  The results are:


Base:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.093269042 seconds time elapsed                                          ( +-  2.24% )

Prefetch (5x64):
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.079440009 seconds time elapsed                                          ( +-  2.29% )

Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.087666677 seconds time elapsed                                          ( +-  4.01% )

Prefetch + Parallel ALU:
 Performance counter stats for 'bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       0.080758702 seconds time elapsed                                          ( +-  2.34% )

So we can see here that we get about a 1% speedup between the base and the both
(Prefetch + Parallel ALU) case, with prefetch accounting for most of that
speedup.

Looking at the specific cpu counters we get this:


Base:
     Total time: 0.179 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1571.304618 task-clock                #    5.213 CPUs utilized            ( +-  0.45% )
            14,423 context-switches          #    0.009 M/sec                    ( +-  4.28% )
             2,710 cpu-migrations            #    0.002 M/sec                    ( +-  2.83% )
            75,402 page-faults               #    0.048 M/sec                    ( +-  0.07% )
     1,597,349,326 cycles                    #    1.017 GHz                      ( +-  1.74% ) [40.51%]
       104,882,858 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.25% ) [40.33%]
     1,043,429,984 stalled-cycles-backend    #   65.32% backend  cycles idle     ( +-  1.25% ) [39.73%]
       868,372,132 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.43% ) [39.88%]
       161,143,820 branches                  #  102.554 M/sec                    ( +-  1.49% ) [39.76%]
         4,348,075 branch-misses             #    2.70% of all branches          ( +-  1.43% ) [39.99%]
       457,042,576 L1-dcache-loads           #  290.868 M/sec                    ( +-  1.25% ) [40.63%]
         8,928,240 L1-dcache-load-misses     #    1.95% of all L1-dcache hits    ( +-  1.26% ) [41.17%]
        15,821,051 LLC-loads                 #   10.069 M/sec                    ( +-  1.56% ) [41.20%]
         4,902,576 LLC-load-misses           #   30.99% of all LL-cache hits     ( +-  1.51% ) [41.36%]
       235,775,688 L1-icache-loads           #  150.051 M/sec                    ( +-  1.39% ) [41.10%]
         3,116,106 L1-icache-load-misses     #    1.32% of all L1-icache hits    ( +-  3.43% ) [40.96%]
       461,315,416 dTLB-loads                #  293.588 M/sec                    ( +-  1.43% ) [41.18%]
           140,280 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.30% ) [40.96%]
       236,127,031 iTLB-loads                #  150.275 M/sec                    ( +-  1.63% ) [41.43%]
            46,173 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.40% ) [41.11%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.82%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.37%]

       0.301414024 seconds time elapsed                                          ( +-  0.47% )

Prefetch (5x64):
     Total time: 0.172 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1565.797128 task-clock                #    5.238 CPUs utilized            ( +-  0.46% )
            13,845 context-switches          #    0.009 M/sec                    ( +-  4.20% )
             2,624 cpu-migrations            #    0.002 M/sec                    ( +-  2.72% )
            75,452 page-faults               #    0.048 M/sec                    ( +-  0.08% )
     1,642,106,355 cycles                    #    1.049 GHz                      ( +-  1.33% ) [40.17%]
       107,786,666 stalled-cycles-frontend   #    6.56% frontend cycles idle     ( +-  1.37% ) [39.90%]
     1,065,286,880 stalled-cycles-backend    #   64.87% backend  cycles idle     ( +-  1.59% ) [39.14%]
       888,815,001 instructions              #    0.54  insns per cycle        
                                             #    1.20  stalled cycles per insn  ( +-  1.29% ) [38.92%]
       163,106,907 branches                  #  104.169 M/sec                    ( +-  1.32% ) [38.93%]
         4,333,456 branch-misses             #    2.66% of all branches          ( +-  1.94% ) [39.77%]
       459,779,806 L1-dcache-loads           #  293.639 M/sec                    ( +-  1.60% ) [40.23%]
         8,827,680 L1-dcache-load-misses     #    1.92% of all L1-dcache hits    ( +-  1.77% ) [41.38%]
        15,556,816 LLC-loads                 #    9.935 M/sec                    ( +-  1.76% ) [41.16%]
         4,885,618 LLC-load-misses           #   31.40% of all LL-cache hits     ( +-  1.40% ) [40.84%]
       236,131,778 L1-icache-loads           #  150.806 M/sec                    ( +-  1.32% ) [40.59%]
         3,037,537 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.23% ) [41.13%]
       454,835,028 dTLB-loads                #  290.481 M/sec                    ( +-  1.23% ) [41.34%]
           139,907 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.18% ) [41.21%]
       236,357,655 iTLB-loads                #  150.950 M/sec                    ( +-  1.31% ) [41.29%]
            46,633 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.74% ) [40.67%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.16%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.09%]

       0.298948767 seconds time elapsed                                          ( +-  0.36% )

Here it appears everything between the two runs is about the same.  We reduced
the number of dcache misses by a small amount (0.03 percentage points), which is
nice, but I'm not sure would account for the speedup we see in the run time.

Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1553.544876 task-clock                #    5.217 CPUs utilized            ( +-  0.42% )
            14,066 context-switches          #    0.009 M/sec                    ( +-  6.24% )
             2,831 cpu-migrations            #    0.002 M/sec                    ( +-  3.33% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.08% )
     1,659,509,743 cycles                    #    1.068 GHz                      ( +-  1.27% ) [40.10%]
       106,466,680 stalled-cycles-frontend   #    6.42% frontend cycles idle     ( +-  1.50% ) [39.98%]
     1,035,481,957 stalled-cycles-backend    #   62.40% backend  cycles idle     ( +-  1.23% ) [39.38%]
       875,104,201 instructions              #    0.53  insns per cycle        
                                             #    1.18  stalled cycles per insn  ( +-  1.30% ) [38.66%]
       160,553,275 branches                  #  103.346 M/sec                    ( +-  1.32% ) [38.85%]
         4,329,119 branch-misses             #    2.70% of all branches          ( +-  1.39% ) [39.59%]
       448,195,116 L1-dcache-loads           #  288.498 M/sec                    ( +-  1.91% ) [41.07%]
         8,632,347 L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  1.90% ) [41.56%]
        15,143,145 LLC-loads                 #    9.747 M/sec                    ( +-  1.89% ) [41.05%]
         4,698,204 LLC-load-misses           #   31.03% of all LL-cache hits     ( +-  1.03% ) [41.23%]
       224,316,468 L1-icache-loads           #  144.390 M/sec                    ( +-  1.27% ) [41.39%]
         2,902,842 L1-icache-load-misses     #    1.29% of all L1-icache hits    ( +-  2.65% ) [42.60%]
       433,914,588 dTLB-loads                #  279.306 M/sec                    ( +-  1.75% ) [43.07%]
           132,090 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  2.15% ) [43.12%]
       230,701,361 iTLB-loads                #  148.500 M/sec                    ( +-  1.77% ) [43.47%]
            45,562 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  3.76% ) [42.88%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [42.29%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [41.32%]

       0.297758185 seconds time elapsed                                          ( +-  0.40% )

Here It seems the major advantage was backend stall cycles saved (which makes
sense to me).  Since we split the instruction path into two units that could run
independently of each other we spent less time waiting for prior instructions to
retire.  As a result we dropped two percentage points in our stall number.

Prefetch + Parallel ALU:
     Total time: 0.182 [sec]

 Performance counter stats for 'perf bench sched messaging -- bash -c echo 1 > /sys/module/csum_test/parameters/test_fire' (20 runs):

       1549.171283 task-clock                #    5.231 CPUs utilized            ( +-  0.50% )
            13,717 context-switches          #    0.009 M/sec                    ( +-  4.32% )
             2,721 cpu-migrations            #    0.002 M/sec                    ( +-  2.47% )
            75,432 page-faults               #    0.049 M/sec                    ( +-  0.07% )
     1,579,140,244 cycles                    #    1.019 GHz                      ( +-  1.71% ) [40.06%]
       103,803,034 stalled-cycles-frontend   #    6.57% frontend cycles idle     ( +-  1.74% ) [39.60%]
     1,016,582,613 stalled-cycles-backend    #   64.38% backend  cycles idle     ( +-  1.79% ) [39.57%]
       881,036,653 instructions              #    0.56  insns per cycle        
                                             #    1.15  stalled cycles per insn  ( +-  1.61% ) [39.29%]
       164,333,010 branches                  #  106.078 M/sec                    ( +-  1.51% ) [39.38%]
         4,385,459 branch-misses             #    2.67% of all branches          ( +-  1.62% ) [40.29%]
       463,987,526 L1-dcache-loads           #  299.507 M/sec                    ( +-  1.52% ) [40.20%]
         8,739,535 L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  1.95% ) [40.37%]
        15,318,497 LLC-loads                 #    9.888 M/sec                    ( +-  1.80% ) [40.43%]
         4,846,148 LLC-load-misses           #   31.64% of all LL-cache hits     ( +-  1.68% ) [40.59%]
       231,982,874 L1-icache-loads           #  149.746 M/sec                    ( +-  1.43% ) [41.25%]
         3,141,106 L1-icache-load-misses     #    1.35% of all L1-icache hits    ( +-  2.32% ) [41.76%]
       459,688,615 dTLB-loads                #  296.732 M/sec                    ( +-  1.75% ) [41.87%]
           138,667 dTLB-load-misses          #    0.03% of all dTLB cache hits   ( +-  1.97% ) [42.31%]
       235,629,204 iTLB-loads                #  152.100 M/sec                    ( +-  1.40% ) [42.04%]
            46,038 iTLB-load-misses          #    0.02% of all iTLB cache hits   ( +-  2.75% ) [41.20%]
                 0 L1-dcache-prefetches      #    0.000 K/sec                   [40.77%]
                 0 L1-dcache-prefetch-misses #    0.000 K/sec                   [40.27%]

       0.296173305 seconds time elapsed                                          ( +-  0.44% )
Here, with both optimizations, we've reduced both our backend stall cycles, and
our dcache miss rate (though our load misses here is higher than it was when we
are just doing parallel ALU execution.  I wonder if the separation of the adcx
path is leading to multiple load requests before the prefetch completes.  I'll
try messing with the stride a bit more to see if I can get some more insight
there.

So there you have it.  I think, looking at this, I can say that its not as big a
win as my initial measurements were indicating, but still a win.

Thoughts?

Regards
Neil

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

extern int csum_mode;
int module_test_mode = 0;
int test_fire = 0;

static int __init csum_init_module(void)
{
        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}

static int set_param_str(const char *val, const struct kernel_param *kp)
{
        int i;
        __wsum sum = 0;
        /*u64 start, end;*/
        void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	
        memset(addrs, 0, sizeof(addrs));
        for (i = 0; i < NBPAGES; i++) {
                addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
                if (!addrs[i])
                        goto out;
        }

	csum_mode = module_test_mode;

	local_bh_disable();
        /*pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());*/
        /*start = ktime_to_ns(ktime_get());*/

        for (i = 0; i < 100000; i++) {
                rnd = prandom_u32();
                base = addrs[rnd % NBPAGES];
                rnd /= NBPAGES;
                offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial(base + offset, 1500, sum);
        }
        /*end = ktime_to_ns(ktime_get());*/
        local_bh_enable();

	/*pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);*/

	csum_mode = 0;
out:
        for (i = 0; i < NBPAGES; i++)
                kfree(addrs[i]);

        return 0;
}

static int get_param_str(char *buffer, const struct kernel_param *kp)
{
	return sprintf(buffer, "%d\n", test_fire);
}

static struct kernel_param_ops param_ops_str = {
	.set = set_param_str,
	.get = get_param_str,
};

module_param_named(module_test_mode, module_test_mode, int, 0644);
MODULE_PARM_DESC(module_test_mode, "csum test mode");
module_param_cb(test_fire, &param_ops_str, &test_fire, 0644);
module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-27  7:26                               ` Ingo Molnar
@ 2013-10-27 17:05                                 ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-27 17:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sun, Oct 27, 2013 at 08:26:32AM +0100, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > > You keep ignoring my request to calculate and account for noise of the 
> > > measurement.
> > 
> > Don't confuse "ignoring" with "haven't gotten there yet".  [...]
> 
> So, instead of replying to my repeated feedback with a single line mail 
> that you plan to address it, you repeated the same measurement mistakes 
> again and again, posting invalid results, and forced me to spend time to 
> repeat this same argument 2-3 times?
> 
No one forced you to do anything Ingo.  I was finishing a valid line of
discussion with Eric prior to addressing your questions, while handling several
other unrelated issues (and related issues with my test system), that cropped up
in parallel.  

> > [...] Sometimes we all have to wait, Ingo. [...]
> 
> I'm making bog standard technical requests to which you've not replied in 
> substance, there's no need for the patronizing tone really.
> 
No one said they weren't easy to do, Ingo, I said I was getting to your request.
And now I am.  I'll be running the tests tomorrow.

> Anyway, to simplify the workflow I'm NAK-ing it all until it's done 
> convincingly.
> 
>   NAKed-by: Ingo Molnar <mingo@kernel.org>
> 
> I'll lift the NAK once my technical concerns and questions are resolved.
> 
Ok, if that helps you wait.  I'll have your test results in the next day or so.

Thanks
Neil

> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 13:58                             ` Neil Horman
@ 2013-10-27  7:26                               ` Ingo Molnar
  2013-10-27 17:05                                 ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-27  7:26 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

* Neil Horman <nhorman@tuxdriver.com> wrote:

> > You keep ignoring my request to calculate and account for noise of the 
> > measurement.
> 
> Don't confuse "ignoring" with "haven't gotten there yet".  [...]

So, instead of replying to my repeated feedback with a single line mail 
that you plan to address it, you repeated the same measurement mistakes 
again and again, posting invalid results, and forced me to spend time to 
repeat this same argument 2-3 times?

> [...] Sometimes we all have to wait, Ingo. [...]

I'm making bog standard technical requests to which you've not replied in 
substance, there's no need for the patronizing tone really.

Anyway, to simplify the workflow I'm NAK-ing it all until it's done 
convincingly.

  NAKed-by: Ingo Molnar <mingo@kernel.org>

I'll lift the NAK once my technical concerns and questions are resolved.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-26 12:01                           ` Ingo Molnar
@ 2013-10-26 13:58                             ` Neil Horman
  2013-10-27  7:26                               ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-26 13:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sat, Oct 26, 2013 at 02:01:08PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > > 
> > > > 
> > > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > > anywhere.
> > > 
> > > This csum_partial_opt() was a private implementation of csum_partial()
> > > so that I could load the module without rebooting the kernel ;)
> > > 
> > > > 
> > > > base results:
> > > > 53569916
> > > > 43506025
> > > > 43476542
> > > > 44048436
> > > > 45048042
> > > > 48550429
> > > > 53925556
> > > > 53927374
> > > > 53489708
> > > > 53003915
> > > > 
> > > > AVG = 492 ns
> > > > 
> > > > prefetching only:
> > > > 53279213
> > > > 45518140
> > > > 49585388
> > > > 53176179
> > > > 44071822
> > > > 43588822
> > > > 44086546
> > > > 47507065
> > > > 53646812
> > > > 54469118
> > > > 
> > > > AVG = 488 ns
> > > > 
> > > > 
> > > > parallel alu's only:
> > > > 46226844
> > > > 44458101
> > > > 46803498
> > > > 45060002
> > > > 46187624
> > > > 37542946
> > > > 45632866
> > > > 46275249
> > > > 45031141
> > > > 46281204
> > > > 
> > > > AVG = 449 ns
> > > > 
> > > > 
> > > > both optimizations:
> > > > 45708837
> > > > 45631124
> > > > 45697135
> > > > 45647011
> > > > 45036679
> > > > 39418544
> > > > 44481577
> > > > 46820868
> > > > 44496471
> > > > 35523928
> > > > 
> > > > AVG = 438 ns
> > > > 
> > > > 
> > > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > > the best savings with both optimizations (54 ns, or 10.9%).  
> > > > 
> > > > These results, while they've changed as we've modified the test case slightly
> > > > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > > > not as much as using multiple alu's, and neither is as good as doing both
> > > > together.
> > > > 
> > > > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > > > do both.
> > > > 
> > > 
> > > Well, I only said (or maybe I forgot), that on my machines, I got no
> > > improvements at all with the multiple alu or the prefetch. (I tried
> > > different strides)
> > > 
> > > Only noises in the results.
> > > 
> > I thought you previously said that running netperf gave you a stastically
> > significant performance boost when you added prefetching:
> > http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> > 
> > But perhaps I missed a note somewhere.
> > 
> > > It seems it depends on cpus and/or multiple factors.
> > > 
> > > Last machine I used for the tests had :
> > > 
> > > processor	: 23
> > > vendor_id	: GenuineIntel
> > > cpu family	: 6
> > > model		: 44
> > > model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> > > stepping	: 2
> > > microcode	: 0x13
> > > cpu MHz		: 2800.256
> > > cache size	: 12288 KB
> > > physical id	: 1
> > > siblings	: 12
> > > core id		: 10
> > > cpu cores	: 6
> > > 
> > > 
> > > 
> > > 
> > 
> > Thats about what I'm running with:
> > processor       : 0
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 44
> > model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> > stepping        : 2
> > microcode       : 0x13
> > cpu MHz         : 1600.000
> > cache size      : 12288 KB
> > physical id     : 0
> > siblings        : 8
> > core id         : 0
> > cpu cores       : 4
> > 
> > 
> > I can't imagine what would cause the discrepancy in our results (a 
> > 10% savings in execution time seems significant to me). My only 
> > thought would be that possibly the alu's on your cpu are faster 
> > than mine, and reduce the speedup obtained by preforming operation 
> > in parallel, though I can't imagine thats the case with these 
> > processors being so closely matched.
> 
> You keep ignoring my request to calculate and account for noise of 
> the measurement.
> 
Don't confuse "ignoring" with "haven't gotten there yet".  Sometimes we all have
to wait, Ingo.  I'm working on it now, but I hit a snag on the machine I'm
working with and am trying to figure it out now.

> For example you are talking about a 0.8% prefetch effect while the 
> noise in the results is obviously much larger than that, with a 
> min/max distance of around 5%:
> 
> > > > 43476542
> > > > 53927374
> 
> so the noise of 10 measurements would be around 5-10%. (back of the 
> envelope calculation)
> 
> So you might be right in the end, but the posted data does not 
> support your claims, statistically.
> 
> It's your responsibility to come up with convincing measurements and 
> results, not of those who review your work.
> 
Be patient, I'm getting there

Thanks
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 20:19                         ` Neil Horman
@ 2013-10-26 12:01                           ` Ingo Molnar
  2013-10-26 13:58                             ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-26 12:01 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> > 
> > > 
> > > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > > anywhere.
> > 
> > This csum_partial_opt() was a private implementation of csum_partial()
> > so that I could load the module without rebooting the kernel ;)
> > 
> > > 
> > > base results:
> > > 53569916
> > > 43506025
> > > 43476542
> > > 44048436
> > > 45048042
> > > 48550429
> > > 53925556
> > > 53927374
> > > 53489708
> > > 53003915
> > > 
> > > AVG = 492 ns
> > > 
> > > prefetching only:
> > > 53279213
> > > 45518140
> > > 49585388
> > > 53176179
> > > 44071822
> > > 43588822
> > > 44086546
> > > 47507065
> > > 53646812
> > > 54469118
> > > 
> > > AVG = 488 ns
> > > 
> > > 
> > > parallel alu's only:
> > > 46226844
> > > 44458101
> > > 46803498
> > > 45060002
> > > 46187624
> > > 37542946
> > > 45632866
> > > 46275249
> > > 45031141
> > > 46281204
> > > 
> > > AVG = 449 ns
> > > 
> > > 
> > > both optimizations:
> > > 45708837
> > > 45631124
> > > 45697135
> > > 45647011
> > > 45036679
> > > 39418544
> > > 44481577
> > > 46820868
> > > 44496471
> > > 35523928
> > > 
> > > AVG = 438 ns
> > > 
> > > 
> > > We continue to see a small savings in execution time with prefetching (4 ns, or
> > > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > > the best savings with both optimizations (54 ns, or 10.9%).  
> > > 
> > > These results, while they've changed as we've modified the test case slightly
> > > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > > not as much as using multiple alu's, and neither is as good as doing both
> > > together.
> > > 
> > > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > > do both.
> > > 
> > 
> > Well, I only said (or maybe I forgot), that on my machines, I got no
> > improvements at all with the multiple alu or the prefetch. (I tried
> > different strides)
> > 
> > Only noises in the results.
> > 
> I thought you previously said that running netperf gave you a stastically
> significant performance boost when you added prefetching:
> http://marc.info/?l=linux-kernel&m=138178914124863&w=2
> 
> But perhaps I missed a note somewhere.
> 
> > It seems it depends on cpus and/or multiple factors.
> > 
> > Last machine I used for the tests had :
> > 
> > processor	: 23
> > vendor_id	: GenuineIntel
> > cpu family	: 6
> > model		: 44
> > model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> > stepping	: 2
> > microcode	: 0x13
> > cpu MHz		: 2800.256
> > cache size	: 12288 KB
> > physical id	: 1
> > siblings	: 12
> > core id		: 10
> > cpu cores	: 6
> > 
> > 
> > 
> > 
> 
> Thats about what I'm running with:
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 44
> model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
> stepping        : 2
> microcode       : 0x13
> cpu MHz         : 1600.000
> cache size      : 12288 KB
> physical id     : 0
> siblings        : 8
> core id         : 0
> cpu cores       : 4
> 
> 
> I can't imagine what would cause the discrepancy in our results (a 
> 10% savings in execution time seems significant to me). My only 
> thought would be that possibly the alu's on your cpu are faster 
> than mine, and reduce the speedup obtained by preforming operation 
> in parallel, though I can't imagine thats the case with these 
> processors being so closely matched.

You keep ignoring my request to calculate and account for noise of 
the measurement.

For example you are talking about a 0.8% prefetch effect while the 
noise in the results is obviously much larger than that, with a 
min/max distance of around 5%:

> > > 43476542
> > > 53927374

so the noise of 10 measurements would be around 5-10%. (back of the 
envelope calculation)

So you might be right in the end, but the posted data does not 
support your claims, statistically.

It's your responsibility to come up with convincing measurements and 
results, not of those who review your work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 17:54   ` Doug Ledford
@ 2013-10-26 11:55     ` Ingo Molnar
  2013-10-28 17:02       ` Doug Ledford
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-26 11:55 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Eric Dumazet, Neil Horman, linux-kernel


* Doug Ledford <dledford@redhat.com> wrote:

> > What I was objecting to strongly here was to measure the _wrong_ 
> > thing, i.e. the cache-hot case. The cache-cold case should be 
> > measured in a low noise fashion, so that results are 
> > representative. It's closer to the real usecase than any other 
> > microbenchmark. That will give us a usable speedup figure and 
> > will tell us which technique helped how much and which parameter 
> > should be how large.
> 
> Cold cache, yes.  Low noise, yes.  But you need DMA traffic at the 
> same time to be truly representative.

Well, but in most usecases network DMA traffic is an order of 
magnitude smaller than system bus capacity. 100 gigabit network 
traffic is possible but not very common.

So I'd say that _if_ prefetching helps in the typical case we should 
tune it for that - not for the bus-contended case...

> >> [...]  This distance should be far enough out that it can 
> >> withstand other memory pressure, yet not so far as to 
> >> constantly be prefetching, tossing the result out of cache due 
> >> to pressure, then fetching/stalling that same memory on load.  
> >> And it may not benchmark as well on a quiescent system running 
> >> only the micro-benchmark, but it should end up performing 
> >> better in actual real world usage.
> > 
> > The 'fully adversarial' case where all resources are maximally 
> > competed for by all other cores is actually pretty rare in 
> > practice. I don't say it does not happen or that it does not 
> > matter, but I do say there are many other important usecases as 
> > well.
> > 
> > More importantly, the 'maximally adversarial' case is very hard 
> > to generate, validate, and it's highly system dependent!
> 
> This I agree with 100%, which is why I tend to think we should 
> scrap the static prefetch optimizations entirely and have a boot 
> up test that allows us to find our optimum prefetch distance for 
> our given hardware.

Would be interesting to see.

I'm a bit sceptical - I think 'looking 1-2 cachelines in advance' is 
something that might work reasonably well on a wide range of 
systems, while trying to find a bus capacity/latency dependent sweet 
spot would be difficult.

We had pretty bad experience from boot-time measurements, and it's 
not for lack of trying: I implemented the raid algorithm 
benchmarking thing and also the scheduler's boot time cache-size 
probing, both were problematic and have hurt reproducability and 
debuggability.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:09     ` H. Peter Anvin
@ 2013-10-25 13:06       ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-25 13:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
> If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.
> 
Ok, fair enough.  Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any code using these instructions for some time, so
I'll back burner their use and revisit them later.  I'm still working on the
parallel alu/prefetch angle though.

Neil

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 19:44                       ` Eric Dumazet
@ 2013-10-21 20:19                         ` Neil Horman
  2013-10-26 12:01                           ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-21 20:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
> 
> > 
> > Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> > such that no interrupts (save for local ones), would occur on that cpu.  Note
> > that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> > doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> > anywhere.
> 
> This csum_partial_opt() was a private implementation of csum_partial()
> so that I could load the module without rebooting the kernel ;)
> 
> > 
> > base results:
> > 53569916
> > 43506025
> > 43476542
> > 44048436
> > 45048042
> > 48550429
> > 53925556
> > 53927374
> > 53489708
> > 53003915
> > 
> > AVG = 492 ns
> > 
> > prefetching only:
> > 53279213
> > 45518140
> > 49585388
> > 53176179
> > 44071822
> > 43588822
> > 44086546
> > 47507065
> > 53646812
> > 54469118
> > 
> > AVG = 488 ns
> > 
> > 
> > parallel alu's only:
> > 46226844
> > 44458101
> > 46803498
> > 45060002
> > 46187624
> > 37542946
> > 45632866
> > 46275249
> > 45031141
> > 46281204
> > 
> > AVG = 449 ns
> > 
> > 
> > both optimizations:
> > 45708837
> > 45631124
> > 45697135
> > 45647011
> > 45036679
> > 39418544
> > 44481577
> > 46820868
> > 44496471
> > 35523928
> > 
> > AVG = 438 ns
> > 
> > 
> > We continue to see a small savings in execution time with prefetching (4 ns, or
> > about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> > the best savings with both optimizations (54 ns, or 10.9%).  
> > 
> > These results, while they've changed as we've modified the test case slightly
> > have remained consistent in their sppedup ordinality.  Prefetching helps, but
> > not as much as using multiple alu's, and neither is as good as doing both
> > together.
> > 
> > Unless you see something else that I'm doing wrong here.  It seems like a win to
> > do both.
> > 
> 
> Well, I only said (or maybe I forgot), that on my machines, I got no
> improvements at all with the multiple alu or the prefetch. (I tried
> different strides)
> 
> Only noises in the results.
> 
I thought you previously said that running netperf gave you a stastically
significant performance boost when you added prefetching:
http://marc.info/?l=linux-kernel&m=138178914124863&w=2

But perhaps I missed a note somewhere.

> It seems it depends on cpus and/or multiple factors.
> 
> Last machine I used for the tests had :
> 
> processor	: 23
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 44
> model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
> stepping	: 2
> microcode	: 0x13
> cpu MHz		: 2800.256
> cache size	: 12288 KB
> physical id	: 1
> siblings	: 12
> core id		: 10
> cpu cores	: 6
> 
> 
> 
> 

Thats about what I'm running with:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 44
model name      : Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping        : 2
microcode       : 0x13
cpu MHz         : 1600.000
cache size      : 12288 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4


I can't imagine what would cause the discrepancy in our results (a 10% savings
in execution time seems significant to me). My only thought would be that
possibly the alu's on your cpu are faster than mine, and reduce the speedup
obtained by preforming operation in parallel, though I can't imagine thats the
case with these processors being so closely matched.

Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 19:21                     ` Neil Horman
@ 2013-10-21 19:44                       ` Eric Dumazet
  2013-10-21 20:19                         ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-21 19:44 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:

> 
> Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
> such that no interrupts (save for local ones), would occur on that cpu.  Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt variant
> doesn't exist in my tree, nor do I see it in any upstream tree or in the history
> anywhere.

This csum_partial_opt() was a private implementation of csum_partial()
so that I could load the module without rebooting the kernel ;)

> 
> base results:
> 53569916
> 43506025
> 43476542
> 44048436
> 45048042
> 48550429
> 53925556
> 53927374
> 53489708
> 53003915
> 
> AVG = 492 ns
> 
> prefetching only:
> 53279213
> 45518140
> 49585388
> 53176179
> 44071822
> 43588822
> 44086546
> 47507065
> 53646812
> 54469118
> 
> AVG = 488 ns
> 
> 
> parallel alu's only:
> 46226844
> 44458101
> 46803498
> 45060002
> 46187624
> 37542946
> 45632866
> 46275249
> 45031141
> 46281204
> 
> AVG = 449 ns
> 
> 
> both optimizations:
> 45708837
> 45631124
> 45697135
> 45647011
> 45036679
> 39418544
> 44481577
> 46820868
> 44496471
> 35523928
> 
> AVG = 438 ns
> 
> 
> We continue to see a small savings in execution time with prefetching (4 ns, or
> about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
> the best savings with both optimizations (54 ns, or 10.9%).  
> 
> These results, while they've changed as we've modified the test case slightly
> have remained consistent in their sppedup ordinality.  Prefetching helps, but
> not as much as using multiple alu's, and neither is as good as doing both
> together.
> 
> Unless you see something else that I'm doing wrong here.  It seems like a win to
> do both.
> 

Well, I only said (or maybe I forgot), that on my machines, I got no
improvements at all with the multiple alu or the prefetch. (I tried
different strides)

Only noises in the results.

It seems it depends on cpus and/or multiple factors.

Last machine I used for the tests had :

processor	: 23
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz
stepping	: 2
microcode	: 0x13
cpu MHz		: 2800.256
cache size	: 12288 KB
physical id	: 1
siblings	: 12
core id		: 10
cpu cores	: 6




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 21:15                   ` Eric Dumazet
  2013-10-20 21:29                     ` Neil Horman
@ 2013-10-21 19:21                     ` Neil Horman
  2013-10-21 19:44                       ` Eric Dumazet
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-21 19:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> 
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > 	int i;
> > 	__wsum sum = 0;
> > 	struct timespec start, end;
> > 	u64 time;
> > 	struct page *page;
> > 	u32 offset = 0;
> > 
> > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> 
> Not sure what you are doing here, but its not correct.
> 
> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
> 
> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
> 
> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
> 
> static int __init csum_init_module(void)
> {
>         int i;
>         __wsum sum = 0;
>         u64 start, end;
> 	void *base, *addrs[NBPAGES];
>         u32 rnd, offset;
> 
> 	memset(addrs, 0, sizeof(addrs));
> 	for (i = 0; i < NBPAGES; i++) {
> 		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> 		if (!addrs[i])
> 			goto out;
> 	}
> 
>         local_bh_disable();
>         pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
>         start = ktime_to_ns(ktime_get());
>         
>         for (i = 0; i < 100000; i++) {
> 		rnd = prandom_u32();
> 		base = addrs[rnd % NBPAGES];
> 		rnd /= NBPAGES;
> 		offset = rnd % (BUFSIZ - 1500);
>                 offset &= ~1U;
>                 sum = csum_partial_opt(base + offset, 1500, sum);
>         }
>         end = ktime_to_ns(ktime_get());
>         local_bh_enable();
> 
>         pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
> 
> out:
> 	for (i = 0; i < NBPAGES; i++)
> 		kfree(addrs[i]);
> 
>         return 0;
> }
> 
> static void __exit csum_cleanup_module(void)
> {
>         return;
> }
> 
> 
> 
> 


Ok, so I ran the above code on a single cpu using taskset, and set irq affinity
such that no interrupts (save for local ones), would occur on that cpu.  Note
that I had to convert csum_partial_opt to csum_partial, as the _opt variant
doesn't exist in my tree, nor do I see it in any upstream tree or in the history
anywhere.

base results:
53569916
43506025
43476542
44048436
45048042
48550429
53925556
53927374
53489708
53003915

AVG = 492 ns

prefetching only:
53279213
45518140
49585388
53176179
44071822
43588822
44086546
47507065
53646812
54469118

AVG = 488 ns


parallel alu's only:
46226844
44458101
46803498
45060002
46187624
37542946
45632866
46275249
45031141
46281204

AVG = 449 ns


both optimizations:
45708837
45631124
45697135
45647011
45036679
39418544
44481577
46820868
44496471
35523928

AVG = 438 ns


We continue to see a small savings in execution time with prefetching (4 ns, or
about 0.8%), a better savings with parallel alu execution (43 ns, or 8.7%), and
the best savings with both optimizations (54 ns, or 10.9%).  

These results, while they've changed as we've modified the test case slightly
have remained consistent in their sppedup ordinality.  Prefetching helps, but
not as much as using multiple alu's, and neither is as good as doing both
together.

Unless you see something else that I'm doing wrong here.  It seems like a win to
do both.

Regards
Neil




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-19  8:23 ` Ingo Molnar
@ 2013-10-21 17:54   ` Doug Ledford
  2013-10-26 11:55     ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Doug Ledford @ 2013-10-21 17:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Eric Dumazet, Neil Horman, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6563 bytes --]

On 10/19/2013 04:23 AM, Ingo Molnar wrote:
> 
> * Doug Ledford <dledford@redhat.com> wrote:
>> All prefetch operations get sent to an access queue in the memory 
>> controller where they compete with both other reads and writes for the 
>> available memory bandwidth.  The optimal prefetch window is not a factor 
>> of memory bandwidth and latency, it's a factor of memory bandwidth, 
>> memory latency, current memory access queue depth at time prefetch is 
>> issued, and memory bank switch time * number of queued memory operations 
>> that will require a bank switch.  In other words, it's much more complex 
>> and also much more fluid than any static optimization can pull out. 
>> [...]
> 
> But this is generally true of _any_ static operation - CPUs are complex, 
> workloads are complex, other threads, CPUs, sockets, devices might 
> interact, etc.
> 
> Yet it does not make it invalid to optimize for the isolated, static 
> usecase that was offered, because 'dynamism' and parallelism in a real 
> system will rarely make that optimization completely invalid, it will 
> typically only diminish its fruits to a certain degree (for example by 
> causing prefetches to be discarded).

So, prefetches are a bit of a special beast in that, if they are done
incorrectly, they can actually make the overall system slower than if we
didn't do anything at all.  If you are talking about anything other than
prefetch I would agree with you.  With prefetch, as much as possible,
you need to mimic the environment for which you are optimizing.  Neil's
test kernel module just called csum_partial on a bunch of memory pages.
 The actual usage pattern of csum_partial though is that it will be used
while, most likely, there is ongoing DMA of data packets across the
network interface.  It is very unlikely that we could care less about
the case of optimizing csum_partial for no network activity since
csum_partial is always going to be the result of network activity and
unlikely to happen in isolation.  As such, my suggestion about a kernel
compile was to create activity across the PCI bus to hard drives,
mimicking network interface DMA traffic.  You could also run a netperf
instance instead.  I just don't agree with optimizing it without
simultaneous DMA traffic as that particular case if likely to be a
rarity, not the norm.

> What I was objecting to strongly here was to measure the _wrong_ thing, 
> i.e. the cache-hot case. The cache-cold case should be measured in a low 
> noise fashion, so that results are representative. It's closer to the real 
> usecase than any other microbenchmark. That will give us a usable speedup 
> figure and will tell us which technique helped how much and which 
> parameter should be how large.

Cold cache, yes.  Low noise, yes.  But you need DMA traffic at the same
time to be truly representative.

>> [...]  So every time I see someone run a series of micro- benchmarks 
>> like you just did, where the system was only doing the micro- benchmark 
>> and not a real workload, and we draw conclusions about optimal prefetch 
>> distances from that test, I cringe inside and I think I even die... just 
>> a little.
> 
> So the thing is, microbenchmarks can indeed be misleading - and as in this 
> case the cache-hot claims can be outright dangerously misleading.

So can non-DMA cases.

> But yet, if done correctly and interpreted correctly they tell us a little 
> bit of the truth and are often correlated to real performance.
> 
> Do microbenchmarks show us everything that a 'real' workload inhibits? Not 
> at all, they are way too simple for that. They are a shortcut, an 
> indicator, which is often helpful as long as not taken as 'the' 
> performance of the system.
> 
>> A better test for this, IMO, would be to start a local kernel compile 
>> with at least twice as many gcc instances allowed as you have CPUs, 
>> *then* run your benchmark kernel module and see what prefetch distance 
>> works well. [...]
> 
> I don't agree that this represents our optimization target. It may 
> represent _one_ optimization target. But many other important usecases 
> such as a dedicated file server, or a computation node that is 
> cache-optimized, would unlikely to show such high parallel memory pressure 
> as a GCC compilation.

But they will *all* show network DMA load, not quiescent DMA load.

>> [...]  This distance should be far enough out that it can withstand 
>> other memory pressure, yet not so far as to constantly be prefetching, 
>> tossing the result out of cache due to pressure, then fetching/stalling 
>> that same memory on load.  And it may not benchmark as well on a 
>> quiescent system running only the micro-benchmark, but it should end up 
>> performing better in actual real world usage.
> 
> The 'fully adversarial' case where all resources are maximally competed 
> for by all other cores is actually pretty rare in practice. I don't say it 
> does not happen or that it does not matter, but I do say there are many 
> other important usecases as well.
> 
> More importantly, the 'maximally adversarial' case is very hard to 
> generate, validate, and it's highly system dependent!

This I agree with 100%, which is why I tend to think we should scrap the
static prefetch optimizations entirely and have a boot up test that
allows us to find our optimum prefetch distance for our given hardware.

> Cache-cold (and cache hot) microbenchmarks on the other hand tend to be 
> more stable, because they typically reflect current physical (mostly 
> latency) limits of CPU and system technology, _not_ highly system 
> dependent resource sizing (mostly bandwidth) limits which are very hard to 
> optimize for in a generic fashion.
> 
> Cache-cold and cache-hot measurements are, in a way, important physical 
> 'eigenvalues' of a complex system. If they both show speedups then it's 
> likely that a more dynamic, contended for, mixed workload will show 
> speedups as well. And these 'eigenvalues' are statistically much more 
> stable across systems, and that's something we care for when we implement 
> various lowlevel assembly routines in arch/x86/ which cover many different 
> systems with different bandwidth characteristics.
> 
> I hope I managed to explain my views clearly enough on this.
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD
	      http://people.redhat.com/dledford



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-21 17:31                       ` Eric Dumazet
@ 2013-10-21 17:46                         ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-21 17:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > > 
> > > > #define BUFSIZ_ORDER 4
> > > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > > static int __init csum_init_module(void)
> > > > {
> > > > 	int i;
> > > > 	__wsum sum = 0;
> > > > 	struct timespec start, end;
> > > > 	u64 time;
> > > > 	struct page *page;
> > > > 	u32 offset = 0;
> > > > 
> > > > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > > 
> > > Not sure what you are doing here, but its not correct.
> > > 
> > Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.
> 
> Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
> on x86. (MAX_ORDER = 11)
> 
> You noticed nothing because you did not 
> write anything on the 64Mbytes area (and corrupt memory) or
> use CONFIG_DEBUG_PAGEALLOC=y.
> 
> Your code read data out of bounds and was lucky, thats all...
> 
> You in fact allocated a page of (4096<<4) bytes
> 
Gahh!  I see what I did, the order in the alloc_pages call is the order of
hugepages, it still allocates that order as typically sized pages, and then
treats them as huge.  Stupid of me...

I'll have results on your version of the test case in just a bit here
Neil

> 
> 
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-20 21:29                     ` Neil Horman
@ 2013-10-21 17:31                       ` Eric Dumazet
  2013-10-21 17:46                         ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-21 17:31 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > 
> > > #define BUFSIZ_ORDER 4
> > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > static int __init csum_init_module(void)
> > > {
> > > 	int i;
> > > 	__wsum sum = 0;
> > > 	struct timespec start, end;
> > > 	u64 time;
> > > 	struct page *page;
> > > 	u32 offset = 0;
> > > 
> > > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> > 
> > Not sure what you are doing here, but its not correct.
> > 
> Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.

Not really. We cannot allocate 64 Mbytes in a single alloc_pages() call
on x86. (MAX_ORDER = 11)

You noticed nothing because you did not 
write anything on the 64Mbytes area (and corrupt memory) or
use CONFIG_DEBUG_PAGEALLOC=y.

Your code read data out of bounds and was lucky, thats all...

You in fact allocated a page of (4096<<4) bytes




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 21:15                   ` Eric Dumazet
@ 2013-10-20 21:29                     ` Neil Horman
  2013-10-21 17:31                       ` Eric Dumazet
  2013-10-21 19:21                     ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-20 21:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> 
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > 	int i;
> > 	__wsum sum = 0;
> > 	struct timespec start, end;
> > 	u64 time;
> > 	struct page *page;
> > 	u32 offset = 0;
> > 
> > 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
> 
> Not sure what you are doing here, but its not correct.
> 
Why not?  You asked for a test with 32 hugepages, so I allocated 32 hugepages.

> You have a lot of variations in your results, I suspect a NUMA affinity
> problem.
> 
I do have some variation, you're correct, but I don't think its a numa issue

> You can try the following code, and use taskset to make sure you run
> this on a cpu on node 0
> 
I did run this with taskset to do exactly that (hence my comment above).  I'll
be glad to run your variant on monday morning though and provide results.


Best
Neil


> #define BUFSIZ 2*1024*1024
> #define NBPAGES 16
> 
> static int __init csum_init_module(void)
> {
>         int i;
>         __wsum sum = 0;
>         u64 start, end;
> 	void *base, *addrs[NBPAGES];
>         u32 rnd, offset;
> 
> 	memset(addrs, 0, sizeof(addrs));
> 	for (i = 0; i < NBPAGES; i++) {
> 		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
> 		if (!addrs[i])
> 			goto out;
> 	}
> 
>         local_bh_disable();
>         pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
>         start = ktime_to_ns(ktime_get());
>         
>         for (i = 0; i < 100000; i++) {
> 		rnd = prandom_u32();
> 		base = addrs[rnd % NBPAGES];
> 		rnd /= NBPAGES;
> 		offset = rnd % (BUFSIZ - 1500);
>                 offset &= ~1U;
>                 sum = csum_partial_opt(base + offset, 1500, sum);
>         }
>         end = ktime_to_ns(ktime_get());
>         local_bh_enable();
> 
>         pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);
> 
> out:
> 	for (i = 0; i < NBPAGES; i++)
> 		kfree(addrs[i]);
> 
>         return 0;
> }
> 
> static void __exit csum_cleanup_module(void)
> {
>         return;
> }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:42 Doug Ledford
@ 2013-10-19  8:23 ` Ingo Molnar
  2013-10-21 17:54   ` Doug Ledford
  0 siblings, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-19  8:23 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Eric Dumazet, Neil Horman, linux-kernel

* Doug Ledford <dledford@redhat.com> wrote:

> >> Based on these, prefetching is obviously a a good improvement, but 
> >> not as good as parallel execution, and the winner by far is doing 
> >> both.
> 
> OK, this is where I have to chime in that these tests can *not* be used 
> to say anything about prefetch, and not just for the reasons Ingo lists 
> in his various emails to this thread.  In fact I would argue that Ingo's 
> methodology on this is wrong as well.

Well, I didn't go into as many details as you - but I agree with your full 
list obviously:

> All prefetch operations get sent to an access queue in the memory 
> controller where they compete with both other reads and writes for the 
> available memory bandwidth.  The optimal prefetch window is not a factor 
> of memory bandwidth and latency, it's a factor of memory bandwidth, 
> memory latency, current memory access queue depth at time prefetch is 
> issued, and memory bank switch time * number of queued memory operations 
> that will require a bank switch.  In other words, it's much more complex 
> and also much more fluid than any static optimization can pull out. 
> [...]

But this is generally true of _any_ static operation - CPUs are complex, 
workloads are complex, other threads, CPUs, sockets, devices might 
interact, etc.

Yet it does not make it invalid to optimize for the isolated, static 
usecase that was offered, because 'dynamism' and parallelism in a real 
system will rarely make that optimization completely invalid, it will 
typically only diminish its fruits to a certain degree (for example by 
causing prefetches to be discarded).

What I was objecting to strongly here was to measure the _wrong_ thing, 
i.e. the cache-hot case. The cache-cold case should be measured in a low 
noise fashion, so that results are representative. It's closer to the real 
usecase than any other microbenchmark. That will give us a usable speedup 
figure and will tell us which technique helped how much and which 
parameter should be how large.

> [...]  So every time I see someone run a series of micro- benchmarks 
> like you just did, where the system was only doing the micro- benchmark 
> and not a real workload, and we draw conclusions about optimal prefetch 
> distances from that test, I cringe inside and I think I even die... just 
> a little.

So the thing is, microbenchmarks can indeed be misleading - and as in this 
case the cache-hot claims can be outright dangerously misleading.

But yet, if done correctly and interpreted correctly they tell us a little 
bit of the truth and are often correlated to real performance.

Do microbenchmarks show us everything that a 'real' workload inhibits? Not 
at all, they are way too simple for that. They are a shortcut, an 
indicator, which is often helpful as long as not taken as 'the' 
performance of the system.

> A better test for this, IMO, would be to start a local kernel compile 
> with at least twice as many gcc instances allowed as you have CPUs, 
> *then* run your benchmark kernel module and see what prefetch distance 
> works well. [...]

I don't agree that this represents our optimization target. It may 
represent _one_ optimization target. But many other important usecases 
such as a dedicated file server, or a computation node that is 
cache-optimized, would unlikely to show such high parallel memory pressure 
as a GCC compilation.

> [...]  This distance should be far enough out that it can withstand 
> other memory pressure, yet not so far as to constantly be prefetching, 
> tossing the result out of cache due to pressure, then fetching/stalling 
> that same memory on load.  And it may not benchmark as well on a 
> quiescent system running only the micro-benchmark, but it should end up 
> performing better in actual real world usage.

The 'fully adversarial' case where all resources are maximally competed 
for by all other cores is actually pretty rare in practice. I don't say it 
does not happen or that it does not matter, but I do say there are many 
other important usecases as well.

More importantly, the 'maximally adversarial' case is very hard to 
generate, validate, and it's highly system dependent!

Cache-cold (and cache hot) microbenchmarks on the other hand tend to be 
more stable, because they typically reflect current physical (mostly 
latency) limits of CPU and system technology, _not_ highly system 
dependent resource sizing (mostly bandwidth) limits which are very hard to 
optimize for in a generic fashion.

Cache-cold and cache-hot measurements are, in a way, important physical 
'eigenvalues' of a complex system. If they both show speedups then it's 
likely that a more dynamic, contended for, mixed workload will show 
speedups as well. And these 'eigenvalues' are statistically much more 
stable across systems, and that's something we care for when we implement 
various lowlevel assembly routines in arch/x86/ which cover many different 
systems with different bandwidth characteristics.

I hope I managed to explain my views clearly enough on this.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 20:11                 ` Neil Horman
@ 2013-10-18 21:15                   ` Eric Dumazet
  2013-10-20 21:29                     ` Neil Horman
  2013-10-21 19:21                     ` Neil Horman
  0 siblings, 2 replies; 107+ messages in thread
From: Eric Dumazet @ 2013-10-18 21:15 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:

> #define BUFSIZ_ORDER 4
> #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> static int __init csum_init_module(void)
> {
> 	int i;
> 	__wsum sum = 0;
> 	struct timespec start, end;
> 	u64 time;
> 	struct page *page;
> 	u32 offset = 0;
> 
> 	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);

Not sure what you are doing here, but its not correct.

You have a lot of variations in your results, I suspect a NUMA affinity
problem.

You can try the following code, and use taskset to make sure you run
this on a cpu on node 0

#define BUFSIZ 2*1024*1024
#define NBPAGES 16

static int __init csum_init_module(void)
{
        int i;
        __wsum sum = 0;
        u64 start, end;
	void *base, *addrs[NBPAGES];
        u32 rnd, offset;

	memset(addrs, 0, sizeof(addrs));
	for (i = 0; i < NBPAGES; i++) {
		addrs[i] = kmalloc_node(BUFSIZ, GFP_KERNEL, 0);
		if (!addrs[i])
			goto out;
	}

        local_bh_disable();
        pr_err("STARTING ITERATIONS on cpu %d\n", smp_processor_id());
        start = ktime_to_ns(ktime_get());
        
        for (i = 0; i < 100000; i++) {
		rnd = prandom_u32();
		base = addrs[rnd % NBPAGES];
		rnd /= NBPAGES;
		offset = rnd % (BUFSIZ - 1500);
                offset &= ~1U;
                sum = csum_partial_opt(base + offset, 1500, sum);
        }
        end = ktime_to_ns(ktime_get());
        local_bh_enable();

        pr_err("COMPLETED 100000 iterations of csum %x in %llu nanosec\n", sum, end - start);

out:
	for (i = 0; i < NBPAGES; i++)
		kfree(addrs[i]);

        return 0;
}

static void __exit csum_cleanup_module(void)
{
        return;
}




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 17:20               ` Eric Dumazet
@ 2013-10-18 20:11                 ` Neil Horman
  2013-10-18 21:15                   ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-18 20:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > > 
> 
> > 	for(i=0;i<100000;i++) {
> > 		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> > 		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
> > 	}
> 
> Please replace this by random accesses, and use the more standard 1500
> length.
> 
> offset = prandom_u32() % (BUFSIZ - 1500);
> offset &= ~1U;
> 
> sum = csum_partial(buf + offset, 1500, sum);
> 
> You are basically doing sequential accesses, so prefetch should
> be automatically done by cpu itself.
> 
> Thanks !
> 
> 
> 

Sure, you got it!  Results below.  However, they continue to bear out that
parallel execution beats prefetch only execution, and both is better than either
one.

base results:
53156647
59670931
62839770
44842780
39297190
44905905
53300688
53287805
39436951
43021730

AVG=493 ns

prefetch-only results:
40337434
51986404
43509199
53128857
52973171
53520649
53536338
50325466
44864664
47908398

AVG=492 ns


parallel-only results:
52157183
44496511
36180011
38298368
36258099
43263531
45365519
54116344
62529241
63118224

AVG = 475 ns


both prefetch and parallel:
44317078
44526464
45761272
44477906
34868814
44637904
49478309
49718417
58681403
58304972

AVG = 474 ns


Heres the code I was using



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;
	struct page *page;
	u32 offset = 0;

	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
	if (!page) {
		printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
		return -ENOMEM;
	}
	buf = page_address(page); 

	
	printk(KERN_CRIT "INITALIZING BUFFER\n");

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);
	
	for(i=0;i<100000;i++) {
		sum = csum_partial(buf+offset, 1500, sum);
		offset = prandom_u32() % (BUFSIZ - 1500);
		offset &= ~1U;
	}
	getnstimeofday(&end);
	preempt_enable();
	if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
		time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
	else 
		time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	__free_pages(page, BUFSIZ_ORDER);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-18 17:42 Doug Ledford
  2013-10-19  8:23 ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Doug Ledford @ 2013-10-18 17:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Eric Dumazet, Doug Ledford, Neil Horman, linux-kernel

On 2013-10-17, Ingo wrote:
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
>> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
>> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
>> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>> > > 
>> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
>> > > > buffer, initalized it with random data, and called csum_partial on it 100000
>> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
>> > > > GHz Intel Xeon processor:
>> > > > 
>> > > > Without patch: Average execute time for csum_partial was 808 ns
>> > > > With patch: Average execute time for csum_partial was 438 ns
>> > > 
>> > > Impressive, but could you try again with data out of cache ?
>> > 
>> > So I tried your patch on a GRE tunnel and got following results on a
>> > single TCP flow. (short result : no visible difference)

[ to Eric ]

You didn't show profile data from before and after the patch, only after.  And it
showed csum_partial at 19.9% IIRC.  That's a much better than I get on my test
machines (even though this is on a rhel6.5-beta kernel, understand that the entire
IB stack in rhel6.5-beta is up to a 3.10 level, with parts closer to 3.11+):

For IPoIB in connected mode, where there is no rx csum offload:

::::::::::::::
rhel6.5-beta-cm-no-offload-oprofile-run1
::::::::::::::
CPU: Intel Architectural Perfmon, speed 3392.17 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
Samples on CPU 0
Samples on CPU 4
Samples on CPU 6 (edited out as it was only a few samples and ruined
                  line wrapping)
samples  %        samples  %        image name               symbol name
98588    59.1431  215      57.9515  vmlinux    csum_partial_copy_generic
3003      1.8015  8         2.1563  vmlinux                  tcp_sendmsg
2219      1.3312  0              0  vmlinux            irq_entries_start
2076      1.2454  4         1.0782  vmlinux         avc_has_perm_noaudit
1815      1.0888  0              0  mlx4_ib.ko           mlx4_ib_poll_cq

So, here anyway, it's 60%.  At that level of showing, there is a lot more to be
gained from an improvement to that function.  And here's the measured performance
from those runs:

[root@rdma-master rhel6.5-beta-client]# more rhel6.5-beta-cm-no-offload-netperf.output 
Recv   Send    Send                          Utilization
Socket Socket  Message  Elapsed              Send     Recv
Size   Size    Size     Time     Throughput  local    remote
bytes  bytes   bytes    secs.    MBytes  /s  % S      % S
 87380  16384  16384    20.00      2815.29   7.92     12.80  
 87380  16384  16384    20.00      2798.22   7.88     12.87  
 87380  16384  16384    20.00      2786.74   7.79     12.84

The test machine has 8 logical CPUs, so 12.5% is 100% of a single CPU.  That
said, the receive side is obviously the bottleneck here, and 60% of that
bottleneck is csum_partial.

[ snip a bunch of Neil's measurements ]

>> Based on these, prefetching is obviously a a good improvement, but not 
>> as good as parallel execution, and the winner by far is doing both.

OK, this is where I have to chime in that these tests can *not* be used
to say anything about prefetch, and not just for the reasons Ingo lists
in his various emails to this thread.  In fact I would argue that Ingo's
methodology on this is wrong as well.

All prefetch operations get sent to an access queue in the memory controller
where they compete with both other reads and writes for the available memory
bandwidth.  The optimal prefetch window is not a factor of memory bandwidth
and latency, it's a factor of memory bandwidth, memory latency, current memory
access queue depth at time prefetch is issued, and memory bank switch time *
number of queued memory operations that will require a bank switch.  In other
words, it's much more complex and also much more fluid than any static
optimization can pull out.  So every time I see someone run a series of micro-
benchmarks like you just did, where the system was only doing the micro-
benchmark and not a real workload, and we draw conclusions about optimal
prefetch distances from that test, I cringe inside and I think I even die...
just a little.

A better test for this, IMO, would be to start a local kernel compile with at
least twice as many gcc instances allowed as you have CPUs, *then* run your
benchmark kernel module and see what prefetch distance works well.  This
distance should be far enough out that it can withstand other memory pressure,
yet not so far as to constantly be prefetching, tossing the result out of cache
due to pressure, then fetching/stalling that same memory on load.  And it may
not benchmark as well on a quiescent system running only the micro-benchmark,
but it should end up performing better in actual real world usage.

> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 
> For example 'perf stat --repeat' will output stddev for you:
> 
>   comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'
> 
>    Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):
> 
>        0.189084480 seconds time elapsed                                          ( +- 11.95% )

[ snip perf usage tips ]

I ran my original tests with oprofile.  I'll rerun the last one plus some new
tests with the various incarnations of this patch using perf and report the
results back here.

However, the machines I ran these tests on were limited by a 40GBit/s line
speed, with a theoretical max of 4GBytes/s due to bit encoding on the wire,
and I think limited even a bit lower by theoretical limit of useful data
across a PCI-e gen2 x8 bus.  So I wouldn't expect the throughput to go
much higher even if this helps, it should mainly reduce CPU usage.  I can
try the same tests on a 56GBit/s link and with cards that have PCI-e
gen3 and see how those machines do by comparison (the hosts are identical,
just the cards are different).

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 16:50             ` Neil Horman
@ 2013-10-18 17:20               ` Eric Dumazet
  2013-10-18 20:11                 ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-18 17:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > 

> 	for(i=0;i<100000;i++) {
> 		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> 		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
> 	}

Please replace this by random accesses, and use the more standard 1500
length.

offset = prandom_u32() % (BUFSIZ - 1500);
offset &= ~1U;

sum = csum_partial(buf + offset, 1500, sum);

You are basically doing sequential accesses, so prefetch should
be automatically done by cpu itself.

Thanks !



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-18 16:42   ` Neil Horman
@ 2013-10-18 17:09     ` H. Peter Anvin
  2013-10-25 13:06       ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2013-10-18 17:09 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

If implemented properly adcx/adox should give additional speedup... that is the whole reason for their existence.

Neil Horman <nhorman@tuxdriver.com> wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices implementing ipoib
>(which don't have
>> > checksum offload hardware were spending a significant amount of
>time computing
>> > checksums.  We found that by splitting the checksum computation
>into two
>> > separate streams, each skipping successive elements of the buffer
>being summed,
>> > we could parallelize the checksum operation accros multiple alus. 
>Since neither
>> > chain is dependent on the result of the other, we get a speedup in
>execution (on
>> > hardware that has multiple alu's available, which is almost
>ubiquitous on x86),
>> > and only a negligible decrease on hardware that has only a single
>alu (an extra
>> > addition is introduced).  Since addition in commutative, the result
>is the same,
>> > only faster
>> 
>> On hardware that implement ADCX/ADOX then you should also be able to
>> have additional streams interleaved since those instructions allow
>for
>> dual carry chains.
>> 
>> 	-hpa
>> 
>I've been looking into this a bit more, and I'm a bit confused. 
>According to
>this:
>http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html
>
>by my read, this pair of instructions simply supports 2 carry bit
>chains,
>allowing for two parallel execution paths through the cpu that won't
>block on
>one another.  Its exactly the same as whats being done with the
>universally
>available addcq instruction, so theres no real speedup (that I can
>see).  Since
>we'd either have to use the alternatives macro to support adcx/adox
>here or the
>old instruction set, it seems not overly worth the effort to support
>the
>extension.  
>
>Or am I missing something?
>
>Neil

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  1:42           ` Eric Dumazet
@ 2013-10-18 16:50             ` Neil Horman
  2013-10-18 17:20               ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-18 16:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> 
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
> 
> Try now with 32 huges pages, to get 64 MBytes of working set.
> 
> Because in reality we never csum_partial() data in cpu cache.
> (Unless the NIC preloaded the data into cpu cache before sending the
> interrupt)
> 
> Really, if Sebastien got a speed up, it means that something fishy was
> going on, like :
> 
> - A copy of data into some area of memory, prefilling cpu caches
> - csum_partial() done while data is hot in cache.
> 
> This is exactly a "should not happen" scenario, because the csum in this
> case should happen _while_ doing the copy, for 0 ns.
> 
> 
> 
> 


So, I took your suggestion, and modified my test module to allocate 32 huge
pages instead of a single 4k page.  I've attached the module changes and the
results below.  Contrary to your assertion above, results came out the same as
in my first run.  See below:

base results:
80381491
85279536
99537729
80398029
121385411
109478429
85369632
99242786
80250395
98170542

AVG=939 ns

prefetch only results:
86803812
101891541
85762713
95866956
102316712
93529111
90473728
79374183
93744053
90075501

AVG=919 ns

parallel only results:
68994797
63503221
64298412
63784256
75350022
66398821
77776050
79158271
91006098
67822318

AVG=718 ns

both prefetch and parallel results:
68852213
77536525
63963560
67255913
76169867
80418081
63485088
62386262
75533808
57731705

AVG=693 ns


So based on these, it seems that your assertion that prefetching is the key to
speedup here isn't quite correct.  Either that or the testing continues to be
invalid.  I'm going to try to do some of ingos microbenchmarking just to see if
that provides any further details.  But any other thoughts about what might be
going awry are appreciated.

My module code:



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;
	struct page *page;
	u32 offset = 0;

	page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
	if (!page) {
		printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
		return -ENOMEM;
	}
	buf = page_address(page); 

	
	printk(KERN_CRIT "INITALIZING BUFFER\n");

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);
	
	for(i=0;i<100000;i++) {
		sum = csum_partial(buf+offset, PAGE_SIZE, sum);
		offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE  : 0;
	}
	getnstimeofday(&end);
	preempt_enable();
	if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
		time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
	else 
		time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	__free_pages(page, BUFSIZ_ORDER);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 22:29 ` H. Peter Anvin
  2013-10-13 12:53   ` Neil Horman
@ 2013-10-18 16:42   ` Neil Horman
  2013-10-18 17:09     ` H. Peter Anvin
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-18 16:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums.  We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus.  Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced).  Since addition in commutative, the result is the same,
> > only faster
> 
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
> 
> 	-hpa
> 
I've been looking into this a bit more, and I'm a bit confused.  According to
this:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

by my read, this pair of instructions simply supports 2 carry bit chains,
allowing for two parallel execution paths through the cpu that won't block on
one another.  Its exactly the same as whats being done with the universally
available addcq instruction, so theres no real speedup (that I can see).  Since
we'd either have to use the alternatives macro to support adcx/adox here or the
old instruction set, it seems not overly worth the effort to support the
extension.  

Or am I missing something?

Neil

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-18 15:46 Doug Ledford
  0 siblings, 0 replies; 107+ messages in thread
From: Doug Ledford @ 2013-10-18 15:46 UTC (permalink / raw)
  To: Joe Perches; +Cc: Ingo Molnar, Eric Dumazet, linux-kernel

On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
>> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
>> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
>> > > attached patch brings much better results
>> > > 
>> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
>> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
>> > > Recv   Send    Send                          Utilization       Service Demand
>> > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
>> > > Size   Size    Size     Time     Throughput  local    remote   local   remote
>> > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
>> > > 
>> > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
>> > > 
>> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>> > []
>> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>> > >  			zero = 0;
>> > >  			count64 = count >> 3;
>> > >  			while (count64) { 
>> > > -				asm("addq 0*8(%[src]),%[res]\n\t"
>> > > +				asm("prefetch 5*64(%[src])\n\t"
>> > 
>> > Might the prefetch size be too big here?
>> 
>> To be effective, you need to prefetch well ahead of time.
> 
> No doubt.
> 
>> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> 
> 5 cachelines for some processors seems like a lot.
> 
> Given you've got a test rig, maybe you could experiment
> with 2 and increase it until it doesn't get better.

You have a fundamental misunderstanding of the prefetch operation.  The 5*64
in the above asm statment does not mean a size, it is an index, with %[src]
as the base pointer.  So it is saying to go to address %[src] + 5*64 and
prefetch there.  The prefetch size itself is always a cache line.  Once the
address is known, whatever cacheline holds that address is the cacheline we
will prefetch.  Your size concerns have no meaning.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
@ 2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-18  6:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Neil Horman, Eric Dumazet, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev

* H. Peter Anvin <hpa@zytor.com> wrote:

> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> > 
> > To correctly simulate the workload you'd have to:
> > 
> >  - allocate a buffer larger than your L2 cache.
> > 
> >  - to measure the effects of the prefetches you'd also have to randomize
> >    the individual buffer positions. See how 'perf bench numa' implements a
> >    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
> >    Otherwise the CPU might learn your simplistic stream direction and the
> >    L2 cache might hw-prefetch your data, interfering with any explicit 
> >    prefetches the code does. In many real-life usecases packet buffers are
> >    scattered.
> > 
> > Also, it would be nice to see standard deviation noise numbers when two 
> > averages are close to each other, to be able to tell whether differences 
> > are statistically significant or not.
> 
> 
> Seriously, though, how much does it matter?  All the above seems likely 
> to do is to drown the signal by adding noise.

I think it matters a lot and I don't think it 'adds' noise - it measures 
something else (cache cold behavior - which is the common case for 
first-time csum_partial() use for network packets), which was not measured 
before, and that that is by its nature has different noise patterns.

I've done many cache-cold measurements myself and had no trouble achieving 
statistically significant results and high precision.

> If the parallel (threaded) checksumming is faster, which theory says it 
> should and microbenchmarking confirms, how important are the 
> macrobenchmarks?

Microbenchmarks can be totally blind to things like the ideal prefetch 
window size. (or whether a prefetch should be done at all: some CPUs will 
throw away prefetches if enough regular fetches arrive.)

Also, 'naive' single-threaded algorithms can occasionally be better in the 
cache-cold case because a linear, predictable stream of memory accesses 
might saturate the memory bus better than a somewhat random looking, 
interleaved web of accesses that might not harmonize with buffer depths.

I _think_ if correctly tuned then the parallel algorithm should be better 
in the cache cold case, I just don't know with what parameters (and the 
algorithm has at least one free parameter: the prefetch window size), and 
I don't know how significant the effect is.

Also, more fundamentally, I absolutely detest doing no measurements or 
measuring the wrong thing - IMHO there are too many 'blind' optimization 
commits in the kernel with little to no observational data attached.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17 18:19             ` H. Peter Anvin
@ 2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  1 sibling, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2013-10-17 18:48 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, x86, netdev

On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:

> Seriously, though, how much does it matter?  All the above seems likely
> to do is to drown the signal by adding noise.

I don't think so.

> 
> If the parallel (threaded) checksumming is faster, which theory says it
> should and microbenchmarking confirms, how important are the
> macrobenchmarks?

Seriously, micro benchmarks are very misleading.

I spent time on this patch, and found no changes on real workloads.

I was excited first, then disappointed.

I hope we will find the real issue, as I really don't care of micro
benchmarks.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  8:41           ` Ingo Molnar
@ 2013-10-17 18:19             ` H. Peter Anvin
  2013-10-17 18:48               ` Eric Dumazet
  2013-10-18  6:43               ` Ingo Molnar
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 107+ messages in thread
From: H. Peter Anvin @ 2013-10-17 18:19 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, x86, netdev

On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> 
> To correctly simulate the workload you'd have to:
> 
>  - allocate a buffer larger than your L2 cache.
> 
>  - to measure the effects of the prefetches you'd also have to randomize
>    the individual buffer positions. See how 'perf bench numa' implements a
>    random walk via --data_rand_walk, in tools/perf/bench/numa.c.
>    Otherwise the CPU might learn your simplistic stream direction and the
>    L2 cache might hw-prefetch your data, interfering with any explicit 
>    prefetches the code does. In many real-life usecases packet buffers are
>    scattered.
> 
> Also, it would be nice to see standard deviation noise numbers when two 
> averages are close to each other, to be able to tell whether differences 
> are statistically significant or not.
> 

Seriously, though, how much does it matter?  All the above seems likely
to do is to drown the signal by adding noise.

If the parallel (threaded) checksumming is faster, which theory says it
should and microbenchmarking confirms, how important are the
macrobenchmarks?

	-hpa



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  0:34         ` Neil Horman
  2013-10-17  1:42           ` Eric Dumazet
@ 2013-10-17  8:41           ` Ingo Molnar
  2013-10-17 18:19             ` H. Peter Anvin
  2013-10-28 16:01             ` Neil Horman
  1 sibling, 2 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-17  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, netdev


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > > 
> > > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > > GHz Intel Xeon processor:
> > > > 
> > > > Without patch: Average execute time for csum_partial was 808 ns
> > > > With patch: Average execute time for csum_partial was 438 ns
> > > 
> > > Impressive, but could you try again with data out of cache ?
> > 
> > So I tried your patch on a GRE tunnel and got following results on a
> > single TCP flow. (short result : no visible difference)
> > 
> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
>
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of 
> nanoseconds taken to compute the checksum of a 4kb buffer 100000 times.  
> To get my average results, I ran the test in a loop 10 times, averaged 
> them, and divided by 100000.
> 
> Based on these, prefetching is obviously a a good improvement, but not 
> as good as parallel execution, and the winner by far is doing both.

But in the actual usecase mentioned the packet data was likely cache-cold, 
it just arrived in the NIC and an IRQ got sent. Your testcase uses a 
super-hot 4K buffer that fits into the L1 cache. So it's apples to 
oranges.

To correctly simulate the workload you'd have to:

 - allocate a buffer larger than your L2 cache.

 - to measure the effects of the prefetches you'd also have to randomize
   the individual buffer positions. See how 'perf bench numa' implements a
   random walk via --data_rand_walk, in tools/perf/bench/numa.c.
   Otherwise the CPU might learn your simplistic stream direction and the
   L2 cache might hw-prefetch your data, interfering with any explicit 
   prefetches the code does. In many real-life usecases packet buffers are
   scattered.

Also, it would be nice to see standard deviation noise numbers when two 
averages are close to each other, to be able to tell whether differences 
are statistically significant or not.

For example 'perf stat --repeat' will output stddev for you:

  comet:~/tip> perf stat --repeat 20 --null bash -c 'usleep $((RANDOM*10))'

   Performance counter stats for 'bash -c usleep $((RANDOM*10))' (20 runs):

       0.189084480 seconds time elapsed                                          ( +- 11.95% )

The last '+-' percentage is the noise of the measurement.

Also note that you can inspect many cache behavior details of your 
algorithm via perf stat - the -ddd option will give you a laundry list:

  aldebaran:~> perf stat --repeat 20 -ddd perf bench sched messaging
  ...

     Total time: 0.095 [sec]

 Performance counter stats for 'perf bench sched messaging' (20 runs):

       1519.128721 task-clock (msec)         #   12.305 CPUs utilized            ( +-  0.34% )
            22,882 context-switches          #    0.015 M/sec                    ( +-  2.84% )
             3,927 cpu-migrations            #    0.003 M/sec                    ( +-  2.74% )
            16,616 page-faults               #    0.011 M/sec                    ( +-  0.17% )
     2,327,978,366 cycles                    #    1.532 GHz                      ( +-  1.61% ) [36.43%]
     1,715,561,189 stalled-cycles-frontend   #   73.69% frontend cycles idle     ( +-  1.76% ) [38.05%]
       715,715,454 stalled-cycles-backend    #   30.74% backend  cycles idle     ( +-  2.25% ) [39.85%]
     1,253,106,346 instructions              #    0.54  insns per cycle        
                                             #    1.37  stalled cycles per insn  ( +-  1.71% ) [49.68%]
       241,181,126 branches                  #  158.763 M/sec                    ( +-  1.43% ) [47.83%]
         4,232,053 branch-misses             #    1.75% of all branches          ( +-  1.23% ) [48.63%]
       431,907,354 L1-dcache-loads           #  284.313 M/sec                    ( +-  1.00% ) [48.37%]
        20,550,528 L1-dcache-load-misses     #    4.76% of all L1-dcache hits    ( +-  0.82% ) [47.61%]
         7,435,847 LLC-loads                 #    4.895 M/sec                    ( +-  0.94% ) [36.11%]
         2,419,201 LLC-load-misses           #   32.53% of all LL-cache hits     ( +-  2.93% ) [ 7.33%]
       448,638,547 L1-icache-loads           #  295.326 M/sec                    ( +-  2.43% ) [21.75%]
        22,066,490 L1-icache-load-misses     #    4.92% of all L1-icache hits    ( +-  2.54% ) [30.66%]
       475,557,948 dTLB-loads                #  313.047 M/sec                    ( +-  1.96% ) [37.96%]
         6,741,523 dTLB-load-misses          #    1.42% of all dTLB cache hits   ( +-  2.38% ) [37.05%]
     1,268,628,660 iTLB-loads                #  835.103 M/sec                    ( +-  1.75% ) [36.45%]
            74,192 iTLB-load-misses          #    0.01% of all iTLB cache hits   ( +-  2.88% ) [36.19%]
         4,466,526 L1-dcache-prefetches      #    2.940 M/sec                    ( +-  1.61% ) [36.17%]
         2,396,311 L1-dcache-prefetch-misses #    1.577 M/sec                    ( +-  1.55% ) [35.71%]

       0.123459566 seconds time elapsed                                          ( +-  0.58% )

There's also a number of prefetch counters that might be useful:

 aldebaran:~> perf list | grep prefetch
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  node-prefetches                                    [Hardware cache event]
  node-prefetch-misses                               [Hardware cache event]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-17  0:34         ` Neil Horman
@ 2013-10-17  1:42           ` Eric Dumazet
  2013-10-18 16:50             ` Neil Horman
  2013-10-17  8:41           ` Ingo Molnar
  1 sibling, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-17  1:42 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:

> > 
> 
> So I went to reproduce these results, but was unable to (due to the fact that I
> only have a pretty jittery network to do testing accross at the moment with
> these devices).  So instead I figured that I would go back to just doing
> measurements with the module that I cobbled together (operating under the
> assumption that it would give me accurate, relatively jitter free results (I've
> attached the module code for reference below).  My results show slightly
> different behavior:
> 
> Base results runs:
> 89417240
> 85170397
> 85208407
> 89422794
> 91645494
> 103655144
> 86063791
> 75647774
> 83502921
> 85847372
> AVG = 875 ns
> 
> Prefetch only runs:
> 70962849
> 77555099
> 81898170
> 68249290
> 72636538
> 83039294
> 78561494
> 83393369
> 85317556
> 79570951
> AVG = 781 ns
> 
> Parallel addition only runs:
> 42024233
> 44313064
> 48304416
> 64762297
> 42994259
> 41811628
> 55654282
> 64892958
> 55125582
> 42456403
> AVG = 510 ns
> 
> 
> Both prefetch and parallel addition:
> 41329930
> 40689195
> 61106622
> 46332422
> 49398117
> 52525171
> 49517101
> 61311153
> 43691814
> 49043084
> AVG = 494 ns
> 
> 
> For reference, each of the above large numbers is the number of nanoseconds
> taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
> results, I ran the test in a loop 10 times, averaged them, and divided by
> 100000.
> 
> 
> Based on these, prefetching is obviously a a good improvement, but not as good
> as parallel execution, and the winner by far is doing both.
> 
> Thoughts?
> 
> Neil
> 


Your benchmark uses a single 4K page, so data is _super_ hot in cpu
caches.
( prefetch should give no speedups, I am surprised it makes any
difference)

Try now with 32 huges pages, to get 64 MBytes of working set.

Because in reality we never csum_partial() data in cpu cache.
(Unless the NIC preloaded the data into cpu cache before sending the
interrupt)

Really, if Sebastien got a speed up, it means that something fishy was
going on, like :

- A copy of data into some area of memory, prefilling cpu caches
- csum_partial() done while data is hot in cache.

This is exactly a "should not happen" scenario, because the csum in this
case should happen _while_ doing the copy, for 0 ns.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:18       ` Eric Dumazet
  2013-10-14 22:37         ` Joe Perches
@ 2013-10-17  0:34         ` Neil Horman
  2013-10-17  1:42           ` Eric Dumazet
  2013-10-17  8:41           ` Ingo Molnar
  1 sibling, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-17  0:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > 
> > > So, early testing results today.  I wrote a test module that, allocated a 4k
> > > buffer, initalized it with random data, and called csum_partial on it 100000
> > > times, recording the time at the start and end of that loop.  Results on a 2.4
> > > GHz Intel Xeon processor:
> > > 
> > > Without patch: Average execute time for csum_partial was 808 ns
> > > With patch: Average execute time for csum_partial was 438 ns
> > 
> > Impressive, but could you try again with data out of cache ?
> 
> So I tried your patch on a GRE tunnel and got following results on a
> single TCP flow. (short result : no visible difference)
> 
> 

So I went to reproduce these results, but was unable to (due to the fact that I
only have a pretty jittery network to do testing accross at the moment with
these devices).  So instead I figured that I would go back to just doing
measurements with the module that I cobbled together (operating under the
assumption that it would give me accurate, relatively jitter free results (I've
attached the module code for reference below).  My results show slightly
different behavior:

Base results runs:
89417240
85170397
85208407
89422794
91645494
103655144
86063791
75647774
83502921
85847372
AVG = 875 ns

Prefetch only runs:
70962849
77555099
81898170
68249290
72636538
83039294
78561494
83393369
85317556
79570951
AVG = 781 ns

Parallel addition only runs:
42024233
44313064
48304416
64762297
42994259
41811628
55654282
64892958
55125582
42456403
AVG = 510 ns


Both prefetch and parallel addition:
41329930
40689195
61106622
46332422
49398117
52525171
49517101
61311153
43691814
49043084
AVG = 494 ns


For reference, each of the above large numbers is the number of nanoseconds
taken to compute the checksum of a 4kb buffer 100000 times.  To get my average
results, I ran the test in a loop 10 times, averaged them, and divided by
100000.


Based on these, prefetching is obviously a a good improvement, but not as good
as parallel execution, and the winner by far is doing both.

Thoughts?

Neil



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

static int __init csum_init_module(void)
{
	int i;
	__wsum sum = 0;
	struct timespec start, end;
	u64 time;

	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

	if (!buf) {
		printk(KERN_CRIT "UNABLE TO ALLOCATE A BUFFER OF %lu bytes\n", PAGE_SIZE);
		return -ENOMEM;
	}

	printk(KERN_CRIT "INITALIZING BUFFER\n");
	get_random_bytes(buf, PAGE_SIZE);

	preempt_disable();
	printk(KERN_CRIT "STARTING ITERATIONS\n");
	getnstimeofday(&start);

	for(i=0;i<100000;i++)
		sum = csum_partial(buf, PAGE_SIZE, sum);
	getnstimeofday(&end);
	preempt_enable();
	if (start.tv_nsec > end.tv_nsec)
		time = (ULLONG_MAX - end.tv_nsec) + start.tv_nsec;
	else 
		time = end.tv_nsec - start.tv_nsec;

	printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
	kfree(buf);
	return 0;


}

static void __exit csum_cleanup_module(void)
{
	return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-16  6:25                   ` Ingo Molnar
@ 2013-10-16 16:55                     ` Joe Perches
  0 siblings, 0 replies; 107+ messages in thread
From: Joe Perches @ 2013-10-16 16:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
>  Prefetch takes memory from L2->L1 memory 
> just as much as it moves it cachelines from memory to the L2 cache. 

Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:21                 ` Joe Perches
  2013-10-16  0:34                   ` Eric Dumazet
@ 2013-10-16  6:25                   ` Ingo Molnar
  2013-10-16 16:55                     ` Joe Perches
  1 sibling, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-16  6:25 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Joe Perches <joe@perches.com> wrote:

> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches <joe@perches.com> wrote:
> > 
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > > attached patch brings much better results
> > > > > > 
> > > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > > Recv   Send    Send                          Utilization       Service Demand
> > > > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > > > 
> > > > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > > > 
> > > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > > []
> > > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > > >  			zero = 0;
> > > > > >  			count64 = count >> 3;
> > > > > >  			while (count64) { 
> > > > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > > > 
> > > > > Might the prefetch size be too big here?
> > > > 
> > > > To be effective, you need to prefetch well ahead of time.
> > > 
> > > No doubt.
> > 
> > So why did you ask then?
> > 
> > > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > > 
> > > 5 cachelines for some processors seems like a lot.
> > 
> > What processors would that be?
> 
> The ones where conservatism in L1 cache use is good because there are 
> multiple threads running concurrently.

What specific processor models would that be?

> > Most processors have hundreds of cachelines even in their L1 cache.
>
> And sometimes that many executable processes too.

Nonsense, this is an unrolled loop running in softirq context most of the 
time that does not get preempted.

> > Thousands in the L2 cache, up to hundreds of thousands.
> 
> Irrelevant because prefetch doesn't apply there.

What planet are you living on? Prefetch takes memory from L2->L1 memory 
just as much as it moves it cachelines from memory to the L2 cache. 

Especially in the usecase cited here there will be a second use of the 
data (when it's finally copied over into user-space), so the L2 cache size 
very much matters.

The prefetches here matter mostly to the packet being processed: the ideal 
size of the look-ahead window in csum_partial() is dictated by typical 
memory latencies and bandwidth. The amount of parallelism is limited by 
the number of carry bits we can maintain independently.

> Ingo, Eric _showed_ that the prefetch is good here. How about looking at 
> a little optimization to the minimal prefetch that gives that level of 
> performance.

Joe, instead of using a condescending tone in matters you clearly have 
very little clue about you might want to start doing some real kernel 
hacking in more serious kernel areas, beyond trivial matters such as 
printk strings, to gain a bit of experience and respect ...

Every word you uttered in this thread made it more likely for me to 
redirect you to /dev/null, permanently.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:21                 ` Joe Perches
@ 2013-10-16  0:34                   ` Eric Dumazet
  2013-10-16  6:25                   ` Ingo Molnar
  1 sibling, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2013-10-16  0:34 UTC (permalink / raw)
  To: Joe Perches
  Cc: Ingo Molnar, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:

> Ingo, Eric _showed_ that the prefetch is good here.
> How about looking at a little optimization to the minimal
> prefetch that gives that level of performance.

Wait a minute, my point was to remind that main cost is the
memory fetching.

Its nice to optimize cpu cycles if we are short of them,
but in the csum_partial() case, the bottleneck is the memory.

Also I was wondering on the implications of changing reads order,
as it might fool cpu predictions.

I do not particularly care about finding the right prefetch stride,
I think Intel guys know better than me.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 16:02                   ` Andi Kleen
@ 2013-10-16  0:28                     ` Eric Dumazet
  0 siblings, 0 replies; 107+ messages in thread
From: Eric Dumazet @ 2013-10-16  0:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Sébastien Dugué,
	Neil Horman, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
> > I get the csum_partial() if disabling prequeue.
> 
> At least in the ipoib case i would consider that a misconfiguration.

There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is not used at all.

In this case, we need to csum_partial() frame before sending an ACK,
don't you think ? ;)




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:41               ` Ingo Molnar
  2013-10-15 10:51                 ` Borislav Petkov
@ 2013-10-15 16:21                 ` Joe Perches
  2013-10-16  0:34                   ` Eric Dumazet
  2013-10-16  6:25                   ` Ingo Molnar
  1 sibling, 2 replies; 107+ messages in thread
From: Joe Perches @ 2013-10-15 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> * Joe Perches <joe@perches.com> wrote:
> 
> > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > attached patch brings much better results
> > > > > 
> > > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > > Recv   Send    Send                          Utilization       Service Demand
> > > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > > 
> > > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > > 
> > > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > > []
> > > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > > >  			zero = 0;
> > > > >  			count64 = count >> 3;
> > > > >  			while (count64) { 
> > > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > > 
> > > > Might the prefetch size be too big here?
> > > 
> > > To be effective, you need to prefetch well ahead of time.
> > 
> > No doubt.
> 
> So why did you ask then?
> 
> > > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> > 
> > 5 cachelines for some processors seems like a lot.
> 
> What processors would that be?

The ones where conservatism in L1 cache use is good
because there are multiple threads running concurrently.

> Most processors have hundreds of cachelines even in their L1 cache. 

And sometimes that many executable processes too.

> Thousands in the L2 cache, up to hundreds of thousands.

Irrelevant because prefetch doesn't apply there.

Ingo, Eric _showed_ that the prefetch is good here.
How about looking at a little optimization to the minimal
prefetch that gives that level of performance.

You could argue that prefetching PAGE_SIZE or larger
would be better still otherwise.

I suspect that using a smaller multiple of
L1_CACHE_BYTES like 2 or 3 would perform the same.

The last time it was looked at for copy_page_64.S was
quite awhile ago.  It looks like maybe 2003.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:52                 ` Eric Dumazet
@ 2013-10-15 16:02                   ` Andi Kleen
  2013-10-16  0:28                     ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Andi Kleen @ 2013-10-15 16:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Sébastien Dugué,
	Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

> I get the csum_partial() if disabling prequeue.

At least in the ipoib case i would consider that a misconfiguration.

"don't do this if it hurts"

There may be more such problems.

-Andi

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:26               ` Eric Dumazet
@ 2013-10-15 14:52                 ` Eric Dumazet
  2013-10-15 16:02                   ` Andi Kleen
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:52 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:

> And the receiver should also do the same : (ethtool -K eth0 rx off)
> 
>     10.55%    netserver  [kernel.kallsyms]  [k]
> csum_partial_copy_generic            

I get the csum_partial() if disabling prequeue.

echo 1 >/proc/sys/net/ipv4/tcp_low_latency

    24.49%      swapper  [kernel.kallsyms]  [k]
csum_partial                        
                |
                --- csum_partial
                    skb_checksum
                    __skb_checksum_complete_head
                    __skb_checksum_complete
                    tcp_rcv_established
                    tcp_v4_do_rcv
                    tcp_v4_rcv
                    ip_local_deliver_finish
                    ip_local_deliver
                    ip_rcv_finish
                    ip_rcv

So yes, we can call csum_partial() in receive path in this case.




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:15             ` Sébastien Dugué
@ 2013-10-15 14:26               ` Eric Dumazet
  2013-10-15 14:52                 ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:26 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
> Hi Eric,
> 
> On Tue, 15 Oct 2013 07:06:25 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > But the csum cost is both for sender and receiver ?
> 
>   No, it was only on the receiver side that I noticed it.
> 

Yes, as Andi said, we do the csum while copying the data for the
sender : (I disabled hardware assist tx checksum using 'ethtool -K eth0
tx off')

    17.21%  netperf  [kernel.kallsyms]       [k]
csum_partial_copy_generic                 
            |
            --- csum_partial_copy_generic
               |          
               |--97.39%-- __libc_send
               |          
                --2.61%-- tcp_sendmsg
                          inet_sendmsg
                          sock_sendmsg
                          _sys_sendto
                          sys_sendto
                          system_call_fastpath
                          __libc_send



>   Sorry, but this is 3 years old stuff and I do not have the
> setup anymore to reproduce.

And the receiver should also do the same : (ethtool -K eth0 rx off)

    10.55%    netserver  [kernel.kallsyms]  [k]
csum_partial_copy_generic            
              |
              --- csum_partial_copy_generic
                 |          
                 |--98.24%-- __libc_recv
                 |          
                  --1.76%-- skb_copy_and_csum_datagram
                            skb_copy_and_csum_datagram_iovec
                            tcp_rcv_established
                            tcp_v4_do_rcv
                            |          
                            |--73.05%-- tcp_prequeue_process
                            |          tcp_recvmsg
                            |          inet_recvmsg
                            |          sock_recvmsg
                            |          SYSC_recvfrom
                            |          SyS_recvfrom
                            |          system_call_fastpath
                            |          __libc_recv
                            |          

So I suspect something is wrong with IPoIB.





^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 14:06           ` Eric Dumazet
@ 2013-10-15 14:15             ` Sébastien Dugué
  2013-10-15 14:26               ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Sébastien Dugué @ 2013-10-15 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

Hi Eric,

On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> > On Tue, 15 Oct 2013 15:33:36 +0200
> > Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > > where one cannot benefit from hardware offloads.
> > > 
> > > Is this with sendfile?
> > 
> >   Tests were done with iperf at the time without any extra funky options, and
> > looking at the code it looks like it does plain write() / recv() on the socket.
> > 
> 
> But the csum cost is both for sender and receiver ?

  No, it was only on the receiver side that I noticed it.

> 
> Please post the following :
> 
> perf record -g "your iperf session"
> 
> perf report | head -n 200

  Sorry, but this is 3 years old stuff and I do not have the
setup anymore to reproduce.

  Sébastien.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 13:56         ` Sébastien Dugué
@ 2013-10-15 14:06           ` Eric Dumazet
  2013-10-15 14:15             ` Sébastien Dugué
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-15 14:06 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Andi Kleen, Neil Horman, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> On Tue, 15 Oct 2013 15:33:36 +0200
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > > where one cannot benefit from hardware offloads.
> > 
> > Is this with sendfile?
> 
>   Tests were done with iperf at the time without any extra funky options, and
> looking at the code it looks like it does plain write() / recv() on the socket.
> 

But the csum cost is both for sender and receiver ?

Please post the following :

perf record -g "your iperf session"

perf report | head -n 200




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 13:33       ` Andi Kleen
@ 2013-10-15 13:56         ` Sébastien Dugué
  2013-10-15 14:06           ` Eric Dumazet
  0 siblings, 1 reply; 107+ messages in thread
From: Sébastien Dugué @ 2013-10-15 13:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen <andi@firstfloor.org> wrote:

> >   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > where one cannot benefit from hardware offloads.
> 
> Is this with sendfile?

  Tests were done with iperf at the time without any extra funky options, and
looking at the code it looks like it does plain write() / recv() on the socket.

  Sébastien.

> 
> For normal send() the checksum is done in the user copy and for receiving it
> can be also done during the copy in most cases
> 
> -Andi

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:12     ` Sébastien Dugué
@ 2013-10-15 13:33       ` Andi Kleen
  2013-10-15 13:56         ` Sébastien Dugué
  0 siblings, 1 reply; 107+ messages in thread
From: Andi Kleen @ 2013-10-15 13:33 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Neil Horman, Andi Kleen, linux-kernel, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

>   indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> where one cannot benefit from hardware offloads.

Is this with sendfile? 

For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases

-Andi

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 21:07     ` Eric Dumazet
@ 2013-10-15 13:17       ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-15 13:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Andi Kleen, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> > * Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > Neil Horman <nhorman@tuxdriver.com> writes:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant 
> > > > amount of time computing
> > > 
> > > Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> > > anyways. I would rather investigate why that doesn't work.
> > 
> > There's a fair amount of csum_partial()-only workloads, a packet does not 
> > need to hit user-space to be a significant portion of the system's 
> > workload.
> > 
> > That said, it would indeed be nice to hear which particular code path was 
> > hit in this case, if nothing else then for education purposes.
> 
> Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
> frames, meaning we have to fallback to software csum to validate
> TCP frames, once tunnel header is pulled.
> 
> So to reproduce the issue, all you need is to setup a GRE tunnel between
> two hosts, and use any tcp stream workload.
> 
> Then receiver profile looks like :
> 
> 11.45%	[kernel]	 [k] csum_partial
>  3.08%	[kernel]	 [k] _raw_spin_lock
>  3.04%	[kernel]	 [k] intel_idle
>  2.73%	[kernel]	 [k] ipt_do_table
>  2.57%	[kernel]	 [k] __netif_receive_skb_core
>  2.15%	[kernel]	 [k] copy_user_generic_string
>  2.05%	[kernel]	 [k] __hrtimer_start_range_ns
>  1.42%	[kernel]	 [k] ip_rcv
>  1.39%	[kernel]	 [k] kmem_cache_free
>  1.36%	[kernel]	 [k] _raw_spin_unlock_irqrestore
>  1.24%	[kernel]	 [k] __schedule
>  1.13%	[bnx2x] 	 [k] bnx2x_rx_int
>  1.12%	[bnx2x] 	 [k] bnx2x_start_xmit
>  1.11%	[kernel]	 [k] fib_table_lookup
>  0.99%	[ip_tunnel]  [k] ip_tunnel_lookup
>  0.91%	[ip_tunnel]  [k] ip_tunnel_rcv
>  0.90%	[kernel]	 [k] check_leaf.isra.7
>  0.89%	[kernel]	 [k] nf_iterate
> 
As I noted previously the workload that this got reported on was ipoib, which
has a simmilar profile, since infiniband cards tend to not be able to do
checksum offload for ip frames.

Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:32     ` Ingo Molnar
@ 2013-10-15 13:14       ` Neil Horman
  0 siblings, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-15 13:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > > 
> > > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > > don't have checksum offload hardware were spending a significant amount 
> > > > of time computing checksums.  We found that by splitting the checksum 
> > > > computation into two separate streams, each skipping successive elements 
> > > > of the buffer being summed, we could parallelize the checksum operation 
> > > > accros multiple alus.  Since neither chain is dependent on the result of 
> > > > the other, we get a speedup in execution (on hardware that has multiple 
> > > > alu's available, which is almost ubiquitous on x86), and only a 
> > > > negligible decrease on hardware that has only a single alu (an extra 
> > > > addition is introduced).  Since addition in commutative, the result is 
> > > > the same, only faster
> > > 
> > > This patch should really come with measurement numbers: what performance 
> > > increase (and drop) did you get on what CPUs.
> > > 
> > > Thanks,
> > > 
> > > 	Ingo
> > > 
> > 
> > 
> > So, early testing results today.  I wrote a test module that, allocated 
> > a 4k buffer, initalized it with random data, and called csum_partial on 
> > it 100000 times, recording the time at the start and end of that loop.  
> 
> It would be nice to stick that testcase into tools/perf/bench/, see how we 
> are able to benchmark the kernel's mempcy and memset implementation there:
> 
Sure, my module is a mess currently.  But as soon as I investigate the use of
ADCX/ADOX that Anvin suggested I'll see about integrating that
Neil

>  $ perf bench mem memcpy -r help
>  # Running 'mem/memcpy' benchmark:
>  Unknown routine:help
>  Available routines...
>         default ... Default memcpy() provided by glibc
>         x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
>         x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S
> 
> In a similar fashion we could build the csum_partial() code as well and do 
> measurements. (We could change arch/x86/ code as well to make such 
> embedding/including easier, as long as it does not change performance.)
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15 10:51                 ` Borislav Petkov
@ 2013-10-15 12:04                   ` Ingo Molnar
  0 siblings, 0 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-15 12:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Joe Perches, Eric Dumazet, Neil Horman, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86


* Borislav Petkov <bp@alien8.de> wrote:

> On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> > Most processors have hundreds of cachelines even in their L1 cache.
> > Thousands in the L2 cache, up to hundreds of thousands.
> 
> Also, I have this hazy memory of prefetch hints being harmful in some
> situations: https://lwn.net/Articles/444344/

Yes, for things like random list walks they tend to be harmful - the 
hardware is smarter.

For something like a controlled packet stream they might be helpful.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-15  7:41               ` Ingo Molnar
@ 2013-10-15 10:51                 ` Borislav Petkov
  2013-10-15 12:04                   ` Ingo Molnar
  2013-10-15 16:21                 ` Joe Perches
  1 sibling, 1 reply; 107+ messages in thread
From: Borislav Petkov @ 2013-10-15 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joe Perches, Eric Dumazet, Neil Horman, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86

On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> Most processors have hundreds of cachelines even in their L1 cache.
> Thousands in the L2 cache, up to hundreds of thousands.

Also, I have this hazy memory of prefetch hints being harmful in some
situations: https://lwn.net/Articles/444344/

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:49             ` Joe Perches
@ 2013-10-15  7:41               ` Ingo Molnar
  2013-10-15 10:51                 ` Borislav Petkov
  2013-10-15 16:21                 ` Joe Perches
  0 siblings, 2 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:41 UTC (permalink / raw)
  To: Joe Perches
  Cc: Eric Dumazet, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86


* Joe Perches <joe@perches.com> wrote:

> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > attached patch brings much better results
> > > > 
> > > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > > Recv   Send    Send                          Utilization       Service Demand
> > > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > > 
> > > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > > 
> > > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > > []
> > > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > > >  			zero = 0;
> > > >  			count64 = count >> 3;
> > > >  			while (count64) { 
> > > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > > +				asm("prefetch 5*64(%[src])\n\t"
> > > 
> > > Might the prefetch size be too big here?
> > 
> > To be effective, you need to prefetch well ahead of time.
> 
> No doubt.

So why did you ask then?

> > 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)
> 
> 5 cachelines for some processors seems like a lot.

What processors would that be?

Most processors have hundreds of cachelines even in their L1 cache. 
Thousands in the L2 cache, up to hundreds of thousands.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:28   ` Neil Horman
  2013-10-14 21:19     ` Eric Dumazet
@ 2013-10-15  7:32     ` Ingo Molnar
  2013-10-15 13:14       ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-15  7:32 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > 
> > * Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > don't have checksum offload hardware were spending a significant amount 
> > > of time computing checksums.  We found that by splitting the checksum 
> > > computation into two separate streams, each skipping successive elements 
> > > of the buffer being summed, we could parallelize the checksum operation 
> > > accros multiple alus.  Since neither chain is dependent on the result of 
> > > the other, we get a speedup in execution (on hardware that has multiple 
> > > alu's available, which is almost ubiquitous on x86), and only a 
> > > negligible decrease on hardware that has only a single alu (an extra 
> > > addition is introduced).  Since addition in commutative, the result is 
> > > the same, only faster
> > 
> > This patch should really come with measurement numbers: what performance 
> > increase (and drop) did you get on what CPUs.
> > 
> > Thanks,
> > 
> > 	Ingo
> > 
> 
> 
> So, early testing results today.  I wrote a test module that, allocated 
> a 4k buffer, initalized it with random data, and called csum_partial on 
> it 100000 times, recording the time at the start and end of that loop.  

It would be nice to stick that testcase into tools/perf/bench/, see how we 
are able to benchmark the kernel's mempcy and memset implementation there:

 $ perf bench mem memcpy -r help
 # Running 'mem/memcpy' benchmark:
 Unknown routine:help
 Available routines...
        default ... Default memcpy() provided by glibc
        x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
        x86-64-movsq ... movsq-based memcpy() in arch/x86/lib/memcpy_64.S
        x86-64-movsb ... movsb-based memcpy() in arch/x86/lib/memcpy_64.S

In a similar fashion we could build the csum_partial() code as well and do 
measurements. (We could change arch/x86/ code as well to make such 
embedding/including easier, as long as it does not change performance.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:25   ` Neil Horman
@ 2013-10-15  7:12     ` Sébastien Dugué
  2013-10-15 13:33       ` Andi Kleen
  0 siblings, 1 reply; 107+ messages in thread
From: Sébastien Dugué @ 2013-10-15  7:12 UTC (permalink / raw)
  To: Neil Horman
  Cc: Andi Kleen, linux-kernel, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


  Hi Neil, Andi,

On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> > Neil Horman <nhorman@tuxdriver.com> writes:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > > checksum offload hardware were spending a significant amount of time computing
> > 
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum
> > anyways. I would rather investigate why that doesn't work.
> > 
> FWIW, the reporter was reporting this using an IP over Infiniband network.
> Neil

  indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
where one cannot benefit from hardware offloads.

  For a bit of background on the issue:

  It all started nearly 3 years ago when trying to understand why IPoIB BW was
so low in our setups and why ksoftirqd used 100% of one CPU. A kernel profile
trace showed that the CPU spent most of it's time in checksum computation (from
the only old trace I managed to unearth):

  Function                               Hit    Time            Avg
  --------                               ---    ----            ---
  schedule                              1730    629976998 us     364148.5 us
  csum_partial                      10813465    20944414 us     1.936 us
  mwait_idle_with_hints                 1451    9858861 us     6794.529 us
  get_page_from_freelist            10110434    8120524 us     0.803 us
  alloc_pages_current               10093675    5180650 us     0.513 us
  __phys_addr                       35554783    4471387 us     0.125 us
  zone_statistics                   10110434    4360871 us     0.431 us
  ipoib_cm_alloc_rx_skb               673899    4343949 us     6.445 us

  After having recoded the checksum to use 2 ALUs, csum_partial() disappeared
from the tracer radar. IPoIB BW got from ~12Gb/s to ~ 20Gb/s and ksoftirqd load
dropped down drastically. Sorry, I could not manage to locate my old traces and
results, those seem to have been lost in the mist of time.

  I did some micro benchmark (dirty hack code below) of different solutions.
It looks like processing 128-byte blocks in 4 chains allows the best performance,
but there are plenty other possibilities.

  FWIW, this code has been running as is at our customers sites for 3 years now.

  Sébastien.

> 
> > That said the change looks reasonable, but may not fix the root cause.
> > 
> > -Andi
> > 
> > -- 
> > ak@linux.intel.com -- Speaking for myself only
> > 

8<----------------------------------------------------------------------


/*
 * gcc -Wall -O3 -o csum_test csum_test.c -lrt
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <errno.h>

#define __force
#define unlikely(x)	(x)

typedef uint32_t u32;
typedef uint16_t u16;

typedef u16 __sum16;
typedef u32 __wsum;

#define NUM_LOOPS	100000
#define BUF_LEN		65536
unsigned char buf[BUF_LEN];


/*
 * csum_fold - Fold and invert a 32bit checksum.
 * sum: 32bit unfolded sum
 *
 * Fold a 32bit running checksum to 16bit and invert it. This is usually
 * the last step before putting a checksum into a packet.
 * Make sure not to mix with 64bit checksums.
 */
static inline __sum16 csum_fold(__wsum sum)
{
	asm("  addl %1,%0\n"
	    "  adcl $0xffff,%0"
	    : "=r" (sum)
	    : "r" ((__force u32)sum << 16),
	      "0" ((__force u32)sum & 0xffff0000));
	return (__force __sum16)(~(__force u32)sum >> 16);
}

static inline unsigned short from32to16(unsigned a)
{
	unsigned short b = a >> 16;
	asm("addw %w2,%w0\n\t"
	    "adcw $0,%w0\n"
	    : "=r" (b)
	    : "0" (b), "r" (a));
	return b;
}

static inline unsigned add32_with_carry(unsigned a, unsigned b)
{
	asm("addl %2,%0\n\t"
	    "adcl $0,%0"
	    : "=r" (a)
	    : "0" (a), "r" (b));
	return a;
}

/*
 * Do a 64-bit checksum on an arbitrary memory area.
 * Returns a 32bit checksum.
 *
 * This isn't as time critical as it used to be because many NICs
 * do hardware checksumming these days.
 *
 * Things tried and found to not make it faster:
 * Manual Prefetching
 * Unrolling to an 128 bytes inner loop.
 * Using interleaving with more registers to break the carry chains.
 */
static unsigned do_csum(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res]\n\t"
				    "adcq 1*8(%[src]),%[res]\n\t"
				    "adcq 2*8(%[src]),%[res]\n\t"
				    "adcq 3*8(%[src]),%[res]\n\t"
				    "adcq 4*8(%[src]),%[res]\n\t"
				    "adcq 5*8(%[src]),%[res]\n\t"
				    "adcq 6*8(%[src]),%[res]\n\t"
				    "adcq 7*8(%[src]),%[res]\n\t"
				    "adcq %[zero],%[res]"
				    : [res] "=r" (result)
				    : [src] "r" (buff), [zero] "r" (zero),
				    "[res]" (result));
				buff += 64;
				count64--;
			}
			/* printf("csum %lx\n", result); */

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

static unsigned do_csum1(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res1]\n\t"
				    "adcq 2*8(%[src]),%[res1]\n\t"
				    "adcq 4*8(%[src]),%[res1]\n\t"
				    "adcq 6*8(%[src]),%[res1]\n\t"
				    "adcq %[zero],%[res1]\n\t"

				    "addq 1*8(%[src]),%[res2]\n\t"
				    "adcq 3*8(%[src]),%[res2]\n\t"
				    "adcq 5*8(%[src]),%[res2]\n\t"
				    "adcq 7*8(%[src]),%[res2]\n\t"
				    "adcq %[zero],%[res2]"
				    : [res1] "=r" (result1),
				      [res2] "=r" (result2)
				    : [src] "r" (buff), [zero] "r" (zero),
				      "[res1]" (result1), "[res2]" (result2));
				buff += 64;
				count64--;
			}

			asm("addq %[res1],%[res]\n\t"
			    "adcq %[res2],%[res]\n\t"
			    "adcq %[zero],%[res]"
			    : [res] "=r" (result)
			    : [res1] "r" (result1),
			      [res2] "r" (result2),
			      [zero] "r" (zero),
			      "0" (result));

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

static unsigned do_csum2(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result3 = 0;
	unsigned long result4 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;

	odd = 1 & (unsigned long) buff;

	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}

	count = len >> 1;		/* nr of 16-bit words.. */

	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}

		count >>= 1;		/* nr of 32-bit words.. */

		if (count) {

			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}

			count >>= 1;	/* nr of 64-bit words.. */

			if (count) {
				unsigned long zero = 0;
				unsigned count128;

				if (8 & (unsigned long) buff) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					count--;
					buff += 8;
				}

				/* main loop using 128 byte blocks */
				count128 = count >> 4;

				while (count128) {
					asm("addq 0*8(%[src]),%[res1]\n\t"
					    "adcq 4*8(%[src]),%[res1]\n\t"
					    "adcq 8*8(%[src]),%[res1]\n\t"
					    "adcq 12*8(%[src]),%[res1]\n\t"
					    "adcq %[zero],%[res1]\n\t"

					    "addq 1*8(%[src]),%[res2]\n\t"
					    "adcq 5*8(%[src]),%[res2]\n\t"
					    "adcq 9*8(%[src]),%[res2]\n\t"
					    "adcq 13*8(%[src]),%[res2]\n\t"
					    "adcq %[zero],%[res2]\n\t"

					    "addq 2*8(%[src]),%[res3]\n\t"
					    "adcq 6*8(%[src]),%[res3]\n\t"
					    "adcq 10*8(%[src]),%[res3]\n\t"
					    "adcq 14*8(%[src]),%[res3]\n\t"
					    "adcq %[zero],%[res3]\n\t"

					    "addq 3*8(%[src]),%[res4]\n\t"
					    "adcq 7*8(%[src]),%[res4]\n\t"
					    "adcq 11*8(%[src]),%[res4]\n\t"
					    "adcq 15*8(%[src]),%[res4]\n\t"
					    "adcq %[zero],%[res4]"

					    : [res1] "=r" (result1),
					      [res2] "=r" (result2),
					      [res3] "=r" (result3),
					      [res4] "=r" (result4)

					    : [src] "r" (buff),
					      [zero] "r" (zero),
					      "[res1]" (result1),
					      "[res2]" (result2),
					      "[res3]" (result3),
					      "[res4]" (result4));
					buff += 128;
					count128--;
				}

				asm("addq %[res1],%[res]\n\t"
				    "adcq %[res2],%[res]\n\t"
				    "adcq %[res3],%[res]\n\t"
				    "adcq %[res4],%[res]\n\t"
				    "adcq %[zero],%[res]"
				    : [res] "=r" (result)
				    : [res1] "r" (result1),
				      [res2] "r" (result2),
				      [res3] "r" (result3),
				      [res4] "r" (result4),
				      [zero] "r" (zero),
				      "0" (result));

				/* last upto 15 8byte blocks */
				count %= 16;
				while (count) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					--count;
					buff += 8;
				}
				result = add32_with_carry(result>>32,
							  result&0xffffffff);

				if (len & 8) {
					asm("addq %1,%0\n\t"
					    "adcq %2,%0\n"
					    : "=r" (result)
					    : "m" (*(unsigned long *)buff),
					      "r" (zero),  "0" (result));
					buff += 8;
				}
			}

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}


static unsigned do_csum3(const unsigned char *buff, unsigned len)
{
	unsigned odd, count;
	unsigned long result1 = 0;
	unsigned long result2 = 0;
	unsigned long result3 = 0;
	unsigned long result4 = 0;
	unsigned long result = 0;

	if (unlikely(len == 0))
		return result;
	odd = 1 & (unsigned long) buff;
	if (unlikely(odd)) {
		result = *buff << 8;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *)buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
			unsigned long zero;
			unsigned count64;
			if (4 & (unsigned long) buff) {
				result += *(unsigned int *) buff;
				count--;
				len -= 4;
				buff += 4;
			}
			count >>= 1;	/* nr of 64-bit words.. */

			/* main loop using 64byte blocks */
			zero = 0;
			count64 = count >> 3;
			while (count64) {
				asm("addq 0*8(%[src]),%[res1]\n\t"
				    "adcq 4*8(%[src]),%[res1]\n\t"
				    "adcq %[zero],%[res1]\n\t"

				    "addq 1*8(%[src]),%[res2]\n\t"
				    "adcq 5*8(%[src]),%[res2]\n\t"
				    "adcq %[zero],%[res2]\n\t"

				    "addq 2*8(%[src]),%[res3]\n\t"
				    "adcq 6*8(%[src]),%[res3]\n\t"
				    "adcq %[zero],%[res3]\n\t"

				    "addq 3*8(%[src]),%[res4]\n\t"
				    "adcq 7*8(%[src]),%[res4]\n\t"
				    "adcq %[zero],%[res4]\n\t"

				    : [res1] "=r" (result1),
				      [res2] "=r" (result2),
				      [res3] "=r" (result3),
				      [res4] "=r" (result4)
				    : [src] "r" (buff),
				      [zero] "r" (zero),
				      "[res1]" (result1),
				      "[res2]" (result2),
				      "[res3]" (result3),
				      "[res4]" (result4));
				buff += 64;
				count64--;
			}

			asm("addq %[res1],%[res]\n\t"
			    "adcq %[res2],%[res]\n\t"
			    "adcq %[res3],%[res]\n\t"
			    "adcq %[res4],%[res]\n\t"
			    "adcq %[zero],%[res]"
			    : [res] "=r" (result)
			    : [res1] "r" (result1),
			      [res2] "r" (result2),
			      [res3] "r" (result3),
			      [res4] "r" (result4),
			      [zero] "r" (zero),
			      "0" (result));

			/* printf("csum1 %lx\n", result); */

			/* last upto 7 8byte blocks */
			count %= 8;
			while (count) {
				asm("addq %1,%0\n\t"
				    "adcq %2,%0\n"
					    : "=r" (result)
				    : "m" (*(unsigned long *)buff),
				    "r" (zero),  "0" (result));
				--count;
				buff += 8;
			}
			result = add32_with_carry(result>>32,
						  result&0xffffffff);

			if (len & 4) {
				result += *(unsigned int *) buff;
				buff += 4;
			}
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += *buff;
	result = add32_with_carry(result>>32, result & 0xffffffff);
	if (unlikely(odd)) {
		result = from32to16(result);
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
	}
	return result;
}

long long delta_ns(struct timespec *t1, struct timespec *t2)
{
	long long tt1, tt2, delta;

	tt1 = t1->tv_sec * 1000000000 + t1->tv_nsec;
	tt2 = t2->tv_sec * 1000000000 + t2->tv_nsec;
	delta = tt2 - tt1;

	return delta;
}

int main(int argc, char **argv)
{
	FILE *f;
	unsigned csum1, csum2, csum3, csum4;
	struct timespec t1;
	struct timespec t2;
	double delta;
	int i;
	unsigned int offset = 0;
	unsigned char *ptr;
	unsigned int size;

	if ((f = fopen("data.bin", "r")) == NULL) {
		printf("Failed to open input file data.bin: %s\n",
		       strerror(errno));
		return -1;
	}

	if (fread(buf, 1, BUF_LEN, f) != BUF_LEN) {
		printf("Failed to read data.bin: %s\n",
		       strerror(errno));
		fclose(f);
		return -1;
	}

	fclose(f);

	if (argc > 1)
		offset = atoi(argv[1]);

	printf("Using offset=%d\n", offset);

	ptr = &buf[offset];
	size = BUF_LEN - offset;

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum1 = do_csum((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("Original:    %.8x %f us\n",
	       csum1, (double)delta/(double)NUM_LOOPS);

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum2 = do_csum1((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("64B Split2:  %.8x %f us\n",
	       csum2, (double)delta/(double)NUM_LOOPS);


	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum3 = do_csum2((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("128B Split4: %.8x %f us\n",
	       csum3, (double)delta/(double)NUM_LOOPS);

	clock_gettime(CLOCK_MONOTONIC, &t1);

	for (i = 0; i < NUM_LOOPS; i++)
		csum4 = do_csum3((const unsigned char *)ptr, size);

	clock_gettime(CLOCK_MONOTONIC, &t2);
	delta = (double)delta_ns(&t1, &t2)/1000.0;
	printf("64B Split4:  %.8x %f us\n",
	       csum4, (double)delta/(double)NUM_LOOPS);

	if ((csum1 != csum2) || (csum1 != csum3) || (csum1 != csum4))
		printf("Wrong checksum\n");

	return 0;
}



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:44           ` Eric Dumazet
@ 2013-10-14 22:49             ` Joe Perches
  2013-10-15  7:41               ` Ingo Molnar
  0 siblings, 1 reply; 107+ messages in thread
From: Joe Perches @ 2013-10-14 22:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > attached patch brings much better results
> > > 
> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > > Recv   Send    Send                          Utilization       Service Demand
> > > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > > 
> > >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > > 
> > > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > []
> > > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> > >  			zero = 0;
> > >  			count64 = count >> 3;
> > >  			while (count64) { 
> > > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > > +				asm("prefetch 5*64(%[src])\n\t"
> > 
> > Might the prefetch size be too big here?
> 
> To be effective, you need to prefetch well ahead of time.

No doubt.

> 5*64 seems common practice (check arch/x86/lib/copy_page_64.S)

5 cachelines for some processors seems like a lot.

Given you've got a test rig, maybe you could experiment
with 2 and increase it until it doesn't get better.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:37         ` Joe Perches
@ 2013-10-14 22:44           ` Eric Dumazet
  2013-10-14 22:49             ` Joe Perches
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-14 22:44 UTC (permalink / raw)
  To: Joe Perches
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > attached patch brings much better results
> > 
> > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> > Recv   Send    Send                          Utilization       Service Demand
> > Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> > Size   Size    Size     Time     Throughput  local    remote   local   remote
> > bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> > 
> >  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> > 
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> []
> > @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> >  			zero = 0;
> >  			count64 = count >> 3;
> >  			while (count64) { 
> > -				asm("addq 0*8(%[src]),%[res]\n\t"
> > +				asm("prefetch 5*64(%[src])\n\t"
> 
> Might the prefetch size be too big here?

To be effective, you need to prefetch well ahead of time.

5*64 seems common practice (check arch/x86/lib/copy_page_64.S)




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 22:18       ` Eric Dumazet
@ 2013-10-14 22:37         ` Joe Perches
  2013-10-14 22:44           ` Eric Dumazet
  2013-10-17  0:34         ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: Joe Perches @ 2013-10-14 22:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neil Horman, Ingo Molnar, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> attached patch brings much better results
> 
> lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
> Recv   Send    Send                          Utilization       Service Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local   remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
> 
>  87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  
> 
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
[]
> @@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>  			zero = 0;
>  			count64 = count >> 3;
>  			while (count64) { 
> -				asm("addq 0*8(%[src]),%[res]\n\t"
> +				asm("prefetch 5*64(%[src])\n\t"

Might the prefetch size be too big here?

0x140 is pretty big and is always multiple cachelines no?

> +				    "addq 0*8(%[src]),%[res]\n\t"
>  				    "adcq 1*8(%[src]),%[res]\n\t"
>  				    "adcq 2*8(%[src]),%[res]\n\t"
>  				    "adcq 3*8(%[src]),%[res]\n\t"




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 21:19     ` Eric Dumazet
@ 2013-10-14 22:18       ` Eric Dumazet
  2013-10-14 22:37         ` Joe Perches
  2013-10-17  0:34         ` Neil Horman
  0 siblings, 2 replies; 107+ messages in thread
From: Eric Dumazet @ 2013-10-14 22:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> 
> > So, early testing results today.  I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 100000
> > times, recording the time at the start and end of that loop.  Results on a 2.4
> > GHz Intel Xeon processor:
> > 
> > Without patch: Average execute time for csum_partial was 808 ns
> > With patch: Average execute time for csum_partial was 438 ns
> 
> Impressive, but could you try again with data out of cache ?

So I tried your patch on a GRE tunnel and got following results on a
single TCP flow. (short result : no visible difference)

Using a prefetch 5*64([%src]) helps more (see at the end)

cpus : model name : Intel Xeon(R) CPU X5660 @ 2.80GHz

Before patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7651.61   2.51     5.45     0.645   1.399  

After patch :

lpq83:~# ./netperf -H 7.7.8.84 -l 20 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    20.00      7239.78   2.09     5.19     0.569   1.408  

Profile on receiver

   PerfTop:    1358 irqs/sec  kernel:98.5%  exact:  0.0% [1000Hz cycles],  (all, 24 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------

    19.99%  [kernel]     [k] csum_partial                
     7.04%  [kernel]     [k] copy_user_generic_string    
     4.92%  [bnx2x]      [k] bnx2x_rx_int                
     3.50%  [kernel]     [k] ipt_do_table                
     2.86%  [kernel]     [k] __netif_receive_skb_core    
     2.35%  [kernel]     [k] fib_table_lookup            
     2.19%  [kernel]     [k] netif_receive_skb           
     1.87%  [kernel]     [k] intel_idle                  
     1.65%  [kernel]     [k] kmem_cache_alloc            
     1.64%  [kernel]     [k] ip_rcv                      
     1.51%  [kernel]     [k] kmem_cache_free             

And attached patch brings much better results

lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00      8043.82   2.32     5.34     0.566   1.304  

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..f0e10fc 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -68,7 +68,8 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
+				asm("prefetch 5*64(%[src])\n\t"
+				    "addq 0*8(%[src]),%[res]\n\t"
 				    "adcq 1*8(%[src]),%[res]\n\t"
 				    "adcq 2*8(%[src]),%[res]\n\t"
 				    "adcq 3*8(%[src]),%[res]\n\t"

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14 20:28   ` Neil Horman
@ 2013-10-14 21:19     ` Eric Dumazet
  2013-10-14 22:18       ` Eric Dumazet
  2013-10-15  7:32     ` Ingo Molnar
  1 sibling, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-14 21:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: Ingo Molnar, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:

> So, early testing results today.  I wrote a test module that, allocated a 4k
> buffer, initalized it with random data, and called csum_partial on it 100000
> times, recording the time at the start and end of that loop.  Results on a 2.4
> GHz Intel Xeon processor:
> 
> Without patch: Average execute time for csum_partial was 808 ns
> With patch: Average execute time for csum_partial was 438 ns

Impressive, but could you try again with data out of cache ?




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  7:49   ` Ingo Molnar
@ 2013-10-14 21:07     ` Eric Dumazet
  2013-10-15 13:17       ` Neil Horman
  0 siblings, 1 reply; 107+ messages in thread
From: Eric Dumazet @ 2013-10-14 21:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Neil Horman, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> * Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Neil Horman <nhorman@tuxdriver.com> writes:
> > 
> > > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > > don't have checksum offload hardware were spending a significant 
> > > amount of time computing
> > 
> > Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> > anyways. I would rather investigate why that doesn't work.
> 
> There's a fair amount of csum_partial()-only workloads, a packet does not 
> need to hit user-space to be a significant portion of the system's 
> workload.
> 
> That said, it would indeed be nice to hear which particular code path was 
> hit in this case, if nothing else then for education purposes.

Many NIC do not provide a CHECKSUM_COMPLETE information for encapsulated
frames, meaning we have to fallback to software csum to validate
TCP frames, once tunnel header is pulled.

So to reproduce the issue, all you need is to setup a GRE tunnel between
two hosts, and use any tcp stream workload.

Then receiver profile looks like :

11.45%	[kernel]	 [k] csum_partial
 3.08%	[kernel]	 [k] _raw_spin_lock
 3.04%	[kernel]	 [k] intel_idle
 2.73%	[kernel]	 [k] ipt_do_table
 2.57%	[kernel]	 [k] __netif_receive_skb_core
 2.15%	[kernel]	 [k] copy_user_generic_string
 2.05%	[kernel]	 [k] __hrtimer_start_range_ns
 1.42%	[kernel]	 [k] ip_rcv
 1.39%	[kernel]	 [k] kmem_cache_free
 1.36%	[kernel]	 [k] _raw_spin_unlock_irqrestore
 1.24%	[kernel]	 [k] __schedule
 1.13%	[bnx2x] 	 [k] bnx2x_rx_int
 1.12%	[bnx2x] 	 [k] bnx2x_start_xmit
 1.11%	[kernel]	 [k] fib_table_lookup
 0.99%	[ip_tunnel]  [k] ip_tunnel_lookup
 0.91%	[ip_tunnel]  [k] ip_tunnel_rcv
 0.90%	[kernel]	 [k] check_leaf.isra.7
 0.89%	[kernel]	 [k] nf_iterate



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 17:21 ` Ingo Molnar
  2013-10-13 12:53   ` Neil Horman
@ 2013-10-14 20:28   ` Neil Horman
  2013-10-14 21:19     ` Eric Dumazet
  2013-10-15  7:32     ` Ingo Molnar
  1 sibling, 2 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-14 20:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant amount 
> > of time computing checksums.  We found that by splitting the checksum 
> > computation into two separate streams, each skipping successive elements 
> > of the buffer being summed, we could parallelize the checksum operation 
> > accros multiple alus.  Since neither chain is dependent on the result of 
> > the other, we get a speedup in execution (on hardware that has multiple 
> > alu's available, which is almost ubiquitous on x86), and only a 
> > negligible decrease on hardware that has only a single alu (an extra 
> > addition is introduced).  Since addition in commutative, the result is 
> > the same, only faster
> 
> This patch should really come with measurement numbers: what performance 
> increase (and drop) did you get on what CPUs.
> 
> Thanks,
> 
> 	Ingo
> 


So, early testing results today.  I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 100000
times, recording the time at the start and end of that loop.  Results on a 2.4
GHz Intel Xeon processor:

Without patch: Average execute time for csum_partial was 808 ns
With patch: Average execute time for csum_partial was 438 ns


I'm looking into hpa's suggestion to use alternate instructions where available
right now.  I'll have more soon
Neil


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  4:38 ` Andi Kleen
  2013-10-14  7:49   ` Ingo Molnar
@ 2013-10-14 20:25   ` Neil Horman
  2013-10-15  7:12     ` Sébastien Dugué
  1 sibling, 1 reply; 107+ messages in thread
From: Neil Horman @ 2013-10-14 20:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> 
> Must be an odd workload, most TCP/UDP workloads do copy-checksum
> anyways. I would rather investigate why that doesn't work.
> 
FWIW, the reporter was reporting this using an IP over Infiniband network.
Neil

> That said the change looks reasonable, but may not fix the root cause.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-14  4:38 ` Andi Kleen
@ 2013-10-14  7:49   ` Ingo Molnar
  2013-10-14 21:07     ` Eric Dumazet
  2013-10-14 20:25   ` Neil Horman
  1 sibling, 1 reply; 107+ messages in thread
From: Ingo Molnar @ 2013-10-14  7:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Neil Horman, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86


* Andi Kleen <andi@firstfloor.org> wrote:

> Neil Horman <nhorman@tuxdriver.com> writes:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant 
> > amount of time computing
> 
> Must be an odd workload, most TCP/UDP workloads do copy-checksum 
> anyways. I would rather investigate why that doesn't work.

There's a fair amount of csum_partial()-only workloads, a packet does not 
need to hit user-space to be a significant portion of the system's 
workload.

That said, it would indeed be nice to hear which particular code path was 
hit in this case, if nothing else then for education purposes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
  2013-10-12 22:29 ` H. Peter Anvin
@ 2013-10-14  4:38 ` Andi Kleen
  2013-10-14  7:49   ` Ingo Molnar
  2013-10-14 20:25   ` Neil Horman
  2 siblings, 2 replies; 107+ messages in thread
From: Andi Kleen @ 2013-10-14  4:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Neil Horman <nhorman@tuxdriver.com> writes:

> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing

Must be an odd workload, most TCP/UDP workloads do copy-checksum
anyways. I would rather investigate why that doesn't work.

That said the change looks reasonable, but may not fix the root cause.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 22:29 ` H. Peter Anvin
@ 2013-10-13 12:53   ` Neil Horman
  2013-10-18 16:42   ` Neil Horman
  1 sibling, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-13 12:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> > checksum offload hardware were spending a significant amount of time computing
> > checksums.  We found that by splitting the checksum computation into two
> > separate streams, each skipping successive elements of the buffer being summed,
> > we could parallelize the checksum operation accros multiple alus.  Since neither
> > chain is dependent on the result of the other, we get a speedup in execution (on
> > hardware that has multiple alu's available, which is almost ubiquitous on x86),
> > and only a negligible decrease on hardware that has only a single alu (an extra
> > addition is introduced).  Since addition in commutative, the result is the same,
> > only faster
> 
> On hardware that implement ADCX/ADOX then you should also be able to
> have additional streams interleaved since those instructions allow for
> dual carry chains.
> 
Ok, thats a good idea, I'll look into those instructions this week
Neil

> 	-hpa
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-12 17:21 ` Ingo Molnar
@ 2013-10-13 12:53   ` Neil Horman
  2013-10-14 20:28   ` Neil Horman
  1 sibling, 0 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-13 12:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> 
> * Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> > Sébastien Dugué reported to me that devices implementing ipoib (which 
> > don't have checksum offload hardware were spending a significant amount 
> > of time computing checksums.  We found that by splitting the checksum 
> > computation into two separate streams, each skipping successive elements 
> > of the buffer being summed, we could parallelize the checksum operation 
> > accros multiple alus.  Since neither chain is dependent on the result of 
> > the other, we get a speedup in execution (on hardware that has multiple 
> > alu's available, which is almost ubiquitous on x86), and only a 
> > negligible decrease on hardware that has only a single alu (an extra 
> > addition is introduced).  Since addition in commutative, the result is 
> > the same, only faster
> 
> This patch should really come with measurement numbers: what performance 
> increase (and drop) did you get on what CPUs.
> 
> Thanks,
> 
Sure, I can gather some stats for you.  I'll post them later this week
Neil

> 	Ingo
> 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
@ 2013-10-12 22:29 ` H. Peter Anvin
  2013-10-13 12:53   ` Neil Horman
  2013-10-18 16:42   ` Neil Horman
  2013-10-14  4:38 ` Andi Kleen
  2 siblings, 2 replies; 107+ messages in thread
From: H. Peter Anvin @ 2013-10-12 22:29 UTC (permalink / raw)
  To: Neil Horman, linux-kernel
  Cc: sebastien.dugue, Thomas Gleixner, Ingo Molnar, x86

On 10/11/2013 09:51 AM, Neil Horman wrote:
> Sébastien Dugué reported to me that devices implementing ipoib (which don't have
> checksum offload hardware were spending a significant amount of time computing
> checksums.  We found that by splitting the checksum computation into two
> separate streams, each skipping successive elements of the buffer being summed,
> we could parallelize the checksum operation accros multiple alus.  Since neither
> chain is dependent on the result of the other, we get a speedup in execution (on
> hardware that has multiple alu's available, which is almost ubiquitous on x86),
> and only a negligible decrease on hardware that has only a single alu (an extra
> addition is introduced).  Since addition in commutative, the result is the same,
> only faster

On hardware that implement ADCX/ADOX then you should also be able to
have additional streams interleaved since those instructions allow for
dual carry chains.

	-hpa




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's
  2013-10-11 16:51 Neil Horman
@ 2013-10-12 17:21 ` Ingo Molnar
  2013-10-13 12:53   ` Neil Horman
  2013-10-14 20:28   ` Neil Horman
  2013-10-12 22:29 ` H. Peter Anvin
  2013-10-14  4:38 ` Andi Kleen
  2 siblings, 2 replies; 107+ messages in thread
From: Ingo Molnar @ 2013-10-12 17:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: linux-kernel, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86


* Neil Horman <nhorman@tuxdriver.com> wrote:

> Sébastien Dugué reported to me that devices implementing ipoib (which 
> don't have checksum offload hardware were spending a significant amount 
> of time computing checksums.  We found that by splitting the checksum 
> computation into two separate streams, each skipping successive elements 
> of the buffer being summed, we could parallelize the checksum operation 
> accros multiple alus.  Since neither chain is dependent on the result of 
> the other, we get a speedup in execution (on hardware that has multiple 
> alu's available, which is almost ubiquitous on x86), and only a 
> negligible decrease on hardware that has only a single alu (an extra 
> addition is introduced).  Since addition in commutative, the result is 
> the same, only faster

This patch should really come with measurement numbers: what performance 
increase (and drop) did you get on what CPUs.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH] x86: Run checksumming in parallel accross multiple alu's
@ 2013-10-11 16:51 Neil Horman
  2013-10-12 17:21 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Neil Horman @ 2013-10-11 16:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Neil Horman, sebastien.dugue, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86

Sébastien Dugué reported to me that devices implementing ipoib (which don't have
checksum offload hardware were spending a significant amount of time computing
checksums.  We found that by splitting the checksum computation into two
separate streams, each skipping successive elements of the buffer being summed,
we could parallelize the checksum operation accros multiple alus.  Since neither
chain is dependent on the result of the other, we get a speedup in execution (on
hardware that has multiple alu's available, which is almost ubiquitous on x86),
and only a negligible decrease on hardware that has only a single alu (an extra
addition is introduced).  Since addition in commutative, the result is the same,
only faster

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: sebastien.dugue@bull.net
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/lib/csum-partial_64.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 9845371..2c7bc50 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -29,11 +29,12 @@ static inline unsigned short from32to16(unsigned a)
  * Things tried and found to not make it faster:
  * Manual Prefetching
  * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
  */
 static unsigned do_csum(const unsigned char *buff, unsigned len)
 {
 	unsigned odd, count;
+	unsigned long result1 = 0;
+	unsigned long result2 = 0;
 	unsigned long result = 0;
 
 	if (unlikely(len == 0))
@@ -68,22 +69,34 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
 			zero = 0;
 			count64 = count >> 3;
 			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
+				asm("addq 0*8(%[src]),%[res1]\n\t"
+				    "adcq 2*8(%[src]),%[res1]\n\t"
+				    "adcq 4*8(%[src]),%[res1]\n\t"
+				    "adcq 6*8(%[src]),%[res1]\n\t"
+				    "adcq %[zero],%[res1]\n\t"
+
+				    "addq 1*8(%[src]),%[res2]\n\t"
+				    "adcq 3*8(%[src]),%[res2]\n\t"
+				    "adcq 5*8(%[src]),%[res2]\n\t"
+				    "adcq 7*8(%[src]),%[res2]\n\t"
+				    "adcq %[zero],%[res2]"
+				    : [res1] "=r" (result1),
+				      [res2] "=r" (result2)
 				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
+				      "[res1]" (result1), "[res2]" (result2));
 				buff += 64;
 				count64--;
 			}
 
+			asm("addq %[res1],%[res]\n\t"
+			    "adcq %[res2],%[res]\n\t"
+			    "adcq %[zero],%[res]"
+			    : [res] "=r" (result)
+			    : [res1] "r" (result1),
+			      [res2] "r" (result2),
+			      [zero] "r" (zero),
+			      "0" (result));
+
 			/* last up to 7 8byte blocks */
 			count %= 8; 
 			while (count) { 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2013-11-04  9:50 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-30  5:25 [PATCH] x86: Run checksumming in parallel accross multiple alu's Doug Ledford
2013-10-30 10:27 ` David Laight
2013-10-30 11:02 ` Neil Horman
2013-10-30 12:18   ` David Laight
2013-10-30 13:22     ` Doug Ledford
2013-10-30 13:35   ` Doug Ledford
2013-10-30 14:04     ` David Laight
2013-10-30 14:52     ` Neil Horman
2013-10-31 18:30     ` Neil Horman
2013-11-01  9:21       ` Ingo Molnar
2013-11-01 15:42       ` Ben Hutchings
2013-11-01 16:08         ` Neil Horman
2013-11-01 16:16           ` Ben Hutchings
2013-11-01 16:18           ` David Laight
2013-11-01 17:37             ` Neil Horman
2013-11-01 19:45               ` Joe Perches
2013-11-01 19:58                 ` Neil Horman
2013-11-01 20:26                   ` Joe Perches
2013-11-02  2:07                     ` Neil Horman
2013-11-04  9:47               ` David Laight
  -- strict thread matches above, loose matches on Subject: below --
2013-10-18 17:42 Doug Ledford
2013-10-19  8:23 ` Ingo Molnar
2013-10-21 17:54   ` Doug Ledford
2013-10-26 11:55     ` Ingo Molnar
2013-10-28 17:02       ` Doug Ledford
2013-10-29  8:38         ` Ingo Molnar
2013-10-18 15:46 Doug Ledford
2013-10-11 16:51 Neil Horman
2013-10-12 17:21 ` Ingo Molnar
2013-10-13 12:53   ` Neil Horman
2013-10-14 20:28   ` Neil Horman
2013-10-14 21:19     ` Eric Dumazet
2013-10-14 22:18       ` Eric Dumazet
2013-10-14 22:37         ` Joe Perches
2013-10-14 22:44           ` Eric Dumazet
2013-10-14 22:49             ` Joe Perches
2013-10-15  7:41               ` Ingo Molnar
2013-10-15 10:51                 ` Borislav Petkov
2013-10-15 12:04                   ` Ingo Molnar
2013-10-15 16:21                 ` Joe Perches
2013-10-16  0:34                   ` Eric Dumazet
2013-10-16  6:25                   ` Ingo Molnar
2013-10-16 16:55                     ` Joe Perches
2013-10-17  0:34         ` Neil Horman
2013-10-17  1:42           ` Eric Dumazet
2013-10-18 16:50             ` Neil Horman
2013-10-18 17:20               ` Eric Dumazet
2013-10-18 20:11                 ` Neil Horman
2013-10-18 21:15                   ` Eric Dumazet
2013-10-20 21:29                     ` Neil Horman
2013-10-21 17:31                       ` Eric Dumazet
2013-10-21 17:46                         ` Neil Horman
2013-10-21 19:21                     ` Neil Horman
2013-10-21 19:44                       ` Eric Dumazet
2013-10-21 20:19                         ` Neil Horman
2013-10-26 12:01                           ` Ingo Molnar
2013-10-26 13:58                             ` Neil Horman
2013-10-27  7:26                               ` Ingo Molnar
2013-10-27 17:05                                 ` Neil Horman
2013-10-17  8:41           ` Ingo Molnar
2013-10-17 18:19             ` H. Peter Anvin
2013-10-17 18:48               ` Eric Dumazet
2013-10-18  6:43               ` Ingo Molnar
2013-10-28 16:01             ` Neil Horman
2013-10-28 16:20               ` Ingo Molnar
2013-10-28 17:49                 ` Neil Horman
2013-10-28 16:24               ` Ingo Molnar
2013-10-28 16:49                 ` David Ahern
2013-10-28 17:46                 ` Neil Horman
2013-10-28 18:29                   ` Neil Horman
2013-10-29  8:25                     ` Ingo Molnar
2013-10-29 11:20                       ` Neil Horman
2013-10-29 11:30                         ` Ingo Molnar
2013-10-29 11:49                           ` Neil Horman
2013-10-29 12:52                             ` Ingo Molnar
2013-10-29 13:07                               ` Neil Horman
2013-10-29 13:11                                 ` Ingo Molnar
2013-10-29 13:20                                   ` Neil Horman
2013-10-29 14:17                                   ` Neil Horman
2013-10-29 14:27                                     ` Ingo Molnar
2013-10-29 20:26                                       ` Neil Horman
2013-10-31 10:22                                         ` Ingo Molnar
2013-10-31 14:33                                           ` Neil Horman
2013-11-01  9:13                                             ` Ingo Molnar
2013-11-01 14:06                                               ` Neil Horman
2013-10-29 14:12                               ` David Ahern
2013-10-15  7:32     ` Ingo Molnar
2013-10-15 13:14       ` Neil Horman
2013-10-12 22:29 ` H. Peter Anvin
2013-10-13 12:53   ` Neil Horman
2013-10-18 16:42   ` Neil Horman
2013-10-18 17:09     ` H. Peter Anvin
2013-10-25 13:06       ` Neil Horman
2013-10-14  4:38 ` Andi Kleen
2013-10-14  7:49   ` Ingo Molnar
2013-10-14 21:07     ` Eric Dumazet
2013-10-15 13:17       ` Neil Horman
2013-10-14 20:25   ` Neil Horman
2013-10-15  7:12     ` Sébastien Dugué
2013-10-15 13:33       ` Andi Kleen
2013-10-15 13:56         ` Sébastien Dugué
2013-10-15 14:06           ` Eric Dumazet
2013-10-15 14:15             ` Sébastien Dugué
2013-10-15 14:26               ` Eric Dumazet
2013-10-15 14:52                 ` Eric Dumazet
2013-10-15 16:02                   ` Andi Kleen
2013-10-16  0:28                     ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).