All of lore.kernel.org
 help / color / mirror / Atom feed
* [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
@ 2013-11-12  1:42 Joe Perches
  2013-11-12 13:59 ` Neil Horman
  2013-11-12 17:12 ` Neil Horman
  0 siblings, 2 replies; 13+ messages in thread
From: Joe Perches @ 2013-11-12  1:42 UTC (permalink / raw)
  To: netdev, Neil Horman
  Cc: Dave Jones, linux-kernel, sebastien.dugue, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

Hi again Neil.

Forwarding on to netdev with a concern as to how often
do_csum is used via csum_partial for very short headers
and what impact any prefetch would have there.

Also, what changed in your test environment?

Why are the new values 5+% higher cycles/byte than the
previous values?

And here is the new table reformatted:

len	set	iterations	Readahead cachelines vs cycles/byte
			1	2	3	4	6	10	20
1500B	64MB	1000000	1.4342	1.4300	1.4350	1.4350	1.4396	1.4315	1.4555
1500B	128MB	1000000	1.4312	1.4346	1.4271	1.4284	1.4376	1.4318	1.4431
1500B	256MB	1000000	1.4309	1.4254	1.4316	1.4308	1.4418	1.4304	1.4367
1500B	512MB	1000000	1.4534	1.4516	1.4523	1.4563	1.4554	1.4644	1.4590
9000B	64MB	1000000	0.8921	0.8924	0.8932	0.8949	0.8952	0.8939	0.8985
9000B	128MB	1000000	0.8841	0.8856	0.8845	0.8854	0.8861	0.8879	0.8861
9000B	256MB	1000000	0.8806	0.8821	0.8813	0.8833	0.8814	0.8827	0.8895
9000B	512MB	1000000	0.8838	0.8852	0.8841	0.8865	0.8846	0.8901	0.8865
64KB	64MB	1000000	0.8132	0.8136	0.8132	0.8150	0.8147	0.8149	0.8147
64KB	128MB	1000000	0.8013	0.8014	0.8013	0.8020	0.8041	0.8015	0.8033
64KB	256MB	1000000	0.7956	0.7959	0.7956	0.7976	0.7981	0.7967	0.7973
64KB	512MB	1000000	0.7934	0.7932	0.7937	0.7951	0.7954	0.7943	0.7948

-------- Forwarded Message --------
From: Neil Horman <nhorman@tuxdriver.com>
To: Joe Perches <joe@perches.com>
Cc: Dave Jones <davej@redhat.com>, linux-kernel@vger.kernel.org,
sebastien.dugue@bull.net, Thomas Gleixner <tglx@linutronix.de>, Ingo
Molnar <mingo@redhat.com>, H. Peter Anvin <hpa@zytor.com>,
x86@kernel.org
Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum

On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote:
> On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> > On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > > > >  > ideas, we found the best optimization available to us currently is to
> > > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > > []
> > > > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > > > comments first.
> > > > > 
> > > > > Perhaps a reduction in prefetch loop count helps.
> > > > > 
> > > > > Was capping the amount prefetched and letting the
> > > > > hardware prefetch also tested?
> > > > > 
> > > > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > > 
> > > > 
> > > > Just tested this out:
> > > 
> > > Thanks.
> > > 
> > > Reformatting the table so it's a bit more
> > > readable/comparable for me:
> > > 
> > > len	SetSz	Loops	cycles/byte
> > > 			limited	unlimited
> > > 1500B	64MB	1M	1.3442	1.3605
> > > 1500B	128MB	1M	1.3410	1.3542
> > > 1500B	256MB	1M	1.3536	1.3710
> > > 1500B	512MB	1M	1.3463	1.3536
> > > 9000B	64MB	1M	0.8522	0.8504
> > > 9000B	128MB	1M	0.8528	0.8536
> > > 9000B	256MB	1M	0.8532	0.8520
> > > 9000B	512MB	1M	0.8527	0.8525
> > > 64KB	64MB	1M	0.7686	0.7683
> > > 64KB	128MB	1M	0.7695	0.7686
> > > 64KB	256MB	1M	0.7699	0.7708
> > > 64KB	512MB	1M	0.7799	0.7694
> > > 
> > > This data appears to show some value
> > > in capping for 1500b lengths and noise
> > > for shorter and longer lengths.
> > > 
> > > Any idea what the actual distribution of
> > > do_csum lengths is under various loads?
> > > 
> > I don't have any hard data no, sorry.
> 
> I think you should before you implement this.
> You might find extremely short lengths.
> 
> > I'll cap the prefetch at 1500B for now, since it
> > doesn't seem to hurt or help beyond that
> 
> The table data has a max prefetch of
> 8 * boot_cpu_data.x86_cache_alignment so
> I believe it's always less than 1500 but
> perhaps 4 might be slightly better still.
> 


So, you appear to be correct, I reran my test set with different prefetch
ceilings and got the results below.  There are some cases in which there is a
performance gain, but the gain is small, and occurs at different spots depending
on the input buffer size (though most peak gains appear around 2 cache lines).
I'm guessing it takes about 2 prefetches before hardware prefetching catches up,
at which point we're just spending time issuing instructions that get discarded.
Given the small prefetch limit, and the limited gains (which may also change on
different hardware), I think we should probably just drop the prefetch idea
entirely, and perhaps just take the perf patch so that we can revisit this area
when hardware that supports the avx extensions and/or adcx/adox becomes
available.

Ingo, does that seem reasonable to you?
Neil



1 cache line:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.434190
1500B   | 128MB | 1000000       | 1.431216
1500B   | 256MB | 1000000       | 1.430888
1500B   | 512MB | 1000000       | 1.453422
9000B   | 64MB  | 1000000       | 0.892055
9000B   | 128MB | 1000000       | 0.884050
9000B   | 256MB | 1000000       | 0.880551
9000B   | 512MB | 1000000       | 0.883848
64KB    | 64MB  | 1000000       | 0.813187
64KB    | 128MB | 1000000       | 0.801326
64KB    | 256MB | 1000000       | 0.795643
64KB    | 512MB | 1000000       | 0.793400


2 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.430030
1500B   | 128MB | 1000000       | 1.434589
1500B   | 256MB | 1000000       | 1.425430
1500B   | 512MB | 1000000       | 1.451570
9000B   | 64MB  | 1000000       | 0.892369
9000B   | 128MB | 1000000       | 0.885577
9000B   | 256MB | 1000000       | 0.882091
9000B   | 512MB | 1000000       | 0.885201
64KB    | 64MB  | 1000000       | 0.813629
64KB    | 128MB | 1000000       | 0.801377
64KB    | 256MB | 1000000       | 0.795861
64KB    | 512MB | 1000000       | 0.793242

3 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.435048
1500B   | 128MB | 1000000       | 1.427103
1500B   | 256MB | 1000000       | 1.431558
1500B   | 512MB | 1000000       | 1.452250
9000B   | 64MB  | 1000000       | 0.893162
9000B   | 128MB | 1000000       | 0.884488
9000B   | 256MB | 1000000       | 0.881314
9000B   | 512MB | 1000000       | 0.884060
64KB    | 64MB  | 1000000       | 0.813185
64KB    | 128MB | 1000000       | 0.801280
64KB    | 256MB | 1000000       | 0.795554
64KB    | 512MB | 1000000       | 0.793670

4 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.435013
1500B   | 128MB | 1000000       | 1.428434
1500B   | 256MB | 1000000       | 1.430780
1500B   | 512MB | 1000000       | 1.456285
9000B   | 64MB  | 1000000       | 0.894877
9000B   | 128MB | 1000000       | 0.885387
9000B   | 256MB | 1000000       | 0.883293
9000B   | 512MB | 1000000       | 0.886462
64KB    | 64MB  | 1000000       | 0.815036
64KB    | 128MB | 1000000       | 0.801962
64KB    | 256MB | 1000000       | 0.797618
64KB    | 512MB | 1000000       | 0.795138

6 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.439609
1500B   | 128MB | 1000000       | 1.437569
1500B   | 256MB | 1000000       | 1.441776
1500B   | 512MB | 1000000       | 1.455362
9000B   | 64MB  | 1000000       | 0.895242
9000B   | 128MB | 1000000       | 0.886149
9000B   | 256MB | 1000000       | 0.881375
9000B   | 512MB | 1000000       | 0.884610
64KB    | 64MB  | 1000000       | 0.814658
64KB    | 128MB | 1000000       | 0.804124
64KB    | 256MB | 1000000       | 0.798143
64KB    | 512MB | 1000000       | 0.795377

10 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.431512
1500B   | 128MB | 1000000       | 1.431805
1500B   | 256MB | 1000000       | 1.430388
1500B   | 512MB | 1000000       | 1.464370
9000B   | 64MB  | 1000000       | 0.893922
9000B   | 128MB | 1000000       | 0.887852
9000B   | 256MB | 1000000       | 0.882711
9000B   | 512MB | 1000000       | 0.890067
64KB    | 64MB  | 1000000       | 0.814890
64KB    | 128MB | 1000000       | 0.801470
64KB    | 256MB | 1000000       | 0.796658
64KB    | 512MB | 1000000       | 0.794266

20 cache lines:
len	| set	| iterations	| cycles/byte
========|=======|===============|=============
1500B   | 64MB  | 1000000       | 1.455539
1500B   | 128MB | 1000000       | 1.443117
1500B   | 256MB | 1000000       | 1.436739
1500B   | 512MB | 1000000       | 1.458973
9000B   | 64MB  | 1000000       | 0.898470
9000B   | 128MB | 1000000       | 0.886110
9000B   | 256MB | 1000000       | 0.889549
9000B   | 512MB | 1000000       | 0.886547
64KB    | 64MB  | 1000000       | 0.814665
64KB    | 128MB | 1000000       | 0.803252
64KB    | 256MB | 1000000       | 0.797268
64KB    | 512MB | 1000000       | 0.794830




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12  1:42 [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum] Joe Perches
@ 2013-11-12 13:59 ` Neil Horman
  2013-11-12 17:12 ` Neil Horman
  1 sibling, 0 replies; 13+ messages in thread
From: Neil Horman @ 2013-11-12 13:59 UTC (permalink / raw)
  To: Joe Perches
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> Hi again Neil.
> 
> Forwarding on to netdev with a concern as to how often
> do_csum is used via csum_partial for very short headers
> and what impact any prefetch would have there.
> 
> Also, what changed in your test environment?
> 
> Why are the new values 5+% higher cycles/byte than the
> previous values?
> 
Hmm, thank you, I didn't notice the increase.  I think I rebooted my system and
failed to reset my irq affinity to avoid the processor I was testing on.  Let me
rerun.
Neil

> And here is the new table reformatted:
> 
> len	set	iterations	Readahead cachelines vs cycles/byte
> 			1	2	3	4	6	10	20
> 1500B	64MB	1000000	1.4342	1.4300	1.4350	1.4350	1.4396	1.4315	1.4555
> 1500B	128MB	1000000	1.4312	1.4346	1.4271	1.4284	1.4376	1.4318	1.4431
> 1500B	256MB	1000000	1.4309	1.4254	1.4316	1.4308	1.4418	1.4304	1.4367
> 1500B	512MB	1000000	1.4534	1.4516	1.4523	1.4563	1.4554	1.4644	1.4590
> 9000B	64MB	1000000	0.8921	0.8924	0.8932	0.8949	0.8952	0.8939	0.8985
> 9000B	128MB	1000000	0.8841	0.8856	0.8845	0.8854	0.8861	0.8879	0.8861
> 9000B	256MB	1000000	0.8806	0.8821	0.8813	0.8833	0.8814	0.8827	0.8895
> 9000B	512MB	1000000	0.8838	0.8852	0.8841	0.8865	0.8846	0.8901	0.8865
> 64KB	64MB	1000000	0.8132	0.8136	0.8132	0.8150	0.8147	0.8149	0.8147
> 64KB	128MB	1000000	0.8013	0.8014	0.8013	0.8020	0.8041	0.8015	0.8033
> 64KB	256MB	1000000	0.7956	0.7959	0.7956	0.7976	0.7981	0.7967	0.7973
> 64KB	512MB	1000000	0.7934	0.7932	0.7937	0.7951	0.7954	0.7943	0.7948
> 
> -------- Forwarded Message --------
> From: Neil Horman <nhorman@tuxdriver.com>
> To: Joe Perches <joe@perches.com>
> Cc: Dave Jones <davej@redhat.com>, linux-kernel@vger.kernel.org,
> sebastien.dugue@bull.net, Thomas Gleixner <tglx@linutronix.de>, Ingo
> Molnar <mingo@redhat.com>, H. Peter Anvin <hpa@zytor.com>,
> x86@kernel.org
> Subject: Re: [PATCH v2 2/2] x86: add prefetching to do_csum
> 
> On Fri, Nov 08, 2013 at 12:29:07PM -0800, Joe Perches wrote:
> > On Fri, 2013-11-08 at 15:14 -0500, Neil Horman wrote:
> > > On Fri, Nov 08, 2013 at 11:33:13AM -0800, Joe Perches wrote:
> > > > On Fri, 2013-11-08 at 14:01 -0500, Neil Horman wrote:
> > > > > On Wed, Nov 06, 2013 at 09:19:23AM -0800, Joe Perches wrote:
> > > > > > On Wed, 2013-11-06 at 10:54 -0500, Neil Horman wrote:
> > > > > > > On Wed, Nov 06, 2013 at 10:34:29AM -0500, Dave Jones wrote:
> > > > > > > > On Wed, Nov 06, 2013 at 10:23:19AM -0500, Neil Horman wrote:
> > > > > > > >  > do_csum was identified via perf recently as a hot spot when doing
> > > > > > > >  > receive on ip over infiniband workloads.  After alot of testing and
> > > > > > > >  > ideas, we found the best optimization available to us currently is to
> > > > > > > >  > prefetch the entire data buffer prior to doing the checksum
> > > > > > []
> > > > > > > I'll fix this up and send a v3, but I'll give it a day in case there are more
> > > > > > > comments first.
> > > > > > 
> > > > > > Perhaps a reduction in prefetch loop count helps.
> > > > > > 
> > > > > > Was capping the amount prefetched and letting the
> > > > > > hardware prefetch also tested?
> > > > > > 
> > > > > > 	prefetch_lines(buff, min(len, cache_line_size() * 8u));
> > > > > > 
> > > > > 
> > > > > Just tested this out:
> > > > 
> > > > Thanks.
> > > > 
> > > > Reformatting the table so it's a bit more
> > > > readable/comparable for me:
> > > > 
> > > > len	SetSz	Loops	cycles/byte
> > > > 			limited	unlimited
> > > > 1500B	64MB	1M	1.3442	1.3605
> > > > 1500B	128MB	1M	1.3410	1.3542
> > > > 1500B	256MB	1M	1.3536	1.3710
> > > > 1500B	512MB	1M	1.3463	1.3536
> > > > 9000B	64MB	1M	0.8522	0.8504
> > > > 9000B	128MB	1M	0.8528	0.8536
> > > > 9000B	256MB	1M	0.8532	0.8520
> > > > 9000B	512MB	1M	0.8527	0.8525
> > > > 64KB	64MB	1M	0.7686	0.7683
> > > > 64KB	128MB	1M	0.7695	0.7686
> > > > 64KB	256MB	1M	0.7699	0.7708
> > > > 64KB	512MB	1M	0.7799	0.7694
> > > > 
> > > > This data appears to show some value
> > > > in capping for 1500b lengths and noise
> > > > for shorter and longer lengths.
> > > > 
> > > > Any idea what the actual distribution of
> > > > do_csum lengths is under various loads?
> > > > 
> > > I don't have any hard data no, sorry.
> > 
> > I think you should before you implement this.
> > You might find extremely short lengths.
> > 
> > > I'll cap the prefetch at 1500B for now, since it
> > > doesn't seem to hurt or help beyond that
> > 
> > The table data has a max prefetch of
> > 8 * boot_cpu_data.x86_cache_alignment so
> > I believe it's always less than 1500 but
> > perhaps 4 might be slightly better still.
> > 
> 
> 
> So, you appear to be correct, I reran my test set with different prefetch
> ceilings and got the results below.  There are some cases in which there is a
> performance gain, but the gain is small, and occurs at different spots depending
> on the input buffer size (though most peak gains appear around 2 cache lines).
> I'm guessing it takes about 2 prefetches before hardware prefetching catches up,
> at which point we're just spending time issuing instructions that get discarded.
> Given the small prefetch limit, and the limited gains (which may also change on
> different hardware), I think we should probably just drop the prefetch idea
> entirely, and perhaps just take the perf patch so that we can revisit this area
> when hardware that supports the avx extensions and/or adcx/adox becomes
> available.
> 
> Ingo, does that seem reasonable to you?
> Neil
> 
> 
> 
> 1 cache line:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.434190
> 1500B   | 128MB | 1000000       | 1.431216
> 1500B   | 256MB | 1000000       | 1.430888
> 1500B   | 512MB | 1000000       | 1.453422
> 9000B   | 64MB  | 1000000       | 0.892055
> 9000B   | 128MB | 1000000       | 0.884050
> 9000B   | 256MB | 1000000       | 0.880551
> 9000B   | 512MB | 1000000       | 0.883848
> 64KB    | 64MB  | 1000000       | 0.813187
> 64KB    | 128MB | 1000000       | 0.801326
> 64KB    | 256MB | 1000000       | 0.795643
> 64KB    | 512MB | 1000000       | 0.793400
> 
> 
> 2 cache lines:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.430030
> 1500B   | 128MB | 1000000       | 1.434589
> 1500B   | 256MB | 1000000       | 1.425430
> 1500B   | 512MB | 1000000       | 1.451570
> 9000B   | 64MB  | 1000000       | 0.892369
> 9000B   | 128MB | 1000000       | 0.885577
> 9000B   | 256MB | 1000000       | 0.882091
> 9000B   | 512MB | 1000000       | 0.885201
> 64KB    | 64MB  | 1000000       | 0.813629
> 64KB    | 128MB | 1000000       | 0.801377
> 64KB    | 256MB | 1000000       | 0.795861
> 64KB    | 512MB | 1000000       | 0.793242
> 
> 3 cache lines:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.435048
> 1500B   | 128MB | 1000000       | 1.427103
> 1500B   | 256MB | 1000000       | 1.431558
> 1500B   | 512MB | 1000000       | 1.452250
> 9000B   | 64MB  | 1000000       | 0.893162
> 9000B   | 128MB | 1000000       | 0.884488
> 9000B   | 256MB | 1000000       | 0.881314
> 9000B   | 512MB | 1000000       | 0.884060
> 64KB    | 64MB  | 1000000       | 0.813185
> 64KB    | 128MB | 1000000       | 0.801280
> 64KB    | 256MB | 1000000       | 0.795554
> 64KB    | 512MB | 1000000       | 0.793670
> 
> 4 cache lines:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.435013
> 1500B   | 128MB | 1000000       | 1.428434
> 1500B   | 256MB | 1000000       | 1.430780
> 1500B   | 512MB | 1000000       | 1.456285
> 9000B   | 64MB  | 1000000       | 0.894877
> 9000B   | 128MB | 1000000       | 0.885387
> 9000B   | 256MB | 1000000       | 0.883293
> 9000B   | 512MB | 1000000       | 0.886462
> 64KB    | 64MB  | 1000000       | 0.815036
> 64KB    | 128MB | 1000000       | 0.801962
> 64KB    | 256MB | 1000000       | 0.797618
> 64KB    | 512MB | 1000000       | 0.795138
> 
> 6 cache lines:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.439609
> 1500B   | 128MB | 1000000       | 1.437569
> 1500B   | 256MB | 1000000       | 1.441776
> 1500B   | 512MB | 1000000       | 1.455362
> 9000B   | 64MB  | 1000000       | 0.895242
> 9000B   | 128MB | 1000000       | 0.886149
> 9000B   | 256MB | 1000000       | 0.881375
> 9000B   | 512MB | 1000000       | 0.884610
> 64KB    | 64MB  | 1000000       | 0.814658
> 64KB    | 128MB | 1000000       | 0.804124
> 64KB    | 256MB | 1000000       | 0.798143
> 64KB    | 512MB | 1000000       | 0.795377
> 
> 10 cache lines:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.431512
> 1500B   | 128MB | 1000000       | 1.431805
> 1500B   | 256MB | 1000000       | 1.430388
> 1500B   | 512MB | 1000000       | 1.464370
> 9000B   | 64MB  | 1000000       | 0.893922
> 9000B   | 128MB | 1000000       | 0.887852
> 9000B   | 256MB | 1000000       | 0.882711
> 9000B   | 512MB | 1000000       | 0.890067
> 64KB    | 64MB  | 1000000       | 0.814890
> 64KB    | 128MB | 1000000       | 0.801470
> 64KB    | 256MB | 1000000       | 0.796658
> 64KB    | 512MB | 1000000       | 0.794266
> 
> 20 cache lines:
> len	| set	| iterations	| cycles/byte
> ========|=======|===============|=============
> 1500B   | 64MB  | 1000000       | 1.455539
> 1500B   | 128MB | 1000000       | 1.443117
> 1500B   | 256MB | 1000000       | 1.436739
> 1500B   | 512MB | 1000000       | 1.458973
> 9000B   | 64MB  | 1000000       | 0.898470
> 9000B   | 128MB | 1000000       | 0.886110
> 9000B   | 256MB | 1000000       | 0.889549
> 9000B   | 512MB | 1000000       | 0.886547
> 64KB    | 64MB  | 1000000       | 0.814665
> 64KB    | 128MB | 1000000       | 0.803252
> 64KB    | 256MB | 1000000       | 0.797268
> 64KB    | 512MB | 1000000       | 0.794830
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12  1:42 [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum] Joe Perches
  2013-11-12 13:59 ` Neil Horman
@ 2013-11-12 17:12 ` Neil Horman
  2013-11-12 17:33   ` Joe Perches
  1 sibling, 1 reply; 13+ messages in thread
From: Neil Horman @ 2013-11-12 17:12 UTC (permalink / raw)
  To: Joe Perches
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> Hi again Neil.
> 
> Forwarding on to netdev with a concern as to how often
> do_csum is used via csum_partial for very short headers
> and what impact any prefetch would have there.
> 
> Also, what changed in your test environment?
> 
> Why are the new values 5+% higher cycles/byte than the
> previous values?
> 
> And here is the new table reformatted:
> 
> len	set	iterations	Readahead cachelines vs cycles/byte
> 			1	2	3	4	6	10	20
> 1500B	64MB	1000000	1.4342	1.4300	1.4350	1.4350	1.4396	1.4315	1.4555
> 1500B	128MB	1000000	1.4312	1.4346	1.4271	1.4284	1.4376	1.4318	1.4431
> 1500B	256MB	1000000	1.4309	1.4254	1.4316	1.4308	1.4418	1.4304	1.4367
> 1500B	512MB	1000000	1.4534	1.4516	1.4523	1.4563	1.4554	1.4644	1.4590
> 9000B	64MB	1000000	0.8921	0.8924	0.8932	0.8949	0.8952	0.8939	0.8985
> 9000B	128MB	1000000	0.8841	0.8856	0.8845	0.8854	0.8861	0.8879	0.8861
> 9000B	256MB	1000000	0.8806	0.8821	0.8813	0.8833	0.8814	0.8827	0.8895
> 9000B	512MB	1000000	0.8838	0.8852	0.8841	0.8865	0.8846	0.8901	0.8865
> 64KB	64MB	1000000	0.8132	0.8136	0.8132	0.8150	0.8147	0.8149	0.8147
> 64KB	128MB	1000000	0.8013	0.8014	0.8013	0.8020	0.8041	0.8015	0.8033
> 64KB	256MB	1000000	0.7956	0.7959	0.7956	0.7976	0.7981	0.7967	0.7973
> 64KB	512MB	1000000	0.7934	0.7932	0.7937	0.7951	0.7954	0.7943	0.7948
> 


There we go, thats better:
len   set     iterations      Readahead cachelines vs cycles/byte
			1	2	3	4	5	10	20
1500B 64MB	1000000	1.3638	1.3288	1.3464	1.3505	1.3586	1.3527	1.3408
1500B 128MB	1000000	1.3394	1.3357	1.3625	1.3456	1.3536	1.3400	1.3410
1500B 256MB	1000000 1.3773	1.3362	1.3419	1.3548	1.3543	1.3442	1.4163
1500B 512MB	1000000 1.3442	1.3390	1.3434	1.3505	1.3767	1.3513	1.3820
9000B 64MB	1000000 0.8505	0.8492	0.8521	0.8593	0.8566	0.8577	0.8547
9000B 128MB	1000000 0.8507	0.8507	0.8523	0.8627	0.8593	0.8670	0.8570
9000B 256MB	1000000 0.8516	0.8515	0.8568	0.8546	0.8549	0.8609	0.8596
9000B 512MB	1000000 0.8517	0.8526	0.8552	0.8675	0.8547	0.8526	0.8621
64KB  64MB	1000000 0.7679	0.7689	0.7688	0.7716	0.7714	0.7722	0.7716
64KB  128MB	1000000 0.7683	0.7687	0.7710	0.7690	0.7717	0.7694	0.7703
64KB  256MB	1000000 0.7680	0.7703	0.7688	0.7689	0.7726	0.7717	0.7713
64KB  512MB	1000000 0.7692	0.7690	0.7701	0.7705	0.7698	0.7693	0.7735


So, the numbers are correct now that I returned my hardware to its previous
interrupt affinity state, but the trend seems to be the same (namely that there
isn't a clear one).  We seem to find peak performance around a readahead of 2
cachelines, but its very small (about 3%), and its inconsistent (larger set
sizes fall to either side of that stride).  So I don't see it as a clear win.  I
still think we should probably scrap the readahead for now, just take the perf
bits, and revisit this when we can use the vector instructions or the
independent carry chain instructions to improve this more consistently.

Thoughts
Neil




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12 17:12 ` Neil Horman
@ 2013-11-12 17:33   ` Joe Perches
  2013-11-12 19:50     ` Neil Horman
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Perches @ 2013-11-12 17:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> > Hi again Neil.
> > 
> > Forwarding on to netdev with a concern as to how often
> > do_csum is used via csum_partial for very short headers
> > and what impact any prefetch would have there.
> > 
> > Also, what changed in your test environment?
> > 
> > Why are the new values 5+% higher cycles/byte than the
> > previous values?
> > 
> > And here is the new table reformatted:
> > 
> > len	set	iterations	Readahead cachelines vs cycles/byte
> > 			1	2	3	4	6	10	20
> > 1500B	64MB	1000000	1.4342	1.4300	1.4350	1.4350	1.4396	1.4315	1.4555
> > 1500B	128MB	1000000	1.4312	1.4346	1.4271	1.4284	1.4376	1.4318	1.4431
> > 1500B	256MB	1000000	1.4309	1.4254	1.4316	1.4308	1.4418	1.4304	1.4367
> > 1500B	512MB	1000000	1.4534	1.4516	1.4523	1.4563	1.4554	1.4644	1.4590
> > 9000B	64MB	1000000	0.8921	0.8924	0.8932	0.8949	0.8952	0.8939	0.8985
> > 9000B	128MB	1000000	0.8841	0.8856	0.8845	0.8854	0.8861	0.8879	0.8861
> > 9000B	256MB	1000000	0.8806	0.8821	0.8813	0.8833	0.8814	0.8827	0.8895
> > 9000B	512MB	1000000	0.8838	0.8852	0.8841	0.8865	0.8846	0.8901	0.8865
> > 64KB	64MB	1000000	0.8132	0.8136	0.8132	0.8150	0.8147	0.8149	0.8147
> > 64KB	128MB	1000000	0.8013	0.8014	0.8013	0.8020	0.8041	0.8015	0.8033
> > 64KB	256MB	1000000	0.7956	0.7959	0.7956	0.7976	0.7981	0.7967	0.7973
> > 64KB	512MB	1000000	0.7934	0.7932	0.7937	0.7951	0.7954	0.7943	0.7948
> > 
> 
> 
> There we go, thats better:
> len   set     iterations      Readahead cachelines vs cycles/byte
> 			1	2	3	4	5	10	20
> 1500B 64MB	1000000	1.3638	1.3288	1.3464	1.3505	1.3586	1.3527	1.3408
> 1500B 128MB	1000000	1.3394	1.3357	1.3625	1.3456	1.3536	1.3400	1.3410
> 1500B 256MB	1000000 1.3773	1.3362	1.3419	1.3548	1.3543	1.3442	1.4163
> 1500B 512MB	1000000 1.3442	1.3390	1.3434	1.3505	1.3767	1.3513	1.3820
> 9000B 64MB	1000000 0.8505	0.8492	0.8521	0.8593	0.8566	0.8577	0.8547
> 9000B 128MB	1000000 0.8507	0.8507	0.8523	0.8627	0.8593	0.8670	0.8570
> 9000B 256MB	1000000 0.8516	0.8515	0.8568	0.8546	0.8549	0.8609	0.8596
> 9000B 512MB	1000000 0.8517	0.8526	0.8552	0.8675	0.8547	0.8526	0.8621
> 64KB  64MB	1000000 0.7679	0.7689	0.7688	0.7716	0.7714	0.7722	0.7716
> 64KB  128MB	1000000 0.7683	0.7687	0.7710	0.7690	0.7717	0.7694	0.7703
> 64KB  256MB	1000000 0.7680	0.7703	0.7688	0.7689	0.7726	0.7717	0.7713
> 64KB  512MB	1000000 0.7692	0.7690	0.7701	0.7705	0.7698	0.7693	0.7735
> 
> 
> So, the numbers are correct now that I returned my hardware to its previous
> interrupt affinity state, but the trend seems to be the same (namely that there
> isn't a clear one).  We seem to find peak performance around a readahead of 2
> cachelines, but its very small (about 3%), and its inconsistent (larger set
> sizes fall to either side of that stride).  So I don't see it as a clear win.  I
> still think we should probably scrap the readahead for now, just take the perf
> bits, and revisit this when we can use the vector instructions or the
> independent carry chain instructions to improve this more consistently.
> 
> Thoughts

Perhaps a single prefetch, not of the first addr but of
the addr after PREFETCH_STRIDE would work best but only
if length is > PREFETCH_STRIDE.

I'd try:

	if (len > PREFETCH_STRIDE)
		prefetch(buf + PREFETCH_STRIDE);
	while (count64) {
		etc...
	}

I still don't know how much that impacts very short lengths.

Can you please add a 20 byte length to your tests?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12 17:33   ` Joe Perches
@ 2013-11-12 19:50     ` Neil Horman
  2013-11-12 20:38       ` Joe Perches
  2013-11-13 10:09       ` David Laight
  0 siblings, 2 replies; 13+ messages in thread
From: Neil Horman @ 2013-11-12 19:50 UTC (permalink / raw)
  To: Joe Perches
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> > On Mon, Nov 11, 2013 at 05:42:22PM -0800, Joe Perches wrote:
> > > Hi again Neil.
> > > 
> > > Forwarding on to netdev with a concern as to how often
> > > do_csum is used via csum_partial for very short headers
> > > and what impact any prefetch would have there.
> > > 
> > > Also, what changed in your test environment?
> > > 
> > > Why are the new values 5+% higher cycles/byte than the
> > > previous values?
> > > 
> > > And here is the new table reformatted:
> > > 
> > > len	set	iterations	Readahead cachelines vs cycles/byte
> > > 			1	2	3	4	6	10	20
> > > 1500B	64MB	1000000	1.4342	1.4300	1.4350	1.4350	1.4396	1.4315	1.4555
> > > 1500B	128MB	1000000	1.4312	1.4346	1.4271	1.4284	1.4376	1.4318	1.4431
> > > 1500B	256MB	1000000	1.4309	1.4254	1.4316	1.4308	1.4418	1.4304	1.4367
> > > 1500B	512MB	1000000	1.4534	1.4516	1.4523	1.4563	1.4554	1.4644	1.4590
> > > 9000B	64MB	1000000	0.8921	0.8924	0.8932	0.8949	0.8952	0.8939	0.8985
> > > 9000B	128MB	1000000	0.8841	0.8856	0.8845	0.8854	0.8861	0.8879	0.8861
> > > 9000B	256MB	1000000	0.8806	0.8821	0.8813	0.8833	0.8814	0.8827	0.8895
> > > 9000B	512MB	1000000	0.8838	0.8852	0.8841	0.8865	0.8846	0.8901	0.8865
> > > 64KB	64MB	1000000	0.8132	0.8136	0.8132	0.8150	0.8147	0.8149	0.8147
> > > 64KB	128MB	1000000	0.8013	0.8014	0.8013	0.8020	0.8041	0.8015	0.8033
> > > 64KB	256MB	1000000	0.7956	0.7959	0.7956	0.7976	0.7981	0.7967	0.7973
> > > 64KB	512MB	1000000	0.7934	0.7932	0.7937	0.7951	0.7954	0.7943	0.7948
> > > 
> > 
> > 
> > There we go, thats better:
> > len   set     iterations      Readahead cachelines vs cycles/byte
> > 			1	2	3	4	5	10	20
> > 1500B 64MB	1000000	1.3638	1.3288	1.3464	1.3505	1.3586	1.3527	1.3408
> > 1500B 128MB	1000000	1.3394	1.3357	1.3625	1.3456	1.3536	1.3400	1.3410
> > 1500B 256MB	1000000 1.3773	1.3362	1.3419	1.3548	1.3543	1.3442	1.4163
> > 1500B 512MB	1000000 1.3442	1.3390	1.3434	1.3505	1.3767	1.3513	1.3820
> > 9000B 64MB	1000000 0.8505	0.8492	0.8521	0.8593	0.8566	0.8577	0.8547
> > 9000B 128MB	1000000 0.8507	0.8507	0.8523	0.8627	0.8593	0.8670	0.8570
> > 9000B 256MB	1000000 0.8516	0.8515	0.8568	0.8546	0.8549	0.8609	0.8596
> > 9000B 512MB	1000000 0.8517	0.8526	0.8552	0.8675	0.8547	0.8526	0.8621
> > 64KB  64MB	1000000 0.7679	0.7689	0.7688	0.7716	0.7714	0.7722	0.7716
> > 64KB  128MB	1000000 0.7683	0.7687	0.7710	0.7690	0.7717	0.7694	0.7703
> > 64KB  256MB	1000000 0.7680	0.7703	0.7688	0.7689	0.7726	0.7717	0.7713
> > 64KB  512MB	1000000 0.7692	0.7690	0.7701	0.7705	0.7698	0.7693	0.7735
> > 
> > 
> > So, the numbers are correct now that I returned my hardware to its previous
> > interrupt affinity state, but the trend seems to be the same (namely that there
> > isn't a clear one).  We seem to find peak performance around a readahead of 2
> > cachelines, but its very small (about 3%), and its inconsistent (larger set
> > sizes fall to either side of that stride).  So I don't see it as a clear win.  I
> > still think we should probably scrap the readahead for now, just take the perf
> > bits, and revisit this when we can use the vector instructions or the
> > independent carry chain instructions to improve this more consistently.
> > 
> > Thoughts
> 
> Perhaps a single prefetch, not of the first addr but of
> the addr after PREFETCH_STRIDE would work best but only
> if length is > PREFETCH_STRIDE.
> 
> I'd try:
> 
> 	if (len > PREFETCH_STRIDE)
> 		prefetch(buf + PREFETCH_STRIDE);
> 	while (count64) {
> 		etc...
> 	}
> 
> I still don't know how much that impacts very short lengths.
> 
> Can you please add a 20 byte length to your tests?
> 
> 


Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
only if the overall length of the input buffer is more than 2 cache lines.
Below are the results (all counts are the average of 1000000 iterations of the
csum operation, as previous tests were, I just omitted that column).

len	set	cycles/byte	cycles/byte	improvement
		no prefetch	prefetch
===========================================================
20B	64MB	45.014989	44.402432	1.3%
20B	128MB	44.900317	46.146447	-2.7%
20B	256MB	45.303223	48.193623	-6.3%
20B	512MB	45.615301	44.486872	2.2%
1500B	64MB	1.364365	1.332285	1.9%
1500B	128MB	1.373945	1.335907	1.4%
1500B	256MB	1.356971	1.339084	1.2%
1500B	512MB	1.351091	1.341431	0.7%
9000B	64MB	0.850966	0.851077	-0.1%
9000B	128MB	0.851013	0.850366	0.1%
9000B	256MB	0.854212	0.851033	0.3%
9000B	512MB	0.857346	0.851744	0.7%
64KB	64MB	0.768224	0.768450	~0%
64KB	128MB	0.768706	0.768884	~0%
64KB	256MB	0.768459	0.768445	~0%
64KB	512MB	0.768808	0.769404	-0.1%

The 20 byte results seem to have a few outliers.  I'm guessing the improvement
came from good fortune in that the random selection happened to hit on the same
range of numbers a few times over, so we hit already cached data.  I would
expect them to run more slowly (as the 2 and 3 rows illustrate), since 20B is
less than the 128 bytes in 2 cachelines on my test system, and so all were doing
is adding in an additional comparison and jump per iteration.  Our sweet spot is
the 1500 byte range, giving us a small performance boost, but that quickly gets
lost in the noise as the buffer size grows beyond that.

I'm still left thinking we should just abandon the prefetch at this point and
keep the perf code until we have new instructions to help us with this further,
unless you see something I dont.

Neil


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12 19:50     ` Neil Horman
@ 2013-11-12 20:38       ` Joe Perches
  2013-11-12 20:59         ` Neil Horman
  2013-11-13 10:09       ` David Laight
  1 sibling, 1 reply; 13+ messages in thread
From: Joe Perches @ 2013-11-12 20:38 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
> On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
[]
> > > So, the numbers are correct now that I returned my hardware to its previous
> > > interrupt affinity state, but the trend seems to be the same (namely that there
> > > isn't a clear one).  We seem to find peak performance around a readahead of 2
> > > cachelines, but its very small (about 3%), and its inconsistent (larger set
> > > sizes fall to either side of that stride).  So I don't see it as a clear win.  I
> > > still think we should probably scrap the readahead for now, just take the perf
> > > bits, and revisit this when we can use the vector instructions or the
> > > independent carry chain instructions to improve this more consistently.
> > > 
> > > Thoughts
> > 
> > Perhaps a single prefetch, not of the first addr but of
> > the addr after PREFETCH_STRIDE would work best but only
> > if length is > PREFETCH_STRIDE.
> > 
> > I'd try:
> > 
> > 	if (len > PREFETCH_STRIDE)
> > 		prefetch(buf + PREFETCH_STRIDE);
> > 	while (count64) {
> > 		etc...
> > 	}
> > 
> > I still don't know how much that impacts very short lengths.
> > Can you please add a 20 byte length to your tests?
> Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> only if the overall length of the input buffer is more than 2 cache lines.
> Below are the results (all counts are the average of 1000000 iterations of the
> csum operation, as previous tests were, I just omitted that column).
> 
> len	set	cycles/byte	cycles/byte	improvement
> 		no prefetch	prefetch
> ===========================================================
> 20B	64MB	45.014989	44.402432	1.3%
> 20B	128MB	44.900317	46.146447	-2.7%
> 20B	256MB	45.303223	48.193623	-6.3%
> 20B	512MB	45.615301	44.486872	2.2%
[]
> I'm still left thinking we should just abandon the prefetch at this point and
> keep the perf code until we have new instructions to help us with this further,
> unless you see something I dont.

I tend to agree but perhaps the 3% performance
increase with a prefetch for longer lengths is
actually significant and desirable.

It doesn't seem you've done the test I suggested
where prefetch is done only for
"len > PREFETCH_STRIDE".

Is it ever useful to do a prefetch of the
address/data being accessed by the next
instruction?

Anyway, thanks for doing all the work.

Joe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12 20:38       ` Joe Perches
@ 2013-11-12 20:59         ` Neil Horman
  0 siblings, 0 replies; 13+ messages in thread
From: Neil Horman @ 2013-11-12 20:59 UTC (permalink / raw)
  To: Joe Perches
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Tue, Nov 12, 2013 at 12:38:01PM -0800, Joe Perches wrote:
> On Tue, 2013-11-12 at 14:50 -0500, Neil Horman wrote:
> > On Tue, Nov 12, 2013 at 09:33:35AM -0800, Joe Perches wrote:
> > > On Tue, 2013-11-12 at 12:12 -0500, Neil Horman wrote:
> []
> > > > So, the numbers are correct now that I returned my hardware to its previous
> > > > interrupt affinity state, but the trend seems to be the same (namely that there
> > > > isn't a clear one).  We seem to find peak performance around a readahead of 2
> > > > cachelines, but its very small (about 3%), and its inconsistent (larger set
> > > > sizes fall to either side of that stride).  So I don't see it as a clear win.  I
> > > > still think we should probably scrap the readahead for now, just take the perf
> > > > bits, and revisit this when we can use the vector instructions or the
> > > > independent carry chain instructions to improve this more consistently.
> > > > 
> > > > Thoughts
> > > 
> > > Perhaps a single prefetch, not of the first addr but of
> > > the addr after PREFETCH_STRIDE would work best but only
> > > if length is > PREFETCH_STRIDE.
> > > 
> > > I'd try:
> > > 
> > > 	if (len > PREFETCH_STRIDE)
> > > 		prefetch(buf + PREFETCH_STRIDE);
> > > 	while (count64) {
> > > 		etc...
> > > 	}
> > > 
> > > I still don't know how much that impacts very short lengths.
> > > Can you please add a 20 byte length to your tests?
> > Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> > only if the overall length of the input buffer is more than 2 cache lines.
> > Below are the results (all counts are the average of 1000000 iterations of the
> > csum operation, as previous tests were, I just omitted that column).
> > 
> > len	set	cycles/byte	cycles/byte	improvement
> > 		no prefetch	prefetch
> > ===========================================================
> > 20B	64MB	45.014989	44.402432	1.3%
> > 20B	128MB	44.900317	46.146447	-2.7%
> > 20B	256MB	45.303223	48.193623	-6.3%
> > 20B	512MB	45.615301	44.486872	2.2%
> []
> > I'm still left thinking we should just abandon the prefetch at this point and
> > keep the perf code until we have new instructions to help us with this further,
> > unless you see something I dont.
> 
> I tend to agree but perhaps the 3% performance
> increase with a prefetch for longer lengths is
> actually significant and desirable.
> 
Maybe, but I worry that its not going to be consistent.  At least not with the
cost of the extra comparison and jump.

> It doesn't seem you've done the test I suggested
> where prefetch is done only for
> "len > PREFETCH_STRIDE".
> 
No, thats exactly what I did, I did this:

#define PREFETCH_STRIDE (cache_line_size() * 2)

...

if (len > PREFETCH_STRIDE)
	prefecth(buf + PREFETCH_STRIDE)

while (count64) {
	...

> Is it ever useful to do a prefetch of the
> address/data being accessed by the next
> instruction?
> 
Doubtful, you need to prefetch the data far enough in advance that its loaded by
the time you need to reference it.  Otherwise you wind up stalling the data
pipeline while the load completes.  So unless you have really fast memory, the
prefetch is effectively a no-op for the next access.  But the next cacheline
ahead is good, as it prevents the stall there.  Any more than that though (from
this testing), seems to again be a no-op as modern hardware automatically issues
the prefetch because it notices the linear data access pattern.

> Anyway, thanks for doing all the work.
> 
No worries, glad to do it.  Thanks for the review

Ingo, what do you think, shall I submit the perf bits as a separate thread, or
do you not want those any more?

Regards
Neil

> Joe
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-12 19:50     ` Neil Horman
  2013-11-12 20:38       ` Joe Perches
@ 2013-11-13 10:09       ` David Laight
  2013-11-13 12:30         ` Neil Horman
  1 sibling, 1 reply; 13+ messages in thread
From: David Laight @ 2013-11-13 10:09 UTC (permalink / raw)
  To: Neil Horman, Joe Perches
  Cc: netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

> Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> only if the overall length of the input buffer is more than 2 cache lines.
> Below are the results (all counts are the average of 1000000 iterations of the
> csum operation, as previous tests were, I just omitted that column).

Hmmm.... averaging over 100000 iterations means that all the code
is in the i-cache and the branch predictor will be correctly primed.

For short checksum requests I'd guess that the relevant data
has just been written and is already in the cpu cache (unless
there has been a process and cpu switch).
So prefetch is likely to be unnecessary.

If you assume that the checksum code isn't in the i-cache then
small requests are likely to be dominated by the code size.

	David




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-13 10:09       ` David Laight
@ 2013-11-13 12:30         ` Neil Horman
  2013-11-13 13:08           ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Neil Horman @ 2013-11-13 12:30 UTC (permalink / raw)
  To: David Laight
  Cc: Joe Perches, netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet

On Wed, Nov 13, 2013 at 10:09:51AM -0000, David Laight wrote:
> > Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> > only if the overall length of the input buffer is more than 2 cache lines.
> > Below are the results (all counts are the average of 1000000 iterations of the
> > csum operation, as previous tests were, I just omitted that column).
> 
> Hmmm.... averaging over 100000 iterations means that all the code
> is in the i-cache and the branch predictor will be correctly primed.
> 
> For short checksum requests I'd guess that the relevant data
> has just been written and is already in the cpu cache (unless
> there has been a process and cpu switch).
> So prefetch is likely to be unnecessary.
> 
> If you assume that the checksum code isn't in the i-cache then
> small requests are likely to be dominated by the code size.
> 
I'm not sure, whats the typical capacity for the branch predictors ability to
remember code paths?  I ask because the most likely use of do_csum will be in
the receive path of the networking stack (specifically in the softirq handler).
So if we run do_csum once, we're likely to run it many more times, as we clean
out an adapters receive queue.

Neil

> 	David
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-13 12:30         ` Neil Horman
@ 2013-11-13 13:08           ` Ingo Molnar
  2013-11-13 13:32             ` David Laight
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2013-11-13 13:08 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Laight, Joe Perches, netdev, Dave Jones, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86, Eric Dumazet, Peter Zijlstra


* Neil Horman <nhorman@tuxdriver.com> wrote:

> On Wed, Nov 13, 2013 at 10:09:51AM -0000, David Laight wrote:
> > > Sure, I modified the code so that we only prefetched 2 cache lines ahead, but
> > > only if the overall length of the input buffer is more than 2 cache lines.
> > > Below are the results (all counts are the average of 1000000 iterations of the
> > > csum operation, as previous tests were, I just omitted that column).
> > 
> > Hmmm.... averaging over 100000 iterations means that all the code
> > is in the i-cache and the branch predictor will be correctly primed.
> > 
> > For short checksum requests I'd guess that the relevant data
> > has just been written and is already in the cpu cache (unless
> > there has been a process and cpu switch).
> > So prefetch is likely to be unnecessary.
> > 
> > If you assume that the checksum code isn't in the i-cache then
> > small requests are likely to be dominated by the code size.
> 
> I'm not sure, whats the typical capacity for the branch predictors 
> ability to remember code paths?  I ask because the most likely use of 
> do_csum will be in the receive path of the networking stack 
> (specifically in the softirq handler). So if we run do_csum once, we're 
> likely to run it many more times, as we clean out an adapters receive 
> queue.

For such simple single-target branches it goes near or over a thousand for 
recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

Note that branch prediction caches are hierarchical and are typically 
attached to cache hierarchies (where the uops are fetched from), so the 
first level BTB is typically shared between SMT CPUs that share an icache 
and L2 BTBs (which is larger and more associative) are shared by all cores 
in a package.

So it's possible for some other task on another (sibling) CPU to keep 
pressure on your BTB, but I'd say it's relatively rare, it's hard to do it 
at a really high rate that blows away all the cache all the time. (PeterZ 
has written some artificial pseudorandom branching monster just to be able 
to generate cache misses and validate perf's branch stats - but even if 
deliberately want to it's pretty hard to beat that cache.)

I'd definitely not worry about the prediction accuracy of repetitive loops 
like csum routines, they'll be cached well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-13 13:08           ` Ingo Molnar
@ 2013-11-13 13:32             ` David Laight
  2013-11-13 13:53               ` Ingo Molnar
  2013-11-13 16:01               ` Neil Horman
  0 siblings, 2 replies; 13+ messages in thread
From: David Laight @ 2013-11-13 13:32 UTC (permalink / raw)
  To: Ingo Molnar, Neil Horman
  Cc: Joe Perches, netdev, Dave Jones, linux-kernel, sebastien.dugue,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, Eric Dumazet,
	Peter Zijlstra

> > I'm not sure, whats the typical capacity for the branch predictors
> > ability to remember code paths?
...
> 
> For such simple single-target branches it goes near or over a thousand for
> recent Intel and AMD microarchitectures. Thousands for really recent CPUs.

IIRC the x86 can also correctly predict simple sequences - like a branch
in a loop that is taken every other iteration, or only after a previous
branch is taken.

Much simpler cpus may use a much simpler strategy.
I think one I've used (a fpga soft-core cpu) just uses the low
bits of the instruction address to index a single bit table.
This means that branches alias each other.
In order to get the consistent cycle counts in order to minimise
the worst case code path we had to disable the dynamic prediction.

For the checksum code the loop branch isn't a problem.
Tests on entry to the function might get mispredicted.

So if you have conditional prefetch when the buffer is long
then time a short buffer after a 100 long ones you'll almost
certainly see the mispredition penalty.

FWIW I remember speeding up a copy (I think) loop on a strongarm by
adding an extra instruction to fetch a word from later in the buffer
into a register I never otherwise used.
(That was an unpaged system so I knew it couldn't fault.)

	David





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-13 13:32             ` David Laight
@ 2013-11-13 13:53               ` Ingo Molnar
  2013-11-13 16:01               ` Neil Horman
  1 sibling, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2013-11-13 13:53 UTC (permalink / raw)
  To: David Laight
  Cc: Neil Horman, Joe Perches, netdev, Dave Jones, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86, Eric Dumazet, Peter Zijlstra


* David Laight <David.Laight@ACULAB.COM> wrote:

> > > I'm not sure, whats the typical capacity for the branch predictors 
> > > ability to remember code paths?
> ...
> > 
> > For such simple single-target branches it goes near or over a thousand 
> > for recent Intel and AMD microarchitectures. Thousands for really 
> > recent CPUs.
> 
> IIRC the x86 can also correctly predict simple sequences - like a branch 
> in a loop that is taken every other iteration, or only after a previous 
> branch is taken.

They tend to be rather capable but not very well documented :) With a 
large out of order execution design and 20+ pipeline stages x86 branch 
prediction accuracy is perhaps the most important design aspect to good 
CPU performance.

> Much simpler cpus may use a much simpler strategy.

Yeah. The patches in this thread are about the x86 assembly implementation 
of the csum routines, and for 'typical' x86 CPUs the branch prediction 
units and caches are certainly sophisticated enough.

Also note that here, for real usecases, the csum routines are (or should 
be) memory bandwidth limited, missing the data cache most of the time, 
with a partially idling pipeline, while branch prediction accuracy matters 
most when the pipeline is well fed and there are a lot of instructions in 
flight.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum]
  2013-11-13 13:32             ` David Laight
  2013-11-13 13:53               ` Ingo Molnar
@ 2013-11-13 16:01               ` Neil Horman
  1 sibling, 0 replies; 13+ messages in thread
From: Neil Horman @ 2013-11-13 16:01 UTC (permalink / raw)
  To: David Laight
  Cc: Ingo Molnar, Joe Perches, netdev, Dave Jones, linux-kernel,
	sebastien.dugue, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86, Eric Dumazet, Peter Zijlstra

On Wed, Nov 13, 2013 at 01:32:50PM -0000, David Laight wrote:
> > > I'm not sure, whats the typical capacity for the branch predictors
> > > ability to remember code paths?
> ...
> > 
> > For such simple single-target branches it goes near or over a thousand for
> > recent Intel and AMD microarchitectures. Thousands for really recent CPUs.
> 
> IIRC the x86 can also correctly predict simple sequences - like a branch
> in a loop that is taken every other iteration, or only after a previous
> branch is taken.
> 
> Much simpler cpus may use a much simpler strategy.
> I think one I've used (a fpga soft-core cpu) just uses the low
> bits of the instruction address to index a single bit table.
> This means that branches alias each other.
> In order to get the consistent cycle counts in order to minimise
> the worst case code path we had to disable the dynamic prediction.
> 
> For the checksum code the loop branch isn't a problem.
> Tests on entry to the function might get mispredicted.
> 
> So if you have conditional prefetch when the buffer is long
> then time a short buffer after a 100 long ones you'll almost
> certainly see the mispredition penalty.
> 
> FWIW I remember speeding up a copy (I think) loop on a strongarm by
> adding an extra instruction to fetch a word from later in the buffer
> into a register I never otherwise used.
> (That was an unpaged system so I knew it couldn't fault.)
> 
Fair enough, but the code we're looking at here is arch specific.  If strongarms
benefit from different coding patterns, we can handle that in that arch.  This
x86 implementation can still avoid worrying about branch predicition since its
hardware handles it well
Neil

> 	David
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-11-13 16:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-12  1:42 [Fwd: Re: [PATCH v2 2/2] x86: add prefetching to do_csum] Joe Perches
2013-11-12 13:59 ` Neil Horman
2013-11-12 17:12 ` Neil Horman
2013-11-12 17:33   ` Joe Perches
2013-11-12 19:50     ` Neil Horman
2013-11-12 20:38       ` Joe Perches
2013-11-12 20:59         ` Neil Horman
2013-11-13 10:09       ` David Laight
2013-11-13 12:30         ` Neil Horman
2013-11-13 13:08           ` Ingo Molnar
2013-11-13 13:32             ` David Laight
2013-11-13 13:53               ` Ingo Molnar
2013-11-13 16:01               ` Neil Horman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.