linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [patch] e1000 TSO parameter
@ 2003-07-15  5:11 Feldman, Scott
  0 siblings, 0 replies; 13+ messages in thread
From: Feldman, Scott @ 2003-07-15  5:11 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, linux-kernel, netdev

> Extend ethtool please :-)

Even better.  This should be no problem.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15  5:38     ` David S. Miller
  2003-07-15 23:01       ` David Mosberger
@ 2003-07-29  6:53       ` Anton Blanchard
  1 sibling, 0 replies; 13+ messages in thread
From: Anton Blanchard @ 2003-07-29  6:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, scott.feldman, linux-kernel, netdev


Hi,

> > So we get almost 15% of throughput drop.  This was with plain "netkit
> > fptd".  AFAIK, it does a simple read/write loop (not sendfile()).

We've been seeing rather variable results for TSO as well. With TSO off
netperf TCP_STREAM will hit line speed and stay there. With TSO on
some runs will hit line speed and others will be about 100Mbit/sec
slower.

> When we use TSO for non-sendfile() applications it really
> stresses memory allocations.  We do these 64K+ kmalloc()'s
> for each packet we construct.

Yep we definitely noticed much more higher allocations when watching
/proc/slab. Playing around with slab tuning didnt seem to help.

Anton

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-16  1:39         ` David S. Miller
@ 2003-07-16  6:32           ` David Mosberger
  2003-07-16  6:30             ` David S. Miller
  0 siblings, 1 reply; 13+ messages in thread
From: David Mosberger @ 2003-07-16  6:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, scott.feldman, linux-kernel, netdev

>>>>> On Tue, 15 Jul 2003 18:39:11 -0700, "David S. Miller" <davem@redhat.com> said:

  >>  We could, but would it always be a win?  Especially for
  >> copy_from_user().  Most of the time, that data remains cached, so
  >> I don't think we'd want to use non-temporal stores on those (in
  >> general).  csum_and_copy_from_user() isn't well optimized yet.
  >> Let's see if I can find a volunteer... ;-)

  DaveM> No, I mean "bypass L2 cache on miss" for stores.  Don't tell
  DaveM> me IA64 doesn't have that? 8) I certainly didn't mean "always
  DaveM> bypass L2 cache" for stores :-)

What I'm saying is that I almost always want copy_user() to put the
destination data in the cache, even if it isn't cached yet.  Many
copy_user() calls are for for data structures that easily fit in the
cache and the data is usually used quickly afterwards.

As for cache-hints supported by IA64: the architecture supports
various non-temporal hints (non-temporal in 1st, 2nd, or all
cache-levels).  How these hints are implemented depends on the chip.
On McKinley, non-temporal hints are generally implemented by storing
the data in the cache without updating the LRU info.  So if the data
is already there, it will stay cached (until a victim is needed).

	--david

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-16  6:32           ` David Mosberger
@ 2003-07-16  6:30             ` David S. Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David S. Miller @ 2003-07-16  6:30 UTC (permalink / raw)
  To: davidm; +Cc: davidm, scott.feldman, linux-kernel, netdev

On Tue, 15 Jul 2003 23:32:48 -0700
David Mosberger <davidm@napali.hpl.hp.com> wrote:

>   DaveM> No, I mean "bypass L2 cache on miss" for stores.  Don't tell
>   DaveM> me IA64 doesn't have that? 8) I certainly didn't mean "always
>   DaveM> bypass L2 cache" for stores :-)
> 
> What I'm saying is that I almost always want copy_user() to put the
> destination data in the cache, even if it isn't cached yet.

No you don't :-)

If you miss, you do a bypass to main memory.  Then when the
app asks for the data (if it even does at all, consider that)
it get's a clean copy in it's L2 cache.

Overall it's more efficient this way.

> Many copy_user() calls are for for data structures that
> easily fit in the cache and the data is usually used quickly afterwards.

Absolutely correct.  We can't use the cache bypass-on-miss stores on
sparc64 unless the copy is at least a couple of cachelines in size.

It all works out, don't worry :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15 23:01       ` David Mosberger
@ 2003-07-16  1:39         ` David S. Miller
  2003-07-16  6:32           ` David Mosberger
  0 siblings, 1 reply; 13+ messages in thread
From: David S. Miller @ 2003-07-16  1:39 UTC (permalink / raw)
  To: davidm; +Cc: davidm, scott.feldman, linux-kernel, netdev

On Tue, 15 Jul 2003 16:01:55 -0700
David Mosberger <davidm@napali.hpl.hp.com> wrote:

> >>>>> On Mon, 14 Jul 2003 22:38:22 -0700, "David S. Miller" <davem@redhat.com> said:
> 
>   DaveM> But I don't think that's what is happening here, rather the
>   DaveM> PCI controller is "talking" to the CPU's L2 cache with
>   DaveM> coherency transactions on all the data of every packet going
>   DaveM> to the chip.
> 
> That's true.  But shouldn't it be true for both the TSO and non-TSO
> case?

The transfers are each longer in the TSO case, so need more
to transfer more data from the bus just to get _one_ of
the sub-packets of the large TSO frame out.  It thus makes it
more likely they'll be a delay.

>   DaveM> I know how this can be fixed, can you use L2-bypassing stores
>   DaveM> in your csum_and_copy_from_user() and copy_from_user()
>   DaveM> implementations like we do on sparc64?  That would exactly
>   DaveM> eliminate this situation where the card is talking to the
>   DaveM> cpu's L2 cache for all the data during the PCI DMA transation
>   DaveM> on the send side.
> 
> We could, but would it always be a win?  Especially for
> copy_from_user().  Most of the time, that data remains cached, so I
> don't think we'd want to use non-temporal stores on those (in
> general).  csum_and_copy_from_user() isn't well optimized yet.  Let's
> see if I can find a volunteer... ;-)

No, I mean "bypass L2 cache on miss" for stores.  Don't
tell me IA64 doesn't have that? 8) I certainly didn't mean
"always bypass L2 cache" for stores :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [patch] e1000 TSO parameter
  2003-07-16  0:27 Feldman, Scott
@ 2003-07-16  0:41 ` David Mosberger
  0 siblings, 0 replies; 13+ messages in thread
From: David Mosberger @ 2003-07-16  0:41 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: davidm, linux-kernel, netdev

>>>>> On Tue, 15 Jul 2003 17:27:56 -0700, "Feldman, Scott" <scott.feldman@intel.com> said:

  >> TSO disabled:

  >> $ modprobe InterruptThrottleRate=0,0,0,0 TSO=0,0,0,0

  Scott> If you're trying to remove all interrupt moderation, you'll
  Scott> also want to add these:

  Scott> RxIntDelay=0,0,0,0 RxAbsIntDelay=0,0,0,0 TxIntDelay=0,0,0,0
  Scott> TxAbsIntDelay=0,0,0,0

  Scott> See the app note here for more info:
  Scott> http://www.intel.com/design/network/applnots/8254x_ap450.htm

I wasn't aware of that note.  Thanks for the pointer!

	--david

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [patch] e1000 TSO parameter
@ 2003-07-16  0:27 Feldman, Scott
  2003-07-16  0:41 ` David Mosberger
  0 siblings, 1 reply; 13+ messages in thread
From: Feldman, Scott @ 2003-07-16  0:27 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, netdev


> TSO disabled:
> 
>  $ modprobe InterruptThrottleRate=0,0,0,0 TSO=0,0,0,0

If you're trying to remove all interrupt moderation, you'll also want to
add these:

RxIntDelay=0,0,0,0 RxAbsIntDelay=0,0,0,0 TxIntDelay=0,0,0,0
TxAbsIntDelay=0,0,0,0

See the app note here for more info:
http://www.intel.com/design/network/applnots/8254x_ap450.htm

-scott


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15  5:38     ` David S. Miller
@ 2003-07-15 23:01       ` David Mosberger
  2003-07-16  1:39         ` David S. Miller
  2003-07-29  6:53       ` Anton Blanchard
  1 sibling, 1 reply; 13+ messages in thread
From: David Mosberger @ 2003-07-15 23:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, scott.feldman, linux-kernel, netdev

>>>>> On Mon, 14 Jul 2003 22:38:22 -0700, "David S. Miller" <davem@redhat.com> said:

  DaveM> But I don't think that's what is happening here, rather the
  DaveM> PCI controller is "talking" to the CPU's L2 cache with
  DaveM> coherency transactions on all the data of every packet going
  DaveM> to the chip.

That's true.  But shouldn't it be true for both the TSO and non-TSO
case?

  DaveM> Whereas with a sendfile() type setup, the PCI controller is
  DaveM> going straight to main memory for the data part of the
  DaveM> packets since the CPU is unlikely to have each page cache
  DaveM> page in it's L2 caches.

But sendfile() was _not_ used in any of the tests.  The ftp server
installed no the machine doesn't use it (not to my knowledge, at
least) and netperf only uses it for the SENDFILE test.

  DaveM> I know how this can be fixed, can you use L2-bypassing stores
  DaveM> in your csum_and_copy_from_user() and copy_from_user()
  DaveM> implementations like we do on sparc64?  That would exactly
  DaveM> eliminate this situation where the card is talking to the
  DaveM> cpu's L2 cache for all the data during the PCI DMA transation
  DaveM> on the send side.

We could, but would it always be a win?  Especially for
copy_from_user().  Most of the time, that data remains cached, so I
don't think we'd want to use non-temporal stores on those (in
general).  csum_and_copy_from_user() isn't well optimized yet.  Let's
see if I can find a volunteer... ;-)

	--david

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15  5:31   ` David Mosberger
@ 2003-07-15  5:38     ` David S. Miller
  2003-07-15 23:01       ` David Mosberger
  2003-07-29  6:53       ` Anton Blanchard
  0 siblings, 2 replies; 13+ messages in thread
From: David S. Miller @ 2003-07-15  5:38 UTC (permalink / raw)
  To: davidm; +Cc: davidm, scott.feldman, linux-kernel, netdev

On Mon, 14 Jul 2003 22:31:00 -0700
David Mosberger <davidm@napali.hpl.hp.com> wrote:

> With TSO enabled:
> 
>  ftp> get big.iso /dev/null
>  local: /dev/null remote: big.iso
>  200 PORT command successful.
>  150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
>  226 Transfer complete.
>  2038628352 bytes received in 21.16 secs (94070.2 kB/s)
> 
>  ftp server CPU utilization: ~ 15%
> 
> So we get almost 15% of throughput drop.  This was with plain "netkit
> fptd".  AFAIK, it does a simple read/write loop (not sendfile()).

When we use TSO for non-sendfile() applications it really
stresses memory allocations.  We do these 64K+ kmalloc()'s
for each packet we construct.

But I don't think that's what is happening here, rather the PCI
controller is "talking" to the CPU's L2 cache with coherency
transactions on all the data of every packet going to the chip.

Whereas with a sendfile() type setup, the PCI controller is going
straight to main memory for the data part of the packets since the
CPU is unlikely to have each page cache page in it's L2 caches.  In
the sendmsg() case, it's virtually guarenteed that the cpu will have
all the packet data in it's L2 cache in an unshared-modified state.

I know how this can be fixed, can you use L2-bypassing stores in
your csum_and_copy_from_user() and copy_from_user() implementations
like we do on sparc64?  That would exactly eliminate this situation
where the card is talking to the cpu's L2 cache for all the data
during the PCI DMA transation on the send side.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15  4:45 ` David S. Miller
  2003-07-15  4:57   ` David Mosberger
@ 2003-07-15  5:31   ` David Mosberger
  2003-07-15  5:38     ` David S. Miller
  1 sibling, 1 reply; 13+ messages in thread
From: David Mosberger @ 2003-07-15  5:31 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: davidm, linux-kernel, netdev

[Scott, somehow I never received your original response so I'm
 replying based on what I saw in the linux.kernel newgroup...]

 Scott> Do you have any data to share?

Sure, I don't see why not.  Here are the number I got:

TSO disabled:

 $ modprobe InterruptThrottleRate=0,0,0,0 TSO=0,0,0,0
 $ netperf -l 30 -c -C -H foobar -- -s64K -S64K
 TCP STREAM TEST to foobar
 Recv   Send    Send                          Utilization       Service Demand
 Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
 Size   Size    Size     Time     Throughput  local    remote   local   remote
 bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 131070 131072 131072    30.00       897.16   34.07    35.00    3.111   3.196

TSO enabled:

 $ modprobe InterruptThrottleRate=0,0,0,0 TSO=1,1,1,1
 $ netperf -l 30 -c -C -H foobar -- -s64K -S64K
 TCP STREAM TEST to foobar
 Recv   Send    Send                          Utilization       Service Demand
 Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
 Size   Size    Size     Time     Throughput  local    remote   local   remote
 bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

 131070 131072 131072    30.00       894.09   11.65    34.48    1.068   3.159

This looks roughly like you'd expect: with TSO, slightly lower
throughput but much less CPU overhead.

With ftp, things get stranger: fetching a 2GByte file via ftp get
(from the remote end):

TSO disabled:

 ftp> get big.iso /dev/null
 local: /dev/null remote: big.iso
 200 PORT command successful.
 150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
 226 Transfer complete.
 2038628352 bytes received in 18.17 secs (109554.5 kB/s)

 ftp server CPU utilization: ~ 40%

With TSO enabled:

 ftp> get big.iso /dev/null
 local: /dev/null remote: big.iso
 200 PORT command successful.
 150 Opening BINARY mode data connection for 'big.iso' (2038628352 bytes).
 226 Transfer complete.
 2038628352 bytes received in 21.16 secs (94070.2 kB/s)

 ftp server CPU utilization: ~ 15%

So we get almost 15% of throughput drop.  This was with plain "netkit
fptd".  AFAIK, it does a simple read/write loop (not sendfile()).

	--david

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15  4:45 ` David S. Miller
@ 2003-07-15  4:57   ` David Mosberger
  2003-07-15  5:31   ` David Mosberger
  1 sibling, 0 replies; 13+ messages in thread
From: David Mosberger @ 2003-07-15  4:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: Feldman, Scott, davidm, linux-kernel, netdev

>>>>> On Mon, 14 Jul 2003 21:45:10 -0700, "David S. Miller" <davem@redhat.com> said:

  DaveM> On Mon, 14 Jul 2003 21:42:40 -0700 "Feldman, Scott"
  DaveM> <scott.feldman@intel.com> wrote:

  >> > Note that I had to move the e1000_check_options() call to a >
  >> slighly earlier place.  You may want to double-check that > it's
  >> really OK.

  >> I'm not too keen on adding another module parameter.  Maybe a
  >> CONFIG_E1000_TSO option?

  DaveM> Extend ethtool please :-)

ethtool would be ideal, agreed.

I absolutely think that this should be a runtime option, not a
compile-time option.

	--david

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] e1000 TSO parameter
  2003-07-15  4:42 Feldman, Scott
@ 2003-07-15  4:45 ` David S. Miller
  2003-07-15  4:57   ` David Mosberger
  2003-07-15  5:31   ` David Mosberger
  0 siblings, 2 replies; 13+ messages in thread
From: David S. Miller @ 2003-07-15  4:45 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: davidm, linux-kernel, netdev

On Mon, 14 Jul 2003 21:42:40 -0700
"Feldman, Scott" <scott.feldman@intel.com> wrote:

> > Note that I had to move the e1000_check_options() call to a 
> > slighly earlier place.  You may want to double-check that 
> > it's really OK.
> 
> I'm not too keen on adding another module parameter.  Maybe a
> CONFIG_E1000_TSO option?

Extend ethtool please :-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [patch] e1000 TSO parameter
@ 2003-07-15  4:42 Feldman, Scott
  2003-07-15  4:45 ` David S. Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Feldman, Scott @ 2003-07-15  4:42 UTC (permalink / raw)
  To: davidm; +Cc: linux-kernel, netdev

[Moving over to netdev]

> Hi Scott,
> 
> Would you mind applying the attached patch for the e1000 
> driver?  The patch adds a "TSO" option which lets you disable 
> TSO at boot-time/module-insertion time.  This is useful for 
> experimentation and, on fast machines, you can actually get 
> better netperf numbers with TSO disabled. ;-)

This is interesting.  I assume you're trying to get the best throughput
regardless of CPU utilization.  Why are we getting lower throughput
rates?  It's an open question for netdev.  Do you have any data to
share?
 
> Note that I had to move the e1000_check_options() call to a 
> slighly earlier place.  You may want to double-check that 
> it's really OK.

I'm not too keen on adding another module parameter.  Maybe a
CONFIG_E1000_TSO option?

> The patch is relative to 2.5.75.
> 
> Thanks,
> 
> 	--david
> 
> # This is a BitKeeper generated patch for the following 
> project: # Project Name: Linux kernel tree # This patch 
> format is intended for GNU patch command version 2.5 or 
> higher. # This patch includes the following deltas:
> #	           ChangeSet	1.1379  -> 1.1380 
> #	drivers/net/e1000/e1000.h	1.33    -> 1.34   
> #	drivers/net/e1000/e1000_main.c	1.77    -> 1.78   
> #	drivers/net/e1000/e1000_param.c	1.21    -> 1.22   
> #
> # The following is the BitKeeper ChangeSet Log
> # --------------------------------------------
> # 03/07/14	davidm@tiger.hpl.hp.com	1.1380
> # Make it possible to disable TSO at module-load time (or 
> boot-time). # This is controlled via the TSO parameter (e.g., 
> modprobe e1000 TSO=0,0,0,0 # will disable TSO on the first 4 
> e1000 NICs). # --------------------------------------------
> #
> diff -Nru a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
> --- a/drivers/net/e1000/e1000.h	Mon Jul 14 17:29:30 2003
> +++ b/drivers/net/e1000/e1000.h	Mon Jul 14 17:29:30 2003
> @@ -194,6 +194,7 @@
>  	uint32_t tx_head_addr;
>  	uint32_t tx_fifo_size;
>  	atomic_t tx_fifo_stall;
> +	boolean_t tso;
>  
>  	/* RX */
>  	struct e1000_desc_ring rx_ring;
> diff -Nru a/drivers/net/e1000/e1000_main.c 
> b/drivers/net/e1000/e1000_main.c
> --- a/drivers/net/e1000/e1000_main.c	Mon Jul 14 17:29:30 2003
> +++ b/drivers/net/e1000/e1000_main.c	Mon Jul 14 17:29:30 2003
> @@ -417,9 +417,12 @@
>  		netdev->features = NETIF_F_SG;
>  	}
>  
> +	e1000_check_options(adapter);
> +
>  #ifdef NETIF_F_TSO
>  	if((adapter->hw.mac_type >= e1000_82544) &&
> -	   (adapter->hw.mac_type != e1000_82547))
> +	   (adapter->hw.mac_type != e1000_82547) &&
> +	   adapter->tso)
>  		netdev->features |= NETIF_F_TSO;
>  #endif
>  
> @@ -469,7 +472,6 @@
>  
>  	printk(KERN_INFO "%s: Intel(R) PRO/1000 Network Connection\n",
>  	       netdev->name);
> -	e1000_check_options(adapter);
>  
>  	/* Initial Wake on LAN setting
>  	 * If APM wake is enabled in the EEPROM,
> diff -Nru a/drivers/net/e1000/e1000_param.c 
> b/drivers/net/e1000/e1000_param.c
> --- a/drivers/net/e1000/e1000_param.c	Mon Jul 14 17:29:30 2003
> +++ b/drivers/net/e1000/e1000_param.c	Mon Jul 14 17:29:30 2003
> @@ -192,6 +192,16 @@
>  
>  E1000_PARAM(InterruptThrottleRate, "Interrupt Throttling Rate");
>  
> +/* Enable TSO - TCP Segmentation Offload Enable/Disable
> + *
> + * Valid Range: 0, 1
> + *  - 0 - disables TSO
> + *  - 1 - enables TSO on NICs that are TSO capable
> + *
> + * Default Value: 1
> + */
> +E1000_PARAM(TSO, "Disable or enable TSO");
> +
>  #define AUTONEG_ADV_DEFAULT  0x2F
>  #define AUTONEG_ADV_MASK     0x2F
>  #define FLOW_CONTROL_DEFAULT FLOW_CONTROL_FULL
> @@ -454,6 +464,18 @@
>  		} else {
>  			e1000_validate_option(&adapter->itr, &opt);
>  		}
> +	}
> +	{ /* TSO Enable/Disable */
> +		struct e1000_option opt = {
> +			.type = enable_option,
> +			.name = "TSO",
> +			.err  = "defaulting to Enabled",
> +			.def  = OPTION_ENABLED
> +		};
> +
> +		int tso = TSO[bd];
> +		e1000_validate_option(&tso, &opt);
> +		adapter->tso = tso;
>  	}
>  
>  	switch(adapter->hw.media_type) {
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2003-07-29  6:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-15  5:11 [patch] e1000 TSO parameter Feldman, Scott
  -- strict thread matches above, loose matches on Subject: below --
2003-07-16  0:27 Feldman, Scott
2003-07-16  0:41 ` David Mosberger
2003-07-15  4:42 Feldman, Scott
2003-07-15  4:45 ` David S. Miller
2003-07-15  4:57   ` David Mosberger
2003-07-15  5:31   ` David Mosberger
2003-07-15  5:38     ` David S. Miller
2003-07-15 23:01       ` David Mosberger
2003-07-16  1:39         ` David S. Miller
2003-07-16  6:32           ` David Mosberger
2003-07-16  6:30             ` David S. Miller
2003-07-29  6:53       ` Anton Blanchard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).