linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Early SPECWeb99 results on 2.5.33 with TSO on e1000
@ 2002-09-05 18:30 Troy Wilson
  2002-09-05 20:59 ` jamal
  0 siblings, 1 reply; 102+ messages in thread
From: Troy Wilson @ 2002-09-05 18:30 UTC (permalink / raw)
  To: linux-kernel, netdev

 
  I've got some early SPECWeb [*] results with 2.5.33 and TSO on e1000.  I
get 2906 simultaneous connections, 99.2% conforming (i.e. faster than the
320 kbps cutoff), at 0% idle with TSO on.  For comparison, with 2.5.25, I 
got 2656, and with 2.5.29 I got 2662, (both 99+% conformance and 0% idle) so
TSO and 2.5.33 look like a Big Win.
 
  I'm having trouble testing with TSO off (I changed the #define NETIF_F_TSO
to "0" in include/linux/netdevice.h to turn it off).  I am getting errors.

     NETDEV WATCHDOG: eth1: transmit timed out
     e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex

  That's pushed my SPECWeb results down to below 2500 connections with TSO 
off because of those adapter resets (It is only that one adapter, BTW) and
these results (with TSO off) shouldn't be considered valid.

  eth1 is the only adapter with errors, and they all look like RX overruns.
For comparison:

eth1      Link encap:Ethernet  HWaddr 00:02:B3:9C:F5:9E  
          inet addr:192.168.4.1  Bcast:192.168.4.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:48621378 errors:8890 dropped:8890 overruns:8890 frame:0
          TX packets:64342993 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:3637004554 (3468.5 Mb)  TX bytes:1377740556 (1313.9 Mb)
          Interrupt:61 Base address:0x1200 Memory:fc020000-0 

eth3      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:E7  
          inet addr:192.168.3.1  Bcast:192.168.3.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:37130540 errors:0 dropped:0 overruns:0 frame:0
          TX packets:49061277 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:2774988658 (2646.4 Mb)  TX bytes:3290541711 (3138.1 Mb)
          Interrupt:44 Base address:0x2040 Memory:fe120000-0 

  I'm still working on getting a clean run with TSO off.  If anyone has any
ideas for me about the timeout errors, I'd appreciate the clue.

Thanks,

- Troy


*  SPEC(tm) and the benchmark name SPECweb(tm) are registered
trademarks of the Standard Performance Evaluation Corporation.
This benchmarking was performed for research purposes only,
and is non-compliant, with the following deviations from the
rules -

  1 - It was run on hardware that does not meed the SPEC
      availability-to-the-public criteria.  The machine is
      an engineering sample.

  2 - access_log wasn't kept for full accounting.  It was
      being written, but deleted every 200 seconds.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 18:30 Early SPECWeb99 results on 2.5.33 with TSO on e1000 Troy Wilson
@ 2002-09-05 20:59 ` jamal
  2002-09-05 22:11   ` Troy Wilson
                     ` (3 more replies)
  0 siblings, 4 replies; 102+ messages in thread
From: jamal @ 2002-09-05 20:59 UTC (permalink / raw)
  To: Troy Wilson; +Cc: linux-kernel, netdev


Hey, thanks for crossposting to netdev

So if i understood correctly (looking at the intel site) the main value
add of this feature is probably in having the CPU avoid reassembling and
retransmitting. I am willing to bet that the real value in your results is
in saving on retransmits; I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant CPU
savings. Do you have any stats from the hardware that could show
retransmits etc; have you tested this with zero copy as well (sendfile)
again, if i am right you shouldnt see much benefit from that either?

I would think it probably works well with things like partial ACKs too?
(I am almost sure it does or someone needs to be spanked, so just
checking).

cheers,
jamal



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 20:59 ` jamal
@ 2002-09-05 22:11   ` Troy Wilson
  2002-09-05 22:39     ` Nivedita Singhvi
  2002-09-05 22:48   ` Nivedita Singhvi
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 102+ messages in thread
From: Troy Wilson @ 2002-09-05 22:11 UTC (permalink / raw)
  To: jamal; +Cc: linux-kernel, netdev

> So if i understood correctly (looking at the intel site) the main value
> add of this feature is probably in having the CPU avoid reassembling and
> retransmitting.

Quoting David S. Miller:

dsm> The performance improvement comes from the fact that the card
dsm> is given huge 64K packets, then the card (using the given ip/tcp
dsm> headers as a template) spits out 1500 byte mtu sized packets.
dsm> 
dsm> Less data DMA'd to the device per normal-mtu packet and less
dsm> per-packet data structure work by the cpu is where the improvement
dsm> comes from.


> Do you have any stats from the hardware that could show
> retransmits etc;

  I'll gather netstat -s after runs with and without TSO enabled.
Anything else you'd like to see?


> have you tested this with zero copy as well (sendfile)

  Yes.  My webserver is Apache 2.0.36, which uses sendfile for anything
over 8k in size.  But, iirc, Apache sends the http headers using writev.

Thanks,

- Troy



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 22:11   ` Troy Wilson
@ 2002-09-05 22:39     ` Nivedita Singhvi
  2002-09-05 23:01       ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-05 22:39 UTC (permalink / raw)
  To: Troy Wilson; +Cc: jamal, linux-kernel, netdev

Quoting Troy Wilson <tcw@tempest.prismnet.com>:

> > Do you have any stats from the hardware that could show
> > retransmits etc;
> 
>   I'll gather netstat -s after runs with and without TSO enabled.
> Anything else you'd like to see?

Troy, this is pointing out the obvious, but make sure
you have the before stats as well :)...

> > have you tested this with zero copy as well (sendfile)
> 
>   Yes.  My webserver is Apache 2.0.36, which uses sendfile for
> anything
> over 8k in size.  But, iirc, Apache sends the http headers using
> writev.

SpecWeb99 doesnt execute the path that might benefit the 
most from this patch - sendmsg() of large files - large writes
going down..

thanks,
Nivedita



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 20:59 ` jamal
  2002-09-05 22:11   ` Troy Wilson
@ 2002-09-05 22:48   ` Nivedita Singhvi
  2002-09-06  1:47     ` jamal
  2002-09-06  3:47   ` David S. Miller
  2002-09-06 23:56   ` Troy Wilson
  3 siblings, 1 reply; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-05 22:48 UTC (permalink / raw)
  To: jamal; +Cc: Troy Wilson, linux-kernel, netdev

Quoting jamal <hadi@cyberus.ca>:

> So if i understood correctly (looking at the intel site) the main
> value add of this feature is probably in having the CPU avoid
> reassembling and retransmitting. I am willing to bet that the real

Er, even just assembling and transmitting? I'm thinking of the
reduction in things like separate memory allocation calls and looking
up the route, etc..??

> value in your results is in saving on retransmits; I would think
> shoving the data down the NIC and avoid the fragmentation shouldnt
> give you that much significant CPU savings. Do you have any stats

Why do say that? Wouldnt the fact that youre now reducing the
number of calls down the stack by a significant number provide
a significant saving? 

> from the hardware that could show retransmits etc; have you tested
> this with zero copy as well (sendfile) again, if i am right you
> shouldnt see much benefit from that either?

thanks,
Nivedita





^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 22:39     ` Nivedita Singhvi
@ 2002-09-05 23:01       ` Dave Hansen
  0 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2002-09-05 23:01 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: Troy Wilson, jamal, linux-kernel, netdev

Nivedita Singhvi wrote:
> SpecWeb99 doesnt execute the path that might benefit the 
> most from this patch - sendmsg() of large files - large writes
> going down..

For those of you who don't know Specweb well, the average size of a request 
is about 14.5 kB.  The largest files are ~5mb, but the largest top out at 
just under a meg.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 22:48   ` Nivedita Singhvi
@ 2002-09-06  1:47     ` jamal
  2002-09-06  3:38       ` Nivedita Singhvi
  2002-09-06  3:56       ` David S. Miller
  0 siblings, 2 replies; 102+ messages in thread
From: jamal @ 2002-09-06  1:47 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: Troy Wilson, linux-kernel, netdev



On Thu, 5 Sep 2002, Nivedita Singhvi wrote:

>
> > value in your results is in saving on retransmits; I would think
> > shoving the data down the NIC and avoid the fragmentation shouldnt
> > give you that much significant CPU savings. Do you have any stats
>
> Why do say that? Wouldnt the fact that youre now reducing the
> number of calls down the stack by a significant number provide
> a significant saving?

I am not sure; if he gets a busy system in a congested network, i can
see the offloading savings i.e i am not sure if the amortization of the
calls away from the CPU is sufficient enough savings if it doesnt
involve a lot of retransmits. I am also wondering how smart this NIC
in doing the retransmits; example i have doubts if this idea is briliant
to begin with; does it handle SACKs for example? What about
the du-jour algorithm, would you have to upgrade the NIC or can it be
taught some new trickes etc etc.
[also i can see why it makes sense to use this feature only with sendfile;
its pretty much useless for interactive apps]

Troy, i am not interested in the nestat -s data rather the TCP stats
this NIC  has exposed. Unless those somehow show up magically in netstat.

cheers,
jamal


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  1:47     ` jamal
@ 2002-09-06  3:38       ` Nivedita Singhvi
  2002-09-06  3:58         ` David S. Miller
  2002-09-07  0:05         ` Troy Wilson
  2002-09-06  3:56       ` David S. Miller
  1 sibling, 2 replies; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-06  3:38 UTC (permalink / raw)
  To: jamal; +Cc: Troy Wilson, linux-kernel, netdev

Quoting jamal <hadi@cyberus.ca>:

> I am not sure; if he gets a busy system in a congested network, i
> can see the offloading savings i.e i am not sure if the amortization
> of the calls away from the CPU is sufficient enough savings if it
> doesnt involve a lot of retransmits. I am also wondering how smart
> this NIC in doing the retransmits; example i have doubts if this
> idea is briliant to begin with; does it handle SACKs for example?

do you mean sack data being sent as a tcp option? 
dont know, lots of other questions arise (like timestamp
on all the segments would be the same?). 

 
> Troy, i am not interested in the nestat -s data rather the TCP
> stats this NIC  has exposed. Unless those somehow show up magically
> in netstat.

most recent (dont know how far back) versions of netstat
display /proc/net/snmp and /proc/net/netstat (with the 
Linux TCP MIB), so netstat -s should show you most of 
whats interesting. Or were you referring to something else?

ifconfig -a and netstat -rn would also be nice to have..

thanks,
Nivedita





^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 20:59 ` jamal
  2002-09-05 22:11   ` Troy Wilson
  2002-09-05 22:48   ` Nivedita Singhvi
@ 2002-09-06  3:47   ` David S. Miller
  2002-09-06  6:48     ` Martin J. Bligh
  2002-09-12  7:28     ` Todd Underwood
  2002-09-06 23:56   ` Troy Wilson
  3 siblings, 2 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06  3:47 UTC (permalink / raw)
  To: hadi; +Cc: tcw, linux-kernel, netdev

   From: jamal <hadi@cyberus.ca>
   Date: Thu, 5 Sep 2002 16:59:47 -0400 (EDT)
   
   I would think shoving the data down the NIC
   and avoid the fragmentation shouldnt give you that much significant
   CPU savings.

It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller.  In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.

Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.

I think I've said this a million times, perhaps the next person who
tries to figure out where the gains come from can just reply with
a pointer to a URL of this email I'm typing right now :-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  1:47     ` jamal
  2002-09-06  3:38       ` Nivedita Singhvi
@ 2002-09-06  3:56       ` David S. Miller
  1 sibling, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06  3:56 UTC (permalink / raw)
  To: hadi; +Cc: niv, tcw, linux-kernel, netdev

   From: jamal <hadi@cyberus.ca>
   Date: Thu, 5 Sep 2002 21:47:35 -0400 (EDT)
   
   I am not sure; if he gets a busy system in a congested network, i can
   see the offloading savings i.e i am not sure if the amortization of the
   calls away from the CPU is sufficient enough savings if it doesnt
   involve a lot of retransmits. I am also wondering how smart this NIC
   in doing the retransmits; example i have doubts if this idea is briliant
   to begin with; does it handle SACKs for example? What about
   the du-jour algorithm, would you have to upgrade the NIC or can it be
   taught some new trickes etc etc.
   [also i can see why it makes sense to use this feature only with sendfile;
   its pretty much useless for interactive apps]
   
   Troy, i am not interested in the nestat -s data rather the TCP stats
   this NIC  has exposed. Unless those somehow show up magically in netstat.
   
There are no retransmits happening, the card does not analyze
activity on the TCP connection to retransmit things itself
it's just a simple header templating facility.

Read my other emails about where the benefits come from.

In fact when connection is sick (ie. retransmits and SACKs occur)
we disable TSO completely for that socket.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  3:38       ` Nivedita Singhvi
@ 2002-09-06  3:58         ` David S. Miller
  2002-09-06  4:20           ` Nivedita Singhvi
  2002-09-07  0:05         ` Troy Wilson
  1 sibling, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06  3:58 UTC (permalink / raw)
  To: niv; +Cc: hadi, tcw, linux-kernel, netdev

   From: Nivedita Singhvi <niv@us.ibm.com>
   Date: Thu,  5 Sep 2002 20:38:10 -0700

   most recent (dont know how far back) versions of netstat
   display /proc/net/snmp and /proc/net/netstat (with the 
   Linux TCP MIB), so netstat -s should show you most of 
   whats interesting. Or were you referring to something else?
   
   ifconfig -a and netstat -rn would also be nice to have..
   
TSO gets turned off during retransmits/SACK and the card does not do
retransmits.

Can we move on in this conversation now? :-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  4:20           ` Nivedita Singhvi
@ 2002-09-06  4:17             ` David S. Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06  4:17 UTC (permalink / raw)
  To: niv; +Cc: hadi, tcw, linux-kernel, netdev

   From: Nivedita Singhvi <niv@us.ibm.com>
   Date: Thu,  5 Sep 2002 21:20:47 -0700
   
   Sure :). The motivation for seeing the stats though would
   be to get an idea of how much retransmission/SACK etc 
   activity _is_ occurring during Troy's SpecWeb runs, which
   would give us an idea of how often we're actually doing
   segmentation offload, and better idea of how much gain
   its possible to further get from this(ahem) DMA coalescing :).
   Some of Troy's early runs had a very large number of
   packets dropped by the card.

One thing to do is make absolutely sure that flow control is
enabled and supported by all devices on the link from the
client to the test spedweb server.

Troy can do you do that for us along with the statistic
dumps?

Thanks.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  3:58         ` David S. Miller
@ 2002-09-06  4:20           ` Nivedita Singhvi
  2002-09-06  4:17             ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-06  4:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, tcw, linux-kernel, netdev

Quoting "David S. Miller" <davem@redhat.com>:

> >   ifconfig -a and netstat -rn would also be nice to have..
>    
> TSO gets turned off during retransmits/SACK and the card does not
> do
> retransmits.
> 
> Can we move on in this conversation now? :-)

Sure :). The motivation for seeing the stats though would
be to get an idea of how much retransmission/SACK etc 
activity _is_ occurring during Troy's SpecWeb runs, which
would give us an idea of how often we're actually doing
segmentation offload, and better idea of how much gain
its possible to further get from this(ahem) DMA coalescing :).
Some of Troy's early runs had a very large number of
packets dropped by the card.

thanks,
Nivedita



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  3:47   ` David S. Miller
@ 2002-09-06  6:48     ` Martin J. Bligh
  2002-09-06  6:51       ` David S. Miller
                         ` (4 more replies)
  2002-09-12  7:28     ` Todd Underwood
  1 sibling, 5 replies; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06  6:48 UTC (permalink / raw)
  To: David S. Miller, hadi; +Cc: tcw, linux-kernel, netdev, Nivedita Singhvi

>    I would think shoving the data down the NIC
>    and avoid the fragmentation shouldnt give you that much significant
>    CPU savings.
> 
> It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> is limited by the DMA throughput of the PCI host controller.  In
> particular some controllers are limited to smaller DMA bursts to
> work around hardware bugs.
> 
> Ie. the headers that don't need to go across the bus are the critical
> resource saved by TSO.

I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4? ... and what's
the raw bandwidth of data we're pushing? ... it's not huge). 

I think we're CPU limited (there's no idle time on this machine), 
which is odd for an 8 CPU 900MHz P3 Xeon, but still, this is Apache,
not Tux. You mentioned CPU load as another advantage of TSO ... 
anything we've done to reduce CPU load enables us to run more and 
more connections (I think we started at about 260 or something, so 
2900 ain't too bad ;-)).

Just to throw another firework into the fire whilst people are 
awake, NAPI does not seem to scale to this sort of load, which
was disappointing, as we were hoping it would solve some of 
our interrupt load problems ... seems that half the machine goes
idle, the number of simultaneous connections drop way down, and
everything's blocked on ... something ... not sure what ;-)
Any guesses at why, or ways to debug this?

M.

PS. Anyone else running NAPI on SMP? (ideally at least 4-way?)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:48     ` Martin J. Bligh
@ 2002-09-06  6:51       ` David S. Miller
  2002-09-06  7:36         ` Andrew Morton
  2002-09-06 14:29         ` Martin J. Bligh
  2002-09-06 15:29       ` Dave Hansen
                         ` (3 subsequent siblings)
  4 siblings, 2 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06  6:51 UTC (permalink / raw)
  To: Martin.Bligh; +Cc: hadi, tcw, linux-kernel, netdev, niv

   From: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>
   Date: Thu, 05 Sep 2002 23:48:42 -0700
   
   Just to throw another firework into the fire whilst people are 
   awake, NAPI does not seem to scale to this sort of load, which
   was disappointing, as we were hoping it would solve some of 
   our interrupt load problems ...

Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?

NAPI is also not the panacea to all problems in the world.

I bet your greatest gain would be obtained from going to Tux
and using appropriate IRQ affinity settings and making sure
Tux threads bind to same cpu as device where they accept
connections.

It is standard method to obtain peak specweb performance.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  7:36         ` Andrew Morton
@ 2002-09-06  7:22           ` David S. Miller
  2002-09-06  9:54             ` jamal
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06  7:22 UTC (permalink / raw)
  To: akpm; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv, Robert.Olsson

   From: Andrew Morton <akpm@zip.com.au>
   Date: Fri, 06 Sep 2002 00:36:04 -0700

   "David S. Miller" wrote:
   > NAPI is also not the panacea to all problems in the world.
   
   Mala did some testing on this a couple of weeks back.  It appears that
   NAPI damaged performance significantly.

   http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm

Unfortunately it is not listed what e1000 and core NAPI
patch was used.  Also, not listed, are the RX/TX mitigation
and ring sizes given to the kernel module upon loading.

Robert can comment on optimal settings

Robert and Jamal can make a more detailed analysis of Mala's
graphs than I.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:51       ` David S. Miller
@ 2002-09-06  7:36         ` Andrew Morton
  2002-09-06  7:22           ` David S. Miller
  2002-09-06 14:29         ` Martin J. Bligh
  1 sibling, 1 reply; 102+ messages in thread
From: Andrew Morton @ 2002-09-06  7:36 UTC (permalink / raw)
  To: David S. Miller; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

"David S. Miller" wrote:
> 
> ...
> 
> NAPI is also not the panacea to all problems in the world.
> 

Mala did some testing on this a couple of weeks back.  It appears that
NAPI damaged performance significantly.

http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  7:22           ` David S. Miller
@ 2002-09-06  9:54             ` jamal
  0 siblings, 0 replies; 102+ messages in thread
From: jamal @ 2002-09-06  9:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: akpm, Martin.Bligh, tcw, linux-kernel, netdev, niv, Robert Olsson



On Fri, 6 Sep 2002, David S. Miller wrote:

>    Mala did some testing on this a couple of weeks back.  It appears that
>    NAPI damaged performance significantly.
>
>    http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
>
> Unfortunately it is not listed what e1000 and core NAPI
> patch was used.  Also, not listed, are the RX/TX mitigation
> and ring sizes given to the kernel module upon loading.
>
> Robert can comment on optimal settings
>
> Robert and Jamal can make a more detailed analysis of Mala's
> graphs than I.


I looked at those graphs, but the lack of information makes them useless.
For example there are too many variables to the tests --  what is the
effect the message size? and then look at the socket buffer size, would
you set it to 64K if you are trying to show perfomance numbers? What
other tcp settings are there?
Manfred Spraul about a year back complained about some performance issues
in low load setups (which is what this IBM setup seems to be if you count
the pps to the server); its one of those things that have been low in
the TODO deck.
The issue maybe legit not because NAPI is bad but because it is too good.
I dont have the e1000, but i have some Dlinks giges still in boxes and i
have a two-CPU SMP machine; I'll setup the testing this weekend.
In the case of Manfred, we couldnt reproduce the tests because he had this
odd weird NIC; in this case at least access to the e1000 doesnt require
a visit to the museum.

cheers,
jamal


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:51       ` David S. Miller
  2002-09-06  7:36         ` Andrew Morton
@ 2002-09-06 14:29         ` Martin J. Bligh
  2002-09-06 15:38           ` Dave Hansen
  1 sibling, 1 reply; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 14:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, tcw, linux-kernel, netdev, niv

> Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
> 
> NAPI is also not the panacea to all problems in the world.

No, but I didn't expect throughput to drop by 40% or so either,
which is (very roughly) what happened. Interrupts are a pain to
manage and do affinity with, so NAPI should (at least in theory)
be better for this kind of setup ... I think.
 
> I bet your greatest gain would be obtained from going to Tux
> and using appropriate IRQ affinity settings and making sure
> Tux threads bind to same cpu as device where they accept
> connections.
> 
> It is standard method to obtain peak specweb performance.

Ah, but that's not really our goal - what we're trying to do is
use specweb as a tool to simulate a semi-realistic customer
workload, and improve the Linux kernel performance, using that
as our yardstick for measuring ourselves. For that I like the
setup we have reasonably well, even though it won't get us the
best numbers.

To get the best benchmark numbers, you're absolutely right though.

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:48     ` Martin J. Bligh
  2002-09-06  6:51       ` David S. Miller
@ 2002-09-06 15:29       ` Dave Hansen
  2002-09-06 16:29         ` Martin J. Bligh
  2002-09-06 17:26       ` Gerrit Huizenga
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2002-09-06 15:29 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

Martin J. Bligh wrote:
> Just to throw another firework into the fire whilst people are 
> awake, NAPI does not seem to scale to this sort of load, which
> was disappointing, as we were hoping it would solve some of 
> our interrupt load problems ... seems that half the machine goes
> idle, the number of simultaneous connections drop way down, and
> everything's blocked on ... something ... not sure what ;-)
> Any guesses at why, or ways to debug this?

I thought that I already tried to explain this to you.  (although it could 
have been on one of those too-much-coffee-days :)

Something strange happens to the clients when NAPI is enabled on the 
Specweb clients.  Somehow the start using a lot more CPU.  The increased 
idle time on the server is because the _clients_ are CPU maxed.  I have 
some preliminary oprofile data for the clients, but it appears that this is 
another case of Specweb code just really sucking.

The real question is why NAPI causes so much more work for the client.  I'm 
not convinced that it is much, much greater, because I believe that I was 
already at the edge of the cliff with my clients and NAPI just gave them a 
little shove :).  Specweb also takes a while to ramp up (even during the 
real run), so sometimes it takes a few minutes to see the clients get 
saturated.
-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 14:29         ` Martin J. Bligh
@ 2002-09-06 15:38           ` Dave Hansen
  2002-09-06 16:11             ` Martin J. Bligh
  2002-09-06 16:21             ` Nivedita Singhvi
  0 siblings, 2 replies; 102+ messages in thread
From: Dave Hansen @ 2002-09-06 15:38 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, niv

Martin J. Bligh wrote:
>>Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
>>
>>NAPI is also not the panacea to all problems in the world.
> 
> No, but I didn't expect throughput to drop by 40% or so either,
> which is (very roughly) what happened. Interrupts are a pain to
> manage and do affinity with, so NAPI should (at least in theory)
> be better for this kind of setup ... I think.

No, no.  Bad Martin!  Throughput didn't drop, "Specweb compliance" dropped. 
  Those are two very, very different things.  I've found that the server 
can produce a lot more throughput, although it doesn't have the 
characteristics that Specweb considers compliant.  Just have Troy enable 
mod-status and look at the throughput that Apache tells you that it is 
giving during a run.  _That_ is real throughput, not number of compliant 
connections.

_And_ NAPI is for receive only, right?  Also, my compliance drop occurs 
with the NAPI checkbox disabled.  There is something else in the new driver 
that causes our problems.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 15:38           ` Dave Hansen
@ 2002-09-06 16:11             ` Martin J. Bligh
  2002-09-06 16:21             ` Nivedita Singhvi
  1 sibling, 0 replies; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 16:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, niv

> No, no.  Bad Martin!  Throughput didn't drop, "Specweb compliance" 
> dropped.   Those are two very, very different things.  I've found 
> that the server can produce a lot more throughput, although it 
> doesn't have the characteristics that Specweb considers compliant.  
> Just have Troy enable mod-status and look at the throughput that 
> Apache tells you that it is giving during a run.  _That_ is real
> throughput, not number of compliant connections.

By throughput I meant number of compliant connections, not bandwidth.
It may well be latency that's going out the window, rather than
bandwidth. Yes, I should use more precise terms ...

> _And_ NAPI is for receive only, right?  Also, my compliance drop 
> occurs with the NAPI checkbox disabled.  There is something else 
> in the new driver that causes our problems.

Not sure about that - I was told once that there were transmission
completion interrupts as well? What happens to those? Or am I 
confused again ...

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 15:38           ` Dave Hansen
  2002-09-06 16:11             ` Martin J. Bligh
@ 2002-09-06 16:21             ` Nivedita Singhvi
  1 sibling, 0 replies; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-06 16:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Martin J. Bligh, David S. Miller, hadi, tcw, linux-kernel, netdev

Quoting Dave Hansen <haveblue@us.ibm.com>:

> No, no.  Bad Martin!  Throughput didn't drop, "Specweb compliance"
> dropped. Those are two very, very different things.  I've found that
> the server can produce a lot more throughput, although it doesn't
> have the characteristics that Specweb considers compliant.  
> Just have Troy enable mod-status and look at the throughput that
> Apache tells you that it is giving during a run.  
> _That_ is real throughput, not number of compliant connections.

> _And_ NAPI is for receive only, right?  Also, my compliance drop
> occurs with the NAPI checkbox disabled.  There is something else in
> the new driver that causes our problems.

Thanks, Dave, you saved me a bunch of typing...

Just looking at a networking benchmark result is worse than
useless. You really need to look at the stats, settings,
and the profiles. eg, for most of the networking stuff:

ifconfig -a 
netstat -s
netstat -rn
/proc/sys/net/ipv4/
/proc/sys/net/core/

before and after the run. 

Dave, although in your setup the clients are maxed out,
not sure thats the case for Mala and Troy's clients. (Dont
know, of course). But I'm fairly sure they arent using
single quad NUMAs and they may not be seeing the same
effects..

thanks,
Nivedita


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 15:29       ` Dave Hansen
@ 2002-09-06 16:29         ` Martin J. Bligh
  2002-09-06 17:36           ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 16:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

> I thought that I already tried to explain this to you.  (although 
> it could have been on one of those too-much-coffee-days :)

You told me, but I'm far from convinced this is the problem. I think
it's more likely this is a side-effect of a server issue - something
like a lot of dropped packets and retransmits, though not necessarily
that.

> Something strange happens to the clients when NAPI is enabled on 
> the Specweb clients.  Somehow the start using a lot more CPU.  
> The increased idle time on the server is because the _clients_ are 
> CPU maxed.  I have some preliminary oprofile data for the clients, 
> but it appears that this is another case of Specweb code just 
> really sucking.

Hmmm ... if you change something on the server, and all the clients
go wild, I'm suspicious of whatever you did to the server. You need
to have a lot more data before leaping to the conclusion that it's
because the specweb client code is crap.

Troy - I think your UP clients weren't anywhere near maxed out on 
CPU power, right? Can you take a peek at the clients under NAPI load?

Dave - did you ever try running 4 specweb clients bound to each of
the 4 CPUs in an attempt to make the clients scale better? I'm 
suspicious that you're maxing out 4 4-way machines, and Troy's
16 UPs are cruising along just fine.

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:48     ` Martin J. Bligh
  2002-09-06  6:51       ` David S. Miller
  2002-09-06 15:29       ` Dave Hansen
@ 2002-09-06 17:26       ` Gerrit Huizenga
  2002-09-06 17:37         ` David S. Miller
  2002-09-06 23:48       ` Troy Wilson
  2002-09-11  9:11       ` Eric W. Biederman
  4 siblings, 1 reply; 102+ messages in thread
From: Gerrit Huizenga @ 2002-09-06 17:26 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

In message <18563262.1031269721@[10.10.2.3]>, > : "Martin J. Bligh" writes:
> >    I would think shoving the data down the NIC
> >    and avoid the fragmentation shouldnt give you that much significant
> >    CPU savings.
> > 
> > It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> > is limited by the DMA throughput of the PCI host controller.  In
> > particular some controllers are limited to smaller DMA bursts to
> > work around hardware bugs.
> 
> I think we're CPU limited (there's no idle time on this machine), 
> which is odd for an 8 CPU 900MHz P3 Xeon, but still, this is Apache,
> not Tux. You mentioned CPU load as another advantage of TSO ... 
> anything we've done to reduce CPU load enables us to run more and 
> more connections (I think we started at about 260 or something, so 
> 2900 ain't too bad ;-)).

Troy, is there any chance you could post an oprofile from any sort
of reasonably conformant run?  I think that might help enlighten
people a bit as to what we are fighting with.  The last numbers I
remember seemed to indicate that we were spending about 1.25 CPUs
in network/e1000 code with 100% CPU utilization and crappy SpecWeb
throughput.

One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines.  For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time.  High ideals, I know, wanting all features *and* performance
from the same tool...  Next thing you know they'll want reliability
or some such thing.

gerrit

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 16:29         ` Martin J. Bligh
@ 2002-09-06 17:36           ` Dave Hansen
  2002-09-06 18:26             ` Andi Kleen
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2002-09-06 17:36 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

Martin J. Bligh wrote:
>>Something strange happens to the clients when NAPI is enabled on 
>>the Specweb clients.  Somehow the start using a lot more CPU.  
>>The increased idle time on the server is because the _clients_ are 
>>CPU maxed.  I have some preliminary oprofile data for the clients, 
>>but it appears that this is another case of Specweb code just 
>>really sucking.
> 
> Hmmm ... if you change something on the server, and all the clients
> go wild, I'm suspicious of whatever you did to the server.

Me too :)  All that was changed was adding the new e1000 driver.  NAPI was 
disabled.

 > You need
> to have a lot more data before leaping to the conclusion that it's
> because the specweb client code is crap.

I'll let the profile speak for itself...

oprofile summary:op_time -d

1          0.0000 0.0000 /bin/sleep
2          0.0001 0.0000 /lib/ld-2.2.5.so.dpkg-new (deleted)
2          0.0001 0.0000 /lib/libpthread-0.9.so
2          0.0001 0.0000 /usr/bin/expr
3          0.0001 0.0000 /sbin/init
4          0.0001 0.0000 /lib/libproc.so.2.0.7
12         0.0004 0.0000 /lib/libc-2.2.5.so.dpkg-new (deleted)
17         0.0005 0.0000 /usr/lib/libcrypto.so.0.9.6.dpkg-new (deleted)
20         0.0006 0.0000 /bin/bash
30         0.0010 0.0000 /usr/sbin/sshd
151        0.0048 0.0000 /usr/bin/vmstat
169        0.0054 0.0000 /lib/ld-2.2.5.so
300        0.0095 0.0000 /lib/modules/2.4.18+O1/oprofile/oprofile.o
1115       0.0354 0.0000 /usr/local/bin/oprofiled
3738       0.1186 0.0000 /lib/libnss_files-2.2.5.so
58181      1.8458 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
249186     7.9056 0.0000 /home/dave/specweb99/build/client
582281    18.4733 0.0000 /lib/libc-2.2.5.so
2256792   71.5986 0.0000 /usr/src/linux/vmlinux

top of oprofile from the client:
08051b3c 2260     0.948938    check_for_timeliness
08051cfc 2716     1.14041     ascii_cat
08050f24 4547     1.90921     HTTPGetReply
0804f138 4682     1.9659      workload_op
08050890 6111     2.56591     HTTPDoConnect
08049a30 7330     3.07775     SHMmalloc
08052244 7433     3.121       HTParse
08052628 8482     3.56146     HTSACopy
08051d88 10288    4.31977     get_some_line
08052150 13070    5.48788     scan
08051a10 65314    27.4243     assign_port_number
0804bd30 83789    35.1817     LOG
#define LOG(x) do {} while(0)
Voila! 35% more CPU!

Top of Kernel profile:
c022c850 33085    1.46602     number
c0106e59 42693    1.89176     restore_all
c01dfe68 42787    1.89592     sys_socketcall
c01df39c 54185    2.40097     sys_bind
c01de698 62740    2.78005     sockfd_lookup
c01372c8 97886    4.3374      fput
c022c110 125306   5.55239     __generic_copy_to_user
c01373b0 181922   8.06109     fget
c020958c 199054   8.82022     tcp_v4_get_port
c0106e10 199934   8.85921     system_call
c022c158 214014   9.48311     __generic_copy_from_user
c0216ecc 257768   11.4219     inet_bind

"oprofpp -k -dl -i /lib/libc-2.2.5.so"
just gives:
vma      samples %-age symbol name  linenr info                 image name
00000000 582281  100   (no symbol)  (no location information) 
/lib/libc-2.2.5.so

I've never really tried to profile anything but the kernel before.  Any ideas?

> Troy - I think your UP clients weren't anywhere near maxed out on 
> CPU power, right? Can you take a peek at the clients under NAPI load?

Make sure you wait a minute or two.  The client tends to ramp up.

"vmstat 2" after the client has told the master that it is running:
  U   S   I
----------
  4  15  81
  5  17  79
  7  16  77
  7  17  76
  7  21  72
11  25  64
  3  16  82
  2  14  84
  7  23  70
16  50  34
24  75   0
27  73   0
28  72   0
24  76   0
...

> Dave - did you ever try running 4 specweb clients bound to each of
> the 4 CPUs in an attempt to make the clients scale better? I'm 
> suspicious that you're maxing out 4 4-way machines, and Troy's
> 16 UPs are cruising along just fine.

No, but I'm not sure it will do any good.  They don't run often enough and 
I have the feeling that there are very few cache locality benefits to be had.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 17:26       ` Gerrit Huizenga
@ 2002-09-06 17:37         ` David S. Miller
  2002-09-06 18:19           ` Gerrit Huizenga
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 17:37 UTC (permalink / raw)
  To: gh; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

   From: Gerrit Huizenga <gh@us.ibm.com>
   Date: Fri, 06 Sep 2002 10:26:04 -0700
   
   One of our goals is to actually take the next generation of the most
   common "large system" web server and get it to scale along the lines
   of Tux or some of the other servers which are more common on the
   small machines.  For some reasons, big corporate customers want lots
   of features that are in a web server like apache and would also like
   the performance on their 8-CPU or 16-CPU machine to not suck at the
   same time.  High ideals, I know, wanting all features *and* performance
   from the same tool...  Next thing you know they'll want reliability
   or some such thing.

Why does Tux keep you from taking advantage of all the
feature of Apache?  Anything Tux doesn't handle in it's
fast path is simple fed up to Apache.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 17:37         ` David S. Miller
@ 2002-09-06 18:19           ` Gerrit Huizenga
  2002-09-06 18:26             ` Martin J. Bligh
  2002-09-06 18:34             ` David S. Miller
  0 siblings, 2 replies; 102+ messages in thread
From: Gerrit Huizenga @ 2002-09-06 18:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

In message <20020906.103717.82432404.davem@redhat.com>, > : "David S. Miller" w
rites:
>    From: Gerrit Huizenga <gh@us.ibm.com>
>    Date: Fri, 06 Sep 2002 10:26:04 -0700
>    
>    One of our goals is to actually take the next generation of the most
>    common "large system" web server and get it to scale along the lines
>    of Tux or some of the other servers which are more common on the
>    small machines.  For some reasons, big corporate customers want lots
>    of features that are in a web server like apache and would also like
>    the performance on their 8-CPU or 16-CPU machine to not suck at the
>    same time.  High ideals, I know, wanting all features *and* performance
>    from the same tool...  Next thing you know they'll want reliability
>    or some such thing.
> 
> Why does Tux keep you from taking advantage of all the
> feature of Apache?  Anything Tux doesn't handle in it's
> fast path is simple fed up to Apache.

You have to ask the hard questions...   Some of this is rooted in
the past when Tux was emerging as a technology rather ubiquitously
available.  And, combined with the fact that most customers tend to
lag the technology curve, Apache 1.X or, in our case, IBM HTTPD was
simply a customer drop in with standard configuration support that
roughly matched that on all other platforms, e.g. AIX, Solaris, HPUX,
Linux, etc.  So, doing a one off for Linux at a very heterogenous
large customer adds pain, that pain becomes cost for the customer in
terms of consulting, training, sys admin, system management, etc.

We also had some bad starts with using Tux in terms of performance
and scalability on 4-CPU and 8-CPU machines, especially when combining
with things like squid or other cacheing products from various third
parties.

Then there is the problem that 90%+ of our customers seem to have
dynamic-only web servers.  Static content is limited to a couple of
banners and images that need to be tied into some kind of cacheing
content server.  So, Tux's benefits for static serving turned out to
be only additional overhead because there were no static pages to be
served up.

And, honestly, I'm a kernel guy much more than an applications guy, so
I'll admit that I'm not up to speed on what Tux2 can do with dynamic
content.  The last I knew was that it could pass it off to another server.
So we are focused on making the most common case for our customer situations
scale well.  As you are probably aware, there are no specweb results
posted using Apache, but web crawler stats suggest that Apache is the
most common server.  The problem is that performance on Apache sucks
but people like the features.  Hence we are working to make Apache
suck less, and finding that part of the problem is the way it uses the
kernel.  Other parts are the interface for specweb in particular which
we have done a bunch of work on with Greg Ames.  And we are feeding
data back to the Apache 2.0 team which should help Apache in general.

gerrit

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 17:36           ` Dave Hansen
@ 2002-09-06 18:26             ` Andi Kleen
  2002-09-06 18:31               ` John Levon
                                 ` (2 more replies)
  0 siblings, 3 replies; 102+ messages in thread
From: Andi Kleen @ 2002-09-06 18:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Martin J. Bligh, David S. Miller, hadi, tcw, linux-kernel,
	netdev, Nivedita Singhvi

> c0106e59 42693    1.89176     restore_all
> c01dfe68 42787    1.89592     sys_socketcall
> c01df39c 54185    2.40097     sys_bind
> c01de698 62740    2.78005     sockfd_lookup
> c01372c8 97886    4.3374      fput
> c022c110 125306   5.55239     __generic_copy_to_user
> c01373b0 181922   8.06109     fget
> c020958c 199054   8.82022     tcp_v4_get_port
> c0106e10 199934   8.85921     system_call
> c022c158 214014   9.48311     __generic_copy_from_user
> c0216ecc 257768   11.4219     inet_bind

The profile looks bogus. The NIC driver is nowhere in sight. Normally
its mmap IO for interrupts and device registers should show. I would
double check it (e.g. with normal profile) 

In case it is no bogus: 
Most of these are either atomic_inc/dec of reference counters or some
form of lock. The system_call could be the int 0x80 (using the SYSENTER
patches would help), which also does atomic operations implicitely.
restore_all is IRET, could also likely be speed up by using SYSEXIT.

If NAPI hurts here then it surely not because of eating CPU time.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:19           ` Gerrit Huizenga
@ 2002-09-06 18:26             ` Martin J. Bligh
  2002-09-06 18:36               ` David S. Miller
  2002-09-06 18:34             ` David S. Miller
  1 sibling, 1 reply; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 18:26 UTC (permalink / raw)
  To: Gerrit Huizenga, David S. Miller; +Cc: hadi, tcw, linux-kernel, netdev, niv

>>    One of our goals is to actually take the next generation of the most
>>    common "large system" web server and get it to scale along the lines
>>    of Tux or some of the other servers which are more common on the
>>    small machines.  For some reasons, big corporate customers want lots
>>    of features that are in a web server like apache and would also like
>>    the performance on their 8-CPU or 16-CPU machine to not suck at the
>>    same time.  High ideals, I know, wanting all features *and* performance
>>    from the same tool...  Next thing you know they'll want reliability
>>    or some such thing.
>> 
>> Why does Tux keep you from taking advantage of all the
>> feature of Apache?  Anything Tux doesn't handle in it's
>> fast path is simple fed up to Apache.
> 
> You have to ask the hard questions...   

Ultimately, to me at least, the server doesn't really matter, and
neither do the absolute benchmark numbers. Linux should scale under 
any reasonable workload. The point of this is to look at the Linux
kernel, not the webserver, or specweb ... they're just hammers to
beat on the kernel with.

The fact that we're doing something different from everyone else
and turning up a different set of kernel issues is a good thing, 
to my mind. You're right, we could use Tux if we wanted to ... but
that doesn't stop Apache being interesting ;-)

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:26             ` Andi Kleen
@ 2002-09-06 18:31               ` John Levon
  2002-09-06 18:33               ` Dave Hansen
  2002-09-06 19:19               ` Nivedita Singhvi
  2 siblings, 0 replies; 102+ messages in thread
From: John Levon @ 2002-09-06 18:31 UTC (permalink / raw)
  To: linux-kernel

On Fri, Sep 06, 2002 at 08:26:46PM +0200, Andi Kleen wrote:

> > c0216ecc 257768   11.4219     inet_bind
> 
> The profile looks bogus. The NIC driver is nowhere in sight. Normally
> its mmap IO for interrupts and device registers should show. I would
> double check it (e.g. with normal profile) 

The system summary shows :

58181      1.8458 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o

so it won't show up in the monolithic kernel profile. You can probably
get a combined comparison with

op_time -dnl | grep -e 'vmlinux|acenic'

regards
john

-- 
 "Are you willing to go out there and save the lives of our children, even if it means losing your own life ?
 Yes I am.
 I believe you, Jeru... you're ready."

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:26             ` Andi Kleen
  2002-09-06 18:31               ` John Levon
@ 2002-09-06 18:33               ` Dave Hansen
  2002-09-06 18:36                 ` David S. Miller
  2002-09-06 19:19               ` Nivedita Singhvi
  2 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2002-09-06 18:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Martin J. Bligh, David S. Miller, hadi, tcw, linux-kernel,
	netdev, Nivedita Singhvi

Andi Kleen wrote:
>>c0106e59 42693    1.89176     restore_all
>>c01dfe68 42787    1.89592     sys_socketcall
>>c01df39c 54185    2.40097     sys_bind
>>c01de698 62740    2.78005     sockfd_lookup
>>c01372c8 97886    4.3374      fput
>>c022c110 125306   5.55239     __generic_copy_to_user
>>c01373b0 181922   8.06109     fget
>>c020958c 199054   8.82022     tcp_v4_get_port
>>c0106e10 199934   8.85921     system_call
>>c022c158 214014   9.48311     __generic_copy_from_user
>>c0216ecc 257768   11.4219     inet_bind
> 
> The profile looks bogus. The NIC driver is nowhere in sight. Normally
> its mmap IO for interrupts and device registers should show. I would
> double check it (e.g. with normal profile) 

Actually, oprofile separated out the acenic module from the rest of the 
kernel.  I should have included that breakout as well. but it was only 1.3 
of CPU:
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o


-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:19           ` Gerrit Huizenga
  2002-09-06 18:26             ` Martin J. Bligh
@ 2002-09-06 18:34             ` David S. Miller
  2002-09-06 18:57               ` Gerrit Huizenga
  1 sibling, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:34 UTC (permalink / raw)
  To: gh; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

   From: Gerrit Huizenga <gh@us.ibm.com>
   Date: Fri, 06 Sep 2002 11:19:11 -0700
   
   And, honestly, I'm a kernel guy much more than an applications guy, so
   I'll admit that I'm not up to speed on what Tux2 can do with dynamic
   content.

TUX can optimize dynamic content just fine.

   The last I knew was that it could pass it off to another server.

Not true.

   The problem is that performance on Apache sucks
   but people like the features.

Tux's design allows it to be a drop in acceleration method
which does not require you to relinquish Apache's feature set.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:26             ` Martin J. Bligh
@ 2002-09-06 18:36               ` David S. Miller
  2002-09-06 18:51                 ` Martin J. Bligh
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:36 UTC (permalink / raw)
  To: Martin.Bligh; +Cc: gh, hadi, tcw, linux-kernel, netdev, niv

   From: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>
   Date: Fri, 06 Sep 2002 11:26:49 -0700
   
   The fact that we're doing something different from everyone else
   and turning up a different set of kernel issues is a good thing, 
   to my mind. You're right, we could use Tux if we wanted to ... but
   that doesn't stop Apache being interesting ;-)

Tux does not obviate Apache from the equation.
See my other emails.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:33               ` Dave Hansen
@ 2002-09-06 18:36                 ` David S. Miller
  2002-09-06 18:45                   ` Martin J. Bligh
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:36 UTC (permalink / raw)
  To: haveblue; +Cc: ak, Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

   From: Dave Hansen <haveblue@us.ibm.com>
   Date: Fri, 06 Sep 2002 11:33:10 -0700
   
   Actually, oprofile separated out the acenic module from the rest of the 
   kernel.  I should have included that breakout as well. but it was only 1.3 
   of CPU:
   1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o

We thought you were using e1000 in these tests?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:45                   ` Martin J. Bligh
@ 2002-09-06 18:43                     ` David S. Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:43 UTC (permalink / raw)
  To: Martin.Bligh; +Cc: haveblue, ak, hadi, tcw, linux-kernel, netdev, niv

   From: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>
   Date: Fri, 06 Sep 2002 11:45:17 -0700

   >    Actually, oprofile separated out the acenic module from the rest of the 
   >    kernel.  I should have included that breakout as well. but it was only 1.3 
   >    of CPU:
   >    1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
   > 
   > We thought you were using e1000 in these tests?
   
   e1000 on the server, those profiles were client side.
   
Ok.  BTW acenic is packet rate limited by the speed of the
MIPS cpus on the card.

It might be instramental to disable HW checksumming in the
acenic driver and see what this does to your results.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:36                 ` David S. Miller
@ 2002-09-06 18:45                   ` Martin J. Bligh
  2002-09-06 18:43                     ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 18:45 UTC (permalink / raw)
  To: David S. Miller, haveblue; +Cc: ak, hadi, tcw, linux-kernel, netdev, niv

>    Actually, oprofile separated out the acenic module from the rest of the 
>    kernel.  I should have included that breakout as well. but it was only 1.3 
>    of CPU:
>    1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
> 
> We thought you were using e1000 in these tests?

e1000 on the server, those profiles were client side.

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:51                 ` Martin J. Bligh
@ 2002-09-06 18:48                   ` David S. Miller
  2002-09-06 19:05                     ` Gerrit Huizenga
  2002-09-06 20:29                   ` Alan Cox
  1 sibling, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:48 UTC (permalink / raw)
  To: Martin.Bligh; +Cc: gh, hadi, tcw, linux-kernel, netdev, niv

   From: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>
   Date: Fri, 06 Sep 2002 11:51:29 -0700
   
   I see no reason why turning on NAPI should make the Apache setup
   we have perform worse ... quite the opposite. Yes, we could use
   Tux, yes we'd get better results. But that's not the point ;-)

Of course.

I just don't want propaganda being spread that using Tux means you
lose any sort of web server functionality whatsoever.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:36               ` David S. Miller
@ 2002-09-06 18:51                 ` Martin J. Bligh
  2002-09-06 18:48                   ` David S. Miller
  2002-09-06 20:29                   ` Alan Cox
  0 siblings, 2 replies; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 18:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: gh, hadi, tcw, linux-kernel, netdev, niv

>    The fact that we're doing something different from everyone else
>    and turning up a different set of kernel issues is a good thing, 
>    to my mind. You're right, we could use Tux if we wanted to ... but
>    that doesn't stop Apache being interesting ;-)
> 
> Tux does not obviate Apache from the equation.
> See my other emails.

That's not the point ... we're getting sidetracked here. The
point is: "is this a realistic-ish stick to beat the kernel
with and expect it to behave" ... I feel the answer is yes.

The secondary point is "what are customers doing in the field?"
(not what *should* they be doing ;-)). Moreover, I think the
Apache + Tux combination has been fairly well beaten on already
by other people in the past, though I'm sure it could be done
again.

I see no reason why turning on NAPI should make the Apache setup
we have perform worse ... quite the opposite. Yes, we could use
Tux, yes we'd get better results. But that's not the point ;-)

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:34             ` David S. Miller
@ 2002-09-06 18:57               ` Gerrit Huizenga
  2002-09-06 18:58                 ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Gerrit Huizenga @ 2002-09-06 18:57 UTC (permalink / raw)
  To: David S. Miller; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

In message <20020906.113448.07697441.davem@redhat.com>, > : "David S. Miller" w
rites:
>    From: Gerrit Huizenga <gh@us.ibm.com>
>    Date: Fri, 06 Sep 2002 11:19:11 -0700
>
> TUX can optimize dynamic content just fine.
> 
>    The last I knew was that it could pass it off to another server.

Out of curiosity, and primarily for my own edification, what kind
of optimization does it do when everything is generated by a java/
perl/python/homebrew script and pasted together by something which
consults a content manager.  In a few of the cases that I know of,
there isn't really any static content to cache...  And why is this
something that Apache couldn't/shouldn't be doing?

gerrit

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:57               ` Gerrit Huizenga
@ 2002-09-06 18:58                 ` David S. Miller
  2002-09-06 19:52                   ` Gerrit Huizenga
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:58 UTC (permalink / raw)
  To: gh; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

   From: Gerrit Huizenga <gh@us.ibm.com>
   Date: Fri, 06 Sep 2002 11:57:39 -0700

   Out of curiosity, and primarily for my own edification, what kind
   of optimization does it do when everything is generated by a java/
   perl/python/homebrew script and pasted together by something which
   consults a content manager.  In a few of the cases that I know of,
   there isn't really any static content to cache...  And why is this
   something that Apache couldn't/shouldn't be doing?

The kernel exec's the CGI process from the TUX server and pipes the
output directly into a networking socket.

Because it is cheaper to create a new fresh user thread from within
the kernel (ie. we don't have to fork() apache and thus dup it's
address space), it is faster.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:05                     ` Gerrit Huizenga
@ 2002-09-06 19:01                       ` David S. Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06 19:01 UTC (permalink / raw)
  To: gh; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

   From: Gerrit Huizenga <gh@us.ibm.com>
   Date: Fri, 06 Sep 2002 12:05:27 -0700
   
   So, any comments I made which might have implied that Tux/Tux2 made things
   worse have no substantiated data to prove that and it is quite possible
   that there is no such problem.  Also, this was run nearly a year ago and
   the state of Tux/Tux2 might have been a bit different at the time.

Thanks for clearing things up.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:48                   ` David S. Miller
@ 2002-09-06 19:05                     ` Gerrit Huizenga
  2002-09-06 19:01                       ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Gerrit Huizenga @ 2002-09-06 19:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

In message <20020906.114815.127906065.davem@redhat.com>, > : "David S. Miller" 
writes:
>    From: "Martin J. Bligh" <Martin.Bligh@us.ibm.com>
>    Date: Fri, 06 Sep 2002 11:51:29 -0700
>    
>    I see no reason why turning on NAPI should make the Apache setup
>    we have perform worse ... quite the opposite. Yes, we could use
>    Tux, yes we'd get better results. But that's not the point ;-)
> 
> Of course.
> 
> I just don't want propaganda being spread that using Tux means you
> lose any sort of web server functionality whatsoever.

Ah sorry - I never meant to imply that Tux was detrimental, other
than one case where it seemed to have no benefit and the performance
numbers while tuning for TPC-W *seemed* worse but were never analyzed
completely.  That was the actual event that I meant when I said:

	We also had some bad starts with using Tux in terms of performance
	and scalability on 4-CPU and 8-CPU machines, especially when
	combining with things like squid or other cacheing products from
	various third parties.

Those results were never quantified but for various reasons we had a
team that decided to take Tux out of the picture.  I think the problem
was more likely lack of knowledge and lack of time to do analysis on
the particular problems.  Another combination of solutions was used.

So, any comments I made which might have implied that Tux/Tux2 made things
worse have no substantiated data to prove that and it is quite possible
that there is no such problem.  Also, this was run nearly a year ago and
the state of Tux/Tux2 might have been a bit different at the time.

gerrit

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:26             ` Andi Kleen
  2002-09-06 18:31               ` John Levon
  2002-09-06 18:33               ` Dave Hansen
@ 2002-09-06 19:19               ` Nivedita Singhvi
  2002-09-06 19:21                 ` David S. Miller
                                   ` (2 more replies)
  2 siblings, 3 replies; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-06 19:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, netdev

Quoting Andi Kleen <ak@suse.de>:

> > c0106e59 42693    1.89176     restore_all
> > c01dfe68 42787    1.89592     sys_socketcall
> > c01df39c 54185    2.40097     sys_bind
> > c01de698 62740    2.78005     sockfd_lookup
> > c01372c8 97886    4.3374      fput
> > c022c110 125306   5.55239     __generic_copy_to_user
> > c01373b0 181922   8.06109     fget
> > c020958c 199054   8.82022     tcp_v4_get_port
> > c0106e10 199934   8.85921     system_call
> > c022c158 214014   9.48311     __generic_copy_from_user
> > c0216ecc 257768   11.4219     inet_bind
> 
> The profile looks bogus. The NIC driver is nowhere in sight.
> Normally its mmap IO for interrupts and device registers 
> should show. I would double check it (e.g. with normal profile)
Separately compiled acenic..

I'm surprised by this profile a bit too - on the client side,
since the requests are small, and the client is receiving
all those files, I would have thought that __generic_copy_to_user
would have been way higher than *from_user.

inet_bind() and tcp_v4_get_port() are up there because
we have to grab the socket lock, the tcp_portalloc_lock,
then the head chain lock and traverse the hash table
which has now many hundred entries. Also, because
of the varied length of the connections, the clients
get freed not in the same order they are allocated
a port, hence the fragmentation of the port space..
Tthere is some cacheline thrashing hurting the NUMA 
more than other systems here too..

If you just wanted to speed things up, you could get the
clients to specify ports instead of letting the kernel
cycle through for a free port..:)

thanks,
Nivedita

> In case it is no bogus: 
> Most of these are either atomic_inc/dec of reference counters or
> some form of lock. The system_call could be the int 0x80 (using the
> SYSENTER patches would help), which also does atomic operations
> implicitely. restore_all is IRET, could also likely be speed up by
> using SYSEXIT.
> 
> If NAPI hurts here then it surely not because of eating CPU time.
> 
> -Andi
> 
> 





^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:19               ` Nivedita Singhvi
@ 2002-09-06 19:21                 ` David S. Miller
  2002-09-06 19:45                   ` Nivedita Singhvi
  2002-09-06 19:26                 ` Andi Kleen
  2002-09-06 19:45                 ` Martin J. Bligh
  2 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 19:21 UTC (permalink / raw)
  To: niv; +Cc: ak, linux-kernel, netdev

   From: Nivedita Singhvi <niv@us.ibm.com>
   Date: Fri,  6 Sep 2002 12:19:14 -0700
   
   inet_bind() and tcp_v4_get_port() are up there because
   we have to grab the socket lock, the tcp_portalloc_lock,
   then the head chain lock and traverse the hash table
   which has now many hundred entries. Also, because
   of the varied length of the connections, the clients
   get freed not in the same order they are allocated
   a port, hence the fragmentation of the port space..
   Tthere is some cacheline thrashing hurting the NUMA 
   more than other systems here too..
   
There are methods to eliminate the centrality of the
port allocation locking.

Basically, kill tcp_portalloc_lock and make the port rover be per-cpu.

The only tricky case is the "out of ports" situation.  Because there
is no centralized locking being used to serialize port allocation,
it is difficult to be sure that the port space is truly exhausted.

Another idea, which doesn't eliminate the tcp_portalloc_lock but
has other good SMP properties, is to apply a "cpu salt" to the
port rover value.  For example, shift the local cpu number into
the upper parts of a 'u16', then 'xor' that with tcp_port_rover.

Alexey and I have discussed this several times but never became
bored enough to experiment :-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:26                 ` Andi Kleen
@ 2002-09-06 19:24                   ` David S. Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06 19:24 UTC (permalink / raw)
  To: ak; +Cc: niv, linux-kernel, netdev

   From: Andi Kleen <ak@suse.de>
   Date: Fri, 6 Sep 2002 21:26:19 +0200
   
   I'm not entirely sure it is worth it in this case. The locks are
   probably the majority of the cost.

You can more localize the lock accesses (since we use per-chain
locks) by applying a cpu salt to the port numbers you allocate.

See my other email.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:19               ` Nivedita Singhvi
  2002-09-06 19:21                 ` David S. Miller
@ 2002-09-06 19:26                 ` Andi Kleen
  2002-09-06 19:24                   ` David S. Miller
  2002-09-06 19:45                 ` Martin J. Bligh
  2 siblings, 1 reply; 102+ messages in thread
From: Andi Kleen @ 2002-09-06 19:26 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: Andi Kleen, linux-kernel, netdev

> If you just wanted to speed things up, you could get the
> clients to specify ports instead of letting the kernel
> cycle through for a free port..:)

Better would be probably to change the kernel to keep a limited
list of free ports in a free list. The grabbing a free port would
be an O(1) operation. 

I'm not entirely sure it is worth it in this case. The locks are
probably the majority of the cost.

-Andi

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:21                 ` David S. Miller
@ 2002-09-06 19:45                   ` Nivedita Singhvi
  0 siblings, 0 replies; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-06 19:45 UTC (permalink / raw)
  To: David S. Miller; +Cc: ak, linux-kernel, netdev

Quoting "David S. Miller" <davem@redhat.com>:

> There are methods to eliminate the centrality of the
> port allocation locking.
> 
> Basically, kill tcp_portalloc_lock and make the port rover be
> per-cpu.

Aha! Exactly what I started to do quite a while ago..

> The only tricky case is the "out of ports" situation.  Because
> there is no centralized locking being used to serialize port
> allocation, it is difficult to be sure that the port space is truly
> exhausted.

I decided to use a stupid global flag to signal this..It did become
messy and I didnt finalize everything. Then my day job 
intervened :). Still hoping for spare time*5 to complete
this if none comes up with something before then..

> Another idea, which doesn't eliminate the tcp_portalloc_lock but
> has other good SMP properties, is to apply a "cpu salt" to the
> port rover value.  For example, shift the local cpu number into
> the upper parts of a 'u16', then 'xor' that with tcp_port_rover.

nice..any patch extant? :)

thanks,
Nivedita





^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:19               ` Nivedita Singhvi
  2002-09-06 19:21                 ` David S. Miller
  2002-09-06 19:26                 ` Andi Kleen
@ 2002-09-06 19:45                 ` Martin J. Bligh
  2 siblings, 0 replies; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 19:45 UTC (permalink / raw)
  To: Nivedita Singhvi, Andi Kleen; +Cc: linux-kernel, netdev

> Tthere is some cacheline thrashing hurting the NUMA 
> more than other systems here too..

There is no NUMA here ... the clients are 4 single node SMP 
systems. We're using the old quads to make them, but they're 
all split up, not linked together into one system.
Sorry if we didn't make that clear.

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:52                   ` Gerrit Huizenga
@ 2002-09-06 19:49                     ` David S. Miller
  2002-09-06 20:03                       ` Gerrit Huizenga
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 19:49 UTC (permalink / raw)
  To: gh; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

   From: Gerrit Huizenga <gh@us.ibm.com>
   Date: Fri, 06 Sep 2002 12:52:15 -0700
   
   So if apache were using a listen()/clone()/accept()/exec() combo rather than a
   full listen()/fork()/exec() model it would see most of the same benefits?

Apache would need to do some more, such as do something about
cpu affinity and do the non-blocking VFS tricks Tux does too.

To be honest, I'm not going to sit here all day long and explain how
Tux works.  I'm not even too knowledgable about the precise details of
it's implementation.  Besides, the code is freely available and not
too complex, so you can go have a look for yourself :-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:58                 ` David S. Miller
@ 2002-09-06 19:52                   ` Gerrit Huizenga
  2002-09-06 19:49                     ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Gerrit Huizenga @ 2002-09-06 19:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

In message <20020906.115804.109349169.davem@redhat.com>, > : "David S. Miller" 
writes:
>    From: Gerrit Huizenga <gh@us.ibm.com>
>    Date: Fri, 06 Sep 2002 11:57:39 -0700
> 
>    Out of curiosity, and primarily for my own edification, what kind
>    of optimization does it do when everything is generated by a java/
>    perl/python/homebrew script and pasted together by something which
>    consults a content manager.  In a few of the cases that I know of,
>    there isn't really any static content to cache...  And why is this
>    something that Apache couldn't/shouldn't be doing?
> 
> The kernel exec's the CGI process from the TUX server and pipes the
> output directly into a networking socket.
>
> Because it is cheaper to create a new fresh user thread from within
> the kernel (ie. we don't have to fork() apache and thus dup it's
> address space), it is faster.

So if apache were using a listen()/clone()/accept()/exec() combo rather than a
full listen()/fork()/exec() model it would see most of the same benefits?
Some additional overhead for the user/kernel syscall path but probably
pretty minor, right?

Or did I miss a piece of data, like the time to call clone() as a function
from in kernel is 2x or 10x more than the same syscall?

gerrit

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:49                     ` David S. Miller
@ 2002-09-06 20:03                       ` Gerrit Huizenga
  0 siblings, 0 replies; 102+ messages in thread
From: Gerrit Huizenga @ 2002-09-06 20:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv

In message <20020906.124936.34476547.davem@redhat.com>, > : "David S. Miller" w
rites:
>    From: Gerrit Huizenga <gh@us.ibm.com>
>    Date: Fri, 06 Sep 2002 12:52:15 -0700
>    
>    So if apache were using a listen()/clone()/accept()/exec() combo rather than a
>    full listen()/fork()/exec() model it would see most of the same benefits?
> 
> Apache would need to do some more, such as do something about
> cpu affinity and do the non-blocking VFS tricks Tux does too.
> 
> To be honest, I'm not going to sit here all day long and explain how
> Tux works.  I'm not even too knowledgable about the precise details of
> it's implementation.  Besides, the code is freely available and not
> too complex, so you can go have a look for yourself :-)

Aw, and you are such a good tutor, too.  :-)  But thanks - my particular
goal isn't to fix apache since there is already a group of folks working
on that, but as we look at kernel traces, this should give us a good
idea if we are at the bottleneck of the apache architecture or if we
have other kernel bottlenecks.  At the moment, the latter seems to be
true, and I think we have some good data from Troy and Dave to validate
that.  I think we have already seen the affinity problem or at least
talked about it as that was somewhat visible and Apache 2.0 does seem
to have some solutions for helping with that.  And when the kernel does
the best it can with Apache's architecture, we have more data to convince
them to fix the architecture problems.

thanks again!

gerrit

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:51                 ` Martin J. Bligh
  2002-09-06 18:48                   ` David S. Miller
@ 2002-09-06 20:29                   ` Alan Cox
  1 sibling, 0 replies; 102+ messages in thread
From: Alan Cox @ 2002-09-06 20:29 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: David S. Miller, gh, hadi, tcw, linux-kernel, netdev, niv

On Fri, 2002-09-06 at 19:51, Martin J. Bligh wrote:
> The secondary point is "what are customers doing in the field?"
> (not what *should* they be doing ;-)). Moreover, I think the
> Apache + Tux combination has been fairly well beaten on already
> by other people in the past, though I'm sure it could be done
> again.

Tux has been proven in the field. A glance at some of the interesting
porn domain names using it would show that 8)


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:48     ` Martin J. Bligh
                         ` (2 preceding siblings ...)
  2002-09-06 17:26       ` Gerrit Huizenga
@ 2002-09-06 23:48       ` Troy Wilson
  2002-09-11  9:11       ` Eric W. Biederman
  4 siblings, 0 replies; 102+ messages in thread
From: Troy Wilson @ 2002-09-06 23:48 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, hadi, linux-kernel, netdev, Nivedita Singhvi

> > It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> > is limited by the DMA throughput of the PCI host controller.  In
> > particular some controllers are limited to smaller DMA bursts to
> > work around hardware bugs.
> 
> I'm not sure that's entirely true in this case - the Netfinity
> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
> there's 4 - 8 gigabit ethernet cards in this beast spread around
> different buses (Troy - are we still just using 4? 


  My machine is not exactly an 8500r.  It's an Intel pre-release
engineering sample (8-way 900MHz PIII) box that is similar to an 
8500r... there are some differences when going across the choerency 
filter (the bus that ties the two 4-way "halves" of the machine 
together).  Bill Hartner has a test program that illustrates the 
differences-- but more on that later.

  I've got 4 PCI busses, two 33 MHz, and two 66MHz, all 64-bit.
I'm configured as follows:

  PCI Bus 0     eth1 ---  3 clients
   33 MHz       eth2 ---  Not in use


  PCI Bus 1     eth3 ---  2 clients
   33 MHz       eth4 ---  Not in use


  PCI Bus 3     eth5 ---  6 clients
   66 MHz       eth6 ---  Not in use


  PCI Bus 4     eth7 ---  6 clients
   66 MHz       eth8 ---  Not in use


> ... and what's
> the raw bandwidth of data we're pushing? ... it's not huge). 

  2900 simultaneous connections, each at ~320 kbps translates to
928000 kbps, which is slightly less than the full bandwidth of a 
single e1000.  We're spreading that over 4 adapters, and 4 busses.

- Troy



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 23:56   ` Troy Wilson
@ 2002-09-06 23:52     ` David S. Miller
  2002-09-07  0:18     ` Nivedita Singhvi
  1 sibling, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-06 23:52 UTC (permalink / raw)
  To: tcw; +Cc: hadi, linux-kernel, netdev

   From: Troy Wilson <tcw@tempest.prismnet.com>
   Date: Fri, 6 Sep 2002 18:56:04 -0500 (CDT)

       4241408 segments retransmited

Is hw flow control being negotiated and enabled properly on the
gigabit interfaces?

There should be no reason for these kinds of retransmits to
happen.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-05 20:59 ` jamal
                     ` (2 preceding siblings ...)
  2002-09-06  3:47   ` David S. Miller
@ 2002-09-06 23:56   ` Troy Wilson
  2002-09-06 23:52     ` David S. Miller
  2002-09-07  0:18     ` Nivedita Singhvi
  3 siblings, 2 replies; 102+ messages in thread
From: Troy Wilson @ 2002-09-06 23:56 UTC (permalink / raw)
  To: jamal; +Cc: linux-kernel, netdev

> Do you have any stats from the hardware that could show
> retransmits etc;

**********************************
* netstat -s before the workload *
**********************************

Ip:
    433 total packets received
    0 forwarded
    0 incoming packets discarded
    409 incoming packets delivered
    239 requests sent out
Icmp:
    24 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 24
    24 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 24
Tcp:
    0 active connections openings
    2 passive connection openings
    0 failed connection attempts
    0 connection resets received
    2 connections established
    300 segments received
    183 segments send out
    0 segments retransmited
    0 bad segments received.
    2 resets sent
Udp:
    8 packets received
    24 packets to unknown port received.
    0 packet receive errors
    32 packets sent
TcpExt:
    ArpFilter: 0
    5 delayed acks sent
    4 packets directly queued to recvmsg prequeue.
    35 packets header predicted
    TCPPureAcks: 5
    TCPHPAcks: 160
    TCPRenoRecovery: 0
    TCPSackRecovery: 0
    TCPSACKReneging: 0
    TCPFACKReorder: 0
    TCPSACKReorder: 0
    TCPRenoReorder: 0
    TCPTSReorder: 0
    TCPFullUndo: 0
    TCPPartialUndo: 0
    TCPDSACKUndo: 0
    TCPLossUndo: 0
    TCPLoss: 0
    TCPLostRetransmit: 0
    TCPRenoFailures: 0
    TCPSackFailures: 0
    TCPLossFailures: 0
    TCPFastRetrans: 0
    TCPForwardRetrans: 0
    TCPSlowStartRetrans: 0
    TCPTimeouts: 0
    TCPRenoRecoveryFail: 0
    TCPSackRecoveryFail: 0
    TCPSchedulerFailed: 0
    TCPRcvCollapsed: 0
    TCPDSACKOldSent: 0
    TCPDSACKOfoSent: 0
    TCPDSACKRecv: 0
    TCPDSACKOfoRecv: 0
    TCPAbortOnSyn: 0
    TCPAbortOnData: 0
    TCPAbortOnClose: 0
    TCPAbortOnMemory: 0
    TCPAbortOnTimeout: 0
    TCPAbortOnLinger: 0
    TCPAbortFailed: 0
    TCPMemoryPressures: 0

*********************************
* netstat -s after the workload *
*********************************

Ip:
    425317106 total packets received
    3648 forwarded
    0 incoming packets discarded
    425313332 incoming packets delivered
    203629600 requests sent out
Icmp:
    58 ICMP messages received
    12 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 58
    58 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 58
Tcp:
    64 active connections openings
    16690445 passive connection openings
    56552 failed connection attempts
    0 connection resets received
    3 connections established
    425311551 segments received
    203629500 segments send out
    4241408 segments retransmited
    0 bad segments received.
    298883 resets sent
Udp:
    8 packets received
    34 packets to unknown port received.
    0 packet receive errors
    42 packets sent
TcpExt:
    ArpFilter: 0
    8884840 TCP sockets finished time wait in fast timer
    12913162 delayed acks sent
    17292 delayed acks further delayed because of locked socket
    Quick ack mode was activated 102351 times
    54977 times the listen queue of a socket overflowed
    54977 SYNs to LISTEN sockets ignored
    157 packets directly queued to recvmsg prequeue.
    51 packets directly received from prequeue
    16925947 packets header predicted
    51 packets header predicted and directly queued to user
    TCPPureAcks: 169071816
    TCPHPAcks: 176510836
    TCPRenoRecovery: 30090
    TCPSackRecovery: 0
    TCPSACKReneging: 0
    TCPFACKReorder: 0
    TCPSACKReorder: 0
    TCPRenoReorder: 464
    TCPTSReorder: 5
    TCPFullUndo: 6
    TCPPartialUndo: 29
    TCPDSACKUndo: 0
    TCPLossUndo: 1
    TCPLoss: 0
    TCPLostRetransmit: 0
    TCPRenoFailures: 218884
    TCPSackFailures: 0
    TCPLossFailures: 35561
    TCPFastRetrans: 145529
    TCPForwardRetrans: 0
    TCPSlowStartRetrans: 3463096
    TCPTimeouts: 373473
    TCPRenoRecoveryFail: 1221
    TCPSackRecoveryFail: 0
    TCPSchedulerFailed: 0
    TCPRcvCollapsed: 0
    TCPDSACKOldSent: 0
    TCPDSACKOfoSent: 0
    TCPDSACKRecv: 1
    TCPDSACKOfoRecv: 0
    TCPAbortOnSyn: 0
    TCPAbortOnData: 0
    TCPAbortOnClose: 0
    TCPAbortOnMemory: 0
    TCPAbortOnTimeout: 0
    TCPAbortOnLinger: 0
    TCPAbortFailed: 0
    TCPMemoryPressures: 0






^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  3:38       ` Nivedita Singhvi
  2002-09-06  3:58         ` David S. Miller
@ 2002-09-07  0:05         ` Troy Wilson
  1 sibling, 0 replies; 102+ messages in thread
From: Troy Wilson @ 2002-09-07  0:05 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: jamal, linux-kernel, netdev


> ifconfig -a and netstat -rn would also be nice to have..

  These counters may have wrapped over the course of the full-length
( 3 x 20 minute runs + 20 minute warmup + rampup + rampdown) SPECWeb run.


*******************************
* ifconfig -a before workload *
*******************************

eth0      Link encap:Ethernet  HWaddr 00:04:AC:23:5E:99  
          inet addr:9.3.192.209  Bcast:9.3.192.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:208 errors:0 dropped:0 overruns:0 frame:0
          TX packets:104 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:22562 (22.0 Kb)  TX bytes:14356 (14.0 Kb)
          Interrupt:50 Base address:0x2000 Memory:fe180000-fe180038 

eth1      Link encap:Ethernet  HWaddr 00:02:B3:9C:F5:9E  
          inet addr:192.168.4.1  Bcast:192.168.4.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:10 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:5940 (5.8 Kb)  TX bytes:256 (256.0 b)
          Interrupt:61 Base address:0x1200 Memory:fc020000-0 

eth2      Link encap:Ethernet  HWaddr 00:02:B3:A8:35:C1  
          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:54 Base address:0x1220 Memory:fc060000-0 

eth3      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:E7  
          inet addr:192.168.3.1  Bcast:192.168.3.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:44 Base address:0x2040 Memory:fe120000-0 

eth4      Link encap:Ethernet  HWaddr 00:02:B3:A3:46:F9  
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:784 (784.0 b)  TX bytes:256 (256.0 b)
          Interrupt:36 Base address:0x2060 Memory:fe160000-0 

eth5      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:88  
          inet addr:192.168.5.1  Bcast:192.168.5.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:32 Base address:0x3000 Memory:fe420000-0 

eth6      Link encap:Ethernet  HWaddr 00:02:B3:9C:F5:A0  
          inet addr:192.168.6.1  Bcast:192.168.6.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:64 (64.0 b)  TX bytes:256 (256.0 b)
          Interrupt:28 Base address:0x3020 Memory:fe460000-0 

eth7      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:39  
          inet addr:192.168.7.1  Bcast:192.168.7.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:24 Base address:0x4000 Memory:fe820000-0 

eth8      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:87  
          inet addr:192.168.8.1  Bcast:192.168.8.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:20 Base address:0x4020 Memory:fe860000-0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:56 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:5100 (4.9 Kb)  TX bytes:5100 (4.9 Kb)


******************************
* ifconfig -a after workload *
******************************

eth0      Link encap:Ethernet  HWaddr 00:04:AC:23:5E:99  
          inet addr:9.3.192.209  Bcast:9.3.192.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3434 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1408 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:336578 (328.6 Kb)  TX bytes:290474 (283.6 Kb)
          Interrupt:50 Base address:0x2000 Memory:fe180000-fe180038 

eth1      Link encap:Ethernet  HWaddr 00:02:B3:9C:F5:9E  
          inet addr:192.168.4.1  Bcast:192.168.4.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:74893662 errors:3 dropped:3 overruns:0 frame:0
          TX packets:100464074 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:1286843881 (1227.2 Mb)  TX bytes:2106085286 (2008.5 Mb)
          Interrupt:61 Base address:0x1200 Memory:fc020000-0 

eth2      Link encap:Ethernet  HWaddr 00:02:B3:A8:35:C1  
          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:54 Base address:0x1220 Memory:fc060000-0 

eth3      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:E7  
          inet addr:192.168.3.1  Bcast:192.168.3.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:50054881 errors:0 dropped:0 overruns:0 frame:0
          TX packets:67122955 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:3730406436 (3557.5 Mb)  TX bytes:3034087396 (2893.5 Mb)
          Interrupt:44 Base address:0x2040 Memory:fe120000-0 

eth4      Link encap:Ethernet  HWaddr 00:02:B3:A3:46:F9  
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:48 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:7342 (7.1 Kb)  TX bytes:256 (256.0 b)
          Interrupt:36 Base address:0x2060 Memory:fe160000-0 

eth5      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:88  
          inet addr:192.168.5.1  Bcast:192.168.5.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:149206960 errors:2861 dropped:2861 overruns:0 frame:0
          TX packets:200247016 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:2530107402 (2412.8 Mb)  TX bytes:3331495154 (3177.1 Mb)
          Interrupt:32 Base address:0x3000 Memory:fe420000-0 

eth6      Link encap:Ethernet  HWaddr 00:02:B3:9C:F5:A0  
          inet addr:192.168.6.1  Bcast:192.168.6.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:13 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:832 (832.0 b)  TX bytes:640 (640.0 b)
          Interrupt:28 Base address:0x3020 Memory:fe460000-0 

eth7      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:39  
          inet addr:192.168.7.1  Bcast:192.168.7.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:151162569 errors:2993 dropped:2993 overruns:0 frame:0
          TX packets:202895482 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:2673954954 (2550.0 Mb)  TX bytes:2456469394 (2342.6 Mb)
          Interrupt:24 Base address:0x4000 Memory:fe820000-0 

eth8      Link encap:Ethernet  HWaddr 00:02:B3:A3:47:87  
          inet addr:192.168.8.1  Bcast:192.168.8.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:0 (0.0 b)  TX bytes:256 (256.0 b)
          Interrupt:20 Base address:0x4020 Memory:fe860000-0 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:100 errors:0 dropped:0 overruns:0 frame:0
          TX packets:100 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:8696 (8.4 Kb)  TX bytes:8696 (8.4 Kb)


***************
* netstat -rn *
***************

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.7.0     0.0.0.0         255.255.255.0   U        40 0          0 eth7
192.168.6.0     0.0.0.0         255.255.255.0   U        40 0          0 eth6
192.168.5.0     0.0.0.0         255.255.255.0   U        40 0          0 eth5
192.168.4.0     0.0.0.0         255.255.255.0   U        40 0          0 eth1
192.168.3.0     0.0.0.0         255.255.255.0   U        40 0          0 eth3
192.168.2.0     0.0.0.0         255.255.255.0   U        40 0          0 eth2
192.168.1.0     0.0.0.0         255.255.255.0   U        40 0          0 eth4
9.3.192.0       0.0.0.0         255.255.255.0   U        40 0          0 eth0
192.168.8.0     0.0.0.0         255.255.255.0   U        40 0          0 eth8
127.0.0.0       0.0.0.0         255.0.0.0       U        40 0          0 lo
0.0.0.0         9.3.192.1       0.0.0.0         UG       40 0          0 eth0







^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 23:56   ` Troy Wilson
  2002-09-06 23:52     ` David S. Miller
@ 2002-09-07  0:18     ` Nivedita Singhvi
  2002-09-07  0:27       ` Troy Wilson
  1 sibling, 1 reply; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-07  0:18 UTC (permalink / raw)
  To: Troy Wilson; +Cc: jamal, linux-kernel, netdev

Quoting Troy Wilson <tcw@tempest.prismnet.com>:

> > Do you have any stats from the hardware that could show
> > retransmits etc;

Troy,

Are tcp_sack, tcp_fack, tcp_dsack turned on?

thanks,
Nivedita



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-07  0:18     ` Nivedita Singhvi
@ 2002-09-07  0:27       ` Troy Wilson
  0 siblings, 0 replies; 102+ messages in thread
From: Troy Wilson @ 2002-09-07  0:27 UTC (permalink / raw)
  To: Nivedita Singhvi; +Cc: jamal, linux-kernel, netdev


> Are tcp_sack, tcp_fack, tcp_dsack turned on?

  tcp_fack and tcp_dsack are on, tcp_sack is off.

- Troy



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  6:48     ` Martin J. Bligh
                         ` (3 preceding siblings ...)
  2002-09-06 23:48       ` Troy Wilson
@ 2002-09-11  9:11       ` Eric W. Biederman
  2002-09-11 14:10         ` Martin J. Bligh
  4 siblings, 1 reply; 102+ messages in thread
From: Eric W. Biederman @ 2002-09-11  9:11 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

"Martin J. Bligh" <Martin.Bligh@us.ibm.com> writes:

> > Ie. the headers that don't need to go across the bus are the critical
> > resource saved by TSO.
> 
> I'm not sure that's entirely true in this case - the Netfinity
> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
> there's 4 - 8 gigabit ethernet cards in this beast spread around
> different buses (Troy - are we still just using 4? ... and what's
> the raw bandwidth of data we're pushing? ... it's not huge). 
> 
> I think we're CPU limited (there's no idle time on this machine), 
> which is odd for an 8 CPU 900MHz P3 Xeon,

Quite possibly.  The P3 has roughly an 800MB/s FSB bandwidth, that must
be used for both I/O and memory accesses.  So just driving a gige card at
wire speed takes a considerable portion of the cpus capacity.  

On analyzing this kind of thing I usually find it quite helpful to
compute what the hardware can theoretically to get a feel where the
bottlenecks should be.

Eric

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-11  9:11       ` Eric W. Biederman
@ 2002-09-11 14:10         ` Martin J. Bligh
  2002-09-11 15:06           ` Eric W. Biederman
  0 siblings, 1 reply; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-11 14:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

>> > Ie. the headers that don't need to go across the bus are the critical
>> > resource saved by TSO.
>> 
>> I'm not sure that's entirely true in this case - the Netfinity
>> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
>> there's 4 - 8 gigabit ethernet cards in this beast spread around
>> different buses (Troy - are we still just using 4? ... and what's
>> the raw bandwidth of data we're pushing? ... it's not huge). 
>> 
>> I think we're CPU limited (there's no idle time on this machine), 
>> which is odd for an 8 CPU 900MHz P3 Xeon,
> 
> Quite possibly.  The P3 has roughly an 800MB/s FSB bandwidth, that must
> be used for both I/O and memory accesses.  So just driving a gige card at
> wire speed takes a considerable portion of the cpus capacity.  
> 
> On analyzing this kind of thing I usually find it quite helpful to
> compute what the hardware can theoretically to get a feel where the
> bottlenecks should be.

We can push about 420MB/s of IO out of this thing (out of that 
theoretical 800Mb/s). Specweb is only pushing about 120MB/s of
total data through it, so it's not bus limited in this case.
Of course, I should have given you that data to start with, 
but ... ;-)

M.

PS. This thing actually has 3 system buses, 1 for each of the two
sets of 4 CPUs, and 1 for all the PCI buses, and the three buses
are joined by an interconnect in the middle. But all the IO goes
through 1 of those buses, so for the purposes of this discussion,
it makes no difference whatsoever ;-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-11 14:10         ` Martin J. Bligh
@ 2002-09-11 15:06           ` Eric W. Biederman
  2002-09-11 15:15             ` David S. Miller
  2002-09-11 15:27             ` Martin J. Bligh
  0 siblings, 2 replies; 102+ messages in thread
From: Eric W. Biederman @ 2002-09-11 15:06 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

"Martin J. Bligh" <mbligh@aracnet.com> writes:

> >> > Ie. the headers that don't need to go across the bus are the critical
> >> > resource saved by TSO.
> >> 
> >> I'm not sure that's entirely true in this case - the Netfinity
> >> 8500R is slightly unusual in that it has 3 or 4 PCI buses, and
> >> there's 4 - 8 gigabit ethernet cards in this beast spread around
> >> different buses (Troy - are we still just using 4? ... and what's
> >> the raw bandwidth of data we're pushing? ... it's not huge). 
> >> 
> >> I think we're CPU limited (there's no idle time on this machine), 
> >> which is odd for an 8 CPU 900MHz P3 Xeon,
> > 
> > Quite possibly.  The P3 has roughly an 800MB/s FSB bandwidth, that must
> > be used for both I/O and memory accesses.  So just driving a gige card at
> > wire speed takes a considerable portion of the cpus capacity.  
> > 
> > On analyzing this kind of thing I usually find it quite helpful to
> > compute what the hardware can theoretically to get a feel where the
> > bottlenecks should be.
> 
> We can push about 420MB/s of IO out of this thing (out of that 
> theoretical 800Mb/s). 

Sounds about average for a P3.  I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop.  Is
that 420MB/sec of IO on this test?
 
> Specweb is only pushing about 120MB/s of
> total data through it, so it's not bus limited in this case.

Note quite.  But you suck at least 240MB/s of your memory bandwidth with
DMA from disk, and then DMA to the nic.  Unless there is a highly
cached component.  So I doubt you can effectively use more than 1 gige
card, maybe 2.  And you have 8?

> Of course, I should have given you that data to start with, 
> but ... ;-)
>
> PS. This thing actually has 3 system buses, 1 for each of the two
> sets of 4 CPUs, and 1 for all the PCI buses, and the three buses
> are joined by an interconnect in the middle. But all the IO goes
> through 1 of those buses, so for the purposes of this discussion,
> it makes no difference whatsoever ;-)

Wow the hardware designers really believed in over-subscription.
If the busses are just running 64bit/33Mhz you are oversubscribed.
And at 64bit/66Mhz the pci busses can easily swamp the system 
533*4 ~= 2128MB/s. 

What kind of memory bandwidth does the system have, and on which
bus are the memory controllers?  I'm just curious  

Eric

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-11 15:06           ` Eric W. Biederman
@ 2002-09-11 15:15             ` David S. Miller
  2002-09-11 15:31               ` Eric W. Biederman
  2002-09-11 15:27             ` Martin J. Bligh
  1 sibling, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-11 15:15 UTC (permalink / raw)
  To: ebiederm; +Cc: mbligh, hadi, tcw, linux-kernel, netdev, niv

   From: ebiederm@xmission.com (Eric W. Biederman)
   Date: 11 Sep 2002 09:06:36 -0600

   "Martin J. Bligh" <mbligh@aracnet.com> writes:
   
   > We can push about 420MB/s of IO out of this thing (out of that 
   > theoretical 800Mb/s). 
   
   Sounds about average for a P3.  I have pushed the full 800MiB/s out of
   a P3 processor to memory but it was a very optimized loop.

You pushed that over the PCI bus of your P3?  Just to RAM
doesn't count, lots of cpu's can do that.

That's what makes his number interesting.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-11 15:06           ` Eric W. Biederman
  2002-09-11 15:15             ` David S. Miller
@ 2002-09-11 15:27             ` Martin J. Bligh
  1 sibling, 0 replies; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-11 15:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David S. Miller, hadi, tcw, linux-kernel, netdev, Nivedita Singhvi

> Sounds about average for a P3.  I have pushed the full 800MiB/s out of
> a P3 processor to memory but it was a very optimized loop.  Is
> that 420MB/sec of IO on this test?

Yup, Fibre channel disks. So we know we can push at least that.

> Note quite.  But you suck at least 240MB/s of your memory bandwidth with
> DMA from disk, and then DMA to the nic.  Unless there is a highly
> cached component.  So I doubt you can effectively use more than 1 gige
> card, maybe 2.  And you have 8?

Nope, it's operating totally out of pagecache, there's no real disk 
IO to speak of.

> Wow the hardware designers really believed in over-subscription.
> If the busses are just running 64bit/33Mhz you are oversubscribed.
> And at 64bit/66Mhz the pci busses can easily swamp the system 
> 533*4 ~= 2128MB/s. 

Two 32bit buses (or maybe it was just one) and two 64bit buses,
all at 66MHz. Yes, the PCI buses can push more than the backplane,
but things are never perfectly balanced in reality, so I'd prefer
it that way around ... it's not a perfect system, but hey, it's
Intel hardware - this is high volume market, not real high end ;-)

> What kind of memory bandwidth does the system have, and on which
> bus are the memory controllers?  I'm just curious  

Memory controllers are hung off the interconnect, slightly difficult
to describe. Look for docs on the Intel profusion chipset, or I can
send you a powerpoint (yeah, yeah) presentation when I get into work
later today if you can't find it. Theoretical mem bandwidth should
be 1600MB/s if you're balanced across the CPUs, in practice I'd
expect to be able to push somewhat over 800Mb/s.

M.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-11 15:15             ` David S. Miller
@ 2002-09-11 15:31               ` Eric W. Biederman
  0 siblings, 0 replies; 102+ messages in thread
From: Eric W. Biederman @ 2002-09-11 15:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: mbligh, hadi, tcw, linux-kernel, netdev, niv

"David S. Miller" <davem@redhat.com> writes:

>    From: ebiederm@xmission.com (Eric W. Biederman)
>    Date: 11 Sep 2002 09:06:36 -0600
> 
>    "Martin J. Bligh" <mbligh@aracnet.com> writes:
>    
>    > We can push about 420MB/s of IO out of this thing (out of that 
>    > theoretical 800Mb/s). 
>    
>    Sounds about average for a P3.  I have pushed the full 800MiB/s out of
>    a P3 processor to memory but it was a very optimized loop.
> 
> You pushed that over the PCI bus of your P3?  Just to RAM
> doesn't count, lots of cpu's can do that.
> 
> That's what makes his number interesting.

I agree. Getting 420MB/s to the pci bus is nice, especially with a P3.  
The 800MB/s to memory was just the test I happened to conduct about 2 years
ago when I was still messing with slow P3 systems.  It was a proof of
concept test to see if we could plug in an I/O card into a memory
slot.  

On a current P4 system with the E7500 chipset this kind of thing is
easy.  I have gotten roughly 450MB/s to a single myrinet card.  And there 
is enough theoretical bandwidth to do 4 times that.  I haven't had a
chance to get it working in practice.  When I attempted to run to gige
cards simultaneously I had some weird problem (probably interrupt
related) where adding additional pci cards did not deliver any extra
performance.  

On a P3 to get writes from the cpu to hit 800MB/s you use the special
cpu instructions that bypass the cache.  

My point was that I have tested the P3 bus in question and I achieved
a real world 800MB/s over it.  So I expect that on the system in
question unless another bottleneck is hit, it should be possible to
achieve a real world 800MB/s of I/O.  There are enough pci busses
to support that kind of traffic.

Unless the memory controller is carefully placed on the system though
doing 400+MB/s could easily eat up most of the available memory
bandwidth and reduce the system to doing some very slow cache line fills.

Eric

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06  3:47   ` David S. Miller
  2002-09-06  6:48     ` Martin J. Bligh
@ 2002-09-12  7:28     ` Todd Underwood
  2002-09-12 12:30       ` jamal
  2002-09-12 17:18       ` Nivedita Singhvi
  1 sibling, 2 replies; 102+ messages in thread
From: Todd Underwood @ 2002-09-12  7:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, tcw, linux-kernel, netdev

folx,

sorry for the late reply.  catching up on kernel mail.

so all this TSO stuff looks v. v. similar to the IP-only fragmentation 
that patricia gilfeather and i implemented on alteon acenics a couple of 
years ago (see http://www.cs.unm.edu/~maccabe/SSL/frag/FragPaper1/ for a 
general overview).  it's exciting to see someone else take a stab on 
different hardware and approaching some of the tcp-specific issues.

the main different, though, is that general purpose kernel development 
still focussed on the improvements in *sending* speed.  for real high 
performance networking, the improvements are necessary in *receiving* cpu 
utilization, in our estimation. (see our analysis of interrupt overhead 
and the effect on receivers at gigabit speeds--i hope that this has become 
common understanding by now)

i guess i can't disagree with david miller that the improvments in TSO are 
due entirely to header retransmission for sending, but that's only because 
sending wasn't CPU-intensive in the first place.  we were able to get a 
significant reduction in receiver cpu-utilization by reassembling IP 
fragments on the receiver side (sort of a standards-based interrupt 
mitigation strategy that has the benefit of not increasing latency the way 
interrupt coalescing does).

anyway, nice work, 

t.

On Thu, 5 Sep 2002, David S. Miller wrote:

> It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
> is limited by the DMA throughput of the PCI host controller.  In
> particular some controllers are limited to smaller DMA bursts to
> work around hardware bugs.
> 
> Ie. the headers that don't need to go across the bus are the critical
> resource saved by TSO.
> 
> I think I've said this a million times, perhaps the next person who
> tries to figure out where the gains come from can just reply with
> a pointer to a URL of this email I'm typing right now :-)

-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12  7:28     ` Todd Underwood
@ 2002-09-12 12:30       ` jamal
  2002-09-12 13:57         ` Todd Underwood
  2002-09-12 23:12         ` David S. Miller
  2002-09-12 17:18       ` Nivedita Singhvi
  1 sibling, 2 replies; 102+ messages in thread
From: jamal @ 2002-09-12 12:30 UTC (permalink / raw)
  To: Todd Underwood; +Cc: David S. Miller, tcw, linux-kernel, netdev



Good work. The first time i have seen someone say Linux's way of
reverse order is a GoodThing(tm). It was also great to see de-mything
some of the old assumption of the world.

BTW, TSO is not a intelligent as what you are suggesting.
If i am not mistaken you are not only suggesting fragmentation and
assembly at that level you are also suggesting retransmits at the NIC.
This could be dangerous for practical reasons (changes in TCP congestion
control algorithms etc). TSO as was pointed in earlier emails is just a
dumb sender of packets. I think even fragmentation is a misnomer.
Essentially you shove a huge buffer to the NIC and it breaks it into MTU
sized packets for you and sends them.

In regards to the receive side CPU utilization improvements: I think
that NAPI does a good job at least in getting ridding of the biggest
offender -- interupt overload. Also with NAPI also having got rid of
intermidiate queues to the socket level, facilitating of zero copy receive
should be relatively easy to add but there are no capable NICs in
existence (well, ok not counting the TIGONII/acenic that you can hack
and the fact that the tigon 2 is EOL doesnt help other than just for
experiments). I dont think theres any NIC that can offload reassembly;
that might not be such a bad idea.

Are you still continuing work on this?

cheers,
jamal


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12 12:30       ` jamal
@ 2002-09-12 13:57         ` Todd Underwood
  2002-09-12 14:11           ` Alan Cox
  2002-09-12 23:12         ` David S. Miller
  1 sibling, 1 reply; 102+ messages in thread
From: Todd Underwood @ 2002-09-12 13:57 UTC (permalink / raw)
  To: jamal; +Cc: David S. Miller, tcw, linux-kernel, netdev, patricia gilfeather

jamal,

> Good work. The first time i have seen someone say Linux's way of
> reverse order is a GoodThing(tm). It was also great to see de-mything
> some of the old assumption of the world.

thanks.  although i'd love to take credit, i don't think that the 
reverse-order fragmentation appreciation is all that original:  who 
wouldn't want their data sctructure size determined up-front? :-) (not to 
mention getting header-overwriting for-free as part of the single copy.

> BTW, TSO is not a intelligent as what you are suggesting.
> If i am not mistaken you are not only suggesting fragmentation and
> assembly at that level you are also suggesting retransmits at the NIC.
> This could be dangerous for practical reasons (changes in TCP congestion
> control algorithms etc). TSO as was pointed in earlier emails is just a
> dumb sender of packets. I think even fragmentation is a misnomer.
> Essentially you shove a huge buffer to the NIC and it breaks it into MTU
> sized packets for you and sends them.

the biggest problem to our approach is that itis extremely difficult to
mix two very different kinds of workloads together:  the regular
server-on-the-internet workload (SOI) and the large-cluster-member
workload (LCM).  in the former case, SOI, you get dropped packets,
fragments, no fragments, out of order fragments, etc.  in the LCM case you 
basically never get any of that stuff--you're on a closed network with 
1000-10000 of your closest cluster friends and that's just what you're 
doing.  no fragments (unless you put them there), no out of order 
fragments (unless you send them) and basically no dropped packets ever.  
obviously, if you can assume conditions like that, you can do things like:  
only reassmble fragments in reverse order since you know you'll only send 
them that way, e.g.

> In regards to the receive side CPU utilization improvements: I think
> that NAPI does a good job at least in getting ridding of the biggest
> offender -- interupt overload. Also with NAPI also having got rid of
> intermidiate queues to the socket level, facilitating of zero copy receive
> should be relatively easy to add but there are no capable NICs in
> existence (well, ok not counting the TIGONII/acenic that you can hack
> and the fact that the tigon 2 is EOL doesnt help other than just for
> experiments). I dont think theres any NIC that can offload reassembly;
> that might not be such a bad idea.

i've done some reading about NAPI just recently (somehow i missed the 
splash when it came out).  the two things i like about it are the hardware 
independent interrupt mitigation technique and using the DMA buffers as a 
receive backlog.  i'm concerned about the numbers posted by ibm folx 
recently showing a slowdown under some conditions using NAPI and need to 
read the rest of that discussion.

we are definitely aware of the fact that the more you want to put on the 
NIC, the more the NIC will have to do (and the more expensive it will have 
to be).  right now the NICs, that people are developing on are the 
TigonII/III and, even more closed/proprietary, the Myrinet NICs.  i would 
love to have a <$200 NIC with open firmware and a CPU/memory so that we 
could offload some more of this functionality (where it makes sense).  
> 
> Are you still continuing work on this?
> 

definitely!  we were just talking about some of these issues yesterday
(and trying to find hardware sepc info on the web for the e1000 platform
to see what else they might be able to do). patricia gilfeather is working
on finding parts of TCP that are separable from the rest of TCP, but the 
problems you raise are serious:  it would have to be on an 
application-specific and socket-specific basis, so that the app would 
*know* that functionality (like acks for synchronization packets or 
whatever) was being offloaded. 

the biggest difference in our perspective, versus the common kernel
developers, is that we're still looking for ways to get the OS out of the
way of the applications.  if we can do large data transfers (with
pre-posted receives and pre-posted memory allocation, obviously) directly
from the nic into application memory and have a clean, relatively simple
and standard api to do that, we avoid all of the interrupt mitigation 
techniques and save hugely on context switching overhead.

this may now be off-topic for linux-kernel and i'd be happy to chat 
further in private email if others are getting bored :-).

> cheers,
> jamal

t.
-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12 13:57         ` Todd Underwood
@ 2002-09-12 14:11           ` Alan Cox
  2002-09-12 14:41             ` todd-lkml
  0 siblings, 1 reply; 102+ messages in thread
From: Alan Cox @ 2002-09-12 14:11 UTC (permalink / raw)
  To: Todd Underwood
  Cc: jamal, David S. Miller, tcw, linux-kernel, netdev, patricia gilfeather

On Thu, 2002-09-12 at 14:57, Todd Underwood wrote:
> thanks.  although i'd love to take credit, i don't think that the 
> reverse-order fragmentation appreciation is all that original:  who 
> wouldn't want their data sctructure size determined up-front? :-) (not to 
> mention getting header-overwriting for-free as part of the single copy.

As far as I am aware it was original when Linux first did it (and we
broke cisco pix, some boot proms, some sco in the process). Credit goes
to Arnt Gulbrandsen probably better known nowdays for his work on Qt


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12 14:11           ` Alan Cox
@ 2002-09-12 14:41             ` todd-lkml
  0 siblings, 0 replies; 102+ messages in thread
From: todd-lkml @ 2002-09-12 14:41 UTC (permalink / raw)
  To: Alan Cox
  Cc: jamal, David S. Miller, tcw, linux-kernel, netdev, patricia gilfeather

alan,

good to know.  it's a nice piece of engineering.  it's useful to note that 
linux has such a long and rich history of breaking de-facto standards in 
order to make things work better.

t.

On 12 Sep 2002, Alan Cox wrote:

> On Thu, 2002-09-12 at 14:57, Todd Underwood wrote:
> > thanks.  although i'd love to take credit, i don't think that the 
> > reverse-order fragmentation appreciation is all that original:  who 
> > wouldn't want their data sctructure size determined up-front? :-) (not to 
> > mention getting header-overwriting for-free as part of the single copy.
> 
> As far as I am aware it was original when Linux first did it (and we
> broke cisco pix, some boot proms, some sco in the process). Credit goes
> to Arnt Gulbrandsen probably better known nowdays for his work on Qt
> 

-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12  7:28     ` Todd Underwood
  2002-09-12 12:30       ` jamal
@ 2002-09-12 17:18       ` Nivedita Singhvi
  1 sibling, 0 replies; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-12 17:18 UTC (permalink / raw)
  To: Todd Underwood; +Cc: David S. Miller, hadi, tcw, linux-kernel, netdev

Quoting Todd Underwood <todd@osogrande.com>:

> sorry for the late reply.  catching up on kernel mail.

> the main different, though, is that general purpose kernel
> development still focussed on the improvements in *sending* speed.
> for real high performance networking, the improvements are necessary
> in *receiving* cpu utilization, in our estimation. 
> (see our analysis of interrupt overhead and the effect on receivers
> at gigabit speeds--i hope that this has become common understanding
> by now)

Some of that may be a byproduct of the "all the worlds' a webserver"
mindset - we are primarily focussed on the server side (aka
money side ;)), and there is some amount of automatic thinking that
this means we're going to be sending data and receiving small packets,
mostly acks) in return. There is much less emphasis given to solving
the problems on the other side (active connection scalability for 
instance), or other issues that manifest themselves as 
client side bottlenecks for most applications..

thanks,
Nivedita


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12 12:30       ` jamal
  2002-09-12 13:57         ` Todd Underwood
@ 2002-09-12 23:12         ` David S. Miller
  2002-09-13 21:59           ` todd-lkml
  1 sibling, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-12 23:12 UTC (permalink / raw)
  To: hadi; +Cc: todd, tcw, linux-kernel, netdev

   From: jamal <hadi@cyberus.ca>
   Date: Thu, 12 Sep 2002 08:30:44 -0400 (EDT)
   
   In regards to the receive side CPU utilization improvements: I think
   that NAPI does a good job at least in getting ridding of the biggest
   offender -- interupt overload.

I disagree, at least for bulk receivers.  We have no way currently to
get rid of the data copy.  We desperately need sys_receivefile() and
appropriate ops all the way into the networking, then the necessary
driver level support to handle the cards that can do this.

Once 10gbit cards start hitting the shelves this will convert from a
nice perf improvement into a must have.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-12 23:12         ` David S. Miller
@ 2002-09-13 21:59           ` todd-lkml
  2002-09-13 22:04             ` David S. Miller
  2002-09-13 22:12             ` Nivedita Singhvi
  0 siblings, 2 replies; 102+ messages in thread
From: todd-lkml @ 2002-09-13 21:59 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, tcw, linux-kernel, netdev, patricia gilfeather

dave, all,

On Thu, 12 Sep 2002, David S. Miller wrote:

> I disagree, at least for bulk receivers.  We have no way currently to
> get rid of the data copy.  We desperately need sys_receivefile() and
> appropriate ops all the way into the networking, then the necessary
> driver level support to handle the cards that can do this.

not sure i understand what you're proposing, but while we're at it, why
not also make the api for apps to allocate a buffer in userland that (for
nics that support it) the nic can dma directly into?  it seems likely
notification that the buffer was used would have to travel through the
kernel, but it would be nice to save the interrupts altogether.

this may be exactly what you were saying.

> 
> Once 10gbit cards start hitting the shelves this will convert from a
> nice perf improvement into a must have.

totally agreed.  this is a must for high-performance computing now (since 
who wants to waste 80-100% of their CPU just running the network)?

t.

-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-13 21:59           ` todd-lkml
@ 2002-09-13 22:04             ` David S. Miller
  2002-09-15 20:16               ` jamal
  2002-09-16 14:16               ` todd-lkml
  2002-09-13 22:12             ` Nivedita Singhvi
  1 sibling, 2 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-13 22:04 UTC (permalink / raw)
  To: linux-kernel, todd-lkml; +Cc: hadi, tcw, netdev, pfeather

   From: todd-lkml@osogrande.com
   Date: Fri, 13 Sep 2002 15:59:15 -0600 (MDT)
   
   not sure i understand what you're proposing

Cards in the future at 10gbit and faster are going to provide
facilities by which:

1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
   with the hardware when TCP connections are openned.

2) The card scans TCP packets arriving, if the cookie matches, it
   accumulated received data to fill full pages and wakes up the
   networking when either:

   a) full page has accumulated for a connection
   b) connection cookie mismatch
   c) configurable timer has expired

3) TCP ends up getting receive packets with skb->shinfo() fraglist
   containing the data portion in full struct page *'s
   This can be placed directly into the page cache via sys_receivefile
   generic code in mm/filemap.c or f.e. NFSD/NFS receive side
   processing.

   not also make the api for apps to allocate a buffer in userland that (for
   nics that support it) the nic can dma directly into?  it seems likely
   notification that the buffer was used would have to travel through the
   kernel, but it would be nice to save the interrupts altogether.
   
This is already doable with sys_sendfile() for send today.  The user
just does the following:

1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_sendfile() to send the data over the socket from that file
3) uses socket write space monitoring to determine if the portions of
   the shared area are reclaimable for new writes

BTW Apache could make this, I doubt it does currently.

The corrolary with sys_receivefile would be that the use:

1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_receivefile() to pull in the data from the socket to that file

There is no need to poll the receive socket space as the successful
return from sys_receivefile() is the "data got received successfully"
event.
   
   totally agreed.  this is a must for high-performance computing now (since 
   who wants to waste 80-100% of their CPU just running the network)?
   
If send side is your bottleneck and you think zerocopy sends of
user anonymous data might help, see the above since we can do it
today and you are free to experiment.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-13 21:59           ` todd-lkml
  2002-09-13 22:04             ` David S. Miller
@ 2002-09-13 22:12             ` Nivedita Singhvi
  1 sibling, 0 replies; 102+ messages in thread
From: Nivedita Singhvi @ 2002-09-13 22:12 UTC (permalink / raw)
  To: linux-kernel, todd-lkml; +Cc: tcw, pfeather, netdev

Quoting todd-lkml@osogrande.com:

> dave, all,
> 
> not sure i understand what you're proposing, but while we're at it,
> why not also make the api for apps to allocate a buffer in userland
> that (for nics that support it) the nic can dma directly into?  it

I believe thats exactly what David was referring to - reverse
direction sendfile() so to speak.. 

> seems likely notification that the buffer was used would have to
> travel through the kernel, but it would be nice to save the
> interrupts altogether. 

However, I dont think what youre saving are interrupts as 
much as the extra copy, but I could be wrong..

thanks,
Nivedita




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-13 22:04             ` David S. Miller
@ 2002-09-15 20:16               ` jamal
  2002-09-16  4:23                 ` David S. Miller
  2002-09-16 14:16               ` todd-lkml
  1 sibling, 1 reply; 102+ messages in thread
From: jamal @ 2002-09-15 20:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, todd-lkml, tcw, netdev, pfeather



10 gige becomes more of an interesting beast. Not sure if we would see
servers with 10gige real soon now. Your proposal does make sense although
compute power would still be a player. I think the key would be
parallelization;
Now if it wasnt for the stupid way TCP options were designed
you could easily do remote DMA instead. Would be relatively easy to add
NIC support for that. Maybe SCTP would save us ;-> however, if history
could be used to predict the future, i think TCP will continue to be
"hacked" and fit the throughput requirements so no chance for SCTP to be
a big player i am afraid .

cheers,
jamal

On Fri, 13 Sep 2002, David S. Miller wrote:

>    From: todd-lkml@osogrande.com
>    Date: Fri, 13 Sep 2002 15:59:15 -0600 (MDT)
>
>    not sure i understand what you're proposing
>
> Cards in the future at 10gbit and faster are going to provide
> facilities by which:
>
> 1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
>    with the hardware when TCP connections are openned.
>

[..]




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-15 20:16               ` jamal
@ 2002-09-16  4:23                 ` David S. Miller
  0 siblings, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-16  4:23 UTC (permalink / raw)
  To: hadi; +Cc: linux-kernel, todd-lkml, tcw, netdev, pfeather

   From: jamal <hadi@cyberus.ca>
   Date: Sun, 15 Sep 2002 16:16:13 -0400 (EDT)

   Your proposal does make sense although compute power would still be
   a player. I think the key would be parallelization;

Oh I forgot to mention that some of these cards also compute a cookie
for you on receive packets, and your meant to point the input
processing for that packet to a cpu whose number is derived from that
cookie it gives you.

Lockless per-cpu packet input queues make this sort of hard for us
to implement currently.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-13 22:04             ` David S. Miller
  2002-09-15 20:16               ` jamal
@ 2002-09-16 14:16               ` todd-lkml
  2002-09-16 19:52                 ` David S. Miller
  1 sibling, 1 reply; 102+ messages in thread
From: todd-lkml @ 2002-09-16 14:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, hadi, tcw, netdev, pfeather

david,

comments/questions below...

On Fri, 13 Sep 2002, David S. Miller wrote:

> 1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
>    with the hardware when TCP connections are openned.

intriguing architecture.  are there any standards in progress to support 
this.  bascially, people doing high performance computing have been 
customizing non-commodity nics (acenic, myrinet, quadrics, etc.) to do 
some of this cookie registration/scanning.  it would be nice if there were 
a standard API/hardware capability that took care of at least this piece.

(frankly, it would also be nice if customizable, almost-commodity nics 
based on processor/memory/firmware architecture rather than just asics
(like the acenic) continued to exist).

>    not also make the api for apps to allocate a buffer in userland that (for
>    nics that support it) the nic can dma directly into?  it seems likely
>    notification that the buffer was used would have to travel through the
>    kernel, but it would be nice to save the interrupts altogether.
>    
> This is already doable with sys_sendfile() for send today.  The user
> just does the following:
> 
> 1) mmap()'s a file with MAP_SHARED to write the data
> 2) uses sys_sendfile() to send the data over the socket from that file
> 3) uses socket write space monitoring to determine if the portions of
>    the shared area are reclaimable for new writes
> 
> BTW Apache could make this, I doubt it does currently.
> 
> The corrolary with sys_receivefile would be that the use:
> 
> 1) mmap()'s a file with MAP_SHARED to write the data
> 2) uses sys_receivefile() to pull in the data from the socket to that file
> 
> There is no need to poll the receive socket space as the successful
> return from sys_receivefile() is the "data got received successfully"
> event.

the send case has been well described and seems work well for the people 
for whom that is the bottleneck.  that has not been the case in HPC, since 
sends are relatively cheaper (in terms of cpu) than receives.  

who is working on this architecture for receives?  i know quite a few 
people who would be interested in working on it and willing to prototype 
as well.

>    totally agreed.  this is a must for high-performance computing now (since 
>    who wants to waste 80-100% of their CPU just running the network)?
>    
> If send side is your bottleneck and you think zerocopy sends of
> user anonymous data might help, see the above since we can do it
> today and you are free to experiment.

for many of the applications that i care about, receive is the bottleneck, 
so zerocopy sends are somewhat of a non-issue (not that they're not nice, 
they just don't solve the primary waste of processor resources).

is there a beginning implementation yet of zerocopy receives as you
describe above, or you you be interested in entertaining implementations
that work on existing (1Gig-e) cards?

what i'm thinking is something that prototypes the api to the nic that you 
are proposing and implements the NIC-side functionality in firmware on the 
acenic-2's (which have available firmware in at least two 
implementations--the alteon version and pete wyckoff's version (which may 
be less license-encumbered).

this is obviously only feasible if there already exists some consensus on 
what the os-to-hardware API should look like (or there is willingness to 
try to build a consensus around that now).

t.

-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 14:16               ` todd-lkml
@ 2002-09-16 19:52                 ` David S. Miller
  2002-09-16 21:32                   ` todd-lkml
  2002-09-17 10:31                   ` jamal
  0 siblings, 2 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-16 19:52 UTC (permalink / raw)
  To: linux-kernel, todd-lkml; +Cc: hadi, tcw, netdev, pfeather

   From: todd-lkml@osogrande.com
   Date: Mon, 16 Sep 2002 08:16:47 -0600 (MDT)

   are there any standards in progress to support this.

Your question makes no sense, it is a hardware optimization
of an existing standard.  The chip merely is told what flows
exist and it concatenates TCP data from consequetive packets
for that flow if they arrive in sequence.
   
   who is working on this architecture for receives?

Once cards with the feature exist, probably Alexey and myself
will work on it.

Basically, who ever isn't busy with something else once the technology
appears.
   
   is there a beginning implementation yet of zerocopy receives

No.
   
Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 21:32                   ` todd-lkml
@ 2002-09-16 21:29                     ` David S. Miller
  2002-09-16 22:53                     ` David Woodhouse
  1 sibling, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-16 21:29 UTC (permalink / raw)
  To: linux-kernel, todd-lkml; +Cc: hadi, tcw, netdev, pfeather

   From: todd-lkml@osogrande.com
   Date: Mon, 16 Sep 2002 15:32:56 -0600 (MDT)
   
   new system calls into the networking code

The system calls would go into the VFS, sys_receivefile is not
networking specific in any way shape or form.

And to answer your question, if I had the time I'd work on it yes.

Right now the answer to "well do you have the time" is no, I am
working on something much more important wrt. Linux networking.  I've
hinted at what this is in previous postings, and if people can't
figure out what it is I'm not going to mention this explicitly :-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 19:52                 ` David S. Miller
@ 2002-09-16 21:32                   ` todd-lkml
  2002-09-16 21:29                     ` David S. Miller
  2002-09-16 22:53                     ` David Woodhouse
  2002-09-17 10:31                   ` jamal
  1 sibling, 2 replies; 102+ messages in thread
From: todd-lkml @ 2002-09-16 21:32 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

folx,

perhaps i was insufficiently clear.

On Mon, 16 Sep 2002, David S. Miller wrote:

>    are there any standards in progress to support this.
> 
> Your question makes no sense, it is a hardware optimization
> of an existing standard.  The chip merely is told what flows
> exist and it concatenates TCP data from consequetive packets
> for that flow if they arrive in sequence.

hardware optimizations can be standardized.  in fact, when they are, it is 
substantially easier to implement to them.

my assumption (perhaps incorrect) is that some core set of functionality 
is necessary for a card to support zero-copy receives (in particular, the 
ability to register cookies of expected data flows and the memory location 
to which they are to be sent).  what 'existing standard' is this 
kernel<->api a standardization of?

>    who is working on this architecture for receives?
> 
> Once cards with the feature exist, probably Alexey and myself
> will work on it.
> 
> Basically, who ever isn't busy with something else once the technology
> appears.

so if we wrote and distributed firmware for alteon acenics that supported
this today, you would be willing to incorporate the new system calls into
the networking code (along with the new firmware for the card, provided we
could talk jes into accepting the changes, assuming he's still the 
maintainer of the driver)?  that's great.

>    
>    is there a beginning implementation yet of zerocopy receives
> 
> No.


thanks for your feedback.

t.


-- 
todd underwood, vp & cto
oso grande technologies, inc.
todd@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 22:53                     ` David Woodhouse
@ 2002-09-16 22:46                       ` David S. Miller
  2002-09-16 23:03                       ` David Woodhouse
  1 sibling, 0 replies; 102+ messages in thread
From: David S. Miller @ 2002-09-16 22:46 UTC (permalink / raw)
  To: dwmw2; +Cc: linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

   From: David Woodhouse <dwmw2@infradead.org>
   Date: Mon, 16 Sep 2002 23:53:00 +0100
   
   Er, surely the same goes for sys_sendfile? Why have a new system call 
   rather than just swapping the 'in' and 'out' fds?

There is an assumption that one is a linear stream of output (in this
case a socket) and the other one is a page cache based file.

It would be nice to extend sys_sendfile to work properly in both
ways in a manner that Linus would accept, want to work on that?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 21:32                   ` todd-lkml
  2002-09-16 21:29                     ` David S. Miller
@ 2002-09-16 22:53                     ` David Woodhouse
  2002-09-16 22:46                       ` David S. Miller
  2002-09-16 23:03                       ` David Woodhouse
  1 sibling, 2 replies; 102+ messages in thread
From: David Woodhouse @ 2002-09-16 22:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather


davem@redhat.com said:
>    new system calls into the networking code
> The system calls would go into the VFS, sys_receivefile is not
> networking specific in any way shape or form. 

Er, surely the same goes for sys_sendfile? Why have a new system call 
rather than just swapping the 'in' and 'out' fds?

--
dwmw2



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 23:08                         ` Jeff Garzik
@ 2002-09-16 23:02                           ` David S. Miller
  2002-09-16 23:48                             ` Jeff Garzik
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-16 23:02 UTC (permalink / raw)
  To: jgarzik; +Cc: dwmw2, linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

   From: Jeff Garzik <jgarzik@mandrakesoft.com>
   Date: Mon, 16 Sep 2002 19:08:15 -0400
   
   I was rather disappointed when file->file sendfile was [purposefully?] 
   broken in 2.5.x...

What change made this happen?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 22:53                     ` David Woodhouse
  2002-09-16 22:46                       ` David S. Miller
@ 2002-09-16 23:03                       ` David Woodhouse
  2002-09-16 23:08                         ` Jeff Garzik
  1 sibling, 1 reply; 102+ messages in thread
From: David Woodhouse @ 2002-09-16 23:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather


davem@redhat.com said:
> >   Er, surely the same goes for sys_sendfile? Why have a new system
> >   call rather than just swapping the 'in' and 'out' fds?

> There is an assumption that one is a linear stream of output (in this
> case a socket) and the other one is a page cache based file.

That's an implementation detail and it's not clear we should be exposing it 
to the user. It's not entirely insane to contemplate socket->socket or 
file->file sendfile either -- would we invent new system calls for those 
too? File descriptors are file descriptors.

> It would be nice to extend sys_sendfile to work properly in both ways
> in a manner that Linus would accept, want to work on that? 

Yeah -- I'll add it to the TODO list. Scheduled for some time in 2007 :)

More seriously though, I'd hope that whoever implemented what you call 
'sys_receivefile' would solve this issue, as 'sys_receivefile' isn't really 
useful as anything more than a handy nomenclature for describing the 
process in question.

--
dwmw2



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 23:03                       ` David Woodhouse
@ 2002-09-16 23:08                         ` Jeff Garzik
  2002-09-16 23:02                           ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Jeff Garzik @ 2002-09-16 23:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: David S. Miller, linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

David Woodhouse wrote:
> davem@redhat.com said:
> 
>>>  Er, surely the same goes for sys_sendfile? Why have a new system
>>>  call rather than just swapping the 'in' and 'out' fds?
>>
> 
>>There is an assumption that one is a linear stream of output (in this
>>case a socket) and the other one is a page cache based file.
> 
> 
> That's an implementation detail and it's not clear we should be exposing it 
> to the user. It's not entirely insane to contemplate socket->socket or 
> file->file sendfile either -- would we invent new system calls for those 
> too? File descriptors are file descriptors.

I was rather disappointed when file->file sendfile was [purposefully?] 
broken in 2.5.x...

	Jeff




^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 23:48                             ` Jeff Garzik
@ 2002-09-16 23:43                               ` David S. Miller
  2002-09-17  0:01                                 ` Jeff Garzik
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-16 23:43 UTC (permalink / raw)
  To: jgarzik; +Cc: dwmw2, linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

   From: Jeff Garzik <jgarzik@mandrakesoft.com>
   Date: Mon, 16 Sep 2002 19:48:37 -0400

   I dunno when it happened, but 2.5.x now returns EINVAL for all 
   file->file cases.
   
   In 2.4.x, if sendpage is NULL, file_send_actor in mm/filemap.c faked a 
   call to fops->write().
   In 2.5.x, if sendpage is NULL, EINVAL is unconditionally returned.
   

What if source and destination file and offsets match?
Sounds like 2.4.x might deadlock.

In fact it sounds similar to the "read() with buf pointed to same
page in MAP_WRITE mmap()'d area" deadlock we had ages ago.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 23:02                           ` David S. Miller
@ 2002-09-16 23:48                             ` Jeff Garzik
  2002-09-16 23:43                               ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Jeff Garzik @ 2002-09-16 23:48 UTC (permalink / raw)
  To: David S. Miller
  Cc: dwmw2, linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

David S. Miller wrote:
>    From: Jeff Garzik <jgarzik@mandrakesoft.com>
>    Date: Mon, 16 Sep 2002 19:08:15 -0400
>    
>    I was rather disappointed when file->file sendfile was [purposefully?] 
>    broken in 2.5.x...
> 
> What change made this happen?


I dunno when it happened, but 2.5.x now returns EINVAL for all 
file->file cases.

In 2.4.x, if sendpage is NULL, file_send_actor in mm/filemap.c faked a 
call to fops->write().
In 2.5.x, if sendpage is NULL, EINVAL is unconditionally returned.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 23:43                               ` David S. Miller
@ 2002-09-17  0:01                                 ` Jeff Garzik
  0 siblings, 0 replies; 102+ messages in thread
From: Jeff Garzik @ 2002-09-17  0:01 UTC (permalink / raw)
  To: David S. Miller
  Cc: dwmw2, linux-kernel, todd-lkml, hadi, tcw, netdev, pfeather

[-- Attachment #1: Type: text/plain, Size: 551 bytes --]

David S. Miller wrote:
>    From: Jeff Garzik <jgarzik@mandrakesoft.com>
>    Date: Mon, 16 Sep 2002 19:48:37 -0400
> 
>    I dunno when it happened, but 2.5.x now returns EINVAL for all 
>    file->file cases.
>    
>    In 2.4.x, if sendpage is NULL, file_send_actor in mm/filemap.c faked a 
>    call to fops->write().
>    In 2.5.x, if sendpage is NULL, EINVAL is unconditionally returned.
>    
> 
> What if source and destination file and offsets match?


The same data is written out.  No deadlock.
(unless the attached test is wrong)

	Jeff



[-- Attachment #2: sendfile-test-2.c --]
[-- Type: text/plain, Size: 635 bytes --]

#include <sys/sendfile.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <string.h>
#include <stdio.h>

int main (int argc, char *argv[])
{
	int in, out;
	struct stat st;
	off_t off = 0;
	ssize_t rc;

	in = open("test.data", O_RDONLY);
	if (in < 0) {
		perror("test.data read");
		return 1;
	}

	fstat(in, &st);

	out = open("test.data", O_WRONLY);
	if (out < 0) {
		perror("test.data write");
		return 1;
	}

	rc = sendfile(out, in, &off, st.st_size);
	if (rc < 0) {
		perror("sendfile");
		close(in);
		unlink("out");
		close(out);
		return 1;
	}

	close(in);
	close(out);
	return 0;
}


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-16 19:52                 ` David S. Miller
  2002-09-16 21:32                   ` todd-lkml
@ 2002-09-17 10:31                   ` jamal
  1 sibling, 0 replies; 102+ messages in thread
From: jamal @ 2002-09-17 10:31 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, todd-lkml, tcw, netdev, pfeather



On Mon, 16 Sep 2002, David S. Miller wrote:

>    From: todd-lkml@osogrande.com
>    Date: Mon, 16 Sep 2002 08:16:47 -0600 (MDT)
>
>    are there any standards in progress to support this.
>
> Your question makes no sense, it is a hardware optimization
> of an existing standard.  The chip merely is told what flows
> exist and it concatenates TCP data from consequetive packets
> for that flow if they arrive in sequence.
>

Hrm. Again, the big Q:
How "thmart" is this NIC going to be (think congestion control and
the du-jour flavor).

cheers,
jamal


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-10 16:55         ` Manfred Spraul
@ 2002-09-11  7:46           ` Robert Olsson
  0 siblings, 0 replies; 102+ messages in thread
From: Robert Olsson @ 2002-09-11  7:46 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Robert Olsson, David S. Miller, haveblue, hadi, netdev, linux-kernel



 >  > Load   "Mode"
 >  > -------------------
 >  > Lo  1) RxIntDelay=0
 >  > Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
 >  > Hi  3) Consecutive polling. No RX interrupts.

Manfred Spraul writes:

 > Sounds good.
 > 
 > The difficult part is when to go from Lo to Mid. Unfortunately my tulip 
 > card is braindead (LC82C168), but I'll try to find something usable for 
 > benchmarking

 21143 for tulip's. Well any NIC with "RxIntDelay"  should do.

 > In my tests with the winbond card, I've switched at a fixed packet rate:
 > 
 > < 2000 packets/sec: no delay
 >  > 2000 packets/sec: poll rx at 0.5 ms

 I was experimenting with all sorts of moving averages but never got a good 
 correlation with bursty network traffic as this level of resolution. The 
 only measure I found fast and simple enough for this was the number of 
 packets on the RX ring as I mentioned.


 Cheers.
						--ro

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-10 12:02       ` Robert Olsson
@ 2002-09-10 16:55         ` Manfred Spraul
  2002-09-11  7:46           ` Robert Olsson
  0 siblings, 1 reply; 102+ messages in thread
From: Manfred Spraul @ 2002-09-10 16:55 UTC (permalink / raw)
  To: Robert Olsson; +Cc: David S. Miller, haveblue, hadi, netdev, linux-kernel

Robert Olsson wrote:
 >
 > Anyway. A tulip NAPI variant added mitigation when we reached "some
 > load" to  avoid the static interrupt delay. (Still keeping things
 > pretty simple):
 >
 > Load   "Mode"
 > -------------------
 > Lo  1) RxIntDelay=0
 > Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
 > Hi  3) Consecutive polling. No RX interrupts.
 >
Sounds good.

The difficult part is when to go from Lo to Mid. Unfortunately my tulip 
card is braindead (LC82C168), but I'll try to find something usable for 
benchmarking

In my tests with the winbond card, I've switched at a fixed packet rate:

< 2000 packets/sec: no delay
 > 2000 packets/sec: poll rx at 0.5 ms



--
	Manfred


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Early SPECWeb99 results on 2.5.33 with TSO on e1000
@ 2002-09-10 14:59 Mala Anand
  0 siblings, 0 replies; 102+ messages in thread
From: Mala Anand @ 2002-09-10 14:59 UTC (permalink / raw)
  To: inux-net, linux-kernel, netdev

I am resending this note with the subject heading, so that
it can be viewed through the subject catagory.

 > "David S. Miller" wrote:
>> NAPI is also not the panacea to all problems in the world.

   >Mala did some testing on this a couple of weeks back. It appears that
   >NAPI damaged performance significantly.



>http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm



>Unfortunately it is not listed what e1000 and core NAPI
>patch was used. Also, not listed, are the RX/TX mitigation
>and ring sizes given to the kernel module upon loading.
The default driver that is included in 2.5.25 kernel for Intel
gigabit adapter was used for the baseline test and the NAPI driver
was downloaded from Robert Olsson's website. I have updated my web
page to include Robert's patch. However it is given there for reference
purpose only. Except for the ones mentioned explicitly the rest of
the configurable values used are default. The default for RX/TX mitigation
is 64 microseconds and the default ring size is 80.

I have added statistics collected during the test to my web site. I do
want to analyze and understand how NAPI can be improved in my tcp_stream
test. Last year around November, when I first tested NAPI, I did find NAPI
results better than the baseline using udp_stream. However I am
concentrating on tcp_stream since that is where NAPI can be improved in
my setup. I will update the website as I do more work on this.


>Robert can comment on optimal settings
I saw Robert's postings. Looks like he may have a more recent version of
NAPI
driver than the one I used. I also see 2.5.33 has NAPI, I will move to
2.5.33
and continue my work on that.


>Robert and Jamal can make a more detailed analysis of Mala's
>graphs than I.
Jamal has questioned about socket buffer size that I used, I have tried
132k
socket buffer size in the past and I didn't see much difference in my
tests.
I will add that to my list again.


Regards,
    Mala


   Mala Anand
   IBM Linux Technology Center - Kernel Performance
   E-mail:manand@us.ibm.com
   http://www-124.ibm.com/developerworks/opensource/linuxperf
   http://www-124.ibm.com/developerworks/projects/linuxperf
   Phone:838-8088; Tie-line:678-8088






^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:34     ` David S. Miller
@ 2002-09-10 12:02       ` Robert Olsson
  2002-09-10 16:55         ` Manfred Spraul
  0 siblings, 1 reply; 102+ messages in thread
From: Robert Olsson @ 2002-09-10 12:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: manfred, haveblue, hadi, netdev, linux-kernel


Manfred Spraul:

>> But what if the backlog queue is empty all the time? Then NAPI thinks 
>> that the system is idle, and reenables the interrupts after each packet :-(

Yes and this happens even with without NAPI. Just set RxIntDelay=X and send 
pkts at >= X+1 interval.

>>   Dave, do you have interrupt rates from the clients with and without NAPI?

DaveM:
> Robert does.


Yes we get into this interesting discussion now... Since with NAPI we can 
safely use RxIntDelay=0 (e1000 terminologi). With the classical IRQ we simply 
had to add latency (RxIntDelay of 64-128 us common for GIGE) this just to 
survive at higher speeds  (GIGE max is 1.48 Mpps) and with the interrupt latency 
also comes higher network latencies... IMO this delay was a "work-around" 
for the old interrupt scheme.

So we now have the option of removing it... But we are trading less latency 
for for more interrupts. So yes Manfred is correct...

So is there a decent setting/compromise? 

Well first approximation is just to do just what DaveM suggested.
RxIntDelay=0. This solved many problems with buggy hardware and complicated 
tuning and RxIntDelay used to be combined with other mitigation parameters to 
compensate for different packets sizes etc. This leading to very "fragile" 
performance where a NIC could perform excellent w. single TCP stream but 
to be seriously broke in many other tests. So tuning to just one "test" 
can cause a lot of mis-tuning as well. 

Anyway. A tulip NAPI variant added mitigation when we reached "some load" to 
avoid the static interrupt delay. (Still keeping things pretty simple):
 
Load   "Mode"
-------------------
Lo  1) RxIntDelay=0
Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
Hi  3) Consecutive polling. No RX interrupts.

Is it worth the effort? 

For SMP w/o affinity the delay could eventually reduce the cache bouncing 
since the packets becomes more "batched" at cost the of latency of course. 
We use RxIntDelay=0 for production use. (IP-forwarding on UP)

Cheers.

						--ro



^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:38 ` David S. Miller
@ 2002-09-06 19:40   ` Manfred Spraul
  2002-09-06 19:34     ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Manfred Spraul @ 2002-09-06 19:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: haveblue, hadi, netdev, linux-kernel

David S. Miller wrote:
>    From: Manfred Spraul <manfred@colorfullife.com>
>    Date: Fri, 06 Sep 2002 20:35:08 +0200
> 
>    The second point was that interrupt mitigation must remain enabled, even 
>    with NAPI: the automatic mitigation doesn't work with process space 
>    limited loads (e.g. TCP: backlog queue is drained quickly, but the 
>    system is busy processing the prequeue or receive queue)
> 
> Not true.  NAPI is in fact a %100 replacement for hw interrupt
> mitigation strategies.  The cpu usage elimination afforded by
> hw interrupt mitigation is also afforded by NAPI and even more
> so by NAPI.
>    
> See Jamal's paper.
> 
I've read his paper: it's about MLFFR. There is no alternative to NAPI 
if packets arrive faster than they are processed by the backlog queue.

But what if the backlog queue is empty all the time? Then NAPI thinks 
that the system is idle, and reenables the interrupts after each packet :-(

In my tests, I've used a pentium class system (I have no GigE cards - 
that was the only system where I could saturate the cpu with 100MBit 
ethernet). IIRC 30% cpu time was needed for the copy_to_user(). The 
receive queue was filled, the backlog queue empty. With NAPI, I got 1 
interrupt for each packet, with hw interrupt mitigation the throughput 
was 30% higher for MTU 600.

Dave, do you have interrupt rates from the clients with and without NAPI?

--
	Manfred


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 19:40   ` Manfred Spraul
@ 2002-09-06 19:34     ` David S. Miller
  2002-09-10 12:02       ` Robert Olsson
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 19:34 UTC (permalink / raw)
  To: manfred; +Cc: haveblue, hadi, netdev, linux-kernel

   From: Manfred Spraul <manfred@colorfullife.com>
   Date: Fri, 06 Sep 2002 21:40:09 +0200
   
   Dave, do you have interrupt rates from the clients with and without NAPI?

Robert does.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 18:35 Manfred Spraul
@ 2002-09-06 18:38 ` David S. Miller
  2002-09-06 19:40   ` Manfred Spraul
  0 siblings, 1 reply; 102+ messages in thread
From: David S. Miller @ 2002-09-06 18:38 UTC (permalink / raw)
  To: manfred; +Cc: haveblue, hadi, netdev, linux-kernel

   From: Manfred Spraul <manfred@colorfullife.com>
   Date: Fri, 06 Sep 2002 20:35:08 +0200

   The second point was that interrupt mitigation must remain enabled, even 
   with NAPI: the automatic mitigation doesn't work with process space 
   limited loads (e.g. TCP: backlog queue is drained quickly, but the 
   system is busy processing the prequeue or receive queue)

Not true.  NAPI is in fact a %100 replacement for hw interrupt
mitigation strategies.  The cpu usage elimination afforded by
hw interrupt mitigation is also afforded by NAPI and even more
so by NAPI.
   
See Jamal's paper.

Franks a lot,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
@ 2002-09-06 18:35 Manfred Spraul
  2002-09-06 18:38 ` David S. Miller
  0 siblings, 1 reply; 102+ messages in thread
From: Manfred Spraul @ 2002-09-06 18:35 UTC (permalink / raw)
  To: Dave Hansen, jamal, netdev, linux-kernel

 >
> The real question is why NAPI causes so much more work for the client.
 >
[Just a summary from my results from last year. All testing with a 
simple NIC without hw interrupt mitigation, on a Cyrix P150]

My assumption was that NAPI increases the cost of receiving a single 
packet: instead of one hw interrupt with one device access (ack 
interrupt) and the softirq processing, the hw interrupt must ack & 
disable the interrupt, then the processing occurs in softirq context, 
and the interrupts are reenabled at softirq context.

The second point was that interrupt mitigation must remain enabled, even 
with NAPI: the automatic mitigation doesn't work with process space 
limited loads (e.g. TCP: backlog queue is drained quickly, but the 
system is busy processing the prequeue or receive queue)

jamal, it is possible that a driver uses both napi and the normal 
interface, or would that break fairness?
Use netif_rx, until it returns dropping. If that happens, disable the 
interrupt, and call netif_rx_schedule().

Is it possible to determine the average number of packets that are 
processed for each netif_rx_schedule()?

--
	Manfred


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 14:37 ` Martin J. Bligh
@ 2002-09-06 15:38   ` Robert Olsson
  0 siblings, 0 replies; 102+ messages in thread
From: Robert Olsson @ 2002-09-06 15:38 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Robert Olsson, David S. Miller, akpm, hadi, tcw, linux-kernel,
	netdev, niv


Martin J. Bligh writes:


 > We are running from 2.5.latest ... any updates needed for NAPI for the
 > driver in the current 2.5 tree, or is that OK?

 Should be OK. Get latest kernel e1000 to get Intel's and maintainers latest 
 work and apply the e1000 NAPI patch. RH includes this patch?

 And yes there are plenty of room for improvements...


 Cheers.
						--ro


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
  2002-09-06 11:44 Robert Olsson
@ 2002-09-06 14:37 ` Martin J. Bligh
  2002-09-06 15:38   ` Robert Olsson
  0 siblings, 1 reply; 102+ messages in thread
From: Martin J. Bligh @ 2002-09-06 14:37 UTC (permalink / raw)
  To: Robert Olsson, David S. Miller; +Cc: akpm, hadi, tcw, linux-kernel, netdev, niv

> And NAPI scheme behaves different since we can not assume that all network 
> traffic is well-behaved like TCP. System has to be manageable and to "perform"
> under any network load not only for well-behaved TCP. So of course we will 
> see some differences -- there are no free lunch. Simply we can not blindly
> look at one test. IMO NAPI is the best overall performer. The number speaks 
> for themselves.

I don't doubt it's a win for most cases, we just want to reap the benefit
for the large SMP systems as well ... the fundamental mechanism seems
very scalable to me, we probably just need to do a little tuning?

> NAPI kernel path is included in 2.4.20-pre4 the comparison below is mainly 
> between e1000 driver w and w/o NAPI and the NAPI port to e1000 is still 
> evolving. 

We are running from 2.5.latest ... any updates needed for NAPI for the
driver in the current 2.5 tree, or is that OK?

Thanks,

Martin.


^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: Early SPECWeb99 results on 2.5.33 with TSO on e1000
@ 2002-09-06 11:44 Robert Olsson
  2002-09-06 14:37 ` Martin J. Bligh
  0 siblings, 1 reply; 102+ messages in thread
From: Robert Olsson @ 2002-09-06 11:44 UTC (permalink / raw)
  To: David S. Miller
  Cc: akpm, Martin.Bligh, hadi, tcw, linux-kernel, netdev, niv, Robert.Olsson


David S. Miller writes:
 >    Mala did some testing on this a couple of weeks back.  It appears that
 >    NAPI damaged performance significantly.
 > 
 >    http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
 > 

 > Robert can comment on optimal settings

Hopefully yes...

I see other numbers so we have to sort out the differences. Andrew Morton
pinged me about this test last week. So I've had a chance to run some tests. 

Some comments:
Scale to CPU can be dangerous measure w. NAPI due to its adapting behaviour 
where RX interrupts decreases in favour of successive polls.

And NAPI scheme behaves different since we can not assume that all network 
traffic is well-behaved like TCP. System has to be manageable and to "perform"
under any network load not only for well-behaved TCP. So of course we will 
see some differences -- there are no free lunch. Simply we can not blindly
look at one test. IMO NAPI is the best overall performer. The number speaks 
for themselves.

Here is the most recent test...

NAPI kernel path is included in 2.4.20-pre4 the comparison below is mainly 
between e1000 driver w and w/o NAPI and the NAPI port to e1000 is still 
evolving. 

Linux 2.4.20-pre4/UP PIII @ 933 MHz w. Intel's e100 2 port GIGE adapter.
e1000 4.3.2-k1 (current kernel version) and current NAPI patch. For NAPI 
e1000 driver uses RxIntDelay=1. RxIntDewlay=0 caused problem. Non-NAPI
driver RxIntDelay=64. (default)

Three tests: TCP, UDP, packet forwarding.

Netperf. TCP socket size 131070, Single TCP stream. Test length 30 s.

M-size   e1000    NAPI-e1000
============================
      4   20.74    20.69  Mbit/s data received.
    128  458.14   465.26 
    512  836.40   846.71 
   1024  936.11   937.93 
   2048  940.65   939.92 
   4096  940.86   937.59
   8192  940.87   939.95 
  16384  940.88   937.61
  32768  940.89   939.92
  65536  940.90   939.48
 131070  940.84   939.74

Netperf. UDP_STREAM. 1440 pkts. Single UDP stream. Test length 30 s.
         e1000    NAPI-e1000
====================================
         955.7    955.7   Mbit/s data received.

Forwarding test. 1 Mpkts at 970 kpps injected.
          e1000   NAPI-e1000
=============================================
T-put       305   298580    pkts routed.

NOTE! 
With non-NAPI driver this system is "dead" an performes nothing.


Cheers.
						--ro



^ permalink raw reply	[flat|nested] 102+ messages in thread

* RE: Early SPECWeb99 results on 2.5.33 with TSO on e1000
@ 2002-09-05 20:47 Feldman, Scott
  0 siblings, 0 replies; 102+ messages in thread
From: Feldman, Scott @ 2002-09-05 20:47 UTC (permalink / raw)
  To: 'Troy Wilson'; +Cc: linux-kernel, netdev

Troy Wilson wrote:
>   I've got some early SPECWeb [*] results with 2.5.33 and TSO 
> on e1000.  I get 2906 simultaneous connections, 99.2% 
> conforming (i.e. faster than the 320 kbps cutoff), at 0% idle 
> with TSO on.  For comparison, with 2.5.25, I 
> got 2656, and with 2.5.29 I got 2662, (both 99+% conformance 
> and 0% idle) so TSO and 2.5.33 look like a Big Win.

A 10% bump is good.  Thanks for running the numbers.
  
>   I'm having trouble testing with TSO off (I changed the 
> #define NETIF_F_TSO to "0" in include/linux/netdevice.h to 
> turn it off).  I am getting errors.

Sorry, I should have made a CONFIG switch.  Just hack the driver for now to
turn it off:

--- linux-2.5/drivers/net/e1000/e1000_main.c	Fri Aug 30 19:26:57 2002
+++ linux-2.5-no_tso/drivers/net/e1000/e1000_main.c	Thu Sep  5 13:38:44
2002
@@ -428,9 +428,11 @@ e1000_probe(struct pci_dev *pdev,
 	}
 
 #ifdef NETIF_F_TSO
+#if 0
 	if(adapter->hw.mac_type >= e1000_82544)
 		netdev->features |= NETIF_F_TSO;
 #endif
+#endif
  
 	if(pci_using_dac)
 		netdev->features |= NETIF_F_HIGHDMA;

-scott

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2002-09-17 10:34 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-05 18:30 Early SPECWeb99 results on 2.5.33 with TSO on e1000 Troy Wilson
2002-09-05 20:59 ` jamal
2002-09-05 22:11   ` Troy Wilson
2002-09-05 22:39     ` Nivedita Singhvi
2002-09-05 23:01       ` Dave Hansen
2002-09-05 22:48   ` Nivedita Singhvi
2002-09-06  1:47     ` jamal
2002-09-06  3:38       ` Nivedita Singhvi
2002-09-06  3:58         ` David S. Miller
2002-09-06  4:20           ` Nivedita Singhvi
2002-09-06  4:17             ` David S. Miller
2002-09-07  0:05         ` Troy Wilson
2002-09-06  3:56       ` David S. Miller
2002-09-06  3:47   ` David S. Miller
2002-09-06  6:48     ` Martin J. Bligh
2002-09-06  6:51       ` David S. Miller
2002-09-06  7:36         ` Andrew Morton
2002-09-06  7:22           ` David S. Miller
2002-09-06  9:54             ` jamal
2002-09-06 14:29         ` Martin J. Bligh
2002-09-06 15:38           ` Dave Hansen
2002-09-06 16:11             ` Martin J. Bligh
2002-09-06 16:21             ` Nivedita Singhvi
2002-09-06 15:29       ` Dave Hansen
2002-09-06 16:29         ` Martin J. Bligh
2002-09-06 17:36           ` Dave Hansen
2002-09-06 18:26             ` Andi Kleen
2002-09-06 18:31               ` John Levon
2002-09-06 18:33               ` Dave Hansen
2002-09-06 18:36                 ` David S. Miller
2002-09-06 18:45                   ` Martin J. Bligh
2002-09-06 18:43                     ` David S. Miller
2002-09-06 19:19               ` Nivedita Singhvi
2002-09-06 19:21                 ` David S. Miller
2002-09-06 19:45                   ` Nivedita Singhvi
2002-09-06 19:26                 ` Andi Kleen
2002-09-06 19:24                   ` David S. Miller
2002-09-06 19:45                 ` Martin J. Bligh
2002-09-06 17:26       ` Gerrit Huizenga
2002-09-06 17:37         ` David S. Miller
2002-09-06 18:19           ` Gerrit Huizenga
2002-09-06 18:26             ` Martin J. Bligh
2002-09-06 18:36               ` David S. Miller
2002-09-06 18:51                 ` Martin J. Bligh
2002-09-06 18:48                   ` David S. Miller
2002-09-06 19:05                     ` Gerrit Huizenga
2002-09-06 19:01                       ` David S. Miller
2002-09-06 20:29                   ` Alan Cox
2002-09-06 18:34             ` David S. Miller
2002-09-06 18:57               ` Gerrit Huizenga
2002-09-06 18:58                 ` David S. Miller
2002-09-06 19:52                   ` Gerrit Huizenga
2002-09-06 19:49                     ` David S. Miller
2002-09-06 20:03                       ` Gerrit Huizenga
2002-09-06 23:48       ` Troy Wilson
2002-09-11  9:11       ` Eric W. Biederman
2002-09-11 14:10         ` Martin J. Bligh
2002-09-11 15:06           ` Eric W. Biederman
2002-09-11 15:15             ` David S. Miller
2002-09-11 15:31               ` Eric W. Biederman
2002-09-11 15:27             ` Martin J. Bligh
2002-09-12  7:28     ` Todd Underwood
2002-09-12 12:30       ` jamal
2002-09-12 13:57         ` Todd Underwood
2002-09-12 14:11           ` Alan Cox
2002-09-12 14:41             ` todd-lkml
2002-09-12 23:12         ` David S. Miller
2002-09-13 21:59           ` todd-lkml
2002-09-13 22:04             ` David S. Miller
2002-09-15 20:16               ` jamal
2002-09-16  4:23                 ` David S. Miller
2002-09-16 14:16               ` todd-lkml
2002-09-16 19:52                 ` David S. Miller
2002-09-16 21:32                   ` todd-lkml
2002-09-16 21:29                     ` David S. Miller
2002-09-16 22:53                     ` David Woodhouse
2002-09-16 22:46                       ` David S. Miller
2002-09-16 23:03                       ` David Woodhouse
2002-09-16 23:08                         ` Jeff Garzik
2002-09-16 23:02                           ` David S. Miller
2002-09-16 23:48                             ` Jeff Garzik
2002-09-16 23:43                               ` David S. Miller
2002-09-17  0:01                                 ` Jeff Garzik
2002-09-17 10:31                   ` jamal
2002-09-13 22:12             ` Nivedita Singhvi
2002-09-12 17:18       ` Nivedita Singhvi
2002-09-06 23:56   ` Troy Wilson
2002-09-06 23:52     ` David S. Miller
2002-09-07  0:18     ` Nivedita Singhvi
2002-09-07  0:27       ` Troy Wilson
2002-09-05 20:47 Feldman, Scott
2002-09-06 11:44 Robert Olsson
2002-09-06 14:37 ` Martin J. Bligh
2002-09-06 15:38   ` Robert Olsson
2002-09-06 18:35 Manfred Spraul
2002-09-06 18:38 ` David S. Miller
2002-09-06 19:40   ` Manfred Spraul
2002-09-06 19:34     ` David S. Miller
2002-09-10 12:02       ` Robert Olsson
2002-09-10 16:55         ` Manfred Spraul
2002-09-11  7:46           ` Robert Olsson
2002-09-10 14:59 Mala Anand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).