All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [E1000-devel] Transmission limit
       [not found] <1101467291.24742.70.camel@mellia.lipar.polito.it>
@ 2004-11-26 14:05 ` P
  2004-11-26 15:31   ` Marco Mellia
                     ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: P @ 2004-11-26 14:05 UTC (permalink / raw)
  To: mellia; +Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

I'm forwarding this to netdev, as these are very interesting
results (even if I don't beleive them).

If you point us at the code/versions we will be better able to answer.

Marco Mellia wrote:
> We are trying to stress the e1000 hardware/driver under linux and Click
> to see what is the maximum number of packets per second that can be
> received/transmitted by a single NIC.
> 
> We found something which is counterintuitive:
> 
> - in reception, we can receive ALL the traffic, regardeless of the
> packet size (or if you prefer, we can receive ALL the minimum sized
> packet at gigabit speed)

I questioned whether you actually did receive at that rate to
which you responded:

 > - using Click, we can receive 100% of (small) packets at gigabit
 >   speed with TWO cards (2gigabit/s ~ 2.8Mpps)
 > - using linux and standard e1000 driver, we can receive up to about
 >   80% of traffic from a single nic (~1.1Mpps)
 > - using linux and a modified (simplified) version of the driver, we
 >   can receive 100% on a single nic, but not 100% using two nics (up
 >   to ~1.5Mpps).
 >
 > Reception means: receiving the packet up to the rx ring at the
 > kernel level, and then IMMEDIATELY drop it (no packet processing,
 > no forwarding, nothing more...)
 >
 > Using NAPI or IRQ has littel impact (as we are not processing the
 > packets, the livelock due to the hardIRQ preemption versus the
 > softIRQ managers is not entered...)
 >
 > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
 > - the traffic generator,
 > - the driver version,
 > - the O.S. (linux/click),
 > - the hardware (broadcom card have the same limit).

> 
> - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
> minimum sized packets are considered (64bytes long ethernet minumum
> frame size). That is about HALF the maximum number of pkt/s considering
> a gigabit link.
> 
> What is weird, is that if we artificially "preload" the NIC tx-fifo with
> packets, and then instruct it to start sending them, those are actually
> transmitted AT WIRE SPEED!!
> 
> These results have been obtained considering different software
> generators (namely, UDPGEN, PACKETGEN, Application level generators)
> under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
> UDPGEN).
> 
> The hardware setup considers 
> - a 2.8GHz Xeon hardware
> - PCI-X bus (133MHz/64bit)
> - 1G of Ram
> - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
> PCI slot.
> 
> Different driver versions have been used, and while there are (small)
> differencies when receiving packets, ALL of them present the same
> trasmission limits.
> 
> Moreover, the same happen considering other vendors cards (broadcom
> based chipset).
> 
> Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
> Or Limit on the number of packets per second that can be stored in the
> NIC tx-fifo?
> May the lenght of the tx-fifo impact on this?
> 
> Any hints will be really appreciated.
> Thanks in advance

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 14:05 ` [E1000-devel] Transmission limit P
@ 2004-11-26 15:31   ` Marco Mellia
  2004-11-26 19:56     ` jamal
                       ` (3 more replies)
  2004-11-26 15:40   ` Robert Olsson
  2004-11-27 20:00   ` Lennert Buytenhek
  2 siblings, 4 replies; 85+ messages in thread
From: Marco Mellia @ 2004-11-26 15:31 UTC (permalink / raw)
  To: P; +Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

If you don't trust us, please, ignore this email.
Sorry.

That's the number we have. And are actually very similar from what other
colleagues of us got.

The point is:
while a PCI-X linux or (or click) box can receive (receive just up to
the netif_receive_skb() level and then discard the skb) up to more than
wire speed using off-the-shelf gigabit ethernet hardware, there is no
way to transmit more than about half that speed. This is true
considering minimum sized ethernet frames.

This holds true with 
- linux 2.4.x and 2.6.x and click-linux 2.4.x
- intel e1000 or broadcom drivers (modified to drop packets after the 
netif_receive_skb())
- whichever driver version you like (with minor modifications).

The only modification to the driver we did consists in carefully
prefecting the data in the CPU internal cache.

Some details and results can be retreived from

http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf

Part of this results are presented in this paper
A. Bianco, J.M. Finochietto, G. Galante, M. Mellia, F. Neri
Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching
Third Internation Workshop on QoS in Multiservice IP Networks
Catania, Feb 2005
http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf

Hope this helps.

> I'm forwarding this to netdev, as these are very interesting
> results (even if I don't beleive them).
> 
> If you point us at the code/versions we will be better able to answer.
> 
> Marco Mellia wrote:
> > We are trying to stress the e1000 hardware/driver under linux and Click
> > to see what is the maximum number of packets per second that can be
> > received/transmitted by a single NIC.
> > 
> > We found something which is counterintuitive:
> > 
> > - in reception, we can receive ALL the traffic, regardeless of the
> > packet size (or if you prefer, we can receive ALL the minimum sized
> > packet at gigabit speed)
> 
> I questioned whether you actually did receive at that rate to
> which you responded:
> 
>  > - using Click, we can receive 100% of (small) packets at gigabit
>  >   speed with TWO cards (2gigabit/s ~ 2.8Mpps)
>  > - using linux and standard e1000 driver, we can receive up to about
>  >   80% of traffic from a single nic (~1.1Mpps)
>  > - using linux and a modified (simplified) version of the driver, we
>  >   can receive 100% on a single nic, but not 100% using two nics (up
>  >   to ~1.5Mpps).
>  >
>  > Reception means: receiving the packet up to the rx ring at the
>  > kernel level, and then IMMEDIATELY drop it (no packet processing,
>  > no forwarding, nothing more...)
>  >
>  > Using NAPI or IRQ has littel impact (as we are not processing the
>  > packets, the livelock due to the hardIRQ preemption versus the
>  > softIRQ managers is not entered...)
>  >
>  > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
>  > - the traffic generator,
>  > - the driver version,
>  > - the O.S. (linux/click),
>  > - the hardware (broadcom card have the same limit).
> 
> > 
> > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
> > minimum sized packets are considered (64bytes long ethernet minumum
> > frame size). That is about HALF the maximum number of pkt/s considering
> > a gigabit link.
> > 
> > What is weird, is that if we artificially "preload" the NIC tx-fifo with
> > packets, and then instruct it to start sending them, those are actually
> > transmitted AT WIRE SPEED!!
> > 
> > These results have been obtained considering different software
> > generators (namely, UDPGEN, PACKETGEN, Application level generators)
> > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
> > UDPGEN).
> > 
> > The hardware setup considers 
> > - a 2.8GHz Xeon hardware
> > - PCI-X bus (133MHz/64bit)
> > - 1G of Ram
> > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
> > PCI slot.
> > 
> > Different driver versions have been used, and while there are (small)
> > differencies when receiving packets, ALL of them present the same
> > trasmission limits.
> > 
> > Moreover, the same happen considering other vendors cards (broadcom
> > based chipset).
> > 
> > Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
> > Or Limit on the number of packets per second that can be stored in the
> > NIC tx-fifo?
> > May the lenght of the tx-fifo impact on this?
> > 
> > Any hints will be really appreciated.
> > Thanks in advance
> 
> cheers,
> Pádraig.
-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 14:05 ` [E1000-devel] Transmission limit P
  2004-11-26 15:31   ` Marco Mellia
@ 2004-11-26 15:40   ` Robert Olsson
  2004-11-26 15:59     ` Marco Mellia
  2004-11-27 20:00   ` Lennert Buytenhek
  2 siblings, 1 reply; 85+ messages in thread
From: Robert Olsson @ 2004-11-26 15:40 UTC (permalink / raw)
  To: P; +Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev


P@draigBrady.com writes:
 > I'm forwarding this to netdev, as these are very interesting
 > results (even if I don't beleive them).

 > I questioned whether you actually did receive at that rate to
 > which you responded:
 > 
 >  > - using Click, we can receive 100% of (small) packets at gigabit
 >  >   speed with TWO cards (2gigabit/s ~ 2.8Mpps)
 >  > - using linux and standard e1000 driver, we can receive up to about
 >  >   80% of traffic from a single nic (~1.1Mpps)
 >  > - using linux and a modified (simplified) version of the driver, we
 >  >   can receive 100% on a single nic, but not 100% using two nics (up
 >  >   to ~1.5Mpps).
 >  >
 >  > Reception means: receiving the packet up to the rx ring at the
 >  > kernel level, and then IMMEDIATELY drop it (no packet processing,
 >  > no forwarding, nothing more...)
 
 In more detail please... The RX ring must be refilled? And HW DMA's
 the to memory-buffer? But I assume data it not touched otherwise.

 Touching the packet-data givs a major impact. See eth_type_trans
 in all profiles.

 So what forwarding numbers is seen?
  
 >  > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
 >  > - the traffic generator,
 >  > - the driver version,
 >  > - the O.S. (linux/click),
 >  > - the hardware (broadcom card have the same limit).
 > 
 > > 
 > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
 > > minimum sized packets are considered (64bytes long ethernet minumum
 > > frame size). That is about HALF the maximum number of pkt/s considering
 > > a gigabit link.
 > > 
 > > What is weird, is that if we artificially "preload" the NIC tx-fifo with
 > > packets, and then instruct it to start sending them, those are actually
 > > transmitted AT WIRE SPEED!!

 OK. Good to know about e1000. Networking is most DMA's and CPU is used 
 adminstating it this is the challange.

 > > These results have been obtained considering different software
 > > generators (namely, UDPGEN, PACKETGEN, Application level generators)
 > > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
 > > UDPGEN).

 We get a hundred kpps more...Turn off all mitigation so interrupts are 
 undelayed so TX ring can be filled as quick as possible.

 Even you could try to fill TX as soon as the HW says there are available
 buffers. This could even be done from TX-interrupt.
 
 > > The hardware setup considers 
 > > - a 2.8GHz Xeon hardware
 > > - PCI-X bus (133MHz/64bit)
 > > - 1G of Ram
 > > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
 > > PCI slot.

 
 > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
 > > Or Limit on the number of packets per second that can be stored in the
 > > NIC tx-fifo?
 > > May the lenght of the tx-fifo impact on this?

 Small packet performance is dependent on low latency. Higher bus speed
 gives shorter latency but also on higher speed buses there use to be  
 bridges that adds latency.

 For packet generation we use still 866 MHz PIII:s and 82543GC on serverworks
 64-bit board which are faster than most other systems. So for testing routing 
 performance in pps we have to use several flows. This gives the advantage to 
 test SMP/NUMA as well.

					      --ro
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:40   ` Robert Olsson
@ 2004-11-26 15:59     ` Marco Mellia
  2004-11-26 16:57       ` P
  2004-11-26 17:58       ` Robert Olsson
  0 siblings, 2 replies; 85+ messages in thread
From: Marco Mellia @ 2004-11-26 15:59 UTC (permalink / raw)
  To: Robert Olsson
  Cc: P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

Robert,
It a pleasure to hear from you.

>  > I questioned whether you actually did receive at that rate to
>  > which you responded:
>  > 
>  >  > - using Click, we can receive 100% of (small) packets at gigabit
>  >  >   speed with TWO cards (2gigabit/s ~ 2.8Mpps)
>  >  > - using linux and standard e1000 driver, we can receive up to about
>  >  >   80% of traffic from a single nic (~1.1Mpps)
>  >  > - using linux and a modified (simplified) version of the driver, we
>  >  >   can receive 100% on a single nic, but not 100% using two nics (up
>  >  >   to ~1.5Mpps).
>  >  >
>  >  > Reception means: receiving the packet up to the rx ring at the
>  >  > kernel level, and then IMMEDIATELY drop it (no packet processing,
>  >  > no forwarding, nothing more...)
>  
>  In more detail please... The RX ring must be refilled? And HW DMA's
>  the to memory-buffer? But I assume data it not touched otherwise.
> 
>  Touching the packet-data givs a major impact. See eth_type_trans
>  in all profiles.

That's exactly what we removed from the driver code: touching the packet
limit the reception rate at about 1.1Mpps, while avoiding to check the
eth_type_trans actually allows to receive 100% of packets.

skb are de/allocated using standard kernel memory management. Still,
without touching the packet, we can receive 100% of them.

>  So what forwarding numbers is seen?

Forwarding is another issue. It seems to us that the bottleneck is in
the transmission of packets. Indeed, considering only reception and
transmission _separetely_
- all packets can be received
- no more than ~700kpps can be trasmitted

When IP-forwarding is considered, no more we hit the transmission limit
(using NAPI, and your buffer recycling patch, as mentioned on the paper
and on the slides... If no buffer recycling is adopted, performance drop
a bit)
So it seemd to us that the major bottleneck is due to the transmission
limit.

Again, you can get numbers and more details from

http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf

>  >  > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
>  >  > - the traffic generator,
>  >  > - the driver version,
>  >  > - the O.S. (linux/click),
>  >  > - the hardware (broadcom card have the same limit).
>  > 
>  > > 
>  > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
>  > > minimum sized packets are considered (64bytes long ethernet minumum
>  > > frame size). That is about HALF the maximum number of pkt/s considering
>  > > a gigabit link.
>  > > 
>  > > What is weird, is that if we artificially "preload" the NIC tx-fifo with
>  > > packets, and then instruct it to start sending them, those are actually
>  > > transmitted AT WIRE SPEED!!
> 
>  OK. Good to know about e1000. Networking is most DMA's and CPU is used 
>  adminstating it this is the challange.

That's true. There is still the chance that the limit is due to hardware
CRC calculation (which must be added to the ethernet frame by the
nic...). But we're quite confortable that that is not the limit, since
in the reception path the same operation must be performed...

>  > > These results have been obtained considering different software
>  > > generators (namely, UDPGEN, PACKETGEN, Application level generators)
>  > > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
>  > > UDPGEN).
> 
>  We get a hundred kpps more...Turn off all mitigation so interrupts are 
>  undelayed so TX ring can be filled as quick as possible.
> 
>  Even you could try to fill TX as soon as the HW says there are available
>  buffers. This could even be done from TX-interrupt.

Are you suggesting to modify packetgen to be more aggressive?
 
>  > > The hardware setup considers 
>  > > - a 2.8GHz Xeon hardware
>  > > - PCI-X bus (133MHz/64bit)
>  > > - 1G of Ram
>  > > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
>  > > PCI slot.
>  > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
>  > > Or Limit on the number of packets per second that can be stored in the
>  > > NIC tx-fifo?
>  > > May the lenght of the tx-fifo impact on this?
> 
>  Small packet performance is dependent on low latency. Higher bus speed
>  gives shorter latency but also on higher speed buses there use to be  
>  bridges that adds latency.

That's true. We suspect that the limit is due to bus latency. But still,
we are surprised, since the bus allows to receive 100%, but to transmit
up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
larger (133MHz*64bit ~ 8gbit/s

>  For packet generation we use still 866 MHz PIII:s and 82543GC on serverworks
>  64-bit board which are faster than most other systems. So for testing routing 
>  performance in pps we have to use several flows. This gives the advantage to 
>  test SMP/NUMA as well.

We use an hardware generator (Agilent router tester)... which can
saturate a gigabit link with no problem (and cost much more than a
PC...). So our forwarding test are not limited...
 
-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:59     ` Marco Mellia
@ 2004-11-26 16:57       ` P
  2004-11-26 20:01         ` jamal
  2004-11-26 17:58       ` Robert Olsson
  1 sibling, 1 reply; 85+ messages in thread
From: P @ 2004-11-26 16:57 UTC (permalink / raw)
  To: mellia
  Cc: Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

I forgot a smilely on my previous post
about not beleiving you. So here's 2: :-) :-)

Comments below:

Marco Mellia wrote:
> Robert,
> It a pleasure to hear from you.
> 
>> Touching the packet-data givs a major impact. See eth_type_trans
>> in all profiles.

Notice the e1000 sets up the alignment for IP by default.

> skb are de/allocated using standard kernel memory management. Still,
> without touching the packet, we can receive 100% of them.

I was doing some playing in this area this week.
I changed the alloc per packet to a "realloc" per packet.
I.E. the e1000 driver owns the packets. I noticed a
very nice speedup from this. In summary a userspace
app was able to receive 2x250Kpps without this patch,
and 2x490Kpps with it. The patch is here:
http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff
Note 99% of that patch is just upgrading from
e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect
the performance).

Wow I just read you're excellent paper, and noticed
you used this approach also :-)

>> Small packet performance is dependent on low latency. Higher bus speed
>> gives shorter latency but also on higher speed buses there use to be  
>> bridges that adds latency.
> 
> That's true. We suspect that the limit is due to bus latency. But still,
> we are surprised, since the bus allows to receive 100%, but to transmit
> up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
> larger (133MHz*64bit ~ 8gbit/s

Well there definitely could be an asymmetry wrt bus latency.
Saying that though, in my tests with much the same hardware
as you, I could only get 800Kpps into the driver. I'll
check this again when I have time. Note also that as I understand
it the PCI control bus is running at a much lower rate,
and that is used to arbitrate the bus for each packet.
I.E. the 8Gb/s number above is not the bottleneck.

An lspci -vvv for your ethernet devices would be useful
Also to view the burst size: setpci -d 8086:1010 e6.b
(where 8086:1010 is the ethernet device PCI id).

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:59     ` Marco Mellia
  2004-11-26 16:57       ` P
@ 2004-11-26 17:58       ` Robert Olsson
  1 sibling, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-11-26 17:58 UTC (permalink / raw)
  To: mellia
  Cc: Robert Olsson, P, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev


Marco Mellia writes:

 > >  Touching the packet-data givs a major impact. See eth_type_trans
 > >  in all profiles.
 > 
 > That's exactly what we removed from the driver code: touching the packet
 > limit the reception rate at about 1.1Mpps, while avoiding to check the
 > eth_type_trans actually allows to receive 100% of packets.
 > 
 > skb are de/allocated using standard kernel memory management. Still,
 > without touching the packet, we can receive 100% of them.

 Right. I recall I tried something similar but as I only have pktgen
 as sender I could only verify this to pktgen TX speed about 860 kpps
 for PIII box I mentioned. This w. UP and one NIC.

 > When IP-forwarding is considered, no more we hit the transmission limit
 > (using NAPI, and your buffer recycling patch, as mentioned on the paper
 > and on the slides... If no buffer recycling is adopted, performance drop
 > a bit)
 > So it seemd to us that the major bottleneck is due to the transmission
 > limit.
 > 
 > Again, you can get numbers and more details from
 > 
 > http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
 > http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf

 Nice. Seems we getting close to click w. NAPI and recycling. The skb
 recycling is outdated as it adds to much complexity to the kernel. I got 
 some idea how make a much more lighweight variant... If you feel hacking 
 I can outline the idea so you can try it.

 > >  OK. Good to know about e1000. Networking is most DMA's and CPU is used 
 > >  adminstating it this is the challange.
 > 
 > That's true. There is still the chance that the limit is due to hardware
 > CRC calculation (which must be added to the ethernet frame by the
 > nic...). But we're quite confortable that that is not the limit, since
 > in the reception path the same operation must be performed...

 OK!

 > >  Even you could try to fill TX as soon as the HW says there are available
 > >  buffers. This could even be done from TX-interrupt.
 > 
 > Are you suggesting to modify packetgen to be more aggressive?

 Well it could be useful at least as an experiment. Our lab would be 
 happy...

 > >  Small packet performance is dependent on low latency. Higher bus speed
 > >  gives shorter latency but also on higher speed buses there use to be  
 > >  bridges that adds latency.
 > 
 > That's true. We suspect that the limit is due to bus latency. But still,
 > we are surprised, since the bus allows to receive 100%, but to transmit
 > up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
 > larger (133MHz*64bit ~ 8gbit/s
 
 Have a look at graph in the pktgen paper presented at Linux-Kongress in
 Erlangen 2004. It seems like even at 8gbit/s thsi is limiting small 
 packet TX performance.

 ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf 

						--ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:31   ` Marco Mellia
@ 2004-11-26 19:56     ` jamal
  2004-11-29 14:21       ` Marco Mellia
  2004-11-26 20:06     ` jamal
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-26 19:56 UTC (permalink / raw)
  To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> If you don't trust us, please, ignore this email.
> Sorry.

Dont take it the wrong way please - nobody has been able to produce the
results you have. So thats why you may be getting that comment.
The fact you have been able to do this is a good thing.

> That's the number we have. And are actually very similar from what other
> colleagues of us got.
> 
> The point is:
> while a PCI-X linux or (or click) box can receive (receive just up to
> the netif_receive_skb() level and then discard the skb) up to more than
> wire speed using off-the-shelf gigabit ethernet hardware, there is no
> way to transmit more than about half that speed. This is true
> considering minimum sized ethernet frames.
> 

Hrm. I could not get more than 8-900Kpps on receive drop in the driver
on a super fast xeon. Can you post the diff for your driver?
My tests was with e1000.
What kind of hardware is this? Do you have a block diagram on how the
NIC is connected on the system? A lot of issues are dependent on how you
hardware hookup is.

> This holds true with 
> - linux 2.4.x and 2.6.x and click-linux 2.4.x
> - intel e1000 or broadcom drivers (modified to drop packets after the 
> netif_receive_skb())
> - whichever driver version you like (with minor modifications).
> 
> The only modification to the driver we did consists in carefully
> prefecting the data in the CPU internal cache.
> 

prefetching as in the use of prefetch()?
What were you prefetching if you end up dropping packet?

> Some details and results can be retreived from
> 
> http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
> 
> Part of this results are presented in this paper
> A. Bianco, J.M. Finochietto, G. Galante, M. Mellia, F. Neri
> Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching
> Third Internation Workshop on QoS in Multiservice IP Networks
> Catania, Feb 2005
> http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf
> 
> Hope this helps.
> 

Thanks i will read these papers. 
Take a look at presentation i made at SUCON:
www.suug.ch/sucon/04/slides/pkt_cls.pdf
I have solved the problem which is identified in the first of slides
(just before "why me momma?" slide) - i could describe the solution and
even provide pathces which may address (perhaps) some of the transmit
issues you are seeing. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 16:57       ` P
@ 2004-11-26 20:01         ` jamal
  2004-11-29 10:19           ` P
  2004-11-29 13:09           ` Robert Olsson
  0 siblings, 2 replies; 85+ messages in thread
From: jamal @ 2004-11-26 20:01 UTC (permalink / raw)
  To: P
  Cc: mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Fri, 2004-11-26 at 11:57, P@draigBrady.com wrote:

> > skb are de/allocated using standard kernel memory management. Still,
> > without touching the packet, we can receive 100% of them.
> 
> I was doing some playing in this area this week.
> I changed the alloc per packet to a "realloc" per packet.
> I.E. the e1000 driver owns the packets. I noticed a
> very nice speedup from this. In summary a userspace
> app was able to receive 2x250Kpps without this patch,
> and 2x490Kpps with it. The patch is here:
> http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff

A very angry gorilla on that url ;->

> Note 99% of that patch is just upgrading from
> e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect
> the performance).
> 
> Wow I just read you're excellent paper, and noticed
> you used this approach also :-)
> 

Have to read the paper - When Robert was last visiting here; we did some
tests and packet recycling is not very valuable as far as SMP is
concerned (given that packets can be alloced on one CPU and freed on
another). There a clear win on single CPU machines.

> >> Small packet performance is dependent on low latency. Higher bus speed
> >> gives shorter latency but also on higher speed buses there use to be  
> >> bridges that adds latency.
> > 
> > That's true. We suspect that the limit is due to bus latency. But still,
> > we are surprised, since the bus allows to receive 100%, but to transmit
> > up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
> > larger (133MHz*64bit ~ 8gbit/s
> 
> Well there definitely could be an asymmetry wrt bus latency.
> Saying that though, in my tests with much the same hardware
> as you, I could only get 800Kpps into the driver.

Yep, thats about the number i was seeing as well in both pieces of
hardware i used in the tests in my SUCON presentation.

>  I'll
> check this again when I have time. Note also that as I understand
> it the PCI control bus is running at a much lower rate,
> and that is used to arbitrate the bus for each packet.
> I.E. the 8Gb/s number above is not the bottleneck.
> 
> An lspci -vvv for your ethernet devices would be useful
> Also to view the burst size: setpci -d 8086:1010 e6.b
> (where 8086:1010 is the ethernet device PCI id).
> 

Can you talk a little about this PCI control bus? I have heard you
mention it before ... I am trying to visualize where it fits in PCI
system.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:31   ` Marco Mellia
  2004-11-26 19:56     ` jamal
@ 2004-11-26 20:06     ` jamal
  2004-11-26 20:56     ` Lennert Buytenhek
  2004-11-27  9:25     ` Harald Welte
  3 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2004-11-26 20:06 UTC (permalink / raw)
  To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> If you don't trust us, please, ignore this email.

BTW, You have to be telling the truth espcially since you
have S. Giordano in your team ;-> We just need to figure out what you
are saying. Off to read your paper.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:31   ` Marco Mellia
  2004-11-26 19:56     ` jamal
  2004-11-26 20:06     ` jamal
@ 2004-11-26 20:56     ` Lennert Buytenhek
  2004-11-26 21:02       ` Lennert Buytenhek
  2004-11-27  9:25     ` Harald Welte
  3 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-26 20:56 UTC (permalink / raw)
  To: Marco Mellia
  Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:

> The point is:
> while a PCI-X linux or (or click) box can receive (receive just up to
> the netif_receive_skb() level and then discard the skb) up to more than
> wire speed using off-the-shelf gigabit ethernet hardware, there is no
> way to transmit more than about half that speed. This is true
> considering minimum sized ethernet frames.

That's more-or-less what I'm seeing.

Theoretically, the maximum #pps you can send on gigabit is p=125000000/(s+24)
where s is the packet size, and the constant 24 consists of the 8B preamble,
4B FCS and and 12B inter-frame gap.

On an e1000 in a 32b 66MHz PCI slot (Intel server mainboard, e1000 'desktop'
NIC) I'm seeing that exact curve for packet sizes > ~350 bytes, but for
smaller packets than that, the curve goes like p=264000000/(s+335)  (which
is accurate to +/- 100pps.)  The 2.64e8 component is exactly the theoretical
max. bandwidth of the PCI slot the card is in, the 335 a random constant
that accounts for latency.  On a different mobo I get a curve following
the same formula but different value for 335.

The same card in a 32b 33MHz PCI slot in a cheap Asus desktop board gives
something a bit stranger:
- p=132000000/(s+260) for s<128
- p=132000000/(s+390) for 128<=s<256
- p=132000000/(s+520) for 256<=s<384
- ...

Again, the 132000000 corresponds with the theoretical max. bandwidth of
the 32/33 bus.  I'm not all that sure yet why things show this behavior.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 20:56     ` Lennert Buytenhek
@ 2004-11-26 21:02       ` Lennert Buytenhek
  0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-26 21:02 UTC (permalink / raw)
  To: Marco Mellia
  Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, Nov 26, 2004 at 09:56:59PM +0100, Lennert Buytenhek wrote:

> On an e1000 in a 32b 66MHz PCI slot (Intel server mainboard, e1000 'desktop'
> NIC) I'm seeing that exact curve for packet sizes > ~350 bytes, but for
> smaller packets than that, the curve goes like p=264000000/(s+335)  (which
> is accurate to +/- 100pps.)  The 2.64e8 component is exactly the theoretical
> max. bandwidth of the PCI slot the card is in, the 335 a random constant
> that accounts for latency.  On a different mobo I get a curve following
> the same formula but different value for 335.
> 
> The same card in a 32b 33MHz PCI slot in a cheap Asus desktop board gives
> something a bit stranger:
> - p=132000000/(s+260) for s<128
> - p=132000000/(s+390) for 128<=s<256
> - p=132000000/(s+520) for 256<=s<384
> - ...

This could be explained by observing that on the Intel mobo, the NIC sits
on a dedicated PCI bus, while on the cheap Asus board, all PCI slots plus
all onboard devices share the same PCI bus.  Probably after pulling in a
single burst of packet (32 clocks here, sounds about right), the NIC has
to relinquish the bus to other bus masters and wait for 128 byte times
until it gets to pull packet data from RAM again.

Would be interesting to find out where the latency is coming from.  Find
a way to reduce/work around that and the 64b packet case will benefit as
well.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 15:31   ` Marco Mellia
                       ` (2 preceding siblings ...)
  2004-11-26 20:56     ` Lennert Buytenhek
@ 2004-11-27  9:25     ` Harald Welte
       [not found]       ` <20041127111101.GC23139@xi.wantstofly.org>
                         ` (2 more replies)
  3 siblings, 3 replies; 85+ messages in thread
From: Harald Welte @ 2004-11-27  9:25 UTC (permalink / raw)
  To: Marco Mellia
  Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> If you don't trust us, please, ignore this email.
> Sorry.
> 
> That's the number we have. And are actually very similar from what other
> colleagues of us got.
> 
> The point is:
> while a PCI-X linux or (or click) box can receive (receive just up to
> the netif_receive_skb() level and then discard the skb) up to more than
> wire speed using off-the-shelf gigabit ethernet hardware, there is no
> way to transmit more than about half that speed. This is true
> considering minimum sized ethernet frames.

Yes, I've seen this, too.

I even rewrote the linux e1000 driver in order to re-fill the tx queue
from hardirq handler, and it didn't help.  760kpps is the most I could
ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz)

I've posted this result to netdev at some earlier point, I also Cc'ed
intel but never got a reply
(http://oss.sgi.com/archives/netdev/2004-09/msg00540.html)

My guess is that Intel always knew this and they want to sell their CSA
chips rather than improving the PCI e1000.

We are hitting a hard limit here, either PCI-X wise or e1000 wise.  You
cannot refill the tx queue faster than from hardirq, and still you don't
get any better numbers.

It was suggested that the problem is PCI DMA arbitration latency, since
the hardware needs to arbitrate the bus for every packet.

Interestingly, if you use a four-port e1000, the numbers get even worse
(580kpps) because the additional pcix bridge on the card introduces
further latency.

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
       [not found]       ` <20041127111101.GC23139@xi.wantstofly.org>
@ 2004-11-27 11:31         ` Harald Welte
  0 siblings, 0 replies; 85+ messages in thread
From: Harald Welte @ 2004-11-27 11:31 UTC (permalink / raw)
  To: Lennert Buytenhek; +Cc: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 770 bytes --]

On Sat, Nov 27, 2004 at 12:11:01PM +0100, Lennert Buytenhek wrote:
> On Sat, Nov 27, 2004 at 10:25:03AM +0100, Harald Welte wrote:
> 
> > I even rewrote the linux e1000 driver [...]
> 
> This is very interesting.  You have chipset docs then?

Once again, please excuse my bad english.  I seem to have translated
'umgeschrieben' into 'rewrote' which is absolutely not applicable here.

Please do s/rewrote/modified/, i.e. I modified/altered/changed the driver

And no, I don't have any docs.

> cheers,
> Lennert

-- 
- Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 14:05 ` [E1000-devel] Transmission limit P
  2004-11-26 15:31   ` Marco Mellia
  2004-11-26 15:40   ` Robert Olsson
@ 2004-11-27 20:00   ` Lennert Buytenhek
  2004-11-29 12:44     ` Marco Mellia
  2 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-27 20:00 UTC (permalink / raw)
  To: mellia; +Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, Nov 26, 2004 at 02:05:26PM +0000, P@draigBrady.com wrote:

> >What is weird, is that if we artificially "preload" the NIC tx-fifo with
> >packets, and then instruct it to start sending them, those are actually
> >transmitted AT WIRE SPEED!!

I've very interested in exactly what it is you're doing here.  What
do you mean by 'preload'?


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-27  9:25     ` Harald Welte
       [not found]       ` <20041127111101.GC23139@xi.wantstofly.org>
@ 2004-11-27 20:12       ` Cesar Marcondes
  2004-11-29  8:53       ` Marco Mellia
  2 siblings, 0 replies; 85+ messages in thread
From: Cesar Marcondes @ 2004-11-27 20:12 UTC (permalink / raw)
  To: Harald Welte
  Cc: Marco Mellia, P, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

STOP !!!!

On Sat, 27 Nov 2004, Harald Welte wrote:

> On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> > If you don't trust us, please, ignore this email.
> > Sorry.
> >
> > That's the number we have. And are actually very similar from what other
> > colleagues of us got.
> >
> > The point is:
> > while a PCI-X linux or (or click) box can receive (receive just up to
> > the netif_receive_skb() level and then discard the skb) up to more than
> > wire speed using off-the-shelf gigabit ethernet hardware, there is no
> > way to transmit more than about half that speed. This is true
> > considering minimum sized ethernet frames.
>
> Yes, I've seen this, too.
>
> I even rewrote the linux e1000 driver in order to re-fill the tx queue
> from hardirq handler, and it didn't help.  760kpps is the most I could
> ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz)
>
> I've posted this result to netdev at some earlier point, I also Cc'ed
> intel but never got a reply
> (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html)
>
> My guess is that Intel always knew this and they want to sell their CSA
> chips rather than improving the PCI e1000.
>
> We are hitting a hard limit here, either PCI-X wise or e1000 wise.  You
> cannot refill the tx queue faster than from hardirq, and still you don't
> get any better numbers.
>
> It was suggested that the problem is PCI DMA arbitration latency, since
> the hardware needs to arbitrate the bus for every packet.
>
> Interestingly, if you use a four-port e1000, the numbers get even worse
> (580kpps) because the additional pcix bridge on the card introduces
> further latency.
>
> --
> - Harald Welte <laforge@gnumonks.org>               http://www.gnumonks.org/
> ============================================================================
> Programming is like sex: One mistake and you have to support it your lifetime
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-27  9:25     ` Harald Welte
       [not found]       ` <20041127111101.GC23139@xi.wantstofly.org>
  2004-11-27 20:12       ` Cesar Marcondes
@ 2004-11-29  8:53       ` Marco Mellia
  2004-11-29 14:50         ` Lennert Buytenhek
  2 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29  8:53 UTC (permalink / raw)
  To: Harald Welte
  Cc: Marco Mellia, P, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Sat, 2004-11-27 at 10:25, Harald Welte wrote:
> On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> > If you don't trust us, please, ignore this email.
> > Sorry.
> > 
> > That's the number we have. And are actually very similar from what other
> > colleagues of us got.
> > 
> > The point is:
> > while a PCI-X linux or (or click) box can receive (receive just up to
> > the netif_receive_skb() level and then discard the skb) up to more than
> > wire speed using off-the-shelf gigabit ethernet hardware, there is no
> > way to transmit more than about half that speed. This is true
> > considering minimum sized ethernet frames.
> 
> Yes, I've seen this, too.
> 
> I even rewrote the linux e1000 driver in order to re-fill the tx queue
> from hardirq handler, and it didn't help.  760kpps is the most I could
> ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz)
> 
> I've posted this result to netdev at some earlier point, I also Cc'ed
> intel but never got a reply
> (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html)
> 
> My guess is that Intel always knew this and they want to sell their CSA
> chips rather than improving the PCI e1000.
> 
> We are hitting a hard limit here, either PCI-X wise or e1000 wise.  You
> cannot refill the tx queue faster than from hardirq, and still you don't
> get any better numbers.
> 
> It was suggested that the problem is PCI DMA arbitration latency, since
> the hardware needs to arbitrate the bus for every packet.

Th's our intuition too.
Notice that we get the same results with 3com (broadcom based) gigabit
cards.
We are thinking of sending packet in "bursts" instead of single
transfers. The only problem is to let the NIC know that there are more
than a packet in a burst...

-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 20:01         ` jamal
@ 2004-11-29 10:19           ` P
  2004-11-29 13:09           ` Robert Olsson
  1 sibling, 0 replies; 85+ messages in thread
From: P @ 2004-11-29 10:19 UTC (permalink / raw)
  To: hadi
  Cc: mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

jamal wrote:
> On Fri, 2004-11-26 at 11:57, P@draigBrady.com wrote:
> 
> 
>>>skb are de/allocated using standard kernel memory management. Still,
>>>without touching the packet, we can receive 100% of them.
>>
>>I was doing some playing in this area this week.
>>I changed the alloc per packet to a "realloc" per packet.
>>I.E. the e1000 driver owns the packets. I noticed a
>>very nice speedup from this. In summary a userspace
>>app was able to receive 2x250Kpps without this patch,
>>and 2x490Kpps with it. The patch is here:
>>http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff
> 
> 
> A very angry gorilla on that url ;->

feck. Add a .gz
http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff.gz

>>Note 99% of that patch is just upgrading from
>>e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect
>>the performance).
>>
>>Wow I just read you're excellent paper, and noticed
>>you used this approach also :-)
>>
> 
> 
> Have to read the paper - When Robert was last visiting here; we did some
> tests and packet recycling is not very valuable as far as SMP is
> concerned (given that packets can be alloced on one CPU and freed on
> another). There a clear win on single CPU machines.

Well for my app, I am just monitoring, so I use
IRQ and process affinity. You could split the
skb heads across CPUs also I guess.

>>>>Small packet performance is dependent on low latency. Higher bus speed
>>>>gives shorter latency but also on higher speed buses there use to be  
>>>>bridges that adds latency.
>>>
>>>That's true. We suspect that the limit is due to bus latency. But still,
>>>we are surprised, since the bus allows to receive 100%, but to transmit
>>>up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
>>>larger (133MHz*64bit ~ 8gbit/s
>>
>>Well there definitely could be an asymmetry wrt bus latency.
>>Saying that though, in my tests with much the same hardware
>>as you, I could only get 800Kpps into the driver.
> 
> 
> Yep, thats about the number i was seeing as well in both pieces of
> hardware i used in the tests in my SUCON presentation.
> 
> 
>> I'll
>>check this again when I have time. Note also that as I understand
>>it the PCI control bus is running at a much lower rate,
>>and that is used to arbitrate the bus for each packet.
>>I.E. the 8Gb/s number above is not the bottleneck.
>>
>>An lspci -vvv for your ethernet devices would be useful
>>Also to view the burst size: setpci -d 8086:1010 e6.b
>>(where 8086:1010 is the ethernet device PCI id).
>>
> 
> Can you talk a little about this PCI control bus? I have heard you
> mention it before ... I am trying to visualize where it fits in PCI
> system.

Basically the bus is arbitrated per packet. See secion 3.5 in:
http://www.intel.com/design/network/applnots/ap453.pdf
This also has lots of nice PCI info:
http://www.hep.man.ac.uk/u/rich/PFLDnet2004/Rich_PFLDNet_10GE_v7.ppt

-- 
Pádraig Brady - http://www.pixelbeat.org
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-27 20:00   ` Lennert Buytenhek
@ 2004-11-29 12:44     ` Marco Mellia
  2004-11-29 15:19       ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 12:44 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

> > >What is weird, is that if we artificially "preload" the NIC tx-fifo with
> > >packets, and then instruct it to start sending them, those are actually
> > >transmitted AT WIRE SPEED!!
> 
> I've very interested in exactly what it is you're doing here.  What
> do you mean by 'preload'?

Here is a brief description of the trick we used.
The modified driver code can be grabbed from 
http://www.tlc-networks.polito.it/~mellia/e1000_modified.tar.gz

So: with "preloaded" we mean that we put the packets to be transmitted 
previously in the TX fifo of the nic without actually updating the
register which counts the number of packets in the fifo queue.
To do this a student modified the network 
driver adding an entry in /proc/net/e1000/eth#.
If you read from there you will get the values of the internal registers
of the NIC regarding the internal fifo.

Writing a number to it, you can set the TDFPC register which contains
the number of pkts in the TX queue of the internal FIFO.

To get the above result you have to:
Compile this version of the driver (don't remember on which version it
was based on)
Load it.
After that you can take a look at the internal registers with:

cat /proc/net/e1000/eth# (# replace it with the correct number)

Then we start placing something inthe TX fifo.
To do this i simply used:

ping -c 10 x.x.x.x

This has placed and also transmitted 10 ping pkts. But they aren't
deleted 
from the internal FIFO; only the pointers have been updated. Take a look
at the registers again with:

cat /proc/net/e1000/eth#

Now use:

echo 10 > /proc/net/e1000/eth#

Naturally 10 is the number we used above.
This "resets" the registers and writes in the TDFPC that there are 10
pkts in the TX queue.
Now when we do:

ping -c 1 x.x.x.x

You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the
new one).
If you try to measure the TX speed you will see that it is ~ the wire
speed.

Note:
- note that if you haven't static arp tables there will be also some arp
pkts (should be two more pkts)
- probably if you write too many pkts it won't work because the FIFO is 
organized like a circular buffer and you will begin to overwrite the
first pkts.
- the normal ping pkts aren't minimum size but reduce them with the -s
option
- the code modoifications have been writen with having "quick and dirty"
in mind, certainly it is possible to write them better
-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 20:01         ` jamal
  2004-11-29 10:19           ` P
@ 2004-11-29 13:09           ` Robert Olsson
  2004-11-29 20:16             ` David S. Miller
  2004-11-30 13:31             ` jamal
  1 sibling, 2 replies; 85+ messages in thread
From: Robert Olsson @ 2004-11-29 13:09 UTC (permalink / raw)
  To: hadi
  Cc: P, mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev


jamal writes:

 > Have to read the paper - When Robert was last visiting here; we did some
 > tests and packet recycling is not very valuable as far as SMP is
 > concerned (given that packets can be alloced on one CPU and freed on
 > another). There a clear win on single CPU machines.


 Correct yes at you lab about 2 1/2 years ago. I see those experiments in a 
 different light today as we never got any packet budget contribution
 from SMP with shared mem arch whatsoever. Spent a week w. Alexey in the lab 
 to understand whats going on. Two flows with total affinity (for each CPU)
 even removed all locks and part of the IP stack. We were still confused...

 When Opteron/NUMA gave good contribution in those setups. We start thinking
 it must be latency and memory controllers that makes the difference. As w. 
 each CPU has it's own memory and memory controller in Opteron case.

 So from that aspect we expecting the impossible from recycling patch
 maybe it will do better on boxes w. local memory.

 But I think we should give it up in current form skb recycling. If extend 
 it to deal cache bouncing etc. We end up having something like slab in 
 every driver. slab has improved is not so dominant in profiles now.

 Also from what I understand new HW and MSI can help in the case where
 pass objects between CPU. Did I dream or did someone tell me that S2IO 
 could have several TX ring that could via MSI be routed to proper cpu?
 
 slab packet-objects have been discussed. It would do some contribution
 but is the complexity worth it?
 
 Also I think it could possible to do more lightweight variant of skb
 recycling in case we need to recycle PCI-mapping etc.

					 --ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-26 19:56     ` jamal
@ 2004-11-29 14:21       ` Marco Mellia
  2004-11-30 13:46         ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 14:21 UTC (permalink / raw)
  To: hadi
  Cc: mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, 2004-11-26 at 20:56, jamal wrote:
> On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> > If you don't trust us, please, ignore this email.
> > Sorry.
> 
> Dont take it the wrong way please - nobody has been able to produce the
> results you have. So thats why you may be getting that comment.
> The fact you have been able to do this is a good thing.

No problem from this side. I also forgot a couple of 8-! I guess...

[...]

> prefetching as in the use of prefetch()?
> What were you prefetching if you end up dropping packet?
> 

Sorry I used the wrong terms there.
What we discovered, is that the CPU caching mechanisms as a HUGE impact.
And that you have very little control on it. Prefetching may help, but
it is difficult to tredict its impacts...

Indeed, if you access to the packet struct, the CPU has to fetch data
from the main memory, which stored the packet transfered using DMA from
the NIC. The penalty in the memory access is huge, and you have little
control on it.

In our experiments, we modified the kernel to drop packets just after
receiving them. skb are just deallocated (using standerd kernel
routines, i.e., no recycling is used). Logically, that happen when the
netif_rx() is called.

Now, we have three cases
1) just mofify the netif_rx() to drop packets.
2) as in one, plus remove the protocol check in the driver
(i.e., comment the line
      skb->protocol = eth_type_trans(skb, netdev);
) to avoid to access the real packet data.
3) as in 2, but dealloc is performed at the driver level, instead of
calling the netif_rx()

In the first case, we can receive about 1.1Mpps (~80% of packets)
In the second case, we can receive 100% of packets, as we removed the
penalty of looking at the packet headers to discover its protocol type.

In the third case, we can NOT receive 100% of packets!
The only difference is that we actually _REMOVED_ a funcion call. This
reduces the overhead, and the compiler/cpu/whatever can not optimize the
data path to access to the skb which must be freed.
Our guess is that by freeing up the skb in the netif_rx() function
actually allows the compiler/cpu to prefetch the skb itself, and
therefore keep the pipeline working...

My guess is that if you change compiler, cpu, memory subsystem, you may
get very counterintuitive results...

-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29  8:53       ` Marco Mellia
@ 2004-11-29 14:50         ` Lennert Buytenhek
  2004-11-30  8:42           ` Marco Mellia
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 14:50 UTC (permalink / raw)
  To: Marco Mellia
  Cc: Harald Welte, P, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote:

> Th's our intuition too.
> Notice that we get the same results with 3com (broadcom based) gigabit
> cards.
> We are thinking of sending packet in "bursts" instead of single
> transfers. The only problem is to let the NIC know that there are more
> than a packet in a burst...

Jamal implemented exactly this for e1000 already, he might be persuaded
into posting his patch here.  Jamal? :)


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 12:44     ` Marco Mellia
@ 2004-11-29 15:19       ` Lennert Buytenhek
  2004-11-29 17:32         ` Marco Mellia
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 15:19 UTC (permalink / raw)
  To: Marco Mellia
  Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, Nov 29, 2004 at 01:44:27PM +0100, Marco Mellia wrote:

> This "resets" the registers and writes in the TDFPC that there are 10
> pkts in the TX queue.
> Now when we do:
> 
> ping -c 1 x.x.x.x
> 
> You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the
> new one).
> If you try to measure the TX speed you will see that it is ~ the wire
> speed.

How are you measuring this?


cheers,
Lennert

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 15:19       ` Lennert Buytenhek
@ 2004-11-29 17:32         ` Marco Mellia
  2004-11-29 19:08           ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 17:32 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Marco Mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

Using the Agilent Router tester as receiver...
:-(

> On Mon, Nov 29, 2004 at 01:44:27PM +0100, Marco Mellia wrote:
> 
> > This "resets" the registers and writes in the TDFPC that there are 10
> > pkts in the TX queue.
> > Now when we do:
> > 
> > ping -c 1 x.x.x.x
> > 
> > You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the
> > new one).
> > If you try to measure the TX speed you will see that it is ~ the wire
> > speed.
> 
> How are you measuring this?
> 
> 
> cheers,
> Lennert
-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 17:32         ` Marco Mellia
@ 2004-11-29 19:08           ` Lennert Buytenhek
  2004-11-29 19:09             ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 19:08 UTC (permalink / raw)
  To: Marco Mellia
  Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, Nov 29, 2004 at 06:32:13PM +0100, Marco Mellia wrote:

> Using the Agilent Router tester as receiver...
> :-(

OK, so you're measuring the inter-packet gap and in that burst of 11
(or whatever many) packets it's 96 bit times between every packet, yes?

Interesting.  Can you also try 'pre-loading' the TX ring with a bunch of
packets and then writing the values 0, 1, 2, 3 ... n-1 to the TXD register
with back-to-back MMIO writes (instead of doing a single write of the
value 'n'), and check what inter-packet gap you get then?


cheers,
Lennert

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 19:08           ` Lennert Buytenhek
@ 2004-11-29 19:09             ` Lennert Buytenhek
  0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 19:09 UTC (permalink / raw)
  To: Marco Mellia
  Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, Nov 29, 2004 at 08:08:08PM +0100, Lennert Buytenhek wrote:

> packets and then writing the values 0, 1, 2, 3 ... n-1 to the TXD register
								^^^
That should be TDT.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 13:09           ` Robert Olsson
@ 2004-11-29 20:16             ` David S. Miller
  2004-12-01 16:47               ` Robert Olsson
  2004-11-30 13:31             ` jamal
  1 sibling, 1 reply; 85+ messages in thread
From: David S. Miller @ 2004-11-29 20:16 UTC (permalink / raw)
  To: Robert Olsson
  Cc: hadi, P, mellia, Robert.Olsson, e1000-devel, jorge.finochietto,
	galante, netdev

On Mon, 29 Nov 2004 14:09:08 +0100
Robert Olsson <Robert.Olsson@data.slu.se> wrote:

>  Did I dream or did someone tell me that S2IO 
>  could have several TX ring that could via MSI be routed to proper cpu?

One of Sun's gigabit chips can do this too, except it isn't
via MSI, the driver has to read the descriptor to figure out
which cpu gets the software interrupt to process the packet.

SGI had hardware which allowed you to do this kind of stuff too.

Obviously the MSI version works much better.

It is important, the cpu selection process.  First of all, it must
be calculated such that flows always go through the same cpu.
Otherwise TCP sockets bounce between the cpus for a streaming
transfer.

And even this doesn't avoid all such problems, TCP LISTEN state
sockets will still thrash between the cpus with such a "pick
a cpu based upon" flow scheme.

Anyways, just some thoughts.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 14:50         ` Lennert Buytenhek
@ 2004-11-30  8:42           ` Marco Mellia
  2004-12-01 12:25             ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-30  8:42 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Marco Mellia, Harald Welte, P, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, 2004-11-29 at 15:50, Lennert Buytenhek wrote:
> On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote:
> 
> > Th's our intuition too.
> > Notice that we get the same results with 3com (broadcom based) gigabit
> > cards.
> > We are thinking of sending packet in "bursts" instead of single
> > transfers. The only problem is to let the NIC know that there are more
> > than a packet in a burst...
> 
> Jamal implemented exactly this for e1000 already, he might be persuaded
> into posting his patch here.  Jamal? :)

I guess that saying that we are _very_ interested in this might help.
:-)
We can offer as "beta-testers" as well...

-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 13:09           ` Robert Olsson
  2004-11-29 20:16             ` David S. Miller
@ 2004-11-30 13:31             ` jamal
  2004-11-30 13:46               ` Lennert Buytenhek
  1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-30 13:31 UTC (permalink / raw)
  To: Robert Olsson
  Cc: P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, 2004-11-29 at 08:09, Robert Olsson wrote:
> jamal writes:
> 
>  > Have to read the paper - When Robert was last visiting here; we did some
>  > tests and packet recycling is not very valuable as far as SMP is
>  > concerned (given that packets can be alloced on one CPU and freed on
>  > another). There a clear win on single CPU machines.
> 
> 
>  Correct yes at you lab about 2 1/2 years ago. 

How time flies when you are having fun ;->

> I see those experiments in a 
>  different light today as we never got any packet budget contribution
>  from SMP with shared mem arch whatsoever. Spent a week w. Alexey in the lab 
>  to understand whats going on. Two flows with total affinity (for each CPU)
>  even removed all locks and part of the IP stack. We were still confused...
> 
>  When Opteron/NUMA gave good contribution in those setups. We start thinking
>  it must be latency and memory controllers that makes the difference. As w. 
>  each CPU has it's own memory and memory controller in Opteron case.
> 
>  So from that aspect we expecting the impossible from recycling patch
>  maybe it will do better on boxes w. local memory.
> 

Interesting thought. Not using a lot of my brain cells to compute i
would say that it would get worse. But i suppose the real reason this 
gets nasty on x86 style SMP is because cache misses are more expensive
there, maybe?

>  But I think we should give it up in current form skb recycling. If extend 
>  it to deal cache bouncing etc. We end up having something like slab in 
>  every driver. slab has improved is not so dominant in profiles now.
> 

nod.

>  Also from what I understand new HW and MSI can help in the case where
>  pass objects between CPU. Did I dream or did someone tell me that S2IO 
>  could have several TX ring that could via MSI be routed to proper cpu?

I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
like more hell to maintain.
 
>  slab packet-objects have been discussed. It would do some contribution
>  but is the complexity worth it?

May not be worth it.

>  
>  Also I think it could possible to do more lightweight variant of skb
>  recycling in case we need to recycle PCI-mapping etc.
>

I think its valuable to have it for people with UP; its not worth the
complexity for SMP IMO.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-30 13:31             ` jamal
@ 2004-11-30 13:46               ` Lennert Buytenhek
  2004-11-30 14:25                 ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-30 13:46 UTC (permalink / raw)
  To: jamal
  Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Tue, Nov 30, 2004 at 08:31:41AM -0500, jamal wrote:

> >  Also from what I understand new HW and MSI can help in the case where
> >  pass objects between CPU. Did I dream or did someone tell me that S2IO 
> >  could have several TX ring that could via MSI be routed to proper cpu?
> 
> I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> like more hell to maintain.

On the TX path you'd have qdiscs to deal with as well, no?


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 14:21       ` Marco Mellia
@ 2004-11-30 13:46         ` jamal
  2004-12-02 17:24           ` Marco Mellia
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-30 13:46 UTC (permalink / raw)
  To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, 2004-11-29 at 09:21, Marco Mellia wrote:
> On Fri, 2004-11-26 at 20:56, jamal wrote:
> > On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> > > If you don't trust us, please, ignore this email.
> > > Sorry.
> > 
> > Dont take it the wrong way please - nobody has been able to produce the
> > results you have. So thats why you may be getting that comment.
> > The fact you have been able to do this is a good thing.
> 
> No problem from this side. I also forgot a couple of 8-! I guess...
> 
> [...]
> 
> > prefetching as in the use of prefetch()?
> > What were you prefetching if you end up dropping packet?
> > 
> 

I read your paper on the weekend - theres one thing which i dont think
has been written on before on NAPI that you covered unfortunetly with no
melodrama ;-> This is the min-max fairness issue. If you actually mix
and match different speeds then it becomes a really interesting problem.
Example try congesting a 100Mbps with 2x1Gbps. What quotas to use etc.
Could this be done cleverly at runtime with dynamic adjustments etc.
Next time you want you want to slave students to do some work talk to us
- I got plenty of things you could try out and keep them busy forever;->

> Sorry I used the wrong terms there.
> What we discovered, is that the CPU caching mechanisms as a HUGE impact.
> And that you have very little control on it. Prefetching may help, but
> it is difficult to tredict its impacts...

Prefetching is hard. The only evidence i have seen of actually what
"appears" to be working prefetching is some code from David Morsberger
at HP. Other architectures are known to be more friendly - my eperiences
with MIPs are far more pleasant. BTW, thats another topic to get those
students to investigate ;->

> Indeed, if you access to the packet struct, the CPU has to fetch data
> from the main memory, which stored the packet transfered using DMA from
> the NIC. The penalty in the memory access is huge, and you have little
> control on it.
> 
> In our experiments, we modified the kernel to drop packets just after
> receiving them. skb are just deallocated (using standerd kernel
> routines, i.e., no recycling is used). Logically, that happen when the
> netif_rx() is called.
> 
> Now, we have three cases
> 1) just mofify the netif_rx() to drop packets.
> 2) as in one, plus remove the protocol check in the driver
> (i.e., comment the line
>       skb->protocol = eth_type_trans(skb, netdev);
> ) to avoid to access the real packet data.
> 3) as in 2, but dealloc is performed at the driver level, instead of
> calling the netif_rx()
> 
> In the first case, we can receive about 1.1Mpps (~80% of packets)

Possible. I was able to receive 900Kpps or so in my experiments with
gact drop which is slightly above this with a 2.4 Ghz machine with IRQ
affinity.

> In the second case, we can receive 100% of packets, as we removed the
> penalty of looking at the packet headers to discover its protocol type.
> 

This is the one people found hard to believe. I will go and retest this.
It is possible. 

> In the third case, we can NOT receive 100% of packets!
> The only difference is that we actually _REMOVED_ a funcion call. This
> reduces the overhead, and the compiler/cpu/whatever can not optimize the
> data path to access to the skb which must be freed.

It doesnt seem like you were runing NAPI if you depended on calling
netif_rx
In that case, #3 would be freeing in hard IRQ context while #2 is
softIRQ.

> Our guess is that by freeing up the skb in the netif_rx() function
> actually allows the compiler/cpu to prefetch the skb itself, and
> therefore keep the pipeline working...
> 
> My guess is that if you change compiler, cpu, memory subsystem, you may
> get very counterintuitive results...

Refer to my comment above.
Repeat tests with NAPI and see if you get same results.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-30 13:46               ` Lennert Buytenhek
@ 2004-11-30 14:25                 ` jamal
  2004-12-01  0:11                   ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-30 14:25 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Tue, 2004-11-30 at 08:46, Lennert Buytenhek wrote:
> On Tue, Nov 30, 2004 at 08:31:41AM -0500, jamal wrote:
> 
> > >  Also from what I understand new HW and MSI can help in the case where
> > >  pass objects between CPU. Did I dream or did someone tell me that S2IO 
> > >  could have several TX ring that could via MSI be routed to proper cpu?
> > 
> > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> > like more hell to maintain.
> 
> On the TX path you'd have qdiscs to deal with as well, no?

I think management of it would be non-trivial in SMP. Youd have to start
playing stupid loadbalancing tricks which would reduce the value of
existence of tx irqs to begin with. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-30 14:25                 ` jamal
@ 2004-12-01  0:11                   ` Lennert Buytenhek
  2004-12-01  1:09                     ` Scott Feldman
  2004-12-01 12:08                     ` jamal
  0 siblings, 2 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01  0:11 UTC (permalink / raw)
  To: jamal
  Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Tue, Nov 30, 2004 at 09:25:54AM -0500, jamal wrote:

> > > >  Also from what I understand new HW and MSI can help in the case where
> > > >  pass objects between CPU. Did I dream or did someone tell me that S2IO 
> > > >  could have several TX ring that could via MSI be routed to proper cpu?
> > > 
> > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> > > like more hell to maintain.
> > 
> > On the TX path you'd have qdiscs to deal with as well, no?
> 
> I think management of it would be non-trivial in SMP. Youd have to start
> playing stupid loadbalancing tricks which would reduce the value of
> existence of tx irqs to begin with. 

You mean the management of qdiscs would be non-trivial?

Probably the idea of these kinds of tricks is to skip the qdisc step
altogether.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01  0:11                   ` Lennert Buytenhek
@ 2004-12-01  1:09                     ` Scott Feldman
  2004-12-01 15:34                       ` Robert Olsson
                                         ` (3 more replies)
  2004-12-01 12:08                     ` jamal
  1 sibling, 4 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-01  1:09 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

Hey, turns out, I know some e1000 tricks that might help get the kpps
numbers up.  

My problem is I only have a P4 desktop system with a 82544 nic running
at PCI 32/33Mhz, so I can't play with the big boys.  But, attached is a
rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
descriptor write-backs.  For me, I see a nice jump in kpps, but I'd like
others to try with their setups.  We should be able to get to wire speed
with 60-byte packets.

I'm using pktgen in linux-2.6.9, count = 1000000.

System: Intel 865 (HT 2.6Ghz)
Nic: 82544 PCI 32-bit/33Mhz
Driver: linux-2.6.9 e1000 (5.3.19-k2-NAPI), no Interrupt Delays
                                                                                BEFORE

256 descs
  pkt_size = 60:   253432pps 129Mb/sec errors: 0
  pkt_size = 1500: 56356pps  678Mb/sec errors: 499791
4096 descs
  pkt_size = 60:   254222pps 130Mb/sec errors: 0
  pkt_size = 1500: 52693pps  634Mb/sec errors: 497556
                                                                                
AFTER

Modified driver to turn off Tx interrupts and descriptor write-backs.
Uses a timer to schedule Tx cleanup.  The timer runs at 1ms.  This would
work poorly where HZ=100.  Needed to bump Tx descriptors up to 4096
because 1ms is a lot of time with 60-byte packets at 1GbE.  Every time
the timer expires, there is only one PIO read to get HW head pointer. 
This wouldn't work at lower media speeds like 10Mbps or 100Mbps because
the ring isn't large enough (or we would need a higher resolution
timer).  This also get Tx cleanup out of the NAPI path.

4096 descs
  pkt_size = 60:   541618pps 277Mb/sec errors: 914
  pkt_size = 1500: 76198pps  916Mb/sec errors: 12419
                                                                               
This doubles the kpps numbers for 60-byte packets.  I'd like to see what
happens on higher bus bandwidth systems.  Anyone?

-scott

diff -Naurp linux-2.6.9/drivers/net/e1000/e1000.h linux-2.6.9/drivers/net/e1000.mod/e1000.h
--- linux-2.6.9/drivers/net/e1000/e1000.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/drivers/net/e1000.mod/e1000.h	2004-11-30 14:41:07.045391488 -0800
@@ -103,7 +103,7 @@ struct e1000_adapter;
 #define E1000_MAX_INTR 10
 
 /* TX/RX descriptor defines */
-#define E1000_DEFAULT_TXD                  256
+#define E1000_DEFAULT_TXD                 4096
 #define E1000_MAX_TXD                      256
 #define E1000_MIN_TXD                       80
 #define E1000_MAX_82544_TXD               4096
@@ -189,6 +189,7 @@ struct e1000_desc_ring {
 /* board specific private data structure */
 
 struct e1000_adapter {
+	struct timer_list tx_cleanup_timer;
 	struct timer_list tx_fifo_stall_timer;
 	struct timer_list watchdog_timer;
 	struct timer_list phy_info_timer;
@@ -224,6 +225,7 @@ struct e1000_adapter {
 	uint32_t tx_fifo_size;
 	atomic_t tx_fifo_stall;
 	boolean_t pcix_82544;
+	boolean_t tx_cleanup_scheduled;
 
 	/* RX */
 	struct e1000_desc_ring rx_ring;
diff -Naurp linux-2.6.9/drivers/net/e1000/e1000_hw.h linux-2.6.9/drivers/net/e1000.mod/e1000_hw.h
--- linux-2.6.9/drivers/net/e1000/e1000_hw.h	2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/drivers/net/e1000.mod/e1000_hw.h	2004-11-30 13:48:07.983682328 -0800
@@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e
 /* This defines the bits that are set in the Interrupt Mask
  * Set/Read Register.  Each bit is documented below:
  *   o RXT0   = Receiver Timer Interrupt (ring 0)
- *   o TXDW   = Transmit Descriptor Written Back
  *   o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0)
  *   o RXSEQ  = Receive Sequence Error
  *   o LSC    = Link Status Change
  */
 #define IMS_ENABLE_MASK ( \
     E1000_IMS_RXT0   |    \
-    E1000_IMS_TXDW   |    \
     E1000_IMS_RXDMT0 |    \
     E1000_IMS_RXSEQ  |    \
     E1000_IMS_LSC)
diff -Naurp linux-2.6.9/drivers/net/e1000/e1000_main.c linux-2.6.9/drivers/net/e1000.mod/e1000_main.c
--- linux-2.6.9/drivers/net/e1000/e1000_main.c	2004-10-18 14:53:50.000000000 -0700
+++ linux-2.6.9/drivers/net/e1000.mod/e1000_main.c	2004-11-30 16:15:13.777957656 -0800
@@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi
 static void e1000_irq_disable(struct e1000_adapter *adapter);
 static void e1000_irq_enable(struct e1000_adapter *adapter);
 static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs);
-static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter);
+static void e1000_clean_tx(unsigned long data);
 #ifdef CONFIG_E1000_NAPI
 static int e1000_clean(struct net_device *netdev, int *budget);
 static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter,
@@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter
 
 	e1000_irq_disable(adapter);
 	free_irq(adapter->pdev->irq, netdev);
+	del_timer_sync(&adapter->tx_cleanup_timer);
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
 	del_timer_sync(&adapter->watchdog_timer);
 	del_timer_sync(&adapter->phy_info_timer);
@@ -533,6 +534,10 @@ e1000_probe(struct pci_dev *pdev,
 
 	e1000_get_bus_info(&adapter->hw);
 
+	init_timer(&adapter->tx_cleanup_timer);
+	adapter->tx_cleanup_timer.function = &e1000_clean_tx;
+	adapter->tx_cleanup_timer.data = (unsigned long) adapter;
+
 	init_timer(&adapter->tx_fifo_stall_timer);
 	adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall;
 	adapter->tx_fifo_stall_timer.data = (unsigned long) adapter;
@@ -893,14 +898,9 @@ e1000_configure_tx(struct e1000_adapter 
 	e1000_config_collision_dist(&adapter->hw);
 
 	/* Setup Transmit Descriptor Settings for eop descriptor */
-	adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP |
+	adapter->txd_cmd = E1000_TXD_CMD_EOP |
 		E1000_TXD_CMD_IFCS;
 
-	if(adapter->hw.mac_type < e1000_82543)
-		adapter->txd_cmd |= E1000_TXD_CMD_RPS;
-	else
-		adapter->txd_cmd |= E1000_TXD_CMD_RS;
-
 	/* Cache if we're 82544 running in PCI-X because we'll
 	 * need this to apply a workaround later in the send path. */
 	if(adapter->hw.mac_type == e1000_82544 &&
@@ -1820,6 +1820,11 @@ e1000_xmit_frame(struct sk_buff *skb, st
  		return NETDEV_TX_LOCKED; 
  	} 
 
+	if(!adapter->tx_cleanup_scheduled) {
+		adapter->tx_cleanup_scheduled = TRUE;
+		mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+	}
+
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
 	if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
@@ -1856,6 +1861,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
 	netdev->trans_start = jiffies;
 
 	spin_unlock_irqrestore(&adapter->tx_lock, flags);
+
 	return NETDEV_TX_OK;
 }
 
@@ -2151,8 +2157,7 @@ e1000_intr(int irq, void *data, struct p
 	}
 #else
 	for(i = 0; i < E1000_MAX_INTR; i++)
-		if(unlikely(!e1000_clean_rx_irq(adapter) &
-		   !e1000_clean_tx_irq(adapter)))
+		if(unlikely(!e1000_clean_rx_irq(adapter)))
 			break;
 #endif
 
@@ -2170,18 +2175,15 @@ e1000_clean(struct net_device *netdev, i
 {
 	struct e1000_adapter *adapter = netdev->priv;
 	int work_to_do = min(*budget, netdev->quota);
-	int tx_cleaned;
 	int work_done = 0;
 	
-	tx_cleaned = e1000_clean_tx_irq(adapter);
 	e1000_clean_rx_irq(adapter, &work_done, work_to_do);
 
 	*budget -= work_done;
 	netdev->quota -= work_done;
 	
-	/* if no Rx and Tx cleanup work was done, exit the polling mode */
-	if(!tx_cleaned || (work_done < work_to_do) || 
-				!netif_running(netdev)) {
+	/* if no Rx cleanup work was done, exit the polling mode */
+	if((work_done < work_to_do) || !netif_running(netdev)) {
 		netif_rx_complete(netdev);
 		e1000_irq_enable(adapter);
 		return 0;
@@ -2192,66 +2194,74 @@ e1000_clean(struct net_device *netdev, i
 
 #endif
 /**
- * e1000_clean_tx_irq - Reclaim resources after transmit completes
- * @adapter: board private structure
+ * e1000_clean_tx - Reclaim resources after transmit completes
+ * @data: timer callback data (board private structure)
  **/
 
-static boolean_t
-e1000_clean_tx_irq(struct e1000_adapter *adapter)
+static void
+e1000_clean_tx(unsigned long data)
 {
+	struct e1000_adapter *adapter = (struct e1000_adapter *)data;
 	struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
-	struct e1000_tx_desc *tx_desc, *eop_desc;
 	struct e1000_buffer *buffer_info;
-	unsigned int i, eop;
-	boolean_t cleaned = FALSE;
+	unsigned int i, next;
+	int size = 0, count = 0;
+	uint32_t tx_head;
 
-	i = tx_ring->next_to_clean;
-	eop = tx_ring->buffer_info[i].next_to_watch;
-	eop_desc = E1000_TX_DESC(*tx_ring, eop);
+	spin_lock(&adapter->tx_lock);
 
-	while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) {
-		for(cleaned = FALSE; !cleaned; ) {
-			tx_desc = E1000_TX_DESC(*tx_ring, i);
-			buffer_info = &tx_ring->buffer_info[i];
+	tx_head = E1000_READ_REG(&adapter->hw, TDH);
 
-			if(likely(buffer_info->dma)) {
-				pci_unmap_page(pdev,
-					       buffer_info->dma,
-					       buffer_info->length,
-					       PCI_DMA_TODEVICE);
-				buffer_info->dma = 0;
-			}
+	i = next = tx_ring->next_to_clean;
 
-			if(buffer_info->skb) {
-				dev_kfree_skb_any(buffer_info->skb);
-				buffer_info->skb = NULL;
-			}
+	while(i != tx_head) {
+		size++;
+		if(i == tx_ring->buffer_info[next].next_to_watch) {
+			count += size;
+			size = 0;
+			if(unlikely(++i == tx_ring->count))
+				i = 0;
+			next = i;
+		} else {
+			if(unlikely(++i == tx_ring->count))
+				i = 0;
+		}
+	}
 
-			tx_desc->buffer_addr = 0;
-			tx_desc->lower.data = 0;
-			tx_desc->upper.data = 0;
+	i = tx_ring->next_to_clean;
+	while(count--) {
+		buffer_info = &tx_ring->buffer_info[i];
 
-			cleaned = (i == eop);
-			if(unlikely(++i == tx_ring->count)) i = 0;
+		if(likely(buffer_info->dma)) {
+			pci_unmap_page(pdev,
+				       buffer_info->dma,
+				       buffer_info->length,
+				       PCI_DMA_TODEVICE);
+			buffer_info->dma = 0;
 		}
-		
-		eop = tx_ring->buffer_info[i].next_to_watch;
-		eop_desc = E1000_TX_DESC(*tx_ring, eop);
+
+		if(buffer_info->skb) {
+			dev_kfree_skb_any(buffer_info->skb);
+			buffer_info->skb = NULL;
+		}
+
+		if(unlikely(++i == tx_ring->count))
+			i = 0;
 	}
 
 	tx_ring->next_to_clean = i;
 
-	spin_lock(&adapter->tx_lock);
+	if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count)
+		mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+	else
+		adapter->tx_cleanup_scheduled = FALSE;
 
-	if(unlikely(cleaned && netif_queue_stopped(netdev) &&
-		    netif_carrier_ok(netdev)))
+	if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev)))
 		netif_wake_queue(netdev);
 
 	spin_unlock(&adapter->tx_lock);
-
-	return cleaned;
 }
 
 /**

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01  0:11                   ` Lennert Buytenhek
  2004-12-01  1:09                     ` Scott Feldman
@ 2004-12-01 12:08                     ` jamal
  2004-12-01 15:24                       ` Lennert Buytenhek
  1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-01 12:08 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Tue, 2004-11-30 at 19:11, Lennert Buytenhek wrote:
> On Tue, Nov 30, 2004 at 09:25:54AM -0500, jamal wrote:
> 
> > > > >  Also from what I understand new HW and MSI can help in the case where
> > > > >  pass objects between CPU. Did I dream or did someone tell me that S2IO 
> > > > >  could have several TX ring that could via MSI be routed to proper cpu?
> > > > 
> > > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> > > > like more hell to maintain.
> > > 
> > > On the TX path you'd have qdiscs to deal with as well, no?
> > 
> > I think management of it would be non-trivial in SMP. Youd have to start
> > playing stupid loadbalancing tricks which would reduce the value of
> > existence of tx irqs to begin with. 
> 
> You mean the management of qdiscs would be non-trivial?

I mean it is useful in only the most ideal cases and if you want to
actually do something useful in most cases with it you will have to
muck around.
Take the case of forwarding (maybe with a little or almost no localhost
generated traffic) - then you end allocating in CPUA, processing and
queueing on egress. Tx softirq, which is what stashes the packet on tx
DMA eventually, is not guaranteed to run on the same CPU. Now add a
little latency between ingress and egress ..
The ideal case is where you end up processing to completion from ingress
to egress (which is known to happen in Linux when theres no congestion).
> Probably the idea of these kinds of tricks is to skip the qdisc step
> altogether.
> 

Which is preached by the BSD folks - bogus in my opinion. If you want to
do something as bland/boring as that you can probably afford a $500
DLINK router which can do it at wire rate with (with cost you being
locked in whatever features they have).

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-30  8:42           ` Marco Mellia
@ 2004-12-01 12:25             ` jamal
  2004-12-02 13:39               ` Marco Mellia
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-01 12:25 UTC (permalink / raw)
  To: mellia
  Cc: Lennert Buytenhek, Harald Welte, P, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Tue, 2004-11-30 at 03:42, Marco Mellia wrote:
> On Mon, 2004-11-29 at 15:50, Lennert Buytenhek wrote:
> > On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote:
> > 
> > > Th's our intuition too.
> > > Notice that we get the same results with 3com (broadcom based) gigabit
> > > cards.
> > > We are thinking of sending packet in "bursts" instead of single
> > > transfers. The only problem is to let the NIC know that there are more
> > > than a packet in a burst...
> > 
> > Jamal implemented exactly this for e1000 already, he might be persuaded
> > into posting his patch here.  Jamal? :)
> 
> I guess that saying that we are _very_ interested in this might help.
> :-)
> We can offer as "beta-testers" as well...

Sorry missed this (I wasnt CCed so it went to a low priority queue which
i read on a best effort basis).
Let me clean up the patches a little bit this weekend. The patch is at
least 4 months old; latest reincarnation was due to issue1 on my SUCON
presentation. Would a patch against latest 2.6.x bitkeeper (whatever it
is this weekend) be fine? If you are in a rush and dont mind a little
ugliness then i will pass them as is.

BTW, Scott posted a interesting patch yesterday, you may wanna give that
a shot as well. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 12:08                     ` jamal
@ 2004-12-01 15:24                       ` Lennert Buytenhek
  0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 15:24 UTC (permalink / raw)
  To: jamal
  Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Wed, Dec 01, 2004 at 07:08:20AM -0500, jamal wrote:

[ per-CPU TX/RX rings ]
> > You mean the management of qdiscs would be non-trivial?
> 
> I mean it is useful in only the most ideal cases and if you want to
> actually do something useful in most cases with it you will have to
> muck around.
> Take the case of forwarding (maybe with a little or almost no localhost
> generated traffic) - then you end allocating in CPUA, processing and
> queueing on egress. Tx softirq, which is what stashes the packet on tx
> DMA eventually, is not guaranteed to run on the same CPU. Now add a
> little latency between ingress and egress ..
> The ideal case is where you end up processing to completion from ingress
> to egress (which is known to happen in Linux when theres no congestion).

We disagreed on this topic at SUCON and I'm afraid we'll be disagreeing
on it forever :)  IMHO, on 10GbE any kind of qdisc is a waste of cycles.

I don't think it's very likely that you'll be using that single 10GbE NIC
for forwarding packets, doing that with a PC at this point in the history
of PCs is just silly.  If you do use it for forwarding, how likely is it
that you'll be able to process an incoming burst of packets fast enough
to require queueing on the egress interface?  You have to be able to send
a burst of packets bigger than the NIC's TX FIFO at >10GbE in the first
place for queueing to be effective/useful at all.

(Leaving the question of whether or not there'll be some room in the TX
FIFO at TX time unanswered, what you're doing with per-CPU TX rings is
basically just simulating the "N individual NICs each bound to its own
CPU" case with a single NIC.)


> > Probably the idea of these kinds of tricks is to skip the qdisc step
> > altogether.
> 
> Which is preached by the BSD folks - bogus in my opinion. If you want to
> do something as bland/boring as that you can probably afford a $500
> DLINK router which can do it at wire rate with (with cost you being
> locked in whatever features they have).

That's an unfair comparison.  Just because I don't need CBQ doesn't mean
my $500 DLINK router does everything I'd want it to -- advanced firewalling
is one thing that comes to mind.  Last time I looked I couldn't load my
own kernel modules on my DLINK router either.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01  1:09                     ` Scott Feldman
@ 2004-12-01 15:34                       ` Robert Olsson
  2004-12-01 16:49                         ` Scott Feldman
  2004-12-01 18:29                       ` Lennert Buytenhek
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 85+ messages in thread
From: Robert Olsson @ 2004-12-01 15:34 UTC (permalink / raw)
  To: sfeldma
  Cc: Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev


Scott Feldman writes:
 > Hey, turns out, I know some e1000 tricks that might help get the kpps
 > numbers up.  
 > 
 > My problem is I only have a P4 desktop system with a 82544 nic running
 > at PCI 32/33Mhz, so I can't play with the big boys.  But, attached is a
 > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
 > descriptor write-backs.  For me, I see a nice jump in kpps, but I'd like
 > others to try with their setups.  We should be able to get to wire speed
 > with 60-byte packets.
 > 
 > System: Intel 865 (HT 2.6Ghz)
 > Nic: 82544 PCI 32-bit/33Mhz
 > Driver: linux-2.6.9 e1000 (5.3.19-k2-NAPI), no Interrupt Delays

 > 4096 descs
 >   pkt_size = 60:   541618pps 277Mb/sec errors: 914

Hello!

Nice but I no improvements w. 82546GB @ 133 MHz on 1.6 GHz Opteron it seems.
SMP kernel linux-2.6.9-rc2

Vanilla. 
  801077pps 410Mb/sec (410151424bps) errors: 95596

Patch TXD=4096
  608690pps 311Mb/sec (311649280bps) errors: 0

Patch TXD=2048
  624103pps 319Mb/sec (319540736bps) errors: 0

Patch TXD=1024
  551289pps 282Mb/sec (282259968bps) errors: 4506

Error count is a bit confusing...

						--ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-29 20:16             ` David S. Miller
@ 2004-12-01 16:47               ` Robert Olsson
  0 siblings, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-01 16:47 UTC (permalink / raw)
  To: David S. Miller
  Cc: Robert Olsson, hadi, P, mellia, e1000-devel, jorge.finochietto,
	galante, netdev


David S. Miller writes:

 > >  Did I dream or did someone tell me that S2IO 
 > >  could have several TX ring that could via MSI be routed to proper cpu?
 > 
 > One of Sun's gigabit chips can do this too, except it isn't
 > via MSI, the driver has to read the descriptor to figure out
 > which cpu gets the software interrupt to process the packet.
 > 
 > SGI had hardware which allowed you to do this kind of stuff too.
 > 
 > Obviously the MSI version works much better.
 > 
 > It is important, the cpu selection process.  First of all, it must
 > be calculated such that flows always go through the same cpu.
 > Otherwise TCP sockets bounce between the cpus for a streaming
 > transfer.
 > 
 > And even this doesn't avoid all such problems, TCP LISTEN state
 > sockets will still thrash between the cpus with such a "pick
 > a cpu based upon" flow scheme.
 > 
 > Anyways, just some thoughts.

 Thanks for the the info. Well we'll be forced to get into those problems when
 the HW is capable. I'll guess it will be w. the 10 GIGE cards.

					     --ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 15:34                       ` Robert Olsson
@ 2004-12-01 16:49                         ` Scott Feldman
  2004-12-01 17:37                           ` Robert Olsson
                                             ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-01 16:49 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Lennert Buytenhek, jamal, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Wed, 2004-12-01 at 07:34, Robert Olsson wrote:
> Nice but I no improvements w. 82546GB @ 133 MHz on 1.6 GHz Opteron it seems.
> SMP kernel linux-2.6.9-rc2
> 
> Vanilla. 
>   801077pps 410Mb/sec (410151424bps) errors: 95596
> 
> Patch TXD=4096
>   608690pps 311Mb/sec (311649280bps) errors: 0

Thank you Robert for trying it out.

Well those results are counter-intuitive!  We remove Tx interrupts and
Tx descriptor DMA write-backs and get no re-tries, and performance
drops?  The only bus activities left are the DMA of buffers to device
and the register writes to increment tail.  I'm stumped.  I'll need to
get my hands on a faster system.  Maybe there is a bus analyzer under
the tree.  :-)

-scott

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 16:49                         ` Scott Feldman
@ 2004-12-01 17:37                           ` Robert Olsson
  2004-12-02 17:54                           ` Robert Olsson
  2004-12-02 18:23                           ` Robert Olsson
  2 siblings, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-01 17:37 UTC (permalink / raw)
  To: sfeldma
  Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev


Scott Feldman writes:

 > Thank you Robert for trying it out.
 > 
 > Well those results are counter-intuitive!  We remove Tx interrupts and
 > Tx descriptor DMA write-backs and get no re-tries, and performance
 > drops?  The only bus activities left are the DMA of buffers to device
 > and the register writes to increment tail.  I'm stumped.  I'll need to
 > get my hands on a faster system.  Maybe there is a bus analyzer under
 > the tree.  :-)

 Huh. I've got a deja-vu feeling. What will happen if we remove almost all
 events (interrupts) and just have the timer waking up once-in-a-while?

						    --ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01  1:09                     ` Scott Feldman
  2004-12-01 15:34                       ` Robert Olsson
@ 2004-12-01 18:29                       ` Lennert Buytenhek
  2004-12-01 21:35                         ` Lennert Buytenhek
  2004-12-02 17:31                       ` [E1000-devel] Transmission limit Marco Mellia
  2004-12-03 20:57                       ` Lennert Buytenhek
  3 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 18:29 UTC (permalink / raw)
  To: Scott Feldman
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Tue, Nov 30, 2004 at 05:09:59PM -0800, Scott Feldman wrote:

> This doubles the kpps numbers for 60-byte packets.  I'd like to see what
> happens on higher bus bandwidth systems.  Anyone?

Dual Xeon 2.4GHz, a 82540EM and a 82541GI both on 32/66 on separate
PCI buses.

BEFORE performance is approx the same for both, ~620kpps.
AFTER performance is ~730kpps, also approx the same for both.

(Note: only sending with one NIC at a time.)

Once or twice it went into a state where it started spitting out these
kinds of messages and never recovered:

	Dec  1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
	[...]
	Dec  1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
	[...]
	Dec  1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out

But overall, looks good.  Strange thing that Robert's numbers didn't
improve.  Doing some more measurements right now.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 18:29                       ` Lennert Buytenhek
@ 2004-12-01 21:35                         ` Lennert Buytenhek
  2004-12-02  6:13                           ` Scott Feldman
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 21:35 UTC (permalink / raw)
  To: Scott Feldman
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]

On Wed, Dec 01, 2004 at 07:29:43PM +0100, Lennert Buytenhek wrote:

> > This doubles the kpps numbers for 60-byte packets.  I'd like to see what
> > happens on higher bus bandwidth systems.  Anyone?
> 
> Dual Xeon 2.4GHz, a 82540EM and a 82541GI both on 32/66 on separate
> PCI buses.
> 
> BEFORE performance is approx the same for both, ~620kpps.
> AFTER performance is ~730kpps, also approx the same for both.

Pretty graph attached.  From ~220B packets or so it does wire speed, but
there's still an odd drop in performance around 256B packets (which is
also there without your patch.)  From 350B packets or so, performance is
identical with or without your patch (wire speed.)

So.  Do you have any other good plans perhaps? :)


> Once or twice it went into a state where it started spitting out these
> kinds of messages and never recovered:
> 
> 	Dec  1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> 	[...]
> 	Dec  1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> 	[...]
> 	Dec  1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out

Didn't see this happen anymore.  (ifconfig down and then up recovered it
both times I saw it happen.)


thanks,
Lennert

[-- Attachment #2: feldman.png --]
[-- Type: image/png, Size: 7959 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 21:35                         ` Lennert Buytenhek
@ 2004-12-02  6:13                           ` Scott Feldman
  2004-12-03 13:24                             ` jamal
  2004-12-05 14:50                             ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
  0 siblings, 2 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-02  6:13 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Wed, 2004-12-01 at 13:35, Lennert Buytenhek wrote: 
> Pretty graph attached.  From ~220B packets or so it does wire speed, but
> there's still an odd drop in performance around 256B packets (which is
> also there without your patch.)  From 350B packets or so, performance is
> identical with or without your patch (wire speed.)
Seems this is helping PCI nics but not PCI-X.  I was using PCI 32/33. 
Can't explain the dip around 256B.

> So.  Do you have any other good plans perhaps? :)

Idea#1

Is the write of TDT causing interference with DMA transactions?

In addition to my patch, what happens if you bump the Tx tail every n
packets, where n is like 16 or 32 or 64?  

if((i % 16) == 0)
	E1000_REG_WRITE(&adapter->hw, TDT, i);

This might piss the NETDEV timer off if the send count isn't a multiple
of n, so you might want to disable netdev->tx_timeout.

Idea#2

The Ultimate: queue up 4096 packets and then write TDT once to send all
4096 in one shot.  Well, maybe a few less that 4096 so we don't wrap the
ring.  How about pkt_size = 4000?

Take my patch and change the timer call in e1000_xmit_frame from 

	jiffies + 1

to

	jiffies + HZ

This will schedule the cleanup of the skbs 1 second after the first
queue, so we shouldn't be doing any cleanup while the 4000 packets are
DMA'ed.

Oh, and change the tail write to

if((i % 4000) == 0)
	E1000_REG_WRITE(&adapter->hw, TDT, i);

Of course you'll need to close/open the driver after each run.

Idea#3

http://www.mail-archive.com/freebsd-net@freebsd.org/msg10826.html

Set TXDMAC to 0 in e1000_configure_tx.

> > Once or twice it went into a state where it started spitting out these
> > kinds of messages and never recovered:
> > 
> > 	Dec  1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> > 	[...]
> > 	Dec  1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> > 	[...]
> > 	Dec  1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> 
> Didn't see this happen anymore.  (ifconfig down and then up recovered it
> both times I saw it happen.)

Well, it's probably not a HW bug that's causing the reset; it's probably
some bug with my patch.

-scott

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 12:25             ` jamal
@ 2004-12-02 13:39               ` Marco Mellia
  2004-12-03 13:07                 ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-12-02 13:39 UTC (permalink / raw)
  To: hadi
  Cc: mellia, Lennert Buytenhek, Harald Welte, P, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

> > > > We are thinking of sending packet in "bursts" instead of single
> > > > transfers. The only problem is to let the NIC know that there are more
> > > > than a packet in a burst...
> > > 
> > > Jamal implemented exactly this for e1000 already, he might be persuaded
> > > into posting his patch here.  Jamal? :)
> > 
> > I guess that saying that we are _very_ interested in this might help.
> > :-)
> > We can offer as "beta-testers" as well...
> 
> Sorry missed this (I wasnt CCed so it went to a low priority queue which
> i read on a best effort basis).
> Let me clean up the patches a little bit this weekend. The patch is at
> least 4 months old; latest reincarnation was due to issue1 on my SUCON
> presentation. Would a patch against latest 2.6.x bitkeeper (whatever it
> is this weekend) be fine? If you are in a rush and dont mind a little
> ugliness then i will pass them as is.
> 
We'll be glad to spend some time trying this out. Please, we are not
very confortable with the linux bitkeeper maintenance method. Can we ask
you to provide us a patch to a standard kernel/driver (whatever you
prefer...)? Also a complete source sub-tree would be ok ;-)

> BTW, Scott posted a interesting patch yesterday, you may wanna give that
> a shot as well. 

We're trying that out right now... (which means, that in a couple of
days, we'll try it ;-))

Thanks a lot.

-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-11-30 13:46         ` jamal
@ 2004-12-02 17:24           ` Marco Mellia
  0 siblings, 0 replies; 85+ messages in thread
From: Marco Mellia @ 2004-12-02 17:24 UTC (permalink / raw)
  To: hadi
  Cc: mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

> > In our experiments, we modified the kernel to drop packets just after
> > receiving them. skb are just deallocated (using standerd kernel
> > routines, i.e., no recycling is used). Logically, that happen when the
> > netif_rx() is called.
> > 
> > Now, we have three cases
> > 1) just mofify the netif_rx() to drop packets.
> > 2) as in one, plus remove the protocol check in the driver
> > (i.e., comment the line
> >       skb->protocol = eth_type_trans(skb, netdev);
> > ) to avoid to access the real packet data.
> > 3) as in 2, but dealloc is performed at the driver level, instead of
> > calling the netif_rx()
> > 
> > In the first case, we can receive about 1.1Mpps (~80% of packets)
> 
> Possible. I was able to receive 900Kpps or so in my experiments with
> gact drop which is slightly above this with a 2.4 Ghz machine with IRQ
> affinity.

I double checked with the people that actually did the job. They indeed
tested both cases, i.e., dropping packets either using IRQ (therefore
using netif_rx()) or using NAPI (therefore using netif_receive_skb()).

In both cases, disabling the eth_type_trans() check, we receive 100% of
packets...

> > In the third case, we can NOT receive 100% of packets!
> > The only difference is that we actually _REMOVED_ a funcion call. This
> > reduces the overhead, and the compiler/cpu/whatever can not optimize the
> > data path to access to the skb which must be freed.
> 
> It doesnt seem like you were runing NAPI if you depended on calling
> netif_rx
> In that case, #3 would be freeing in hard IRQ context while #2 is
> softIRQ.

Again, it was my mistake. Case #3 was performed using the NAPI stack,
i.e., freeing up skb instead of calling the netif_receive_skb().
Doing that, we observed a performance drop, that we hint to some caching
isses. Indeed, investigating with a Oprofile, in case #3 it registers
about twice the number of cache miss than in case #2.
Again, we do not have any plain explanation, but our intuition is that
adding a function call with  pointer as argument might allow the
compiler/cpu to prefecth the skb and speed up the memory release...


> > Our guess is that by freeing up the skb in the netif_rx() function
> > actually allows the compiler/cpu to prefetch the skb itself, and
> > therefore keep the pipeline working...
> > 
> > My guess is that if you change compiler, cpu, memory subsystem, you may
> > get very counterintuitive results...
> 
> Refer to my comment above.
> Repeat tests with NAPI and see if you get same results.

We were using NAPI. Sorry for the misunderstanding.
Hope this helps.



-- 
Ciao,                    /\/\/\rco

+--+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+--+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01  1:09                     ` Scott Feldman
  2004-12-01 15:34                       ` Robert Olsson
  2004-12-01 18:29                       ` Lennert Buytenhek
@ 2004-12-02 17:31                       ` Marco Mellia
  2004-12-03 20:57                       ` Lennert Buytenhek
  3 siblings, 0 replies; 85+ messages in thread
From: Marco Mellia @ 2004-12-02 17:31 UTC (permalink / raw)
  To: sfeldma
  Cc: birke, Lennert Buytenhek, jamal, Robert Olsson, P, mellia,
	e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Wed, 2004-12-01 at 02:09, Scott Feldman wrote:
> Hey, turns out, I know some e1000 tricks that might help get the kpps
> numbers up.  
> 
> My problem is I only have a P4 desktop system with a 82544 nic running
> at PCI 32/33Mhz, so I can't play with the big boys.  But, attached is a
> rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> descriptor write-backs.  For me, I see a nice jump in kpps, but I'd like
> others to try with their setups.  We should be able to get to wire speed
> with 60-byte packets.
> 

Here are the numbers in our setup:

vanilla kernel [2.4.20 + packetgen + driver e1000 5.4.11]
4096 Descr => 356 Mbps (60 bytes long frames)
           => 941Mbps (1500 bytes lonf frames)

256 Descr => 354 Mbps (60 bytes long frames)
           => 941Mbps (1500 bytes lonf frames)

Patched driver [2.4.20 + packetgen + driver e1000 5.4.11 patched]
4096 Descr => 357 Mbps (60 bytes long frames)
           => 941Mbps (1500 bytes lonf frames)

I guess that was _not_ the bottleneck sigh... at least with a PCI-X bus.
Again, latency issue of the DMA transfer from RAM to NIC?

-- 
Ciao,                    /\/\/\rco

+-----------------------------------+  
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608              |
| Tel: 39-011-564-4173              |
| Cel: 39-340-9674888               |   /"\  .. . . . . . . . . . . . .
| Politecnico di Torino             |   \ /  . ASCII Ribbon Campaign  .
| Corso Duca degli Abruzzi 24       |    X   .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy            |   / \  .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia  |        .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 16:49                         ` Scott Feldman
  2004-12-01 17:37                           ` Robert Olsson
@ 2004-12-02 17:54                           ` Robert Olsson
  2004-12-02 18:23                           ` Robert Olsson
  2 siblings, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-02 17:54 UTC (permalink / raw)
  To: sfeldma
  Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev


Scott Feldman writes:
 > Thank you Robert for trying it out.

 Scott! 

 I've rerun some of the tests. I've set maxcpus=1 make sure all things
 happens on one CPU. Some HW as yesterday.

 I see a now lot variation in the results from your patch.

 vanilla 
   804353pps 411Mb/sec (411828736bps) errors: 98877

 patch TXD=4096
 Sometimes:   882362pps 451Mb/sec (451769344bps) errors: 0

 patch TXD=2048
 Sometimes:   943007pps 482Mb/sec (482819584bps) errors: 0

 But very often runs around 500 kpps with patch. This smells scheduling to me 
 as smaller rings use to mean higher performance but ring need to big 
 enough to hide latencies.

 See also my next mail...

						--ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01 16:49                         ` Scott Feldman
  2004-12-01 17:37                           ` Robert Olsson
  2004-12-02 17:54                           ` Robert Olsson
@ 2004-12-02 18:23                           ` Robert Olsson
  2004-12-02 23:25                             ` Lennert Buytenhek
                                               ` (2 more replies)
  2 siblings, 3 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-02 18:23 UTC (permalink / raw)
  To: sfeldma
  Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev


Hello!

Below is little patch to clean skb at xmit. It's old jungle trick Jamal
and I used w. tulip. Note we can now even decrease the size of TX ring.

It can increase TX performance from 800 kpps to
  1125128pps 576Mb/sec (576065536bps) errors: 0
  1124946pps 575Mb/sec (575972352bps) errors: 0

But suffers from scheduling problems as the previous patch. Often we just get
  582108pps 298Mb/sec (298039296bps) errors: 0

When the sender CPU free (it's) skb's. we might get some "TX free affinity"
which are unrelated to irq affinity of course not 100% perfect.
 
And some of Scotts may still be used. 

--- drivers/net/e1000/e1000.h.orig	2004-12-01 13:59:36.000000000 +0100
+++ drivers/net/e1000/e1000.h	2004-12-02 20:11:31.000000000 +0100
@@ -103,7 +103,7 @@
 #define E1000_MAX_INTR 10
 
 /* TX/RX descriptor defines */
-#define E1000_DEFAULT_TXD                  256
+#define E1000_DEFAULT_TXD                  128
 #define E1000_MAX_TXD                      256
 #define E1000_MIN_TXD                       80
 #define E1000_MAX_82544_TXD               4096
--- drivers/net/e1000/e1000_main.c.orig	2004-12-01 13:59:36.000000000 +0100
+++ drivers/net/e1000/e1000_main.c	2004-12-02 20:37:40.000000000 +0100
@@ -1820,6 +1820,10 @@
  		return NETDEV_TX_LOCKED; 
  	} 
 
+
+	if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 )
+		e1000_clean_tx_ring(adapter);
+
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
 	if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {


						  --ro

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-02 18:23                           ` Robert Olsson
@ 2004-12-02 23:25                             ` Lennert Buytenhek
  2004-12-03  5:23                             ` Scott Feldman
  2004-12-10 16:24                             ` Martin Josefsson
  2 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-02 23:25 UTC (permalink / raw)
  To: Robert Olsson
  Cc: sfeldma, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto,
	Giulio Galante, netdev

On Thu, Dec 02, 2004 at 07:23:24PM +0100, Robert Olsson wrote:

> Below is little patch to clean skb at xmit. It's old jungle trick Jamal
> and I used w. tulip. Note we can now even decrease the size of TX ring.
> 
> It can increase TX performance from 800 kpps to
>   1125128pps 576Mb/sec (576065536bps) errors: 0
>   1124946pps 575Mb/sec (575972352bps) errors: 0
> 
> But suffers from scheduling problems as the previous patch. Often we just get
>   582108pps 298Mb/sec (298039296bps) errors: 0

Robert, there is something weird with your setup with packets sizes under
160 bytes.  Can you check if you also get wildly variable numbers on a
baseline kernel perhaps?  The numbers you sent me of packet size vs. pps
were very jumpy as well, even at 10M pkts per run.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-02 18:23                           ` Robert Olsson
  2004-12-02 23:25                             ` Lennert Buytenhek
@ 2004-12-03  5:23                             ` Scott Feldman
  2004-12-10 16:24                             ` Martin Josefsson
  2 siblings, 0 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-03  5:23 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Lennert Buytenhek, jamal, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Thu, 2004-12-02 at 10:23, Robert Olsson wrote: 
> It can increase TX performance from 800 kpps to
>   1125128pps 576Mb/sec (576065536bps) errors: 0
>   1124946pps 575Mb/sec (575972352bps) errors: 0

These are the best numbers reported so far, right?

> And some of Scotts may still be used. 

Did you try combining the two?

> +
> +	if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 )
> +		e1000_clean_tx_ring(adapter);
> +

You want to use E1000_DESC_UNUSED here because of the ring wrap. ;-)

-scott

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-02 13:39               ` Marco Mellia
@ 2004-12-03 13:07                 ` jamal
  0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2004-12-03 13:07 UTC (permalink / raw)
  To: mellia
  Cc: Lennert Buytenhek, Harald Welte, P, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Thu, 2004-12-02 at 08:39, Marco Mellia wrote:

> We'll be glad to spend some time trying this out. Please, we are not
> very confortable with the linux bitkeeper maintenance method. Can we ask
> you to provide us a patch to a standard kernel/driver (whatever you
> prefer...)? Also a complete source sub-tree would be ok ;-)

Would a -rcX patch be fine for you?
2.6.10-rc2; which means you willl take 2.6.9 patch it with the
patch-2.6.10-rc2.gz from kernel.org/v2.6/testing directory then
patch one more time with patch i give you. 
Let me know if you are uncomfortable with that as well.
[Sorry, I am disk poor and my stupid ISP still charges $1/MB/month even
in this age if i put it up at cyberus].

In the patch i give you i will include rx path improvement code that I
got from David Morsberger; I "think" i have seen some improvements with
it but i am not 100% sure. If you repeat the test where you drop the
packet right after eth_type_trans() with this patch on, I would be very
interested if you see any improvements.    
 
In any case, expect something from me this weekend or monday (big party
this weekend ;->).

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-02  6:13                           ` Scott Feldman
@ 2004-12-03 13:24                             ` jamal
  2004-12-05 14:50                             ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
  1 sibling, 0 replies; 85+ messages in thread
From: jamal @ 2004-12-03 13:24 UTC (permalink / raw)
  To: sfeldma
  Cc: Lennert Buytenhek, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Thu, 2004-12-02 at 01:13, Scott Feldman wrote:
> On Wed, 2004-12-01 at 13:35, Lennert Buytenhek wrote: 
> > Pretty graph attached.  From ~220B packets or so it does wire speed, but
> > there's still an odd drop in performance around 256B packets (which is
> > also there without your patch.)  From 350B packets or so, performance is
> > identical with or without your patch (wire speed.)
>
> Seems this is helping PCI nics but not PCI-X.  I was using PCI 32/33. 
> Can't explain the dip around 256B.
> 

Interesting thought. I also saw improvements with my batching patch for
PCI 32/32 but nothing noticeable in PCI-X 64/66.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-01  1:09                     ` Scott Feldman
                                         ` (2 preceding siblings ...)
  2004-12-02 17:31                       ` [E1000-devel] Transmission limit Marco Mellia
@ 2004-12-03 20:57                       ` Lennert Buytenhek
  2004-12-04 10:36                         ` Lennert Buytenhek
  3 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-03 20:57 UTC (permalink / raw)
  To: Scott Feldman
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

On Tue, Nov 30, 2004 at 05:09:59PM -0800, Scott Feldman wrote:

> Hey, turns out, I know some e1000 tricks that might help get the kpps
> numbers up.  
> 
> My problem is I only have a P4 desktop system with a 82544 nic running
> at PCI 32/33Mhz, so I can't play with the big boys.  But, attached is a
> rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> descriptor write-backs.  For me, I see a nice jump in kpps, but I'd like
> others to try with their setups.  We should be able to get to wire speed
> with 60-byte packets.

Attached is a graph of my numbers with and without your patch for:
- An 82540 at PCI 32/33, idle 33MHz card on the same bus forcing it to 33MHz.
- An 82541 at PCI 32/66.
- An 82546 at PCI-X 64/100, NIC can do 133MHz but mobo only does 100MHz.

All 'phi' tests were done on my box phi, a dual 2.4GHz Xeon on an Intel
SE7505VB2 board (http://www.intel.com/design/servers/se7505vb2/).  I've
included Robert's 64/133 numbers ('sourcemage') on his dual 866MHz P3 for
comparison.  I didn't test all packet sizes up to 1500, just the first few
hundred bytes for each.

As before, the max # pps at 60B packets is strongly influenced by the per-
packet overhead (which seems to be reduced by your patch for my machine
quite a bit, also on 64/100, even though Robert sees no improvement on
64/133) while the slope of each curve appears to depend only on the speed
of the bus the NIC is in.  I.e. the 60B kpps number more-or-less determines
the shape of the rest of the graph in each case.

Bus speed is most likely also the reason why the 64/100 setup w/o your patch
starts off slower than the 64/66 with your patch, but then eventually beats
the 64/66 (around 140B packets) just before they both hit the GigE saturation
point.

There's no drop at 256B for the 64/100 setup like with the 32/* setups.
Perhaps the drop at 256B is because of the PCI latency timer being set
to 64 by default, and that causes the transfer on 32b to be broken up in
256-byte chunks?

I'm not able to saturate gigabit on 32/33 with 1500B packets, while Jamal
does.  Another thing to look into.

Also note that the 64/100 NIC has rather wobbly performance between 60B and
~160B bytes.  This 'square wave pattern' is there both with and without your
patch, perhaps something particular to the NIC.  Its period appears to be 16
bytes, dropping down where packet_size mod 16 = 0, and then jumping up again
a bit when packet_size mod 16 = 6.  Odd.


--L

[-- Attachment #2: perf.png --]
[-- Type: image/png, Size: 31312 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-03 20:57                       ` Lennert Buytenhek
@ 2004-12-04 10:36                         ` Lennert Buytenhek
  0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-04 10:36 UTC (permalink / raw)
  To: Scott Feldman
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Fri, Dec 03, 2004 at 09:57:06PM +0100, Lennert Buytenhek wrote:

> > My problem is I only have a P4 desktop system with a 82544 nic running
> > at PCI 32/33Mhz, so I can't play with the big boys.  But, attached is a
> > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> > descriptor write-backs.  For me, I see a nice jump in kpps, but I'd like
> > others to try with their setups.  We should be able to get to wire speed
> > with 60-byte packets.
> 
> Attached is a graph of my numbers with and without your patch for:
> - An 82540 at PCI 32/33, idle 33MHz card on the same bus forcing it to 33MHz.
> - An 82541 at PCI 32/66.
> - An 82546 at PCI-X 64/100, NIC can do 133MHz but mobo only does 100MHz.

When extrapolating these numbers to the 0-byte packet case (which then
tells you the per-packet overhead), I get the following approximate numbers:

case				overhead 

phi-32-33-82540-2.6.9		1.86 us
phi-32-66-82541-2.6.9		1.41 us
phi-64-100-82546-2.6.9		1.45 us

phi-32-33-82540-2.6.9-feldman	1.48 us
phi-32-66-82541-2.6.9-feldman	1.13 us
phi-64-100-82546-2.6.9-feldman	1.25 us

Note that this figure doesn't differ all that much between the different
bus widths/speeds.

In any case, if I ever want to get more than ~880kpps on this hardware,
there's no other way than to make this overhead go down.  For saturating
1Gb/s with 60B packets on 64/100, the overhead can't be more than ~0.59 us
per packet or you lose.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-02  6:13                           ` Scott Feldman
  2004-12-03 13:24                             ` jamal
@ 2004-12-05 14:50                             ` Lennert Buytenhek
  2004-12-05 15:03                               ` Martin Josefsson
  1 sibling, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 14:50 UTC (permalink / raw)
  To: Scott Feldman
  Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Wed, Dec 01, 2004 at 10:13:33PM -0800, Scott Feldman wrote:

> Idea#3
> 
> http://www.mail-archive.com/freebsd-net@freebsd.org/msg10826.html
> 
> Set TXDMAC to 0 in e1000_configure_tx.

Enabling 'DMA packet prefetching' gives me an impressive boost in performance.
Combined with your TX clean rework, I now get 1.03Mpps TX performance at 60B
packets.  Transmitting from both of the 82546 ports at the same time gives me
close to 2 Mpps.

The freebsd post hints that (some) e1000 hardware might be buggy w.r.t. this
prefetching though.

I'll play some more with the other ideas you suggested as well.

60      1036488
61      1037413
62      1036429
63      990239
64      993218
65      993233
66      993201
67      993234
68      993219
69      993208
70      992225
71      980560


--L


diff -ur e1000.orig/e1000_main.c e1000/e1000_main.c
--- e1000.orig/e1000_main.c	2004-12-04 11:43:12.000000000 +0100
+++ e1000/e1000_main.c	2004-12-05 15:40:49.284946897 +0100
@@ -879,6 +894,8 @@
 
 	E1000_WRITE_REG(&adapter->hw, TCTL, tctl);
 
+	E1000_WRITE_REG(&adapter->hw, TXDMAC, 0);
+
 	e1000_config_collision_dist(&adapter->hw);
 
 	/* Setup Transmit Descriptor Settings for eop descriptor */

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 14:50                             ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
@ 2004-12-05 15:03                               ` Martin Josefsson
  2004-12-05 15:15                                 ` Lennert Buytenhek
                                                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:03 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Lennert Buytenhek wrote:

> Enabling 'DMA packet prefetching' gives me an impressive boost in performance.
> Combined with your TX clean rework, I now get 1.03Mpps TX performance at 60B
> packets.  Transmitting from both of the 82546 ports at the same time gives me
> close to 2 Mpps.
>
> The freebsd post hints that (some) e1000 hardware might be buggy w.r.t. this
> prefetching though.
>
> I'll play some more with the other ideas you suggested as well.
>
> 60      1036488

I was just playing with prefetching when you sent your mail :)

I get that number with Scotts patch but without prefetching.
If I mode the TDT update to the tc cleaning I get a few extra kpps but not
much.

BUT if I use the above + prefetching I get this:

60      1483890
64      1418568
68      1356992
72      1300523
76      1248568
80      1142989
84      1140909
88      1114951
92      1076546
96      960732
100     949801
104     972876
108     945314
112     918380
116     891393
120     865923
124     843288
128     696465

Which is pretty nice :)

This is on one port of a 82546GB

The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
the nic is located in a 64/66 slot.

I won't post any patch until I've tested some more and cleaned up a few
things.

BTW, I also get some transmit timouts with Scotts patch sometimes, not
often but it does happen.

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:03                               ` Martin Josefsson
@ 2004-12-05 15:15                                 ` Lennert Buytenhek
  2004-12-05 15:19                                   ` Martin Josefsson
  2004-12-05 15:42                                 ` Martin Josefsson
  2004-12-05 21:12                                 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
  2 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 15:15 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 04:03:36PM +0100, Martin Josefsson wrote:

> BUT if I use the above + prefetching I get this:
> 
> 60      1483890
> [snip]
> 
> Which is pretty nice :)

Not just that, it's also wire speed GigE.  Damn.  Now we all have to go
and upgrade to 10GbE cards, and I don't think my girlfriend would give me
one of those for christmas.


> This is on one port of a 82546GB
> 
> The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> the nic is located in a 64/66 slot.

Hmmm.  Funny you get this number even on 64/66.  How many PCI bridges
between the CPUs and the NIC?  Any idea how many cycles an MMIO read on
your hardware is?


cheers,
Lennert

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:15                                 ` Lennert Buytenhek
@ 2004-12-05 15:19                                   ` Martin Josefsson
  2004-12-05 15:30                                     ` Martin Josefsson
  0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:19 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Lennert Buytenhek wrote:

> > 60      1483890
> > [snip]
> >
> > Which is pretty nice :)
>
> Not just that, it's also wire speed GigE.  Damn.  Now we all have to go
> and upgrade to 10GbE cards, and I don't think my girlfriend would give me
> one of those for christmas.

Yes it is, and it's lovely to see.
You have to nerdify her so she sees the need for geeky hardware enough to
give you what you need :)

> > This is on one port of a 82546GB
> >
> > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> > the nic is located in a 64/66 slot.
>
> Hmmm.  Funny you get this number even on 64/66.  How many PCI bridges
> between the CPUs and the NIC?  Any idea how many cycles an MMIO read on
> your hardware is?

I verified that I get the same results on a small whimpy 82540EM that runs
at 32/66 as well. Just about to see what I get at 32/33 with that card.

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:19                                   ` Martin Josefsson
@ 2004-12-05 15:30                                     ` Martin Josefsson
  2004-12-05 17:00                                       ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:30 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Martin Josefsson wrote:

> > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> > > the nic is located in a 64/66 slot.
> >
> > Hmmm.  Funny you get this number even on 64/66.  How many PCI bridges
> > between the CPUs and the NIC?  Any idea how many cycles an MMIO read on
> > your hardware is?
>
> I verified that I get the same results on a small whimpy 82540EM that runs
> at 32/66 as well. Just about to see what I get at 32/33 with that card.

Just tested the 82540EM at 32/33 and it's a big diffrence.

60      350229
64      247037
68      219643
72      218205
76      216786
80      215386
84      214003
88      212638
92      211291
96      210004
100     208647
104     182461
108     181468
112     180453
116     179482
120     185472
124     188336
128     153743


Sorry, forgot to answer your other questions, I'm a bit excited at the
moment :)

The 64/66 bus on this motherboard is directly connected to the
northbridge. Here's the lspci output with the 82546GB nic attached
to the 64/66 bus and 82540EM nic connected to the 32/33 bus that hangs
off the southbridge:

00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
00:08.0 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03)
00:08.1 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03)
00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
01:05.0 VGA compatible controller: Silicon Integrated Systems [SiS] 86C326 5598/6326 (rev 0b)
02:05.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c)
02:06.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02)
02:08.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02)

And lspci -t

-[00]-+-00.0
      +-01.0-[01]----05.0
      +-07.0
      +-07.1
      +-07.3
      +-08.0
      +-08.1
      \-10.0-[02]--+-05.0
                   +-06.0
                   \-08.0

I have no idea how expensive an MMIO read is on this machine, do you have
an relatively easy way to find out?

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:03                               ` Martin Josefsson
  2004-12-05 15:15                                 ` Lennert Buytenhek
@ 2004-12-05 15:42                                 ` Martin Josefsson
  2004-12-05 16:48                                   ` Martin Josefsson
                                                     ` (2 more replies)
  2004-12-05 21:12                                 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
  2 siblings, 3 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:42 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Martin Josefsson wrote:

[snip]
> BUT if I use the above + prefetching I get this:
>
> 60      1483890
[snip]
> This is on one port of a 82546GB
>
> The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> the nic is located in a 64/66 slot.
>
> I won't post any patch until I've tested some more and cleaned up a few
> things.
>
> BTW, I also get some transmit timouts with Scotts patch sometimes, not
> often but it does happen.

Here's the patch, not much more tested (it still gives some transmit
timeouts since it's scotts patch + prefetching and delayed TDT updating).
And it's not cleaned up, but hey, that's development :)

The delayed TDT updating was a test and currently it delays the first tx'd
packet after a timerrun 1ms.

Would be interesting to see what other people get with this thing.
Lennert?

diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h
--- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h	2004-12-04 18:16:53.000000000 +0100
+++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h	2004-12-05 15:12:25.000000000 +0100
@@ -101,7 +101,7 @@ struct e1000_adapter;
 #define E1000_MAX_INTR 10

 /* TX/RX descriptor defines */
-#define E1000_DEFAULT_TXD                  256
+#define E1000_DEFAULT_TXD                 4096
 #define E1000_MAX_TXD                      256
 #define E1000_MIN_TXD                       80
 #define E1000_MAX_82544_TXD               4096
@@ -187,6 +187,7 @@ struct e1000_desc_ring {
 /* board specific private data structure */

 struct e1000_adapter {
+	struct timer_list tx_cleanup_timer;
 	struct timer_list tx_fifo_stall_timer;
 	struct timer_list watchdog_timer;
 	struct timer_list phy_info_timer;
@@ -222,6 +223,7 @@ struct e1000_adapter {
 	uint32_t tx_fifo_size;
 	atomic_t tx_fifo_stall;
 	boolean_t pcix_82544;
+	boolean_t tx_cleanup_scheduled;

 	/* RX */
 	struct e1000_desc_ring rx_ring;
diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h
--- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h	2004-12-04 18:16:53.000000000 +0100
+++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h	2004-12-05 15:37:50.000000000 +0100
@@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e
 /* This defines the bits that are set in the Interrupt Mask
  * Set/Read Register.  Each bit is documented below:
  *   o RXT0   = Receiver Timer Interrupt (ring 0)
- *   o TXDW   = Transmit Descriptor Written Back
  *   o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0)
  *   o RXSEQ  = Receive Sequence Error
  *   o LSC    = Link Status Change
  */
 #define IMS_ENABLE_MASK ( \
     E1000_IMS_RXT0   |    \
-    E1000_IMS_TXDW   |    \
     E1000_IMS_RXDMT0 |    \
     E1000_IMS_RXSEQ  |    \
     E1000_IMS_LSC)
diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c
--- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c	2004-12-05 14:59:19.000000000 +0100
+++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c	2004-12-05 15:40:11.000000000 +0100
@@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi
 static void e1000_irq_disable(struct e1000_adapter *adapter);
 static void e1000_irq_enable(struct e1000_adapter *adapter);
 static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs);
-static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter);
+static void e1000_clean_tx(unsigned long data);
 #ifdef CONFIG_E1000_NAPI
 static int e1000_clean(struct net_device *netdev, int *budget);
 static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter,
@@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter

 	e1000_irq_disable(adapter);
 	free_irq(adapter->pdev->irq, netdev);
+	del_timer_sync(&adapter->tx_cleanup_timer);
 	del_timer_sync(&adapter->tx_fifo_stall_timer);
 	del_timer_sync(&adapter->watchdog_timer);
 	del_timer_sync(&adapter->phy_info_timer);
@@ -522,6 +523,10 @@ e1000_probe(struct pci_dev *pdev,

 	e1000_get_bus_info(&adapter->hw);

+	init_timer(&adapter->tx_cleanup_timer);
+	adapter->tx_cleanup_timer.function = &e1000_clean_tx;
+	adapter->tx_cleanup_timer.data = (unsigned long) adapter;
+
 	init_timer(&adapter->tx_fifo_stall_timer);
 	adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall;
 	adapter->tx_fifo_stall_timer.data = (unsigned long) adapter;
@@ -882,19 +887,16 @@ e1000_configure_tx(struct e1000_adapter
 	e1000_config_collision_dist(&adapter->hw);

 	/* Setup Transmit Descriptor Settings for eop descriptor */
-	adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP |
+	adapter->txd_cmd = E1000_TXD_CMD_EOP |
 		E1000_TXD_CMD_IFCS;

-	if(adapter->hw.mac_type < e1000_82543)
-		adapter->txd_cmd |= E1000_TXD_CMD_RPS;
-	else
-		adapter->txd_cmd |= E1000_TXD_CMD_RS;
-
 	/* Cache if we're 82544 running in PCI-X because we'll
 	 * need this to apply a workaround later in the send path. */
 	if(adapter->hw.mac_type == e1000_82544 &&
 	   adapter->hw.bus_type == e1000_bus_type_pcix)
 		adapter->pcix_82544 = 1;
+
+	E1000_WRITE_REG(&adapter->hw, TXDMAC, 0);
 }

 /**
@@ -1707,7 +1709,7 @@ e1000_tx_queue(struct e1000_adapter *ada
 	wmb();

 	tx_ring->next_to_use = i;
-	E1000_WRITE_REG(&adapter->hw, TDT, i);
+	/* E1000_WRITE_REG(&adapter->hw, TDT, i); */
 }

 /**
@@ -1809,6 +1811,11 @@ e1000_xmit_frame(struct sk_buff *skb, st
  		return NETDEV_TX_LOCKED;
  	}

+	if(!adapter->tx_cleanup_scheduled) {
+		adapter->tx_cleanup_scheduled = TRUE;
+		mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+	}
+
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
 	if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
@@ -1845,6 +1852,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
 	netdev->trans_start = jiffies;

 	spin_unlock_irqrestore(&adapter->tx_lock, flags);
+
 	return NETDEV_TX_OK;
 }

@@ -2140,8 +2148,7 @@ e1000_intr(int irq, void *data, struct p
 	}
 #else
 	for(i = 0; i < E1000_MAX_INTR; i++)
-		if(unlikely(!e1000_clean_rx_irq(adapter) &
-		   !e1000_clean_tx_irq(adapter)))
+		if(unlikely(!e1000_clean_rx_irq(adapter)))
 			break;
 #endif

@@ -2159,18 +2166,15 @@ e1000_clean(struct net_device *netdev, i
 {
 	struct e1000_adapter *adapter = netdev->priv;
 	int work_to_do = min(*budget, netdev->quota);
-	int tx_cleaned;
 	int work_done = 0;

-	tx_cleaned = e1000_clean_tx_irq(adapter);
 	e1000_clean_rx_irq(adapter, &work_done, work_to_do);

 	*budget -= work_done;
 	netdev->quota -= work_done;

-	/* if no Rx and Tx cleanup work was done, exit the polling mode */
-	if(!tx_cleaned || (work_done < work_to_do) ||
-				!netif_running(netdev)) {
+	/* if no Rx cleanup work was done, exit the polling mode */
+	if((work_done < work_to_do) || !netif_running(netdev)) {
 		netif_rx_complete(netdev);
 		e1000_irq_enable(adapter);
 		return 0;
@@ -2181,66 +2185,76 @@ e1000_clean(struct net_device *netdev, i

 #endif
 /**
- * e1000_clean_tx_irq - Reclaim resources after transmit completes
- * @adapter: board private structure
+ * e1000_clean_tx - Reclaim resources after transmit completes
+ * @data: timer callback data (board private structure)
  **/

-static boolean_t
-e1000_clean_tx_irq(struct e1000_adapter *adapter)
+static void
+e1000_clean_tx(unsigned long data)
 {
+	struct e1000_adapter *adapter = (struct e1000_adapter *)data;
 	struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
-	struct e1000_tx_desc *tx_desc, *eop_desc;
 	struct e1000_buffer *buffer_info;
-	unsigned int i, eop;
-	boolean_t cleaned = FALSE;
+	unsigned int i, next;
+	int size = 0, count = 0;
+	uint32_t tx_head;

-	i = tx_ring->next_to_clean;
-	eop = tx_ring->buffer_info[i].next_to_watch;
-	eop_desc = E1000_TX_DESC(*tx_ring, eop);
+	spin_lock(&adapter->tx_lock);

-	while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) {
-		for(cleaned = FALSE; !cleaned; ) {
-			tx_desc = E1000_TX_DESC(*tx_ring, i);
-			buffer_info = &tx_ring->buffer_info[i];
+	E1000_WRITE_REG(&adapter->hw, TDT, tx_ring->next_to_use);

-			if(likely(buffer_info->dma)) {
-				pci_unmap_page(pdev,
-					       buffer_info->dma,
-					       buffer_info->length,
-					       PCI_DMA_TODEVICE);
-				buffer_info->dma = 0;
-			}
+	tx_head = E1000_READ_REG(&adapter->hw, TDH);

-			if(buffer_info->skb) {
-				dev_kfree_skb_any(buffer_info->skb);
-				buffer_info->skb = NULL;
-			}
+	i = next = tx_ring->next_to_clean;

-			tx_desc->buffer_addr = 0;
-			tx_desc->lower.data = 0;
-			tx_desc->upper.data = 0;
+	while(i != tx_head) {
+		size++;
+		if(i == tx_ring->buffer_info[next].next_to_watch) {
+			count += size;
+			size = 0;
+			if(unlikely(++i == tx_ring->count))
+				i = 0;
+			next = i;
+		} else {
+			if(unlikely(++i == tx_ring->count))
+				i = 0;
+		}
+	}

-			cleaned = (i == eop);
-			if(unlikely(++i == tx_ring->count)) i = 0;
+	i = tx_ring->next_to_clean;
+	while(count--) {
+		buffer_info = &tx_ring->buffer_info[i];
+
+		if(likely(buffer_info->dma)) {
+			pci_unmap_page(pdev,
+				       buffer_info->dma,
+				       buffer_info->length,
+				       PCI_DMA_TODEVICE);
+			buffer_info->dma = 0;
 		}
-
-		eop = tx_ring->buffer_info[i].next_to_watch;
-		eop_desc = E1000_TX_DESC(*tx_ring, eop);
+
+		if(buffer_info->skb) {
+			dev_kfree_skb_any(buffer_info->skb);
+			buffer_info->skb = NULL;
+		}
+
+		if(unlikely(++i == tx_ring->count))
+			i = 0;
 	}

 	tx_ring->next_to_clean = i;

-	spin_lock(&adapter->tx_lock);
+	if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count)
+		mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+	else
+		adapter->tx_cleanup_scheduled = FALSE;

-	if(unlikely(cleaned && netif_queue_stopped(netdev) &&
-		    netif_carrier_ok(netdev)))
+	if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev)))
 		netif_wake_queue(netdev);

 	spin_unlock(&adapter->tx_lock);
-
-	return cleaned;
 }

 /**

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:42                                 ` Martin Josefsson
@ 2004-12-05 16:48                                   ` Martin Josefsson
  2004-12-05 17:01                                     ` Martin Josefsson
  2004-12-05 17:58                                     ` Lennert Buytenhek
  2004-12-05 17:44                                   ` Lennert Buytenhek
  2004-12-08 23:36                                   ` Ray Lehtiniemi
  2 siblings, 2 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 16:48 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Martin Josefsson wrote:

> The delayed TDT updating was a test and currently it delays the first tx'd
> packet after a timerrun 1ms.

I removed the delayed TDT updating and gave it a go again (this is scott +
prefetching):

60      1486193
64      1267639
68      1259682
72      1243997
76      1243989
80      1153608
84      1123813
88      1115047
92      1076636
96      1040792
100     1007252
104     975806
108     946263
112     918456
116     892227
120     867477
124     844052
128     821858

It gives a little diffrent results, 60byte is ok but then it falls a lot
down to 64byte and the curve seems a bit flatter.

This should be the same driver that Lennert got 1.03Mpps with.
I get 1.03Mpps without prefetching.

I tried using both ports on the 82546GB nic.

        delay        nodelay
1CPU    1.95 Mpps    1.76 Mpps
2CPU    1.60 Mpps    1.44 Mpps

All tests performed on an SMP kernel, the above mention of 1CPU vs 2CPU
just means how the two nics were bound to the cpus. And there's no
tx-interrupts at all due to scotts patch.

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:30                                     ` Martin Josefsson
@ 2004-12-05 17:00                                       ` Lennert Buytenhek
  2004-12-05 17:11                                         ` Martin Josefsson
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:00 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 04:30:47PM +0100, Martin Josefsson wrote:

> > I verified that I get the same results on a small whimpy 82540EM
> > that runs at 32/66 as well. Just about to see what I get at 32/33
> > with that card.
> 
> Just tested the 82540EM at 32/33 and it's a big diffrence.
> 
> 60      350229
> 64      247037
> 68      219643
> 72      218205
> 76      216786
> 80      215386
> 84      214003
> 88      212638
> 92      211291
> 96      210004
> 100     208647
> 104     182461
> 108     181468
> 112     180453
> 116     179482
> 120     185472
> 124     188336
> 128     153743

With or without prefetching? My 82540 in 32/33 mode gets on baseline
2.6.9:

60      431967
61      431311
62      431927
63      427827
64      427482

And with Scott's notxints patch:

60      514496
61      514493
62      514754
63      504629
64      504123


> Sorry, forgot to answer your other questions, I'm a bit excited at the
> moment :)

Makes sense :)


> The 64/66 bus on this motherboard is directly connected to the
> northbridge.

Your lspci output seems to suggest there is another PCI bridge in
between (00:10.0)

Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the
'Memory Controller Hub' and P64H2 the PCI-X bridge chip.


> I have no idea how expensive an MMIO read is on this machine, do you have
> an relatively easy way to find out?

A dirty way, yes ;-)  Open up e1000_osdep.h and do:

-#define E1000_READ_REG(a, reg) ( \
-    readl((a)->hw_addr + \
-        (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)))
+#define E1000_READ_REG(a, reg) ({ \
+    unsigned long s, e, d, v; \
+\
+    (a)->mmio_reads++; \
+    rdtsc(s, d); \
+    v = readl((a)->hw_addr + \
+        (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \
+    rdtsc(e, d); \
+    e -= s; \
+    printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \
+    printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \
+    dump_stack(); \
+    v; \
+})

You might want to disable the stack dump of course.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 16:48                                   ` Martin Josefsson
@ 2004-12-05 17:01                                     ` Martin Josefsson
  2004-12-05 17:58                                     ` Lennert Buytenhek
  1 sibling, 0 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:01 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Martin Josefsson wrote:

> I removed the delayed TDT updating and gave it a go again (this is scott +
> prefetching):
>
> 60      1486193
> 64      1267639
> 68      1259682

Yet another mail, I hope you are using a NAPI-enabled MUA :)

This time I tried vanilla + prefetch and it gave pretty nice performance
as well:

60      1308047
64      1076044
68      1079377
72      1058993
76      1055708
80      1025659
84      1024692
88      1024236
92      1024510
96      1012853
100     1007925
104     976500
108     947061
112     919169
116     892804
120     868084
124     844609
128     822381

Large gap between 60 and 64byte, maybe the prefetching only prefetches
32bytes at a time?

As a reference: here's a completely vanilla e1000 driver:

60      860931
64      772949
68      754738
72      754200
76      756093
80      756398
84      742111
88      738120
92      740426
96      739720
100     722322
104     729287
108     719312
112     723171
116     705551
120     704843
124     704622
128     665863

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 17:00                                       ` Lennert Buytenhek
@ 2004-12-05 17:11                                         ` Martin Josefsson
  2004-12-05 17:38                                           ` Martin Josefsson
  0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:11 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Lennert Buytenhek wrote:

> > Just tested the 82540EM at 32/33 and it's a big diffrence.
> >
> > 60      350229
> > 64      247037
> > 68      219643

[snip]

> With or without prefetching? My 82540 in 32/33 mode gets on baseline
> 2.6.9:

With, will test without. I've always suspected that the 32bit bus on this
motherboard is a bit slow.

> Your lspci output seems to suggest there is another PCI bridge in
> between (00:10.0)

Yes it sits between the 32bit and the 64bit bus.

> Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the
> 'Memory Controller Hub' and P64H2 the PCI-X bridge chip.

I don't have PCI-X (unless 64/66 counts as PCI-x which I highly doubt)

> > I have no idea how expensive an MMIO read is on this machine, do you have
> > an relatively easy way to find out?
>
> A dirty way, yes ;-)  Open up e1000_osdep.h and do:
>
> -#define E1000_READ_REG(a, reg) ( \
> -    readl((a)->hw_addr + \
> -        (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)))
> +#define E1000_READ_REG(a, reg) ({ \
> +    unsigned long s, e, d, v; \
> +\
> +    (a)->mmio_reads++; \
> +    rdtsc(s, d); \
> +    v = readl((a)->hw_addr + \
> +        (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \
> +    rdtsc(e, d); \
> +    e -= s; \
> +    printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \
> +    printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \
> +    dump_stack(); \
> +    v; \
> +})
>
> You might want to disable the stack dump of course.

Will test this in a while.

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 17:11                                         ` Martin Josefsson
@ 2004-12-05 17:38                                           ` Martin Josefsson
  2004-12-05 18:14                                             ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:38 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Martin Josefsson wrote:

> > -#define E1000_READ_REG(a, reg) ( \
> > -    readl((a)->hw_addr + \
> > -        (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)))
> > +#define E1000_READ_REG(a, reg) ({ \
> > +    unsigned long s, e, d, v; \
> > +\
> > +    (a)->mmio_reads++; \
> > +    rdtsc(s, d); \
> > +    v = readl((a)->hw_addr + \
> > +        (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \
> > +    rdtsc(e, d); \
> > +    e -= s; \
> > +    printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \
> > +    printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \
> > +    dump_stack(); \
> > +    v; \
> > +})
> >
> > You might want to disable the stack dump of course.
>
> Will test this in a while.

It gives pretty varied results.
This is during a pktgen run.

The machine is an Athlon MP 2000+ which operated at 1667 MHz

e1000: MMIO read took 481 clocks
e1000: MMIO read took 369 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 477 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 332 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 372 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 388 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 485 clocks
e1000: MMIO read took 317 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 337 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 409 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 334 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 505 clocks
e1000: MMIO read took 359 clocks
e1000: MMIO read took 484 clocks
e1000: MMIO read took 337 clocks
e1000: MMIO read took 464 clocks
e1000: MMIO read took 504 clocks

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:42                                 ` Martin Josefsson
  2004-12-05 16:48                                   ` Martin Josefsson
@ 2004-12-05 17:44                                   ` Lennert Buytenhek
  2004-12-05 17:51                                     ` Lennert Buytenhek
  2004-12-08 23:36                                   ` Ray Lehtiniemi
  2 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:44 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote:

> The delayed TDT updating was a test and currently it delays the first tx'd
> packet after a timerrun 1ms.
> 
> Would be interesting to see what other people get with this thing.
> Lennert?

I took Scott's notxints patch, added the prefetch bits and moved the
TDT updating to e1000_clean_tx as you did.

Slightly better than before, but not much:

60      1070157
61      1066610
62      1062088
63      991447
64      991546
65      991537
66      991449
67      990857
68      989882
69      991347

Regular TDT updating:

60      1037469
61      1038425
62      1037393
63      993143
64      992156
65      993137
66      992203
67      992165
68      992185
69      988249


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 17:44                                   ` Lennert Buytenhek
@ 2004-12-05 17:51                                     ` Lennert Buytenhek
  2004-12-05 17:54                                       ` Martin Josefsson
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:51 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 06:44:01PM +0100, Lennert Buytenhek wrote:
> On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote:
> 
> > The delayed TDT updating was a test and currently it delays the first tx'd
> > packet after a timerrun 1ms.
> > 
> > Would be interesting to see what other people get with this thing.
> > Lennert?
> 
> I took Scott's notxints patch, added the prefetch bits and moved the
> TDT updating to e1000_clean_tx as you did.
> 
> Slightly better than before, but not much:

I've tested all packet sizes now, and delayed TDT updating once per jiffy
(instead of once per packet) indeed gives about 25kpps more on 60,61,62
byte packets, and is hardly worth it for bigger packets.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 17:51                                     ` Lennert Buytenhek
@ 2004-12-05 17:54                                       ` Martin Josefsson
  2004-12-06 11:32                                         ` 1.03Mpps on e1000 (was: " jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:54 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 5 Dec 2004, Lennert Buytenhek wrote:

> I've tested all packet sizes now, and delayed TDT updating once per jiffy
> (instead of once per packet) indeed gives about 25kpps more on 60,61,62
> byte packets, and is hardly worth it for bigger packets.

Maybe we can't see any real gains here now, I wonder if it has any effect
if you have lots of nics on the same bus. I mean, in theory it saves a
whole lot of traffic on the bus.

/Martin

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 16:48                                   ` Martin Josefsson
  2004-12-05 17:01                                     ` Martin Josefsson
@ 2004-12-05 17:58                                     ` Lennert Buytenhek
  1 sibling, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:58 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 05:48:34PM +0100, Martin Josefsson wrote:

> I tried using both ports on the 82546GB nic.
> 
>         delay        nodelay
> 1CPU    1.95 Mpps    1.76 Mpps
> 2CPU    1.60 Mpps    1.44 Mpps

I get:

	delay		nodelay
1CPU	1837356		1837330
2CPU	2035060		1947424

So in your case using 2 CPUs degrades performance, in my case it
increases it.  And TDT delaying/coalescing only improves performance
when using 2 CPUs, and even then only slightly (and only for <= 62B
packets.)


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 17:38                                           ` Martin Josefsson
@ 2004-12-05 18:14                                             ` Lennert Buytenhek
  0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 18:14 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 06:38:05PM +0100, Martin Josefsson wrote:

> e1000: MMIO read took 481 clocks
> e1000: MMIO read took 369 clocks
> e1000: MMIO read took 481 clocks
> e1000: MMIO read took 11 clocks
> e1000: MMIO read took 477 clocks
> e1000: MMIO read took 316 clocks

Interesting.  On a 1667MHz CPU, this is around ~0.28us per MMIO read
in the worst case.  On my hardware (dual Xeon 2.4GHz), the best case
I've ever seen was ~0.83us.

This alone can make a hell of a difference, esp. for 60B packets.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:03                               ` Martin Josefsson
  2004-12-05 15:15                                 ` Lennert Buytenhek
  2004-12-05 15:42                                 ` Martin Josefsson
@ 2004-12-05 21:12                                 ` Scott Feldman
  2004-12-05 21:25                                   ` Lennert Buytenhek
  2 siblings, 1 reply; 85+ messages in thread
From: Scott Feldman @ 2004-12-05 21:12 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 2004-12-05 at 07:03, Martin Josefsson wrote:
> BUT if I use the above + prefetching I get this:
> 
> 60      1483890

Ok, proof that we can get to 1.4Mpps!  

That's the good news.

The bad news is prefetching is potentially buggy as pointed out in the
freebsd note.  Buggy as in the controller may hang.  Sorry, I don't have
details on what conditions are necessary to cause a hang.

Would Martin or Lennert run these test for a longer duration so we can
get some data, maybe adding in Rx.  It could be that removing the Tx
interrupts and descriptor write-backs, prefetching may be ok.  I don't
know.  Intel?

Also, wouldn't it be great if someone wrote a document capturing all of
the accumulated knowledge for future generations?

-scott

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 21:12                                 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
@ 2004-12-05 21:25                                   ` Lennert Buytenhek
  2004-12-06  1:23                                     ` 1.03Mpps on e1000 (was: " Scott Feldman
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 21:25 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Martin Josefsson, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, Dec 05, 2004 at 01:12:22PM -0800, Scott Feldman wrote:

> Would Martin or Lennert run these test for a longer duration so we can
> get some data, maybe adding in Rx.  It could be that removing the Tx
> interrupts and descriptor write-backs, prefetching may be ok.  I don't
> know.  Intel?

What your patch does is (correct me if I'm wrong):
- Masking TXDW, effectively preventing it from delivering TXdone ints.
- Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes
  the chip to 'ignore the TIDV' register, which is the 'TX Interrupt
  Delay Value'.  What exactly does this?
- Not setting the "Report Packet Sent"/"Report Status" bits in the TXD
  command field.  Is this the equivalent of the TXdone interrupt?

Just exactly which bit avoids the descriptor writeback?

I'm also a bit worried that only freeing packets 1ms later will mess up
socket accounting and such.  Any ideas on that?


> Also, wouldn't it be great if someone wrote a document capturing all of
> the accumulated knowledge for future generations?

I'll volunteer for that.


--L

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-05 21:25                                   ` Lennert Buytenhek
@ 2004-12-06  1:23                                     ` Scott Feldman
  0 siblings, 0 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-06  1:23 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Martin Josefsson, jamal, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 2004-12-05 at 13:25, Lennert Buytenhek wrote:
> What your patch does is (correct me if I'm wrong):
> - Masking TXDW, effectively preventing it from delivering TXdone ints.
> - Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes
>   the chip to 'ignore the TIDV' register, which is the 'TX Interrupt
>   Delay Value'.  What exactly does this?

A descriptor with IDE, when written back, starts the Tx delay timers
countdown.  Never setting IDE means the Tx delay timers never expire.

> - Not setting the "Report Packet Sent"/"Report Status" bits in the TXD
>   command field.  Is this the equivalent of the TXdone interrupt?
> 
> Just exactly which bit avoids the descriptor writeback?

As the name implies, Report Status (RS) instructs the controller to
indicate the status of the descriptor by doing a write-back (DMA) to the
descriptor memory.  The only status we care about is the "done"
indicator.  By reading TDH (Tx head), we can figure out where hardware
is without reading the status of each descriptor.  Since we don't need
status, we can turn off RS.

> I'm also a bit worried that only freeing packets 1ms later will mess up
> socket accounting and such.  Any ideas on that?

Well the timer solution is less than ideal, and any protocols that are
sensitive to getting Tx resources returned by the driver as quickly as
possible are not going to be happy.  I don't know if 1ms is quick enough.

You could eliminate the timer by doing the cleanup first thing in
xmit_frame, but then you have two problems: 1) you might end up reading
TDH for each send, and that's going to be expensive; 2) calls to
xmit_frame might stop, leaving uncleaned work until xmit_frame is called
again.

-scott



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-05 17:54                                       ` Martin Josefsson
@ 2004-12-06 11:32                                         ` jamal
  2004-12-06 12:11                                           ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-06 11:32 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Lennert Buytenhek, Scott Feldman, Robert Olsson, P, mellia,
	e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Sun, 2004-12-05 at 12:54, Martin Josefsson wrote:
> On Sun, 5 Dec 2004, Lennert Buytenhek wrote:
> 
> > I've tested all packet sizes now, and delayed TDT updating once per jiffy
> > (instead of once per packet) indeed gives about 25kpps more on 60,61,62
> > byte packets, and is hardly worth it for bigger packets.
> 
> Maybe we can't see any real gains here now, I wonder if it has any effect
> if you have lots of nics on the same bus. I mean, in theory it saves a
> whole lot of traffic on the bus.
> 

This sounds like really exciting stuff happening here over the weekend.
Scott, you had to leave Intel before giving us this tip? ;-> 

Someone correct me if i am wrong - but does it appear as if all these
changes are only useful on PCI but not PCI-X?

cheers,
jamal



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-06 11:32                                         ` 1.03Mpps on e1000 (was: " jamal
@ 2004-12-06 12:11                                           ` Lennert Buytenhek
  2004-12-06 12:20                                             ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-06 12:11 UTC (permalink / raw)
  To: jamal
  Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia,
	e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote:

> Someone correct me if i am wrong - but does it appear as if all these
> changes are only useful on PCI but not PCI-X?

They are useful on PCI-X as well as regular PCI.  On my 64/100 NIC I
get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0.

Martin gets the ~1Mpps number with just the tx rework, and even more
with TXDMAC=0 added in as well.


--L


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-06 12:11                                           ` Lennert Buytenhek
@ 2004-12-06 12:20                                             ` jamal
  2004-12-06 12:23                                               ` Lennert Buytenhek
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-06 12:20 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia,
	e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, 2004-12-06 at 07:11, Lennert Buytenhek wrote:
> On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote:
> 
> > Someone correct me if i am wrong - but does it appear as if all these
> > changes are only useful on PCI but not PCI-X?
> 
> They are useful on PCI-X as well as regular PCI.  On my 64/100 NIC I
> get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0.
> 
> Martin gets the ~1Mpps number with just the tx rework, and even more
> with TXDMAC=0 added in as well.

Right, but so far when i scan the results all i see is PCI not PCI-X.
Which of your (or Martins) boards has PCI-X?

cheers,
jamal



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-06 12:20                                             ` jamal
@ 2004-12-06 12:23                                               ` Lennert Buytenhek
  2004-12-06 12:30                                                 ` Martin Josefsson
  0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-06 12:23 UTC (permalink / raw)
  To: jamal
  Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia,
	e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, Dec 06, 2004 at 07:20:43AM -0500, jamal wrote:

> > > Someone correct me if i am wrong - but does it appear as if all these
> > > changes are only useful on PCI but not PCI-X?
> > 
> > They are useful on PCI-X as well as regular PCI.  On my 64/100 NIC I
> > get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0.
> > 
> > Martin gets the ~1Mpps number with just the tx rework, and even more
> > with TXDMAC=0 added in as well.
> 
> Right, but so far when i scan the results all i see is PCI not PCI-X.
> Which of your (or Martins) boards has PCI-X?

I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X.  I _think_ Martin
was running at 64/133 PCI-X.


--L


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-06 12:23                                               ` Lennert Buytenhek
@ 2004-12-06 12:30                                                 ` Martin Josefsson
  2004-12-06 13:11                                                   ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-06 12:30 UTC (permalink / raw)
  To: Lennert Buytenhek
  Cc: jamal, Scott Feldman, Robert Olsson, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

On Mon, 6 Dec 2004, Lennert Buytenhek wrote:

> > Right, but so far when i scan the results all i see is PCI not PCI-X.
> > Which of your (or Martins) boards has PCI-X?
>
> I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X.  I _think_ Martin
> was running at 64/133 PCI-X.

I don't have any motherboards with PCI-X so no :)
I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop
adapter) at 32/66, both are able to send at wirespeed.

/Martin


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
  2004-12-06 12:30                                                 ` Martin Josefsson
@ 2004-12-06 13:11                                                   ` jamal
       [not found]                                                     ` <20041206132907.GA13411@xi.wantstofly.org>
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-06 13:11 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Lennert Buytenhek, Scott Feldman, Robert Olsson, P, mellia,
	e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev

Hopefully someone will beat me to testing to see if our forwarding
capacity now goes up with this new recipe.

cheers,
jamal

On Mon, 2004-12-06 at 07:30, Martin Josefsson wrote:
> On Mon, 6 Dec 2004, Lennert Buytenhek wrote:
> 
> > > Right, but so far when i scan the results all i see is PCI not PCI-X.
> > > Which of your (or Martins) boards has PCI-X?
> >
> > I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X.  I _think_ Martin
> > was running at 64/133 PCI-X.
> 
> I don't have any motherboards with PCI-X so no :)
> I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop
> adapter) at 32/66, both are able to send at wirespeed.
> 
> /Martin
> 
> 



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
       [not found]                                                       ` <16820.37049.396306.295878@robur.slu.se>
@ 2004-12-06 17:32                                                         ` P
  0 siblings, 0 replies; 85+ messages in thread
From: P @ 2004-12-06 17:32 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Lennert Buytenhek, jamal, Martin Josefsson, Scott Feldman,
	mellia, Jorge Manuel Finochietto, Giulio Galante, netdev

Robert Olsson wrote:
> Lennert Buytenhek writes:
>  > On Mon, Dec 06, 2004 at 08:11:02AM -0500, jamal wrote:
>  > 
>  > > Hopefully someone will beat me to testing to see if our forwarding
>  > > capacity now goes up with this new recipe.
> 
> 
> A breakthrough we now can send small packets at wire speed it will make 
> development and testing much easier...

It surely will!!

Just to recap, 2 people have been able to tx @ wire speed.
The origonal poster was able to receive at wire speed,
but could only TX at about 50% wire speed.

It would be really cool if we could combine this
to bridge @ wire speed.

-- 
Pádraig Brady - http://www.pixelbeat.org
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
  2004-12-05 15:42                                 ` Martin Josefsson
  2004-12-05 16:48                                   ` Martin Josefsson
  2004-12-05 17:44                                   ` Lennert Buytenhek
@ 2004-12-08 23:36                                   ` Ray Lehtiniemi
       [not found]                                     ` <41B825A5.2000009@draigBrady.com>
  2 siblings, 1 reply; 85+ messages in thread
From: Ray Lehtiniemi @ 2004-12-08 23:36 UTC (permalink / raw)
  To: Martin Josefsson
  Cc: Lennert Buytenhek, Scott Feldman, jamal, Robert Olsson, P,
	mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante,
	netdev


hello martin


On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote:
> 
> Here's the patch, not much more tested (it still gives some transmit
> timeouts since it's scotts patch + prefetching and delayed TDT updating).
> And it's not cleaned up, but hey, that's development :)
> 
> The delayed TDT updating was a test and currently it delays the first tx'd
> packet after a timerrun 1ms.
> 
> Would be interesting to see what other people get with this thing.
> Lennert?

well, i'm brand new to gig ethernet, but i have access to some nice
hardware right now, so i decided to give your patch a try.

this is the average tx pps of 10 pktgen runs for each packet size:
	
60	1187589.1
64	 601805.4
68	1115029.3
72	 593096.4
76	1097761.1
80	 587125.4
84	1098045.2
88	 588159.1
92	1072124.8
96	 582510.3
100	1008056.8
104	 577898.0
108	 946974.0
112	 573719.2
116	 892871.0
120	 573072.5
124	 844608.3
128	 563685.7


any idea why the packet rates are cut in half for every other line?

pktgen is running with eth0 bound to CPU0 on this box:

  NexGate NSA 2040G
  Dual Xeon 3.06 GHz, HT enabled
  1 GB PC3200 DDR SDRAM
  Dual 82544EI
  - on PCI-X 64 bit 133 MHz bus
  - behind P64H2 bridge
  - on hub channel D of E7501 chipset



thanks

-- 
----------------------------------------------------------------------
     Ray L   <rayl@mail.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000
       [not found]                                       ` <20041209161825.GA32454@mail.com>
@ 2004-12-09 17:12                                         ` P
       [not found]                                         ` <20041209164820.GB32454@mail.com>
  1 sibling, 0 replies; 85+ messages in thread
From: P @ 2004-12-09 17:12 UTC (permalink / raw)
  To: Ray Lehtiniemi; +Cc: netdev

Ray Lehtiniemi wrote:
> On Thu, Dec 09, 2004 at 10:15:01AM +0000, P@draigBrady.com wrote:
> 
>>That is very interesting!
>>I'm guessing it's due to some alignment bug?
>>
>>Can you repeat for 60-68 ?
> 
> certainly.  here are the raw results, and a summary oprofile for
> 60-68.
> 
> looking at the disassembly, it seems that the 'rdtsc' opcode
> at 0x46f3 is causing the problem?

Well that wasn't obvious to me :-)
I did some manipulating with sort/join
and came up with the following percentage changes
Note the % diff col adds to 37%

address  % @ 60b % @ 64b % diff
000046f5 14.6006 22.3856 7.785000 #instruction after rdtsc
00004737 15.0990 20.2242 5.125200 #instruction after rdtsc
0000474b 11.3857 12.9496 1.563900
00004726 1.5419 2.5867 1.044800
000046f7 0.6258 1.1922 0.566400
00004751 4.9377 5.4016 0.463900
000047a1 0.0118 0.4675 0.455700
00004739 1.2614 1.6962 0.434800
000044f7 1.0592 1.4506 0.391400
00004749 0.5467 0.9253 0.378600
0000475d 0.0879 0.1769 0.089000
0000445f 0.3785 0.4599 0.081400
000047c3 0.1003 0.1652 0.064900
000045cf 0.0804 0.1316 0.051200
000047aa 0.0048 0.0194 0.014600
000047bd 5.5e-04 0.0142 0.013650
000047b3 0.0106 0.0200 0.009400
00004598 0.0061 0.0147 0.008600
000045e9 0.0026 0.0103 0.007700
00004640 0.0692 0.0701 0.000900
00004465 0.0014 0.0020 0.000600
0000481b 4.3e-04 7.3e-04 0.000300
0000470e 6.1e-05 2.4e-04 0.000179
0000458d 1.2e-04 2.7e-04 0.000150
00004a47 1.8e-04 3.0e-04 0.000120
00004735 0.0085 0.0086 0.000100
00004745 1.2e-04 2.2e-04 0.000100
000047dd 0.0032 0.0033 0.000100
00004a49 0.0037 0.0038 0.000100
00004663 1.8e-04 2.7e-04 0.000090
0000489a 8.0e-04 8.9e-04 0.000090
00004514 9.2e-04 0.0010 0.000080
00004a61 6.1e-05 1.4e-04 0.000079
000046d4 6.1e-05 1.1e-04 0.000049
00004789 6.1e-05 1.1e-04 0.000049
00004683 1.2e-04 1.6e-04 0.000040
00004a51 1.8e-04 2.2e-04 0.000040
000047cc 9.2e-04 9.5e-04 0.000030
000045ba 6.8e-04 7.0e-04 0.000020
00004a36 6.1e-05 8.1e-05 0.000020
00004620 1.8e-04 1.9e-04 0.000010
0000474f 0.0042 0.0042 0.000000
0000466d 6.1e-05 5.4e-05 -0.000007
00004817 1.2e-04 1.1e-04 -0.000010
0000470c 4.9e-04 4.6e-04 -0.000030
000045eb 6.1e-05 2.7e-05 -0.000034
00004616 6.1e-05 2.7e-05 -0.000034
00004a1e 6.1e-05 2.7e-05 -0.000034
00004652 1.2e-04 8.1e-05 -0.000039
000047ee 1.2e-04 8.1e-05 -0.000039
00004685 1.2e-04 5.4e-05 -0.000066
00004894 3.1e-04 2.4e-04 -0.000070
00004714 6.1e-04 5.2e-04 -0.000090
00004524 1.2e-04 2.7e-05 -0.000093
0000467b 1.2e-04 2.7e-05 -0.000093
000046bb 1.2e-04 2.7e-05 -0.000093
00004446 0.0010 8.9e-04 -0.000110
0000488b 2.5e-04 1.4e-04 -0.000110
00004522 4.3e-04 2.7e-04 -0.000160
00004508 3.1e-04 1.4e-04 -0.000170
00004634 6.1e-04 4.3e-04 -0.000180
00004587 8.0e-04 6.0e-04 -0.000200
000047ae 0.0032 0.0030 -0.000200
00004440 5.5e-04 3.3e-04 -0.000220
00004459 0.0012 9.8e-04 -0.000220
00004506 9.2e-04 6.5e-04 -0.000270
000049ff 0.0021 0.0018 -0.000300
0000451c 0.0013 9.8e-04 -0.000320
000046c7 3.7e-04 2.7e-05 -0.000343
00004673 4.9e-04 1.1e-04 -0.000380
0000478f 4.9e-04 1.1e-04 -0.000380
00004450 0.0012 8.1e-04 -0.000390
00004541 6.1e-04 2.2e-04 -0.000390
000045a9 7.4e-04 3.5e-04 -0.000390
00004777 5.5e-04 1.6e-04 -0.000390
000047d0 6.8e-04 2.7e-04 -0.000410
00004457 0.0084 0.0079 -0.000500
000047ba 0.0018 0.0013 -0.000500
00004a6b 0.0031 0.0026 -0.000500
00004612 5.5e-04 2.7e-05 -0.000523
00004681 6.8e-04 1.4e-04 -0.000540
0000477b 7.4e-04 1.9e-04 -0.000550
00004503 0.0017 0.0011 -0.000600
000047df 0.0020 0.0014 -0.000600
000045b6 0.0010 3.8e-04 -0.000620
00004781 0.0010 3.8e-04 -0.000620
00004667 0.0012 5.2e-04 -0.000680
00004885 0.0015 8.1e-04 -0.000690
000045a3 0.0017 0.0010 -0.000700
000047da 0.0014 7.0e-04 -0.000700
00004747 8.6e-04 8.1e-05 -0.000779
0000446f 0.0151 0.0143 -0.000800
00004702 0.0019 0.0011 -0.000800
00004718 0.0157 0.0149 -0.000800
000047b6 0.0022 0.0014 -0.000800
00004a25 0.0054 0.0046 -0.000800
00004a65 0.0026 0.0018 -0.000800
0000477e 9.8e-04 1.4e-04 -0.000840
000045c8 0.0015 5.7e-04 -0.000930
00004543 0.0049 0.0039 -0.001000
00004604 0.0013 3.0e-04 -0.001000
00004787 0.0026 0.0016 -0.001000
00004a02 0.0018 7.6e-04 -0.001040
0000450e 0.0063 0.0052 -0.001100
0000465d 0.0022 0.0011 -0.001100
0000459d 0.0014 1.9e-04 -0.001210
0000464a 0.0017 3.8e-04 -0.001320
000047cf 0.0020 6.8e-04 -0.001320
00004a13 0.0016 1.1e-04 -0.001490
0000461e 0.0017 1.6e-04 -0.001540
000044ff 0.0040 0.0024 -0.001600
00004628 0.0020 3.5e-04 -0.001650
000045d5 0.0076 0.0055 -0.002100
00004638 0.0049 0.0027 -0.002200
00004650 0.0045 0.0021 -0.002400
00004632 0.0052 0.0026 -0.002600
00004769 0.0059 0.0033 -0.002600
00004444 0.0957 0.0930 -0.002700
00004610 0.0034 6.5e-04 -0.002750
000046fb 0.0097 0.0069 -0.002800
0000487f 0.0175 0.0146 -0.002900
000044f4 0.0071 0.0039 -0.003200
00004757 0.0068 0.0032 -0.003600
00004583 0.0176 0.0136 -0.004000
0000472d 0.0178 0.0138 -0.004000
00004624 0.0049 6.5e-04 -0.004250
00004700 0.0074 0.0029 -0.004500
00004763 0.0110 0.0059 -0.005100
00004755 0.0091 0.0037 -0.005400
000047b0 0.0201 0.0138 -0.006300
0000459b 0.0102 0.0035 -0.006700
000046fd 0.0146 0.0078 -0.006800
00004797 0.0253 0.0181 -0.007200
0000473f 0.0226 0.0153 -0.007300
0000476d 0.0253 0.0180 -0.007300
0000474d 0.0236 0.0152 -0.008400
000044f0 0.0191 0.0094 -0.009700
00004471 0.0332 0.0222 -0.011000
000046f3 0.0224 0.0112 -0.011200
0000472f 0.0221 0.0105 -0.011600
00004743 0.0146 0.0025 -0.012100
00004753 0.0311 0.0185 -0.012600
000044f9 0.0232 0.0100 -0.013200
000045f2 0.0781 0.0638 -0.014300
000045c0 0.0796 0.0632 -0.016400
000047a4 0.1020 0.0851 -0.016900
00004455 0.0468 0.0282 -0.018600
0000472a 0.0331 0.0140 -0.019100
00004720 0.0420 0.0228 -0.019200
00004741 0.0520 0.0255 -0.026500
0000460a 0.0296 6.8e-04 -0.028920
00004469 0.0696 0.0391 -0.030500
000047b8 0.0485 0.0164 -0.032100
00004771 0.0479 0.0151 -0.032800
000047d6 0.0634 0.0270 -0.036400
000045c2 0.1763 0.0500 -0.126300
0000488e 0.2228 0.0961 -0.126700
0000458f 0.2212 0.0932 -0.128000
00004709 0.8817 0.7529 -0.128800
0000479b 0.2469 0.1158 -0.131100
000047c6 0.2489 0.1103 -0.138600
00004775 0.2514 0.1124 -0.139000
00004657 0.2502 0.1105 -0.139700
0000444c 0.2555 0.1107 -0.144800
000045df 0.1822 0.0357 -0.146500
00004608 0.2596 0.1117 -0.147900
00004618 0.2635 0.1153 -0.148200
00004679 0.2580 0.1094 -0.148600
0000462c 0.2630 0.1134 -0.149600
00004594 0.2494 0.0958 -0.153600
000045f8 0.1934 0.0369 -0.156500
0000471a 0.8706 0.6718 -0.198800
000045e6 0.4986 0.2189 -0.279700
00004644 0.4393 0.1515 -0.287800
0000463c 0.5214 0.2247 -0.296700
000045fe 0.5160 0.2022 -0.313800
00004622 3.5942 1.5668 -2.027400
0000461c 3.6298 1.5695 -2.060300
00004716 19.2425 16.4027 -2.839800
00004600 5.2128 2.2837 -2.929100
000045b0 7.8500 3.3027 -4.547300

> 
> 
> it is worth noting that my box has become quite unstable since
> i started to use oprofile and pktgen together.  sshd stops responding,
> and the network seems to go down.  not sure what is happening there...
> this instability seems to be persisting across reboots, unfortunately...
> 
> 
> 
> 
> 
> 
> 60 bytes
> --------
> 
> 60 1195259
> 60 1206652
> 60 1139822
> 60 1206650
> 60 1206654
> 60 1136447
> 60 1206651
> 60 1148050
> 60 1206504
> 60 1206653
> 
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma      samples  %        image name               app name                 symbol name
> 00004337 1626886  57.5170  pktgen.ko                pktgen                   pktgen_thread_worker
> c02f389d 282974   10.0043  vmlinux                  vmlinux                  _spin_lock
> c021adc0 219795    7.7706  vmlinux                  vmlinux                  e1000_clean_tx
> c02f3904 164371    5.8112  vmlinux                  vmlinux                  _spin_lock_bh
> c0219c74 160383    5.6702  vmlinux                  vmlinux                  e1000_xmit_frame
> c02f3870 124564    4.4038  vmlinux                  vmlinux                  _spin_trylock
> 000041d1 48511     1.7151  pktgen.ko                pktgen                   next_to_run
> c02f399a 46205     1.6335  vmlinux                  vmlinux                  _spin_unlock_irqrestore
> c010c7d9 20876     0.7381  vmlinux                  vmlinux                  mark_offset_tsc
> c011fdb2 13116     0.4637  vmlinux                  vmlinux                  local_bh_enable
> c0107248 8166      0.2887  vmlinux                  vmlinux                  timer_interrupt
> c0103970 5607      0.1982  vmlinux                  vmlinux                  apic_timer_interrupt
> c010123a 5368      0.1898  vmlinux                  vmlinux                  default_idle
> c02f39a5 4256      0.1505  vmlinux                  vmlinux                  _spin_unlock_bh
> c0103c08 4042      0.1429  vmlinux                  vmlinux                  page_fault
> 0804ae00 3930      0.1389  oprofiled                oprofiled                sfile_find
> 0804aa10 3573      0.1263  oprofiled                oprofiled                get_file
> 
> 
> 
> 64 bytes
> --------
> 
> 64 606104
> 64 597737
> 64 594927
> 64 595531
> 64 606876
> 64 594751
> 64 595709
> 64 595070
> 64 606876
> 64 595600
> 
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma      samples  %        image name               app name                 symbol name
> 00004337 3688998  68.9133  pktgen.ko                pktgen                   pktgen_thread_worker
> c02f389d 519536    9.7053  vmlinux                  vmlinux                  _spin_lock
> c021adc0 271791    5.0773  vmlinux                  vmlinux                  e1000_clean_tx
> c0219c74 214428    4.0057  vmlinux                  vmlinux                  e1000_xmit_frame
> c02f3904 166334    3.1072  vmlinux                  vmlinux                  _spin_lock_bh
> c02f3870 127623    2.3841  vmlinux                  vmlinux                  _spin_trylock
> 000041d1 111650    2.0857  pktgen.ko                pktgen                   next_to_run
> c02f399a 47428     0.8860  vmlinux                  vmlinux                  _spin_unlock_irqrestore
> c010c7d9 39586     0.7395  vmlinux                  vmlinux                  mark_offset_tsc
> c0107248 14671     0.2741  vmlinux                  vmlinux                  timer_interrupt
> c011fdb2 12926     0.2415  vmlinux                  vmlinux                  local_bh_enable
> c0103970 11778     0.2200  vmlinux                  vmlinux                  apic_timer_interrupt
> c010123a 9282      0.1734  vmlinux                  vmlinux                  default_idle
> 0804ae00 7449      0.1392  oprofiled                oprofiled                sfile_find
> 0804aa10 6387      0.1193  oprofiled                oprofiled                get_file
> 0804ac30 6234      0.1165  oprofiled                oprofiled                sfile_log_sample
> 0804f4b0 5852      0.1093  oprofiled                oprofiled                odb_insert
> 
> 
> 
> 68 bytes
> --------
> 
> 68 1124822
> 68 1124805
> 68 1090006
> 68 1124822
> 68 1089775
> 68 1124812
> 68 1123305
> 68 1091796
> 68 1124820
> 68 1087043
> 
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma      samples  %        image name               app name                 symbol name
> 00004337 1753028  58.4510  pktgen.ko                pktgen                   pktgen_thread_worker
> c02f389d 301835   10.0641  vmlinux                  vmlinux                  _spin_lock
> c021adc0 223405    7.4490  vmlinux                  vmlinux                  e1000_clean_tx
> c02f3904 167118    5.5722  vmlinux                  vmlinux                  _spin_lock_bh
> c0219c74 166016    5.5355  vmlinux                  vmlinux                  e1000_xmit_frame
> c02f3870 131516    4.3851  vmlinux                  vmlinux                  _spin_trylock
> 000041d1 56334     1.8783  pktgen.ko                pktgen                   next_to_run
> c02f399a 46860     1.5624  vmlinux                  vmlinux                  _spin_unlock_irqrestore
> c010c7d9 26188     0.8732  vmlinux                  vmlinux                  mark_offset_tsc
> c011fdb2 12199     0.4068  vmlinux                  vmlinux                  local_bh_enable
> c0107248 10399     0.3467  vmlinux                  vmlinux                  timer_interrupt
> c010123a 8799      0.2934  vmlinux                  vmlinux                  default_idle
> c0103970 8194      0.2732  vmlinux                  vmlinux                  apic_timer_interrupt
> c0117346 4822      0.1608  vmlinux                  vmlinux                  find_busiest_group
> 0804ae00 4214      0.1405  oprofiled                oprofiled                sfile_find
> c02f39a5 3955      0.1319  vmlinux                  vmlinux                  _spin_unlock_bh
> 0804aa10 3745      0.1249  oprofiled                oprofiled                get_file
> 
> 
> 
> here is the detailed breakdown for the 60 byte pktgen:
> 
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma      samples  %        image name               app name                 symbol name
> 00004337 1626886  57.5170  pktgen.ko                pktgen                   pktgen_thread_worker
>   00004440 9        5.5e-04
>   00004444 1557      0.0957
>   00004446 17        0.0010
>   0000444c 4156      0.2555
>   00004450 19        0.0012
>   00004455 762       0.0468
>   00004457 136       0.0084
>   00004459 20        0.0012
>   0000445f 6157      0.3785
>   00004465 23        0.0014
>   00004469 1133      0.0696
>   0000446f 246       0.0151
>   00004471 540       0.0332
>   000044f0 310       0.0191
>   000044f4 115       0.0071
>   000044f7 17232     1.0592
>   000044f9 377       0.0232
>   000044ff 65        0.0040
>   00004503 28        0.0017
>   00004506 15       9.2e-04
>   00004508 5        3.1e-04
>   0000450e 102       0.0063
>   00004514 15       9.2e-04
>   0000451c 21        0.0013
>   00004522 7        4.3e-04
>   00004524 2        1.2e-04
>   00004541 10       6.1e-04
>   00004543 79        0.0049
>   00004583 287       0.0176
>   00004587 13       8.0e-04
>   0000458d 2        1.2e-04
>   0000458f 3598      0.2212
>   00004594 4057      0.2494
>   00004598 100       0.0061
>   0000459b 166       0.0102
>   0000459d 22        0.0014
>   000045a3 28        0.0017
>   000045a9 12       7.4e-04
>   000045b0 127711    7.8500
>   000045b6 17        0.0010
>   000045ba 11       6.8e-04
>   000045c0 1295      0.0796
>   000045c2 2869      0.1763
>   000045c8 24        0.0015
>   000045cf 1308      0.0804
>   000045d5 123       0.0076
>   000045df 2964      0.1822
>   000045e6 8111      0.4986
>   000045e9 42        0.0026
>   000045eb 1        6.1e-05
>   000045f2 1271      0.0781
>   000045f8 3146      0.1934
>   000045fe 8395      0.5160
>   00004600 84807     5.2128
>   00004604 21        0.0013
>   00004608 4223      0.2596
>   0000460a 481       0.0296
>   00004610 55        0.0034
>   00004612 9        5.5e-04
>   00004616 1        6.1e-05
>   00004618 4287      0.2635
>   0000461a 3        1.8e-04
>   0000461c 59052     3.6298
>   0000461e 28        0.0017
>   00004620 3        1.8e-04
>   00004622 58473     3.5942
>   00004624 79        0.0049
>   00004628 33        0.0020
>   0000462c 4279      0.2630
>   00004632 84        0.0052
>   00004634 10       6.1e-04
>   00004638 80        0.0049
>   0000463c 8483      0.5214
>   00004640 1126      0.0692
>   00004644 7147      0.4393
>   0000464a 27        0.0017
>   00004650 73        0.0045
>   00004652 2        1.2e-04
>   00004657 4070      0.2502
>   0000465d 36        0.0022
>   00004663 3        1.8e-04
>   00004665 2        1.2e-04
>   00004667 20        0.0012
>   0000466d 1        6.1e-05
>   00004673 8        4.9e-04
>   00004679 4197      0.2580
>   0000467b 2        1.2e-04
>   00004681 11       6.8e-04
>   00004683 2        1.2e-04
>   00004685 2        1.2e-04
>   000046bb 2        1.2e-04
>   000046c1 2        1.2e-04
>   000046c7 6        3.7e-04
>   000046d4 1        6.1e-05
>   000046f3 365       0.0224
>   000046f5 237535   14.6006
>   000046f7 10181     0.6258
>   000046fb 157       0.0097
>   000046fd 238       0.0146
>   00004700 120       0.0074
>   00004702 31        0.0019
>   00004709 14344     0.8817
>   0000470c 8        4.9e-04
>   0000470e 1        6.1e-05
>   00004714 10       6.1e-04
>   00004716 313053   19.2425
>   00004718 255       0.0157
>   0000471a 14164     0.8706
>   00004720 683       0.0420
>   00004726 25085     1.5419
>   0000472a 538       0.0331
>   0000472d 290       0.0178
>   0000472f 359       0.0221
>   00004735 139       0.0085
>   00004737 245644   15.0990
>   00004739 20521     1.2614
>   0000473f 368       0.0226
>   00004741 846       0.0520
>   00004743 237       0.0146
>   00004745 2        1.2e-04
>   00004747 14       8.6e-04
>   00004749 8894      0.5467
>   0000474b 185233   11.3857
>   0000474d 384       0.0236
>   0000474f 69        0.0042
>   00004751 80331     4.9377
>   00004753 506       0.0311
>   00004755 148       0.0091
>   00004757 111       0.0068
>   0000475d 1430      0.0879
>   00004763 179       0.0110
>   00004769 96        0.0059
>   0000476d 411       0.0253
>   00004771 780       0.0479
>   00004775 4090      0.2514
>   00004777 9        5.5e-04
>   0000477b 12       7.4e-04
>   0000477e 16       9.8e-04
>   00004781 17        0.0010
>   00004787 43        0.0026
>   00004789 1        6.1e-05
>   0000478f 8        4.9e-04
>   00004797 412       0.0253
>   0000479b 4016      0.2469
>   000047a1 192       0.0118
>   000047a4 1660      0.1020
>   000047aa 78        0.0048
>   000047ae 52        0.0032
>   000047b0 327       0.0201
>   000047b3 173       0.0106
>   000047b6 35        0.0022
>   000047b8 789       0.0485
>   000047ba 29        0.0018
>   000047bd 9        5.5e-04
>   000047c3 1632      0.1003
>   000047c6 4049      0.2489
>   000047cc 15       9.2e-04
>   000047cf 33        0.0020
>   000047d0 11       6.8e-04
>   000047d6 1032      0.0634
>   000047da 22        0.0014
>   000047dd 52        0.0032
>   000047df 33        0.0020
>   000047ea 1        6.1e-05
>   000047ee 2        1.2e-04
>   000047f6 1        6.1e-05
>   000047ff 1        6.1e-05
>   00004809 1        6.1e-05
>   0000480e 1        6.1e-05
>   00004817 2        1.2e-04
>   0000481b 7        4.3e-04
>   0000487f 284       0.0175
>   00004885 24        0.0015
>   0000488b 4        2.5e-04
>   0000488e 3625      0.2228
>   00004894 5        3.1e-04
>   0000489a 13       8.0e-04
>   000049ff 34        0.0021
>   00004a02 30        0.0018
>   00004a04 4        2.5e-04
>   00004a0f 3        1.8e-04
>   00004a13 26        0.0016
>   00004a1e 1        6.1e-05
>   00004a25 88        0.0054
>   00004a36 1        6.1e-05
>   00004a47 3        1.8e-04
>   00004a49 60        0.0037
>   00004a51 3        1.8e-04
>   00004a61 1        6.1e-05
>   00004a65 42        0.0026
>   00004a6b 50        0.0031
> 
> 
> 
> here is the detailed breakdown for the 64 byte pktgen:
> 
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma      samples  %        image name               app name                 symbol name
> 00004337 3688998  68.9133  pktgen.ko                pktgen                   pktgen_thread_worker
>   00004440 12       3.3e-04
>   00004444 3431      0.0930
>   00004446 33       8.9e-04
>   0000444c 4082      0.1107
>   00004450 30       8.1e-04
>   00004455 1041      0.0282
>   00004457 292       0.0079
>   00004459 36       9.8e-04
>   0000445f 16964     0.4599
>   00004465 73        0.0020
>   00004469 1442      0.0391
>   0000446f 528       0.0143
>   00004471 818       0.0222
>   000044f0 347       0.0094
>   000044f4 145       0.0039
>   000044f7 53514     1.4506
>   000044f9 369       0.0100
>   000044ff 90        0.0024
>   00004503 41        0.0011
>   00004506 24       6.5e-04
>   00004508 5        1.4e-04
>   0000450e 192       0.0052
>   00004514 37        0.0010
>   00004516 5        1.4e-04
>   0000451c 36       9.8e-04
>   00004522 10       2.7e-04
>   00004524 1        2.7e-05
>   00004541 8        2.2e-04
>   00004543 144       0.0039
>   00004583 503       0.0136
>   00004587 22       6.0e-04
>   0000458d 10       2.7e-04
>   0000458f 3437      0.0932
>   00004594 3533      0.0958
>   00004598 541       0.0147
>   0000459b 129       0.0035
>   0000459d 7        1.9e-04
>   000045a3 38        0.0010
>   000045a9 13       3.5e-04
>   000045b0 121838    3.3027
>   000045b6 14       3.8e-04
>   000045ba 26       7.0e-04
>   000045c0 2330      0.0632
>   000045c2 1843      0.0500
>   000045c8 21       5.7e-04
>   000045cf 4855      0.1316
>   000045d5 203       0.0055
>   000045df 1317      0.0357
>   000045e6 8076      0.2189
>   000045e9 381       0.0103
>   000045eb 1        2.7e-05
>   000045f2 2355      0.0638
>   000045f8 1362      0.0369
>   000045fe 7460      0.2022
>   00004600 84246     2.2837
>   00004604 11       3.0e-04
>   00004608 4122      0.1117
>   0000460a 25       6.8e-04
>   00004610 24       6.5e-04
>   00004612 1        2.7e-05
>   00004614 1        2.7e-05
>   00004616 1        2.7e-05
>   00004618 4254      0.1153
>   0000461c 57898     1.5695
>   0000461e 6        1.6e-04
>   00004620 7        1.9e-04
>   00004622 57801     1.5668
>   00004624 24       6.5e-04
>   00004628 13       3.5e-04
>   0000462c 4185      0.1134
>   00004632 97        0.0026
>   00004634 16       4.3e-04
>   00004638 99        0.0027
>   0000463c 8288      0.2247
>   00004640 2585      0.0701
>   00004644 5590      0.1515
>   0000464a 14       3.8e-04
>   00004650 77        0.0021
>   00004652 3        8.1e-05
>   00004657 4077      0.1105
>   0000465d 41        0.0011
>   00004663 10       2.7e-04
>   00004667 19       5.2e-04
>   0000466d 2        5.4e-05
>   00004673 4        1.1e-04
>   00004679 4035      0.1094
>   0000467b 1        2.7e-05
>   00004681 5        1.4e-04
>   00004683 6        1.6e-04
>   00004685 2        5.4e-05
>   000046bb 1        2.7e-05
>   000046c7 1        2.7e-05
>   000046d4 4        1.1e-04
>   000046f3 415       0.0112
>   000046f5 825806   22.3856
>   000046f7 43980     1.1922
>   000046fb 256       0.0069
>   000046fd 286       0.0078
>   00004700 108       0.0029
>   00004702 41        0.0011
>   00004705 5        1.4e-04
>   00004709 27774     0.7529
>   0000470c 17       4.6e-04
>   0000470e 9        2.4e-04
>   00004714 19       5.2e-04
>   00004716 605096   16.4027
>   00004718 548       0.0149
>   0000471a 24782     0.6718
>   00004720 842       0.0228
>   00004726 95423     2.5867
>   0000472a 516       0.0140
>   0000472d 510       0.0138
>   0000472f 389       0.0105
>   00004735 316       0.0086
>   00004737 746069   20.2242
>   00004739 62574     1.6962
>   0000473f 565       0.0153
>   00004741 941       0.0255
>   00004743 91        0.0025
>   00004745 8        2.2e-04
>   00004747 3        8.1e-05
>   00004749 34135     0.9253
>   0000474b 477712   12.9496
>   0000474d 561       0.0152
>   0000474f 155       0.0042
>   00004751 199265    5.4016
>   00004753 684       0.0185
>   00004755 137       0.0037
>   00004757 119       0.0032
>   0000475d 6527      0.1769
>   00004763 217       0.0059
>   00004769 120       0.0033
>   0000476d 665       0.0180
>   00004771 558       0.0151
>   00004775 4148      0.1124
>   00004777 6        1.6e-04
>   0000477b 7        1.9e-04
>   0000477e 5        1.4e-04
>   00004781 14       3.8e-04
>   00004787 60        0.0016
>   00004789 4        1.1e-04
>   0000478f 4        1.1e-04
>   00004797 669       0.0181
>   0000479b 4271      0.1158
>   000047a1 17245     0.4675
>   000047a4 3138      0.0851
>   000047aa 716       0.0194
>   000047ae 112       0.0030
>   000047b0 508       0.0138
>   000047b3 736       0.0200
>   000047b6 53        0.0014
>   000047b8 604       0.0164
>   000047ba 47        0.0013
>   000047bd 525       0.0142
>   000047c3 6094      0.1652
>   000047c6 4068      0.1103
>   000047cc 35       9.5e-04
>   000047cf 25       6.8e-04
>   000047d0 10       2.7e-04
>   000047d6 995       0.0270
>   000047da 26       7.0e-04
>   000047dd 120       0.0033
>   000047df 50        0.0014
>   000047ee 3        8.1e-05
>   000047fa 1        2.7e-05
>   00004817 4        1.1e-04
>   0000481b 27       7.3e-04
>   0000487f 539       0.0146
>   00004885 30       8.1e-04
>   0000488b 5        1.4e-04
>   0000488e 3544      0.0961
>   00004894 9        2.4e-04
>   0000489a 33       8.9e-04
>   000049ff 67        0.0018
>   00004a02 28       7.6e-04
>   00004a11 1        2.7e-05
>   00004a13 4        1.1e-04
>   00004a18 3        8.1e-05
>   00004a1e 1        2.7e-05
>   00004a25 168       0.0046
>   00004a36 3        8.1e-05
>   00004a47 11       3.0e-04
>   00004a49 139       0.0038
>   00004a51 8        2.2e-04
>   00004a59 1        2.7e-05
>   00004a61 5        1.4e-04
>   00004a65 67        0.0018
>   00004a6b 97        0.0026
> 
> 
> and finally, here's the disasm of the threadworker function:
> 
> 
> 00004337 <pktgen_Thread_worker>:
>     4337:	55                   	push   %ebp
>     4338:	57                   	push   %edi
>     4339:	56                   	push   %esi
>     433a:	53                   	push   %ebx
>     433b:	bb 00 e0 ff ff       	mov    $0xffffe000,%ebx
>     4340:	21 e3                	and    %esp,%ebx
>     4342:	83 ec 2c             	sub    $0x2c,%esp
>     4345:	89 44 24 28          	mov    %eax,0x28(%esp)
>     4349:	8b b0 bc 02 00 00    	mov    0x2bc(%eax),%esi
>     434f:	c7 44 24 20 00 00 00 	movl   $0x0,0x20(%esp)
>     4356:	00 
>     4357:	c7 04 24 c2 06 00 00 	movl   $0x6c2,(%esp)
>     435e:	89 74 24 04          	mov    %esi,0x4(%esp)
>     4362:	e8 fc ff ff ff       	call   4363 <pktgen_thread_worker+0x2c>
>     4367:	8b 03                	mov    (%ebx),%eax
>     4369:	8b 80 90 04 00 00    	mov    0x490(%eax),%eax
>     436f:	05 04 05 00 00       	add    $0x504,%eax
>     4374:	e8 fc ff ff ff       	call   4375 <pktgen_thread_worker+0x3e>
>     4379:	8b 03                	mov    (%ebx),%eax
>     437b:	c7 80 94 04 00 00 ff 	movl   $0xfffbbeff,0x494(%eax)
>     4382:	be fb ff 
>     4385:	c7 80 98 04 00 00 ff 	movl   $0xffffffff,0x498(%eax)
>     438c:	ff ff ff 
>     438f:	e8 fc ff ff ff       	call   4390 <pktgen_thread_worker+0x59>
>     4394:	8b 03                	mov    (%ebx),%eax
>     4396:	8b 80 90 04 00 00    	mov    0x490(%eax),%eax
>     439c:	05 04 05 00 00       	add    $0x504,%eax
>     43a1:	e8 fc ff ff ff       	call   43a2 <pktgen_thread_worker+0x6b>
>     43a6:	89 f1                	mov    %esi,%ecx
>     43a8:	ba 01 00 00 00       	mov    $0x1,%edx
>     43ad:	d3 e2                	shl    %cl,%edx
>     43af:	8b 03                	mov    (%ebx),%eax
>     43b1:	e8 fc ff ff ff       	call   43b2 <pktgen_thread_worker+0x7b>
>     43b6:	39 73 10             	cmp    %esi,0x10(%ebx)
>     43b9:	74 08                	je     43c3 <pktgen_thread_worker+0x8c>
>     43bb:	0f 0b                	ud2a   
>     43bd:	27                   	daa    
>     43be:	0b b0 06 00 00 8b    	or     0x8b000006(%eax),%esi
>     43c4:	44                   	inc    %esp
>     43c5:	24 28                	and    $0x28,%al
>     43c7:	8b 54 24 28          	mov    0x28(%esp),%edx
>     43cb:	05 c0 02 00 00       	add    $0x2c0,%eax
>     43d0:	89 44 24 1c          	mov    %eax,0x1c(%esp)
>     43d4:	c7 82 c0 02 00 00 01 	movl   $0x1,0x2c0(%edx)
>     43db:	00 00 00 
>     43de:	89 d0                	mov    %edx,%eax
>     43e0:	8b 4c 24 1c          	mov    0x1c(%esp),%ecx
>     43e4:	05 c4 02 00 00       	add    $0x2c4,%eax
>     43e9:	89 41 04             	mov    %eax,0x4(%ecx)
>     43ec:	89 41 08             	mov    %eax,0x8(%ecx)
>     43ef:	83 a2 b4 02 00 00 f0 	andl   $0xfffffff0,0x2b4(%edx)
>     43f6:	8b 03                	mov    (%ebx),%eax
>     43f8:	8b 80 a8 00 00 00    	mov    0xa8(%eax),%eax
>     43fe:	89 82 b8 02 00 00    	mov    %eax,0x2b8(%edx)
>     4404:	8b 03                	mov    (%ebx),%eax
>     4406:	8b 80 a8 00 00 00    	mov    0xa8(%eax),%eax
>     440c:	89 74 24 04          	mov    %esi,0x4(%esp)
>     4410:	c7 04 24 a0 06 00 00 	movl   $0x6a0,(%esp)
>     4417:	89 44 24 08          	mov    %eax,0x8(%esp)
>     441b:	e8 fc ff ff ff       	call   441c <pktgen_thread_worker+0xe5>
>     4420:	8b 44 24 28          	mov    0x28(%esp),%eax
>     4424:	8b 80 b0 02 00 00    	mov    0x2b0(%eax),%eax
>     442a:	89 44 24 24          	mov    %eax,0x24(%esp)
>     442e:	8b 03                	mov    (%ebx),%eax
>     4430:	c7 00 01 00 00 00    	movl   $0x1,(%eax)
>     4436:	f0 83 44 24 00 00    	lock addl $0x0,0x0(%esp)
>     443c:	89 5c 24 18          	mov    %ebx,0x18(%esp)
>     4440:	8b 54 24 18          	mov    0x18(%esp),%edx
>     4444:	8b 02                	mov    (%edx),%eax
>     4446:	c7 00 00 00 00 00    	movl   $0x0,(%eax)
>     444c:	8b 44 24 28          	mov    0x28(%esp),%eax
>     4450:	e8 7c fd ff ff       	call   41d1 <next_to_run>
>     4455:	85 c0                	test   %eax,%eax
>     4457:	89 c6                	mov    %eax,%esi
>     4459:	0f 84 aa 03 00 00    	je     4809 <pktgen_thread_worker+0x4d2>
>     445f:	8b 88 44 04 00 00    	mov    0x444(%eax),%ecx
>     4465:	89 4c 24 14          	mov    %ecx,0x14(%esp)
>     4469:	8b b8 90 02 00 00    	mov    0x290(%eax),%edi
>     446f:	85 ff                	test   %edi,%edi
>     4471:	74 7d                	je     44f0 <pktgen_thread_worker+0x1b9>
>     4473:	0f 31                	rdtsc  
>     4475:	89 44 24 0c          	mov    %eax,0xc(%esp)
>     4479:	89 54 24 10          	mov    %edx,0x10(%esp)
>     447d:	85 d2                	test   %edx,%edx
>     447f:	8b 1d 1c 00 00 00    	mov    0x1c,%ebx
>     4485:	89 d1                	mov    %edx,%ecx
>     4487:	89 c5                	mov    %eax,%ebp
>     4489:	74 08                	je     4493 <pktgen_thread_worker+0x15c>
>     448b:	89 d0                	mov    %edx,%eax
>     448d:	31 d2                	xor    %edx,%edx
>     448f:	f7 f3                	div    %ebx
>     4491:	89 c1                	mov    %eax,%ecx
>     4493:	89 e8                	mov    %ebp,%eax
>     4495:	f7 f3                	div    %ebx
>     4497:	89 ca                	mov    %ecx,%edx
>     4499:	89 d3                	mov    %edx,%ebx
>     449b:	8b 96 b8 02 00 00    	mov    0x2b8(%esi),%edx
>     44a1:	39 d3                	cmp    %edx,%ebx
>     44a3:	89 c1                	mov    %eax,%ecx
>     44a5:	8b 86 b4 02 00 00    	mov    0x2b4(%esi),%eax
>     44ab:	77 37                	ja     44e4 <pktgen_thread_worker+0x1ad>
>     44ad:	72 04                	jb     44b3 <pktgen_thread_worker+0x17c>
>     44af:	39 c1                	cmp    %eax,%ecx
>     44b1:	73 31                	jae    44e4 <pktgen_thread_worker+0x1ad>
>     44b3:	8b be b4 02 00 00    	mov    0x2b4(%esi),%edi
>     44b9:	29 cf                	sub    %ecx,%edi
>     44bb:	81 ff 0f 27 00 00    	cmp    $0x270f,%edi
>     44c1:	0f 86 ee 04 00 00    	jbe    49b5 <pktgen_thread_worker+0x67e>
>     44c7:	b9 d3 4d 62 10       	mov    $0x10624dd3,%ecx
>     44cc:	89 f8                	mov    %edi,%eax
>     44ce:	f7 e1                	mul    %ecx
>     44d0:	89 d1                	mov    %edx,%ecx
>     44d2:	89 f2                	mov    %esi,%edx
>     44d4:	c1 e9 06             	shr    $0x6,%ecx
>     44d7:	89 c8                	mov    %ecx,%eax
>     44d9:	e8 10 e4 ff ff       	call   28ee <pg_udelay>
>     44de:	8b be 90 02 00 00    	mov    0x290(%esi),%edi
>     44e4:	81 ff ff ff ff 7f    	cmp    $0x7fffffff,%edi
>     44ea:	0f 84 d5 04 00 00    	je     49c5 <pktgen_thread_worker+0x68e>
>     44f0:	8b 54 24 14          	mov    0x14(%esp),%edx
>     44f4:	8b 42 24             	mov    0x24(%edx),%eax
>     44f7:	a8 01                	test   $0x1,%al
>     44f9:	0f 85 f4 01 00 00    	jne    46f3 <pktgen_thread_worker+0x3bc>
>     44ff:	8b 4c 24 18          	mov    0x18(%esp),%ecx
>     4503:	8b 41 08             	mov    0x8(%ecx),%eax
>     4506:	a8 08                	test   $0x8,%al
>     4508:	0f 85 e5 01 00 00    	jne    46f3 <pktgen_thread_worker+0x3bc>
>     450e:	8b 86 c8 02 00 00    	mov    0x2c8(%esi),%eax
>     4514:	85 c0                	test   %eax,%eax
>     4516:	0f 85 63 03 00 00    	jne    487f <pktgen_thread_worker+0x548>
>     451c:	8b 96 40 04 00 00    	mov    0x440(%esi),%edx
>     4522:	85 d2                	test   %edx,%edx
>     4524:	75 5d                	jne    4583 <pktgen_thread_worker+0x24c>
>     4526:	8b 86 c4 02 00 00    	mov    0x2c4(%esi),%eax
>     452c:	83 c0 01             	add    $0x1,%eax
>     452f:	3b 86 e8 02 00 00    	cmp    0x2e8(%esi),%eax
>     4535:	89 86 c4 02 00 00    	mov    %eax,0x2c4(%esi)
>     453b:	0f 83 5f 03 00 00    	jae    48a0 <pktgen_thread_worker+0x569>
>     4541:	85 d2                	test   %edx,%edx
>     4543:	75 3e                	jne    4583 <pktgen_thread_worker+0x24c>
>     4545:	f6 86 81 02 00 00 02 	testb  $0x2,0x281(%esi)
>     454c:	0f 84 87 03 00 00    	je     48d9 <pktgen_thread_worker+0x5a2>
>     4552:	89 f2                	mov    %esi,%edx
>     4554:	8b 44 24 14          	mov    0x14(%esp),%eax
>     4558:	e8 fc f1 ff ff       	call   3759 <fill_packet_ipv6>
>     455d:	85 c0                	test   %eax,%eax
>     455f:	89 86 40 04 00 00    	mov    %eax,0x440(%esi)
>     4565:	0f 84 87 03 00 00    	je     48f2 <pktgen_thread_worker+0x5bb>
>     456b:	83 86 bc 02 00 00 01 	addl   $0x1,0x2bc(%esi)
>     4572:	c7 86 c4 02 00 00 00 	movl   $0x0,0x2c4(%esi)
>     4579:	00 00 00 
>     457c:	83 96 c0 02 00 00 00 	adcl   $0x0,0x2c0(%esi)
>     4583:	8b 7c 24 14          	mov    0x14(%esp),%edi
>     4587:	81 c7 2c 01 00 00    	add    $0x12c,%edi
>     458d:	89 f8                	mov    %edi,%eax
>     458f:	e8 fc ff ff ff       	call   4590 <pktgen_thread_worker+0x259>
>     4594:	8b 54 24 14          	mov    0x14(%esp),%edx
>     4598:	8b 42 24             	mov    0x24(%edx),%eax
>     459b:	a8 01                	test   $0x1,%al
>     459d:	0f 85 6c 03 00 00    	jne    490f <pktgen_thread_worker+0x5d8>
>     45a3:	8b 86 40 04 00 00    	mov    0x440(%esi),%eax
>     45a9:	f0 ff 80 94 00 00 00 	lock incl 0x94(%eax)
>     45b0:	8b 86 40 04 00 00    	mov    0x440(%esi),%eax
>     45b6:	8b 54 24 14          	mov    0x14(%esp),%edx
>     45ba:	ff 92 6c 01 00 00    	call   *0x16c(%edx)
>     45c0:	85 c0                	test   %eax,%eax
>     45c2:	0f 85 37 04 00 00    	jne    49ff <pktgen_thread_worker+0x6c8>
>     45c8:	83 86 9c 02 00 00 01 	addl   $0x1,0x29c(%esi)
>     45cf:	8b 86 2c 04 00 00    	mov    0x42c(%esi),%eax
>     45d5:	c7 86 c8 02 00 00 01 	movl   $0x1,0x2c8(%esi)
>     45dc:	00 00 00 
>     45df:	83 96 a0 02 00 00 00 	adcl   $0x0,0x2a0(%esi)
>     45e6:	83 c0 04             	add    $0x4,%eax
>     45e9:	31 d2                	xor    %edx,%edx
>     45eb:	83 86 e4 02 00 00 01 	addl   $0x1,0x2e4(%esi)
>     45f2:	01 86 a4 02 00 00    	add    %eax,0x2a4(%esi)
>     45f8:	11 96 a8 02 00 00    	adc    %edx,0x2a8(%esi)
>     45fe:	0f 31                	rdtsc  
>     4600:	89 44 24 0c          	mov    %eax,0xc(%esp)
>     4604:	89 54 24 10          	mov    %edx,0x10(%esp)
>     4608:	85 d2                	test   %edx,%edx
>     460a:	8b 1d 1c 00 00 00    	mov    0x1c,%ebx
>     4610:	89 d1                	mov    %edx,%ecx
>     4612:	89 c5                	mov    %eax,%ebp
>     4614:	74 08                	je     461e <pktgen_thread_worker+0x2e7>
>     4616:	89 d0                	mov    %edx,%eax
>     4618:	31 d2                	xor    %edx,%edx
>     461a:	f7 f3                	div    %ebx
>     461c:	89 c1                	mov    %eax,%ecx
>     461e:	89 e8                	mov    %ebp,%eax
>     4620:	f7 f3                	div    %ebx
>     4622:	89 ca                	mov    %ecx,%edx
>     4624:	89 54 24 10          	mov    %edx,0x10(%esp)
>     4628:	89 44 24 0c          	mov    %eax,0xc(%esp)
>     462c:	8b 8e 90 02 00 00    	mov    0x290(%esi),%ecx
>     4632:	31 db                	xor    %ebx,%ebx
>     4634:	01 4c 24 0c          	add    %ecx,0xc(%esp)
>     4638:	11 5c 24 10          	adc    %ebx,0x10(%esp)
>     463c:	8b 54 24 0c          	mov    0xc(%esp),%edx
>     4640:	8b 4c 24 10          	mov    0x10(%esp),%ecx
>     4644:	89 96 b4 02 00 00    	mov    %edx,0x2b4(%esi)
>     464a:	89 8e b8 02 00 00    	mov    %ecx,0x2b8(%esi)
>     4650:	89 f8                	mov    %edi,%eax
>     4652:	e8 fc ff ff ff       	call   4653 <pktgen_thread_worker+0x31c>
>     4657:	8b 96 98 02 00 00    	mov    0x298(%esi),%edx
>     465d:	8b 86 94 02 00 00    	mov    0x294(%esi),%eax
>     4663:	89 d1                	mov    %edx,%ecx
>     4665:	09 c1                	or     %eax,%ecx
>     4667:	0f 84 f6 00 00 00    	je     4763 <pktgen_thread_worker+0x42c>
>     466d:	8b 9e a0 02 00 00    	mov    0x2a0(%esi),%ebx
>     4673:	8b 8e 9c 02 00 00    	mov    0x29c(%esi),%ecx
>     4679:	39 d3                	cmp    %edx,%ebx
>     467b:	0f 82 e2 00 00 00    	jb     4763 <pktgen_thread_worker+0x42c>
>     4681:	77 08                	ja     468b <pktgen_thread_worker+0x354>
>     4683:	39 c1                	cmp    %eax,%ecx
>     4685:	0f 82 d8 00 00 00    	jb     4763 <pktgen_thread_worker+0x42c>
>     468b:	8b 8e 40 04 00 00    	mov    0x440(%esi),%ecx
>     4691:	8b 81 94 00 00 00    	mov    0x94(%ecx),%eax
>     4697:	83 f8 01             	cmp    $0x1,%eax
>     469a:	74 4e                	je     46ea <pktgen_thread_worker+0x3b3>
>     469c:	0f 31                	rdtsc  
>     469e:	89 c7                	mov    %eax,%edi
>     46a0:	8b 81 94 00 00 00    	mov    0x94(%ecx),%eax
>     46a6:	83 f8 01             	cmp    $0x1,%eax
>     46a9:	89 d5                	mov    %edx,%ebp
>     46ab:	74 2b                	je     46d8 <pktgen_thread_worker+0x3a1>
>     46ad:	bb 00 e0 ff ff       	mov    $0xffffe000,%ebx
>     46b2:	21 e3                	and    %esp,%ebx
>     46b4:	eb 16                	jmp    46cc <pktgen_thread_worker+0x395>
>     46b6:	e8 fc ff ff ff       	call   46b7 <pktgen_thread_worker+0x380>
>     46bb:	8b 86 40 04 00 00    	mov    0x440(%esi),%eax
>     46c1:	8b 80 94 00 00 00    	mov    0x94(%eax),%eax
>     46c7:	83 f8 01             	cmp    $0x1,%eax
>     46ca:	74 0c                	je     46d8 <pktgen_thread_worker+0x3a1>
>     46cc:	8b 03                	mov    (%ebx),%eax
>     46ce:	8b 40 04             	mov    0x4(%eax),%eax
>     46d1:	8b 40 08             	mov    0x8(%eax),%eax
>     46d4:	a8 04                	test   $0x4,%al
>     46d6:	74 de                	je     46b6 <pktgen_thread_worker+0x37f>
>     46d8:	0f 31                	rdtsc  
>     46da:	29 f8                	sub    %edi,%eax
>     46dc:	19 ea                	sbb    %ebp,%edx
>     46de:	01 86 dc 02 00 00    	add    %eax,0x2dc(%esi)
>     46e4:	11 96 e0 02 00 00    	adc    %edx,0x2e0(%esi)
>     46ea:	89 f0                	mov    %esi,%eax
>     46ec:	e8 2f fa ff ff       	call   4120 <pktgen_stop_device>
>     46f1:	eb 70                	jmp    4763 <pktgen_thread_worker+0x42c>
>     46f3:	0f 31                	rdtsc  
>     46f5:	89 d5                	mov    %edx,%ebp
>     46f7:	8b 54 24 14          	mov    0x14(%esp),%edx
>     46fb:	89 c7                	mov    %eax,%edi
>     46fd:	8b 42 24             	mov    0x24(%edx),%eax
>     4700:	a8 02                	test   $0x2,%al
>     4702:	2e 74 e5             	je,pn  46ea <pktgen_thread_worker+0x3b3>
>     4705:	8b 4c 24 18          	mov    0x18(%esp),%ecx
>     4709:	8b 41 08             	mov    0x8(%ecx),%eax
>     470c:	a8 08                	test   $0x8,%al
>     470e:	0f 85 e1 02 00 00    	jne    49f5 <pktgen_thread_worker+0x6be>
>     4714:	0f 31                	rdtsc  
>     4716:	29 f8                	sub    %edi,%eax
>     4718:	19 ea                	sbb    %ebp,%edx
>     471a:	01 86 dc 02 00 00    	add    %eax,0x2dc(%esi)
>     4720:	11 96 e0 02 00 00    	adc    %edx,0x2e0(%esi)
>     4726:	8b 54 24 14          	mov    0x14(%esp),%edx
>     472a:	8b 42 24             	mov    0x24(%edx),%eax
>     472d:	a8 01                	test   $0x1,%al
>     472f:	0f 84 d9 fd ff ff    	je     450e <pktgen_thread_worker+0x1d7>
>     4735:	0f 31                	rdtsc  
>     4737:	85 d2                	test   %edx,%edx
>     4739:	8b 1d 1c 00 00 00    	mov    0x1c,%ebx
>     473f:	89 d1                	mov    %edx,%ecx
>     4741:	89 c7                	mov    %eax,%edi
>     4743:	74 08                	je     474d <pktgen_thread_worker+0x416>
>     4745:	89 d0                	mov    %edx,%eax
>     4747:	31 d2                	xor    %edx,%edx
>     4749:	f7 f3                	div    %ebx
>     474b:	89 c1                	mov    %eax,%ecx
>     474d:	89 f8                	mov    %edi,%eax
>     474f:	f7 f3                	div    %ebx
>     4751:	89 ca                	mov    %ecx,%edx
>     4753:	89 c1                	mov    %eax,%ecx
>     4755:	89 d3                	mov    %edx,%ebx
>     4757:	89 8e b4 02 00 00    	mov    %ecx,0x2b4(%esi)
>     475d:	89 9e b8 02 00 00    	mov    %ebx,0x2b8(%esi)
>     4763:	8b 96 c8 02 00 00    	mov    0x2c8(%esi),%edx
>     4769:	8b 4c 24 24          	mov    0x24(%esp),%ecx
>     476d:	01 54 24 20          	add    %edx,0x20(%esp)
>     4771:	39 4c 24 20          	cmp    %ecx,0x20(%esp)
>     4775:	76 20                	jbe    4797 <pktgen_thread_worker+0x460>
>     4777:	8b 54 24 18          	mov    0x18(%esp),%edx
>     477b:	8b 42 10             	mov    0x10(%edx),%eax
>     477e:	c1 e0 07             	shl    $0x7,%eax
>     4781:	8b b8 00 00 00 00    	mov    0x0(%eax),%edi
>     4787:	85 ff                	test   %edi,%edi
>     4789:	0f 85 1c 02 00 00    	jne    49ab <pktgen_thread_worker+0x674>
>     478f:	c7 44 24 20 00 00 00 	movl   $0x0,0x20(%esp)
>     4796:	00 
>     4797:	8b 4c 24 28          	mov    0x28(%esp),%ecx
>     479b:	8b 91 b4 02 00 00    	mov    0x2b4(%ecx),%edx
>     47a1:	f6 c2 01             	test   $0x1,%dl
>     47a4:	0f 85 7c 00 00 00    	jne    4826 <pktgen_thread_worker+0x4ef>
>     47aa:	8b 4c 24 18          	mov    0x18(%esp),%ecx
>     47ae:	8b 01                	mov    (%ecx),%eax
>     47b0:	8b 40 04             	mov    0x4(%eax),%eax
>     47b3:	8b 40 08             	mov    0x8(%eax),%eax
>     47b6:	a8 04                	test   $0x4,%al
>     47b8:	75 6c                	jne    4826 <pktgen_thread_worker+0x4ef>
>     47ba:	f6 c2 02             	test   $0x2,%dl
>     47bd:	0f 85 c7 01 00 00    	jne    498a <pktgen_thread_worker+0x653>
>     47c3:	f6 c2 04             	test   $0x4,%dl
>     47c6:	0f 85 9d 01 00 00    	jne    4969 <pktgen_thread_worker+0x632>
>     47cc:	80 e2 08             	and    $0x8,%dl
>     47cf:	90                   	nop    
>     47d0:	0f 85 7a 01 00 00    	jne    4950 <pktgen_thread_worker+0x619>
>     47d6:	8b 54 24 18          	mov    0x18(%esp),%edx
>     47da:	8b 42 08             	mov    0x8(%edx),%eax
>     47dd:	a8 08                	test   $0x8,%al
>     47df:	0f 84 5b fc ff ff    	je     4440 <pktgen_thread_worker+0x109>
>     47e5:	e8 fc ff ff ff       	call   47e6 <pktgen_thread_worker+0x4af>
>     47ea:	8b 54 24 18          	mov    0x18(%esp),%edx
>     47ee:	8b 02                	mov    (%edx),%eax
>     47f0:	c7 00 00 00 00 00    	movl   $0x0,(%eax)
>     47f6:	8b 44 24 28          	mov    0x28(%esp),%eax
>     47fa:	e8 d2 f9 ff ff       	call   41d1 <next_to_run>
>     47ff:	85 c0                	test   %eax,%eax
>     4801:	89 c6                	mov    %eax,%esi
>     4803:	0f 85 56 fc ff ff    	jne    445f <pktgen_thread_worker+0x128>
>     4809:	ba 64 00 00 00       	mov    $0x64,%edx
>     480e:	8b 44 24 1c          	mov    0x1c(%esp),%eax
>     4812:	e8 fc ff ff ff       	call   4813 <pktgen_thread_worker+0x4dc>
>     4817:	8b 4c 24 28          	mov    0x28(%esp),%ecx
>     481b:	8b 91 b4 02 00 00    	mov    0x2b4(%ecx),%edx
>     4821:	f6 c2 01             	test   $0x1,%dl
>     4824:	74 84                	je     47aa <pktgen_thread_worker+0x473>
>     4826:	8b 5c 24 28          	mov    0x28(%esp),%ebx
>     482a:	c7 04 24 c8 06 00 00 	movl   $0x6c8,(%esp)
>     4831:	83 c3 0c             	add    $0xc,%ebx
>     4834:	89 5c 24 04          	mov    %ebx,0x4(%esp)
>     4838:	e8 fc ff ff ff       	call   4839 <pktgen_thread_worker+0x502>
>     483d:	8b 44 24 28          	mov    0x28(%esp),%eax
>     4841:	e8 f6 f9 ff ff       	call   423c <pktgen_stop>
>     4846:	89 5c 24 04          	mov    %ebx,0x4(%esp)
>     484a:	c7 04 24 e8 06 00 00 	movl   $0x6e8,(%esp)
>     4851:	e8 fc ff ff ff       	call   4852 <pktgen_thread_worker+0x51b>
>     4856:	8b 44 24 28          	mov    0x28(%esp),%eax
>     485a:	e8 1b fa ff ff       	call   427a <pktgen_rem_all_ifs>
>     485f:	89 5c 24 04          	mov    %ebx,0x4(%esp)
>     4863:	c7 04 24 cc 06 00 00 	movl   $0x6cc,(%esp)
>     486a:	e8 fc ff ff ff       	call   486b <pktgen_thread_worker+0x534>
>     486f:	8b 44 24 28          	mov    0x28(%esp),%eax
>     4873:	83 c4 2c             	add    $0x2c,%esp
>     4876:	5b                   	pop    %ebx
>     4877:	5e                   	pop    %esi
>     4878:	5f                   	pop    %edi
>     4879:	5d                   	pop    %ebp
>     487a:	e9 27 fa ff ff       	jmp    42a6 <pktgen_rem_thread>
>     487f:	8b 96 40 04 00 00    	mov    0x440(%esi),%edx
>     4885:	8b 86 c4 02 00 00    	mov    0x2c4(%esi),%eax
>     488b:	83 c0 01             	add    $0x1,%eax
>     488e:	3b 86 e8 02 00 00    	cmp    0x2e8(%esi),%eax
>     4894:	89 86 c4 02 00 00    	mov    %eax,0x2c4(%esi)
>     489a:	0f 82 a1 fc ff ff    	jb     4541 <pktgen_thread_worker+0x20a>
>     48a0:	85 d2                	test   %edx,%edx
>     48a2:	0f 84 9d fc ff ff    	je     4545 <pktgen_thread_worker+0x20e>
>     48a8:	8b 82 94 00 00 00    	mov    0x94(%edx),%eax
>     48ae:	83 f8 01             	cmp    $0x1,%eax
>     48b1:	74 12                	je     48c5 <pktgen_thread_worker+0x58e>
>     48b3:	f0 ff 8a 94 00 00 00 	lock decl 0x94(%edx)
>     48ba:	0f 94 c0             	sete   %al
>     48bd:	84 c0                	test   %al,%al
>     48bf:	0f 84 80 fc ff ff    	je     4545 <pktgen_thread_worker+0x20e>
>     48c5:	89 d0                	mov    %edx,%eax
>     48c7:	e8 fc ff ff ff       	call   48c8 <pktgen_thread_worker+0x591>
>     48cc:	f6 86 81 02 00 00 02 	testb  $0x2,0x281(%esi)
>     48d3:	0f 85 79 fc ff ff    	jne    4552 <pktgen_thread_worker+0x21b>
>     48d9:	89 f2                	mov    %esi,%edx
>     48db:	8b 44 24 14          	mov    0x14(%esp),%eax
>     48df:	e8 22 e7 ff ff       	call   3006 <fill_packet_ipv4>
>     48e4:	85 c0                	test   %eax,%eax
>     48e6:	89 86 40 04 00 00    	mov    %eax,0x440(%esi)
>     48ec:	0f 85 79 fc ff ff    	jne    456b <pktgen_thread_worker+0x234>
>     48f2:	c7 04 24 08 07 00 00 	movl   $0x708,(%esp)
>     48f9:	e8 fc ff ff ff       	call   48fa <pktgen_thread_worker+0x5c3>
>     48fe:	e8 fc ff ff ff       	call   48ff <pktgen_thread_worker+0x5c8>
>     4903:	83 ae c4 02 00 00 01 	subl   $0x1,0x2c4(%esi)
>     490a:	e9 54 fe ff ff       	jmp    4763 <pktgen_thread_worker+0x42c>
>     490f:	c7 86 c8 02 00 00 00 	movl   $0x0,0x2c8(%esi)
>     4916:	00 00 00 
>     4919:	0f 31                	rdtsc  
>     491b:	89 44 24 0c          	mov    %eax,0xc(%esp)
>     491f:	89 54 24 10          	mov    %edx,0x10(%esp)
>     4923:	85 d2                	test   %edx,%edx
>     4925:	8b 1d 1c 00 00 00    	mov    0x1c,%ebx
>     492b:	89 d1                	mov    %edx,%ecx
>     492d:	89 c5                	mov    %eax,%ebp
>     492f:	74 08                	je     4939 <pktgen_thread_worker+0x602>
>     4931:	89 d0                	mov    %edx,%eax
>     4933:	31 d2                	xor    %edx,%edx
>     4935:	f7 f3                	div    %ebx
>     4937:	89 c1                	mov    %eax,%ecx
>     4939:	89 e8                	mov    %ebp,%eax
>     493b:	f7 f3                	div    %ebx
>     493d:	89 ca                	mov    %ecx,%edx
>     493f:	89 86 b4 02 00 00    	mov    %eax,0x2b4(%esi)
>     4945:	89 96 b8 02 00 00    	mov    %edx,0x2b8(%esi)
>     494b:	e9 00 fd ff ff       	jmp    4650 <pktgen_thread_worker+0x319>
>     4950:	8b 44 24 28          	mov    0x28(%esp),%eax
>     4954:	e8 21 f9 ff ff       	call   427a <pktgen_rem_all_ifs>
>     4959:	8b 44 24 28          	mov    0x28(%esp),%eax
>     495d:	83 a0 b4 02 00 00 f7 	andl   $0xfffffff7,0x2b4(%eax)
>     4964:	e9 6d fe ff ff       	jmp    47d6 <pktgen_thread_worker+0x49f>
>     4969:	8b 44 24 28          	mov    0x28(%esp),%eax
>     496d:	e8 d7 f2 ff ff       	call   3c49 <pktgen_run>
>     4972:	8b 4c 24 28          	mov    0x28(%esp),%ecx
>     4976:	8b 91 b4 02 00 00    	mov    0x2b4(%ecx),%edx
>     497c:	83 e2 fb             	and    $0xfffffffb,%edx
>     497f:	89 91 b4 02 00 00    	mov    %edx,0x2b4(%ecx)
>     4985:	e9 42 fe ff ff       	jmp    47cc <pktgen_thread_worker+0x495>
>     498a:	8b 44 24 28          	mov    0x28(%esp),%eax
>     498e:	e8 a9 f8 ff ff       	call   423c <pktgen_stop>
>     4993:	8b 44 24 28          	mov    0x28(%esp),%eax
>     4997:	8b 90 b4 02 00 00    	mov    0x2b4(%eax),%edx
>     499d:	83 e2 fd             	and    $0xfffffffd,%edx
>     49a0:	89 90 b4 02 00 00    	mov    %edx,0x2b4(%eax)
>     49a6:	e9 18 fe ff ff       	jmp    47c3 <pktgen_thread_worker+0x48c>
>     49ab:	e8 fc ff ff ff       	call   49ac <pktgen_thread_worker+0x675>
>     49b0:	e9 da fd ff ff       	jmp    478f <pktgen_thread_worker+0x458>
>     49b5:	89 f2                	mov    %esi,%edx
>     49b7:	89 f8                	mov    %edi,%eax
>     49b9:	e8 cc de ff ff       	call   288a <nanospin>
>     49be:	89 f6                	mov    %esi,%esi
>     49c0:	e9 19 fb ff ff       	jmp    44de <pktgen_thread_worker+0x1a7>
>     49c5:	0f 31                	rdtsc  
>     49c7:	85 d2                	test   %edx,%edx
>     49c9:	8b 1d 1c 00 00 00    	mov    0x1c,%ebx
>     49cf:	89 d1                	mov    %edx,%ecx
>     49d1:	89 c7                	mov    %eax,%edi
>     49d3:	74 08                	je     49dd <pktgen_thread_worker+0x6a6>
>     49d5:	89 d0                	mov    %edx,%eax
>     49d7:	31 d2                	xor    %edx,%edx
>     49d9:	f7 f3                	div    %ebx
>     49db:	89 c1                	mov    %eax,%ecx
>     49dd:	89 f8                	mov    %edi,%eax
>     49df:	f7 f3                	div    %ebx
>     49e1:	89 ca                	mov    %ecx,%edx
>     49e3:	89 c1                	mov    %eax,%ecx
>     49e5:	89 d3                	mov    %edx,%ebx
>     49e7:	81 c1 ff ff ff 7f    	add    $0x7fffffff,%ecx
>     49ed:	83 d3 00             	adc    $0x0,%ebx
>     49f0:	e9 62 fd ff ff       	jmp    4757 <pktgen_thread_worker+0x420>
>     49f5:	e8 fc ff ff ff       	call   49f6 <pktgen_thread_worker+0x6bf>
>     49fa:	e9 15 fd ff ff       	jmp    4714 <pktgen_thread_worker+0x3dd>
>     49ff:	83 f8 ff             	cmp    $0xffffffff,%eax
>     4a02:	75 14                	jne    4a18 <pktgen_thread_worker+0x6e1>
>     4a04:	8b 4c 24 14          	mov    0x14(%esp),%ecx
>     4a08:	f6 81 59 01 00 00 10 	testb  $0x10,0x159(%ecx)
>     4a0f:	74 07                	je     4a18 <pktgen_thread_worker+0x6e1>
>     4a11:	f3 90                	pause  
>     4a13:	e9 98 fb ff ff       	jmp    45b0 <pktgen_thread_worker+0x279>
>     4a18:	8b 86 40 04 00 00    	mov    0x440(%esi),%eax
>     4a1e:	f0 ff 88 94 00 00 00 	lock decl 0x94(%eax)
>     4a25:	8b 2d 08 00 00 00    	mov    0x8,%ebp
>     4a2b:	85 ed                	test   %ebp,%ebp
>     4a2d:	75 4f                	jne    4a7e <pktgen_thread_worker+0x747>
>     4a2f:	83 86 ac 02 00 00 01 	addl   $0x1,0x2ac(%esi)
>     4a36:	c7 86 c8 02 00 00 00 	movl   $0x0,0x2c8(%esi)
>     4a3d:	00 00 00 
>     4a40:	83 96 b0 02 00 00 00 	adcl   $0x0,0x2b0(%esi)
>     4a47:	0f 31                	rdtsc  
>     4a49:	89 44 24 0c          	mov    %eax,0xc(%esp)
>     4a4d:	89 54 24 10          	mov    %edx,0x10(%esp)
>     4a51:	85 d2                	test   %edx,%edx
>     4a53:	8b 1d 1c 00 00 00    	mov    0x1c,%ebx
>     4a59:	89 d1                	mov    %edx,%ecx
>     4a5b:	89 c5                	mov    %eax,%ebp
>     4a5d:	74 08                	je     4a67 <pktgen_thread_worker+0x730>
>     4a5f:	89 d0                	mov    %edx,%eax
>     4a61:	31 d2                	xor    %edx,%edx
>     4a63:	f7 f3                	div    %ebx
>     4a65:	89 c1                	mov    %eax,%ecx
>     4a67:	89 e8                	mov    %ebp,%eax
>     4a69:	f7 f3                	div    %ebx
>     4a6b:	89 ca                	mov    %ecx,%edx
>     4a6d:	89 86 b4 02 00 00    	mov    %eax,0x2b4(%esi)
>     4a73:	89 96 b8 02 00 00    	mov    %edx,0x2b8(%esi)
>     4a79:	e9 80 fb ff ff       	jmp    45fe <pktgen_thread_worker+0x2c7>
>     4a7e:	e8 fc ff ff ff       	call   4a7f <pktgen_thread_worker+0x748>
>     4a83:	85 c0                	test   %eax,%eax
>     4a85:	74 a8                	je     4a2f <pktgen_thread_worker+0x6f8>
>     4a87:	c7 04 24 e9 06 00 00 	movl   $0x6e9,(%esp)
>     4a8e:	e8 fc ff ff ff       	call   4a8f <pktgen_thread_worker+0x758>
>     4a93:	eb 9a                	jmp    4a2f <pktgen_thread_worker+0x6f8>
> 

-- 
Pádraig Brady - http://www.pixelbeat.org
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000
       [not found]                                         ` <20041209164820.GB32454@mail.com>
@ 2004-12-09 17:19                                           ` P
  2004-12-09 23:25                                             ` Ray Lehtiniemi
  0 siblings, 1 reply; 85+ messages in thread
From: P @ 2004-12-09 17:19 UTC (permalink / raw)
  To: Ray Lehtiniemi

Ray Lehtiniemi wrote:
> On Thu, Dec 09, 2004 at 09:18:25AM -0700, Ray Lehtiniemi wrote:
> 
>>it is worth noting that my box has become quite unstable since
>>i started to use oprofile and pktgen together.  sshd stops responding,
>>and the network seems to go down.  not sure what is happening there...
>>this instability seems to be persisting across reboots, unfortunately...
> 
> 
> 
> ok, it seems that this is related to martin's e1000 patch, and i
> just hadn't noticed it before.  rolling back the 1.2 Mpps patch
> seems to cure the problem.
> 
> symptoms are a total freezeup of the e1000 interfaces. netstat
> -an shows a tcp connection for my ssh login to the box, with about
> 53K in the send-Q.  /proc/net/tcp is empty, however....  i can
> reproduce this at will by doing
> 
>  # objdump -d /lib/modules/2.6.10-rc3-mp-rayl/kernel/net/core/pktgen.ko
> 
> on that machine with the e1000-patched kernel running.
> 
> 
> if there's any diagnostic output i can generate that might tell
> me what's going wrong, let me know and i'll try to generate it.

can you send this to again to netdev.

thanks.

-- 
Pádraig Brady - http://www.pixelbeat.org
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: 1.03Mpps on e1000
  2004-12-09 17:19                                           ` P
@ 2004-12-09 23:25                                             ` Ray Lehtiniemi
  0 siblings, 0 replies; 85+ messages in thread
From: Ray Lehtiniemi @ 2004-12-09 23:25 UTC (permalink / raw)
  To: netdev


hi all

my apologies if this gets received twice... i originally sent a copy
of this using mutt's 'bounce' function, but i don't think that's
what i wanted to do.....

this is a bug report re: martin e1000 patch. i'm seeing some lockups
under normal traffic loads that seem to go away if i revert the
patch.  details below..


thanks



On Thu, Dec 09, 2004 at 05:19:55PM +0000, P@draigBrady.com wrote:
> Ray Lehtiniemi wrote:
> >On Thu, Dec 09, 2004 at 09:18:25AM -0700, Ray Lehtiniemi wrote:
> >
> >>it is worth noting that my box has become quite unstable since
> >>i started to use oprofile and pktgen together.  sshd stops responding,
> >>and the network seems to go down.  not sure what is happening there...
> >>this instability seems to be persisting across reboots, unfortunately...
> >
> >
> >
> >ok, it seems that this is related to martin's e1000 patch, and i
> >just hadn't noticed it before.  rolling back the 1.2 Mpps patch
> >seems to cure the problem.
> >
> >symptoms are a total freezeup of the e1000 interfaces. netstat
> >-an shows a tcp connection for my ssh login to the box, with about
> >53K in the send-Q.  /proc/net/tcp is empty, however....  i can
> >reproduce this at will by doing
> >
> > # objdump -d /lib/modules/2.6.10-rc3-mp-rayl/kernel/net/core/pktgen.ko
> >
> >on that machine with the e1000-patched kernel running.
> >
> >
> >if there's any diagnostic output i can generate that might tell
> >me what's going wrong, let me know and i'll try to generate it.
> 
> can you send this to again to netdev.
> 
> thanks.
> 
> -- 
> Pádraig Brady - http://www.pixelbeat.org
> --

-- 
----------------------------------------------------------------------
     Ray L   <rayl@mail.com>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [E1000-devel] Transmission limit
  2004-12-02 18:23                           ` Robert Olsson
  2004-12-02 23:25                             ` Lennert Buytenhek
  2004-12-03  5:23                             ` Scott Feldman
@ 2004-12-10 16:24                             ` Martin Josefsson
  2 siblings, 0 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-10 16:24 UTC (permalink / raw)
  To: Robert Olsson
  Cc: sfeldma, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
	Jorge Manuel Finochietto, Giulio Galante, netdev

[-- Attachment #1: Type: text/plain, Size: 1096 bytes --]

On Thu, 2004-12-02 at 19:23, Robert Olsson wrote:
> Hello!
> 
> Below is little patch to clean skb at xmit. It's old jungle trick Jamal
> and I used w. tulip. Note we can now even decrease the size of TX ring.

Just a small unimportant note.

> --- drivers/net/e1000/e1000_main.c.orig	2004-12-01 13:59:36.000000000 +0100
> +++ drivers/net/e1000/e1000_main.c	2004-12-02 20:37:40.000000000 +0100
> @@ -1820,6 +1820,10 @@
>   		return NETDEV_TX_LOCKED; 
>   	} 
>  
> +
> +	if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 )
> +		e1000_clean_tx_ring(adapter);
> +
>  	/* need: count + 2 desc gap to keep tail from touching
>  	 * head, otherwise try next time */
>  	if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {

This patch is pretty broken, I doubt you want to call
e1000_clean_tx_ring(), I think you want some variant of
e1000_clean_tx_irq() :)

e1000_clean_tx_irq() takes adapter->tx_lock which e1000_xmit_frame()
also does so it will need some modification.

And it should use E1000_DESC_UNUSED as Scott pointed out.

-- 
/Martin

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2004-12-10 16:24 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1101467291.24742.70.camel@mellia.lipar.polito.it>
2004-11-26 14:05 ` [E1000-devel] Transmission limit P
2004-11-26 15:31   ` Marco Mellia
2004-11-26 19:56     ` jamal
2004-11-29 14:21       ` Marco Mellia
2004-11-30 13:46         ` jamal
2004-12-02 17:24           ` Marco Mellia
2004-11-26 20:06     ` jamal
2004-11-26 20:56     ` Lennert Buytenhek
2004-11-26 21:02       ` Lennert Buytenhek
2004-11-27  9:25     ` Harald Welte
     [not found]       ` <20041127111101.GC23139@xi.wantstofly.org>
2004-11-27 11:31         ` Harald Welte
2004-11-27 20:12       ` Cesar Marcondes
2004-11-29  8:53       ` Marco Mellia
2004-11-29 14:50         ` Lennert Buytenhek
2004-11-30  8:42           ` Marco Mellia
2004-12-01 12:25             ` jamal
2004-12-02 13:39               ` Marco Mellia
2004-12-03 13:07                 ` jamal
2004-11-26 15:40   ` Robert Olsson
2004-11-26 15:59     ` Marco Mellia
2004-11-26 16:57       ` P
2004-11-26 20:01         ` jamal
2004-11-29 10:19           ` P
2004-11-29 13:09           ` Robert Olsson
2004-11-29 20:16             ` David S. Miller
2004-12-01 16:47               ` Robert Olsson
2004-11-30 13:31             ` jamal
2004-11-30 13:46               ` Lennert Buytenhek
2004-11-30 14:25                 ` jamal
2004-12-01  0:11                   ` Lennert Buytenhek
2004-12-01  1:09                     ` Scott Feldman
2004-12-01 15:34                       ` Robert Olsson
2004-12-01 16:49                         ` Scott Feldman
2004-12-01 17:37                           ` Robert Olsson
2004-12-02 17:54                           ` Robert Olsson
2004-12-02 18:23                           ` Robert Olsson
2004-12-02 23:25                             ` Lennert Buytenhek
2004-12-03  5:23                             ` Scott Feldman
2004-12-10 16:24                             ` Martin Josefsson
2004-12-01 18:29                       ` Lennert Buytenhek
2004-12-01 21:35                         ` Lennert Buytenhek
2004-12-02  6:13                           ` Scott Feldman
2004-12-03 13:24                             ` jamal
2004-12-05 14:50                             ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
2004-12-05 15:03                               ` Martin Josefsson
2004-12-05 15:15                                 ` Lennert Buytenhek
2004-12-05 15:19                                   ` Martin Josefsson
2004-12-05 15:30                                     ` Martin Josefsson
2004-12-05 17:00                                       ` Lennert Buytenhek
2004-12-05 17:11                                         ` Martin Josefsson
2004-12-05 17:38                                           ` Martin Josefsson
2004-12-05 18:14                                             ` Lennert Buytenhek
2004-12-05 15:42                                 ` Martin Josefsson
2004-12-05 16:48                                   ` Martin Josefsson
2004-12-05 17:01                                     ` Martin Josefsson
2004-12-05 17:58                                     ` Lennert Buytenhek
2004-12-05 17:44                                   ` Lennert Buytenhek
2004-12-05 17:51                                     ` Lennert Buytenhek
2004-12-05 17:54                                       ` Martin Josefsson
2004-12-06 11:32                                         ` 1.03Mpps on e1000 (was: " jamal
2004-12-06 12:11                                           ` Lennert Buytenhek
2004-12-06 12:20                                             ` jamal
2004-12-06 12:23                                               ` Lennert Buytenhek
2004-12-06 12:30                                                 ` Martin Josefsson
2004-12-06 13:11                                                   ` jamal
     [not found]                                                     ` <20041206132907.GA13411@xi.wantstofly.org>
     [not found]                                                       ` <16820.37049.396306.295878@robur.slu.se>
2004-12-06 17:32                                                         ` 1.03Mpps on e1000 (was: Re: [E1000-devel] " P
2004-12-08 23:36                                   ` Ray Lehtiniemi
     [not found]                                     ` <41B825A5.2000009@draigBrady.com>
     [not found]                                       ` <20041209161825.GA32454@mail.com>
2004-12-09 17:12                                         ` 1.03Mpps on e1000 P
     [not found]                                         ` <20041209164820.GB32454@mail.com>
2004-12-09 17:19                                           ` P
2004-12-09 23:25                                             ` Ray Lehtiniemi
2004-12-05 21:12                                 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
2004-12-05 21:25                                   ` Lennert Buytenhek
2004-12-06  1:23                                     ` 1.03Mpps on e1000 (was: " Scott Feldman
2004-12-02 17:31                       ` [E1000-devel] Transmission limit Marco Mellia
2004-12-03 20:57                       ` Lennert Buytenhek
2004-12-04 10:36                         ` Lennert Buytenhek
2004-12-01 12:08                     ` jamal
2004-12-01 15:24                       ` Lennert Buytenhek
2004-11-26 17:58       ` Robert Olsson
2004-11-27 20:00   ` Lennert Buytenhek
2004-11-29 12:44     ` Marco Mellia
2004-11-29 15:19       ` Lennert Buytenhek
2004-11-29 17:32         ` Marco Mellia
2004-11-29 19:08           ` Lennert Buytenhek
2004-11-29 19:09             ` Lennert Buytenhek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.