* Re: [E1000-devel] Transmission limit
[not found] <1101467291.24742.70.camel@mellia.lipar.polito.it>
@ 2004-11-26 14:05 ` P
2004-11-26 15:31 ` Marco Mellia
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: P @ 2004-11-26 14:05 UTC (permalink / raw)
To: mellia; +Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
I'm forwarding this to netdev, as these are very interesting
results (even if I don't beleive them).
If you point us at the code/versions we will be better able to answer.
Marco Mellia wrote:
> We are trying to stress the e1000 hardware/driver under linux and Click
> to see what is the maximum number of packets per second that can be
> received/transmitted by a single NIC.
>
> We found something which is counterintuitive:
>
> - in reception, we can receive ALL the traffic, regardeless of the
> packet size (or if you prefer, we can receive ALL the minimum sized
> packet at gigabit speed)
I questioned whether you actually did receive at that rate to
which you responded:
> - using Click, we can receive 100% of (small) packets at gigabit
> speed with TWO cards (2gigabit/s ~ 2.8Mpps)
> - using linux and standard e1000 driver, we can receive up to about
> 80% of traffic from a single nic (~1.1Mpps)
> - using linux and a modified (simplified) version of the driver, we
> can receive 100% on a single nic, but not 100% using two nics (up
> to ~1.5Mpps).
>
> Reception means: receiving the packet up to the rx ring at the
> kernel level, and then IMMEDIATELY drop it (no packet processing,
> no forwarding, nothing more...)
>
> Using NAPI or IRQ has littel impact (as we are not processing the
> packets, the livelock due to the hardIRQ preemption versus the
> softIRQ managers is not entered...)
>
> But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
> - the traffic generator,
> - the driver version,
> - the O.S. (linux/click),
> - the hardware (broadcom card have the same limit).
>
> - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
> minimum sized packets are considered (64bytes long ethernet minumum
> frame size). That is about HALF the maximum number of pkt/s considering
> a gigabit link.
>
> What is weird, is that if we artificially "preload" the NIC tx-fifo with
> packets, and then instruct it to start sending them, those are actually
> transmitted AT WIRE SPEED!!
>
> These results have been obtained considering different software
> generators (namely, UDPGEN, PACKETGEN, Application level generators)
> under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
> UDPGEN).
>
> The hardware setup considers
> - a 2.8GHz Xeon hardware
> - PCI-X bus (133MHz/64bit)
> - 1G of Ram
> - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
> PCI slot.
>
> Different driver versions have been used, and while there are (small)
> differencies when receiving packets, ALL of them present the same
> trasmission limits.
>
> Moreover, the same happen considering other vendors cards (broadcom
> based chipset).
>
> Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
> Or Limit on the number of packets per second that can be stored in the
> NIC tx-fifo?
> May the lenght of the tx-fifo impact on this?
>
> Any hints will be really appreciated.
> Thanks in advance
cheers,
Pádraig.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 14:05 ` [E1000-devel] Transmission limit P
@ 2004-11-26 15:31 ` Marco Mellia
2004-11-26 19:56 ` jamal
` (3 more replies)
2004-11-26 15:40 ` Robert Olsson
2004-11-27 20:00 ` Lennert Buytenhek
2 siblings, 4 replies; 85+ messages in thread
From: Marco Mellia @ 2004-11-26 15:31 UTC (permalink / raw)
To: P; +Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
If you don't trust us, please, ignore this email.
Sorry.
That's the number we have. And are actually very similar from what other
colleagues of us got.
The point is:
while a PCI-X linux or (or click) box can receive (receive just up to
the netif_receive_skb() level and then discard the skb) up to more than
wire speed using off-the-shelf gigabit ethernet hardware, there is no
way to transmit more than about half that speed. This is true
considering minimum sized ethernet frames.
This holds true with
- linux 2.4.x and 2.6.x and click-linux 2.4.x
- intel e1000 or broadcom drivers (modified to drop packets after the
netif_receive_skb())
- whichever driver version you like (with minor modifications).
The only modification to the driver we did consists in carefully
prefecting the data in the CPU internal cache.
Some details and results can be retreived from
http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
Part of this results are presented in this paper
A. Bianco, J.M. Finochietto, G. Galante, M. Mellia, F. Neri
Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching
Third Internation Workshop on QoS in Multiservice IP Networks
Catania, Feb 2005
http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf
Hope this helps.
> I'm forwarding this to netdev, as these are very interesting
> results (even if I don't beleive them).
>
> If you point us at the code/versions we will be better able to answer.
>
> Marco Mellia wrote:
> > We are trying to stress the e1000 hardware/driver under linux and Click
> > to see what is the maximum number of packets per second that can be
> > received/transmitted by a single NIC.
> >
> > We found something which is counterintuitive:
> >
> > - in reception, we can receive ALL the traffic, regardeless of the
> > packet size (or if you prefer, we can receive ALL the minimum sized
> > packet at gigabit speed)
>
> I questioned whether you actually did receive at that rate to
> which you responded:
>
> > - using Click, we can receive 100% of (small) packets at gigabit
> > speed with TWO cards (2gigabit/s ~ 2.8Mpps)
> > - using linux and standard e1000 driver, we can receive up to about
> > 80% of traffic from a single nic (~1.1Mpps)
> > - using linux and a modified (simplified) version of the driver, we
> > can receive 100% on a single nic, but not 100% using two nics (up
> > to ~1.5Mpps).
> >
> > Reception means: receiving the packet up to the rx ring at the
> > kernel level, and then IMMEDIATELY drop it (no packet processing,
> > no forwarding, nothing more...)
> >
> > Using NAPI or IRQ has littel impact (as we are not processing the
> > packets, the livelock due to the hardIRQ preemption versus the
> > softIRQ managers is not entered...)
> >
> > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
> > - the traffic generator,
> > - the driver version,
> > - the O.S. (linux/click),
> > - the hardware (broadcom card have the same limit).
>
> >
> > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
> > minimum sized packets are considered (64bytes long ethernet minumum
> > frame size). That is about HALF the maximum number of pkt/s considering
> > a gigabit link.
> >
> > What is weird, is that if we artificially "preload" the NIC tx-fifo with
> > packets, and then instruct it to start sending them, those are actually
> > transmitted AT WIRE SPEED!!
> >
> > These results have been obtained considering different software
> > generators (namely, UDPGEN, PACKETGEN, Application level generators)
> > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
> > UDPGEN).
> >
> > The hardware setup considers
> > - a 2.8GHz Xeon hardware
> > - PCI-X bus (133MHz/64bit)
> > - 1G of Ram
> > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
> > PCI slot.
> >
> > Different driver versions have been used, and while there are (small)
> > differencies when receiving packets, ALL of them present the same
> > trasmission limits.
> >
> > Moreover, the same happen considering other vendors cards (broadcom
> > based chipset).
> >
> > Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
> > Or Limit on the number of packets per second that can be stored in the
> > NIC tx-fifo?
> > May the lenght of the tx-fifo impact on this?
> >
> > Any hints will be really appreciated.
> > Thanks in advance
>
> cheers,
> Pádraig.
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 14:05 ` [E1000-devel] Transmission limit P
2004-11-26 15:31 ` Marco Mellia
@ 2004-11-26 15:40 ` Robert Olsson
2004-11-26 15:59 ` Marco Mellia
2004-11-27 20:00 ` Lennert Buytenhek
2 siblings, 1 reply; 85+ messages in thread
From: Robert Olsson @ 2004-11-26 15:40 UTC (permalink / raw)
To: P; +Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
P@draigBrady.com writes:
> I'm forwarding this to netdev, as these are very interesting
> results (even if I don't beleive them).
> I questioned whether you actually did receive at that rate to
> which you responded:
>
> > - using Click, we can receive 100% of (small) packets at gigabit
> > speed with TWO cards (2gigabit/s ~ 2.8Mpps)
> > - using linux and standard e1000 driver, we can receive up to about
> > 80% of traffic from a single nic (~1.1Mpps)
> > - using linux and a modified (simplified) version of the driver, we
> > can receive 100% on a single nic, but not 100% using two nics (up
> > to ~1.5Mpps).
> >
> > Reception means: receiving the packet up to the rx ring at the
> > kernel level, and then IMMEDIATELY drop it (no packet processing,
> > no forwarding, nothing more...)
In more detail please... The RX ring must be refilled? And HW DMA's
the to memory-buffer? But I assume data it not touched otherwise.
Touching the packet-data givs a major impact. See eth_type_trans
in all profiles.
So what forwarding numbers is seen?
> > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
> > - the traffic generator,
> > - the driver version,
> > - the O.S. (linux/click),
> > - the hardware (broadcom card have the same limit).
>
> >
> > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
> > minimum sized packets are considered (64bytes long ethernet minumum
> > frame size). That is about HALF the maximum number of pkt/s considering
> > a gigabit link.
> >
> > What is weird, is that if we artificially "preload" the NIC tx-fifo with
> > packets, and then instruct it to start sending them, those are actually
> > transmitted AT WIRE SPEED!!
OK. Good to know about e1000. Networking is most DMA's and CPU is used
adminstating it this is the challange.
> > These results have been obtained considering different software
> > generators (namely, UDPGEN, PACKETGEN, Application level generators)
> > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
> > UDPGEN).
We get a hundred kpps more...Turn off all mitigation so interrupts are
undelayed so TX ring can be filled as quick as possible.
Even you could try to fill TX as soon as the HW says there are available
buffers. This could even be done from TX-interrupt.
> > The hardware setup considers
> > - a 2.8GHz Xeon hardware
> > - PCI-X bus (133MHz/64bit)
> > - 1G of Ram
> > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
> > PCI slot.
> > Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
> > Or Limit on the number of packets per second that can be stored in the
> > NIC tx-fifo?
> > May the lenght of the tx-fifo impact on this?
Small packet performance is dependent on low latency. Higher bus speed
gives shorter latency but also on higher speed buses there use to be
bridges that adds latency.
For packet generation we use still 866 MHz PIII:s and 82543GC on serverworks
64-bit board which are faster than most other systems. So for testing routing
performance in pps we have to use several flows. This gives the advantage to
test SMP/NUMA as well.
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:40 ` Robert Olsson
@ 2004-11-26 15:59 ` Marco Mellia
2004-11-26 16:57 ` P
2004-11-26 17:58 ` Robert Olsson
0 siblings, 2 replies; 85+ messages in thread
From: Marco Mellia @ 2004-11-26 15:59 UTC (permalink / raw)
To: Robert Olsson
Cc: P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
Robert,
It a pleasure to hear from you.
> > I questioned whether you actually did receive at that rate to
> > which you responded:
> >
> > > - using Click, we can receive 100% of (small) packets at gigabit
> > > speed with TWO cards (2gigabit/s ~ 2.8Mpps)
> > > - using linux and standard e1000 driver, we can receive up to about
> > > 80% of traffic from a single nic (~1.1Mpps)
> > > - using linux and a modified (simplified) version of the driver, we
> > > can receive 100% on a single nic, but not 100% using two nics (up
> > > to ~1.5Mpps).
> > >
> > > Reception means: receiving the packet up to the rx ring at the
> > > kernel level, and then IMMEDIATELY drop it (no packet processing,
> > > no forwarding, nothing more...)
>
> In more detail please... The RX ring must be refilled? And HW DMA's
> the to memory-buffer? But I assume data it not touched otherwise.
>
> Touching the packet-data givs a major impact. See eth_type_trans
> in all profiles.
That's exactly what we removed from the driver code: touching the packet
limit the reception rate at about 1.1Mpps, while avoiding to check the
eth_type_trans actually allows to receive 100% of packets.
skb are de/allocated using standard kernel memory management. Still,
without touching the packet, we can receive 100% of them.
> So what forwarding numbers is seen?
Forwarding is another issue. It seems to us that the bottleneck is in
the transmission of packets. Indeed, considering only reception and
transmission _separetely_
- all packets can be received
- no more than ~700kpps can be trasmitted
When IP-forwarding is considered, no more we hit the transmission limit
(using NAPI, and your buffer recycling patch, as mentioned on the paper
and on the slides... If no buffer recycling is adopted, performance drop
a bit)
So it seemd to us that the major bottleneck is due to the transmission
limit.
Again, you can get numbers and more details from
http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf
> > > But the limit in TRANSMISSION seems to be 700Kpps. Regardless of
> > > - the traffic generator,
> > > - the driver version,
> > > - the O.S. (linux/click),
> > > - the hardware (broadcom card have the same limit).
> >
> > >
> > > - in transmission we CAN ONLY trasmit about 700.000 pkt/s when the
> > > minimum sized packets are considered (64bytes long ethernet minumum
> > > frame size). That is about HALF the maximum number of pkt/s considering
> > > a gigabit link.
> > >
> > > What is weird, is that if we artificially "preload" the NIC tx-fifo with
> > > packets, and then instruct it to start sending them, those are actually
> > > transmitted AT WIRE SPEED!!
>
> OK. Good to know about e1000. Networking is most DMA's and CPU is used
> adminstating it this is the challange.
That's true. There is still the chance that the limit is due to hardware
CRC calculation (which must be added to the ethernet frame by the
nic...). But we're quite confortable that that is not the limit, since
in the reception path the same operation must be performed...
> > > These results have been obtained considering different software
> > > generators (namely, UDPGEN, PACKETGEN, Application level generators)
> > > under LINUX (2.4.x, 2.6.x), and under CLICK (using a modified version of
> > > UDPGEN).
>
> We get a hundred kpps more...Turn off all mitigation so interrupts are
> undelayed so TX ring can be filled as quick as possible.
>
> Even you could try to fill TX as soon as the HW says there are available
> buffers. This could even be done from TX-interrupt.
Are you suggesting to modify packetgen to be more aggressive?
> > > The hardware setup considers
> > > - a 2.8GHz Xeon hardware
> > > - PCI-X bus (133MHz/64bit)
> > > - 1G of Ram
> > > - Intel PRO 1000 MT single, double, and quad cards, integrated or on a
> > > PCI slot.
> > > Is there any limit on the PCI-X (or PCI) that can be the bottleneck?
> > > Or Limit on the number of packets per second that can be stored in the
> > > NIC tx-fifo?
> > > May the lenght of the tx-fifo impact on this?
>
> Small packet performance is dependent on low latency. Higher bus speed
> gives shorter latency but also on higher speed buses there use to be
> bridges that adds latency.
That's true. We suspect that the limit is due to bus latency. But still,
we are surprised, since the bus allows to receive 100%, but to transmit
up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
larger (133MHz*64bit ~ 8gbit/s
> For packet generation we use still 866 MHz PIII:s and 82543GC on serverworks
> 64-bit board which are faster than most other systems. So for testing routing
> performance in pps we have to use several flows. This gives the advantage to
> test SMP/NUMA as well.
We use an hardware generator (Agilent router tester)... which can
saturate a gigabit link with no problem (and cost much more than a
PC...). So our forwarding test are not limited...
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:59 ` Marco Mellia
@ 2004-11-26 16:57 ` P
2004-11-26 20:01 ` jamal
2004-11-26 17:58 ` Robert Olsson
1 sibling, 1 reply; 85+ messages in thread
From: P @ 2004-11-26 16:57 UTC (permalink / raw)
To: mellia
Cc: Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
I forgot a smilely on my previous post
about not beleiving you. So here's 2: :-) :-)
Comments below:
Marco Mellia wrote:
> Robert,
> It a pleasure to hear from you.
>
>> Touching the packet-data givs a major impact. See eth_type_trans
>> in all profiles.
Notice the e1000 sets up the alignment for IP by default.
> skb are de/allocated using standard kernel memory management. Still,
> without touching the packet, we can receive 100% of them.
I was doing some playing in this area this week.
I changed the alloc per packet to a "realloc" per packet.
I.E. the e1000 driver owns the packets. I noticed a
very nice speedup from this. In summary a userspace
app was able to receive 2x250Kpps without this patch,
and 2x490Kpps with it. The patch is here:
http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff
Note 99% of that patch is just upgrading from
e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect
the performance).
Wow I just read you're excellent paper, and noticed
you used this approach also :-)
>> Small packet performance is dependent on low latency. Higher bus speed
>> gives shorter latency but also on higher speed buses there use to be
>> bridges that adds latency.
>
> That's true. We suspect that the limit is due to bus latency. But still,
> we are surprised, since the bus allows to receive 100%, but to transmit
> up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
> larger (133MHz*64bit ~ 8gbit/s
Well there definitely could be an asymmetry wrt bus latency.
Saying that though, in my tests with much the same hardware
as you, I could only get 800Kpps into the driver. I'll
check this again when I have time. Note also that as I understand
it the PCI control bus is running at a much lower rate,
and that is used to arbitrate the bus for each packet.
I.E. the 8Gb/s number above is not the bottleneck.
An lspci -vvv for your ethernet devices would be useful
Also to view the burst size: setpci -d 8086:1010 e6.b
(where 8086:1010 is the ethernet device PCI id).
cheers,
Pádraig.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:59 ` Marco Mellia
2004-11-26 16:57 ` P
@ 2004-11-26 17:58 ` Robert Olsson
1 sibling, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-11-26 17:58 UTC (permalink / raw)
To: mellia
Cc: Robert Olsson, P, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
Marco Mellia writes:
> > Touching the packet-data givs a major impact. See eth_type_trans
> > in all profiles.
>
> That's exactly what we removed from the driver code: touching the packet
> limit the reception rate at about 1.1Mpps, while avoiding to check the
> eth_type_trans actually allows to receive 100% of packets.
>
> skb are de/allocated using standard kernel memory management. Still,
> without touching the packet, we can receive 100% of them.
Right. I recall I tried something similar but as I only have pktgen
as sender I could only verify this to pktgen TX speed about 860 kpps
for PIII box I mentioned. This w. UP and one NIC.
> When IP-forwarding is considered, no more we hit the transmission limit
> (using NAPI, and your buffer recycling patch, as mentioned on the paper
> and on the slides... If no buffer recycling is adopted, performance drop
> a bit)
> So it seemd to us that the major bottleneck is due to the transmission
> limit.
>
> Again, you can get numbers and more details from
>
> http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
> http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf
Nice. Seems we getting close to click w. NAPI and recycling. The skb
recycling is outdated as it adds to much complexity to the kernel. I got
some idea how make a much more lighweight variant... If you feel hacking
I can outline the idea so you can try it.
> > OK. Good to know about e1000. Networking is most DMA's and CPU is used
> > adminstating it this is the challange.
>
> That's true. There is still the chance that the limit is due to hardware
> CRC calculation (which must be added to the ethernet frame by the
> nic...). But we're quite confortable that that is not the limit, since
> in the reception path the same operation must be performed...
OK!
> > Even you could try to fill TX as soon as the HW says there are available
> > buffers. This could even be done from TX-interrupt.
>
> Are you suggesting to modify packetgen to be more aggressive?
Well it could be useful at least as an experiment. Our lab would be
happy...
> > Small packet performance is dependent on low latency. Higher bus speed
> > gives shorter latency but also on higher speed buses there use to be
> > bridges that adds latency.
>
> That's true. We suspect that the limit is due to bus latency. But still,
> we are surprised, since the bus allows to receive 100%, but to transmit
> up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
> larger (133MHz*64bit ~ 8gbit/s
Have a look at graph in the pktgen paper presented at Linux-Kongress in
Erlangen 2004. It seems like even at 8gbit/s thsi is limiting small
packet TX performance.
ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:31 ` Marco Mellia
@ 2004-11-26 19:56 ` jamal
2004-11-29 14:21 ` Marco Mellia
2004-11-26 20:06 ` jamal
` (2 subsequent siblings)
3 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-26 19:56 UTC (permalink / raw)
To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> If you don't trust us, please, ignore this email.
> Sorry.
Dont take it the wrong way please - nobody has been able to produce the
results you have. So thats why you may be getting that comment.
The fact you have been able to do this is a good thing.
> That's the number we have. And are actually very similar from what other
> colleagues of us got.
>
> The point is:
> while a PCI-X linux or (or click) box can receive (receive just up to
> the netif_receive_skb() level and then discard the skb) up to more than
> wire speed using off-the-shelf gigabit ethernet hardware, there is no
> way to transmit more than about half that speed. This is true
> considering minimum sized ethernet frames.
>
Hrm. I could not get more than 8-900Kpps on receive drop in the driver
on a super fast xeon. Can you post the diff for your driver?
My tests was with e1000.
What kind of hardware is this? Do you have a block diagram on how the
NIC is connected on the system? A lot of issues are dependent on how you
hardware hookup is.
> This holds true with
> - linux 2.4.x and 2.6.x and click-linux 2.4.x
> - intel e1000 or broadcom drivers (modified to drop packets after the
> netif_receive_skb())
> - whichever driver version you like (with minor modifications).
>
> The only modification to the driver we did consists in carefully
> prefecting the data in the CPU internal cache.
>
prefetching as in the use of prefetch()?
What were you prefetching if you end up dropping packet?
> Some details and results can be retreived from
>
> http://www.tlc-networks.polito.it/~mellia/euroTLC.pdf
>
> Part of this results are presented in this paper
> A. Bianco, J.M. Finochietto, G. Galante, M. Mellia, F. Neri
> Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching
> Third Internation Workshop on QoS in Multiservice IP Networks
> Catania, Feb 2005
> http://www.tlc-networks.polito.it/mellia/papers/Euro_qos_ip.pdf
>
> Hope this helps.
>
Thanks i will read these papers.
Take a look at presentation i made at SUCON:
www.suug.ch/sucon/04/slides/pkt_cls.pdf
I have solved the problem which is identified in the first of slides
(just before "why me momma?" slide) - i could describe the solution and
even provide pathces which may address (perhaps) some of the transmit
issues you are seeing.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 16:57 ` P
@ 2004-11-26 20:01 ` jamal
2004-11-29 10:19 ` P
2004-11-29 13:09 ` Robert Olsson
0 siblings, 2 replies; 85+ messages in thread
From: jamal @ 2004-11-26 20:01 UTC (permalink / raw)
To: P
Cc: mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Fri, 2004-11-26 at 11:57, P@draigBrady.com wrote:
> > skb are de/allocated using standard kernel memory management. Still,
> > without touching the packet, we can receive 100% of them.
>
> I was doing some playing in this area this week.
> I changed the alloc per packet to a "realloc" per packet.
> I.E. the e1000 driver owns the packets. I noticed a
> very nice speedup from this. In summary a userspace
> app was able to receive 2x250Kpps without this patch,
> and 2x490Kpps with it. The patch is here:
> http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff
A very angry gorilla on that url ;->
> Note 99% of that patch is just upgrading from
> e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect
> the performance).
>
> Wow I just read you're excellent paper, and noticed
> you used this approach also :-)
>
Have to read the paper - When Robert was last visiting here; we did some
tests and packet recycling is not very valuable as far as SMP is
concerned (given that packets can be alloced on one CPU and freed on
another). There a clear win on single CPU machines.
> >> Small packet performance is dependent on low latency. Higher bus speed
> >> gives shorter latency but also on higher speed buses there use to be
> >> bridges that adds latency.
> >
> > That's true. We suspect that the limit is due to bus latency. But still,
> > we are surprised, since the bus allows to receive 100%, but to transmit
> > up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
> > larger (133MHz*64bit ~ 8gbit/s
>
> Well there definitely could be an asymmetry wrt bus latency.
> Saying that though, in my tests with much the same hardware
> as you, I could only get 800Kpps into the driver.
Yep, thats about the number i was seeing as well in both pieces of
hardware i used in the tests in my SUCON presentation.
> I'll
> check this again when I have time. Note also that as I understand
> it the PCI control bus is running at a much lower rate,
> and that is used to arbitrate the bus for each packet.
> I.E. the 8Gb/s number above is not the bottleneck.
>
> An lspci -vvv for your ethernet devices would be useful
> Also to view the burst size: setpci -d 8086:1010 e6.b
> (where 8086:1010 is the ethernet device PCI id).
>
Can you talk a little about this PCI control bus? I have heard you
mention it before ... I am trying to visualize where it fits in PCI
system.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:31 ` Marco Mellia
2004-11-26 19:56 ` jamal
@ 2004-11-26 20:06 ` jamal
2004-11-26 20:56 ` Lennert Buytenhek
2004-11-27 9:25 ` Harald Welte
3 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2004-11-26 20:06 UTC (permalink / raw)
To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> If you don't trust us, please, ignore this email.
BTW, You have to be telling the truth espcially since you
have S. Giordano in your team ;-> We just need to figure out what you
are saying. Off to read your paper.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:31 ` Marco Mellia
2004-11-26 19:56 ` jamal
2004-11-26 20:06 ` jamal
@ 2004-11-26 20:56 ` Lennert Buytenhek
2004-11-26 21:02 ` Lennert Buytenhek
2004-11-27 9:25 ` Harald Welte
3 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-26 20:56 UTC (permalink / raw)
To: Marco Mellia
Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> The point is:
> while a PCI-X linux or (or click) box can receive (receive just up to
> the netif_receive_skb() level and then discard the skb) up to more than
> wire speed using off-the-shelf gigabit ethernet hardware, there is no
> way to transmit more than about half that speed. This is true
> considering minimum sized ethernet frames.
That's more-or-less what I'm seeing.
Theoretically, the maximum #pps you can send on gigabit is p=125000000/(s+24)
where s is the packet size, and the constant 24 consists of the 8B preamble,
4B FCS and and 12B inter-frame gap.
On an e1000 in a 32b 66MHz PCI slot (Intel server mainboard, e1000 'desktop'
NIC) I'm seeing that exact curve for packet sizes > ~350 bytes, but for
smaller packets than that, the curve goes like p=264000000/(s+335) (which
is accurate to +/- 100pps.) The 2.64e8 component is exactly the theoretical
max. bandwidth of the PCI slot the card is in, the 335 a random constant
that accounts for latency. On a different mobo I get a curve following
the same formula but different value for 335.
The same card in a 32b 33MHz PCI slot in a cheap Asus desktop board gives
something a bit stranger:
- p=132000000/(s+260) for s<128
- p=132000000/(s+390) for 128<=s<256
- p=132000000/(s+520) for 256<=s<384
- ...
Again, the 132000000 corresponds with the theoretical max. bandwidth of
the 32/33 bus. I'm not all that sure yet why things show this behavior.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 20:56 ` Lennert Buytenhek
@ 2004-11-26 21:02 ` Lennert Buytenhek
0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-26 21:02 UTC (permalink / raw)
To: Marco Mellia
Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, Nov 26, 2004 at 09:56:59PM +0100, Lennert Buytenhek wrote:
> On an e1000 in a 32b 66MHz PCI slot (Intel server mainboard, e1000 'desktop'
> NIC) I'm seeing that exact curve for packet sizes > ~350 bytes, but for
> smaller packets than that, the curve goes like p=264000000/(s+335) (which
> is accurate to +/- 100pps.) The 2.64e8 component is exactly the theoretical
> max. bandwidth of the PCI slot the card is in, the 335 a random constant
> that accounts for latency. On a different mobo I get a curve following
> the same formula but different value for 335.
>
> The same card in a 32b 33MHz PCI slot in a cheap Asus desktop board gives
> something a bit stranger:
> - p=132000000/(s+260) for s<128
> - p=132000000/(s+390) for 128<=s<256
> - p=132000000/(s+520) for 256<=s<384
> - ...
This could be explained by observing that on the Intel mobo, the NIC sits
on a dedicated PCI bus, while on the cheap Asus board, all PCI slots plus
all onboard devices share the same PCI bus. Probably after pulling in a
single burst of packet (32 clocks here, sounds about right), the NIC has
to relinquish the bus to other bus masters and wait for 128 byte times
until it gets to pull packet data from RAM again.
Would be interesting to find out where the latency is coming from. Find
a way to reduce/work around that and the 64b packet case will benefit as
well.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 15:31 ` Marco Mellia
` (2 preceding siblings ...)
2004-11-26 20:56 ` Lennert Buytenhek
@ 2004-11-27 9:25 ` Harald Welte
[not found] ` <20041127111101.GC23139@xi.wantstofly.org>
` (2 more replies)
3 siblings, 3 replies; 85+ messages in thread
From: Harald Welte @ 2004-11-27 9:25 UTC (permalink / raw)
To: Marco Mellia
Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]
On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> If you don't trust us, please, ignore this email.
> Sorry.
>
> That's the number we have. And are actually very similar from what other
> colleagues of us got.
>
> The point is:
> while a PCI-X linux or (or click) box can receive (receive just up to
> the netif_receive_skb() level and then discard the skb) up to more than
> wire speed using off-the-shelf gigabit ethernet hardware, there is no
> way to transmit more than about half that speed. This is true
> considering minimum sized ethernet frames.
Yes, I've seen this, too.
I even rewrote the linux e1000 driver in order to re-fill the tx queue
from hardirq handler, and it didn't help. 760kpps is the most I could
ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz)
I've posted this result to netdev at some earlier point, I also Cc'ed
intel but never got a reply
(http://oss.sgi.com/archives/netdev/2004-09/msg00540.html)
My guess is that Intel always knew this and they want to sell their CSA
chips rather than improving the PCI e1000.
We are hitting a hard limit here, either PCI-X wise or e1000 wise. You
cannot refill the tx queue faster than from hardirq, and still you don't
get any better numbers.
It was suggested that the problem is PCI DMA arbitration latency, since
the hardware needs to arbitrate the bus for every packet.
Interestingly, if you use a four-port e1000, the numbers get even worse
(580kpps) because the additional pcix bridge on the card introduces
further latency.
--
- Harald Welte <laforge@gnumonks.org> http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
[not found] ` <20041127111101.GC23139@xi.wantstofly.org>
@ 2004-11-27 11:31 ` Harald Welte
0 siblings, 0 replies; 85+ messages in thread
From: Harald Welte @ 2004-11-27 11:31 UTC (permalink / raw)
To: Lennert Buytenhek; +Cc: Linux Netdev List
[-- Attachment #1: Type: text/plain, Size: 770 bytes --]
On Sat, Nov 27, 2004 at 12:11:01PM +0100, Lennert Buytenhek wrote:
> On Sat, Nov 27, 2004 at 10:25:03AM +0100, Harald Welte wrote:
>
> > I even rewrote the linux e1000 driver [...]
>
> This is very interesting. You have chipset docs then?
Once again, please excuse my bad english. I seem to have translated
'umgeschrieben' into 'rewrote' which is absolutely not applicable here.
Please do s/rewrote/modified/, i.e. I modified/altered/changed the driver
And no, I don't have any docs.
> cheers,
> Lennert
--
- Harald Welte <laforge@gnumonks.org> http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 14:05 ` [E1000-devel] Transmission limit P
2004-11-26 15:31 ` Marco Mellia
2004-11-26 15:40 ` Robert Olsson
@ 2004-11-27 20:00 ` Lennert Buytenhek
2004-11-29 12:44 ` Marco Mellia
2 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-27 20:00 UTC (permalink / raw)
To: mellia; +Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, Nov 26, 2004 at 02:05:26PM +0000, P@draigBrady.com wrote:
> >What is weird, is that if we artificially "preload" the NIC tx-fifo with
> >packets, and then instruct it to start sending them, those are actually
> >transmitted AT WIRE SPEED!!
I've very interested in exactly what it is you're doing here. What
do you mean by 'preload'?
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-27 9:25 ` Harald Welte
[not found] ` <20041127111101.GC23139@xi.wantstofly.org>
@ 2004-11-27 20:12 ` Cesar Marcondes
2004-11-29 8:53 ` Marco Mellia
2 siblings, 0 replies; 85+ messages in thread
From: Cesar Marcondes @ 2004-11-27 20:12 UTC (permalink / raw)
To: Harald Welte
Cc: Marco Mellia, P, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
STOP !!!!
On Sat, 27 Nov 2004, Harald Welte wrote:
> On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> > If you don't trust us, please, ignore this email.
> > Sorry.
> >
> > That's the number we have. And are actually very similar from what other
> > colleagues of us got.
> >
> > The point is:
> > while a PCI-X linux or (or click) box can receive (receive just up to
> > the netif_receive_skb() level and then discard the skb) up to more than
> > wire speed using off-the-shelf gigabit ethernet hardware, there is no
> > way to transmit more than about half that speed. This is true
> > considering minimum sized ethernet frames.
>
> Yes, I've seen this, too.
>
> I even rewrote the linux e1000 driver in order to re-fill the tx queue
> from hardirq handler, and it didn't help. 760kpps is the most I could
> ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz)
>
> I've posted this result to netdev at some earlier point, I also Cc'ed
> intel but never got a reply
> (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html)
>
> My guess is that Intel always knew this and they want to sell their CSA
> chips rather than improving the PCI e1000.
>
> We are hitting a hard limit here, either PCI-X wise or e1000 wise. You
> cannot refill the tx queue faster than from hardirq, and still you don't
> get any better numbers.
>
> It was suggested that the problem is PCI DMA arbitration latency, since
> the hardware needs to arbitrate the bus for every packet.
>
> Interestingly, if you use a four-port e1000, the numbers get even worse
> (580kpps) because the additional pcix bridge on the card introduces
> further latency.
>
> --
> - Harald Welte <laforge@gnumonks.org> http://www.gnumonks.org/
> ============================================================================
> Programming is like sex: One mistake and you have to support it your lifetime
>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-27 9:25 ` Harald Welte
[not found] ` <20041127111101.GC23139@xi.wantstofly.org>
2004-11-27 20:12 ` Cesar Marcondes
@ 2004-11-29 8:53 ` Marco Mellia
2004-11-29 14:50 ` Lennert Buytenhek
2 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 8:53 UTC (permalink / raw)
To: Harald Welte
Cc: Marco Mellia, P, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Sat, 2004-11-27 at 10:25, Harald Welte wrote:
> On Fri, Nov 26, 2004 at 04:31:21PM +0100, Marco Mellia wrote:
> > If you don't trust us, please, ignore this email.
> > Sorry.
> >
> > That's the number we have. And are actually very similar from what other
> > colleagues of us got.
> >
> > The point is:
> > while a PCI-X linux or (or click) box can receive (receive just up to
> > the netif_receive_skb() level and then discard the skb) up to more than
> > wire speed using off-the-shelf gigabit ethernet hardware, there is no
> > way to transmit more than about half that speed. This is true
> > considering minimum sized ethernet frames.
>
> Yes, I've seen this, too.
>
> I even rewrote the linux e1000 driver in order to re-fill the tx queue
> from hardirq handler, and it didn't help. 760kpps is the most I could
> ever get (133MHz 64bit PCI-X on a Sun Fire v20z, Dual Opteron 1.8GHz)
>
> I've posted this result to netdev at some earlier point, I also Cc'ed
> intel but never got a reply
> (http://oss.sgi.com/archives/netdev/2004-09/msg00540.html)
>
> My guess is that Intel always knew this and they want to sell their CSA
> chips rather than improving the PCI e1000.
>
> We are hitting a hard limit here, either PCI-X wise or e1000 wise. You
> cannot refill the tx queue faster than from hardirq, and still you don't
> get any better numbers.
>
> It was suggested that the problem is PCI DMA arbitration latency, since
> the hardware needs to arbitrate the bus for every packet.
Th's our intuition too.
Notice that we get the same results with 3com (broadcom based) gigabit
cards.
We are thinking of sending packet in "bursts" instead of single
transfers. The only problem is to let the NIC know that there are more
than a packet in a burst...
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 20:01 ` jamal
@ 2004-11-29 10:19 ` P
2004-11-29 13:09 ` Robert Olsson
1 sibling, 0 replies; 85+ messages in thread
From: P @ 2004-11-29 10:19 UTC (permalink / raw)
To: hadi
Cc: mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
jamal wrote:
> On Fri, 2004-11-26 at 11:57, P@draigBrady.com wrote:
>
>
>>>skb are de/allocated using standard kernel memory management. Still,
>>>without touching the packet, we can receive 100% of them.
>>
>>I was doing some playing in this area this week.
>>I changed the alloc per packet to a "realloc" per packet.
>>I.E. the e1000 driver owns the packets. I noticed a
>>very nice speedup from this. In summary a userspace
>>app was able to receive 2x250Kpps without this patch,
>>and 2x490Kpps with it. The patch is here:
>>http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff
>
>
> A very angry gorilla on that url ;->
feck. Add a .gz
http://www.pixelbeat.org/tmp/linux-2.4.20-pb.diff.gz
>>Note 99% of that patch is just upgrading from
>>e1000 V4.4.12-k1 to V5.2.52 (which doesn't affect
>>the performance).
>>
>>Wow I just read you're excellent paper, and noticed
>>you used this approach also :-)
>>
>
>
> Have to read the paper - When Robert was last visiting here; we did some
> tests and packet recycling is not very valuable as far as SMP is
> concerned (given that packets can be alloced on one CPU and freed on
> another). There a clear win on single CPU machines.
Well for my app, I am just monitoring, so I use
IRQ and process affinity. You could split the
skb heads across CPUs also I guess.
>>>>Small packet performance is dependent on low latency. Higher bus speed
>>>>gives shorter latency but also on higher speed buses there use to be
>>>>bridges that adds latency.
>>>
>>>That's true. We suspect that the limit is due to bus latency. But still,
>>>we are surprised, since the bus allows to receive 100%, but to transmit
>>>up to ~50%. Moreover the raw aggerate bandwidth of the buffer is _far_
>>>larger (133MHz*64bit ~ 8gbit/s
>>
>>Well there definitely could be an asymmetry wrt bus latency.
>>Saying that though, in my tests with much the same hardware
>>as you, I could only get 800Kpps into the driver.
>
>
> Yep, thats about the number i was seeing as well in both pieces of
> hardware i used in the tests in my SUCON presentation.
>
>
>> I'll
>>check this again when I have time. Note also that as I understand
>>it the PCI control bus is running at a much lower rate,
>>and that is used to arbitrate the bus for each packet.
>>I.E. the 8Gb/s number above is not the bottleneck.
>>
>>An lspci -vvv for your ethernet devices would be useful
>>Also to view the burst size: setpci -d 8086:1010 e6.b
>>(where 8086:1010 is the ethernet device PCI id).
>>
>
> Can you talk a little about this PCI control bus? I have heard you
> mention it before ... I am trying to visualize where it fits in PCI
> system.
Basically the bus is arbitrated per packet. See secion 3.5 in:
http://www.intel.com/design/network/applnots/ap453.pdf
This also has lots of nice PCI info:
http://www.hep.man.ac.uk/u/rich/PFLDnet2004/Rich_PFLDNet_10GE_v7.ppt
--
Pádraig Brady - http://www.pixelbeat.org
--
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-27 20:00 ` Lennert Buytenhek
@ 2004-11-29 12:44 ` Marco Mellia
2004-11-29 15:19 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 12:44 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
> > >What is weird, is that if we artificially "preload" the NIC tx-fifo with
> > >packets, and then instruct it to start sending them, those are actually
> > >transmitted AT WIRE SPEED!!
>
> I've very interested in exactly what it is you're doing here. What
> do you mean by 'preload'?
Here is a brief description of the trick we used.
The modified driver code can be grabbed from
http://www.tlc-networks.polito.it/~mellia/e1000_modified.tar.gz
So: with "preloaded" we mean that we put the packets to be transmitted
previously in the TX fifo of the nic without actually updating the
register which counts the number of packets in the fifo queue.
To do this a student modified the network
driver adding an entry in /proc/net/e1000/eth#.
If you read from there you will get the values of the internal registers
of the NIC regarding the internal fifo.
Writing a number to it, you can set the TDFPC register which contains
the number of pkts in the TX queue of the internal FIFO.
To get the above result you have to:
Compile this version of the driver (don't remember on which version it
was based on)
Load it.
After that you can take a look at the internal registers with:
cat /proc/net/e1000/eth# (# replace it with the correct number)
Then we start placing something inthe TX fifo.
To do this i simply used:
ping -c 10 x.x.x.x
This has placed and also transmitted 10 ping pkts. But they aren't
deleted
from the internal FIFO; only the pointers have been updated. Take a look
at the registers again with:
cat /proc/net/e1000/eth#
Now use:
echo 10 > /proc/net/e1000/eth#
Naturally 10 is the number we used above.
This "resets" the registers and writes in the TDFPC that there are 10
pkts in the TX queue.
Now when we do:
ping -c 1 x.x.x.x
You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the
new one).
If you try to measure the TX speed you will see that it is ~ the wire
speed.
Note:
- note that if you haven't static arp tables there will be also some arp
pkts (should be two more pkts)
- probably if you write too many pkts it won't work because the FIFO is
organized like a circular buffer and you will begin to overwrite the
first pkts.
- the normal ping pkts aren't minimum size but reduce them with the -s
option
- the code modoifications have been writen with having "quick and dirty"
in mind, certainly it is possible to write them better
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 20:01 ` jamal
2004-11-29 10:19 ` P
@ 2004-11-29 13:09 ` Robert Olsson
2004-11-29 20:16 ` David S. Miller
2004-11-30 13:31 ` jamal
1 sibling, 2 replies; 85+ messages in thread
From: Robert Olsson @ 2004-11-29 13:09 UTC (permalink / raw)
To: hadi
Cc: P, mellia, Robert Olsson, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
jamal writes:
> Have to read the paper - When Robert was last visiting here; we did some
> tests and packet recycling is not very valuable as far as SMP is
> concerned (given that packets can be alloced on one CPU and freed on
> another). There a clear win on single CPU machines.
Correct yes at you lab about 2 1/2 years ago. I see those experiments in a
different light today as we never got any packet budget contribution
from SMP with shared mem arch whatsoever. Spent a week w. Alexey in the lab
to understand whats going on. Two flows with total affinity (for each CPU)
even removed all locks and part of the IP stack. We were still confused...
When Opteron/NUMA gave good contribution in those setups. We start thinking
it must be latency and memory controllers that makes the difference. As w.
each CPU has it's own memory and memory controller in Opteron case.
So from that aspect we expecting the impossible from recycling patch
maybe it will do better on boxes w. local memory.
But I think we should give it up in current form skb recycling. If extend
it to deal cache bouncing etc. We end up having something like slab in
every driver. slab has improved is not so dominant in profiles now.
Also from what I understand new HW and MSI can help in the case where
pass objects between CPU. Did I dream or did someone tell me that S2IO
could have several TX ring that could via MSI be routed to proper cpu?
slab packet-objects have been discussed. It would do some contribution
but is the complexity worth it?
Also I think it could possible to do more lightweight variant of skb
recycling in case we need to recycle PCI-mapping etc.
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-26 19:56 ` jamal
@ 2004-11-29 14:21 ` Marco Mellia
2004-11-30 13:46 ` jamal
0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 14:21 UTC (permalink / raw)
To: hadi
Cc: mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, 2004-11-26 at 20:56, jamal wrote:
> On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> > If you don't trust us, please, ignore this email.
> > Sorry.
>
> Dont take it the wrong way please - nobody has been able to produce the
> results you have. So thats why you may be getting that comment.
> The fact you have been able to do this is a good thing.
No problem from this side. I also forgot a couple of 8-! I guess...
[...]
> prefetching as in the use of prefetch()?
> What were you prefetching if you end up dropping packet?
>
Sorry I used the wrong terms there.
What we discovered, is that the CPU caching mechanisms as a HUGE impact.
And that you have very little control on it. Prefetching may help, but
it is difficult to tredict its impacts...
Indeed, if you access to the packet struct, the CPU has to fetch data
from the main memory, which stored the packet transfered using DMA from
the NIC. The penalty in the memory access is huge, and you have little
control on it.
In our experiments, we modified the kernel to drop packets just after
receiving them. skb are just deallocated (using standerd kernel
routines, i.e., no recycling is used). Logically, that happen when the
netif_rx() is called.
Now, we have three cases
1) just mofify the netif_rx() to drop packets.
2) as in one, plus remove the protocol check in the driver
(i.e., comment the line
skb->protocol = eth_type_trans(skb, netdev);
) to avoid to access the real packet data.
3) as in 2, but dealloc is performed at the driver level, instead of
calling the netif_rx()
In the first case, we can receive about 1.1Mpps (~80% of packets)
In the second case, we can receive 100% of packets, as we removed the
penalty of looking at the packet headers to discover its protocol type.
In the third case, we can NOT receive 100% of packets!
The only difference is that we actually _REMOVED_ a funcion call. This
reduces the overhead, and the compiler/cpu/whatever can not optimize the
data path to access to the skb which must be freed.
Our guess is that by freeing up the skb in the netif_rx() function
actually allows the compiler/cpu to prefetch the skb itself, and
therefore keep the pipeline working...
My guess is that if you change compiler, cpu, memory subsystem, you may
get very counterintuitive results...
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 8:53 ` Marco Mellia
@ 2004-11-29 14:50 ` Lennert Buytenhek
2004-11-30 8:42 ` Marco Mellia
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 14:50 UTC (permalink / raw)
To: Marco Mellia
Cc: Harald Welte, P, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote:
> Th's our intuition too.
> Notice that we get the same results with 3com (broadcom based) gigabit
> cards.
> We are thinking of sending packet in "bursts" instead of single
> transfers. The only problem is to let the NIC know that there are more
> than a packet in a burst...
Jamal implemented exactly this for e1000 already, he might be persuaded
into posting his patch here. Jamal? :)
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 12:44 ` Marco Mellia
@ 2004-11-29 15:19 ` Lennert Buytenhek
2004-11-29 17:32 ` Marco Mellia
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 15:19 UTC (permalink / raw)
To: Marco Mellia
Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, Nov 29, 2004 at 01:44:27PM +0100, Marco Mellia wrote:
> This "resets" the registers and writes in the TDFPC that there are 10
> pkts in the TX queue.
> Now when we do:
>
> ping -c 1 x.x.x.x
>
> You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the
> new one).
> If you try to measure the TX speed you will see that it is ~ the wire
> speed.
How are you measuring this?
cheers,
Lennert
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 15:19 ` Lennert Buytenhek
@ 2004-11-29 17:32 ` Marco Mellia
2004-11-29 19:08 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-29 17:32 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Marco Mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
Using the Agilent Router tester as receiver...
:-(
> On Mon, Nov 29, 2004 at 01:44:27PM +0100, Marco Mellia wrote:
>
> > This "resets" the registers and writes in the TDFPC that there are 10
> > pkts in the TX queue.
> > Now when we do:
> >
> > ping -c 1 x.x.x.x
> >
> > You will see that the NIC will transmit 11 pkts (10 we "preloaded" + the
> > new one).
> > If you try to measure the TX speed you will see that it is ~ the wire
> > speed.
>
> How are you measuring this?
>
>
> cheers,
> Lennert
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 17:32 ` Marco Mellia
@ 2004-11-29 19:08 ` Lennert Buytenhek
2004-11-29 19:09 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 19:08 UTC (permalink / raw)
To: Marco Mellia
Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, Nov 29, 2004 at 06:32:13PM +0100, Marco Mellia wrote:
> Using the Agilent Router tester as receiver...
> :-(
OK, so you're measuring the inter-packet gap and in that burst of 11
(or whatever many) packets it's 96 bit times between every packet, yes?
Interesting. Can you also try 'pre-loading' the TX ring with a bunch of
packets and then writing the values 0, 1, 2, 3 ... n-1 to the TXD register
with back-to-back MMIO writes (instead of doing a single write of the
value 'n'), and check what inter-packet gap you get then?
cheers,
Lennert
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 19:08 ` Lennert Buytenhek
@ 2004-11-29 19:09 ` Lennert Buytenhek
0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-29 19:09 UTC (permalink / raw)
To: Marco Mellia
Cc: e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, Nov 29, 2004 at 08:08:08PM +0100, Lennert Buytenhek wrote:
> packets and then writing the values 0, 1, 2, 3 ... n-1 to the TXD register
^^^
That should be TDT.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 13:09 ` Robert Olsson
@ 2004-11-29 20:16 ` David S. Miller
2004-12-01 16:47 ` Robert Olsson
2004-11-30 13:31 ` jamal
1 sibling, 1 reply; 85+ messages in thread
From: David S. Miller @ 2004-11-29 20:16 UTC (permalink / raw)
To: Robert Olsson
Cc: hadi, P, mellia, Robert.Olsson, e1000-devel, jorge.finochietto,
galante, netdev
On Mon, 29 Nov 2004 14:09:08 +0100
Robert Olsson <Robert.Olsson@data.slu.se> wrote:
> Did I dream or did someone tell me that S2IO
> could have several TX ring that could via MSI be routed to proper cpu?
One of Sun's gigabit chips can do this too, except it isn't
via MSI, the driver has to read the descriptor to figure out
which cpu gets the software interrupt to process the packet.
SGI had hardware which allowed you to do this kind of stuff too.
Obviously the MSI version works much better.
It is important, the cpu selection process. First of all, it must
be calculated such that flows always go through the same cpu.
Otherwise TCP sockets bounce between the cpus for a streaming
transfer.
And even this doesn't avoid all such problems, TCP LISTEN state
sockets will still thrash between the cpus with such a "pick
a cpu based upon" flow scheme.
Anyways, just some thoughts.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 14:50 ` Lennert Buytenhek
@ 2004-11-30 8:42 ` Marco Mellia
2004-12-01 12:25 ` jamal
0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-11-30 8:42 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Marco Mellia, Harald Welte, P, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, 2004-11-29 at 15:50, Lennert Buytenhek wrote:
> On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote:
>
> > Th's our intuition too.
> > Notice that we get the same results with 3com (broadcom based) gigabit
> > cards.
> > We are thinking of sending packet in "bursts" instead of single
> > transfers. The only problem is to let the NIC know that there are more
> > than a packet in a burst...
>
> Jamal implemented exactly this for e1000 already, he might be persuaded
> into posting his patch here. Jamal? :)
I guess that saying that we are _very_ interested in this might help.
:-)
We can offer as "beta-testers" as well...
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 13:09 ` Robert Olsson
2004-11-29 20:16 ` David S. Miller
@ 2004-11-30 13:31 ` jamal
2004-11-30 13:46 ` Lennert Buytenhek
1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-30 13:31 UTC (permalink / raw)
To: Robert Olsson
Cc: P, mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, 2004-11-29 at 08:09, Robert Olsson wrote:
> jamal writes:
>
> > Have to read the paper - When Robert was last visiting here; we did some
> > tests and packet recycling is not very valuable as far as SMP is
> > concerned (given that packets can be alloced on one CPU and freed on
> > another). There a clear win on single CPU machines.
>
>
> Correct yes at you lab about 2 1/2 years ago.
How time flies when you are having fun ;->
> I see those experiments in a
> different light today as we never got any packet budget contribution
> from SMP with shared mem arch whatsoever. Spent a week w. Alexey in the lab
> to understand whats going on. Two flows with total affinity (for each CPU)
> even removed all locks and part of the IP stack. We were still confused...
>
> When Opteron/NUMA gave good contribution in those setups. We start thinking
> it must be latency and memory controllers that makes the difference. As w.
> each CPU has it's own memory and memory controller in Opteron case.
>
> So from that aspect we expecting the impossible from recycling patch
> maybe it will do better on boxes w. local memory.
>
Interesting thought. Not using a lot of my brain cells to compute i
would say that it would get worse. But i suppose the real reason this
gets nasty on x86 style SMP is because cache misses are more expensive
there, maybe?
> But I think we should give it up in current form skb recycling. If extend
> it to deal cache bouncing etc. We end up having something like slab in
> every driver. slab has improved is not so dominant in profiles now.
>
nod.
> Also from what I understand new HW and MSI can help in the case where
> pass objects between CPU. Did I dream or did someone tell me that S2IO
> could have several TX ring that could via MSI be routed to proper cpu?
I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
like more hell to maintain.
> slab packet-objects have been discussed. It would do some contribution
> but is the complexity worth it?
May not be worth it.
>
> Also I think it could possible to do more lightweight variant of skb
> recycling in case we need to recycle PCI-mapping etc.
>
I think its valuable to have it for people with UP; its not worth the
complexity for SMP IMO.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-30 13:31 ` jamal
@ 2004-11-30 13:46 ` Lennert Buytenhek
2004-11-30 14:25 ` jamal
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-11-30 13:46 UTC (permalink / raw)
To: jamal
Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Tue, Nov 30, 2004 at 08:31:41AM -0500, jamal wrote:
> > Also from what I understand new HW and MSI can help in the case where
> > pass objects between CPU. Did I dream or did someone tell me that S2IO
> > could have several TX ring that could via MSI be routed to proper cpu?
>
> I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> like more hell to maintain.
On the TX path you'd have qdiscs to deal with as well, no?
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 14:21 ` Marco Mellia
@ 2004-11-30 13:46 ` jamal
2004-12-02 17:24 ` Marco Mellia
0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-30 13:46 UTC (permalink / raw)
To: mellia; +Cc: P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, 2004-11-29 at 09:21, Marco Mellia wrote:
> On Fri, 2004-11-26 at 20:56, jamal wrote:
> > On Fri, 2004-11-26 at 10:31, Marco Mellia wrote:
> > > If you don't trust us, please, ignore this email.
> > > Sorry.
> >
> > Dont take it the wrong way please - nobody has been able to produce the
> > results you have. So thats why you may be getting that comment.
> > The fact you have been able to do this is a good thing.
>
> No problem from this side. I also forgot a couple of 8-! I guess...
>
> [...]
>
> > prefetching as in the use of prefetch()?
> > What were you prefetching if you end up dropping packet?
> >
>
I read your paper on the weekend - theres one thing which i dont think
has been written on before on NAPI that you covered unfortunetly with no
melodrama ;-> This is the min-max fairness issue. If you actually mix
and match different speeds then it becomes a really interesting problem.
Example try congesting a 100Mbps with 2x1Gbps. What quotas to use etc.
Could this be done cleverly at runtime with dynamic adjustments etc.
Next time you want you want to slave students to do some work talk to us
- I got plenty of things you could try out and keep them busy forever;->
> Sorry I used the wrong terms there.
> What we discovered, is that the CPU caching mechanisms as a HUGE impact.
> And that you have very little control on it. Prefetching may help, but
> it is difficult to tredict its impacts...
Prefetching is hard. The only evidence i have seen of actually what
"appears" to be working prefetching is some code from David Morsberger
at HP. Other architectures are known to be more friendly - my eperiences
with MIPs are far more pleasant. BTW, thats another topic to get those
students to investigate ;->
> Indeed, if you access to the packet struct, the CPU has to fetch data
> from the main memory, which stored the packet transfered using DMA from
> the NIC. The penalty in the memory access is huge, and you have little
> control on it.
>
> In our experiments, we modified the kernel to drop packets just after
> receiving them. skb are just deallocated (using standerd kernel
> routines, i.e., no recycling is used). Logically, that happen when the
> netif_rx() is called.
>
> Now, we have three cases
> 1) just mofify the netif_rx() to drop packets.
> 2) as in one, plus remove the protocol check in the driver
> (i.e., comment the line
> skb->protocol = eth_type_trans(skb, netdev);
> ) to avoid to access the real packet data.
> 3) as in 2, but dealloc is performed at the driver level, instead of
> calling the netif_rx()
>
> In the first case, we can receive about 1.1Mpps (~80% of packets)
Possible. I was able to receive 900Kpps or so in my experiments with
gact drop which is slightly above this with a 2.4 Ghz machine with IRQ
affinity.
> In the second case, we can receive 100% of packets, as we removed the
> penalty of looking at the packet headers to discover its protocol type.
>
This is the one people found hard to believe. I will go and retest this.
It is possible.
> In the third case, we can NOT receive 100% of packets!
> The only difference is that we actually _REMOVED_ a funcion call. This
> reduces the overhead, and the compiler/cpu/whatever can not optimize the
> data path to access to the skb which must be freed.
It doesnt seem like you were runing NAPI if you depended on calling
netif_rx
In that case, #3 would be freeing in hard IRQ context while #2 is
softIRQ.
> Our guess is that by freeing up the skb in the netif_rx() function
> actually allows the compiler/cpu to prefetch the skb itself, and
> therefore keep the pipeline working...
>
> My guess is that if you change compiler, cpu, memory subsystem, you may
> get very counterintuitive results...
Refer to my comment above.
Repeat tests with NAPI and see if you get same results.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-30 13:46 ` Lennert Buytenhek
@ 2004-11-30 14:25 ` jamal
2004-12-01 0:11 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-11-30 14:25 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Tue, 2004-11-30 at 08:46, Lennert Buytenhek wrote:
> On Tue, Nov 30, 2004 at 08:31:41AM -0500, jamal wrote:
>
> > > Also from what I understand new HW and MSI can help in the case where
> > > pass objects between CPU. Did I dream or did someone tell me that S2IO
> > > could have several TX ring that could via MSI be routed to proper cpu?
> >
> > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> > like more hell to maintain.
>
> On the TX path you'd have qdiscs to deal with as well, no?
I think management of it would be non-trivial in SMP. Youd have to start
playing stupid loadbalancing tricks which would reduce the value of
existence of tx irqs to begin with.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-30 14:25 ` jamal
@ 2004-12-01 0:11 ` Lennert Buytenhek
2004-12-01 1:09 ` Scott Feldman
2004-12-01 12:08 ` jamal
0 siblings, 2 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 0:11 UTC (permalink / raw)
To: jamal
Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Tue, Nov 30, 2004 at 09:25:54AM -0500, jamal wrote:
> > > > Also from what I understand new HW and MSI can help in the case where
> > > > pass objects between CPU. Did I dream or did someone tell me that S2IO
> > > > could have several TX ring that could via MSI be routed to proper cpu?
> > >
> > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> > > like more hell to maintain.
> >
> > On the TX path you'd have qdiscs to deal with as well, no?
>
> I think management of it would be non-trivial in SMP. Youd have to start
> playing stupid loadbalancing tricks which would reduce the value of
> existence of tx irqs to begin with.
You mean the management of qdiscs would be non-trivial?
Probably the idea of these kinds of tricks is to skip the qdisc step
altogether.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 0:11 ` Lennert Buytenhek
@ 2004-12-01 1:09 ` Scott Feldman
2004-12-01 15:34 ` Robert Olsson
` (3 more replies)
2004-12-01 12:08 ` jamal
1 sibling, 4 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-01 1:09 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
Hey, turns out, I know some e1000 tricks that might help get the kpps
numbers up.
My problem is I only have a P4 desktop system with a 82544 nic running
at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a
rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
descriptor write-backs. For me, I see a nice jump in kpps, but I'd like
others to try with their setups. We should be able to get to wire speed
with 60-byte packets.
I'm using pktgen in linux-2.6.9, count = 1000000.
System: Intel 865 (HT 2.6Ghz)
Nic: 82544 PCI 32-bit/33Mhz
Driver: linux-2.6.9 e1000 (5.3.19-k2-NAPI), no Interrupt Delays
BEFORE
256 descs
pkt_size = 60: 253432pps 129Mb/sec errors: 0
pkt_size = 1500: 56356pps 678Mb/sec errors: 499791
4096 descs
pkt_size = 60: 254222pps 130Mb/sec errors: 0
pkt_size = 1500: 52693pps 634Mb/sec errors: 497556
AFTER
Modified driver to turn off Tx interrupts and descriptor write-backs.
Uses a timer to schedule Tx cleanup. The timer runs at 1ms. This would
work poorly where HZ=100. Needed to bump Tx descriptors up to 4096
because 1ms is a lot of time with 60-byte packets at 1GbE. Every time
the timer expires, there is only one PIO read to get HW head pointer.
This wouldn't work at lower media speeds like 10Mbps or 100Mbps because
the ring isn't large enough (or we would need a higher resolution
timer). This also get Tx cleanup out of the NAPI path.
4096 descs
pkt_size = 60: 541618pps 277Mb/sec errors: 914
pkt_size = 1500: 76198pps 916Mb/sec errors: 12419
This doubles the kpps numbers for 60-byte packets. I'd like to see what
happens on higher bus bandwidth systems. Anyone?
-scott
diff -Naurp linux-2.6.9/drivers/net/e1000/e1000.h linux-2.6.9/drivers/net/e1000.mod/e1000.h
--- linux-2.6.9/drivers/net/e1000/e1000.h 2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/drivers/net/e1000.mod/e1000.h 2004-11-30 14:41:07.045391488 -0800
@@ -103,7 +103,7 @@ struct e1000_adapter;
#define E1000_MAX_INTR 10
/* TX/RX descriptor defines */
-#define E1000_DEFAULT_TXD 256
+#define E1000_DEFAULT_TXD 4096
#define E1000_MAX_TXD 256
#define E1000_MIN_TXD 80
#define E1000_MAX_82544_TXD 4096
@@ -189,6 +189,7 @@ struct e1000_desc_ring {
/* board specific private data structure */
struct e1000_adapter {
+ struct timer_list tx_cleanup_timer;
struct timer_list tx_fifo_stall_timer;
struct timer_list watchdog_timer;
struct timer_list phy_info_timer;
@@ -224,6 +225,7 @@ struct e1000_adapter {
uint32_t tx_fifo_size;
atomic_t tx_fifo_stall;
boolean_t pcix_82544;
+ boolean_t tx_cleanup_scheduled;
/* RX */
struct e1000_desc_ring rx_ring;
diff -Naurp linux-2.6.9/drivers/net/e1000/e1000_hw.h linux-2.6.9/drivers/net/e1000.mod/e1000_hw.h
--- linux-2.6.9/drivers/net/e1000/e1000_hw.h 2004-10-18 14:55:06.000000000 -0700
+++ linux-2.6.9/drivers/net/e1000.mod/e1000_hw.h 2004-11-30 13:48:07.983682328 -0800
@@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e
/* This defines the bits that are set in the Interrupt Mask
* Set/Read Register. Each bit is documented below:
* o RXT0 = Receiver Timer Interrupt (ring 0)
- * o TXDW = Transmit Descriptor Written Back
* o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0)
* o RXSEQ = Receive Sequence Error
* o LSC = Link Status Change
*/
#define IMS_ENABLE_MASK ( \
E1000_IMS_RXT0 | \
- E1000_IMS_TXDW | \
E1000_IMS_RXDMT0 | \
E1000_IMS_RXSEQ | \
E1000_IMS_LSC)
diff -Naurp linux-2.6.9/drivers/net/e1000/e1000_main.c linux-2.6.9/drivers/net/e1000.mod/e1000_main.c
--- linux-2.6.9/drivers/net/e1000/e1000_main.c 2004-10-18 14:53:50.000000000 -0700
+++ linux-2.6.9/drivers/net/e1000.mod/e1000_main.c 2004-11-30 16:15:13.777957656 -0800
@@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi
static void e1000_irq_disable(struct e1000_adapter *adapter);
static void e1000_irq_enable(struct e1000_adapter *adapter);
static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs);
-static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter);
+static void e1000_clean_tx(unsigned long data);
#ifdef CONFIG_E1000_NAPI
static int e1000_clean(struct net_device *netdev, int *budget);
static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter,
@@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter
e1000_irq_disable(adapter);
free_irq(adapter->pdev->irq, netdev);
+ del_timer_sync(&adapter->tx_cleanup_timer);
del_timer_sync(&adapter->tx_fifo_stall_timer);
del_timer_sync(&adapter->watchdog_timer);
del_timer_sync(&adapter->phy_info_timer);
@@ -533,6 +534,10 @@ e1000_probe(struct pci_dev *pdev,
e1000_get_bus_info(&adapter->hw);
+ init_timer(&adapter->tx_cleanup_timer);
+ adapter->tx_cleanup_timer.function = &e1000_clean_tx;
+ adapter->tx_cleanup_timer.data = (unsigned long) adapter;
+
init_timer(&adapter->tx_fifo_stall_timer);
adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall;
adapter->tx_fifo_stall_timer.data = (unsigned long) adapter;
@@ -893,14 +898,9 @@ e1000_configure_tx(struct e1000_adapter
e1000_config_collision_dist(&adapter->hw);
/* Setup Transmit Descriptor Settings for eop descriptor */
- adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP |
+ adapter->txd_cmd = E1000_TXD_CMD_EOP |
E1000_TXD_CMD_IFCS;
- if(adapter->hw.mac_type < e1000_82543)
- adapter->txd_cmd |= E1000_TXD_CMD_RPS;
- else
- adapter->txd_cmd |= E1000_TXD_CMD_RS;
-
/* Cache if we're 82544 running in PCI-X because we'll
* need this to apply a workaround later in the send path. */
if(adapter->hw.mac_type == e1000_82544 &&
@@ -1820,6 +1820,11 @@ e1000_xmit_frame(struct sk_buff *skb, st
return NETDEV_TX_LOCKED;
}
+ if(!adapter->tx_cleanup_scheduled) {
+ adapter->tx_cleanup_scheduled = TRUE;
+ mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+ }
+
/* need: count + 2 desc gap to keep tail from touching
* head, otherwise try next time */
if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
@@ -1856,6 +1861,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
netdev->trans_start = jiffies;
spin_unlock_irqrestore(&adapter->tx_lock, flags);
+
return NETDEV_TX_OK;
}
@@ -2151,8 +2157,7 @@ e1000_intr(int irq, void *data, struct p
}
#else
for(i = 0; i < E1000_MAX_INTR; i++)
- if(unlikely(!e1000_clean_rx_irq(adapter) &
- !e1000_clean_tx_irq(adapter)))
+ if(unlikely(!e1000_clean_rx_irq(adapter)))
break;
#endif
@@ -2170,18 +2175,15 @@ e1000_clean(struct net_device *netdev, i
{
struct e1000_adapter *adapter = netdev->priv;
int work_to_do = min(*budget, netdev->quota);
- int tx_cleaned;
int work_done = 0;
- tx_cleaned = e1000_clean_tx_irq(adapter);
e1000_clean_rx_irq(adapter, &work_done, work_to_do);
*budget -= work_done;
netdev->quota -= work_done;
- /* if no Rx and Tx cleanup work was done, exit the polling mode */
- if(!tx_cleaned || (work_done < work_to_do) ||
- !netif_running(netdev)) {
+ /* if no Rx cleanup work was done, exit the polling mode */
+ if((work_done < work_to_do) || !netif_running(netdev)) {
netif_rx_complete(netdev);
e1000_irq_enable(adapter);
return 0;
@@ -2192,66 +2194,74 @@ e1000_clean(struct net_device *netdev, i
#endif
/**
- * e1000_clean_tx_irq - Reclaim resources after transmit completes
- * @adapter: board private structure
+ * e1000_clean_tx - Reclaim resources after transmit completes
+ * @data: timer callback data (board private structure)
**/
-static boolean_t
-e1000_clean_tx_irq(struct e1000_adapter *adapter)
+static void
+e1000_clean_tx(unsigned long data)
{
+ struct e1000_adapter *adapter = (struct e1000_adapter *)data;
struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
struct net_device *netdev = adapter->netdev;
struct pci_dev *pdev = adapter->pdev;
- struct e1000_tx_desc *tx_desc, *eop_desc;
struct e1000_buffer *buffer_info;
- unsigned int i, eop;
- boolean_t cleaned = FALSE;
+ unsigned int i, next;
+ int size = 0, count = 0;
+ uint32_t tx_head;
- i = tx_ring->next_to_clean;
- eop = tx_ring->buffer_info[i].next_to_watch;
- eop_desc = E1000_TX_DESC(*tx_ring, eop);
+ spin_lock(&adapter->tx_lock);
- while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) {
- for(cleaned = FALSE; !cleaned; ) {
- tx_desc = E1000_TX_DESC(*tx_ring, i);
- buffer_info = &tx_ring->buffer_info[i];
+ tx_head = E1000_READ_REG(&adapter->hw, TDH);
- if(likely(buffer_info->dma)) {
- pci_unmap_page(pdev,
- buffer_info->dma,
- buffer_info->length,
- PCI_DMA_TODEVICE);
- buffer_info->dma = 0;
- }
+ i = next = tx_ring->next_to_clean;
- if(buffer_info->skb) {
- dev_kfree_skb_any(buffer_info->skb);
- buffer_info->skb = NULL;
- }
+ while(i != tx_head) {
+ size++;
+ if(i == tx_ring->buffer_info[next].next_to_watch) {
+ count += size;
+ size = 0;
+ if(unlikely(++i == tx_ring->count))
+ i = 0;
+ next = i;
+ } else {
+ if(unlikely(++i == tx_ring->count))
+ i = 0;
+ }
+ }
- tx_desc->buffer_addr = 0;
- tx_desc->lower.data = 0;
- tx_desc->upper.data = 0;
+ i = tx_ring->next_to_clean;
+ while(count--) {
+ buffer_info = &tx_ring->buffer_info[i];
- cleaned = (i == eop);
- if(unlikely(++i == tx_ring->count)) i = 0;
+ if(likely(buffer_info->dma)) {
+ pci_unmap_page(pdev,
+ buffer_info->dma,
+ buffer_info->length,
+ PCI_DMA_TODEVICE);
+ buffer_info->dma = 0;
}
-
- eop = tx_ring->buffer_info[i].next_to_watch;
- eop_desc = E1000_TX_DESC(*tx_ring, eop);
+
+ if(buffer_info->skb) {
+ dev_kfree_skb_any(buffer_info->skb);
+ buffer_info->skb = NULL;
+ }
+
+ if(unlikely(++i == tx_ring->count))
+ i = 0;
}
tx_ring->next_to_clean = i;
- spin_lock(&adapter->tx_lock);
+ if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count)
+ mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+ else
+ adapter->tx_cleanup_scheduled = FALSE;
- if(unlikely(cleaned && netif_queue_stopped(netdev) &&
- netif_carrier_ok(netdev)))
+ if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev)))
netif_wake_queue(netdev);
spin_unlock(&adapter->tx_lock);
-
- return cleaned;
}
/**
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 0:11 ` Lennert Buytenhek
2004-12-01 1:09 ` Scott Feldman
@ 2004-12-01 12:08 ` jamal
2004-12-01 15:24 ` Lennert Buytenhek
1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-01 12:08 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Tue, 2004-11-30 at 19:11, Lennert Buytenhek wrote:
> On Tue, Nov 30, 2004 at 09:25:54AM -0500, jamal wrote:
>
> > > > > Also from what I understand new HW and MSI can help in the case where
> > > > > pass objects between CPU. Did I dream or did someone tell me that S2IO
> > > > > could have several TX ring that could via MSI be routed to proper cpu?
> > > >
> > > > I am wondering if the per CPU tx/rx irqs are valuable at all. They sound
> > > > like more hell to maintain.
> > >
> > > On the TX path you'd have qdiscs to deal with as well, no?
> >
> > I think management of it would be non-trivial in SMP. Youd have to start
> > playing stupid loadbalancing tricks which would reduce the value of
> > existence of tx irqs to begin with.
>
> You mean the management of qdiscs would be non-trivial?
I mean it is useful in only the most ideal cases and if you want to
actually do something useful in most cases with it you will have to
muck around.
Take the case of forwarding (maybe with a little or almost no localhost
generated traffic) - then you end allocating in CPUA, processing and
queueing on egress. Tx softirq, which is what stashes the packet on tx
DMA eventually, is not guaranteed to run on the same CPU. Now add a
little latency between ingress and egress ..
The ideal case is where you end up processing to completion from ingress
to egress (which is known to happen in Linux when theres no congestion).
> Probably the idea of these kinds of tricks is to skip the qdisc step
> altogether.
>
Which is preached by the BSD folks - bogus in my opinion. If you want to
do something as bland/boring as that you can probably afford a $500
DLINK router which can do it at wire rate with (with cost you being
locked in whatever features they have).
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-30 8:42 ` Marco Mellia
@ 2004-12-01 12:25 ` jamal
2004-12-02 13:39 ` Marco Mellia
0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-01 12:25 UTC (permalink / raw)
To: mellia
Cc: Lennert Buytenhek, Harald Welte, P, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Tue, 2004-11-30 at 03:42, Marco Mellia wrote:
> On Mon, 2004-11-29 at 15:50, Lennert Buytenhek wrote:
> > On Mon, Nov 29, 2004 at 09:53:33AM +0100, Marco Mellia wrote:
> >
> > > Th's our intuition too.
> > > Notice that we get the same results with 3com (broadcom based) gigabit
> > > cards.
> > > We are thinking of sending packet in "bursts" instead of single
> > > transfers. The only problem is to let the NIC know that there are more
> > > than a packet in a burst...
> >
> > Jamal implemented exactly this for e1000 already, he might be persuaded
> > into posting his patch here. Jamal? :)
>
> I guess that saying that we are _very_ interested in this might help.
> :-)
> We can offer as "beta-testers" as well...
Sorry missed this (I wasnt CCed so it went to a low priority queue which
i read on a best effort basis).
Let me clean up the patches a little bit this weekend. The patch is at
least 4 months old; latest reincarnation was due to issue1 on my SUCON
presentation. Would a patch against latest 2.6.x bitkeeper (whatever it
is this weekend) be fine? If you are in a rush and dont mind a little
ugliness then i will pass them as is.
BTW, Scott posted a interesting patch yesterday, you may wanna give that
a shot as well.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 12:08 ` jamal
@ 2004-12-01 15:24 ` Lennert Buytenhek
0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 15:24 UTC (permalink / raw)
To: jamal
Cc: Robert Olsson, P, mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Wed, Dec 01, 2004 at 07:08:20AM -0500, jamal wrote:
[ per-CPU TX/RX rings ]
> > You mean the management of qdiscs would be non-trivial?
>
> I mean it is useful in only the most ideal cases and if you want to
> actually do something useful in most cases with it you will have to
> muck around.
> Take the case of forwarding (maybe with a little or almost no localhost
> generated traffic) - then you end allocating in CPUA, processing and
> queueing on egress. Tx softirq, which is what stashes the packet on tx
> DMA eventually, is not guaranteed to run on the same CPU. Now add a
> little latency between ingress and egress ..
> The ideal case is where you end up processing to completion from ingress
> to egress (which is known to happen in Linux when theres no congestion).
We disagreed on this topic at SUCON and I'm afraid we'll be disagreeing
on it forever :) IMHO, on 10GbE any kind of qdisc is a waste of cycles.
I don't think it's very likely that you'll be using that single 10GbE NIC
for forwarding packets, doing that with a PC at this point in the history
of PCs is just silly. If you do use it for forwarding, how likely is it
that you'll be able to process an incoming burst of packets fast enough
to require queueing on the egress interface? You have to be able to send
a burst of packets bigger than the NIC's TX FIFO at >10GbE in the first
place for queueing to be effective/useful at all.
(Leaving the question of whether or not there'll be some room in the TX
FIFO at TX time unanswered, what you're doing with per-CPU TX rings is
basically just simulating the "N individual NICs each bound to its own
CPU" case with a single NIC.)
> > Probably the idea of these kinds of tricks is to skip the qdisc step
> > altogether.
>
> Which is preached by the BSD folks - bogus in my opinion. If you want to
> do something as bland/boring as that you can probably afford a $500
> DLINK router which can do it at wire rate with (with cost you being
> locked in whatever features they have).
That's an unfair comparison. Just because I don't need CBQ doesn't mean
my $500 DLINK router does everything I'd want it to -- advanced firewalling
is one thing that comes to mind. Last time I looked I couldn't load my
own kernel modules on my DLINK router either.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 1:09 ` Scott Feldman
@ 2004-12-01 15:34 ` Robert Olsson
2004-12-01 16:49 ` Scott Feldman
2004-12-01 18:29 ` Lennert Buytenhek
` (2 subsequent siblings)
3 siblings, 1 reply; 85+ messages in thread
From: Robert Olsson @ 2004-12-01 15:34 UTC (permalink / raw)
To: sfeldma
Cc: Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
Scott Feldman writes:
> Hey, turns out, I know some e1000 tricks that might help get the kpps
> numbers up.
>
> My problem is I only have a P4 desktop system with a 82544 nic running
> at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a
> rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> descriptor write-backs. For me, I see a nice jump in kpps, but I'd like
> others to try with their setups. We should be able to get to wire speed
> with 60-byte packets.
>
> System: Intel 865 (HT 2.6Ghz)
> Nic: 82544 PCI 32-bit/33Mhz
> Driver: linux-2.6.9 e1000 (5.3.19-k2-NAPI), no Interrupt Delays
> 4096 descs
> pkt_size = 60: 541618pps 277Mb/sec errors: 914
Hello!
Nice but I no improvements w. 82546GB @ 133 MHz on 1.6 GHz Opteron it seems.
SMP kernel linux-2.6.9-rc2
Vanilla.
801077pps 410Mb/sec (410151424bps) errors: 95596
Patch TXD=4096
608690pps 311Mb/sec (311649280bps) errors: 0
Patch TXD=2048
624103pps 319Mb/sec (319540736bps) errors: 0
Patch TXD=1024
551289pps 282Mb/sec (282259968bps) errors: 4506
Error count is a bit confusing...
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-29 20:16 ` David S. Miller
@ 2004-12-01 16:47 ` Robert Olsson
0 siblings, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-01 16:47 UTC (permalink / raw)
To: David S. Miller
Cc: Robert Olsson, hadi, P, mellia, e1000-devel, jorge.finochietto,
galante, netdev
David S. Miller writes:
> > Did I dream or did someone tell me that S2IO
> > could have several TX ring that could via MSI be routed to proper cpu?
>
> One of Sun's gigabit chips can do this too, except it isn't
> via MSI, the driver has to read the descriptor to figure out
> which cpu gets the software interrupt to process the packet.
>
> SGI had hardware which allowed you to do this kind of stuff too.
>
> Obviously the MSI version works much better.
>
> It is important, the cpu selection process. First of all, it must
> be calculated such that flows always go through the same cpu.
> Otherwise TCP sockets bounce between the cpus for a streaming
> transfer.
>
> And even this doesn't avoid all such problems, TCP LISTEN state
> sockets will still thrash between the cpus with such a "pick
> a cpu based upon" flow scheme.
>
> Anyways, just some thoughts.
Thanks for the the info. Well we'll be forced to get into those problems when
the HW is capable. I'll guess it will be w. the 10 GIGE cards.
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 15:34 ` Robert Olsson
@ 2004-12-01 16:49 ` Scott Feldman
2004-12-01 17:37 ` Robert Olsson
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-01 16:49 UTC (permalink / raw)
To: Robert Olsson
Cc: Lennert Buytenhek, jamal, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Wed, 2004-12-01 at 07:34, Robert Olsson wrote:
> Nice but I no improvements w. 82546GB @ 133 MHz on 1.6 GHz Opteron it seems.
> SMP kernel linux-2.6.9-rc2
>
> Vanilla.
> 801077pps 410Mb/sec (410151424bps) errors: 95596
>
> Patch TXD=4096
> 608690pps 311Mb/sec (311649280bps) errors: 0
Thank you Robert for trying it out.
Well those results are counter-intuitive! We remove Tx interrupts and
Tx descriptor DMA write-backs and get no re-tries, and performance
drops? The only bus activities left are the DMA of buffers to device
and the register writes to increment tail. I'm stumped. I'll need to
get my hands on a faster system. Maybe there is a bus analyzer under
the tree. :-)
-scott
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 16:49 ` Scott Feldman
@ 2004-12-01 17:37 ` Robert Olsson
2004-12-02 17:54 ` Robert Olsson
2004-12-02 18:23 ` Robert Olsson
2 siblings, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-01 17:37 UTC (permalink / raw)
To: sfeldma
Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
Scott Feldman writes:
> Thank you Robert for trying it out.
>
> Well those results are counter-intuitive! We remove Tx interrupts and
> Tx descriptor DMA write-backs and get no re-tries, and performance
> drops? The only bus activities left are the DMA of buffers to device
> and the register writes to increment tail. I'm stumped. I'll need to
> get my hands on a faster system. Maybe there is a bus analyzer under
> the tree. :-)
Huh. I've got a deja-vu feeling. What will happen if we remove almost all
events (interrupts) and just have the timer waking up once-in-a-while?
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 1:09 ` Scott Feldman
2004-12-01 15:34 ` Robert Olsson
@ 2004-12-01 18:29 ` Lennert Buytenhek
2004-12-01 21:35 ` Lennert Buytenhek
2004-12-02 17:31 ` [E1000-devel] Transmission limit Marco Mellia
2004-12-03 20:57 ` Lennert Buytenhek
3 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 18:29 UTC (permalink / raw)
To: Scott Feldman
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Tue, Nov 30, 2004 at 05:09:59PM -0800, Scott Feldman wrote:
> This doubles the kpps numbers for 60-byte packets. I'd like to see what
> happens on higher bus bandwidth systems. Anyone?
Dual Xeon 2.4GHz, a 82540EM and a 82541GI both on 32/66 on separate
PCI buses.
BEFORE performance is approx the same for both, ~620kpps.
AFTER performance is ~730kpps, also approx the same for both.
(Note: only sending with one NIC at a time.)
Once or twice it went into a state where it started spitting out these
kinds of messages and never recovered:
Dec 1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
[...]
Dec 1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
[...]
Dec 1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
But overall, looks good. Strange thing that Robert's numbers didn't
improve. Doing some more measurements right now.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 18:29 ` Lennert Buytenhek
@ 2004-12-01 21:35 ` Lennert Buytenhek
2004-12-02 6:13 ` Scott Feldman
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-01 21:35 UTC (permalink / raw)
To: Scott Feldman
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]
On Wed, Dec 01, 2004 at 07:29:43PM +0100, Lennert Buytenhek wrote:
> > This doubles the kpps numbers for 60-byte packets. I'd like to see what
> > happens on higher bus bandwidth systems. Anyone?
>
> Dual Xeon 2.4GHz, a 82540EM and a 82541GI both on 32/66 on separate
> PCI buses.
>
> BEFORE performance is approx the same for both, ~620kpps.
> AFTER performance is ~730kpps, also approx the same for both.
Pretty graph attached. From ~220B packets or so it does wire speed, but
there's still an odd drop in performance around 256B packets (which is
also there without your patch.) From 350B packets or so, performance is
identical with or without your patch (wire speed.)
So. Do you have any other good plans perhaps? :)
> Once or twice it went into a state where it started spitting out these
> kinds of messages and never recovered:
>
> Dec 1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> [...]
> Dec 1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> [...]
> Dec 1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
Didn't see this happen anymore. (ifconfig down and then up recovered it
both times I saw it happen.)
thanks,
Lennert
[-- Attachment #2: feldman.png --]
[-- Type: image/png, Size: 7959 bytes --]
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 21:35 ` Lennert Buytenhek
@ 2004-12-02 6:13 ` Scott Feldman
2004-12-03 13:24 ` jamal
2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
0 siblings, 2 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-02 6:13 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Wed, 2004-12-01 at 13:35, Lennert Buytenhek wrote:
> Pretty graph attached. From ~220B packets or so it does wire speed, but
> there's still an odd drop in performance around 256B packets (which is
> also there without your patch.) From 350B packets or so, performance is
> identical with or without your patch (wire speed.)
Seems this is helping PCI nics but not PCI-X. I was using PCI 32/33.
Can't explain the dip around 256B.
> So. Do you have any other good plans perhaps? :)
Idea#1
Is the write of TDT causing interference with DMA transactions?
In addition to my patch, what happens if you bump the Tx tail every n
packets, where n is like 16 or 32 or 64?
if((i % 16) == 0)
E1000_REG_WRITE(&adapter->hw, TDT, i);
This might piss the NETDEV timer off if the send count isn't a multiple
of n, so you might want to disable netdev->tx_timeout.
Idea#2
The Ultimate: queue up 4096 packets and then write TDT once to send all
4096 in one shot. Well, maybe a few less that 4096 so we don't wrap the
ring. How about pkt_size = 4000?
Take my patch and change the timer call in e1000_xmit_frame from
jiffies + 1
to
jiffies + HZ
This will schedule the cleanup of the skbs 1 second after the first
queue, so we shouldn't be doing any cleanup while the 4000 packets are
DMA'ed.
Oh, and change the tail write to
if((i % 4000) == 0)
E1000_REG_WRITE(&adapter->hw, TDT, i);
Of course you'll need to close/open the driver after each run.
Idea#3
http://www.mail-archive.com/freebsd-net@freebsd.org/msg10826.html
Set TXDMAC to 0 in e1000_configure_tx.
> > Once or twice it went into a state where it started spitting out these
> > kinds of messages and never recovered:
> >
> > Dec 1 19:13:18 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> > [...]
> > Dec 1 19:13:31 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
> > [...]
> > Dec 1 19:13:43 phi kernel: NETDEV WATCHDOG: eth1: transmit timed out
>
> Didn't see this happen anymore. (ifconfig down and then up recovered it
> both times I saw it happen.)
Well, it's probably not a HW bug that's causing the reset; it's probably
some bug with my patch.
-scott
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 12:25 ` jamal
@ 2004-12-02 13:39 ` Marco Mellia
2004-12-03 13:07 ` jamal
0 siblings, 1 reply; 85+ messages in thread
From: Marco Mellia @ 2004-12-02 13:39 UTC (permalink / raw)
To: hadi
Cc: mellia, Lennert Buytenhek, Harald Welte, P, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
> > > > We are thinking of sending packet in "bursts" instead of single
> > > > transfers. The only problem is to let the NIC know that there are more
> > > > than a packet in a burst...
> > >
> > > Jamal implemented exactly this for e1000 already, he might be persuaded
> > > into posting his patch here. Jamal? :)
> >
> > I guess that saying that we are _very_ interested in this might help.
> > :-)
> > We can offer as "beta-testers" as well...
>
> Sorry missed this (I wasnt CCed so it went to a low priority queue which
> i read on a best effort basis).
> Let me clean up the patches a little bit this weekend. The patch is at
> least 4 months old; latest reincarnation was due to issue1 on my SUCON
> presentation. Would a patch against latest 2.6.x bitkeeper (whatever it
> is this weekend) be fine? If you are in a rush and dont mind a little
> ugliness then i will pass them as is.
>
We'll be glad to spend some time trying this out. Please, we are not
very confortable with the linux bitkeeper maintenance method. Can we ask
you to provide us a patch to a standard kernel/driver (whatever you
prefer...)? Also a complete source sub-tree would be ok ;-)
> BTW, Scott posted a interesting patch yesterday, you may wanna give that
> a shot as well.
We're trying that out right now... (which means, that in a couple of
days, we'll try it ;-))
Thanks a lot.
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-11-30 13:46 ` jamal
@ 2004-12-02 17:24 ` Marco Mellia
0 siblings, 0 replies; 85+ messages in thread
From: Marco Mellia @ 2004-12-02 17:24 UTC (permalink / raw)
To: hadi
Cc: mellia, P, e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
> > In our experiments, we modified the kernel to drop packets just after
> > receiving them. skb are just deallocated (using standerd kernel
> > routines, i.e., no recycling is used). Logically, that happen when the
> > netif_rx() is called.
> >
> > Now, we have three cases
> > 1) just mofify the netif_rx() to drop packets.
> > 2) as in one, plus remove the protocol check in the driver
> > (i.e., comment the line
> > skb->protocol = eth_type_trans(skb, netdev);
> > ) to avoid to access the real packet data.
> > 3) as in 2, but dealloc is performed at the driver level, instead of
> > calling the netif_rx()
> >
> > In the first case, we can receive about 1.1Mpps (~80% of packets)
>
> Possible. I was able to receive 900Kpps or so in my experiments with
> gact drop which is slightly above this with a 2.4 Ghz machine with IRQ
> affinity.
I double checked with the people that actually did the job. They indeed
tested both cases, i.e., dropping packets either using IRQ (therefore
using netif_rx()) or using NAPI (therefore using netif_receive_skb()).
In both cases, disabling the eth_type_trans() check, we receive 100% of
packets...
> > In the third case, we can NOT receive 100% of packets!
> > The only difference is that we actually _REMOVED_ a funcion call. This
> > reduces the overhead, and the compiler/cpu/whatever can not optimize the
> > data path to access to the skb which must be freed.
>
> It doesnt seem like you were runing NAPI if you depended on calling
> netif_rx
> In that case, #3 would be freeing in hard IRQ context while #2 is
> softIRQ.
Again, it was my mistake. Case #3 was performed using the NAPI stack,
i.e., freeing up skb instead of calling the netif_receive_skb().
Doing that, we observed a performance drop, that we hint to some caching
isses. Indeed, investigating with a Oprofile, in case #3 it registers
about twice the number of cache miss than in case #2.
Again, we do not have any plain explanation, but our intuition is that
adding a function call with pointer as argument might allow the
compiler/cpu to prefecth the skb and speed up the memory release...
> > Our guess is that by freeing up the skb in the netif_rx() function
> > actually allows the compiler/cpu to prefetch the skb itself, and
> > therefore keep the pipeline working...
> >
> > My guess is that if you change compiler, cpu, memory subsystem, you may
> > get very counterintuitive results...
>
> Refer to my comment above.
> Repeat tests with NAPI and see if you get same results.
We were using NAPI. Sorry for the misunderstanding.
Hope this helps.
--
Ciao, /\/\/\rco
+--+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+--+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 1:09 ` Scott Feldman
2004-12-01 15:34 ` Robert Olsson
2004-12-01 18:29 ` Lennert Buytenhek
@ 2004-12-02 17:31 ` Marco Mellia
2004-12-03 20:57 ` Lennert Buytenhek
3 siblings, 0 replies; 85+ messages in thread
From: Marco Mellia @ 2004-12-02 17:31 UTC (permalink / raw)
To: sfeldma
Cc: birke, Lennert Buytenhek, jamal, Robert Olsson, P, mellia,
e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Wed, 2004-12-01 at 02:09, Scott Feldman wrote:
> Hey, turns out, I know some e1000 tricks that might help get the kpps
> numbers up.
>
> My problem is I only have a P4 desktop system with a 82544 nic running
> at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a
> rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> descriptor write-backs. For me, I see a nice jump in kpps, but I'd like
> others to try with their setups. We should be able to get to wire speed
> with 60-byte packets.
>
Here are the numbers in our setup:
vanilla kernel [2.4.20 + packetgen + driver e1000 5.4.11]
4096 Descr => 356 Mbps (60 bytes long frames)
=> 941Mbps (1500 bytes lonf frames)
256 Descr => 354 Mbps (60 bytes long frames)
=> 941Mbps (1500 bytes lonf frames)
Patched driver [2.4.20 + packetgen + driver e1000 5.4.11 patched]
4096 Descr => 357 Mbps (60 bytes long frames)
=> 941Mbps (1500 bytes lonf frames)
I guess that was _not_ the bottleneck sigh... at least with a PCI-X bus.
Again, latency issue of the DMA transfer from RAM to NIC?
--
Ciao, /\/\/\rco
+-----------------------------------+
| Marco Mellia - Assistant Professor|
| Tel: 39-011-2276-608 |
| Tel: 39-011-564-4173 |
| Cel: 39-340-9674888 | /"\ .. . . . . . . . . . . . .
| Politecnico di Torino | \ / . ASCII Ribbon Campaign .
| Corso Duca degli Abruzzi 24 | X .- NO HTML/RTF in e-mail .
| Torino - 10129 - Italy | / \ .- NO Word docs in e-mail.
| http://www1.tlc.polito.it/mellia | .. . . . . . . . . . . . .
+-----------------------------------+
The box said "Requires Windows 95 or Better." So I installed Linux.
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 16:49 ` Scott Feldman
2004-12-01 17:37 ` Robert Olsson
@ 2004-12-02 17:54 ` Robert Olsson
2004-12-02 18:23 ` Robert Olsson
2 siblings, 0 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-02 17:54 UTC (permalink / raw)
To: sfeldma
Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
Scott Feldman writes:
> Thank you Robert for trying it out.
Scott!
I've rerun some of the tests. I've set maxcpus=1 make sure all things
happens on one CPU. Some HW as yesterday.
I see a now lot variation in the results from your patch.
vanilla
804353pps 411Mb/sec (411828736bps) errors: 98877
patch TXD=4096
Sometimes: 882362pps 451Mb/sec (451769344bps) errors: 0
patch TXD=2048
Sometimes: 943007pps 482Mb/sec (482819584bps) errors: 0
But very often runs around 500 kpps with patch. This smells scheduling to me
as smaller rings use to mean higher performance but ring need to big
enough to hide latencies.
See also my next mail...
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 16:49 ` Scott Feldman
2004-12-01 17:37 ` Robert Olsson
2004-12-02 17:54 ` Robert Olsson
@ 2004-12-02 18:23 ` Robert Olsson
2004-12-02 23:25 ` Lennert Buytenhek
` (2 more replies)
2 siblings, 3 replies; 85+ messages in thread
From: Robert Olsson @ 2004-12-02 18:23 UTC (permalink / raw)
To: sfeldma
Cc: Robert Olsson, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
Hello!
Below is little patch to clean skb at xmit. It's old jungle trick Jamal
and I used w. tulip. Note we can now even decrease the size of TX ring.
It can increase TX performance from 800 kpps to
1125128pps 576Mb/sec (576065536bps) errors: 0
1124946pps 575Mb/sec (575972352bps) errors: 0
But suffers from scheduling problems as the previous patch. Often we just get
582108pps 298Mb/sec (298039296bps) errors: 0
When the sender CPU free (it's) skb's. we might get some "TX free affinity"
which are unrelated to irq affinity of course not 100% perfect.
And some of Scotts may still be used.
--- drivers/net/e1000/e1000.h.orig 2004-12-01 13:59:36.000000000 +0100
+++ drivers/net/e1000/e1000.h 2004-12-02 20:11:31.000000000 +0100
@@ -103,7 +103,7 @@
#define E1000_MAX_INTR 10
/* TX/RX descriptor defines */
-#define E1000_DEFAULT_TXD 256
+#define E1000_DEFAULT_TXD 128
#define E1000_MAX_TXD 256
#define E1000_MIN_TXD 80
#define E1000_MAX_82544_TXD 4096
--- drivers/net/e1000/e1000_main.c.orig 2004-12-01 13:59:36.000000000 +0100
+++ drivers/net/e1000/e1000_main.c 2004-12-02 20:37:40.000000000 +0100
@@ -1820,6 +1820,10 @@
return NETDEV_TX_LOCKED;
}
+
+ if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 )
+ e1000_clean_tx_ring(adapter);
+
/* need: count + 2 desc gap to keep tail from touching
* head, otherwise try next time */
if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
--ro
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-02 18:23 ` Robert Olsson
@ 2004-12-02 23:25 ` Lennert Buytenhek
2004-12-03 5:23 ` Scott Feldman
2004-12-10 16:24 ` Martin Josefsson
2 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-02 23:25 UTC (permalink / raw)
To: Robert Olsson
Cc: sfeldma, jamal, P, mellia, e1000-devel, Jorge Manuel Finochietto,
Giulio Galante, netdev
On Thu, Dec 02, 2004 at 07:23:24PM +0100, Robert Olsson wrote:
> Below is little patch to clean skb at xmit. It's old jungle trick Jamal
> and I used w. tulip. Note we can now even decrease the size of TX ring.
>
> It can increase TX performance from 800 kpps to
> 1125128pps 576Mb/sec (576065536bps) errors: 0
> 1124946pps 575Mb/sec (575972352bps) errors: 0
>
> But suffers from scheduling problems as the previous patch. Often we just get
> 582108pps 298Mb/sec (298039296bps) errors: 0
Robert, there is something weird with your setup with packets sizes under
160 bytes. Can you check if you also get wildly variable numbers on a
baseline kernel perhaps? The numbers you sent me of packet size vs. pps
were very jumpy as well, even at 10M pkts per run.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-02 18:23 ` Robert Olsson
2004-12-02 23:25 ` Lennert Buytenhek
@ 2004-12-03 5:23 ` Scott Feldman
2004-12-10 16:24 ` Martin Josefsson
2 siblings, 0 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-03 5:23 UTC (permalink / raw)
To: Robert Olsson
Cc: Lennert Buytenhek, jamal, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Thu, 2004-12-02 at 10:23, Robert Olsson wrote:
> It can increase TX performance from 800 kpps to
> 1125128pps 576Mb/sec (576065536bps) errors: 0
> 1124946pps 575Mb/sec (575972352bps) errors: 0
These are the best numbers reported so far, right?
> And some of Scotts may still be used.
Did you try combining the two?
> +
> + if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 )
> + e1000_clean_tx_ring(adapter);
> +
You want to use E1000_DESC_UNUSED here because of the ring wrap. ;-)
-scott
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-02 13:39 ` Marco Mellia
@ 2004-12-03 13:07 ` jamal
0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2004-12-03 13:07 UTC (permalink / raw)
To: mellia
Cc: Lennert Buytenhek, Harald Welte, P, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Thu, 2004-12-02 at 08:39, Marco Mellia wrote:
> We'll be glad to spend some time trying this out. Please, we are not
> very confortable with the linux bitkeeper maintenance method. Can we ask
> you to provide us a patch to a standard kernel/driver (whatever you
> prefer...)? Also a complete source sub-tree would be ok ;-)
Would a -rcX patch be fine for you?
2.6.10-rc2; which means you willl take 2.6.9 patch it with the
patch-2.6.10-rc2.gz from kernel.org/v2.6/testing directory then
patch one more time with patch i give you.
Let me know if you are uncomfortable with that as well.
[Sorry, I am disk poor and my stupid ISP still charges $1/MB/month even
in this age if i put it up at cyberus].
In the patch i give you i will include rx path improvement code that I
got from David Morsberger; I "think" i have seen some improvements with
it but i am not 100% sure. If you repeat the test where you drop the
packet right after eth_type_trans() with this patch on, I would be very
interested if you see any improvements.
In any case, expect something from me this weekend or monday (big party
this weekend ;->).
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-02 6:13 ` Scott Feldman
@ 2004-12-03 13:24 ` jamal
2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
1 sibling, 0 replies; 85+ messages in thread
From: jamal @ 2004-12-03 13:24 UTC (permalink / raw)
To: sfeldma
Cc: Lennert Buytenhek, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Thu, 2004-12-02 at 01:13, Scott Feldman wrote:
> On Wed, 2004-12-01 at 13:35, Lennert Buytenhek wrote:
> > Pretty graph attached. From ~220B packets or so it does wire speed, but
> > there's still an odd drop in performance around 256B packets (which is
> > also there without your patch.) From 350B packets or so, performance is
> > identical with or without your patch (wire speed.)
>
> Seems this is helping PCI nics but not PCI-X. I was using PCI 32/33.
> Can't explain the dip around 256B.
>
Interesting thought. I also saw improvements with my batching patch for
PCI 32/32 but nothing noticeable in PCI-X 64/66.
cheers,
jamal
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-01 1:09 ` Scott Feldman
` (2 preceding siblings ...)
2004-12-02 17:31 ` [E1000-devel] Transmission limit Marco Mellia
@ 2004-12-03 20:57 ` Lennert Buytenhek
2004-12-04 10:36 ` Lennert Buytenhek
3 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-03 20:57 UTC (permalink / raw)
To: Scott Feldman
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]
On Tue, Nov 30, 2004 at 05:09:59PM -0800, Scott Feldman wrote:
> Hey, turns out, I know some e1000 tricks that might help get the kpps
> numbers up.
>
> My problem is I only have a P4 desktop system with a 82544 nic running
> at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a
> rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> descriptor write-backs. For me, I see a nice jump in kpps, but I'd like
> others to try with their setups. We should be able to get to wire speed
> with 60-byte packets.
Attached is a graph of my numbers with and without your patch for:
- An 82540 at PCI 32/33, idle 33MHz card on the same bus forcing it to 33MHz.
- An 82541 at PCI 32/66.
- An 82546 at PCI-X 64/100, NIC can do 133MHz but mobo only does 100MHz.
All 'phi' tests were done on my box phi, a dual 2.4GHz Xeon on an Intel
SE7505VB2 board (http://www.intel.com/design/servers/se7505vb2/). I've
included Robert's 64/133 numbers ('sourcemage') on his dual 866MHz P3 for
comparison. I didn't test all packet sizes up to 1500, just the first few
hundred bytes for each.
As before, the max # pps at 60B packets is strongly influenced by the per-
packet overhead (which seems to be reduced by your patch for my machine
quite a bit, also on 64/100, even though Robert sees no improvement on
64/133) while the slope of each curve appears to depend only on the speed
of the bus the NIC is in. I.e. the 60B kpps number more-or-less determines
the shape of the rest of the graph in each case.
Bus speed is most likely also the reason why the 64/100 setup w/o your patch
starts off slower than the 64/66 with your patch, but then eventually beats
the 64/66 (around 140B packets) just before they both hit the GigE saturation
point.
There's no drop at 256B for the 64/100 setup like with the 32/* setups.
Perhaps the drop at 256B is because of the PCI latency timer being set
to 64 by default, and that causes the transfer on 32b to be broken up in
256-byte chunks?
I'm not able to saturate gigabit on 32/33 with 1500B packets, while Jamal
does. Another thing to look into.
Also note that the 64/100 NIC has rather wobbly performance between 60B and
~160B bytes. This 'square wave pattern' is there both with and without your
patch, perhaps something particular to the NIC. Its period appears to be 16
bytes, dropping down where packet_size mod 16 = 0, and then jumping up again
a bit when packet_size mod 16 = 6. Odd.
--L
[-- Attachment #2: perf.png --]
[-- Type: image/png, Size: 31312 bytes --]
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-03 20:57 ` Lennert Buytenhek
@ 2004-12-04 10:36 ` Lennert Buytenhek
0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-04 10:36 UTC (permalink / raw)
To: Scott Feldman
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Fri, Dec 03, 2004 at 09:57:06PM +0100, Lennert Buytenhek wrote:
> > My problem is I only have a P4 desktop system with a 82544 nic running
> > at PCI 32/33Mhz, so I can't play with the big boys. But, attached is a
> > rework of the Tx path to eliminate 1) Tx interrupts, and 2) Tx
> > descriptor write-backs. For me, I see a nice jump in kpps, but I'd like
> > others to try with their setups. We should be able to get to wire speed
> > with 60-byte packets.
>
> Attached is a graph of my numbers with and without your patch for:
> - An 82540 at PCI 32/33, idle 33MHz card on the same bus forcing it to 33MHz.
> - An 82541 at PCI 32/66.
> - An 82546 at PCI-X 64/100, NIC can do 133MHz but mobo only does 100MHz.
When extrapolating these numbers to the 0-byte packet case (which then
tells you the per-packet overhead), I get the following approximate numbers:
case overhead
phi-32-33-82540-2.6.9 1.86 us
phi-32-66-82541-2.6.9 1.41 us
phi-64-100-82546-2.6.9 1.45 us
phi-32-33-82540-2.6.9-feldman 1.48 us
phi-32-66-82541-2.6.9-feldman 1.13 us
phi-64-100-82546-2.6.9-feldman 1.25 us
Note that this figure doesn't differ all that much between the different
bus widths/speeds.
In any case, if I ever want to get more than ~880kpps on this hardware,
there's no other way than to make this overhead go down. For saturating
1Gb/s with 60B packets on 64/100, the overhead can't be more than ~0.59 us
per packet or you lose.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-02 6:13 ` Scott Feldman
2004-12-03 13:24 ` jamal
@ 2004-12-05 14:50 ` Lennert Buytenhek
2004-12-05 15:03 ` Martin Josefsson
1 sibling, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 14:50 UTC (permalink / raw)
To: Scott Feldman
Cc: jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Wed, Dec 01, 2004 at 10:13:33PM -0800, Scott Feldman wrote:
> Idea#3
>
> http://www.mail-archive.com/freebsd-net@freebsd.org/msg10826.html
>
> Set TXDMAC to 0 in e1000_configure_tx.
Enabling 'DMA packet prefetching' gives me an impressive boost in performance.
Combined with your TX clean rework, I now get 1.03Mpps TX performance at 60B
packets. Transmitting from both of the 82546 ports at the same time gives me
close to 2 Mpps.
The freebsd post hints that (some) e1000 hardware might be buggy w.r.t. this
prefetching though.
I'll play some more with the other ideas you suggested as well.
60 1036488
61 1037413
62 1036429
63 990239
64 993218
65 993233
66 993201
67 993234
68 993219
69 993208
70 992225
71 980560
--L
diff -ur e1000.orig/e1000_main.c e1000/e1000_main.c
--- e1000.orig/e1000_main.c 2004-12-04 11:43:12.000000000 +0100
+++ e1000/e1000_main.c 2004-12-05 15:40:49.284946897 +0100
@@ -879,6 +894,8 @@
E1000_WRITE_REG(&adapter->hw, TCTL, tctl);
+ E1000_WRITE_REG(&adapter->hw, TXDMAC, 0);
+
e1000_config_collision_dist(&adapter->hw);
/* Setup Transmit Descriptor Settings for eop descriptor */
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
@ 2004-12-05 15:03 ` Martin Josefsson
2004-12-05 15:15 ` Lennert Buytenhek
` (2 more replies)
0 siblings, 3 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:03 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Lennert Buytenhek wrote:
> Enabling 'DMA packet prefetching' gives me an impressive boost in performance.
> Combined with your TX clean rework, I now get 1.03Mpps TX performance at 60B
> packets. Transmitting from both of the 82546 ports at the same time gives me
> close to 2 Mpps.
>
> The freebsd post hints that (some) e1000 hardware might be buggy w.r.t. this
> prefetching though.
>
> I'll play some more with the other ideas you suggested as well.
>
> 60 1036488
I was just playing with prefetching when you sent your mail :)
I get that number with Scotts patch but without prefetching.
If I mode the TDT update to the tc cleaning I get a few extra kpps but not
much.
BUT if I use the above + prefetching I get this:
60 1483890
64 1418568
68 1356992
72 1300523
76 1248568
80 1142989
84 1140909
88 1114951
92 1076546
96 960732
100 949801
104 972876
108 945314
112 918380
116 891393
120 865923
124 843288
128 696465
Which is pretty nice :)
This is on one port of a 82546GB
The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
the nic is located in a 64/66 slot.
I won't post any patch until I've tested some more and cleaned up a few
things.
BTW, I also get some transmit timouts with Scotts patch sometimes, not
often but it does happen.
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:03 ` Martin Josefsson
@ 2004-12-05 15:15 ` Lennert Buytenhek
2004-12-05 15:19 ` Martin Josefsson
2004-12-05 15:42 ` Martin Josefsson
2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
2 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 15:15 UTC (permalink / raw)
To: Martin Josefsson
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 04:03:36PM +0100, Martin Josefsson wrote:
> BUT if I use the above + prefetching I get this:
>
> 60 1483890
> [snip]
>
> Which is pretty nice :)
Not just that, it's also wire speed GigE. Damn. Now we all have to go
and upgrade to 10GbE cards, and I don't think my girlfriend would give me
one of those for christmas.
> This is on one port of a 82546GB
>
> The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> the nic is located in a 64/66 slot.
Hmmm. Funny you get this number even on 64/66. How many PCI bridges
between the CPUs and the NIC? Any idea how many cycles an MMIO read on
your hardware is?
cheers,
Lennert
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:15 ` Lennert Buytenhek
@ 2004-12-05 15:19 ` Martin Josefsson
2004-12-05 15:30 ` Martin Josefsson
0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:19 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Lennert Buytenhek wrote:
> > 60 1483890
> > [snip]
> >
> > Which is pretty nice :)
>
> Not just that, it's also wire speed GigE. Damn. Now we all have to go
> and upgrade to 10GbE cards, and I don't think my girlfriend would give me
> one of those for christmas.
Yes it is, and it's lovely to see.
You have to nerdify her so she sees the need for geeky hardware enough to
give you what you need :)
> > This is on one port of a 82546GB
> >
> > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> > the nic is located in a 64/66 slot.
>
> Hmmm. Funny you get this number even on 64/66. How many PCI bridges
> between the CPUs and the NIC? Any idea how many cycles an MMIO read on
> your hardware is?
I verified that I get the same results on a small whimpy 82540EM that runs
at 32/66 as well. Just about to see what I get at 32/33 with that card.
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:19 ` Martin Josefsson
@ 2004-12-05 15:30 ` Martin Josefsson
2004-12-05 17:00 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:30 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Martin Josefsson wrote:
> > > The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> > > the nic is located in a 64/66 slot.
> >
> > Hmmm. Funny you get this number even on 64/66. How many PCI bridges
> > between the CPUs and the NIC? Any idea how many cycles an MMIO read on
> > your hardware is?
>
> I verified that I get the same results on a small whimpy 82540EM that runs
> at 32/66 as well. Just about to see what I get at 32/33 with that card.
Just tested the 82540EM at 32/33 and it's a big diffrence.
60 350229
64 247037
68 219643
72 218205
76 216786
80 215386
84 214003
88 212638
92 211291
96 210004
100 208647
104 182461
108 181468
112 180453
116 179482
120 185472
124 188336
128 153743
Sorry, forgot to answer your other questions, I'm a bit excited at the
moment :)
The 64/66 bus on this motherboard is directly connected to the
northbridge. Here's the lspci output with the 82546GB nic attached
to the 64/66 bus and 82540EM nic connected to the 32/33 bus that hangs
off the southbridge:
00:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] System Controller (rev 11)
00:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-760 MP [IGD4-2P] AGP Bridge
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ISA (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-768 [Opus] IDE (rev 04)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] ACPI (rev 03)
00:08.0 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03)
00:08.1 Ethernet controller: Intel Corp. 82546GB Gigabit Ethernet Controller (rev 03)
00:10.0 PCI bridge: Advanced Micro Devices [AMD] AMD-768 [Opus] PCI (rev 05)
01:05.0 VGA compatible controller: Silicon Integrated Systems [SiS] 86C326 5598/6326 (rev 0b)
02:05.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0c)
02:06.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02)
02:08.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02)
And lspci -t
-[00]-+-00.0
+-01.0-[01]----05.0
+-07.0
+-07.1
+-07.3
+-08.0
+-08.1
\-10.0-[02]--+-05.0
+-06.0
\-08.0
I have no idea how expensive an MMIO read is on this machine, do you have
an relatively easy way to find out?
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:03 ` Martin Josefsson
2004-12-05 15:15 ` Lennert Buytenhek
@ 2004-12-05 15:42 ` Martin Josefsson
2004-12-05 16:48 ` Martin Josefsson
` (2 more replies)
2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
2 siblings, 3 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 15:42 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Martin Josefsson wrote:
[snip]
> BUT if I use the above + prefetching I get this:
>
> 60 1483890
[snip]
> This is on one port of a 82546GB
>
> The hardware is a dual Athlon MP 2000+ in an Asus A7M266-D motherboard and
> the nic is located in a 64/66 slot.
>
> I won't post any patch until I've tested some more and cleaned up a few
> things.
>
> BTW, I also get some transmit timouts with Scotts patch sometimes, not
> often but it does happen.
Here's the patch, not much more tested (it still gives some transmit
timeouts since it's scotts patch + prefetching and delayed TDT updating).
And it's not cleaned up, but hey, that's development :)
The delayed TDT updating was a test and currently it delays the first tx'd
packet after a timerrun 1ms.
Would be interesting to see what other people get with this thing.
Lennert?
diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h
--- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000.h 2004-12-04 18:16:53.000000000 +0100
+++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000.h 2004-12-05 15:12:25.000000000 +0100
@@ -101,7 +101,7 @@ struct e1000_adapter;
#define E1000_MAX_INTR 10
/* TX/RX descriptor defines */
-#define E1000_DEFAULT_TXD 256
+#define E1000_DEFAULT_TXD 4096
#define E1000_MAX_TXD 256
#define E1000_MIN_TXD 80
#define E1000_MAX_82544_TXD 4096
@@ -187,6 +187,7 @@ struct e1000_desc_ring {
/* board specific private data structure */
struct e1000_adapter {
+ struct timer_list tx_cleanup_timer;
struct timer_list tx_fifo_stall_timer;
struct timer_list watchdog_timer;
struct timer_list phy_info_timer;
@@ -222,6 +223,7 @@ struct e1000_adapter {
uint32_t tx_fifo_size;
atomic_t tx_fifo_stall;
boolean_t pcix_82544;
+ boolean_t tx_cleanup_scheduled;
/* RX */
struct e1000_desc_ring rx_ring;
diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h
--- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_hw.h 2004-12-04 18:16:53.000000000 +0100
+++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_hw.h 2004-12-05 15:37:50.000000000 +0100
@@ -417,14 +417,12 @@ int32_t e1000_set_d3_lplu_state(struct e
/* This defines the bits that are set in the Interrupt Mask
* Set/Read Register. Each bit is documented below:
* o RXT0 = Receiver Timer Interrupt (ring 0)
- * o TXDW = Transmit Descriptor Written Back
* o RXDMT0 = Receive Descriptor Minimum Threshold hit (ring 0)
* o RXSEQ = Receive Sequence Error
* o LSC = Link Status Change
*/
#define IMS_ENABLE_MASK ( \
E1000_IMS_RXT0 | \
- E1000_IMS_TXDW | \
E1000_IMS_RXDMT0 | \
E1000_IMS_RXSEQ | \
E1000_IMS_LSC)
diff -X /home/gandalf/dontdiff.ny -urNp linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c
--- linux-2.6.10-rc3.orig/drivers/net/e1000/e1000_main.c 2004-12-05 14:59:19.000000000 +0100
+++ linux-2.6.10-rc3.labbrouter/drivers/net/e1000/e1000_main.c 2004-12-05 15:40:11.000000000 +0100
@@ -131,7 +131,7 @@ static int e1000_set_mac(struct net_devi
static void e1000_irq_disable(struct e1000_adapter *adapter);
static void e1000_irq_enable(struct e1000_adapter *adapter);
static irqreturn_t e1000_intr(int irq, void *data, struct pt_regs *regs);
-static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter);
+static void e1000_clean_tx(unsigned long data);
#ifdef CONFIG_E1000_NAPI
static int e1000_clean(struct net_device *netdev, int *budget);
static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter,
@@ -286,6 +286,7 @@ e1000_down(struct e1000_adapter *adapter
e1000_irq_disable(adapter);
free_irq(adapter->pdev->irq, netdev);
+ del_timer_sync(&adapter->tx_cleanup_timer);
del_timer_sync(&adapter->tx_fifo_stall_timer);
del_timer_sync(&adapter->watchdog_timer);
del_timer_sync(&adapter->phy_info_timer);
@@ -522,6 +523,10 @@ e1000_probe(struct pci_dev *pdev,
e1000_get_bus_info(&adapter->hw);
+ init_timer(&adapter->tx_cleanup_timer);
+ adapter->tx_cleanup_timer.function = &e1000_clean_tx;
+ adapter->tx_cleanup_timer.data = (unsigned long) adapter;
+
init_timer(&adapter->tx_fifo_stall_timer);
adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall;
adapter->tx_fifo_stall_timer.data = (unsigned long) adapter;
@@ -882,19 +887,16 @@ e1000_configure_tx(struct e1000_adapter
e1000_config_collision_dist(&adapter->hw);
/* Setup Transmit Descriptor Settings for eop descriptor */
- adapter->txd_cmd = E1000_TXD_CMD_IDE | E1000_TXD_CMD_EOP |
+ adapter->txd_cmd = E1000_TXD_CMD_EOP |
E1000_TXD_CMD_IFCS;
- if(adapter->hw.mac_type < e1000_82543)
- adapter->txd_cmd |= E1000_TXD_CMD_RPS;
- else
- adapter->txd_cmd |= E1000_TXD_CMD_RS;
-
/* Cache if we're 82544 running in PCI-X because we'll
* need this to apply a workaround later in the send path. */
if(adapter->hw.mac_type == e1000_82544 &&
adapter->hw.bus_type == e1000_bus_type_pcix)
adapter->pcix_82544 = 1;
+
+ E1000_WRITE_REG(&adapter->hw, TXDMAC, 0);
}
/**
@@ -1707,7 +1709,7 @@ e1000_tx_queue(struct e1000_adapter *ada
wmb();
tx_ring->next_to_use = i;
- E1000_WRITE_REG(&adapter->hw, TDT, i);
+ /* E1000_WRITE_REG(&adapter->hw, TDT, i); */
}
/**
@@ -1809,6 +1811,11 @@ e1000_xmit_frame(struct sk_buff *skb, st
return NETDEV_TX_LOCKED;
}
+ if(!adapter->tx_cleanup_scheduled) {
+ adapter->tx_cleanup_scheduled = TRUE;
+ mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+ }
+
/* need: count + 2 desc gap to keep tail from touching
* head, otherwise try next time */
if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
@@ -1845,6 +1852,7 @@ e1000_xmit_frame(struct sk_buff *skb, st
netdev->trans_start = jiffies;
spin_unlock_irqrestore(&adapter->tx_lock, flags);
+
return NETDEV_TX_OK;
}
@@ -2140,8 +2148,7 @@ e1000_intr(int irq, void *data, struct p
}
#else
for(i = 0; i < E1000_MAX_INTR; i++)
- if(unlikely(!e1000_clean_rx_irq(adapter) &
- !e1000_clean_tx_irq(adapter)))
+ if(unlikely(!e1000_clean_rx_irq(adapter)))
break;
#endif
@@ -2159,18 +2166,15 @@ e1000_clean(struct net_device *netdev, i
{
struct e1000_adapter *adapter = netdev->priv;
int work_to_do = min(*budget, netdev->quota);
- int tx_cleaned;
int work_done = 0;
- tx_cleaned = e1000_clean_tx_irq(adapter);
e1000_clean_rx_irq(adapter, &work_done, work_to_do);
*budget -= work_done;
netdev->quota -= work_done;
- /* if no Rx and Tx cleanup work was done, exit the polling mode */
- if(!tx_cleaned || (work_done < work_to_do) ||
- !netif_running(netdev)) {
+ /* if no Rx cleanup work was done, exit the polling mode */
+ if((work_done < work_to_do) || !netif_running(netdev)) {
netif_rx_complete(netdev);
e1000_irq_enable(adapter);
return 0;
@@ -2181,66 +2185,76 @@ e1000_clean(struct net_device *netdev, i
#endif
/**
- * e1000_clean_tx_irq - Reclaim resources after transmit completes
- * @adapter: board private structure
+ * e1000_clean_tx - Reclaim resources after transmit completes
+ * @data: timer callback data (board private structure)
**/
-static boolean_t
-e1000_clean_tx_irq(struct e1000_adapter *adapter)
+static void
+e1000_clean_tx(unsigned long data)
{
+ struct e1000_adapter *adapter = (struct e1000_adapter *)data;
struct e1000_desc_ring *tx_ring = &adapter->tx_ring;
struct net_device *netdev = adapter->netdev;
struct pci_dev *pdev = adapter->pdev;
- struct e1000_tx_desc *tx_desc, *eop_desc;
struct e1000_buffer *buffer_info;
- unsigned int i, eop;
- boolean_t cleaned = FALSE;
+ unsigned int i, next;
+ int size = 0, count = 0;
+ uint32_t tx_head;
- i = tx_ring->next_to_clean;
- eop = tx_ring->buffer_info[i].next_to_watch;
- eop_desc = E1000_TX_DESC(*tx_ring, eop);
+ spin_lock(&adapter->tx_lock);
- while(eop_desc->upper.data & cpu_to_le32(E1000_TXD_STAT_DD)) {
- for(cleaned = FALSE; !cleaned; ) {
- tx_desc = E1000_TX_DESC(*tx_ring, i);
- buffer_info = &tx_ring->buffer_info[i];
+ E1000_WRITE_REG(&adapter->hw, TDT, tx_ring->next_to_use);
- if(likely(buffer_info->dma)) {
- pci_unmap_page(pdev,
- buffer_info->dma,
- buffer_info->length,
- PCI_DMA_TODEVICE);
- buffer_info->dma = 0;
- }
+ tx_head = E1000_READ_REG(&adapter->hw, TDH);
- if(buffer_info->skb) {
- dev_kfree_skb_any(buffer_info->skb);
- buffer_info->skb = NULL;
- }
+ i = next = tx_ring->next_to_clean;
- tx_desc->buffer_addr = 0;
- tx_desc->lower.data = 0;
- tx_desc->upper.data = 0;
+ while(i != tx_head) {
+ size++;
+ if(i == tx_ring->buffer_info[next].next_to_watch) {
+ count += size;
+ size = 0;
+ if(unlikely(++i == tx_ring->count))
+ i = 0;
+ next = i;
+ } else {
+ if(unlikely(++i == tx_ring->count))
+ i = 0;
+ }
+ }
- cleaned = (i == eop);
- if(unlikely(++i == tx_ring->count)) i = 0;
+ i = tx_ring->next_to_clean;
+ while(count--) {
+ buffer_info = &tx_ring->buffer_info[i];
+
+ if(likely(buffer_info->dma)) {
+ pci_unmap_page(pdev,
+ buffer_info->dma,
+ buffer_info->length,
+ PCI_DMA_TODEVICE);
+ buffer_info->dma = 0;
}
-
- eop = tx_ring->buffer_info[i].next_to_watch;
- eop_desc = E1000_TX_DESC(*tx_ring, eop);
+
+ if(buffer_info->skb) {
+ dev_kfree_skb_any(buffer_info->skb);
+ buffer_info->skb = NULL;
+ }
+
+ if(unlikely(++i == tx_ring->count))
+ i = 0;
}
tx_ring->next_to_clean = i;
- spin_lock(&adapter->tx_lock);
+ if(E1000_DESC_UNUSED(tx_ring) != tx_ring->count)
+ mod_timer(&adapter->tx_cleanup_timer, jiffies + 1);
+ else
+ adapter->tx_cleanup_scheduled = FALSE;
- if(unlikely(cleaned && netif_queue_stopped(netdev) &&
- netif_carrier_ok(netdev)))
+ if(unlikely(netif_queue_stopped(netdev) && netif_carrier_ok(netdev)))
netif_wake_queue(netdev);
spin_unlock(&adapter->tx_lock);
-
- return cleaned;
}
/**
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:42 ` Martin Josefsson
@ 2004-12-05 16:48 ` Martin Josefsson
2004-12-05 17:01 ` Martin Josefsson
2004-12-05 17:58 ` Lennert Buytenhek
2004-12-05 17:44 ` Lennert Buytenhek
2004-12-08 23:36 ` Ray Lehtiniemi
2 siblings, 2 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 16:48 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Martin Josefsson wrote:
> The delayed TDT updating was a test and currently it delays the first tx'd
> packet after a timerrun 1ms.
I removed the delayed TDT updating and gave it a go again (this is scott +
prefetching):
60 1486193
64 1267639
68 1259682
72 1243997
76 1243989
80 1153608
84 1123813
88 1115047
92 1076636
96 1040792
100 1007252
104 975806
108 946263
112 918456
116 892227
120 867477
124 844052
128 821858
It gives a little diffrent results, 60byte is ok but then it falls a lot
down to 64byte and the curve seems a bit flatter.
This should be the same driver that Lennert got 1.03Mpps with.
I get 1.03Mpps without prefetching.
I tried using both ports on the 82546GB nic.
delay nodelay
1CPU 1.95 Mpps 1.76 Mpps
2CPU 1.60 Mpps 1.44 Mpps
All tests performed on an SMP kernel, the above mention of 1CPU vs 2CPU
just means how the two nics were bound to the cpus. And there's no
tx-interrupts at all due to scotts patch.
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:30 ` Martin Josefsson
@ 2004-12-05 17:00 ` Lennert Buytenhek
2004-12-05 17:11 ` Martin Josefsson
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:00 UTC (permalink / raw)
To: Martin Josefsson
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 04:30:47PM +0100, Martin Josefsson wrote:
> > I verified that I get the same results on a small whimpy 82540EM
> > that runs at 32/66 as well. Just about to see what I get at 32/33
> > with that card.
>
> Just tested the 82540EM at 32/33 and it's a big diffrence.
>
> 60 350229
> 64 247037
> 68 219643
> 72 218205
> 76 216786
> 80 215386
> 84 214003
> 88 212638
> 92 211291
> 96 210004
> 100 208647
> 104 182461
> 108 181468
> 112 180453
> 116 179482
> 120 185472
> 124 188336
> 128 153743
With or without prefetching? My 82540 in 32/33 mode gets on baseline
2.6.9:
60 431967
61 431311
62 431927
63 427827
64 427482
And with Scott's notxints patch:
60 514496
61 514493
62 514754
63 504629
64 504123
> Sorry, forgot to answer your other questions, I'm a bit excited at the
> moment :)
Makes sense :)
> The 64/66 bus on this motherboard is directly connected to the
> northbridge.
Your lspci output seems to suggest there is another PCI bridge in
between (00:10.0)
Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the
'Memory Controller Hub' and P64H2 the PCI-X bridge chip.
> I have no idea how expensive an MMIO read is on this machine, do you have
> an relatively easy way to find out?
A dirty way, yes ;-) Open up e1000_osdep.h and do:
-#define E1000_READ_REG(a, reg) ( \
- readl((a)->hw_addr + \
- (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)))
+#define E1000_READ_REG(a, reg) ({ \
+ unsigned long s, e, d, v; \
+\
+ (a)->mmio_reads++; \
+ rdtsc(s, d); \
+ v = readl((a)->hw_addr + \
+ (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \
+ rdtsc(e, d); \
+ e -= s; \
+ printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \
+ printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \
+ dump_stack(); \
+ v; \
+})
You might want to disable the stack dump of course.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 16:48 ` Martin Josefsson
@ 2004-12-05 17:01 ` Martin Josefsson
2004-12-05 17:58 ` Lennert Buytenhek
1 sibling, 0 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:01 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Martin Josefsson wrote:
> I removed the delayed TDT updating and gave it a go again (this is scott +
> prefetching):
>
> 60 1486193
> 64 1267639
> 68 1259682
Yet another mail, I hope you are using a NAPI-enabled MUA :)
This time I tried vanilla + prefetch and it gave pretty nice performance
as well:
60 1308047
64 1076044
68 1079377
72 1058993
76 1055708
80 1025659
84 1024692
88 1024236
92 1024510
96 1012853
100 1007925
104 976500
108 947061
112 919169
116 892804
120 868084
124 844609
128 822381
Large gap between 60 and 64byte, maybe the prefetching only prefetches
32bytes at a time?
As a reference: here's a completely vanilla e1000 driver:
60 860931
64 772949
68 754738
72 754200
76 756093
80 756398
84 742111
88 738120
92 740426
96 739720
100 722322
104 729287
108 719312
112 723171
116 705551
120 704843
124 704622
128 665863
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 17:00 ` Lennert Buytenhek
@ 2004-12-05 17:11 ` Martin Josefsson
2004-12-05 17:38 ` Martin Josefsson
0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:11 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Lennert Buytenhek wrote:
> > Just tested the 82540EM at 32/33 and it's a big diffrence.
> >
> > 60 350229
> > 64 247037
> > 68 219643
[snip]
> With or without prefetching? My 82540 in 32/33 mode gets on baseline
> 2.6.9:
With, will test without. I've always suspected that the 32bit bus on this
motherboard is a bit slow.
> Your lspci output seems to suggest there is another PCI bridge in
> between (00:10.0)
Yes it sits between the 32bit and the 64bit bus.
> Basically on my box, it's CPU - MCH - P64H2 - e1000, where MCH is the
> 'Memory Controller Hub' and P64H2 the PCI-X bridge chip.
I don't have PCI-X (unless 64/66 counts as PCI-x which I highly doubt)
> > I have no idea how expensive an MMIO read is on this machine, do you have
> > an relatively easy way to find out?
>
> A dirty way, yes ;-) Open up e1000_osdep.h and do:
>
> -#define E1000_READ_REG(a, reg) ( \
> - readl((a)->hw_addr + \
> - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)))
> +#define E1000_READ_REG(a, reg) ({ \
> + unsigned long s, e, d, v; \
> +\
> + (a)->mmio_reads++; \
> + rdtsc(s, d); \
> + v = readl((a)->hw_addr + \
> + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \
> + rdtsc(e, d); \
> + e -= s; \
> + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \
> + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \
> + dump_stack(); \
> + v; \
> +})
>
> You might want to disable the stack dump of course.
Will test this in a while.
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 17:11 ` Martin Josefsson
@ 2004-12-05 17:38 ` Martin Josefsson
2004-12-05 18:14 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:38 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Martin Josefsson wrote:
> > -#define E1000_READ_REG(a, reg) ( \
> > - readl((a)->hw_addr + \
> > - (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)))
> > +#define E1000_READ_REG(a, reg) ({ \
> > + unsigned long s, e, d, v; \
> > +\
> > + (a)->mmio_reads++; \
> > + rdtsc(s, d); \
> > + v = readl((a)->hw_addr + \
> > + (((a)->mac_type >= e1000_82543) ? E1000_##reg : E1000_82542_##reg)); \
> > + rdtsc(e, d); \
> > + e -= s; \
> > + printk(KERN_INFO "e1000: MMIO read took %ld clocks\n", e); \
> > + printk(KERN_INFO "e1000: in process %d(%s)\n", current->pid, current->comm); \
> > + dump_stack(); \
> > + v; \
> > +})
> >
> > You might want to disable the stack dump of course.
>
> Will test this in a while.
It gives pretty varied results.
This is during a pktgen run.
The machine is an Athlon MP 2000+ which operated at 1667 MHz
e1000: MMIO read took 481 clocks
e1000: MMIO read took 369 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 477 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 332 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 372 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 388 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 485 clocks
e1000: MMIO read took 317 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 337 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 409 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 334 clocks
e1000: MMIO read took 481 clocks
e1000: MMIO read took 316 clocks
e1000: MMIO read took 480 clocks
e1000: MMIO read took 11 clocks
e1000: MMIO read took 505 clocks
e1000: MMIO read took 359 clocks
e1000: MMIO read took 484 clocks
e1000: MMIO read took 337 clocks
e1000: MMIO read took 464 clocks
e1000: MMIO read took 504 clocks
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:42 ` Martin Josefsson
2004-12-05 16:48 ` Martin Josefsson
@ 2004-12-05 17:44 ` Lennert Buytenhek
2004-12-05 17:51 ` Lennert Buytenhek
2004-12-08 23:36 ` Ray Lehtiniemi
2 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:44 UTC (permalink / raw)
To: Martin Josefsson
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote:
> The delayed TDT updating was a test and currently it delays the first tx'd
> packet after a timerrun 1ms.
>
> Would be interesting to see what other people get with this thing.
> Lennert?
I took Scott's notxints patch, added the prefetch bits and moved the
TDT updating to e1000_clean_tx as you did.
Slightly better than before, but not much:
60 1070157
61 1066610
62 1062088
63 991447
64 991546
65 991537
66 991449
67 990857
68 989882
69 991347
Regular TDT updating:
60 1037469
61 1038425
62 1037393
63 993143
64 992156
65 993137
66 992203
67 992165
68 992185
69 988249
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 17:44 ` Lennert Buytenhek
@ 2004-12-05 17:51 ` Lennert Buytenhek
2004-12-05 17:54 ` Martin Josefsson
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:51 UTC (permalink / raw)
To: Martin Josefsson
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 06:44:01PM +0100, Lennert Buytenhek wrote:
> On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote:
>
> > The delayed TDT updating was a test and currently it delays the first tx'd
> > packet after a timerrun 1ms.
> >
> > Would be interesting to see what other people get with this thing.
> > Lennert?
>
> I took Scott's notxints patch, added the prefetch bits and moved the
> TDT updating to e1000_clean_tx as you did.
>
> Slightly better than before, but not much:
I've tested all packet sizes now, and delayed TDT updating once per jiffy
(instead of once per packet) indeed gives about 25kpps more on 60,61,62
byte packets, and is hardly worth it for bigger packets.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 17:51 ` Lennert Buytenhek
@ 2004-12-05 17:54 ` Martin Josefsson
2004-12-06 11:32 ` 1.03Mpps on e1000 (was: " jamal
0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-05 17:54 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 5 Dec 2004, Lennert Buytenhek wrote:
> I've tested all packet sizes now, and delayed TDT updating once per jiffy
> (instead of once per packet) indeed gives about 25kpps more on 60,61,62
> byte packets, and is hardly worth it for bigger packets.
Maybe we can't see any real gains here now, I wonder if it has any effect
if you have lots of nics on the same bus. I mean, in theory it saves a
whole lot of traffic on the bus.
/Martin
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 16:48 ` Martin Josefsson
2004-12-05 17:01 ` Martin Josefsson
@ 2004-12-05 17:58 ` Lennert Buytenhek
1 sibling, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 17:58 UTC (permalink / raw)
To: Martin Josefsson
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 05:48:34PM +0100, Martin Josefsson wrote:
> I tried using both ports on the 82546GB nic.
>
> delay nodelay
> 1CPU 1.95 Mpps 1.76 Mpps
> 2CPU 1.60 Mpps 1.44 Mpps
I get:
delay nodelay
1CPU 1837356 1837330
2CPU 2035060 1947424
So in your case using 2 CPUs degrades performance, in my case it
increases it. And TDT delaying/coalescing only improves performance
when using 2 CPUs, and even then only slightly (and only for <= 62B
packets.)
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 17:38 ` Martin Josefsson
@ 2004-12-05 18:14 ` Lennert Buytenhek
0 siblings, 0 replies; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 18:14 UTC (permalink / raw)
To: Martin Josefsson
Cc: Scott Feldman, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 06:38:05PM +0100, Martin Josefsson wrote:
> e1000: MMIO read took 481 clocks
> e1000: MMIO read took 369 clocks
> e1000: MMIO read took 481 clocks
> e1000: MMIO read took 11 clocks
> e1000: MMIO read took 477 clocks
> e1000: MMIO read took 316 clocks
Interesting. On a 1667MHz CPU, this is around ~0.28us per MMIO read
in the worst case. On my hardware (dual Xeon 2.4GHz), the best case
I've ever seen was ~0.83us.
This alone can make a hell of a difference, esp. for 60B packets.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:03 ` Martin Josefsson
2004-12-05 15:15 ` Lennert Buytenhek
2004-12-05 15:42 ` Martin Josefsson
@ 2004-12-05 21:12 ` Scott Feldman
2004-12-05 21:25 ` Lennert Buytenhek
2 siblings, 1 reply; 85+ messages in thread
From: Scott Feldman @ 2004-12-05 21:12 UTC (permalink / raw)
To: Martin Josefsson
Cc: Lennert Buytenhek, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 2004-12-05 at 07:03, Martin Josefsson wrote:
> BUT if I use the above + prefetching I get this:
>
> 60 1483890
Ok, proof that we can get to 1.4Mpps!
That's the good news.
The bad news is prefetching is potentially buggy as pointed out in the
freebsd note. Buggy as in the controller may hang. Sorry, I don't have
details on what conditions are necessary to cause a hang.
Would Martin or Lennert run these test for a longer duration so we can
get some data, maybe adding in Rx. It could be that removing the Tx
interrupts and descriptor write-backs, prefetching may be ok. I don't
know. Intel?
Also, wouldn't it be great if someone wrote a document capturing all of
the accumulated knowledge for future generations?
-scott
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
@ 2004-12-05 21:25 ` Lennert Buytenhek
2004-12-06 1:23 ` 1.03Mpps on e1000 (was: " Scott Feldman
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-05 21:25 UTC (permalink / raw)
To: Scott Feldman
Cc: Martin Josefsson, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, Dec 05, 2004 at 01:12:22PM -0800, Scott Feldman wrote:
> Would Martin or Lennert run these test for a longer duration so we can
> get some data, maybe adding in Rx. It could be that removing the Tx
> interrupts and descriptor write-backs, prefetching may be ok. I don't
> know. Intel?
What your patch does is (correct me if I'm wrong):
- Masking TXDW, effectively preventing it from delivering TXdone ints.
- Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes
the chip to 'ignore the TIDV' register, which is the 'TX Interrupt
Delay Value'. What exactly does this?
- Not setting the "Report Packet Sent"/"Report Status" bits in the TXD
command field. Is this the equivalent of the TXdone interrupt?
Just exactly which bit avoids the descriptor writeback?
I'm also a bit worried that only freeing packets 1ms later will mess up
socket accounting and such. Any ideas on that?
> Also, wouldn't it be great if someone wrote a document capturing all of
> the accumulated knowledge for future generations?
I'll volunteer for that.
--L
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-05 21:25 ` Lennert Buytenhek
@ 2004-12-06 1:23 ` Scott Feldman
0 siblings, 0 replies; 85+ messages in thread
From: Scott Feldman @ 2004-12-06 1:23 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Martin Josefsson, jamal, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 2004-12-05 at 13:25, Lennert Buytenhek wrote:
> What your patch does is (correct me if I'm wrong):
> - Masking TXDW, effectively preventing it from delivering TXdone ints.
> - Not setting E1000_TXD_CMD_IDE in the TXD command field, which causes
> the chip to 'ignore the TIDV' register, which is the 'TX Interrupt
> Delay Value'. What exactly does this?
A descriptor with IDE, when written back, starts the Tx delay timers
countdown. Never setting IDE means the Tx delay timers never expire.
> - Not setting the "Report Packet Sent"/"Report Status" bits in the TXD
> command field. Is this the equivalent of the TXdone interrupt?
>
> Just exactly which bit avoids the descriptor writeback?
As the name implies, Report Status (RS) instructs the controller to
indicate the status of the descriptor by doing a write-back (DMA) to the
descriptor memory. The only status we care about is the "done"
indicator. By reading TDH (Tx head), we can figure out where hardware
is without reading the status of each descriptor. Since we don't need
status, we can turn off RS.
> I'm also a bit worried that only freeing packets 1ms later will mess up
> socket accounting and such. Any ideas on that?
Well the timer solution is less than ideal, and any protocols that are
sensitive to getting Tx resources returned by the driver as quickly as
possible are not going to be happy. I don't know if 1ms is quick enough.
You could eliminate the timer by doing the cleanup first thing in
xmit_frame, but then you have two problems: 1) you might end up reading
TDH for each send, and that's going to be expensive; 2) calls to
xmit_frame might stop, leaving uncleaned work until xmit_frame is called
again.
-scott
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-05 17:54 ` Martin Josefsson
@ 2004-12-06 11:32 ` jamal
2004-12-06 12:11 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-06 11:32 UTC (permalink / raw)
To: Martin Josefsson
Cc: Lennert Buytenhek, Scott Feldman, Robert Olsson, P, mellia,
e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Sun, 2004-12-05 at 12:54, Martin Josefsson wrote:
> On Sun, 5 Dec 2004, Lennert Buytenhek wrote:
>
> > I've tested all packet sizes now, and delayed TDT updating once per jiffy
> > (instead of once per packet) indeed gives about 25kpps more on 60,61,62
> > byte packets, and is hardly worth it for bigger packets.
>
> Maybe we can't see any real gains here now, I wonder if it has any effect
> if you have lots of nics on the same bus. I mean, in theory it saves a
> whole lot of traffic on the bus.
>
This sounds like really exciting stuff happening here over the weekend.
Scott, you had to leave Intel before giving us this tip? ;->
Someone correct me if i am wrong - but does it appear as if all these
changes are only useful on PCI but not PCI-X?
cheers,
jamal
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-06 11:32 ` 1.03Mpps on e1000 (was: " jamal
@ 2004-12-06 12:11 ` Lennert Buytenhek
2004-12-06 12:20 ` jamal
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-06 12:11 UTC (permalink / raw)
To: jamal
Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia,
e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote:
> Someone correct me if i am wrong - but does it appear as if all these
> changes are only useful on PCI but not PCI-X?
They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I
get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0.
Martin gets the ~1Mpps number with just the tx rework, and even more
with TXDMAC=0 added in as well.
--L
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-06 12:11 ` Lennert Buytenhek
@ 2004-12-06 12:20 ` jamal
2004-12-06 12:23 ` Lennert Buytenhek
0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-06 12:20 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia,
e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, 2004-12-06 at 07:11, Lennert Buytenhek wrote:
> On Mon, Dec 06, 2004 at 06:32:37AM -0500, jamal wrote:
>
> > Someone correct me if i am wrong - but does it appear as if all these
> > changes are only useful on PCI but not PCI-X?
>
> They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I
> get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0.
>
> Martin gets the ~1Mpps number with just the tx rework, and even more
> with TXDMAC=0 added in as well.
Right, but so far when i scan the results all i see is PCI not PCI-X.
Which of your (or Martins) boards has PCI-X?
cheers,
jamal
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-06 12:20 ` jamal
@ 2004-12-06 12:23 ` Lennert Buytenhek
2004-12-06 12:30 ` Martin Josefsson
0 siblings, 1 reply; 85+ messages in thread
From: Lennert Buytenhek @ 2004-12-06 12:23 UTC (permalink / raw)
To: jamal
Cc: Martin Josefsson, Scott Feldman, Robert Olsson, P, mellia,
e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, Dec 06, 2004 at 07:20:43AM -0500, jamal wrote:
> > > Someone correct me if i am wrong - but does it appear as if all these
> > > changes are only useful on PCI but not PCI-X?
> >
> > They are useful on PCI-X as well as regular PCI. On my 64/100 NIC I
> > get ~620kpps on 2.6.9, ~1Mpps with 2.6.9 plus tx rework plus TXDMAC=0.
> >
> > Martin gets the ~1Mpps number with just the tx rework, and even more
> > with TXDMAC=0 added in as well.
>
> Right, but so far when i scan the results all i see is PCI not PCI-X.
> Which of your (or Martins) boards has PCI-X?
I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin
was running at 64/133 PCI-X.
--L
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-06 12:23 ` Lennert Buytenhek
@ 2004-12-06 12:30 ` Martin Josefsson
2004-12-06 13:11 ` jamal
0 siblings, 1 reply; 85+ messages in thread
From: Martin Josefsson @ 2004-12-06 12:30 UTC (permalink / raw)
To: Lennert Buytenhek
Cc: jamal, Scott Feldman, Robert Olsson, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
On Mon, 6 Dec 2004, Lennert Buytenhek wrote:
> > Right, but so far when i scan the results all i see is PCI not PCI-X.
> > Which of your (or Martins) boards has PCI-X?
>
> I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin
> was running at 64/133 PCI-X.
I don't have any motherboards with PCI-X so no :)
I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop
adapter) at 32/66, both are able to send at wirespeed.
/Martin
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: Transmission limit)
2004-12-06 12:30 ` Martin Josefsson
@ 2004-12-06 13:11 ` jamal
[not found] ` <20041206132907.GA13411@xi.wantstofly.org>
0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2004-12-06 13:11 UTC (permalink / raw)
To: Martin Josefsson
Cc: Lennert Buytenhek, Scott Feldman, Robert Olsson, P, mellia,
e1000-devel, Jorge Manuel Finochietto, Giulio Galante, netdev
Hopefully someone will beat me to testing to see if our forwarding
capacity now goes up with this new recipe.
cheers,
jamal
On Mon, 2004-12-06 at 07:30, Martin Josefsson wrote:
> On Mon, 6 Dec 2004, Lennert Buytenhek wrote:
>
> > > Right, but so far when i scan the results all i see is PCI not PCI-X.
> > > Which of your (or Martins) boards has PCI-X?
> >
> > I've tested 32/33 PCI, 32/66 PCI, and 64/100 PCI-X. I _think_ Martin
> > was running at 64/133 PCI-X.
>
> I don't have any motherboards with PCI-X so no :)
> I'm running the 82546GB (dualport) at 64/66 and the 82540EM (desktop
> adapter) at 32/66, both are able to send at wirespeed.
>
> /Martin
>
>
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
[not found] ` <16820.37049.396306.295878@robur.slu.se>
@ 2004-12-06 17:32 ` P
0 siblings, 0 replies; 85+ messages in thread
From: P @ 2004-12-06 17:32 UTC (permalink / raw)
To: Robert Olsson
Cc: Lennert Buytenhek, jamal, Martin Josefsson, Scott Feldman,
mellia, Jorge Manuel Finochietto, Giulio Galante, netdev
Robert Olsson wrote:
> Lennert Buytenhek writes:
> > On Mon, Dec 06, 2004 at 08:11:02AM -0500, jamal wrote:
> >
> > > Hopefully someone will beat me to testing to see if our forwarding
> > > capacity now goes up with this new recipe.
>
>
> A breakthrough we now can send small packets at wire speed it will make
> development and testing much easier...
It surely will!!
Just to recap, 2 people have been able to tx @ wire speed.
The origonal poster was able to receive at wire speed,
but could only TX at about 50% wire speed.
It would be really cool if we could combine this
to bridge @ wire speed.
--
Pádraig Brady - http://www.pixelbeat.org
--
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit)
2004-12-05 15:42 ` Martin Josefsson
2004-12-05 16:48 ` Martin Josefsson
2004-12-05 17:44 ` Lennert Buytenhek
@ 2004-12-08 23:36 ` Ray Lehtiniemi
[not found] ` <41B825A5.2000009@draigBrady.com>
2 siblings, 1 reply; 85+ messages in thread
From: Ray Lehtiniemi @ 2004-12-08 23:36 UTC (permalink / raw)
To: Martin Josefsson
Cc: Lennert Buytenhek, Scott Feldman, jamal, Robert Olsson, P,
mellia, e1000-devel, Jorge Manuel Finochietto, Giulio Galante,
netdev
hello martin
On Sun, Dec 05, 2004 at 04:42:34PM +0100, Martin Josefsson wrote:
>
> Here's the patch, not much more tested (it still gives some transmit
> timeouts since it's scotts patch + prefetching and delayed TDT updating).
> And it's not cleaned up, but hey, that's development :)
>
> The delayed TDT updating was a test and currently it delays the first tx'd
> packet after a timerrun 1ms.
>
> Would be interesting to see what other people get with this thing.
> Lennert?
well, i'm brand new to gig ethernet, but i have access to some nice
hardware right now, so i decided to give your patch a try.
this is the average tx pps of 10 pktgen runs for each packet size:
60 1187589.1
64 601805.4
68 1115029.3
72 593096.4
76 1097761.1
80 587125.4
84 1098045.2
88 588159.1
92 1072124.8
96 582510.3
100 1008056.8
104 577898.0
108 946974.0
112 573719.2
116 892871.0
120 573072.5
124 844608.3
128 563685.7
any idea why the packet rates are cut in half for every other line?
pktgen is running with eth0 bound to CPU0 on this box:
NexGate NSA 2040G
Dual Xeon 3.06 GHz, HT enabled
1 GB PC3200 DDR SDRAM
Dual 82544EI
- on PCI-X 64 bit 133 MHz bus
- behind P64H2 bridge
- on hub channel D of E7501 chipset
thanks
--
----------------------------------------------------------------------
Ray L <rayl@mail.com>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000
[not found] ` <20041209161825.GA32454@mail.com>
@ 2004-12-09 17:12 ` P
[not found] ` <20041209164820.GB32454@mail.com>
1 sibling, 0 replies; 85+ messages in thread
From: P @ 2004-12-09 17:12 UTC (permalink / raw)
To: Ray Lehtiniemi; +Cc: netdev
Ray Lehtiniemi wrote:
> On Thu, Dec 09, 2004 at 10:15:01AM +0000, P@draigBrady.com wrote:
>
>>That is very interesting!
>>I'm guessing it's due to some alignment bug?
>>
>>Can you repeat for 60-68 ?
>
> certainly. here are the raw results, and a summary oprofile for
> 60-68.
>
> looking at the disassembly, it seems that the 'rdtsc' opcode
> at 0x46f3 is causing the problem?
Well that wasn't obvious to me :-)
I did some manipulating with sort/join
and came up with the following percentage changes
Note the % diff col adds to 37%
address % @ 60b % @ 64b % diff
000046f5 14.6006 22.3856 7.785000 #instruction after rdtsc
00004737 15.0990 20.2242 5.125200 #instruction after rdtsc
0000474b 11.3857 12.9496 1.563900
00004726 1.5419 2.5867 1.044800
000046f7 0.6258 1.1922 0.566400
00004751 4.9377 5.4016 0.463900
000047a1 0.0118 0.4675 0.455700
00004739 1.2614 1.6962 0.434800
000044f7 1.0592 1.4506 0.391400
00004749 0.5467 0.9253 0.378600
0000475d 0.0879 0.1769 0.089000
0000445f 0.3785 0.4599 0.081400
000047c3 0.1003 0.1652 0.064900
000045cf 0.0804 0.1316 0.051200
000047aa 0.0048 0.0194 0.014600
000047bd 5.5e-04 0.0142 0.013650
000047b3 0.0106 0.0200 0.009400
00004598 0.0061 0.0147 0.008600
000045e9 0.0026 0.0103 0.007700
00004640 0.0692 0.0701 0.000900
00004465 0.0014 0.0020 0.000600
0000481b 4.3e-04 7.3e-04 0.000300
0000470e 6.1e-05 2.4e-04 0.000179
0000458d 1.2e-04 2.7e-04 0.000150
00004a47 1.8e-04 3.0e-04 0.000120
00004735 0.0085 0.0086 0.000100
00004745 1.2e-04 2.2e-04 0.000100
000047dd 0.0032 0.0033 0.000100
00004a49 0.0037 0.0038 0.000100
00004663 1.8e-04 2.7e-04 0.000090
0000489a 8.0e-04 8.9e-04 0.000090
00004514 9.2e-04 0.0010 0.000080
00004a61 6.1e-05 1.4e-04 0.000079
000046d4 6.1e-05 1.1e-04 0.000049
00004789 6.1e-05 1.1e-04 0.000049
00004683 1.2e-04 1.6e-04 0.000040
00004a51 1.8e-04 2.2e-04 0.000040
000047cc 9.2e-04 9.5e-04 0.000030
000045ba 6.8e-04 7.0e-04 0.000020
00004a36 6.1e-05 8.1e-05 0.000020
00004620 1.8e-04 1.9e-04 0.000010
0000474f 0.0042 0.0042 0.000000
0000466d 6.1e-05 5.4e-05 -0.000007
00004817 1.2e-04 1.1e-04 -0.000010
0000470c 4.9e-04 4.6e-04 -0.000030
000045eb 6.1e-05 2.7e-05 -0.000034
00004616 6.1e-05 2.7e-05 -0.000034
00004a1e 6.1e-05 2.7e-05 -0.000034
00004652 1.2e-04 8.1e-05 -0.000039
000047ee 1.2e-04 8.1e-05 -0.000039
00004685 1.2e-04 5.4e-05 -0.000066
00004894 3.1e-04 2.4e-04 -0.000070
00004714 6.1e-04 5.2e-04 -0.000090
00004524 1.2e-04 2.7e-05 -0.000093
0000467b 1.2e-04 2.7e-05 -0.000093
000046bb 1.2e-04 2.7e-05 -0.000093
00004446 0.0010 8.9e-04 -0.000110
0000488b 2.5e-04 1.4e-04 -0.000110
00004522 4.3e-04 2.7e-04 -0.000160
00004508 3.1e-04 1.4e-04 -0.000170
00004634 6.1e-04 4.3e-04 -0.000180
00004587 8.0e-04 6.0e-04 -0.000200
000047ae 0.0032 0.0030 -0.000200
00004440 5.5e-04 3.3e-04 -0.000220
00004459 0.0012 9.8e-04 -0.000220
00004506 9.2e-04 6.5e-04 -0.000270
000049ff 0.0021 0.0018 -0.000300
0000451c 0.0013 9.8e-04 -0.000320
000046c7 3.7e-04 2.7e-05 -0.000343
00004673 4.9e-04 1.1e-04 -0.000380
0000478f 4.9e-04 1.1e-04 -0.000380
00004450 0.0012 8.1e-04 -0.000390
00004541 6.1e-04 2.2e-04 -0.000390
000045a9 7.4e-04 3.5e-04 -0.000390
00004777 5.5e-04 1.6e-04 -0.000390
000047d0 6.8e-04 2.7e-04 -0.000410
00004457 0.0084 0.0079 -0.000500
000047ba 0.0018 0.0013 -0.000500
00004a6b 0.0031 0.0026 -0.000500
00004612 5.5e-04 2.7e-05 -0.000523
00004681 6.8e-04 1.4e-04 -0.000540
0000477b 7.4e-04 1.9e-04 -0.000550
00004503 0.0017 0.0011 -0.000600
000047df 0.0020 0.0014 -0.000600
000045b6 0.0010 3.8e-04 -0.000620
00004781 0.0010 3.8e-04 -0.000620
00004667 0.0012 5.2e-04 -0.000680
00004885 0.0015 8.1e-04 -0.000690
000045a3 0.0017 0.0010 -0.000700
000047da 0.0014 7.0e-04 -0.000700
00004747 8.6e-04 8.1e-05 -0.000779
0000446f 0.0151 0.0143 -0.000800
00004702 0.0019 0.0011 -0.000800
00004718 0.0157 0.0149 -0.000800
000047b6 0.0022 0.0014 -0.000800
00004a25 0.0054 0.0046 -0.000800
00004a65 0.0026 0.0018 -0.000800
0000477e 9.8e-04 1.4e-04 -0.000840
000045c8 0.0015 5.7e-04 -0.000930
00004543 0.0049 0.0039 -0.001000
00004604 0.0013 3.0e-04 -0.001000
00004787 0.0026 0.0016 -0.001000
00004a02 0.0018 7.6e-04 -0.001040
0000450e 0.0063 0.0052 -0.001100
0000465d 0.0022 0.0011 -0.001100
0000459d 0.0014 1.9e-04 -0.001210
0000464a 0.0017 3.8e-04 -0.001320
000047cf 0.0020 6.8e-04 -0.001320
00004a13 0.0016 1.1e-04 -0.001490
0000461e 0.0017 1.6e-04 -0.001540
000044ff 0.0040 0.0024 -0.001600
00004628 0.0020 3.5e-04 -0.001650
000045d5 0.0076 0.0055 -0.002100
00004638 0.0049 0.0027 -0.002200
00004650 0.0045 0.0021 -0.002400
00004632 0.0052 0.0026 -0.002600
00004769 0.0059 0.0033 -0.002600
00004444 0.0957 0.0930 -0.002700
00004610 0.0034 6.5e-04 -0.002750
000046fb 0.0097 0.0069 -0.002800
0000487f 0.0175 0.0146 -0.002900
000044f4 0.0071 0.0039 -0.003200
00004757 0.0068 0.0032 -0.003600
00004583 0.0176 0.0136 -0.004000
0000472d 0.0178 0.0138 -0.004000
00004624 0.0049 6.5e-04 -0.004250
00004700 0.0074 0.0029 -0.004500
00004763 0.0110 0.0059 -0.005100
00004755 0.0091 0.0037 -0.005400
000047b0 0.0201 0.0138 -0.006300
0000459b 0.0102 0.0035 -0.006700
000046fd 0.0146 0.0078 -0.006800
00004797 0.0253 0.0181 -0.007200
0000473f 0.0226 0.0153 -0.007300
0000476d 0.0253 0.0180 -0.007300
0000474d 0.0236 0.0152 -0.008400
000044f0 0.0191 0.0094 -0.009700
00004471 0.0332 0.0222 -0.011000
000046f3 0.0224 0.0112 -0.011200
0000472f 0.0221 0.0105 -0.011600
00004743 0.0146 0.0025 -0.012100
00004753 0.0311 0.0185 -0.012600
000044f9 0.0232 0.0100 -0.013200
000045f2 0.0781 0.0638 -0.014300
000045c0 0.0796 0.0632 -0.016400
000047a4 0.1020 0.0851 -0.016900
00004455 0.0468 0.0282 -0.018600
0000472a 0.0331 0.0140 -0.019100
00004720 0.0420 0.0228 -0.019200
00004741 0.0520 0.0255 -0.026500
0000460a 0.0296 6.8e-04 -0.028920
00004469 0.0696 0.0391 -0.030500
000047b8 0.0485 0.0164 -0.032100
00004771 0.0479 0.0151 -0.032800
000047d6 0.0634 0.0270 -0.036400
000045c2 0.1763 0.0500 -0.126300
0000488e 0.2228 0.0961 -0.126700
0000458f 0.2212 0.0932 -0.128000
00004709 0.8817 0.7529 -0.128800
0000479b 0.2469 0.1158 -0.131100
000047c6 0.2489 0.1103 -0.138600
00004775 0.2514 0.1124 -0.139000
00004657 0.2502 0.1105 -0.139700
0000444c 0.2555 0.1107 -0.144800
000045df 0.1822 0.0357 -0.146500
00004608 0.2596 0.1117 -0.147900
00004618 0.2635 0.1153 -0.148200
00004679 0.2580 0.1094 -0.148600
0000462c 0.2630 0.1134 -0.149600
00004594 0.2494 0.0958 -0.153600
000045f8 0.1934 0.0369 -0.156500
0000471a 0.8706 0.6718 -0.198800
000045e6 0.4986 0.2189 -0.279700
00004644 0.4393 0.1515 -0.287800
0000463c 0.5214 0.2247 -0.296700
000045fe 0.5160 0.2022 -0.313800
00004622 3.5942 1.5668 -2.027400
0000461c 3.6298 1.5695 -2.060300
00004716 19.2425 16.4027 -2.839800
00004600 5.2128 2.2837 -2.929100
000045b0 7.8500 3.3027 -4.547300
>
>
> it is worth noting that my box has become quite unstable since
> i started to use oprofile and pktgen together. sshd stops responding,
> and the network seems to go down. not sure what is happening there...
> this instability seems to be persisting across reboots, unfortunately...
>
>
>
>
>
>
> 60 bytes
> --------
>
> 60 1195259
> 60 1206652
> 60 1139822
> 60 1206650
> 60 1206654
> 60 1136447
> 60 1206651
> 60 1148050
> 60 1206504
> 60 1206653
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma samples % image name app name symbol name
> 00004337 1626886 57.5170 pktgen.ko pktgen pktgen_thread_worker
> c02f389d 282974 10.0043 vmlinux vmlinux _spin_lock
> c021adc0 219795 7.7706 vmlinux vmlinux e1000_clean_tx
> c02f3904 164371 5.8112 vmlinux vmlinux _spin_lock_bh
> c0219c74 160383 5.6702 vmlinux vmlinux e1000_xmit_frame
> c02f3870 124564 4.4038 vmlinux vmlinux _spin_trylock
> 000041d1 48511 1.7151 pktgen.ko pktgen next_to_run
> c02f399a 46205 1.6335 vmlinux vmlinux _spin_unlock_irqrestore
> c010c7d9 20876 0.7381 vmlinux vmlinux mark_offset_tsc
> c011fdb2 13116 0.4637 vmlinux vmlinux local_bh_enable
> c0107248 8166 0.2887 vmlinux vmlinux timer_interrupt
> c0103970 5607 0.1982 vmlinux vmlinux apic_timer_interrupt
> c010123a 5368 0.1898 vmlinux vmlinux default_idle
> c02f39a5 4256 0.1505 vmlinux vmlinux _spin_unlock_bh
> c0103c08 4042 0.1429 vmlinux vmlinux page_fault
> 0804ae00 3930 0.1389 oprofiled oprofiled sfile_find
> 0804aa10 3573 0.1263 oprofiled oprofiled get_file
>
>
>
> 64 bytes
> --------
>
> 64 606104
> 64 597737
> 64 594927
> 64 595531
> 64 606876
> 64 594751
> 64 595709
> 64 595070
> 64 606876
> 64 595600
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma samples % image name app name symbol name
> 00004337 3688998 68.9133 pktgen.ko pktgen pktgen_thread_worker
> c02f389d 519536 9.7053 vmlinux vmlinux _spin_lock
> c021adc0 271791 5.0773 vmlinux vmlinux e1000_clean_tx
> c0219c74 214428 4.0057 vmlinux vmlinux e1000_xmit_frame
> c02f3904 166334 3.1072 vmlinux vmlinux _spin_lock_bh
> c02f3870 127623 2.3841 vmlinux vmlinux _spin_trylock
> 000041d1 111650 2.0857 pktgen.ko pktgen next_to_run
> c02f399a 47428 0.8860 vmlinux vmlinux _spin_unlock_irqrestore
> c010c7d9 39586 0.7395 vmlinux vmlinux mark_offset_tsc
> c0107248 14671 0.2741 vmlinux vmlinux timer_interrupt
> c011fdb2 12926 0.2415 vmlinux vmlinux local_bh_enable
> c0103970 11778 0.2200 vmlinux vmlinux apic_timer_interrupt
> c010123a 9282 0.1734 vmlinux vmlinux default_idle
> 0804ae00 7449 0.1392 oprofiled oprofiled sfile_find
> 0804aa10 6387 0.1193 oprofiled oprofiled get_file
> 0804ac30 6234 0.1165 oprofiled oprofiled sfile_log_sample
> 0804f4b0 5852 0.1093 oprofiled oprofiled odb_insert
>
>
>
> 68 bytes
> --------
>
> 68 1124822
> 68 1124805
> 68 1090006
> 68 1124822
> 68 1089775
> 68 1124812
> 68 1123305
> 68 1091796
> 68 1124820
> 68 1087043
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma samples % image name app name symbol name
> 00004337 1753028 58.4510 pktgen.ko pktgen pktgen_thread_worker
> c02f389d 301835 10.0641 vmlinux vmlinux _spin_lock
> c021adc0 223405 7.4490 vmlinux vmlinux e1000_clean_tx
> c02f3904 167118 5.5722 vmlinux vmlinux _spin_lock_bh
> c0219c74 166016 5.5355 vmlinux vmlinux e1000_xmit_frame
> c02f3870 131516 4.3851 vmlinux vmlinux _spin_trylock
> 000041d1 56334 1.8783 pktgen.ko pktgen next_to_run
> c02f399a 46860 1.5624 vmlinux vmlinux _spin_unlock_irqrestore
> c010c7d9 26188 0.8732 vmlinux vmlinux mark_offset_tsc
> c011fdb2 12199 0.4068 vmlinux vmlinux local_bh_enable
> c0107248 10399 0.3467 vmlinux vmlinux timer_interrupt
> c010123a 8799 0.2934 vmlinux vmlinux default_idle
> c0103970 8194 0.2732 vmlinux vmlinux apic_timer_interrupt
> c0117346 4822 0.1608 vmlinux vmlinux find_busiest_group
> 0804ae00 4214 0.1405 oprofiled oprofiled sfile_find
> c02f39a5 3955 0.1319 vmlinux vmlinux _spin_unlock_bh
> 0804aa10 3745 0.1249 oprofiled oprofiled get_file
>
>
>
> here is the detailed breakdown for the 60 byte pktgen:
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma samples % image name app name symbol name
> 00004337 1626886 57.5170 pktgen.ko pktgen pktgen_thread_worker
> 00004440 9 5.5e-04
> 00004444 1557 0.0957
> 00004446 17 0.0010
> 0000444c 4156 0.2555
> 00004450 19 0.0012
> 00004455 762 0.0468
> 00004457 136 0.0084
> 00004459 20 0.0012
> 0000445f 6157 0.3785
> 00004465 23 0.0014
> 00004469 1133 0.0696
> 0000446f 246 0.0151
> 00004471 540 0.0332
> 000044f0 310 0.0191
> 000044f4 115 0.0071
> 000044f7 17232 1.0592
> 000044f9 377 0.0232
> 000044ff 65 0.0040
> 00004503 28 0.0017
> 00004506 15 9.2e-04
> 00004508 5 3.1e-04
> 0000450e 102 0.0063
> 00004514 15 9.2e-04
> 0000451c 21 0.0013
> 00004522 7 4.3e-04
> 00004524 2 1.2e-04
> 00004541 10 6.1e-04
> 00004543 79 0.0049
> 00004583 287 0.0176
> 00004587 13 8.0e-04
> 0000458d 2 1.2e-04
> 0000458f 3598 0.2212
> 00004594 4057 0.2494
> 00004598 100 0.0061
> 0000459b 166 0.0102
> 0000459d 22 0.0014
> 000045a3 28 0.0017
> 000045a9 12 7.4e-04
> 000045b0 127711 7.8500
> 000045b6 17 0.0010
> 000045ba 11 6.8e-04
> 000045c0 1295 0.0796
> 000045c2 2869 0.1763
> 000045c8 24 0.0015
> 000045cf 1308 0.0804
> 000045d5 123 0.0076
> 000045df 2964 0.1822
> 000045e6 8111 0.4986
> 000045e9 42 0.0026
> 000045eb 1 6.1e-05
> 000045f2 1271 0.0781
> 000045f8 3146 0.1934
> 000045fe 8395 0.5160
> 00004600 84807 5.2128
> 00004604 21 0.0013
> 00004608 4223 0.2596
> 0000460a 481 0.0296
> 00004610 55 0.0034
> 00004612 9 5.5e-04
> 00004616 1 6.1e-05
> 00004618 4287 0.2635
> 0000461a 3 1.8e-04
> 0000461c 59052 3.6298
> 0000461e 28 0.0017
> 00004620 3 1.8e-04
> 00004622 58473 3.5942
> 00004624 79 0.0049
> 00004628 33 0.0020
> 0000462c 4279 0.2630
> 00004632 84 0.0052
> 00004634 10 6.1e-04
> 00004638 80 0.0049
> 0000463c 8483 0.5214
> 00004640 1126 0.0692
> 00004644 7147 0.4393
> 0000464a 27 0.0017
> 00004650 73 0.0045
> 00004652 2 1.2e-04
> 00004657 4070 0.2502
> 0000465d 36 0.0022
> 00004663 3 1.8e-04
> 00004665 2 1.2e-04
> 00004667 20 0.0012
> 0000466d 1 6.1e-05
> 00004673 8 4.9e-04
> 00004679 4197 0.2580
> 0000467b 2 1.2e-04
> 00004681 11 6.8e-04
> 00004683 2 1.2e-04
> 00004685 2 1.2e-04
> 000046bb 2 1.2e-04
> 000046c1 2 1.2e-04
> 000046c7 6 3.7e-04
> 000046d4 1 6.1e-05
> 000046f3 365 0.0224
> 000046f5 237535 14.6006
> 000046f7 10181 0.6258
> 000046fb 157 0.0097
> 000046fd 238 0.0146
> 00004700 120 0.0074
> 00004702 31 0.0019
> 00004709 14344 0.8817
> 0000470c 8 4.9e-04
> 0000470e 1 6.1e-05
> 00004714 10 6.1e-04
> 00004716 313053 19.2425
> 00004718 255 0.0157
> 0000471a 14164 0.8706
> 00004720 683 0.0420
> 00004726 25085 1.5419
> 0000472a 538 0.0331
> 0000472d 290 0.0178
> 0000472f 359 0.0221
> 00004735 139 0.0085
> 00004737 245644 15.0990
> 00004739 20521 1.2614
> 0000473f 368 0.0226
> 00004741 846 0.0520
> 00004743 237 0.0146
> 00004745 2 1.2e-04
> 00004747 14 8.6e-04
> 00004749 8894 0.5467
> 0000474b 185233 11.3857
> 0000474d 384 0.0236
> 0000474f 69 0.0042
> 00004751 80331 4.9377
> 00004753 506 0.0311
> 00004755 148 0.0091
> 00004757 111 0.0068
> 0000475d 1430 0.0879
> 00004763 179 0.0110
> 00004769 96 0.0059
> 0000476d 411 0.0253
> 00004771 780 0.0479
> 00004775 4090 0.2514
> 00004777 9 5.5e-04
> 0000477b 12 7.4e-04
> 0000477e 16 9.8e-04
> 00004781 17 0.0010
> 00004787 43 0.0026
> 00004789 1 6.1e-05
> 0000478f 8 4.9e-04
> 00004797 412 0.0253
> 0000479b 4016 0.2469
> 000047a1 192 0.0118
> 000047a4 1660 0.1020
> 000047aa 78 0.0048
> 000047ae 52 0.0032
> 000047b0 327 0.0201
> 000047b3 173 0.0106
> 000047b6 35 0.0022
> 000047b8 789 0.0485
> 000047ba 29 0.0018
> 000047bd 9 5.5e-04
> 000047c3 1632 0.1003
> 000047c6 4049 0.2489
> 000047cc 15 9.2e-04
> 000047cf 33 0.0020
> 000047d0 11 6.8e-04
> 000047d6 1032 0.0634
> 000047da 22 0.0014
> 000047dd 52 0.0032
> 000047df 33 0.0020
> 000047ea 1 6.1e-05
> 000047ee 2 1.2e-04
> 000047f6 1 6.1e-05
> 000047ff 1 6.1e-05
> 00004809 1 6.1e-05
> 0000480e 1 6.1e-05
> 00004817 2 1.2e-04
> 0000481b 7 4.3e-04
> 0000487f 284 0.0175
> 00004885 24 0.0015
> 0000488b 4 2.5e-04
> 0000488e 3625 0.2228
> 00004894 5 3.1e-04
> 0000489a 13 8.0e-04
> 000049ff 34 0.0021
> 00004a02 30 0.0018
> 00004a04 4 2.5e-04
> 00004a0f 3 1.8e-04
> 00004a13 26 0.0016
> 00004a1e 1 6.1e-05
> 00004a25 88 0.0054
> 00004a36 1 6.1e-05
> 00004a47 3 1.8e-04
> 00004a49 60 0.0037
> 00004a51 3 1.8e-04
> 00004a61 1 6.1e-05
> 00004a65 42 0.0026
> 00004a6b 50 0.0031
>
>
>
> here is the detailed breakdown for the 64 byte pktgen:
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3067.25 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
> vma samples % image name app name symbol name
> 00004337 3688998 68.9133 pktgen.ko pktgen pktgen_thread_worker
> 00004440 12 3.3e-04
> 00004444 3431 0.0930
> 00004446 33 8.9e-04
> 0000444c 4082 0.1107
> 00004450 30 8.1e-04
> 00004455 1041 0.0282
> 00004457 292 0.0079
> 00004459 36 9.8e-04
> 0000445f 16964 0.4599
> 00004465 73 0.0020
> 00004469 1442 0.0391
> 0000446f 528 0.0143
> 00004471 818 0.0222
> 000044f0 347 0.0094
> 000044f4 145 0.0039
> 000044f7 53514 1.4506
> 000044f9 369 0.0100
> 000044ff 90 0.0024
> 00004503 41 0.0011
> 00004506 24 6.5e-04
> 00004508 5 1.4e-04
> 0000450e 192 0.0052
> 00004514 37 0.0010
> 00004516 5 1.4e-04
> 0000451c 36 9.8e-04
> 00004522 10 2.7e-04
> 00004524 1 2.7e-05
> 00004541 8 2.2e-04
> 00004543 144 0.0039
> 00004583 503 0.0136
> 00004587 22 6.0e-04
> 0000458d 10 2.7e-04
> 0000458f 3437 0.0932
> 00004594 3533 0.0958
> 00004598 541 0.0147
> 0000459b 129 0.0035
> 0000459d 7 1.9e-04
> 000045a3 38 0.0010
> 000045a9 13 3.5e-04
> 000045b0 121838 3.3027
> 000045b6 14 3.8e-04
> 000045ba 26 7.0e-04
> 000045c0 2330 0.0632
> 000045c2 1843 0.0500
> 000045c8 21 5.7e-04
> 000045cf 4855 0.1316
> 000045d5 203 0.0055
> 000045df 1317 0.0357
> 000045e6 8076 0.2189
> 000045e9 381 0.0103
> 000045eb 1 2.7e-05
> 000045f2 2355 0.0638
> 000045f8 1362 0.0369
> 000045fe 7460 0.2022
> 00004600 84246 2.2837
> 00004604 11 3.0e-04
> 00004608 4122 0.1117
> 0000460a 25 6.8e-04
> 00004610 24 6.5e-04
> 00004612 1 2.7e-05
> 00004614 1 2.7e-05
> 00004616 1 2.7e-05
> 00004618 4254 0.1153
> 0000461c 57898 1.5695
> 0000461e 6 1.6e-04
> 00004620 7 1.9e-04
> 00004622 57801 1.5668
> 00004624 24 6.5e-04
> 00004628 13 3.5e-04
> 0000462c 4185 0.1134
> 00004632 97 0.0026
> 00004634 16 4.3e-04
> 00004638 99 0.0027
> 0000463c 8288 0.2247
> 00004640 2585 0.0701
> 00004644 5590 0.1515
> 0000464a 14 3.8e-04
> 00004650 77 0.0021
> 00004652 3 8.1e-05
> 00004657 4077 0.1105
> 0000465d 41 0.0011
> 00004663 10 2.7e-04
> 00004667 19 5.2e-04
> 0000466d 2 5.4e-05
> 00004673 4 1.1e-04
> 00004679 4035 0.1094
> 0000467b 1 2.7e-05
> 00004681 5 1.4e-04
> 00004683 6 1.6e-04
> 00004685 2 5.4e-05
> 000046bb 1 2.7e-05
> 000046c7 1 2.7e-05
> 000046d4 4 1.1e-04
> 000046f3 415 0.0112
> 000046f5 825806 22.3856
> 000046f7 43980 1.1922
> 000046fb 256 0.0069
> 000046fd 286 0.0078
> 00004700 108 0.0029
> 00004702 41 0.0011
> 00004705 5 1.4e-04
> 00004709 27774 0.7529
> 0000470c 17 4.6e-04
> 0000470e 9 2.4e-04
> 00004714 19 5.2e-04
> 00004716 605096 16.4027
> 00004718 548 0.0149
> 0000471a 24782 0.6718
> 00004720 842 0.0228
> 00004726 95423 2.5867
> 0000472a 516 0.0140
> 0000472d 510 0.0138
> 0000472f 389 0.0105
> 00004735 316 0.0086
> 00004737 746069 20.2242
> 00004739 62574 1.6962
> 0000473f 565 0.0153
> 00004741 941 0.0255
> 00004743 91 0.0025
> 00004745 8 2.2e-04
> 00004747 3 8.1e-05
> 00004749 34135 0.9253
> 0000474b 477712 12.9496
> 0000474d 561 0.0152
> 0000474f 155 0.0042
> 00004751 199265 5.4016
> 00004753 684 0.0185
> 00004755 137 0.0037
> 00004757 119 0.0032
> 0000475d 6527 0.1769
> 00004763 217 0.0059
> 00004769 120 0.0033
> 0000476d 665 0.0180
> 00004771 558 0.0151
> 00004775 4148 0.1124
> 00004777 6 1.6e-04
> 0000477b 7 1.9e-04
> 0000477e 5 1.4e-04
> 00004781 14 3.8e-04
> 00004787 60 0.0016
> 00004789 4 1.1e-04
> 0000478f 4 1.1e-04
> 00004797 669 0.0181
> 0000479b 4271 0.1158
> 000047a1 17245 0.4675
> 000047a4 3138 0.0851
> 000047aa 716 0.0194
> 000047ae 112 0.0030
> 000047b0 508 0.0138
> 000047b3 736 0.0200
> 000047b6 53 0.0014
> 000047b8 604 0.0164
> 000047ba 47 0.0013
> 000047bd 525 0.0142
> 000047c3 6094 0.1652
> 000047c6 4068 0.1103
> 000047cc 35 9.5e-04
> 000047cf 25 6.8e-04
> 000047d0 10 2.7e-04
> 000047d6 995 0.0270
> 000047da 26 7.0e-04
> 000047dd 120 0.0033
> 000047df 50 0.0014
> 000047ee 3 8.1e-05
> 000047fa 1 2.7e-05
> 00004817 4 1.1e-04
> 0000481b 27 7.3e-04
> 0000487f 539 0.0146
> 00004885 30 8.1e-04
> 0000488b 5 1.4e-04
> 0000488e 3544 0.0961
> 00004894 9 2.4e-04
> 0000489a 33 8.9e-04
> 000049ff 67 0.0018
> 00004a02 28 7.6e-04
> 00004a11 1 2.7e-05
> 00004a13 4 1.1e-04
> 00004a18 3 8.1e-05
> 00004a1e 1 2.7e-05
> 00004a25 168 0.0046
> 00004a36 3 8.1e-05
> 00004a47 11 3.0e-04
> 00004a49 139 0.0038
> 00004a51 8 2.2e-04
> 00004a59 1 2.7e-05
> 00004a61 5 1.4e-04
> 00004a65 67 0.0018
> 00004a6b 97 0.0026
>
>
> and finally, here's the disasm of the threadworker function:
>
>
> 00004337 <pktgen_Thread_worker>:
> 4337: 55 push %ebp
> 4338: 57 push %edi
> 4339: 56 push %esi
> 433a: 53 push %ebx
> 433b: bb 00 e0 ff ff mov $0xffffe000,%ebx
> 4340: 21 e3 and %esp,%ebx
> 4342: 83 ec 2c sub $0x2c,%esp
> 4345: 89 44 24 28 mov %eax,0x28(%esp)
> 4349: 8b b0 bc 02 00 00 mov 0x2bc(%eax),%esi
> 434f: c7 44 24 20 00 00 00 movl $0x0,0x20(%esp)
> 4356: 00
> 4357: c7 04 24 c2 06 00 00 movl $0x6c2,(%esp)
> 435e: 89 74 24 04 mov %esi,0x4(%esp)
> 4362: e8 fc ff ff ff call 4363 <pktgen_thread_worker+0x2c>
> 4367: 8b 03 mov (%ebx),%eax
> 4369: 8b 80 90 04 00 00 mov 0x490(%eax),%eax
> 436f: 05 04 05 00 00 add $0x504,%eax
> 4374: e8 fc ff ff ff call 4375 <pktgen_thread_worker+0x3e>
> 4379: 8b 03 mov (%ebx),%eax
> 437b: c7 80 94 04 00 00 ff movl $0xfffbbeff,0x494(%eax)
> 4382: be fb ff
> 4385: c7 80 98 04 00 00 ff movl $0xffffffff,0x498(%eax)
> 438c: ff ff ff
> 438f: e8 fc ff ff ff call 4390 <pktgen_thread_worker+0x59>
> 4394: 8b 03 mov (%ebx),%eax
> 4396: 8b 80 90 04 00 00 mov 0x490(%eax),%eax
> 439c: 05 04 05 00 00 add $0x504,%eax
> 43a1: e8 fc ff ff ff call 43a2 <pktgen_thread_worker+0x6b>
> 43a6: 89 f1 mov %esi,%ecx
> 43a8: ba 01 00 00 00 mov $0x1,%edx
> 43ad: d3 e2 shl %cl,%edx
> 43af: 8b 03 mov (%ebx),%eax
> 43b1: e8 fc ff ff ff call 43b2 <pktgen_thread_worker+0x7b>
> 43b6: 39 73 10 cmp %esi,0x10(%ebx)
> 43b9: 74 08 je 43c3 <pktgen_thread_worker+0x8c>
> 43bb: 0f 0b ud2a
> 43bd: 27 daa
> 43be: 0b b0 06 00 00 8b or 0x8b000006(%eax),%esi
> 43c4: 44 inc %esp
> 43c5: 24 28 and $0x28,%al
> 43c7: 8b 54 24 28 mov 0x28(%esp),%edx
> 43cb: 05 c0 02 00 00 add $0x2c0,%eax
> 43d0: 89 44 24 1c mov %eax,0x1c(%esp)
> 43d4: c7 82 c0 02 00 00 01 movl $0x1,0x2c0(%edx)
> 43db: 00 00 00
> 43de: 89 d0 mov %edx,%eax
> 43e0: 8b 4c 24 1c mov 0x1c(%esp),%ecx
> 43e4: 05 c4 02 00 00 add $0x2c4,%eax
> 43e9: 89 41 04 mov %eax,0x4(%ecx)
> 43ec: 89 41 08 mov %eax,0x8(%ecx)
> 43ef: 83 a2 b4 02 00 00 f0 andl $0xfffffff0,0x2b4(%edx)
> 43f6: 8b 03 mov (%ebx),%eax
> 43f8: 8b 80 a8 00 00 00 mov 0xa8(%eax),%eax
> 43fe: 89 82 b8 02 00 00 mov %eax,0x2b8(%edx)
> 4404: 8b 03 mov (%ebx),%eax
> 4406: 8b 80 a8 00 00 00 mov 0xa8(%eax),%eax
> 440c: 89 74 24 04 mov %esi,0x4(%esp)
> 4410: c7 04 24 a0 06 00 00 movl $0x6a0,(%esp)
> 4417: 89 44 24 08 mov %eax,0x8(%esp)
> 441b: e8 fc ff ff ff call 441c <pktgen_thread_worker+0xe5>
> 4420: 8b 44 24 28 mov 0x28(%esp),%eax
> 4424: 8b 80 b0 02 00 00 mov 0x2b0(%eax),%eax
> 442a: 89 44 24 24 mov %eax,0x24(%esp)
> 442e: 8b 03 mov (%ebx),%eax
> 4430: c7 00 01 00 00 00 movl $0x1,(%eax)
> 4436: f0 83 44 24 00 00 lock addl $0x0,0x0(%esp)
> 443c: 89 5c 24 18 mov %ebx,0x18(%esp)
> 4440: 8b 54 24 18 mov 0x18(%esp),%edx
> 4444: 8b 02 mov (%edx),%eax
> 4446: c7 00 00 00 00 00 movl $0x0,(%eax)
> 444c: 8b 44 24 28 mov 0x28(%esp),%eax
> 4450: e8 7c fd ff ff call 41d1 <next_to_run>
> 4455: 85 c0 test %eax,%eax
> 4457: 89 c6 mov %eax,%esi
> 4459: 0f 84 aa 03 00 00 je 4809 <pktgen_thread_worker+0x4d2>
> 445f: 8b 88 44 04 00 00 mov 0x444(%eax),%ecx
> 4465: 89 4c 24 14 mov %ecx,0x14(%esp)
> 4469: 8b b8 90 02 00 00 mov 0x290(%eax),%edi
> 446f: 85 ff test %edi,%edi
> 4471: 74 7d je 44f0 <pktgen_thread_worker+0x1b9>
> 4473: 0f 31 rdtsc
> 4475: 89 44 24 0c mov %eax,0xc(%esp)
> 4479: 89 54 24 10 mov %edx,0x10(%esp)
> 447d: 85 d2 test %edx,%edx
> 447f: 8b 1d 1c 00 00 00 mov 0x1c,%ebx
> 4485: 89 d1 mov %edx,%ecx
> 4487: 89 c5 mov %eax,%ebp
> 4489: 74 08 je 4493 <pktgen_thread_worker+0x15c>
> 448b: 89 d0 mov %edx,%eax
> 448d: 31 d2 xor %edx,%edx
> 448f: f7 f3 div %ebx
> 4491: 89 c1 mov %eax,%ecx
> 4493: 89 e8 mov %ebp,%eax
> 4495: f7 f3 div %ebx
> 4497: 89 ca mov %ecx,%edx
> 4499: 89 d3 mov %edx,%ebx
> 449b: 8b 96 b8 02 00 00 mov 0x2b8(%esi),%edx
> 44a1: 39 d3 cmp %edx,%ebx
> 44a3: 89 c1 mov %eax,%ecx
> 44a5: 8b 86 b4 02 00 00 mov 0x2b4(%esi),%eax
> 44ab: 77 37 ja 44e4 <pktgen_thread_worker+0x1ad>
> 44ad: 72 04 jb 44b3 <pktgen_thread_worker+0x17c>
> 44af: 39 c1 cmp %eax,%ecx
> 44b1: 73 31 jae 44e4 <pktgen_thread_worker+0x1ad>
> 44b3: 8b be b4 02 00 00 mov 0x2b4(%esi),%edi
> 44b9: 29 cf sub %ecx,%edi
> 44bb: 81 ff 0f 27 00 00 cmp $0x270f,%edi
> 44c1: 0f 86 ee 04 00 00 jbe 49b5 <pktgen_thread_worker+0x67e>
> 44c7: b9 d3 4d 62 10 mov $0x10624dd3,%ecx
> 44cc: 89 f8 mov %edi,%eax
> 44ce: f7 e1 mul %ecx
> 44d0: 89 d1 mov %edx,%ecx
> 44d2: 89 f2 mov %esi,%edx
> 44d4: c1 e9 06 shr $0x6,%ecx
> 44d7: 89 c8 mov %ecx,%eax
> 44d9: e8 10 e4 ff ff call 28ee <pg_udelay>
> 44de: 8b be 90 02 00 00 mov 0x290(%esi),%edi
> 44e4: 81 ff ff ff ff 7f cmp $0x7fffffff,%edi
> 44ea: 0f 84 d5 04 00 00 je 49c5 <pktgen_thread_worker+0x68e>
> 44f0: 8b 54 24 14 mov 0x14(%esp),%edx
> 44f4: 8b 42 24 mov 0x24(%edx),%eax
> 44f7: a8 01 test $0x1,%al
> 44f9: 0f 85 f4 01 00 00 jne 46f3 <pktgen_thread_worker+0x3bc>
> 44ff: 8b 4c 24 18 mov 0x18(%esp),%ecx
> 4503: 8b 41 08 mov 0x8(%ecx),%eax
> 4506: a8 08 test $0x8,%al
> 4508: 0f 85 e5 01 00 00 jne 46f3 <pktgen_thread_worker+0x3bc>
> 450e: 8b 86 c8 02 00 00 mov 0x2c8(%esi),%eax
> 4514: 85 c0 test %eax,%eax
> 4516: 0f 85 63 03 00 00 jne 487f <pktgen_thread_worker+0x548>
> 451c: 8b 96 40 04 00 00 mov 0x440(%esi),%edx
> 4522: 85 d2 test %edx,%edx
> 4524: 75 5d jne 4583 <pktgen_thread_worker+0x24c>
> 4526: 8b 86 c4 02 00 00 mov 0x2c4(%esi),%eax
> 452c: 83 c0 01 add $0x1,%eax
> 452f: 3b 86 e8 02 00 00 cmp 0x2e8(%esi),%eax
> 4535: 89 86 c4 02 00 00 mov %eax,0x2c4(%esi)
> 453b: 0f 83 5f 03 00 00 jae 48a0 <pktgen_thread_worker+0x569>
> 4541: 85 d2 test %edx,%edx
> 4543: 75 3e jne 4583 <pktgen_thread_worker+0x24c>
> 4545: f6 86 81 02 00 00 02 testb $0x2,0x281(%esi)
> 454c: 0f 84 87 03 00 00 je 48d9 <pktgen_thread_worker+0x5a2>
> 4552: 89 f2 mov %esi,%edx
> 4554: 8b 44 24 14 mov 0x14(%esp),%eax
> 4558: e8 fc f1 ff ff call 3759 <fill_packet_ipv6>
> 455d: 85 c0 test %eax,%eax
> 455f: 89 86 40 04 00 00 mov %eax,0x440(%esi)
> 4565: 0f 84 87 03 00 00 je 48f2 <pktgen_thread_worker+0x5bb>
> 456b: 83 86 bc 02 00 00 01 addl $0x1,0x2bc(%esi)
> 4572: c7 86 c4 02 00 00 00 movl $0x0,0x2c4(%esi)
> 4579: 00 00 00
> 457c: 83 96 c0 02 00 00 00 adcl $0x0,0x2c0(%esi)
> 4583: 8b 7c 24 14 mov 0x14(%esp),%edi
> 4587: 81 c7 2c 01 00 00 add $0x12c,%edi
> 458d: 89 f8 mov %edi,%eax
> 458f: e8 fc ff ff ff call 4590 <pktgen_thread_worker+0x259>
> 4594: 8b 54 24 14 mov 0x14(%esp),%edx
> 4598: 8b 42 24 mov 0x24(%edx),%eax
> 459b: a8 01 test $0x1,%al
> 459d: 0f 85 6c 03 00 00 jne 490f <pktgen_thread_worker+0x5d8>
> 45a3: 8b 86 40 04 00 00 mov 0x440(%esi),%eax
> 45a9: f0 ff 80 94 00 00 00 lock incl 0x94(%eax)
> 45b0: 8b 86 40 04 00 00 mov 0x440(%esi),%eax
> 45b6: 8b 54 24 14 mov 0x14(%esp),%edx
> 45ba: ff 92 6c 01 00 00 call *0x16c(%edx)
> 45c0: 85 c0 test %eax,%eax
> 45c2: 0f 85 37 04 00 00 jne 49ff <pktgen_thread_worker+0x6c8>
> 45c8: 83 86 9c 02 00 00 01 addl $0x1,0x29c(%esi)
> 45cf: 8b 86 2c 04 00 00 mov 0x42c(%esi),%eax
> 45d5: c7 86 c8 02 00 00 01 movl $0x1,0x2c8(%esi)
> 45dc: 00 00 00
> 45df: 83 96 a0 02 00 00 00 adcl $0x0,0x2a0(%esi)
> 45e6: 83 c0 04 add $0x4,%eax
> 45e9: 31 d2 xor %edx,%edx
> 45eb: 83 86 e4 02 00 00 01 addl $0x1,0x2e4(%esi)
> 45f2: 01 86 a4 02 00 00 add %eax,0x2a4(%esi)
> 45f8: 11 96 a8 02 00 00 adc %edx,0x2a8(%esi)
> 45fe: 0f 31 rdtsc
> 4600: 89 44 24 0c mov %eax,0xc(%esp)
> 4604: 89 54 24 10 mov %edx,0x10(%esp)
> 4608: 85 d2 test %edx,%edx
> 460a: 8b 1d 1c 00 00 00 mov 0x1c,%ebx
> 4610: 89 d1 mov %edx,%ecx
> 4612: 89 c5 mov %eax,%ebp
> 4614: 74 08 je 461e <pktgen_thread_worker+0x2e7>
> 4616: 89 d0 mov %edx,%eax
> 4618: 31 d2 xor %edx,%edx
> 461a: f7 f3 div %ebx
> 461c: 89 c1 mov %eax,%ecx
> 461e: 89 e8 mov %ebp,%eax
> 4620: f7 f3 div %ebx
> 4622: 89 ca mov %ecx,%edx
> 4624: 89 54 24 10 mov %edx,0x10(%esp)
> 4628: 89 44 24 0c mov %eax,0xc(%esp)
> 462c: 8b 8e 90 02 00 00 mov 0x290(%esi),%ecx
> 4632: 31 db xor %ebx,%ebx
> 4634: 01 4c 24 0c add %ecx,0xc(%esp)
> 4638: 11 5c 24 10 adc %ebx,0x10(%esp)
> 463c: 8b 54 24 0c mov 0xc(%esp),%edx
> 4640: 8b 4c 24 10 mov 0x10(%esp),%ecx
> 4644: 89 96 b4 02 00 00 mov %edx,0x2b4(%esi)
> 464a: 89 8e b8 02 00 00 mov %ecx,0x2b8(%esi)
> 4650: 89 f8 mov %edi,%eax
> 4652: e8 fc ff ff ff call 4653 <pktgen_thread_worker+0x31c>
> 4657: 8b 96 98 02 00 00 mov 0x298(%esi),%edx
> 465d: 8b 86 94 02 00 00 mov 0x294(%esi),%eax
> 4663: 89 d1 mov %edx,%ecx
> 4665: 09 c1 or %eax,%ecx
> 4667: 0f 84 f6 00 00 00 je 4763 <pktgen_thread_worker+0x42c>
> 466d: 8b 9e a0 02 00 00 mov 0x2a0(%esi),%ebx
> 4673: 8b 8e 9c 02 00 00 mov 0x29c(%esi),%ecx
> 4679: 39 d3 cmp %edx,%ebx
> 467b: 0f 82 e2 00 00 00 jb 4763 <pktgen_thread_worker+0x42c>
> 4681: 77 08 ja 468b <pktgen_thread_worker+0x354>
> 4683: 39 c1 cmp %eax,%ecx
> 4685: 0f 82 d8 00 00 00 jb 4763 <pktgen_thread_worker+0x42c>
> 468b: 8b 8e 40 04 00 00 mov 0x440(%esi),%ecx
> 4691: 8b 81 94 00 00 00 mov 0x94(%ecx),%eax
> 4697: 83 f8 01 cmp $0x1,%eax
> 469a: 74 4e je 46ea <pktgen_thread_worker+0x3b3>
> 469c: 0f 31 rdtsc
> 469e: 89 c7 mov %eax,%edi
> 46a0: 8b 81 94 00 00 00 mov 0x94(%ecx),%eax
> 46a6: 83 f8 01 cmp $0x1,%eax
> 46a9: 89 d5 mov %edx,%ebp
> 46ab: 74 2b je 46d8 <pktgen_thread_worker+0x3a1>
> 46ad: bb 00 e0 ff ff mov $0xffffe000,%ebx
> 46b2: 21 e3 and %esp,%ebx
> 46b4: eb 16 jmp 46cc <pktgen_thread_worker+0x395>
> 46b6: e8 fc ff ff ff call 46b7 <pktgen_thread_worker+0x380>
> 46bb: 8b 86 40 04 00 00 mov 0x440(%esi),%eax
> 46c1: 8b 80 94 00 00 00 mov 0x94(%eax),%eax
> 46c7: 83 f8 01 cmp $0x1,%eax
> 46ca: 74 0c je 46d8 <pktgen_thread_worker+0x3a1>
> 46cc: 8b 03 mov (%ebx),%eax
> 46ce: 8b 40 04 mov 0x4(%eax),%eax
> 46d1: 8b 40 08 mov 0x8(%eax),%eax
> 46d4: a8 04 test $0x4,%al
> 46d6: 74 de je 46b6 <pktgen_thread_worker+0x37f>
> 46d8: 0f 31 rdtsc
> 46da: 29 f8 sub %edi,%eax
> 46dc: 19 ea sbb %ebp,%edx
> 46de: 01 86 dc 02 00 00 add %eax,0x2dc(%esi)
> 46e4: 11 96 e0 02 00 00 adc %edx,0x2e0(%esi)
> 46ea: 89 f0 mov %esi,%eax
> 46ec: e8 2f fa ff ff call 4120 <pktgen_stop_device>
> 46f1: eb 70 jmp 4763 <pktgen_thread_worker+0x42c>
> 46f3: 0f 31 rdtsc
> 46f5: 89 d5 mov %edx,%ebp
> 46f7: 8b 54 24 14 mov 0x14(%esp),%edx
> 46fb: 89 c7 mov %eax,%edi
> 46fd: 8b 42 24 mov 0x24(%edx),%eax
> 4700: a8 02 test $0x2,%al
> 4702: 2e 74 e5 je,pn 46ea <pktgen_thread_worker+0x3b3>
> 4705: 8b 4c 24 18 mov 0x18(%esp),%ecx
> 4709: 8b 41 08 mov 0x8(%ecx),%eax
> 470c: a8 08 test $0x8,%al
> 470e: 0f 85 e1 02 00 00 jne 49f5 <pktgen_thread_worker+0x6be>
> 4714: 0f 31 rdtsc
> 4716: 29 f8 sub %edi,%eax
> 4718: 19 ea sbb %ebp,%edx
> 471a: 01 86 dc 02 00 00 add %eax,0x2dc(%esi)
> 4720: 11 96 e0 02 00 00 adc %edx,0x2e0(%esi)
> 4726: 8b 54 24 14 mov 0x14(%esp),%edx
> 472a: 8b 42 24 mov 0x24(%edx),%eax
> 472d: a8 01 test $0x1,%al
> 472f: 0f 84 d9 fd ff ff je 450e <pktgen_thread_worker+0x1d7>
> 4735: 0f 31 rdtsc
> 4737: 85 d2 test %edx,%edx
> 4739: 8b 1d 1c 00 00 00 mov 0x1c,%ebx
> 473f: 89 d1 mov %edx,%ecx
> 4741: 89 c7 mov %eax,%edi
> 4743: 74 08 je 474d <pktgen_thread_worker+0x416>
> 4745: 89 d0 mov %edx,%eax
> 4747: 31 d2 xor %edx,%edx
> 4749: f7 f3 div %ebx
> 474b: 89 c1 mov %eax,%ecx
> 474d: 89 f8 mov %edi,%eax
> 474f: f7 f3 div %ebx
> 4751: 89 ca mov %ecx,%edx
> 4753: 89 c1 mov %eax,%ecx
> 4755: 89 d3 mov %edx,%ebx
> 4757: 89 8e b4 02 00 00 mov %ecx,0x2b4(%esi)
> 475d: 89 9e b8 02 00 00 mov %ebx,0x2b8(%esi)
> 4763: 8b 96 c8 02 00 00 mov 0x2c8(%esi),%edx
> 4769: 8b 4c 24 24 mov 0x24(%esp),%ecx
> 476d: 01 54 24 20 add %edx,0x20(%esp)
> 4771: 39 4c 24 20 cmp %ecx,0x20(%esp)
> 4775: 76 20 jbe 4797 <pktgen_thread_worker+0x460>
> 4777: 8b 54 24 18 mov 0x18(%esp),%edx
> 477b: 8b 42 10 mov 0x10(%edx),%eax
> 477e: c1 e0 07 shl $0x7,%eax
> 4781: 8b b8 00 00 00 00 mov 0x0(%eax),%edi
> 4787: 85 ff test %edi,%edi
> 4789: 0f 85 1c 02 00 00 jne 49ab <pktgen_thread_worker+0x674>
> 478f: c7 44 24 20 00 00 00 movl $0x0,0x20(%esp)
> 4796: 00
> 4797: 8b 4c 24 28 mov 0x28(%esp),%ecx
> 479b: 8b 91 b4 02 00 00 mov 0x2b4(%ecx),%edx
> 47a1: f6 c2 01 test $0x1,%dl
> 47a4: 0f 85 7c 00 00 00 jne 4826 <pktgen_thread_worker+0x4ef>
> 47aa: 8b 4c 24 18 mov 0x18(%esp),%ecx
> 47ae: 8b 01 mov (%ecx),%eax
> 47b0: 8b 40 04 mov 0x4(%eax),%eax
> 47b3: 8b 40 08 mov 0x8(%eax),%eax
> 47b6: a8 04 test $0x4,%al
> 47b8: 75 6c jne 4826 <pktgen_thread_worker+0x4ef>
> 47ba: f6 c2 02 test $0x2,%dl
> 47bd: 0f 85 c7 01 00 00 jne 498a <pktgen_thread_worker+0x653>
> 47c3: f6 c2 04 test $0x4,%dl
> 47c6: 0f 85 9d 01 00 00 jne 4969 <pktgen_thread_worker+0x632>
> 47cc: 80 e2 08 and $0x8,%dl
> 47cf: 90 nop
> 47d0: 0f 85 7a 01 00 00 jne 4950 <pktgen_thread_worker+0x619>
> 47d6: 8b 54 24 18 mov 0x18(%esp),%edx
> 47da: 8b 42 08 mov 0x8(%edx),%eax
> 47dd: a8 08 test $0x8,%al
> 47df: 0f 84 5b fc ff ff je 4440 <pktgen_thread_worker+0x109>
> 47e5: e8 fc ff ff ff call 47e6 <pktgen_thread_worker+0x4af>
> 47ea: 8b 54 24 18 mov 0x18(%esp),%edx
> 47ee: 8b 02 mov (%edx),%eax
> 47f0: c7 00 00 00 00 00 movl $0x0,(%eax)
> 47f6: 8b 44 24 28 mov 0x28(%esp),%eax
> 47fa: e8 d2 f9 ff ff call 41d1 <next_to_run>
> 47ff: 85 c0 test %eax,%eax
> 4801: 89 c6 mov %eax,%esi
> 4803: 0f 85 56 fc ff ff jne 445f <pktgen_thread_worker+0x128>
> 4809: ba 64 00 00 00 mov $0x64,%edx
> 480e: 8b 44 24 1c mov 0x1c(%esp),%eax
> 4812: e8 fc ff ff ff call 4813 <pktgen_thread_worker+0x4dc>
> 4817: 8b 4c 24 28 mov 0x28(%esp),%ecx
> 481b: 8b 91 b4 02 00 00 mov 0x2b4(%ecx),%edx
> 4821: f6 c2 01 test $0x1,%dl
> 4824: 74 84 je 47aa <pktgen_thread_worker+0x473>
> 4826: 8b 5c 24 28 mov 0x28(%esp),%ebx
> 482a: c7 04 24 c8 06 00 00 movl $0x6c8,(%esp)
> 4831: 83 c3 0c add $0xc,%ebx
> 4834: 89 5c 24 04 mov %ebx,0x4(%esp)
> 4838: e8 fc ff ff ff call 4839 <pktgen_thread_worker+0x502>
> 483d: 8b 44 24 28 mov 0x28(%esp),%eax
> 4841: e8 f6 f9 ff ff call 423c <pktgen_stop>
> 4846: 89 5c 24 04 mov %ebx,0x4(%esp)
> 484a: c7 04 24 e8 06 00 00 movl $0x6e8,(%esp)
> 4851: e8 fc ff ff ff call 4852 <pktgen_thread_worker+0x51b>
> 4856: 8b 44 24 28 mov 0x28(%esp),%eax
> 485a: e8 1b fa ff ff call 427a <pktgen_rem_all_ifs>
> 485f: 89 5c 24 04 mov %ebx,0x4(%esp)
> 4863: c7 04 24 cc 06 00 00 movl $0x6cc,(%esp)
> 486a: e8 fc ff ff ff call 486b <pktgen_thread_worker+0x534>
> 486f: 8b 44 24 28 mov 0x28(%esp),%eax
> 4873: 83 c4 2c add $0x2c,%esp
> 4876: 5b pop %ebx
> 4877: 5e pop %esi
> 4878: 5f pop %edi
> 4879: 5d pop %ebp
> 487a: e9 27 fa ff ff jmp 42a6 <pktgen_rem_thread>
> 487f: 8b 96 40 04 00 00 mov 0x440(%esi),%edx
> 4885: 8b 86 c4 02 00 00 mov 0x2c4(%esi),%eax
> 488b: 83 c0 01 add $0x1,%eax
> 488e: 3b 86 e8 02 00 00 cmp 0x2e8(%esi),%eax
> 4894: 89 86 c4 02 00 00 mov %eax,0x2c4(%esi)
> 489a: 0f 82 a1 fc ff ff jb 4541 <pktgen_thread_worker+0x20a>
> 48a0: 85 d2 test %edx,%edx
> 48a2: 0f 84 9d fc ff ff je 4545 <pktgen_thread_worker+0x20e>
> 48a8: 8b 82 94 00 00 00 mov 0x94(%edx),%eax
> 48ae: 83 f8 01 cmp $0x1,%eax
> 48b1: 74 12 je 48c5 <pktgen_thread_worker+0x58e>
> 48b3: f0 ff 8a 94 00 00 00 lock decl 0x94(%edx)
> 48ba: 0f 94 c0 sete %al
> 48bd: 84 c0 test %al,%al
> 48bf: 0f 84 80 fc ff ff je 4545 <pktgen_thread_worker+0x20e>
> 48c5: 89 d0 mov %edx,%eax
> 48c7: e8 fc ff ff ff call 48c8 <pktgen_thread_worker+0x591>
> 48cc: f6 86 81 02 00 00 02 testb $0x2,0x281(%esi)
> 48d3: 0f 85 79 fc ff ff jne 4552 <pktgen_thread_worker+0x21b>
> 48d9: 89 f2 mov %esi,%edx
> 48db: 8b 44 24 14 mov 0x14(%esp),%eax
> 48df: e8 22 e7 ff ff call 3006 <fill_packet_ipv4>
> 48e4: 85 c0 test %eax,%eax
> 48e6: 89 86 40 04 00 00 mov %eax,0x440(%esi)
> 48ec: 0f 85 79 fc ff ff jne 456b <pktgen_thread_worker+0x234>
> 48f2: c7 04 24 08 07 00 00 movl $0x708,(%esp)
> 48f9: e8 fc ff ff ff call 48fa <pktgen_thread_worker+0x5c3>
> 48fe: e8 fc ff ff ff call 48ff <pktgen_thread_worker+0x5c8>
> 4903: 83 ae c4 02 00 00 01 subl $0x1,0x2c4(%esi)
> 490a: e9 54 fe ff ff jmp 4763 <pktgen_thread_worker+0x42c>
> 490f: c7 86 c8 02 00 00 00 movl $0x0,0x2c8(%esi)
> 4916: 00 00 00
> 4919: 0f 31 rdtsc
> 491b: 89 44 24 0c mov %eax,0xc(%esp)
> 491f: 89 54 24 10 mov %edx,0x10(%esp)
> 4923: 85 d2 test %edx,%edx
> 4925: 8b 1d 1c 00 00 00 mov 0x1c,%ebx
> 492b: 89 d1 mov %edx,%ecx
> 492d: 89 c5 mov %eax,%ebp
> 492f: 74 08 je 4939 <pktgen_thread_worker+0x602>
> 4931: 89 d0 mov %edx,%eax
> 4933: 31 d2 xor %edx,%edx
> 4935: f7 f3 div %ebx
> 4937: 89 c1 mov %eax,%ecx
> 4939: 89 e8 mov %ebp,%eax
> 493b: f7 f3 div %ebx
> 493d: 89 ca mov %ecx,%edx
> 493f: 89 86 b4 02 00 00 mov %eax,0x2b4(%esi)
> 4945: 89 96 b8 02 00 00 mov %edx,0x2b8(%esi)
> 494b: e9 00 fd ff ff jmp 4650 <pktgen_thread_worker+0x319>
> 4950: 8b 44 24 28 mov 0x28(%esp),%eax
> 4954: e8 21 f9 ff ff call 427a <pktgen_rem_all_ifs>
> 4959: 8b 44 24 28 mov 0x28(%esp),%eax
> 495d: 83 a0 b4 02 00 00 f7 andl $0xfffffff7,0x2b4(%eax)
> 4964: e9 6d fe ff ff jmp 47d6 <pktgen_thread_worker+0x49f>
> 4969: 8b 44 24 28 mov 0x28(%esp),%eax
> 496d: e8 d7 f2 ff ff call 3c49 <pktgen_run>
> 4972: 8b 4c 24 28 mov 0x28(%esp),%ecx
> 4976: 8b 91 b4 02 00 00 mov 0x2b4(%ecx),%edx
> 497c: 83 e2 fb and $0xfffffffb,%edx
> 497f: 89 91 b4 02 00 00 mov %edx,0x2b4(%ecx)
> 4985: e9 42 fe ff ff jmp 47cc <pktgen_thread_worker+0x495>
> 498a: 8b 44 24 28 mov 0x28(%esp),%eax
> 498e: e8 a9 f8 ff ff call 423c <pktgen_stop>
> 4993: 8b 44 24 28 mov 0x28(%esp),%eax
> 4997: 8b 90 b4 02 00 00 mov 0x2b4(%eax),%edx
> 499d: 83 e2 fd and $0xfffffffd,%edx
> 49a0: 89 90 b4 02 00 00 mov %edx,0x2b4(%eax)
> 49a6: e9 18 fe ff ff jmp 47c3 <pktgen_thread_worker+0x48c>
> 49ab: e8 fc ff ff ff call 49ac <pktgen_thread_worker+0x675>
> 49b0: e9 da fd ff ff jmp 478f <pktgen_thread_worker+0x458>
> 49b5: 89 f2 mov %esi,%edx
> 49b7: 89 f8 mov %edi,%eax
> 49b9: e8 cc de ff ff call 288a <nanospin>
> 49be: 89 f6 mov %esi,%esi
> 49c0: e9 19 fb ff ff jmp 44de <pktgen_thread_worker+0x1a7>
> 49c5: 0f 31 rdtsc
> 49c7: 85 d2 test %edx,%edx
> 49c9: 8b 1d 1c 00 00 00 mov 0x1c,%ebx
> 49cf: 89 d1 mov %edx,%ecx
> 49d1: 89 c7 mov %eax,%edi
> 49d3: 74 08 je 49dd <pktgen_thread_worker+0x6a6>
> 49d5: 89 d0 mov %edx,%eax
> 49d7: 31 d2 xor %edx,%edx
> 49d9: f7 f3 div %ebx
> 49db: 89 c1 mov %eax,%ecx
> 49dd: 89 f8 mov %edi,%eax
> 49df: f7 f3 div %ebx
> 49e1: 89 ca mov %ecx,%edx
> 49e3: 89 c1 mov %eax,%ecx
> 49e5: 89 d3 mov %edx,%ebx
> 49e7: 81 c1 ff ff ff 7f add $0x7fffffff,%ecx
> 49ed: 83 d3 00 adc $0x0,%ebx
> 49f0: e9 62 fd ff ff jmp 4757 <pktgen_thread_worker+0x420>
> 49f5: e8 fc ff ff ff call 49f6 <pktgen_thread_worker+0x6bf>
> 49fa: e9 15 fd ff ff jmp 4714 <pktgen_thread_worker+0x3dd>
> 49ff: 83 f8 ff cmp $0xffffffff,%eax
> 4a02: 75 14 jne 4a18 <pktgen_thread_worker+0x6e1>
> 4a04: 8b 4c 24 14 mov 0x14(%esp),%ecx
> 4a08: f6 81 59 01 00 00 10 testb $0x10,0x159(%ecx)
> 4a0f: 74 07 je 4a18 <pktgen_thread_worker+0x6e1>
> 4a11: f3 90 pause
> 4a13: e9 98 fb ff ff jmp 45b0 <pktgen_thread_worker+0x279>
> 4a18: 8b 86 40 04 00 00 mov 0x440(%esi),%eax
> 4a1e: f0 ff 88 94 00 00 00 lock decl 0x94(%eax)
> 4a25: 8b 2d 08 00 00 00 mov 0x8,%ebp
> 4a2b: 85 ed test %ebp,%ebp
> 4a2d: 75 4f jne 4a7e <pktgen_thread_worker+0x747>
> 4a2f: 83 86 ac 02 00 00 01 addl $0x1,0x2ac(%esi)
> 4a36: c7 86 c8 02 00 00 00 movl $0x0,0x2c8(%esi)
> 4a3d: 00 00 00
> 4a40: 83 96 b0 02 00 00 00 adcl $0x0,0x2b0(%esi)
> 4a47: 0f 31 rdtsc
> 4a49: 89 44 24 0c mov %eax,0xc(%esp)
> 4a4d: 89 54 24 10 mov %edx,0x10(%esp)
> 4a51: 85 d2 test %edx,%edx
> 4a53: 8b 1d 1c 00 00 00 mov 0x1c,%ebx
> 4a59: 89 d1 mov %edx,%ecx
> 4a5b: 89 c5 mov %eax,%ebp
> 4a5d: 74 08 je 4a67 <pktgen_thread_worker+0x730>
> 4a5f: 89 d0 mov %edx,%eax
> 4a61: 31 d2 xor %edx,%edx
> 4a63: f7 f3 div %ebx
> 4a65: 89 c1 mov %eax,%ecx
> 4a67: 89 e8 mov %ebp,%eax
> 4a69: f7 f3 div %ebx
> 4a6b: 89 ca mov %ecx,%edx
> 4a6d: 89 86 b4 02 00 00 mov %eax,0x2b4(%esi)
> 4a73: 89 96 b8 02 00 00 mov %edx,0x2b8(%esi)
> 4a79: e9 80 fb ff ff jmp 45fe <pktgen_thread_worker+0x2c7>
> 4a7e: e8 fc ff ff ff call 4a7f <pktgen_thread_worker+0x748>
> 4a83: 85 c0 test %eax,%eax
> 4a85: 74 a8 je 4a2f <pktgen_thread_worker+0x6f8>
> 4a87: c7 04 24 e9 06 00 00 movl $0x6e9,(%esp)
> 4a8e: e8 fc ff ff ff call 4a8f <pktgen_thread_worker+0x758>
> 4a93: eb 9a jmp 4a2f <pktgen_thread_worker+0x6f8>
>
--
Pádraig Brady - http://www.pixelbeat.org
--
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000
[not found] ` <20041209164820.GB32454@mail.com>
@ 2004-12-09 17:19 ` P
2004-12-09 23:25 ` Ray Lehtiniemi
0 siblings, 1 reply; 85+ messages in thread
From: P @ 2004-12-09 17:19 UTC (permalink / raw)
To: Ray Lehtiniemi
Ray Lehtiniemi wrote:
> On Thu, Dec 09, 2004 at 09:18:25AM -0700, Ray Lehtiniemi wrote:
>
>>it is worth noting that my box has become quite unstable since
>>i started to use oprofile and pktgen together. sshd stops responding,
>>and the network seems to go down. not sure what is happening there...
>>this instability seems to be persisting across reboots, unfortunately...
>
>
>
> ok, it seems that this is related to martin's e1000 patch, and i
> just hadn't noticed it before. rolling back the 1.2 Mpps patch
> seems to cure the problem.
>
> symptoms are a total freezeup of the e1000 interfaces. netstat
> -an shows a tcp connection for my ssh login to the box, with about
> 53K in the send-Q. /proc/net/tcp is empty, however.... i can
> reproduce this at will by doing
>
> # objdump -d /lib/modules/2.6.10-rc3-mp-rayl/kernel/net/core/pktgen.ko
>
> on that machine with the e1000-patched kernel running.
>
>
> if there's any diagnostic output i can generate that might tell
> me what's going wrong, let me know and i'll try to generate it.
can you send this to again to netdev.
thanks.
--
Pádraig Brady - http://www.pixelbeat.org
--
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: 1.03Mpps on e1000
2004-12-09 17:19 ` P
@ 2004-12-09 23:25 ` Ray Lehtiniemi
0 siblings, 0 replies; 85+ messages in thread
From: Ray Lehtiniemi @ 2004-12-09 23:25 UTC (permalink / raw)
To: netdev
hi all
my apologies if this gets received twice... i originally sent a copy
of this using mutt's 'bounce' function, but i don't think that's
what i wanted to do.....
this is a bug report re: martin e1000 patch. i'm seeing some lockups
under normal traffic loads that seem to go away if i revert the
patch. details below..
thanks
On Thu, Dec 09, 2004 at 05:19:55PM +0000, P@draigBrady.com wrote:
> Ray Lehtiniemi wrote:
> >On Thu, Dec 09, 2004 at 09:18:25AM -0700, Ray Lehtiniemi wrote:
> >
> >>it is worth noting that my box has become quite unstable since
> >>i started to use oprofile and pktgen together. sshd stops responding,
> >>and the network seems to go down. not sure what is happening there...
> >>this instability seems to be persisting across reboots, unfortunately...
> >
> >
> >
> >ok, it seems that this is related to martin's e1000 patch, and i
> >just hadn't noticed it before. rolling back the 1.2 Mpps patch
> >seems to cure the problem.
> >
> >symptoms are a total freezeup of the e1000 interfaces. netstat
> >-an shows a tcp connection for my ssh login to the box, with about
> >53K in the send-Q. /proc/net/tcp is empty, however.... i can
> >reproduce this at will by doing
> >
> > # objdump -d /lib/modules/2.6.10-rc3-mp-rayl/kernel/net/core/pktgen.ko
> >
> >on that machine with the e1000-patched kernel running.
> >
> >
> >if there's any diagnostic output i can generate that might tell
> >me what's going wrong, let me know and i'll try to generate it.
>
> can you send this to again to netdev.
>
> thanks.
>
> --
> Pádraig Brady - http://www.pixelbeat.org
> --
--
----------------------------------------------------------------------
Ray L <rayl@mail.com>
^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: [E1000-devel] Transmission limit
2004-12-02 18:23 ` Robert Olsson
2004-12-02 23:25 ` Lennert Buytenhek
2004-12-03 5:23 ` Scott Feldman
@ 2004-12-10 16:24 ` Martin Josefsson
2 siblings, 0 replies; 85+ messages in thread
From: Martin Josefsson @ 2004-12-10 16:24 UTC (permalink / raw)
To: Robert Olsson
Cc: sfeldma, Lennert Buytenhek, jamal, P, mellia, e1000-devel,
Jorge Manuel Finochietto, Giulio Galante, netdev
[-- Attachment #1: Type: text/plain, Size: 1096 bytes --]
On Thu, 2004-12-02 at 19:23, Robert Olsson wrote:
> Hello!
>
> Below is little patch to clean skb at xmit. It's old jungle trick Jamal
> and I used w. tulip. Note we can now even decrease the size of TX ring.
Just a small unimportant note.
> --- drivers/net/e1000/e1000_main.c.orig 2004-12-01 13:59:36.000000000 +0100
> +++ drivers/net/e1000/e1000_main.c 2004-12-02 20:37:40.000000000 +0100
> @@ -1820,6 +1820,10 @@
> return NETDEV_TX_LOCKED;
> }
>
> +
> + if( adapter->tx_ring.next_to_use - adapter->tx_ring.next_to_clean > 80 )
> + e1000_clean_tx_ring(adapter);
> +
> /* need: count + 2 desc gap to keep tail from touching
> * head, otherwise try next time */
> if(E1000_DESC_UNUSED(&adapter->tx_ring) < count + 2) {
This patch is pretty broken, I doubt you want to call
e1000_clean_tx_ring(), I think you want some variant of
e1000_clean_tx_irq() :)
e1000_clean_tx_irq() takes adapter->tx_lock which e1000_xmit_frame()
also does so it will need some modification.
And it should use E1000_DESC_UNUSED as Scott pointed out.
--
/Martin
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 85+ messages in thread
end of thread, other threads:[~2004-12-10 16:24 UTC | newest]
Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1101467291.24742.70.camel@mellia.lipar.polito.it>
2004-11-26 14:05 ` [E1000-devel] Transmission limit P
2004-11-26 15:31 ` Marco Mellia
2004-11-26 19:56 ` jamal
2004-11-29 14:21 ` Marco Mellia
2004-11-30 13:46 ` jamal
2004-12-02 17:24 ` Marco Mellia
2004-11-26 20:06 ` jamal
2004-11-26 20:56 ` Lennert Buytenhek
2004-11-26 21:02 ` Lennert Buytenhek
2004-11-27 9:25 ` Harald Welte
[not found] ` <20041127111101.GC23139@xi.wantstofly.org>
2004-11-27 11:31 ` Harald Welte
2004-11-27 20:12 ` Cesar Marcondes
2004-11-29 8:53 ` Marco Mellia
2004-11-29 14:50 ` Lennert Buytenhek
2004-11-30 8:42 ` Marco Mellia
2004-12-01 12:25 ` jamal
2004-12-02 13:39 ` Marco Mellia
2004-12-03 13:07 ` jamal
2004-11-26 15:40 ` Robert Olsson
2004-11-26 15:59 ` Marco Mellia
2004-11-26 16:57 ` P
2004-11-26 20:01 ` jamal
2004-11-29 10:19 ` P
2004-11-29 13:09 ` Robert Olsson
2004-11-29 20:16 ` David S. Miller
2004-12-01 16:47 ` Robert Olsson
2004-11-30 13:31 ` jamal
2004-11-30 13:46 ` Lennert Buytenhek
2004-11-30 14:25 ` jamal
2004-12-01 0:11 ` Lennert Buytenhek
2004-12-01 1:09 ` Scott Feldman
2004-12-01 15:34 ` Robert Olsson
2004-12-01 16:49 ` Scott Feldman
2004-12-01 17:37 ` Robert Olsson
2004-12-02 17:54 ` Robert Olsson
2004-12-02 18:23 ` Robert Olsson
2004-12-02 23:25 ` Lennert Buytenhek
2004-12-03 5:23 ` Scott Feldman
2004-12-10 16:24 ` Martin Josefsson
2004-12-01 18:29 ` Lennert Buytenhek
2004-12-01 21:35 ` Lennert Buytenhek
2004-12-02 6:13 ` Scott Feldman
2004-12-03 13:24 ` jamal
2004-12-05 14:50 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Lennert Buytenhek
2004-12-05 15:03 ` Martin Josefsson
2004-12-05 15:15 ` Lennert Buytenhek
2004-12-05 15:19 ` Martin Josefsson
2004-12-05 15:30 ` Martin Josefsson
2004-12-05 17:00 ` Lennert Buytenhek
2004-12-05 17:11 ` Martin Josefsson
2004-12-05 17:38 ` Martin Josefsson
2004-12-05 18:14 ` Lennert Buytenhek
2004-12-05 15:42 ` Martin Josefsson
2004-12-05 16:48 ` Martin Josefsson
2004-12-05 17:01 ` Martin Josefsson
2004-12-05 17:58 ` Lennert Buytenhek
2004-12-05 17:44 ` Lennert Buytenhek
2004-12-05 17:51 ` Lennert Buytenhek
2004-12-05 17:54 ` Martin Josefsson
2004-12-06 11:32 ` 1.03Mpps on e1000 (was: " jamal
2004-12-06 12:11 ` Lennert Buytenhek
2004-12-06 12:20 ` jamal
2004-12-06 12:23 ` Lennert Buytenhek
2004-12-06 12:30 ` Martin Josefsson
2004-12-06 13:11 ` jamal
[not found] ` <20041206132907.GA13411@xi.wantstofly.org>
[not found] ` <16820.37049.396306.295878@robur.slu.se>
2004-12-06 17:32 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] " P
2004-12-08 23:36 ` Ray Lehtiniemi
[not found] ` <41B825A5.2000009@draigBrady.com>
[not found] ` <20041209161825.GA32454@mail.com>
2004-12-09 17:12 ` 1.03Mpps on e1000 P
[not found] ` <20041209164820.GB32454@mail.com>
2004-12-09 17:19 ` P
2004-12-09 23:25 ` Ray Lehtiniemi
2004-12-05 21:12 ` 1.03Mpps on e1000 (was: Re: [E1000-devel] Transmission limit) Scott Feldman
2004-12-05 21:25 ` Lennert Buytenhek
2004-12-06 1:23 ` 1.03Mpps on e1000 (was: " Scott Feldman
2004-12-02 17:31 ` [E1000-devel] Transmission limit Marco Mellia
2004-12-03 20:57 ` Lennert Buytenhek
2004-12-04 10:36 ` Lennert Buytenhek
2004-12-01 12:08 ` jamal
2004-12-01 15:24 ` Lennert Buytenhek
2004-11-26 17:58 ` Robert Olsson
2004-11-27 20:00 ` Lennert Buytenhek
2004-11-29 12:44 ` Marco Mellia
2004-11-29 15:19 ` Lennert Buytenhek
2004-11-29 17:32 ` Marco Mellia
2004-11-29 19:08 ` Lennert Buytenhek
2004-11-29 19:09 ` Lennert Buytenhek
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.