linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 15:17 Herman Dierks
  2003-06-13 16:21 ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Herman Dierks @ 2003-06-13 15:17 UTC (permalink / raw)
  To: Feldman, Scott
  Cc: David Gibson, linux-kernel, Anton Blanchard, Nancy J Milliner,
	Ricardo C Gonzalez, Brian Twichell


Scott, we don't copy large blocks like large send.
 The code is just suppose to copy any packets 2K or less.

Actually, David jumped the gun here a bit as we were going to write up a
more proposal for this.
I expect that this would help other acrh. as well.
The issue is the alignment and problems that causes on the PCI bus.
 In a nutshell, it takes several more DMA's to get the transfer done.  Not
sure if this would hold true on all platforms (IE is a function of the
adapter and or PCI bus) or if it is due to the bridge chips (PHB's) on
pSeries (PPC).
I think the only issue here that is PPC specific is that we have several
bridge chips in the IO path so LATENCY is longer that if you have a direct
attached bus.  However, many platforms that need more PCI slots have to add
bridge chips into the path (required by the PCI architecture due to loads
on the bus, etc).  Thus, the extra DMA's required due to miss-alignment
lead to longer bus times if bridge chips are in the IO path.
I copied  below a note from 3/16/2003 that shows the issues. Note that the
main issue here is on PCI bus (not PCI-X).
The reason is that PCI-X with MMRBC =2048 reads larger chunks and this
helps.   However, we have PIC-X bus traces and this alignment also helps
PCI-X as well, just to a lesser extend with respect to performance.

I don't think the TCE is an issue in this one case because the driver was
already DMA'ing a single "linear" buffer to the adapter.
It was just aligned in such a way that it takes more DMA's to finish.
In AIX, we did the copy to get to a single DMA buffer (which Linux already
does) as it has the TCP/IP headers in one buffer and data in another buffer
(one or two in some cases for MTU 1500).   Thus we took two TCE lookups in
the AIX case so the copy saved us more.
Because we only copy 2K or less, the cost is minimal as the data is already
in the cache (typically)  from being copied into the kernel (non-zero copy
case).

So in summary I think this is mainly an alignment issue and it is likely to
help other platforms as well so should be considered for the driver.

Here is the older note on the PCI bus trace.    The receive alignment was
addressed in a different defect and I think has been taken care of.

 To:  Ricardo C Gonzalez/Austin/IBM@ibmus, Nancy J
Milliner/Austin/IBM@IBMUS, Anton Blanchard/Australia/IBM@IBMAU, Michael
Perez/Austin/IBM@IBMUS, Pat Buckland/Austin/IBM@IBMUS
cc:    Brian Twichell/Austin/IBM@IBMUS, Bill Buros/Austin/IBM@IBMUS, Binh
       Hua/Austin/IBM@IBMUS, Jim Cunningham/Austin/IBM@IBMUS, Brian M
       Rzycki/Austin/IBM@IBMUS
From:  Herman Dierks/Austin/IBM@IBMUS
Subject:    Goliad performance issues on linux, results of PCI bus traces


This note is partially  regarding bug1589.   Will have to add some of this
info into that defect.
Any time you go looking for problems, you often find more than you went
looking for.   Such is probably the case here so we may end up with a few
other defects if we want to track them seperately.  We just looked at Linux
2.4, so I assume the same issues are in 2.5 but would need to be verified.

I have talked with Anton about issues 1 and 2 below so he will see about
fixing them so we can prototype the change and measure it.

Binh Hua came over and we ran a number of PCI bus traces using  Brian &
Jim's bus Analizer over several days.  (Thanks guys).
As this is just a "PCI" analizer (and not a PCI-X analizer), we could only
trace on the p660 (M2+) system.

After first day of tracing, we saw all 256 byte transfers on the bus for
Memory Read Multiple commands (MMR).
I thought those should be 512 and investigated the system FW level.
I found the FW on the machine was quite old (from 2001) and predated some
Goliad perf changes.
So I installed a 020910 level FW on the M2+.
The performance on AIX then jumped up 200 Mbits/sec.   The PCI bus analizer
trace then showed we are doing 512 byte transfers as we should be.
On AIX, this took the PCI commands from 10 down to 3 to send one packet.
These are not CPU cycles but DMA bus cycles.
Thus you can see the large affect that saving a few commands can have on
performance.
This above paragraph is just a side note, and I will not put that into the
defect.
The point of this paragraph is that the FW sets up the bridge chips based
on adapter and its very important to have the right level of FW on the
system when looking at any performance issues.

Now that the M2+ had the right FW,  we did a few more sniff tests and took
more bus traces on both AIX and then the following day on Linux.

Cutting to the chase,  based on what we see in the traces, we see a couple
of performance problems.
 1) Linux is not aligning the transmit buffer on a cache line.
       M2+ to LER performance for LINUX is 777 Mbits while for AIX it is
859 Mbits.  We have large send (TSO) disabled on AIX to make this apples to
apples.
       So there is a 82 Mbit difference (10%) here that we feel is due to
the cache alignment.  See details below.
       With large send enabled, AIX tets 943 Mbits with single session.

      The solution here is to align the transmit buffer onto a 128 byte
cache line.  This is probably being controled up in the TCP stack above the
driver.
                If we have to fix this in the driver, it would mean another
copy.

 2) Linux is not aligning the receive buffer on a cache line.
      On a number of PCI commands viewpoint,  Linux alignment is fewer PCI
commands.
     However, the bad new of the way it does the alignment is that the
memory controller will have to do  "read-modify-write" operations to RAM to
update partial cache lines.  That is very costly on the memory subsystem
and will hurt is as we scale up the number of adapters and the memory
bandwidth becomes taxed.
This may need to be a new defect on just the e1000 driver (at least for PPC
and PPC64).

3)   We need to do a second look at the number of TX and RX descriptors
fetched by the adapter on Linux. These values may be set up way too large.
     Binh thought on AIX we just move a cache line's worth at a time.
This is a secondary issue (generates bus traffic) and we need to
investigate a bit more before we open a defect on this.


DETAILS:

I will show the  PCI commands for sending (reading from memory) for AIX
first and then for Linux  2.4.  This is with correct level of FW on M2+.
PCI Commands are :       MRM  is Memory Read Multiple (IE multiple cache
lines)
                  MRL is Memory Read Line (IE one cache line of 128 bytes)
                  MR is Memory Read  (IE reading fewer bytes than one cache
line)

AIX  sending one 1514 byte packet  (859 Mbits sec rate)

      Address     PCI comand  Bytes xfered
      4800        MRM         512
      4A00        MRM         512
      4C00        MRM         490
                              -------
                              1514 bytes

Linux sending one 1514 byte packet   (777 Mbits sec rate)

      Address     PCI command Bytes xfered
      420A8       MRM         344
      42200       MRM         512
      42400       MRM         512
      42600       MRL         128
      42680       MR            18
                              -----
                              1514 bytes , total time of 14.108 micro
seconds

 So, notice what happended. On AIX we move the data in 3 PCI commands.
On Linux, due to the alignment, it took 5 operations.
The bus latency on these RIO based IO drawer machines is very long. Each of
these commands takes at least 700 nanosec to issue the command.
 It can take longer to issue the command than to transfer the bytes.
 So, we want to change Linux to align the buffer on a 128 byte cache
boundry.  That should get us parity with AIX.

Now lets look at receive side (writes to memory)


AIX receiving one 1514 byte packet

      Address     PCI command Bytes xfered
      EE800       MWI         512
      EA00        MWI         512
      EC00        MWI         384
         ED80           MW          106      (partial cache line)
                              -----
                              1514 bytes,

LINUX receiving one 1514 byte packet

      Address     PCI Command Bytes xfered
      2012        MWI         494 bytes    (partial cache line)
      2200        MWI         512
      2400        MWI         508          (partial cache line)
                                    ----------
                              1514 bytes,   total time 6.496 micro sec

Linux does the xfer in 3 PCI commands.  However, it appears that due to the
partial cache lines (one at start and one at end) that this will cause one
more read-modify-write of a cache line at the memory controller to update
the line.
We need to verify this with Pat or Mike (so I am copying them on this note)
If so, then we should probably align  the RX buffer to start on a cache
line and take the hit on one more PCI command.    Just depends on what we
want to optimize for, esp. on a busy server.




"Feldman, Scott" <scott.feldman@intel.com> on 06/12/2003 08:16:11 PM

To:    "David Gibson" <dwg@au1.ibm.com>
cc:    <linux-kernel@vger.kernel.org>, "Anton Blanchard" <anton@samba.org>,
       Nancy J Milliner/Austin/IBM@IBMUS, Herman Dierks/Austin/IBM@IBMUS,
       Ricardo C Gonzalez/Austin/IBM@ibmus
Subject:    RE: e1000 performance hack for ppc64 (Power4)



David, arch-specific code in the driver concerns me.  I would really
like to avoid any arch-specific code in the driver if at all possible.
Can we find a way to move this work to the arch-dependent areas?  This
doesn't seem to be an issue unique to e1000, so moving this up one level
should benefit other devices as well.  More questions below.

> Peculiarities in the PCI bridges on Power4 based ppc64 machines mean
> that unaligned DMAs are horribly slow.  This hits us hard on gigabit
> transfers, since the packets (starting from the MAC header) are
> usually only 2-byte aligned.

2-byte alignment is bad for ppc64, so what is minimum alignment that is
good?  (I hope it's not 2K!)  What could you do at a higher level to
present properly aligned buffers to the driver?

> The patch below addresses this by copying and realigning packets into
> nicely 2k aligned buffers.  As well as fixing the alignment that
> minimises the number of TCE lookups, and because we allocate the
> buffers pci_alloc_consistent(), we avoid setting up and tearing down
> the TCE mappings for each packet.

If I'm understanding the patch correctly, you're saying unaligned DMA +
TCE lookup is more expensive than a data copy?  If we copy the data, we
loss some of the benefits of TSO and Zerocopy and h/w checksum
offloading!  What is more expensive, unaligned DMA or TCE?

 -scott





^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-16 18:56 Feldman, Scott
  0 siblings, 0 replies; 34+ messages in thread
From: Feldman, Scott @ 2003-06-16 18:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Herman Dierks, David S. Miller, ltd, Anton Blanchard, dwg,
	Linux Kernel Mailing List, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev

> Scott, would you be pleased if something was implemented out 
> of the driver, in generic net code?  Something that all the 
> drivers could use, even if nothing but e1000 used it for now.

I suppose the driver could unconditionally call something like
skb_realign_for_broken_hw, which is a nop on non-broken archs, but would
it make more sense to not have the driver mess with the skb at all?

-scott

^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-16 18:21 Feldman, Scott
  2003-06-16 18:30 ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Feldman, Scott @ 2003-06-16 18:21 UTC (permalink / raw)
  To: Herman Dierks, David S. Miller
  Cc: ltd, anton, haveblue, dwg, linux-kernel, Nancy J Milliner,
	Ricardo C Gonzalez, Brian Twichell, netdev

Herman wrote:
> Its only the MTU 1500 case with non-TSO that we are 
> discussing here so copying a few bytes is really not a big 
> deal as the data is already in cache from copying into 
> kernel.  If it lets the adapter run at speed, thats what 
> customers want and what we need. Granted, if the HW could 
> deal with this we would not have to, but thats not the case 
> today so I want to spend a few CPU cycles to get best 
> performance. Again, if this is not done on other platforms, I 
> don't understand why you care.

I care because adding the arch-specific hack creates a maintenance issue
for me.

-scott


^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: e1000 performance hack for ppc64 (Power4)
@ 2003-06-15 14:40 Herman Dierks
  2003-06-15 14:44 ` David S. Miller
  2003-06-16 16:17 ` Nivedita Singhvi
  0 siblings, 2 replies; 34+ messages in thread
From: Herman Dierks @ 2003-06-15 14:40 UTC (permalink / raw)
  To: David S. Miller
  Cc: ltd, anton, haveblue, scott.feldman, dwg, linux-kernel,
	Nancy J Milliner, Ricardo C Gonzalez, Brian Twichell, netdev


Look folks,   we run 40 to 48  GigE adapters in a p690 32 way on AIX and
they basically all run at full speed  so let me se you try that on most of
these other boxes you are talking about.     Same adapter, same hardware
logic.
I have also seen what many of these other boxes you talk about do when data
or structures are not aligned on 64 bit boundaries.
The PPC HW does not have those 64bit alignment  issues.   So, each machine
has some warts.  Have yet to see a perfect one.

If you want a lot of PCI adapters on a box, it takes a number of bridge
chips and other IO links to do that.
Memory controllers like to deal with cache lines.
For larger packets, like jumbo frames or large send (TSO), the few added
DMA's is not an issue as the packets are so large the DMA soon get aligned
and are not an issue.   With TSO being the default,   the small packet case
becomes less important anyway.   Its more an issue on 2.4 where TSO is not
provided.  We also want this to run well if someone does not want to use
TSO.

Its only the MTU 1500 case with non-TSO that we are discussing here so
copying a few bytes is really not a big deal as the data is already in
cache from copying into kernel.  If it lets the adapter run at speed, thats
what customers want and what we need.
Granted, if the HW could deal with this we would not have to, but thats not
the case today so I want to spend a few CPU cycles to get best performance.
Again, if this is not done on other platforms, I don't understand why you
care.

If we have to do this for PPC port, fine.   I have not seen any of you
suggest a better solution that works and will not be a worse hack to TCP or
other code.  Anton tried various other ideas before we fell back to doing
it the same way we did this in AIX.   This code is very localized and is
only used by platforms that need it.  Thus I don't see the big issue here.

Herman


"David S. Miller" <davem@redhat.com> on 06/14/2003 01:08:50 AM

To:    ltd@cisco.com
cc:    anton@samba.org, haveblue@us.ltcfwd.linux.ibm.com, Herman
       Dierks/Austin/IBM@IBMUS, scott.feldman@intel.com, dwg@au1.ibm.com,
       linux-kernel@vger.kernel.org, Nancy J Milliner/Austin/IBM@IBMUS,
       Ricardo C Gonzalez/Austin/IBM@ibmus, Brian
       Twichell/Austin/IBM@IBMUS, netdev@oss.sgi.com
Subject:    Re: e1000 performance hack for ppc64 (Power4)



   From: Lincoln Dale <ltd@cisco.com>
   Date: Sat, 14 Jun 2003 15:52:35 +1000

   can we have the TCP retransmit side take a performance hit if it needs
   to
   realign buffers?

You don't understand, the person who mangles the packet
must make the copy, not the person not doing the packet
modifications.

   for a "high performance app" requiring gigabit-type speeds,

...we probably won't be using ppc64 and e1000 cards, yes, I agree
:-)

Anton, go to the local computer store and pick up some tg3
 cards or a bunch of Taiwan specials :-)





^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: e1000 performance hack for ppc64 (Power4)
@ 2003-06-15 14:32 Herman Dierks
  0 siblings, 0 replies; 34+ messages in thread
From: Herman Dierks @ 2003-06-15 14:32 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Feldman, Scott, David S. Miller, haveblue, dwg, linux-kernel,
	Nancy J Milliner, Ricardo C Gonzalez, Brian Twichell, netdev


Anton, I think the option described below is intended to cause the adapter
to "get off on a cache line boundary" so when it restarts the DMA it will
be aligned.   This is for cases when the adapter has to get off, for exampe
due to FIFO full, etc.
Some adapters would get off on any boundary and that then causes perf
issues when the DMA is restarted.
This is a good option, but I don't think it addresses what we need here as
the host needs to ensure a DMA starts on a cache line.
Different adapter anyway, but  I am just pointing out that even if e1000
had this it would not be the solution.


Anton Blanchard <anton@samba.org> on 06/13/2003 07:03:42 PM

To:    "Feldman, Scott" <scott.feldman@intel.com>
cc:    "David S. Miller" <davem@redhat.com>,
       haveblue@us.ltcfwd.linux.ibm.com, Herman Dierks/Austin/IBM@IBMUS,
       dwg@au1.ibm.com, linux-kernel@vger.kernel.org, Nancy J
       Milliner/Austin/IBM@IBMUS, Ricardo C Gonzalez/Austin/IBM@ibmus,
       Brian Twichell/Austin/IBM@IBMUS, netdev@oss.sgi.com
Subject:    Re: e1000 performance hack for ppc64 (Power4)




> I thought the answer was no, so I double checked with a couple of
> hardware guys, and the answer is still no.

Hi Scott,

Thats a pity, the e100 docs on sourceforge show it can do what we want,
it would be nice if e1000 had this feature too :)

4.2.2 Read Align

The Read Align feature is aimed to enhance performance in cache line
oriented systems. Starting a PCI transaction in these systems on a
non-cache line aligned address may result in low  performance. To solve
this performance problem, the controller can be configured to terminate
Transmit DMA cycles on a cache line boundary, and start the next
transaction on a cache line aligned address. This  feature is enabled
when the Read Align Enable bit is set in device Configure command
(Section 6.4.2.3, "Configure (010b)").

If this bit is set, the device operates as follows:

* When the device is close to running out of resources on the Transmit
* DMA (in other words, the Transmit FIFO is almost full), it attempts to
* terminate the read transaction on the nearest cache line boundary when
* possible.

* When the arbitration counters feature is enabled (maximum Transmit DMA
* byte count value is set in configuration space), the device switches
 * to other pending DMAs on cache line boundary only.





^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 23:52 Feldman, Scott
  2003-06-13 23:52 ` David S. Miller
  2003-06-14  0:03 ` Anton Blanchard
  0 siblings, 2 replies; 34+ messages in thread
From: Feldman, Scott @ 2003-06-13 23:52 UTC (permalink / raw)
  To: Anton Blanchard, David S. Miller
  Cc: haveblue, hdierks, dwg, linux-kernel, milliner, ricardoz,
	twichell, netdev

> > Why not instead find out if it's possible to have the e1000 
> > fetch the entire cache line where the first byte of the
> > packet resides?  Even ancient designes like SunHME do that.
> 
> Rusty and I were wondering why the e1000 didnt do that exact thing.
> 
> Scott: is it possible to enable such a thing?

I thought the answer was no, so I double checked with a couple of
hardware guys, and the answer is still no.

-scott

^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 22:13 Feldman, Scott
  0 siblings, 0 replies; 34+ messages in thread
From: Feldman, Scott @ 2003-06-13 22:13 UTC (permalink / raw)
  To: Herman Dierks
  Cc: David Gibson, linux-kernel, Anton Blanchard, Nancy J Milliner,
	Ricardo C Gonzalez, Brian Twichell

> So in summary I think this is mainly an alignment issue and 
> it is likely to help other platforms as well so should be 
> considered for the driver.

But we're applying the alignment constraint in the wrong direction.  The
e1000 h/w doesn't have an alignment constraint, it's your arch.  Dave
Hansen's suggestion is in the right direction; let's see what Anton's
response is.

-scott

^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13 17:03 Herman Dierks
  0 siblings, 0 replies; 34+ messages in thread
From: Herman Dierks @ 2003-06-13 17:03 UTC (permalink / raw)
  To: haveblue
  Cc: Feldman, Scott, David Gibson, Linux Kernel Mailing List,
	Anton Blanchard, Nancy J Milliner, Ricardo C Gonzalez,
	Brian Twichell, netdev


I will let Anton respond to this.   I think he may have tried this some
time back in his early prototypes to fix this.
I think the problem was not where the buffer started but where the  packet
ended up within the buffer.
Due to varying sizes of TCP and IP headers the packet ended up at some
non-cache aligned address.
What we need for the DMA to work well is to have the final packet  (with
datalink headers)  starting on a cache line as its the final packet that
must be DMA'd. In fact it may need to to be aligned to a higher level than
that (not sure).


haveblue@us.ltcfwd.linux.ibm.com on 06/13/2003 11:21:03 AM

To:    Herman Dierks/Austin/IBM@IBMUS
cc:    "Feldman, Scott" <scott.feldman@intel.com>, David Gibson
       <dwg@au1.ibm.com>, Linux Kernel Mailing List
       <linux-kernel@vger.kernel.org>, Anton Blanchard <anton@samba.org>,
       Nancy J Milliner/Austin/IBM@IBMUS, Ricardo C
       Gonzalez/Austin/IBM@ibmus, Brian Twichell/Austin/IBM@IBMUS,
       netdev@oss.sgi.com
Subject:    RE: e1000 performance hack for ppc64 (Power4)



Too long to quote:
http://marc.theaimsgroup.com/?t=105538879600001&r=1&w=2

Wouldn't you get most of the benefit from copying that stuff around in
the driver if you allocated the skb->data aligned in the first place?

There's already code to align them on CPU cache boundaries:
#define SKB_DATA_ALIGN(X)       (((X) + (SMP_CACHE_BYTES - 1)) & \
                                 ~(SMP_CACHE_BYTES - 1))

So, do something like this:
#ifdef ARCH_ALIGN_SKB_BYTES
#define SKB_ALIGN_BYTES ARCH_ALIGN_SKB_BYTES
#else
#define SKB_ALIGN_BYTES SMP_CACHE_BYTES
#endif
#define SKB_DATA_ALIGN(X)       (((X) + (ARCH_ALIGN_SKB - 1)) & \
                                 ~(SKB_ALIGN_BYTES - 1))

You could easily make this adaptive to no align on th arch size when the
request is bigger than that, just like in the e1000 patch you posted.
--
Dave Hansen
haveblue@us.ibm.com






^ permalink raw reply	[flat|nested] 34+ messages in thread
* RE: e1000 performance hack for ppc64 (Power4)
@ 2003-06-13  1:16 Feldman, Scott
  2003-06-13 23:15 ` Anton Blanchard
  0 siblings, 1 reply; 34+ messages in thread
From: Feldman, Scott @ 2003-06-13  1:16 UTC (permalink / raw)
  To: David Gibson
  Cc: linux-kernel, Anton Blanchard, Nancy Milliner, Herman Dierks,
	Ricardo Gonzalez

David, arch-specific code in the driver concerns me.  I would really
like to avoid any arch-specific code in the driver if at all possible.
Can we find a way to move this work to the arch-dependent areas?  This
doesn't seem to be an issue unique to e1000, so moving this up one level
should benefit other devices as well.  More questions below.

> Peculiarities in the PCI bridges on Power4 based ppc64 machines mean
> that unaligned DMAs are horribly slow.  This hits us hard on gigabit
> transfers, since the packets (starting from the MAC header) are
> usually only 2-byte aligned.

2-byte alignment is bad for ppc64, so what is minimum alignment that is
good?  (I hope it's not 2K!)  What could you do at a higher level to
present properly aligned buffers to the driver?

> The patch below addresses this by copying and realigning packets into
> nicely 2k aligned buffers.  As well as fixing the alignment that
> minimises the number of TCE lookups, and because we allocate the
> buffers pci_alloc_consistent(), we avoid setting up and tearing down
> the TCE mappings for each packet.

If I'm understanding the patch correctly, you're saying unaligned DMA +
TCE lookup is more expensive than a data copy?  If we copy the data, we
loss some of the benefits of TSO and Zerocopy and h/w checksum
offloading!  What is more expensive, unaligned DMA or TCE?

-scott

^ permalink raw reply	[flat|nested] 34+ messages in thread
* e1000 performance hack for ppc64 (Power4)
@ 2003-06-12  3:32 David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2003-06-12  3:32 UTC (permalink / raw)
  To: Scott Feldman
  Cc: linux-kernel, Anton Blanchard, Nancy Milliner, Herman Dierks,
	Ricardo Gonzalez

Hi Scott,

Peculiarities in the PCI bridges on Power4 based ppc64 machines mean
that unaligned DMAs are horribly slow.  This hits us hard on gigabit
transfers, since the packets (starting from the MAC header) are
usually only 2-byte aligned.

The patch below addresses this by copying and realigning packets into
nicely 2k aligned buffers.  As well as fixing the alignment that
minimises the number of TCE lookups, and because we allocate the
buffers pci_alloc_consistent(), we avoid setting up and tearing down
the TCE mappings for each packet.

It's definitely a ppc64 specific hack, but I've tried to minimise the
patch's adverse impact on the rest of the code: It should cause no
change in behaviour or performance when the realignment is disabled,
which is true by default on everything except ppc64.  I've also
attempted to minimise the impact on the readability of the rest of the
code.

It would be very helpful if you could include this patch in the
mainline driver.

diff -urN /scratch/anton/export/drivers/net/e1000/e1000.h linux-congo/drivers/net/e1000/e1000.h
--- /scratch/anton/export/drivers/net/e1000/e1000.h	2003-05-30 01:28:06.000000000 +1000
+++ linux-congo/drivers/net/e1000/e1000.h	2003-06-12 12:39:10.000000000 +1000
@@ -163,6 +163,43 @@
 #define E1000_TX_DESC(R, i)		E1000_GET_DESC(R, i, e1000_tx_desc)
 #define E1000_CONTEXT_DESC(R, i)	E1000_GET_DESC(R, i, e1000_context_desc)
 
+#ifdef CONFIG_PPC64
+/* We have a POWER4 specific performance hack to deal with the
+ * slowness of transferring unaligned frames over the PCI bus */
+#define E1000_REALIGN_DATA_HACK	1
+#else
+#define E1000_REALIGN_DATA_HACK	0
+#endif
+
+#define E1000_REALIGN_BUFFER_SIZE	2048 /* importantly, >1514 */
+#define E1000_REALIGN_TARGET_ALIGNMENT	E1000_REALIGN_BUFFER_SIZE
+
+#if E1000_REALIGN_DATA_HACK
+/* We want each buffer to lie within one page, to minimise TCE
+ * lookups */
+#if (PAGE_SIZE % E1000_REALIGN_BUFFER_SIZE)
+#warning E1000_REALIGN_BUFFER_SIZE is not a factor of page size.
+#endif
+
+struct e1000_realign_ring {
+	unsigned char *vaddr;
+	dma_addr_t dma_handle;
+	size_t size;
+};
+
+#define E1000_REALIGN_BUFFER_OFFSET(i)	((i)*E1000_REALIGN_BUFFER_SIZE)
+#define E1000_REALIGN_BUFFER(a,i) ((a)->realign_ring.vaddr ? \
+	(a)->realign_ring.vaddr + E1000_REALIGN_BUFFER_OFFSET(i) : \
+	NULL)
+#define E1000_REALIGN_BUFFER_DMA(a,i) ((a)->realign_ring.dma_handle \
+	+ E1000_REALIGN_BUFFER_OFFSET(i))
+
+#else /* ! E1000_REALIGN_DATA_HACK */
+#define E1000_REALIGN_BUFFER(a,i)	NULL
+#define E1000_REALIGN_BUFFER_DMA(a,i)	0
+#endif /* ! E1000_REALIGN_DATA_HACK */
+
+
 /* board specific private data structure */
 
 struct e1000_adapter {
@@ -186,6 +223,9 @@
 
 	/* TX */
 	struct e1000_desc_ring tx_ring;
+#if E1000_REALIGN_DATA_HACK
+	struct e1000_realign_ring realign_ring;
+#endif /* E1000_REALIGN_DATA_HACK */
 	uint32_t txd_cmd;
 	uint32_t tx_int_delay;
 	uint32_t tx_abs_int_delay;
diff -urN /scratch/anton/export/drivers/net/e1000/e1000_main.c linux-congo/drivers/net/e1000/e1000_main.c
--- /scratch/anton/export/drivers/net/e1000/e1000_main.c	2003-05-30 01:23:39.000000000 +1000
+++ linux-congo/drivers/net/e1000/e1000_main.c	2003-06-12 12:52:58.000000000 +1000
@@ -702,6 +702,23 @@
 	return 0;
 }
 
+static inline void
+e1000_setup_realign_ring(struct e1000_adapter *adapter)
+{
+#if E1000_REALIGN_DATA_HACK
+	struct e1000_desc_ring *txdr = &adapter->tx_ring;
+	struct e1000_realign_ring *rr = &adapter->realign_ring;
+
+	rr->size = txdr->count * E1000_REALIGN_BUFFER_SIZE;
+	rr->vaddr = pci_alloc_consistent(adapter->pdev, rr->size,
+					 &rr->dma_handle);
+
+	if (! rr->vaddr)
+		printk(KERN_WARNING "%s: could not allocate realignment buffers.  Expect poor performance on Power4 hardware\n",
+		       adapter->netdev->name);
+#endif /* E1000_REALIGN_DATA_HACK */
+}
+
 /**
  * e1000_setup_tx_resources - allocate Tx resources (Descriptors)
  * @adapter: board private structure
@@ -735,6 +752,8 @@
 	}
 	memset(txdr->desc, 0, txdr->size);
 
+	e1000_setup_realign_ring(adapter);
+
 	txdr->next_to_use = 0;
 	txdr->next_to_clean = 0;
 
@@ -951,6 +970,19 @@
 	E1000_WRITE_REG(&adapter->hw, RCTL, rctl);
 }
 
+static inline void
+e1000_free_realign_ring(struct e1000_adapter *adapter)
+{
+#if E1000_REALIGN_DATA_HACK
+	struct e1000_realign_ring *rr = &adapter->realign_ring;
+
+	if (rr->vaddr)
+		pci_free_consistent(adapter->pdev, rr->size,
+				    rr->vaddr, rr->dma_handle);
+	rr->vaddr = NULL;
+#endif /* E1000_REALIGN_DATA_HACK */
+}
+
 /**
  * e1000_free_tx_resources - Free Tx Resources
  * @adapter: board private structure
@@ -972,6 +1004,8 @@
 	                    adapter->tx_ring.desc, adapter->tx_ring.dma);
 
 	adapter->tx_ring.desc = NULL;
+
+	e1000_free_realign_ring(adapter);
 }
 
 /**
@@ -990,11 +1024,11 @@
 
 	for(i = 0; i < adapter->tx_ring.count; i++) {
 		if(adapter->tx_ring.buffer_info[i].skb) {
-
-			pci_unmap_page(pdev,
-			               adapter->tx_ring.buffer_info[i].dma,
-			               adapter->tx_ring.buffer_info[i].length,
-			               PCI_DMA_TODEVICE);
+			if(adapter->tx_ring.buffer_info[i].dma)
+				pci_unmap_page(pdev,
+					       adapter->tx_ring.buffer_info[i].dma,
+					       adapter->tx_ring.buffer_info[i].length,
+					       PCI_DMA_TODEVICE);
 
 			dev_kfree_skb(adapter->tx_ring.buffer_info[i].skb);
 
@@ -1491,6 +1525,21 @@
 #define E1000_MAX_DATA_PER_TXD	(1<<E1000_MAX_TXD_PWR)
 
 static inline int
+e1000_realign_data(struct e1000_adapter *adapter, unsigned char *data,
+		     int size, int i)
+{
+	unsigned char *buf = E1000_REALIGN_BUFFER(adapter, i);
+
+	if(buf && (size <= E1000_REALIGN_BUFFER_SIZE)
+	   && ((unsigned long)(data) % E1000_REALIGN_TARGET_ALIGNMENT)) {
+
+		memcpy(buf, data, size);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int
 e1000_tx_map(struct e1000_adapter *adapter, struct sk_buff *skb,
 	unsigned int first)
 {
@@ -1515,11 +1564,13 @@
 			size -= 4;
 #endif
 		tx_ring->buffer_info[i].length = size;
-		tx_ring->buffer_info[i].dma =
-			pci_map_single(adapter->pdev,
-				skb->data + offset,
-				size,
-				PCI_DMA_TODEVICE);
+		if (e1000_realign_data(adapter, skb->data+offset, size, i))
+			tx_ring->buffer_info[i].dma = 0;
+		else
+			tx_ring->buffer_info[i].dma =
+				pci_map_single(adapter->pdev,
+					       skb->data + offset, size,
+					       PCI_DMA_TODEVICE);
 		tx_ring->buffer_info[i].time_stamp = jiffies;
 
 		len -= size;
@@ -1593,7 +1644,13 @@
 
 	while(count--) {
 		tx_desc = E1000_TX_DESC(*tx_ring, i);
-		tx_desc->buffer_addr = cpu_to_le64(tx_ring->buffer_info[i].dma);
+		if (E1000_REALIGN_DATA_HACK
+		    && !tx_ring->buffer_info[i].dma)
+			tx_desc->buffer_addr =
+				cpu_to_le64(E1000_REALIGN_BUFFER_DMA(adapter, i));
+		else
+			tx_desc->buffer_addr =
+				cpu_to_le64(tx_ring->buffer_info[i].dma);
 		tx_desc->lower.data =
 			cpu_to_le32(txd_lower | tx_ring->buffer_info[i].length);
 		tx_desc->upper.data = cpu_to_le32(txd_upper);


-- 
David Gibson			| For every complex problem there is a
david@gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2003-06-16 18:44 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-13 15:17 e1000 performance hack for ppc64 (Power4) Herman Dierks
2003-06-13 16:21 ` Dave Hansen
2003-06-13 22:38   ` Anton Blanchard
2003-06-13 22:46     ` David S. Miller
2003-06-13 23:18       ` Anton Blanchard
2003-06-14  1:52         ` Lincoln Dale
2003-06-14  5:41           ` David S. Miller
2003-06-14  5:52             ` Lincoln Dale
2003-06-14  6:08               ` David S. Miller
2003-06-14  6:14                 ` David S. Miller
2003-06-14  6:27                   ` William Lee Irwin III
2003-06-14 17:08                   ` Greg KH
2003-06-14 17:19                     ` Greg KH
2003-06-14 17:21                     ` Riley Williams
2003-06-15  3:01                     ` David S. Miller
2003-06-14  5:16       ` Nivedita Singhvi
2003-06-14  5:36         ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2003-06-16 18:56 Feldman, Scott
2003-06-16 18:21 Feldman, Scott
2003-06-16 18:30 ` Dave Hansen
2003-06-15 14:40 Herman Dierks
2003-06-15 14:44 ` David S. Miller
2003-06-16 16:17 ` Nivedita Singhvi
2003-06-15 14:32 Herman Dierks
2003-06-13 23:52 Feldman, Scott
2003-06-13 23:52 ` David S. Miller
2003-06-14  0:55   ` Anton Blanchard
2003-06-14  1:34     ` David S. Miller
2003-06-14  0:03 ` Anton Blanchard
2003-06-13 22:13 Feldman, Scott
2003-06-13 17:03 Herman Dierks
2003-06-13  1:16 Feldman, Scott
2003-06-13 23:15 ` Anton Blanchard
2003-06-12  3:32 David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).