All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Speed of plb_temac 3.00 on ML403
@ 2006-12-05 19:08 Rick Moleres
  2006-12-12 11:08 ` Ming Liu
  2007-02-09 14:16 ` Ming Liu
  0 siblings, 2 replies; 20+ messages in thread
From: Rick Moleres @ 2006-12-05 19:08 UTC (permalink / raw)
  To: Michael Galassi, Thomas Denzinger; +Cc: linuxppc-embedded


Thomas,

Yes, Michael points out the hardware parameters that are needed to
enable SGDMA along with DRE (to allow unaligned packets) and checksum
offload. It also helps the queuing if the FIFOs in the hardware (Tx/Rx
and IPIF) are deep to handle fast frame rates.  And finally, better
performance if jumbo frames are enabled. Once SGDMA is tuned (e.g.,
number of buffer descriptors, interrupt coalescing) and set up, the PPC
is not involved in the data transfers - only in the setup and interrupt
handling.

With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
again) - using netperf w/ TCP_SENDFILE option. We didn't investigate the
difference between 2.4 and 2.6.

-Rick

-----Original Message-----
From: linuxppc-embedded-bounces+moleres=3Dxilinx.com@ozlabs.org
[mailto:linuxppc-embedded-bounces+moleres=3Dxilinx.com@ozlabs.org] On
Behalf Of Michael Galassi
Sent: Tuesday, December 05, 2006 11:42 AM
To: Thomas Denzinger
Cc: linuxppc-embedded@ozlabs.org
Subject: Re: Speed of plb_temac 3.00 on ML403=20

>My question is now: Has anybody deeper knowledge how ethernet and sgDMA
>works? How deep is the PPC involved in the data transfer? Or does the
>Temac-core handle the datatransfer to DDR-memory autonomous?

Thomas,

If you cut & pasted directly from my design you may be running without
DMA, which in turn implies running without checksum offload and DRE.
The plb_temac shrinks to about half it's size this way, but if you're
performance bound you probably want to turn DMA back on in your mhs
file:

 PARAMETER C_DMA_TYPE =3D 3
 PARAMETER C_INCLUDE_RX_CSUM =3D 1
 PARAMETER C_INCLUDE_TX_CSUM =3D 1
 PARAMETER C_RX_DRE_TYPE =3D 1
 PARAMETER C_TX_DRE_TYPE =3D 1
 PARAMETER C_RXFIFO_DEPTH =3D 32768

You'll have to regenerate the xparameters file too if you make these
changes (in xps: Software -> Generate Libraries and BSPs).

There may also be issues with the IP stack in the 2.4 linux kernels.
If you have the option, an experiment with at 2.6 stack would be
ammusing.

-michael
_______________________________________________
Linuxppc-embedded mailing list
Linuxppc-embedded@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-embedded

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2006-12-05 19:08 Speed of plb_temac 3.00 on ML403 Rick Moleres
@ 2006-12-12 11:08 ` Ming Liu
  2007-02-09 14:16 ` Ming Liu
  1 sibling, 0 replies; 20+ messages in thread
From: Ming Liu @ 2006-12-12 11:08 UTC (permalink / raw)
  To: rick.moleres; +Cc: linuxppc-embedded

Dear Rick,
Now I am measuring the performance of my TEMAC on ml403 using netperf. 
However I cannot get a performance as high as yours(550Mbps for TX). My 
data is listed here:

Board --> PC (tx)

# ./netperf -H 192.168.0.3 -C -t TCP_STREAM -- -m 8192 -s 253952 -S 253952
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.3 
(192.168.0.3) port 0 AF_INET
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   
remote
bytes  bytes   bytes    secs.    10^6bits/s  % U      % S      us/KB   
us/KB

262142 206848   8192    10.00        64.51   -1.00    2.59     -1.000  
6.587

PC --> board (rx)

linux:/home/mingliu/netperf-2.4.1 # netperf -H 192.168.0.5 -C -t TCP_STREAM 
-- -m 14400 -s 253952 -S 253952
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.5 
(192.168.0.5) port 0 AF_INET
Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   
remote
bytes  bytes   bytes    secs.    10^6bits/s  % U      % U      us/KB   
us/KB

206848 262142  14400    10.02       169.09   -1.00    -1.00    -1.000  
-0.484

I think this performance is much slower than what you have described. So 
what's the problem? I am using the old cores of TEMAC(plb_temac 2.00.a and 
hard_temac 1.00.a and DMA type is 3, Tx and Rx FIFO lengths are both 
131072, large enough?). My linux is 2.6.16 from the general kernel with the 
temac driver patched. The driver is from the patch 
http://source.mvista.com/~ank/paulus-powerpc/20060309/. Is this bad 
performance because of the old cores, or the driver? Or Montavista Linux is 
RTOS and it should have a much better performance like this? You must be 
more experienced on the performance issue and your suggestion will be 
extreamly useful for me. 

Anxious for your suggestion and explanation. 

Regards
Ming

>From: "Rick Moleres" <rick.moleres@xilinx.com>
>To: "Michael Galassi" <mgalassi@c-cor.com>,"Thomas Denzinger" 
<t.denzinger@lesametric.de>
>CC: linuxppc-embedded@ozlabs.org
>Subject: RE: Speed of plb_temac 3.00 on ML403 
>Date: Tue, 5 Dec 2006 12:08:58 -0700
>
>
>Thomas,
>
>Yes, Michael points out the hardware parameters that are needed to
>enable SGDMA along with DRE (to allow unaligned packets) and checksum
>offload. It also helps the queuing if the FIFOs in the hardware (Tx/Rx
>and IPIF) are deep to handle fast frame rates.  And finally, better
>performance if jumbo frames are enabled. Once SGDMA is tuned (e.g.,
>number of buffer descriptors, interrupt coalescing) and set up, the PPC
>is not involved in the data transfers - only in the setup and interrupt
>handling.
>
>With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
>(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
>again) - using netperf w/ TCP_SENDFILE option. We didn't investigate the
>difference between 2.4 and 2.6.
>
>-Rick
>
>-----Original Message-----
>From: linuxppc-embedded-bounces+moleres=xilinx.com@ozlabs.org
>[mailto:linuxppc-embedded-bounces+moleres=xilinx.com@ozlabs.org] On
>Behalf Of Michael Galassi
>Sent: Tuesday, December 05, 2006 11:42 AM
>To: Thomas Denzinger
>Cc: linuxppc-embedded@ozlabs.org
>Subject: Re: Speed of plb_temac 3.00 on ML403
>
> >My question is now: Has anybody deeper knowledge how ethernet and sgDMA
> >works? How deep is the PPC involved in the data transfer? Or does the
> >Temac-core handle the datatransfer to DDR-memory autonomous?
>
>Thomas,
>
>If you cut & pasted directly from my design you may be running without
>DMA, which in turn implies running without checksum offload and DRE.
>The plb_temac shrinks to about half it's size this way, but if you're
>performance bound you probably want to turn DMA back on in your mhs
>file:
>
>  PARAMETER C_DMA_TYPE = 3
>  PARAMETER C_INCLUDE_RX_CSUM = 1
>  PARAMETER C_INCLUDE_TX_CSUM = 1
>  PARAMETER C_RX_DRE_TYPE = 1
>  PARAMETER C_TX_DRE_TYPE = 1
>  PARAMETER C_RXFIFO_DEPTH = 32768
>
>You'll have to regenerate the xparameters file too if you make these
>changes (in xps: Software -> Generate Libraries and BSPs).
>
>There may also be issues with the IP stack in the 2.4 linux kernels.
>If you have the option, an experiment with at 2.6 stack would be
>ammusing.
>
>-michael
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded
>
>
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded

_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger:  http://messenger.msn.com/cn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2006-12-05 19:08 Speed of plb_temac 3.00 on ML403 Rick Moleres
  2006-12-12 11:08 ` Ming Liu
@ 2007-02-09 14:16 ` Ming Liu
  2007-02-09 14:57   ` jozsef imrek
                     ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: Ming Liu @ 2007-02-09 14:16 UTC (permalink / raw)
  To: rick.moleres; +Cc: linuxppc-embedded

Dear Rick,
Again the problem of TEMAC speed. Hopefully you can give me some suggestion 
on that.

>With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
>(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
>again) - using netperf w/ TCP_SENDFILE option. We didn't investigate the
>difference between 2.4 and 2.6.

Now with my system(plb_temac and hard_temac v3.00 with all features enabled 
to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can 
achieve AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when 
jumbo-frame is enabled as 8500. For UDP it is 350Mbps for TX, also 8500 
jumbo-frame is enabled. 

So it looks that my results are still much less than yours from 
Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and improve 
the performance.

When I use netperf to transfer data, I noticed that the CPU utilization is 
almost 100%. So I suspect that CPU is the bottleneck. However other friends 
said the PLB structure is the bottleneck, because when the CPU is lowered 
to 100Mhz, the performance will not change much, but if the PLB frquency is 
lowered, it will. Then they conclude that with the PLB structure, the CPU 
will wait a long time to load and store data from DDR. So PLB is the 
criminal.

Then come some questions. 1. Is your result from the GSRD structure or just 
the normal PLB_TEMAC? Will the GSRD achieve a better performance than the 
normal PLB_TEMAC? 2. Which on earch is the bottleneck for the network 
performance, CPU or PLB structure? Is that possible for PLB to achieve a 
much higher throughput? 3. Because your result is based on Montavista 
Linux. Is there any difference between MontaVista Linux and the general 
open-source linux kernel which could lead to different performance? 

I know that many persons including me are struggling to improve the 
performance of PLB_TEMAC on ML403. So please give us some hints and 
suggestions with your experience and research. Thanks so much for your 
work.

BR
Ming

_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger:  http://messenger.msn.com/cn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-09 14:16 ` Ming Liu
@ 2007-02-09 14:57   ` jozsef imrek
  2007-02-11 15:25     ` Ming Liu
  2007-02-09 16:00   ` Rick Moleres
  2007-02-11  6:55   ` Linux " Leonid
  2 siblings, 1 reply; 20+ messages in thread
From: jozsef imrek @ 2007-02-09 14:57 UTC (permalink / raw)
  To: linuxppc-embedded; +Cc: rick.moleres

On Fri, 9 Feb 2007, Ming Liu wrote:

> Now with my system(plb_temac and hard_temac v3.00 with all features enabled 
> to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can achieve 
> AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when jumbo-frame is 
> enabled as 8500. For UDP it is 350Mbps for TX, also 8500 jumbo-frame is 
> enabled. 
> So it looks that my results are still much less than yours from 
> Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and improve the 
> performance.


when testing network performance you might want to use the packet
generator included in the 2.6 linux kernel (in menuconfig go to
Networking -> Networking options -> Network testing -> Packet Generator).

with this tool you can bypass the ip stack, user space/kernel space 
barrier, etc, and measure the speed of the hardware itself using UDP-like
packets.

using pktgen i have seen data rates close to gigabit. (the hardware i'm 
working with is a memec minimodule with V4FX12. i'm using plb_temac with
s/g dma, plb running at 100MHz, and our custom core accessed via IPIF's
address range. sw is linux 2.6.19, xilinx tools are EDK 8.2i)



another hint: when transfering bulk amount of data TCP is probably an
overkill, especially on dedicated intranets and given the reliability
of the network devices available today. just use UDP if you can.


-- 
mazsi

----------------------------------------------------------------
strawberry fields forever!                       imrek@atomki.hu
----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-09 14:16 ` Ming Liu
  2007-02-09 14:57   ` jozsef imrek
@ 2007-02-09 16:00   ` Rick Moleres
  2007-02-11  6:22     ` Leonid
  2007-02-11 13:37     ` Ming Liu
  2007-02-11  6:55   ` Linux " Leonid
  2 siblings, 2 replies; 20+ messages in thread
From: Rick Moleres @ 2007-02-09 16:00 UTC (permalink / raw)
  To: Ming Liu; +Cc: linuxppc-embedded

Ming,

Here's a quick summary of the systems we used:

Operating system:	MontaVista Linux 40
Benchmark tool:		NetPerf / NetServer
Kernel:			Linux ml403 2.6.10_mvl401-ml40x

IP Core:
Name & version: 		PLB TEMAC 3.00A
Operation Mode:		SGDMA mode
TX/RX DRE:		Yes / Yes
TX/RX CSUM offload:	Yes / Yes
TX Data FIFO depth:	131072 bits (i.e. 16K bytes)
RX Data FIFO depth:	131072 bits (i.e. 16K bytes)

Xilinx Platform Hardware:
Board:			ML403 / Virtex4 FX12
Processor:		PPC405 @ 300MHz
Memory type:		DDR
Memory burst:		Yes

PC-side Test Hardware:
Processor:		Intel(R) Pentium(R) 4 CPU 3.20GHz
OS:			Ubuntu Linux 6.06 LTS, kernel 2.6.15-26-386
Network adapters used:	D-LinkDL2000-based Gigabit Ethernet (rev 0c)


- Are Checksum offload, SGDMA, and DRE enabled in the plb_temac?
- Are you using the TCP_SENDFILE option of netperf?  Your UDP numbers =
are similar already to what we saw in Linux 2.6, and your TCP numbers =
are similar to what we saw *without* the sendfile option.

I don't believe the PLB is the bottleneck here.  We had similar =
platforms running with Treck and have achieved over 800Mbps TCP rates =
(Tx and Rx) over the PLB.

To answer your questions:
1. Results are from PLB_TEMAC, not GSRD.  You would likely see similar =
throughput rates with GSRD and Linux.
2. Assuming you have everything tuned for SGDMA based on previous =
emails, I would suspect the bottleneck is the 300MHz CPU *when* running =
Linux.  In Linux 2.6 we've not spent any time trying to tune the =
TCP/Ethernet parameters on the target board or the host, so there could =
be some optimizations that can be done at that level.  In the exact same =
system we can achieve over 800Mbps using the Treck TCP/IP stack, and =
with VxWorks it was over 600Mbps.  I'm not a Linux expert, so I don't =
know what's tunable for network performance, and there is a possibility =
the driver could be optimized as well.

Thanks,
-Rick

-----Original Message-----
From: Ming Liu [mailto:eemingliu@hotmail.com]=20
Sent: Friday, February 09, 2007 7:17 AM
To: Rick Moleres
Cc: linuxppc-embedded@ozlabs.org
Subject: RE: Speed of plb_temac 3.00 on ML403

Dear Rick,
Again the problem of TEMAC speed. Hopefully you can give me some =
suggestion=20
on that.

>With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
>(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
>again) - using netperf w/ TCP_SENDFILE option. We didn't investigate =
the
>difference between 2.4 and 2.6.

Now with my system(plb_temac and hard_temac v3.00 with all features =
enabled=20
to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can=20
achieve AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when=20
jumbo-frame is enabled as 8500. For UDP it is 350Mbps for TX, also 8500=20
jumbo-frame is enabled.=20

So it looks that my results are still much less than yours from=20
Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and =
improve=20
the performance.

When I use netperf to transfer data, I noticed that the CPU utilization =
is=20
almost 100%. So I suspect that CPU is the bottleneck. However other =
friends=20
said the PLB structure is the bottleneck, because when the CPU is =
lowered=20
to 100Mhz, the performance will not change much, but if the PLB frquency =
is=20
lowered, it will. Then they conclude that with the PLB structure, the =
CPU=20
will wait a long time to load and store data from DDR. So PLB is the=20
criminal.

Then come some questions. 1. Is your result from the GSRD structure or =
just=20
the normal PLB_TEMAC? Will the GSRD achieve a better performance than =
the=20
normal PLB_TEMAC? 2. Which on earch is the bottleneck for the network=20
performance, CPU or PLB structure? Is that possible for PLB to achieve a =

much higher throughput? 3. Because your result is based on Montavista=20
Linux. Is there any difference between MontaVista Linux and the general=20
open-source linux kernel which could lead to different performance?=20

I know that many persons including me are struggling to improve the=20
performance of PLB_TEMAC on ML403. So please give us some hints and=20
suggestions with your experience and research. Thanks so much for your=20
work.

BR
Ming

_________________________________________________________________
=D3=EB=C1=AA=BB=FA=B5=C4=C5=F3=D3=D1=BD=F8=D0=D0=BD=BB=C1=F7=A3=AC=C7=EB=CA=
=B9=D3=C3 MSN Messenger:  http://messenger.msn.com/cn =20

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-09 16:00   ` Rick Moleres
@ 2007-02-11  6:22     ` Leonid
  2007-02-11 13:37     ` Ming Liu
  1 sibling, 0 replies; 20+ messages in thread
From: Leonid @ 2007-02-11  6:22 UTC (permalink / raw)
  To: Rick Moleres, Ming Liu; +Cc: linuxppc-embedded

Does it mean that ml403 and particularly TEMAC need Monta Vista linux? =
Will standard kernel suffice?=20

Thanks,

Leonid.

-----Original Message-----
From: linuxppc-embedded-bounces+leonid=3Da-k-a.net@ozlabs.org =
[mailto:linuxppc-embedded-bounces+leonid=3Da-k-a.net@ozlabs.org] On =
Behalf Of Rick Moleres
Sent: Friday, February 09, 2007 8:01 AM
To: Ming Liu
Cc: linuxppc-embedded@ozlabs.org
Subject: RE: Speed of plb_temac 3.00 on ML403

Ming,

Here's a quick summary of the systems we used:

Operating system:	MontaVista Linux 40
Benchmark tool:		NetPerf / NetServer
Kernel:			Linux ml403 2.6.10_mvl401-ml40x

IP Core:
Name & version: 		PLB TEMAC 3.00A
Operation Mode:		SGDMA mode
TX/RX DRE:		Yes / Yes
TX/RX CSUM offload:	Yes / Yes
TX Data FIFO depth:	131072 bits (i.e. 16K bytes)
RX Data FIFO depth:	131072 bits (i.e. 16K bytes)

Xilinx Platform Hardware:
Board:			ML403 / Virtex4 FX12
Processor:		PPC405 @ 300MHz
Memory type:		DDR
Memory burst:		Yes

PC-side Test Hardware:
Processor:		Intel(R) Pentium(R) 4 CPU 3.20GHz
OS:			Ubuntu Linux 6.06 LTS, kernel 2.6.15-26-386
Network adapters used:	D-LinkDL2000-based Gigabit Ethernet (rev 0c)


- Are Checksum offload, SGDMA, and DRE enabled in the plb_temac?
- Are you using the TCP_SENDFILE option of netperf?  Your UDP numbers =
are similar already to what we saw in Linux 2.6, and your TCP numbers =
are similar to what we saw *without* the sendfile option.

I don't believe the PLB is the bottleneck here.  We had similar =
platforms running with Treck and have achieved over 800Mbps TCP rates =
(Tx and Rx) over the PLB.

To answer your questions:
1. Results are from PLB_TEMAC, not GSRD.  You would likely see similar =
throughput rates with GSRD and Linux.
2. Assuming you have everything tuned for SGDMA based on previous =
emails, I would suspect the bottleneck is the 300MHz CPU *when* running =
Linux.  In Linux 2.6 we've not spent any time trying to tune the =
TCP/Ethernet parameters on the target board or the host, so there could =
be some optimizations that can be done at that level.  In the exact same =
system we can achieve over 800Mbps using the Treck TCP/IP stack, and =
with VxWorks it was over 600Mbps.  I'm not a Linux expert, so I don't =
know what's tunable for network performance, and there is a possibility =
the driver could be optimized as well.

Thanks,
-Rick

-----Original Message-----
From: Ming Liu [mailto:eemingliu@hotmail.com]=20
Sent: Friday, February 09, 2007 7:17 AM
To: Rick Moleres
Cc: linuxppc-embedded@ozlabs.org
Subject: RE: Speed of plb_temac 3.00 on ML403

Dear Rick,
Again the problem of TEMAC speed. Hopefully you can give me some =
suggestion=20
on that.

>With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
>(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
>again) - using netperf w/ TCP_SENDFILE option. We didn't investigate =
the
>difference between 2.4 and 2.6.

Now with my system(plb_temac and hard_temac v3.00 with all features =
enabled=20
to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can=20
achieve AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when=20
jumbo-frame is enabled as 8500. For UDP it is 350Mbps for TX, also 8500=20
jumbo-frame is enabled.=20

So it looks that my results are still much less than yours from=20
Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and =
improve=20
the performance.

When I use netperf to transfer data, I noticed that the CPU utilization =
is=20
almost 100%. So I suspect that CPU is the bottleneck. However other =
friends=20
said the PLB structure is the bottleneck, because when the CPU is =
lowered=20
to 100Mhz, the performance will not change much, but if the PLB frquency =
is=20
lowered, it will. Then they conclude that with the PLB structure, the =
CPU=20
will wait a long time to load and store data from DDR. So PLB is the=20
criminal.

Then come some questions. 1. Is your result from the GSRD structure or =
just=20
the normal PLB_TEMAC? Will the GSRD achieve a better performance than =
the=20
normal PLB_TEMAC? 2. Which on earch is the bottleneck for the network=20
performance, CPU or PLB structure? Is that possible for PLB to achieve a =

much higher throughput? 3. Because your result is based on Montavista=20
Linux. Is there any difference between MontaVista Linux and the general=20
open-source linux kernel which could lead to different performance?=20

I know that many persons including me are struggling to improve the=20
performance of PLB_TEMAC on ML403. So please give us some hints and=20
suggestions with your experience and research. Thanks so much for your=20
work.

BR
Ming

_________________________________________________________________
=D3=EB=C1=AA=BB=FA=B5=C4=C5=F3=D3=D1=BD=F8=D0=D0=BD=BB=C1=F7=A3=AC=C7=EB=CA=
=B9=D3=C3 MSN Messenger:  http://messenger.msn.com/cn =20

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Linux on ML403
  2007-02-09 14:16 ` Ming Liu
  2007-02-09 14:57   ` jozsef imrek
  2007-02-09 16:00   ` Rick Moleres
@ 2007-02-11  6:55   ` Leonid
  2007-02-11 13:10     ` Ming Liu
  2 siblings, 1 reply; 20+ messages in thread
From: Leonid @ 2007-02-11  6:55 UTC (permalink / raw)
  To: linuxppc-embedded

Folks, is everybody using Monta Vista Linux on ML403, or there are
those, using standard kernel from kernel.org? Or any other place I can
get 2.6 kernel for free.=20

I would also appreciate any instructions how to patch it for TMAC if
necessary.=20

Thanks,

Leonid.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Linux on ML403
  2007-02-11  6:55   ` Linux " Leonid
@ 2007-02-11 13:10     ` Ming Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Liu @ 2007-02-11 13:10 UTC (permalink / raw)
  To: Leonid; +Cc: linuxppc-embedded

Hi,
>From http://source.mvista.com/ you can get the kernel of 
linux-xilinx-26.git.

The Temac driver is provided along with Xilinx EDK. So just overwrite the 
corresponding files and try to compile. It will not be strange if there are 
some errors. So just debug it. 

Good luck.

BR
Ming


>From: "Leonid" <Leonid@a-k-a.net>
>To: <linuxppc-embedded@ozlabs.org>
>Subject: Linux on ML403
>Date: Sat, 10 Feb 2007 22:55:27 -0800
>
>Folks, is everybody using Monta Vista Linux on ML403, or there are
>those, using standard kernel from kernel.org? Or any other place I can
>get 2.6 kernel for free.
>
>I would also appreciate any instructions how to patch it for TMAC if
>necessary.
>
>Thanks,
>
>Leonid.
>
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded

_________________________________________________________________
免费下载 MSN Explorer:   http://explorer.msn.com/lccn/  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-09 16:00   ` Rick Moleres
  2007-02-11  6:22     ` Leonid
@ 2007-02-11 13:37     ` Ming Liu
  2007-02-12 19:45       ` Rick Moleres
  1 sibling, 1 reply; 20+ messages in thread
From: Ming Liu @ 2007-02-11 13:37 UTC (permalink / raw)
  To: Rick.Moleres; +Cc: linuxppc-embedded

Dear Rick,
First thank you SO SO SO MUCH for your kindly telling. It's really useful 
for me to solve problems.

>From the test summary listed, I can see that we have similar systems except 
that you are using MontaVista while I am using the general open-source 
kernel. Also I enabled all the features which could improve the 
performance, including DRE, CSUM offload and SGDMA,etc. 

>- Are Checksum offload, SGDMA, and DRE enabled in the plb_temac?
Yes. All features are enabled.

>- Are you using the TCP_SENDFILE option of netperf?  Your UDP numbers are 
similar already to what we saw in Linux 2.6, and your TCP numbers are 
similar to what we saw *without* the sendfile option.

Perhaps no. I just understand that sendfile option is so important to 
improve performance. At first I thought it will achieve a same performance 
with TCP_STREAM, until I saw the article to explain how to use sendfile() 
to optimize data transfer. So I will try this soon. 

After some reading of the articles on performance improving, here come 
other problems. So I will appreciate you so much if you can clarify them 
for me. 

>1. Results are from PLB_TEMAC, not GSRD.  You would likely see similar 
throughput rates with GSRD and Linux.

Problem 1: From the website for GSRD, I know that it uses a different 
structure than PLB, where a Multi port mem controller and DMA are added to 
release the CPU from move data between Memory and TEMAC. So can GSRD 
achieve a higher performance than PLB_TEMAC, or similar performance like 
what you said above? If their performance is similar, what's the advantage 
for GSRD? Could you please explain some differences between these two 
structures? 

>2. Assuming you have everything tuned for SGDMA based on previous emails, 
I would suspect the bottleneck is the 300MHz CPU *when* running Linux.  In 
Linux 2.6 we've not spent any time trying to tune the TCP/Ethernet 
parameters on the target board or the host, so there could be some 
optimizations that can be done at that level.  In the exact same system we 
can achieve over 800Mbps using the Treck TCP/IP stack, and with VxWorks it 
was over 600Mbps.  

Problem 2. I read XAPP546 of High performance TCP/IP on xilinx FPGA devices 
using the Treck embedded TCP/IP stack. I notice that the features of Treck 
TCP/IP stack include: Zero-copy send and receive, Jumbo-frame support, CSUM 
offload, etc. which could achieve a much higher performance than not using 
it. However in the Xilinx TEMAC core V3.00, these features are all 
supported: Zero-copy is supported by sendfile() when using Netperf; 
Jumbo-frame is also supported; CSUM offload and DRE are also supported by 
the hardware. So does this mean I can achieve a similarly high performance 
with PLB_TEMAC V3.00 and without Treck TCP/IP stack? I mean if all the 
features of Treck stack have been included in the PLB_TEMAC cores, what's 
the use for Treck stack? 

Maybe my questions are a little stupid. But I am really confused on them. 
So thank you so much if you can explain them to me. Thanks a lot.

BR
Ming

_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger:  http://messenger.msn.com/cn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-09 14:57   ` jozsef imrek
@ 2007-02-11 15:25     ` Ming Liu
  2007-02-12 18:09       ` jozsef imrek
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Liu @ 2007-02-11 15:25 UTC (permalink / raw)
  To: imrek; +Cc: linuxppc-embedded

Dear Jozsef,
Thank you so much for your hints. After your telling, I tried as you told 
me. Here I list the script file and result.

#! /bin/sh
insmod ./pktgen.ko

PGDEV=/proc/net/pktgen/pg0

pgset() {
    local result

    echo $1 > $PGDEV

    result=`cat $PGDEV | fgrep "Result: OK:"`
    if [ "$result" = "" ]; then
         cat $PGDEV | fgrep Result:
    fi
}

pg() {
    echo inject > $PGDEV
    cat $PGDEV
}

pgset "odev eth0"
pgset "dst 192.168.0.3"
pgset "pkt_size 8500"
pg

My board IP is 192.168.0.5 and my PC is 192.168.0.3. Also in my linux, 
Jumbo-frame size of 8500 is supported (max is 8982) and then I set the 
pkt_size as 8500. Here goes the result:

pktgen.c: v1.4: Packet Generator for packet performance testing.
pktgen version 1.32
Params: count 100000  pkt_size: 8500  frags: 0  ipg: 0  clone_skb: 0 odev 
"eth0"
     dst_min: 192.168.0.3  dst_max:   src_min:   src_max:
     src_mac: 00:00:00:00:00:00  dst_mac: 00:00:00:00:00:00
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 100000  errors: 0
     started: 3555387ms  stopped: 3562242ms  now: 3562242ms  idle: 
1267914442ns
     seq_num: 100000  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0xc0a80005  cur_daddr: 0xc0a80003  cur_udp_dst: 9  
cur_udp_src: 9
Result: OK: 6824904(c2584388+d4240516) usec, 100000 (8504byte,0frags)
  14652pps 996Mb/sec (996804864bps) errors: 0
#

In the end, it shows 996Mb/sec, which means the throughput is 
996Mbps(almost gigabit), right?

However I don't think this result is so meaningful because it bypass the 
processing for TCP/UDP packets. In the practical implementation, that's not 
possible. The TCP/UDP packets have to been processed, right?  In fact, this 
almost gigabit speed is just a representation of the Gigabit ethernet 
capability while the bottleneck of the system is not there. So I still need 
to solve my practical unsatisfying performance in my design. :(

Anyway thanks for your hints and welcome to a deeper discussion.

BR
Ming

>From: jozsef imrek <imrek@atomki.hu>
>To: linuxppc-embedded@ozlabs.org
>CC: rick.moleres@xilinx.com
>Subject: RE: Speed of plb_temac 3.00 on ML403
>Date: Fri, 9 Feb 2007 15:57:15 +0100 (CET)
>
>On Fri, 9 Feb 2007, Ming Liu wrote:
>
> > Now with my system(plb_temac and hard_temac v3.00 with all features 
enabled
> > to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can 
achieve
> > AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when jumbo-frame 
is
> > enabled as 8500. For UDP it is 350Mbps for TX, also 8500 jumbo-frame is
> > enabled.
> > So it looks that my results are still much less than yours from
> > Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and 
improve the
> > performance.
>
>
>when testing network performance you might want to use the packet
>generator included in the 2.6 linux kernel (in menuconfig go to
>Networking -> Networking options -> Network testing -> Packet Generator).
>
>with this tool you can bypass the ip stack, user space/kernel space
>barrier, etc, and measure the speed of the hardware itself using UDP-like
>packets.
>
>using pktgen i have seen data rates close to gigabit. (the hardware i'm
>working with is a memec minimodule with V4FX12. i'm using plb_temac with
>s/g dma, plb running at 100MHz, and our custom core accessed via IPIF's
>address range. sw is linux 2.6.19, xilinx tools are EDK 8.2i)
>
>
>
>another hint: when transfering bulk amount of data TCP is probably an
>overkill, especially on dedicated intranets and given the reliability
>of the network devices available today. just use UDP if you can.
>
>
>--
>mazsi
>
>----------------------------------------------------------------
>strawberry fields forever!                       imrek@atomki.hu
>----------------------------------------------------------------
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded

_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger:  http://messenger.msn.com/cn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-11 15:25     ` Ming Liu
@ 2007-02-12 18:09       ` jozsef imrek
  2007-02-12 19:18         ` Ming Liu
  0 siblings, 1 reply; 20+ messages in thread
From: jozsef imrek @ 2007-02-12 18:09 UTC (permalink / raw)
  To: linuxppc-embedded

On Sun, 11 Feb 2007, Ming Liu wrote:

> In the end, it shows 996Mb/sec, which means the throughput is 996Mbps(almost 
> gigabit), right?

yes.. :)



> However I don't think this result is so meaningful because it bypass the 
> processing for TCP/UDP packets. In the practical implementation, that's not 
> possible. The TCP/UDP packets have to been processed, right?  In fact, this

on the contrary: our whole design is built on the principle that with UDP
you do not need too much handling/processing, and your data path can
bypass the IP stack, the CPU, and with some work even the main memory.

with TCP you (the os) have to take care of connection set up and tear
down, acknowledgements, packet retransmission (and for that you need to
save the packets until they are ack'ed!), etc. in return you get reliable
data transmission, which is a must in many applications.

however we don't need that. duplication or loss of a packet, or reordering
of packets (which could happen when using UDP) would not be a critical
problem for us. but these are mostly theoretical issues, they don't realy
happen on our dedicated daq network (except for packet losses that we
deliberately use as poor man's flow control).


but all theese might be off topic on this list..




> almost gigabit speed is just a representation of the Gigabit ethernet 
> capability while the bottleneck of the system is not there. So I still need 
> to solve my practical unsatisfying performance in my design. :(

at least it will probably help you narrowing the problem down.. good
luck!



-- 
mazsi

----------------------------------------------------------------
strawberry fields forever!                       imrek@atomki.hu
----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-12 18:09       ` jozsef imrek
@ 2007-02-12 19:18         ` Ming Liu
  2007-02-14  7:24           ` jozsef imrek
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Liu @ 2007-02-12 19:18 UTC (permalink / raw)
  To: imrek; +Cc: linuxppc-embedded

Dear Jozsef,

>on the contrary: our whole design is built on the principle that with UDP
>you do not need too much handling/processing, and your data path can
>bypass the IP stack, the CPU, and with some work even the main memory.
>
>with TCP you (the os) have to take care of connection set up and tear
>down, acknowledgements, packet retransmission (and for that you need to
>save the packets until they are ack'ed!), etc. in return you get reliable
>data transmission, which is a must in many applications.
>
>however we don't need that. duplication or loss of a packet, or reordering
>of packets (which could happen when using UDP) would not be a critical
>problem for us. but these are mostly theoretical issues, they don't realy
>happen on our dedicated daq network (except for packet losses that we
>deliberately use as poor man's flow control).

In fact, in our application we also use UDP. I know UDP need less CPU 
processing capability than TCP. However, just like what I measured, my UDP 
performance is around 350Mbps and TCP performance is 270Mbps when 
Jumbo-frame enabled. In my application, I have to use the CPU to process 
the UDP or TCP packets. Then this "need not too much processing"  results 
the much lower performance as I listed above. :(

You said in your application the data path can bypass the IP stack, the 
CPU, and with some work even the main memory. How can you achieve that? 
Then who will process the UDP packets? If you add the work of processing 
packets in, do you have some idea on how fast can your network achieve? I 
believe it will be much lowered, right? :)

Thanks for your discussion.

BR
Ming

_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger:  http://messenger.msn.com/cn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-11 13:37     ` Ming Liu
@ 2007-02-12 19:45       ` Rick Moleres
  2007-02-12 20:39         ` Ming Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Rick Moleres @ 2007-02-12 19:45 UTC (permalink / raw)
  To: Ming Liu; +Cc: linuxppc-embedded


Ming,

<snip>

>>1. Results are from PLB_TEMAC, not GSRD.  You would likely see similar

>throughput rates with GSRD and Linux.
>
>Problem 1: From the website for GSRD, I know that it uses a different=20
>structure than PLB, where a Multi port mem controller and DMA are added
to=20
>release the CPU from move data between Memory and TEMAC. So can GSRD=20
>achieve a higher performance than PLB_TEMAC, or similar performance
like=20
>what you said above? If their performance is similar, what's the
advantage=20
>for GSRD? Could you please explain some differences between these two=20
>structures?=20

GSRD is a reference design intended to exhibit high-performance gigabit
rates.  It offloads the data path of the Ethernet traffic from the PLB
bus, under the assumption that the arbitrated bus is best used for other
things (control, other data, etc...).  With Linux, however, GSRD still
only achieves slightly more than 500Mbps TCP.  We see similar numbers
with PLB TEMAC, and with other stacks we see similar numbers as GSRD as
well (e.g., Treck).  The decision points for using GSRD would be a) what
else needs to happen on the PLB in your system, and b) Xilinx support.
GSRD is a reference design, so it's not officially supported through the
Xilinx support chain.  However, many of its architectural concepts are
being considered for future EDK IP (sorry, no timeframe).  For now, I
recommend PLB TEMAC because it's part of the EDK, supported, and gets as
good performance in most use cases.


>>2. Assuming you have everything tuned for SGDMA based on previous
emails,=20
>I would suspect the bottleneck is the 300MHz CPU *when* running Linux.
In=20
>Linux 2.6 we've not spent any time trying to tune the TCP/Ethernet=20
>parameters on the target board or the host, so there could be some=20
>optimizations that can be done at that level.  In the exact same system
we=20
>can achieve over 800Mbps using the Treck TCP/IP stack, and with VxWorks
it=20
>was over 600Mbps. =20
>
>Problem 2. I read XAPP546 of High performance TCP/IP on xilinx FPGA
devices=20
>using the Treck embedded TCP/IP stack. I notice that the features of
Treck=20
>TCP/IP stack include: Zero-copy send and receive, Jumbo-frame support,
CSUM=20
>offload, etc. which could achieve a much higher performance than not
using=20
>it. However in the Xilinx TEMAC core V3.00, these features are all=20
>supported: Zero-copy is supported by sendfile() when using Netperf;=20
>Jumbo-frame is also supported; CSUM offload and DRE are also supported
by=20
>the hardware. So does this mean I can achieve a similarly high
performance=20
>with PLB_TEMAC V3.00 and without Treck TCP/IP stack? I mean if all the=20
>features of Treck stack have been included in the PLB_TEMAC cores,
what's=20
>the use for Treck stack?=20

Note that Linux only supports zero-copy on the transmit side (i.e.,
sendfile), not on the receive side.  I'm not going to recommend one RTOS
or network stack over another.  Treck is a general purpose TCP/IP stack
that can be used in a standalone environment or in various RTOS
environments (I think).  We've found that Treck, in the case where it is
used without an RTOS, is a higher performing stack than the Linux stack.
The VxWorks stack is also good, and Linux (of the three I've mentioned)
seems to be the slowest.  Again, it's possible that the Linux stack
could be tuned better, but we haven't taken the time to try this.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-12 19:45       ` Rick Moleres
@ 2007-02-12 20:39         ` Ming Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Liu @ 2007-02-12 20:39 UTC (permalink / raw)
  To: Rick.Moleres; +Cc: linuxppc-embedded

Dear Rick,
Thanks for your kindly telling once more. :)

>GSRD is a reference design intended to exhibit high-performance gigabit
>rates.  It offloads the data path of the Ethernet traffic from the PLB
>bus, under the assumption that the arbitrated bus is best used for other
>things (control, other data, etc...).  With Linux, however, GSRD still
>only achieves slightly more than 500Mbps TCP.  We see similar numbers
>with PLB TEMAC, and with other stacks we see similar numbers as GSRD as
>well (e.g., Treck).  The decision points for using GSRD would be a) what
>else needs to happen on the PLB in your system, and b) Xilinx support.
>GSRD is a reference design, so it's not officially supported through the
>Xilinx support chain.  However, many of its architectural concepts are
>being considered for future EDK IP (sorry, no timeframe).  For now, I
>recommend PLB TEMAC because it's part of the EDK, supported, and gets as
>good performance in most use cases.

Well. This time I am totally clear on the concept of GSRD. That's, if I 
have other tasks which use PLB Bus a lot, it will release the network 
traffic from the PLB and then improve the network performance, right? So I 
would like to agree with you: in my system, I choose PLB_TEMAC. 


>Note that Linux only supports zero-copy on the transmit side (i.e.,
>sendfile), not on the receive side.  I'm not going to recommend one RTOS
>or network stack over another.  Treck is a general purpose TCP/IP stack
>that can be used in a standalone environment or in various RTOS
>environments (I think).  We've found that Treck, in the case where it is
>used without an RTOS, is a higher performing stack than the Linux stack.
>The VxWorks stack is also good, and Linux (of the three I've mentioned)
>seems to be the slowest.  Again, it's possible that the Linux stack
>could be tuned better, but we haven't taken the time to try this.

I just read some documents on sendfile() and have understanded some. So I 
tried to use TCP_SENDFILE in netperf at this time. With TCP_SENDFILE 
option, I can achieve a higher TX performance of 301.4Mbps for TCP. 
(without TCP_SENDFILE, it's 213.8Mbps. improved by almost 50%) For RX there 
is no difference with or without TCP_SENDFILE (278Mbps), which shows that 
Linux only supports zero-copy on the TX side as you mentioned. 

Then till now, my best performance for TCP is TX(301Mbps) and RX(278Mbps). 
There is still a long distance from your result (TX of 550Mbps). In my 
system, everything is the same as yours (PLB_TEMAC v3.00, SGDMA, TX/RX DRE 
and CSUM offload, 16k TX/RX FIFO, 300Mhz CPU) except that I am using the 
opensource linux other than MontaVista linux 4.0. Will Montavista Linux 
lead to such a much higher performance? Or what's the real reason why my 
performance is yet not so high as yours? I will appreciate a lot if you can 
give me more hints.

BTW, for your result of TX 550Mbps, did you just use the Linux stack or 
include the Treck one? 

Thanks again for your time and kindly help.

BR
Ming

_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger:  http://messenger.msn.com/cn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2007-02-12 19:18         ` Ming Liu
@ 2007-02-14  7:24           ` jozsef imrek
  0 siblings, 0 replies; 20+ messages in thread
From: jozsef imrek @ 2007-02-14  7:24 UTC (permalink / raw)
  To: linuxppc-embedded

On Mon, 12 Feb 2007, Ming Liu wrote:

> You said in your application the data path can bypass the IP stack, the CPU, 
> and with some work even the main memory. How can you achieve that? Then who 
> will process the UDP packets? If you add the work of processing packets in,

a precondition for this to work is to have all your data processing
implemented in hardware, since the CPU will not see the payload of
the udp packets, so it will have no chance to modify it.

the trick is (as i mentioned in my first reply) is to use an IPIF with
address range support (so it looks like a memory), and have the s/g dma
engine of the plb_temac read the udp payload from this "memory" instead
of your system ram.

the packet header (ethernet + ip + udp) in our application is assembled
by a code based on pktgen. you could implement it in hw as well, but there
is no need for that since you can utilize the full gigabit bandwith (this
you have seen.. :), and it is more convinient to have that functionality
in sw.


a quick hack to see this theory working:

1, create a new peripherial with address range support.
 	(start xps -> hardware -> create or import new peripherial,
 	plb bus, no sw reset/mir, no interrupt, no user regs, with
 	burst support, no fifo, with user address range, no dma,
 	no master iface)

 	you might want to replace the bram in the sample code
 	(pcores/*/hdl/vhdl/user_logic.vhd) with a fixed value
 	or a counter.

2, add this core to your design, create and download your new bitfile.

3, modify the source of the plb_temac linux driver. when a packet
 	is sent with fragments a buffer descriptor (bd) will be set
 	up for each fragment. the first bd will be used for the
 	packet header, and the rest of the bd's will point to the
 	the udp payload. so you want to make sure, that the physical
  	address of all but the first bd's is pointing to the physical
  	address of your IPIF's address range (you can find it in your
 	mhs file, search for C_AR_BASEADDR).

 	in adapter.c in xenet_SgSend_internal() search for the loop
 	where a bd is set up for each payload fragment (something like
 	for (i = 1; i < total_frags; i++, frag++)..), and set the
 	phy_addr to the address of your core (ie. phy_addr = 0x70090000;).

 	compile your new kernel, download it.

4, start pktgen with frags = 1 (use pgset "frags 1"). check the
 	payload of the packets sent on the wire (ie. with tcpdump).

 	if you have replaced the bram in step 1 with a fixed value
 	you should see that value. if you have replaced it with a
 	counter you should see the values rolling. if you did not
 	replace the bram, you should see the contents of the bram - it
 	is filled with all zeroes on reset, but you can fill it with
 	any test pattern.



good luck, and let me know how it works!


-- 
mazsi

----------------------------------------------------------------
strawberry fields forever!                       imrek@atomki.hu
----------------------------------------------------------------

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2006-12-13  0:11 Speed of plb_temac 3.00 " Rick Moleres
@ 2006-12-17 15:05 ` Ming Liu
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Liu @ 2006-12-17 15:05 UTC (permalink / raw)
  To: rick.moleres; +Cc: linuxppc-embedded

Dear Rick,
Now I turn to the Temac v3 core. Also I have included all the features 
which could improve the network performance. However unfortunately the 
performance is also bad, even worse than the old cores. (The RX number is 
worse than before.) I noticed that the parameters in the new driver are 
much different with the ones in the old drivers. They are listed below.

>The numbers I quoted were using the TCP_SENDFILE option of netperf, and 
also using the plb_temac_v3 core, which has checksum offload and some other 
features that help performance.  Given the core you're using, your RX 
numbers are probably about right (assuming you're not using jumbo frames).  
Your transmit number looks low, though.  Perhaps you can try tuning the 
packet threshold (e.g., less interrupts - try 8 instead of 1) and the 
waitbound (use 1) in adapter.c.  Also, how many buffer descriptors are 
being allocated in adapter.c?

In the new driver, the threshold and waitbound are as (by default when BSP 
generated by EDK):
#define DFT_TX_THRESHOLD  16
#define DFT_TX_WAITBOUND  1
#define DFT_RX_THRESHOLD  2
#define DFT_RX_WAITBOUND  1
Also the buffer descriptors are 
#define XTE_SEND_BD_CNT 256
#define XTE_RECV_BD_CNT 256
when booting, it shows the buffer descriptor number is 0x8000.

Sorry that I cannot understand the mechanism of the network driver. So I 
still cannot make sure what's the physical meaning of a parameter and how a 
parameter inside the adapter.c will affect the performance. So if possible 
could you please send me your adaptor.c file which could generate a high 
performance? I want to have a deep research on these parameters.

Of course, your precious suggestion is also appreciated. :-)

BR
Ming

_________________________________________________________________
享用世界上最大的电子邮件系统― MSN Hotmail。  http://www.hotmail.com  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
@ 2006-12-13  0:11 Rick Moleres
  2006-12-17 15:05 ` Ming Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Rick Moleres @ 2006-12-13  0:11 UTC (permalink / raw)
  To: Ming Liu; +Cc: linuxppc-embedded


Ming,

The numbers I quoted were using the TCP_SENDFILE option of netperf, and =
also using the plb_temac_v3 core, which has checksum offload and some =
other features that help performance.  Given the core you're using, your =
RX numbers are probably about right (assuming you're not using jumbo =
frames).  Your transmit number looks low, though.  Perhaps you can try =
tuning the packet threshold (e.g., less interrupts - try 8 instead of 1) =
and the waitbound (use 1) in adapter.c.  Also, how many buffer =
descriptors are being allocated in adapter.c?

I doubt MV Linux has anything to do with it, I would say it's a =
combination of using the later core and its features (checksum offload, =
DRE, jumbo frames) along with netperf's SENDFILE feature, and the =
adapter/driver that takes advantage of both.  Plus tuning the interrupt =
coalescing (threshold, waitbound) typically helps.

-Rick

-----Original Message-----
From: Ming Liu [mailto:eemingliu@hotmail.com]=20
Sent: Tuesday, December 12, 2006 4:08 AM
To: Rick Moleres
Cc: linuxppc-embedded@ozlabs.org
Subject: RE: Speed of plb_temac 3.00 on ML403

Dear Rick,
Now I am measuring the performance of my TEMAC on ml403 using netperf.=20
However I cannot get a performance as high as yours(550Mbps for TX). My=20
data is listed here:

Board --> PC (tx)

# ./netperf -H 192.168.0.3 -C -t TCP_STREAM -- -m 8192 -s 253952 -S =
253952
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.3=20
(192.168.0.3) port 0 AF_INET
Recv   Send    Send                          Utilization       Service=20
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    =
Recv
Size   Size    Size     Time     Throughput  local    remote   local  =20
remote
bytes  bytes   bytes    secs.    10^6bits/s  % U      % S      us/KB  =20
us/KB

262142 206848   8192    10.00        64.51   -1.00    2.59     -1.000 =20
6.587

PC --> board (rx)

linux:/home/mingliu/netperf-2.4.1 # netperf -H 192.168.0.5 -C -t =
TCP_STREAM=20
-- -m 14400 -s 253952 -S 253952
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.5=20
(192.168.0.5) port 0 AF_INET
Recv   Send    Send                          Utilization       Service=20
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    =
Recv
Size   Size    Size     Time     Throughput  local    remote   local  =20
remote
bytes  bytes   bytes    secs.    10^6bits/s  % U      % U      us/KB  =20
us/KB

206848 262142  14400    10.02       169.09   -1.00    -1.00    -1.000 =20
-0.484

I think this performance is much slower than what you have described. So =

what's the problem? I am using the old cores of TEMAC(plb_temac 2.00.a =
and=20
hard_temac 1.00.a and DMA type is 3, Tx and Rx FIFO lengths are both=20
131072, large enough?). My linux is 2.6.16 from the general kernel with =
the=20
temac driver patched. The driver is from the patch=20
http://source.mvista.com/~ank/paulus-powerpc/20060309/. Is this bad=20
performance because of the old cores, or the driver? Or Montavista Linux =
is=20
RTOS and it should have a much better performance like this? You must be =

more experienced on the performance issue and your suggestion will be=20
extreamly useful for me.=20

Anxious for your suggestion and explanation.=20

Regards
Ming

>From: "Rick Moleres" <rick.moleres@xilinx.com>
>To: "Michael Galassi" <mgalassi@c-cor.com>,"Thomas Denzinger"=20
<t.denzinger@lesametric.de>
>CC: linuxppc-embedded@ozlabs.org
>Subject: RE: Speed of plb_temac 3.00 on ML403=20
>Date: Tue, 5 Dec 2006 12:08:58 -0700
>
>
>Thomas,
>
>Yes, Michael points out the hardware parameters that are needed to
>enable SGDMA along with DRE (to allow unaligned packets) and checksum
>offload. It also helps the queuing if the FIFOs in the hardware (Tx/Rx
>and IPIF) are deep to handle fast frame rates.  And finally, better
>performance if jumbo frames are enabled. Once SGDMA is tuned (e.g.,
>number of buffer descriptors, interrupt coalescing) and set up, the PPC
>is not involved in the data transfers - only in the setup and interrupt
>handling.
>
>With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
>(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
>again) - using netperf w/ TCP_SENDFILE option. We didn't investigate =
the
>difference between 2.4 and 2.6.
>
>-Rick
>
>-----Original Message-----
>From: linuxppc-embedded-bounces+moleres=3Dxilinx.com@ozlabs.org
>[mailto:linuxppc-embedded-bounces+moleres=3Dxilinx.com@ozlabs.org] On
>Behalf Of Michael Galassi
>Sent: Tuesday, December 05, 2006 11:42 AM
>To: Thomas Denzinger
>Cc: linuxppc-embedded@ozlabs.org
>Subject: Re: Speed of plb_temac 3.00 on ML403
>
> >My question is now: Has anybody deeper knowledge how ethernet and =
sgDMA
> >works? How deep is the PPC involved in the data transfer? Or does the
> >Temac-core handle the datatransfer to DDR-memory autonomous?
>
>Thomas,
>
>If you cut & pasted directly from my design you may be running without
>DMA, which in turn implies running without checksum offload and DRE.
>The plb_temac shrinks to about half it's size this way, but if you're
>performance bound you probably want to turn DMA back on in your mhs
>file:
>
>  PARAMETER C_DMA_TYPE =3D 3
>  PARAMETER C_INCLUDE_RX_CSUM =3D 1
>  PARAMETER C_INCLUDE_TX_CSUM =3D 1
>  PARAMETER C_RX_DRE_TYPE =3D 1
>  PARAMETER C_TX_DRE_TYPE =3D 1
>  PARAMETER C_RXFIFO_DEPTH =3D 32768
>
>You'll have to regenerate the xparameters file too if you make these
>changes (in xps: Software -> Generate Libraries and BSPs).
>
>There may also be issues with the IP stack in the 2.4 linux kernels.
>If you have the option, an experiment with at 2.6 stack would be
>ammusing.
>
>-michael
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded
>
>
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded

_________________________________________________________________
=D3=EB=C1=AA=BB=FA=B5=C4=C5=F3=D3=D1=BD=F8=D0=D0=BD=BB=C1=F7=A3=AC=C7=EB=CA=
=B9=D3=C3 MSN Messenger:  http://messenger.msn.com/cn =20

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Speed of plb_temac 3.00 on ML403
  2006-12-05 16:18 Thomas Denzinger
  2006-12-05 16:49 ` Ming Liu
@ 2006-12-05 18:42 ` Michael Galassi
  1 sibling, 0 replies; 20+ messages in thread
From: Michael Galassi @ 2006-12-05 18:42 UTC (permalink / raw)
  To: Thomas Denzinger; +Cc: linuxppc-embedded

>My question is now: Has anybody deeper knowledge how ethernet and sgDMA
>works? How deep is the PPC involved in the data transfer? Or does the
>Temac-core handle the datatransfer to DDR-memory autonomous?

Thomas,

If you cut & pasted directly from my design you may be running without
DMA, which in turn implies running without checksum offload and DRE.
The plb_temac shrinks to about half it's size this way, but if you're
performance bound you probably want to turn DMA back on in your mhs
file:

 PARAMETER C_DMA_TYPE = 3
 PARAMETER C_INCLUDE_RX_CSUM = 1
 PARAMETER C_INCLUDE_TX_CSUM = 1
 PARAMETER C_RX_DRE_TYPE = 1
 PARAMETER C_TX_DRE_TYPE = 1
 PARAMETER C_RXFIFO_DEPTH = 32768

You'll have to regenerate the xparameters file too if you make these
changes (in xps: Software -> Generate Libraries and BSPs).

There may also be issues with the IP stack in the 2.4 linux kernels.
If you have the option, an experiment with at 2.6 stack would be
ammusing.

-michael

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Speed of plb_temac 3.00 on ML403
  2006-12-05 16:18 Thomas Denzinger
@ 2006-12-05 16:49 ` Ming Liu
  2006-12-05 18:42 ` Michael Galassi
  1 sibling, 0 replies; 20+ messages in thread
From: Ming Liu @ 2006-12-05 16:49 UTC (permalink / raw)
  To: t.denzinger; +Cc: linuxppc-embedded

Dear Thomas,
Now I am also fighting against improving the performance of the TEMAC on 
ml403 board. But the difference is I work on the old plb_temac and 
hard_temac cores and my linux is 2.6.

I used some little programs written by myself to test the throughput of the 
network between a pc and my board. (I would like to say, I am not an 
experienced programmer so I cannot make sure my programs have no problem, 
although they are simple. :-) ) According to my result, the largest 
throughput between my board and a pc is around 4MB/s, for a round-trip. 
However when those programs applied on two servers interconnected by 
gigabit ethernet, the largest throughput is around 12MB/s, three times than 
my board.

I have not used the popular software, such as Netperf, to measure my little 
network.

I also noticed that when transfering the data, the CPU utilization is very 
high(almost 100%). I am not sure if this is the reason. I think because 
there is SGDMA who can deal with data transfer, the CPU utilization should 
not be so high. 

I am also waiting for some suggestions from some experts who are more 
experienced in such field.

Did you ever measure the throughput of your network? If possible, let's 
share some experience. :)

BR
Ming


>From: Thomas Denzinger <t.denzinger@lesametric.de>
>To: <linuxppc-embedded@ozlabs.org>
>Subject: Speed of plb_temac 3.00 on ML403
>Date: Tue, 05 Dec 2006 17:18:07 +0100
>
>Hi all,
>
>I have to interface to a camera with the GigE Vision protocol.
>
>For that I set up a design on ML403 with PPC, Temac and sgDMA. I use 
MontaVista 2.4.20 Linux with the BSP from Xilinx EDK 8.2. The camera vendor 
supplied a library for GigE Vision, which works under Linux.
>The results of some tests showed that I have to insert waiting time 
between frames sent by the camera, otherwise the lib signals errors.
>This leeds to only 1/10 of the needed transfer rate.
>
>My question is now: Has anybody deeper knowledge how ethernet and sgDMA 
works? How deep is the PPC involved in the data transfer? Or does the 
Temac-core handle the datatransfer to DDR-memory autonomous?
>
>I learnd from the camera vendor, that on PCs with special Intel ethernet 
chips, it works outonomous and so the high transfer rate can be 
accomplished.
>
>I'm very interested to get in contact with people who have to interface 
with a GigE Vision camera.
>
>Also interesting is if anybody benchmarked the gigabit ethernet on the 
ML403 hardware. How fast is the gigabit interface realy?
>
>Thomas
>
>
>--
>Thomas Denzinger
>LesaMetric GmbH
>Hauptstrasse 46
>35649 Bischoffen
>
>Tel.: 06444/931928
>Fax : 06444/931912
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded@ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded

_________________________________________________________________
免费下载 MSN Explorer:   http://explorer.msn.com/lccn  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Speed of plb_temac 3.00 on ML403
@ 2006-12-05 16:18 Thomas Denzinger
  2006-12-05 16:49 ` Ming Liu
  2006-12-05 18:42 ` Michael Galassi
  0 siblings, 2 replies; 20+ messages in thread
From: Thomas Denzinger @ 2006-12-05 16:18 UTC (permalink / raw)
  To: linuxppc-embedded

Hi all,

I have to interface to a camera with the GigE Vision protocol.

For that I set up a design on ML403 with PPC, Temac and sgDMA. I use MontaVista 2.4.20 Linux with the BSP from Xilinx EDK 8.2. The camera vendor supplied a library for GigE Vision, which works under Linux.
The results of some tests showed that I have to insert waiting time between frames sent by the camera, otherwise the lib signals errors.
This leeds to only 1/10 of the needed transfer rate. 

My question is now: Has anybody deeper knowledge how ethernet and sgDMA works? How deep is the PPC involved in the data transfer? Or does the Temac-core handle the datatransfer to DDR-memory autonomous?

I learnd from the camera vendor, that on PCs with special Intel ethernet chips, it works outonomous and so the high transfer rate can be accomplished. 

I'm very interested to get in contact with people who have to interface with a GigE Vision camera.

Also interesting is if anybody benchmarked the gigabit ethernet on the ML403 hardware. How fast is the gigabit interface realy?

Thomas


-- 
Thomas Denzinger
LesaMetric GmbH 
Hauptstrasse 46
35649 Bischoffen

Tel.: 06444/931928
Fax : 06444/931912

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-02-14  7:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-05 19:08 Speed of plb_temac 3.00 on ML403 Rick Moleres
2006-12-12 11:08 ` Ming Liu
2007-02-09 14:16 ` Ming Liu
2007-02-09 14:57   ` jozsef imrek
2007-02-11 15:25     ` Ming Liu
2007-02-12 18:09       ` jozsef imrek
2007-02-12 19:18         ` Ming Liu
2007-02-14  7:24           ` jozsef imrek
2007-02-09 16:00   ` Rick Moleres
2007-02-11  6:22     ` Leonid
2007-02-11 13:37     ` Ming Liu
2007-02-12 19:45       ` Rick Moleres
2007-02-12 20:39         ` Ming Liu
2007-02-11  6:55   ` Linux " Leonid
2007-02-11 13:10     ` Ming Liu
  -- strict thread matches above, loose matches on Subject: below --
2006-12-13  0:11 Speed of plb_temac 3.00 " Rick Moleres
2006-12-17 15:05 ` Ming Liu
2006-12-05 16:18 Thomas Denzinger
2006-12-05 16:49 ` Ming Liu
2006-12-05 18:42 ` Michael Galassi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.