* RE: Speed of plb_temac 3.00 on ML403
@ 2006-12-05 19:08 Rick Moleres
2006-12-12 11:08 ` Ming Liu
2007-02-09 14:16 ` Ming Liu
0 siblings, 2 replies; 15+ messages in thread
From: Rick Moleres @ 2006-12-05 19:08 UTC (permalink / raw)
To: Michael Galassi, Thomas Denzinger; +Cc: linuxppc-embedded
Thomas,
Yes, Michael points out the hardware parameters that are needed to
enable SGDMA along with DRE (to allow unaligned packets) and checksum
offload. It also helps the queuing if the FIFOs in the hardware (Tx/Rx
and IPIF) are deep to handle fast frame rates. And finally, better
performance if jumbo frames are enabled. Once SGDMA is tuned (e.g.,
number of buffer descriptors, interrupt coalescing) and set up, the PPC
is not involved in the data transfers - only in the setup and interrupt
handling.
With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20
(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista
again) - using netperf w/ TCP_SENDFILE option. We didn't investigate the
difference between 2.4 and 2.6.
-Rick
-----Original Message-----
From: linuxppc-embedded-bounces+moleres=3Dxilinx.com@ozlabs.org
[mailto:linuxppc-embedded-bounces+moleres=3Dxilinx.com@ozlabs.org] On
Behalf Of Michael Galassi
Sent: Tuesday, December 05, 2006 11:42 AM
To: Thomas Denzinger
Cc: linuxppc-embedded@ozlabs.org
Subject: Re: Speed of plb_temac 3.00 on ML403=20
>My question is now: Has anybody deeper knowledge how ethernet and sgDMA
>works? How deep is the PPC involved in the data transfer? Or does the
>Temac-core handle the datatransfer to DDR-memory autonomous?
Thomas,
If you cut & pasted directly from my design you may be running without
DMA, which in turn implies running without checksum offload and DRE.
The plb_temac shrinks to about half it's size this way, but if you're
performance bound you probably want to turn DMA back on in your mhs
file:
PARAMETER C_DMA_TYPE =3D 3
PARAMETER C_INCLUDE_RX_CSUM =3D 1
PARAMETER C_INCLUDE_TX_CSUM =3D 1
PARAMETER C_RX_DRE_TYPE =3D 1
PARAMETER C_TX_DRE_TYPE =3D 1
PARAMETER C_RXFIFO_DEPTH =3D 32768
You'll have to regenerate the xparameters file too if you make these
changes (in xps: Software -> Generate Libraries and BSPs).
There may also be issues with the IP stack in the 2.4 linux kernels.
If you have the option, an experiment with at 2.6 stack would be
ammusing.
-michael
_______________________________________________
Linuxppc-embedded mailing list
Linuxppc-embedded@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-embedded
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2006-12-05 19:08 Speed of plb_temac 3.00 on ML403 Rick Moleres @ 2006-12-12 11:08 ` Ming Liu 2007-02-09 14:16 ` Ming Liu 1 sibling, 0 replies; 15+ messages in thread From: Ming Liu @ 2006-12-12 11:08 UTC (permalink / raw) To: rick.moleres; +Cc: linuxppc-embedded Dear Rick, Now I am measuring the performance of my TEMAC on ml403 using netperf. However I cannot get a performance as high as yours(550Mbps for TX). My data is listed here: Board --> PC (tx) # ./netperf -H 192.168.0.3 -C -t TCP_STREAM -- -m 8192 -s 253952 -S 253952 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.3 (192.168.0.3) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % U % S us/KB us/KB 262142 206848 8192 10.00 64.51 -1.00 2.59 -1.000 6.587 PC --> board (rx) linux:/home/mingliu/netperf-2.4.1 # netperf -H 192.168.0.5 -C -t TCP_STREAM -- -m 14400 -s 253952 -S 253952 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.0.5 (192.168.0.5) port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % U % U us/KB us/KB 206848 262142 14400 10.02 169.09 -1.00 -1.00 -1.000 -0.484 I think this performance is much slower than what you have described. So what's the problem? I am using the old cores of TEMAC(plb_temac 2.00.a and hard_temac 1.00.a and DMA type is 3, Tx and Rx FIFO lengths are both 131072, large enough?). My linux is 2.6.16 from the general kernel with the temac driver patched. The driver is from the patch http://source.mvista.com/~ank/paulus-powerpc/20060309/. Is this bad performance because of the old cores, or the driver? Or Montavista Linux is RTOS and it should have a much better performance like this? You must be more experienced on the performance issue and your suggestion will be extreamly useful for me. Anxious for your suggestion and explanation. Regards Ming >From: "Rick Moleres" <rick.moleres@xilinx.com> >To: "Michael Galassi" <mgalassi@c-cor.com>,"Thomas Denzinger" <t.denzinger@lesametric.de> >CC: linuxppc-embedded@ozlabs.org >Subject: RE: Speed of plb_temac 3.00 on ML403 >Date: Tue, 5 Dec 2006 12:08:58 -0700 > > >Thomas, > >Yes, Michael points out the hardware parameters that are needed to >enable SGDMA along with DRE (to allow unaligned packets) and checksum >offload. It also helps the queuing if the FIFOs in the hardware (Tx/Rx >and IPIF) are deep to handle fast frame rates. And finally, better >performance if jumbo frames are enabled. Once SGDMA is tuned (e.g., >number of buffer descriptors, interrupt coalescing) and set up, the PPC >is not involved in the data transfers - only in the setup and interrupt >handling. > >With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20 >(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista >again) - using netperf w/ TCP_SENDFILE option. We didn't investigate the >difference between 2.4 and 2.6. > >-Rick > >-----Original Message----- >From: linuxppc-embedded-bounces+moleres=xilinx.com@ozlabs.org >[mailto:linuxppc-embedded-bounces+moleres=xilinx.com@ozlabs.org] On >Behalf Of Michael Galassi >Sent: Tuesday, December 05, 2006 11:42 AM >To: Thomas Denzinger >Cc: linuxppc-embedded@ozlabs.org >Subject: Re: Speed of plb_temac 3.00 on ML403 > > >My question is now: Has anybody deeper knowledge how ethernet and sgDMA > >works? How deep is the PPC involved in the data transfer? Or does the > >Temac-core handle the datatransfer to DDR-memory autonomous? > >Thomas, > >If you cut & pasted directly from my design you may be running without >DMA, which in turn implies running without checksum offload and DRE. >The plb_temac shrinks to about half it's size this way, but if you're >performance bound you probably want to turn DMA back on in your mhs >file: > > PARAMETER C_DMA_TYPE = 3 > PARAMETER C_INCLUDE_RX_CSUM = 1 > PARAMETER C_INCLUDE_TX_CSUM = 1 > PARAMETER C_RX_DRE_TYPE = 1 > PARAMETER C_TX_DRE_TYPE = 1 > PARAMETER C_RXFIFO_DEPTH = 32768 > >You'll have to regenerate the xparameters file too if you make these >changes (in xps: Software -> Generate Libraries and BSPs). > >There may also be issues with the IP stack in the 2.4 linux kernels. >If you have the option, an experiment with at 2.6 stack would be >ammusing. > >-michael >_______________________________________________ >Linuxppc-embedded mailing list >Linuxppc-embedded@ozlabs.org >https://ozlabs.org/mailman/listinfo/linuxppc-embedded > > >_______________________________________________ >Linuxppc-embedded mailing list >Linuxppc-embedded@ozlabs.org >https://ozlabs.org/mailman/listinfo/linuxppc-embedded _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2006-12-05 19:08 Speed of plb_temac 3.00 on ML403 Rick Moleres 2006-12-12 11:08 ` Ming Liu @ 2007-02-09 14:16 ` Ming Liu 2007-02-09 14:57 ` jozsef imrek ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Ming Liu @ 2007-02-09 14:16 UTC (permalink / raw) To: rick.moleres; +Cc: linuxppc-embedded Dear Rick, Again the problem of TEMAC speed. Hopefully you can give me some suggestion on that. >With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20 >(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista >again) - using netperf w/ TCP_SENDFILE option. We didn't investigate the >difference between 2.4 and 2.6. Now with my system(plb_temac and hard_temac v3.00 with all features enabled to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can achieve AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when jumbo-frame is enabled as 8500. For UDP it is 350Mbps for TX, also 8500 jumbo-frame is enabled. So it looks that my results are still much less than yours from Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and improve the performance. When I use netperf to transfer data, I noticed that the CPU utilization is almost 100%. So I suspect that CPU is the bottleneck. However other friends said the PLB structure is the bottleneck, because when the CPU is lowered to 100Mhz, the performance will not change much, but if the PLB frquency is lowered, it will. Then they conclude that with the PLB structure, the CPU will wait a long time to load and store data from DDR. So PLB is the criminal. Then come some questions. 1. Is your result from the GSRD structure or just the normal PLB_TEMAC? Will the GSRD achieve a better performance than the normal PLB_TEMAC? 2. Which on earch is the bottleneck for the network performance, CPU or PLB structure? Is that possible for PLB to achieve a much higher throughput? 3. Because your result is based on Montavista Linux. Is there any difference between MontaVista Linux and the general open-source linux kernel which could lead to different performance? I know that many persons including me are struggling to improve the performance of PLB_TEMAC on ML403. So please give us some hints and suggestions with your experience and research. Thanks so much for your work. BR Ming _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-09 14:16 ` Ming Liu @ 2007-02-09 14:57 ` jozsef imrek 2007-02-11 15:25 ` Ming Liu 2007-02-09 16:00 ` Rick Moleres 2007-02-11 6:55 ` Linux " Leonid 2 siblings, 1 reply; 15+ messages in thread From: jozsef imrek @ 2007-02-09 14:57 UTC (permalink / raw) To: linuxppc-embedded; +Cc: rick.moleres On Fri, 9 Feb 2007, Ming Liu wrote: > Now with my system(plb_temac and hard_temac v3.00 with all features enabled > to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can achieve > AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when jumbo-frame is > enabled as 8500. For UDP it is 350Mbps for TX, also 8500 jumbo-frame is > enabled. > So it looks that my results are still much less than yours from > Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and improve the > performance. when testing network performance you might want to use the packet generator included in the 2.6 linux kernel (in menuconfig go to Networking -> Networking options -> Network testing -> Packet Generator). with this tool you can bypass the ip stack, user space/kernel space barrier, etc, and measure the speed of the hardware itself using UDP-like packets. using pktgen i have seen data rates close to gigabit. (the hardware i'm working with is a memec minimodule with V4FX12. i'm using plb_temac with s/g dma, plb running at 100MHz, and our custom core accessed via IPIF's address range. sw is linux 2.6.19, xilinx tools are EDK 8.2i) another hint: when transfering bulk amount of data TCP is probably an overkill, especially on dedicated intranets and given the reliability of the network devices available today. just use UDP if you can. -- mazsi ---------------------------------------------------------------- strawberry fields forever! imrek@atomki.hu ---------------------------------------------------------------- ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-09 14:57 ` jozsef imrek @ 2007-02-11 15:25 ` Ming Liu 2007-02-12 18:09 ` jozsef imrek 0 siblings, 1 reply; 15+ messages in thread From: Ming Liu @ 2007-02-11 15:25 UTC (permalink / raw) To: imrek; +Cc: linuxppc-embedded Dear Jozsef, Thank you so much for your hints. After your telling, I tried as you told me. Here I list the script file and result. #! /bin/sh insmod ./pktgen.ko PGDEV=/proc/net/pktgen/pg0 pgset() { local result echo $1 > $PGDEV result=`cat $PGDEV | fgrep "Result: OK:"` if [ "$result" = "" ]; then cat $PGDEV | fgrep Result: fi } pg() { echo inject > $PGDEV cat $PGDEV } pgset "odev eth0" pgset "dst 192.168.0.3" pgset "pkt_size 8500" pg My board IP is 192.168.0.5 and my PC is 192.168.0.3. Also in my linux, Jumbo-frame size of 8500 is supported (max is 8982) and then I set the pkt_size as 8500. Here goes the result: pktgen.c: v1.4: Packet Generator for packet performance testing. pktgen version 1.32 Params: count 100000 pkt_size: 8500 frags: 0 ipg: 0 clone_skb: 0 odev "eth0" dst_min: 192.168.0.3 dst_max: src_min: src_max: src_mac: 00:00:00:00:00:00 dst_mac: 00:00:00:00:00:00 udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9 src_mac_count: 0 dst_mac_count: 0 Flags: Current: pkts-sofar: 100000 errors: 0 started: 3555387ms stopped: 3562242ms now: 3562242ms idle: 1267914442ns seq_num: 100000 cur_dst_mac_offset: 0 cur_src_mac_offset: 0 cur_saddr: 0xc0a80005 cur_daddr: 0xc0a80003 cur_udp_dst: 9 cur_udp_src: 9 Result: OK: 6824904(c2584388+d4240516) usec, 100000 (8504byte,0frags) 14652pps 996Mb/sec (996804864bps) errors: 0 # In the end, it shows 996Mb/sec, which means the throughput is 996Mbps(almost gigabit), right? However I don't think this result is so meaningful because it bypass the processing for TCP/UDP packets. In the practical implementation, that's not possible. The TCP/UDP packets have to been processed, right? In fact, this almost gigabit speed is just a representation of the Gigabit ethernet capability while the bottleneck of the system is not there. So I still need to solve my practical unsatisfying performance in my design. :( Anyway thanks for your hints and welcome to a deeper discussion. BR Ming >From: jozsef imrek <imrek@atomki.hu> >To: linuxppc-embedded@ozlabs.org >CC: rick.moleres@xilinx.com >Subject: RE: Speed of plb_temac 3.00 on ML403 >Date: Fri, 9 Feb 2007 15:57:15 +0100 (CET) > >On Fri, 9 Feb 2007, Ming Liu wrote: > > > Now with my system(plb_temac and hard_temac v3.00 with all features enabled > > to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can achieve > > AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when jumbo-frame is > > enabled as 8500. For UDP it is 350Mbps for TX, also 8500 jumbo-frame is > > enabled. > > So it looks that my results are still much less than yours from > > Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and improve the > > performance. > > >when testing network performance you might want to use the packet >generator included in the 2.6 linux kernel (in menuconfig go to >Networking -> Networking options -> Network testing -> Packet Generator). > >with this tool you can bypass the ip stack, user space/kernel space >barrier, etc, and measure the speed of the hardware itself using UDP-like >packets. > >using pktgen i have seen data rates close to gigabit. (the hardware i'm >working with is a memec minimodule with V4FX12. i'm using plb_temac with >s/g dma, plb running at 100MHz, and our custom core accessed via IPIF's >address range. sw is linux 2.6.19, xilinx tools are EDK 8.2i) > > > >another hint: when transfering bulk amount of data TCP is probably an >overkill, especially on dedicated intranets and given the reliability >of the network devices available today. just use UDP if you can. > > >-- >mazsi > >---------------------------------------------------------------- >strawberry fields forever! imrek@atomki.hu >---------------------------------------------------------------- >_______________________________________________ >Linuxppc-embedded mailing list >Linuxppc-embedded@ozlabs.org >https://ozlabs.org/mailman/listinfo/linuxppc-embedded _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-11 15:25 ` Ming Liu @ 2007-02-12 18:09 ` jozsef imrek 2007-02-12 19:18 ` Ming Liu 0 siblings, 1 reply; 15+ messages in thread From: jozsef imrek @ 2007-02-12 18:09 UTC (permalink / raw) To: linuxppc-embedded On Sun, 11 Feb 2007, Ming Liu wrote: > In the end, it shows 996Mb/sec, which means the throughput is 996Mbps(almost > gigabit), right? yes.. :) > However I don't think this result is so meaningful because it bypass the > processing for TCP/UDP packets. In the practical implementation, that's not > possible. The TCP/UDP packets have to been processed, right? In fact, this on the contrary: our whole design is built on the principle that with UDP you do not need too much handling/processing, and your data path can bypass the IP stack, the CPU, and with some work even the main memory. with TCP you (the os) have to take care of connection set up and tear down, acknowledgements, packet retransmission (and for that you need to save the packets until they are ack'ed!), etc. in return you get reliable data transmission, which is a must in many applications. however we don't need that. duplication or loss of a packet, or reordering of packets (which could happen when using UDP) would not be a critical problem for us. but these are mostly theoretical issues, they don't realy happen on our dedicated daq network (except for packet losses that we deliberately use as poor man's flow control). but all theese might be off topic on this list.. > almost gigabit speed is just a representation of the Gigabit ethernet > capability while the bottleneck of the system is not there. So I still need > to solve my practical unsatisfying performance in my design. :( at least it will probably help you narrowing the problem down.. good luck! -- mazsi ---------------------------------------------------------------- strawberry fields forever! imrek@atomki.hu ---------------------------------------------------------------- ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-12 18:09 ` jozsef imrek @ 2007-02-12 19:18 ` Ming Liu 2007-02-14 7:24 ` jozsef imrek 0 siblings, 1 reply; 15+ messages in thread From: Ming Liu @ 2007-02-12 19:18 UTC (permalink / raw) To: imrek; +Cc: linuxppc-embedded Dear Jozsef, >on the contrary: our whole design is built on the principle that with UDP >you do not need too much handling/processing, and your data path can >bypass the IP stack, the CPU, and with some work even the main memory. > >with TCP you (the os) have to take care of connection set up and tear >down, acknowledgements, packet retransmission (and for that you need to >save the packets until they are ack'ed!), etc. in return you get reliable >data transmission, which is a must in many applications. > >however we don't need that. duplication or loss of a packet, or reordering >of packets (which could happen when using UDP) would not be a critical >problem for us. but these are mostly theoretical issues, they don't realy >happen on our dedicated daq network (except for packet losses that we >deliberately use as poor man's flow control). In fact, in our application we also use UDP. I know UDP need less CPU processing capability than TCP. However, just like what I measured, my UDP performance is around 350Mbps and TCP performance is 270Mbps when Jumbo-frame enabled. In my application, I have to use the CPU to process the UDP or TCP packets. Then this "need not too much processing" results the much lower performance as I listed above. :( You said in your application the data path can bypass the IP stack, the CPU, and with some work even the main memory. How can you achieve that? Then who will process the UDP packets? If you add the work of processing packets in, do you have some idea on how fast can your network achieve? I believe it will be much lowered, right? :) Thanks for your discussion. BR Ming _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-12 19:18 ` Ming Liu @ 2007-02-14 7:24 ` jozsef imrek 0 siblings, 0 replies; 15+ messages in thread From: jozsef imrek @ 2007-02-14 7:24 UTC (permalink / raw) To: linuxppc-embedded On Mon, 12 Feb 2007, Ming Liu wrote: > You said in your application the data path can bypass the IP stack, the CPU, > and with some work even the main memory. How can you achieve that? Then who > will process the UDP packets? If you add the work of processing packets in, a precondition for this to work is to have all your data processing implemented in hardware, since the CPU will not see the payload of the udp packets, so it will have no chance to modify it. the trick is (as i mentioned in my first reply) is to use an IPIF with address range support (so it looks like a memory), and have the s/g dma engine of the plb_temac read the udp payload from this "memory" instead of your system ram. the packet header (ethernet + ip + udp) in our application is assembled by a code based on pktgen. you could implement it in hw as well, but there is no need for that since you can utilize the full gigabit bandwith (this you have seen.. :), and it is more convinient to have that functionality in sw. a quick hack to see this theory working: 1, create a new peripherial with address range support. (start xps -> hardware -> create or import new peripherial, plb bus, no sw reset/mir, no interrupt, no user regs, with burst support, no fifo, with user address range, no dma, no master iface) you might want to replace the bram in the sample code (pcores/*/hdl/vhdl/user_logic.vhd) with a fixed value or a counter. 2, add this core to your design, create and download your new bitfile. 3, modify the source of the plb_temac linux driver. when a packet is sent with fragments a buffer descriptor (bd) will be set up for each fragment. the first bd will be used for the packet header, and the rest of the bd's will point to the the udp payload. so you want to make sure, that the physical address of all but the first bd's is pointing to the physical address of your IPIF's address range (you can find it in your mhs file, search for C_AR_BASEADDR). in adapter.c in xenet_SgSend_internal() search for the loop where a bd is set up for each payload fragment (something like for (i = 1; i < total_frags; i++, frag++)..), and set the phy_addr to the address of your core (ie. phy_addr = 0x70090000;). compile your new kernel, download it. 4, start pktgen with frags = 1 (use pgset "frags 1"). check the payload of the packets sent on the wire (ie. with tcpdump). if you have replaced the bram in step 1 with a fixed value you should see that value. if you have replaced it with a counter you should see the values rolling. if you did not replace the bram, you should see the contents of the bram - it is filled with all zeroes on reset, but you can fill it with any test pattern. good luck, and let me know how it works! -- mazsi ---------------------------------------------------------------- strawberry fields forever! imrek@atomki.hu ---------------------------------------------------------------- ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-09 14:16 ` Ming Liu 2007-02-09 14:57 ` jozsef imrek @ 2007-02-09 16:00 ` Rick Moleres 2007-02-11 6:22 ` Leonid 2007-02-11 13:37 ` Ming Liu 2007-02-11 6:55 ` Linux " Leonid 2 siblings, 2 replies; 15+ messages in thread From: Rick Moleres @ 2007-02-09 16:00 UTC (permalink / raw) To: Ming Liu; +Cc: linuxppc-embedded Ming, Here's a quick summary of the systems we used: Operating system: MontaVista Linux 40 Benchmark tool: NetPerf / NetServer Kernel: Linux ml403 2.6.10_mvl401-ml40x IP Core: Name & version: PLB TEMAC 3.00A Operation Mode: SGDMA mode TX/RX DRE: Yes / Yes TX/RX CSUM offload: Yes / Yes TX Data FIFO depth: 131072 bits (i.e. 16K bytes) RX Data FIFO depth: 131072 bits (i.e. 16K bytes) Xilinx Platform Hardware: Board: ML403 / Virtex4 FX12 Processor: PPC405 @ 300MHz Memory type: DDR Memory burst: Yes PC-side Test Hardware: Processor: Intel(R) Pentium(R) 4 CPU 3.20GHz OS: Ubuntu Linux 6.06 LTS, kernel 2.6.15-26-386 Network adapters used: D-LinkDL2000-based Gigabit Ethernet (rev 0c) - Are Checksum offload, SGDMA, and DRE enabled in the plb_temac? - Are you using the TCP_SENDFILE option of netperf? Your UDP numbers = are similar already to what we saw in Linux 2.6, and your TCP numbers = are similar to what we saw *without* the sendfile option. I don't believe the PLB is the bottleneck here. We had similar = platforms running with Treck and have achieved over 800Mbps TCP rates = (Tx and Rx) over the PLB. To answer your questions: 1. Results are from PLB_TEMAC, not GSRD. You would likely see similar = throughput rates with GSRD and Linux. 2. Assuming you have everything tuned for SGDMA based on previous = emails, I would suspect the bottleneck is the 300MHz CPU *when* running = Linux. In Linux 2.6 we've not spent any time trying to tune the = TCP/Ethernet parameters on the target board or the host, so there could = be some optimizations that can be done at that level. In the exact same = system we can achieve over 800Mbps using the Treck TCP/IP stack, and = with VxWorks it was over 600Mbps. I'm not a Linux expert, so I don't = know what's tunable for network performance, and there is a possibility = the driver could be optimized as well. Thanks, -Rick -----Original Message----- From: Ming Liu [mailto:eemingliu@hotmail.com]=20 Sent: Friday, February 09, 2007 7:17 AM To: Rick Moleres Cc: linuxppc-embedded@ozlabs.org Subject: RE: Speed of plb_temac 3.00 on ML403 Dear Rick, Again the problem of TEMAC speed. Hopefully you can give me some = suggestion=20 on that. >With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20 >(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista >again) - using netperf w/ TCP_SENDFILE option. We didn't investigate = the >difference between 2.4 and 2.6. Now with my system(plb_temac and hard_temac v3.00 with all features = enabled=20 to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can=20 achieve AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when=20 jumbo-frame is enabled as 8500. For UDP it is 350Mbps for TX, also 8500=20 jumbo-frame is enabled.=20 So it looks that my results are still much less than yours from=20 Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and = improve=20 the performance. When I use netperf to transfer data, I noticed that the CPU utilization = is=20 almost 100%. So I suspect that CPU is the bottleneck. However other = friends=20 said the PLB structure is the bottleneck, because when the CPU is = lowered=20 to 100Mhz, the performance will not change much, but if the PLB frquency = is=20 lowered, it will. Then they conclude that with the PLB structure, the = CPU=20 will wait a long time to load and store data from DDR. So PLB is the=20 criminal. Then come some questions. 1. Is your result from the GSRD structure or = just=20 the normal PLB_TEMAC? Will the GSRD achieve a better performance than = the=20 normal PLB_TEMAC? 2. Which on earch is the bottleneck for the network=20 performance, CPU or PLB structure? Is that possible for PLB to achieve a = much higher throughput? 3. Because your result is based on Montavista=20 Linux. Is there any difference between MontaVista Linux and the general=20 open-source linux kernel which could lead to different performance?=20 I know that many persons including me are struggling to improve the=20 performance of PLB_TEMAC on ML403. So please give us some hints and=20 suggestions with your experience and research. Thanks so much for your=20 work. BR Ming _________________________________________________________________ =D3=EB=C1=AA=BB=FA=B5=C4=C5=F3=D3=D1=BD=F8=D0=D0=BD=BB=C1=F7=A3=AC=C7=EB=CA= =B9=D3=C3 MSN Messenger: http://messenger.msn.com/cn =20 ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-09 16:00 ` Rick Moleres @ 2007-02-11 6:22 ` Leonid 2007-02-11 13:37 ` Ming Liu 1 sibling, 0 replies; 15+ messages in thread From: Leonid @ 2007-02-11 6:22 UTC (permalink / raw) To: Rick Moleres, Ming Liu; +Cc: linuxppc-embedded Does it mean that ml403 and particularly TEMAC need Monta Vista linux? = Will standard kernel suffice?=20 Thanks, Leonid. -----Original Message----- From: linuxppc-embedded-bounces+leonid=3Da-k-a.net@ozlabs.org = [mailto:linuxppc-embedded-bounces+leonid=3Da-k-a.net@ozlabs.org] On = Behalf Of Rick Moleres Sent: Friday, February 09, 2007 8:01 AM To: Ming Liu Cc: linuxppc-embedded@ozlabs.org Subject: RE: Speed of plb_temac 3.00 on ML403 Ming, Here's a quick summary of the systems we used: Operating system: MontaVista Linux 40 Benchmark tool: NetPerf / NetServer Kernel: Linux ml403 2.6.10_mvl401-ml40x IP Core: Name & version: PLB TEMAC 3.00A Operation Mode: SGDMA mode TX/RX DRE: Yes / Yes TX/RX CSUM offload: Yes / Yes TX Data FIFO depth: 131072 bits (i.e. 16K bytes) RX Data FIFO depth: 131072 bits (i.e. 16K bytes) Xilinx Platform Hardware: Board: ML403 / Virtex4 FX12 Processor: PPC405 @ 300MHz Memory type: DDR Memory burst: Yes PC-side Test Hardware: Processor: Intel(R) Pentium(R) 4 CPU 3.20GHz OS: Ubuntu Linux 6.06 LTS, kernel 2.6.15-26-386 Network adapters used: D-LinkDL2000-based Gigabit Ethernet (rev 0c) - Are Checksum offload, SGDMA, and DRE enabled in the plb_temac? - Are you using the TCP_SENDFILE option of netperf? Your UDP numbers = are similar already to what we saw in Linux 2.6, and your TCP numbers = are similar to what we saw *without* the sendfile option. I don't believe the PLB is the bottleneck here. We had similar = platforms running with Treck and have achieved over 800Mbps TCP rates = (Tx and Rx) over the PLB. To answer your questions: 1. Results are from PLB_TEMAC, not GSRD. You would likely see similar = throughput rates with GSRD and Linux. 2. Assuming you have everything tuned for SGDMA based on previous = emails, I would suspect the bottleneck is the 300MHz CPU *when* running = Linux. In Linux 2.6 we've not spent any time trying to tune the = TCP/Ethernet parameters on the target board or the host, so there could = be some optimizations that can be done at that level. In the exact same = system we can achieve over 800Mbps using the Treck TCP/IP stack, and = with VxWorks it was over 600Mbps. I'm not a Linux expert, so I don't = know what's tunable for network performance, and there is a possibility = the driver could be optimized as well. Thanks, -Rick -----Original Message----- From: Ming Liu [mailto:eemingliu@hotmail.com]=20 Sent: Friday, February 09, 2007 7:17 AM To: Rick Moleres Cc: linuxppc-embedded@ozlabs.org Subject: RE: Speed of plb_temac 3.00 on ML403 Dear Rick, Again the problem of TEMAC speed. Hopefully you can give me some = suggestion=20 on that. >With a 300Mhz system we saw about 730Mbps Tx with TCP on 2.4.20 >(MontaVista Linux) and about 550Mbps Tx with TCP on 2.6.10 (MontaVista >again) - using netperf w/ TCP_SENDFILE option. We didn't investigate = the >difference between 2.4 and 2.6. Now with my system(plb_temac and hard_temac v3.00 with all features = enabled=20 to improve the performance, Linux 2.6.10, 300Mhz ppc, netperf), I can=20 achieve AT MOST 213.8Mbps for TCP TX and 277.4Mbps for TCP RX, when=20 jumbo-frame is enabled as 8500. For UDP it is 350Mbps for TX, also 8500=20 jumbo-frame is enabled.=20 So it looks that my results are still much less than yours from=20 Xilinx(550Mbps TCP TX). So I am trying to find the bottleneck and = improve=20 the performance. When I use netperf to transfer data, I noticed that the CPU utilization = is=20 almost 100%. So I suspect that CPU is the bottleneck. However other = friends=20 said the PLB structure is the bottleneck, because when the CPU is = lowered=20 to 100Mhz, the performance will not change much, but if the PLB frquency = is=20 lowered, it will. Then they conclude that with the PLB structure, the = CPU=20 will wait a long time to load and store data from DDR. So PLB is the=20 criminal. Then come some questions. 1. Is your result from the GSRD structure or = just=20 the normal PLB_TEMAC? Will the GSRD achieve a better performance than = the=20 normal PLB_TEMAC? 2. Which on earch is the bottleneck for the network=20 performance, CPU or PLB structure? Is that possible for PLB to achieve a = much higher throughput? 3. Because your result is based on Montavista=20 Linux. Is there any difference between MontaVista Linux and the general=20 open-source linux kernel which could lead to different performance?=20 I know that many persons including me are struggling to improve the=20 performance of PLB_TEMAC on ML403. So please give us some hints and=20 suggestions with your experience and research. Thanks so much for your=20 work. BR Ming _________________________________________________________________ =D3=EB=C1=AA=BB=FA=B5=C4=C5=F3=D3=D1=BD=F8=D0=D0=BD=BB=C1=F7=A3=AC=C7=EB=CA= =B9=D3=C3 MSN Messenger: http://messenger.msn.com/cn =20 ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-09 16:00 ` Rick Moleres 2007-02-11 6:22 ` Leonid @ 2007-02-11 13:37 ` Ming Liu 2007-02-12 19:45 ` Rick Moleres 1 sibling, 1 reply; 15+ messages in thread From: Ming Liu @ 2007-02-11 13:37 UTC (permalink / raw) To: Rick.Moleres; +Cc: linuxppc-embedded Dear Rick, First thank you SO SO SO MUCH for your kindly telling. It's really useful for me to solve problems. >From the test summary listed, I can see that we have similar systems except that you are using MontaVista while I am using the general open-source kernel. Also I enabled all the features which could improve the performance, including DRE, CSUM offload and SGDMA,etc. >- Are Checksum offload, SGDMA, and DRE enabled in the plb_temac? Yes. All features are enabled. >- Are you using the TCP_SENDFILE option of netperf? Your UDP numbers are similar already to what we saw in Linux 2.6, and your TCP numbers are similar to what we saw *without* the sendfile option. Perhaps no. I just understand that sendfile option is so important to improve performance. At first I thought it will achieve a same performance with TCP_STREAM, until I saw the article to explain how to use sendfile() to optimize data transfer. So I will try this soon. After some reading of the articles on performance improving, here come other problems. So I will appreciate you so much if you can clarify them for me. >1. Results are from PLB_TEMAC, not GSRD. You would likely see similar throughput rates with GSRD and Linux. Problem 1: From the website for GSRD, I know that it uses a different structure than PLB, where a Multi port mem controller and DMA are added to release the CPU from move data between Memory and TEMAC. So can GSRD achieve a higher performance than PLB_TEMAC, or similar performance like what you said above? If their performance is similar, what's the advantage for GSRD? Could you please explain some differences between these two structures? >2. Assuming you have everything tuned for SGDMA based on previous emails, I would suspect the bottleneck is the 300MHz CPU *when* running Linux. In Linux 2.6 we've not spent any time trying to tune the TCP/Ethernet parameters on the target board or the host, so there could be some optimizations that can be done at that level. In the exact same system we can achieve over 800Mbps using the Treck TCP/IP stack, and with VxWorks it was over 600Mbps. Problem 2. I read XAPP546 of High performance TCP/IP on xilinx FPGA devices using the Treck embedded TCP/IP stack. I notice that the features of Treck TCP/IP stack include: Zero-copy send and receive, Jumbo-frame support, CSUM offload, etc. which could achieve a much higher performance than not using it. However in the Xilinx TEMAC core V3.00, these features are all supported: Zero-copy is supported by sendfile() when using Netperf; Jumbo-frame is also supported; CSUM offload and DRE are also supported by the hardware. So does this mean I can achieve a similarly high performance with PLB_TEMAC V3.00 and without Treck TCP/IP stack? I mean if all the features of Treck stack have been included in the PLB_TEMAC cores, what's the use for Treck stack? Maybe my questions are a little stupid. But I am really confused on them. So thank you so much if you can explain them to me. Thanks a lot. BR Ming _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-11 13:37 ` Ming Liu @ 2007-02-12 19:45 ` Rick Moleres 2007-02-12 20:39 ` Ming Liu 0 siblings, 1 reply; 15+ messages in thread From: Rick Moleres @ 2007-02-12 19:45 UTC (permalink / raw) To: Ming Liu; +Cc: linuxppc-embedded Ming, <snip> >>1. Results are from PLB_TEMAC, not GSRD. You would likely see similar >throughput rates with GSRD and Linux. > >Problem 1: From the website for GSRD, I know that it uses a different=20 >structure than PLB, where a Multi port mem controller and DMA are added to=20 >release the CPU from move data between Memory and TEMAC. So can GSRD=20 >achieve a higher performance than PLB_TEMAC, or similar performance like=20 >what you said above? If their performance is similar, what's the advantage=20 >for GSRD? Could you please explain some differences between these two=20 >structures?=20 GSRD is a reference design intended to exhibit high-performance gigabit rates. It offloads the data path of the Ethernet traffic from the PLB bus, under the assumption that the arbitrated bus is best used for other things (control, other data, etc...). With Linux, however, GSRD still only achieves slightly more than 500Mbps TCP. We see similar numbers with PLB TEMAC, and with other stacks we see similar numbers as GSRD as well (e.g., Treck). The decision points for using GSRD would be a) what else needs to happen on the PLB in your system, and b) Xilinx support. GSRD is a reference design, so it's not officially supported through the Xilinx support chain. However, many of its architectural concepts are being considered for future EDK IP (sorry, no timeframe). For now, I recommend PLB TEMAC because it's part of the EDK, supported, and gets as good performance in most use cases. >>2. Assuming you have everything tuned for SGDMA based on previous emails,=20 >I would suspect the bottleneck is the 300MHz CPU *when* running Linux. In=20 >Linux 2.6 we've not spent any time trying to tune the TCP/Ethernet=20 >parameters on the target board or the host, so there could be some=20 >optimizations that can be done at that level. In the exact same system we=20 >can achieve over 800Mbps using the Treck TCP/IP stack, and with VxWorks it=20 >was over 600Mbps. =20 > >Problem 2. I read XAPP546 of High performance TCP/IP on xilinx FPGA devices=20 >using the Treck embedded TCP/IP stack. I notice that the features of Treck=20 >TCP/IP stack include: Zero-copy send and receive, Jumbo-frame support, CSUM=20 >offload, etc. which could achieve a much higher performance than not using=20 >it. However in the Xilinx TEMAC core V3.00, these features are all=20 >supported: Zero-copy is supported by sendfile() when using Netperf;=20 >Jumbo-frame is also supported; CSUM offload and DRE are also supported by=20 >the hardware. So does this mean I can achieve a similarly high performance=20 >with PLB_TEMAC V3.00 and without Treck TCP/IP stack? I mean if all the=20 >features of Treck stack have been included in the PLB_TEMAC cores, what's=20 >the use for Treck stack?=20 Note that Linux only supports zero-copy on the transmit side (i.e., sendfile), not on the receive side. I'm not going to recommend one RTOS or network stack over another. Treck is a general purpose TCP/IP stack that can be used in a standalone environment or in various RTOS environments (I think). We've found that Treck, in the case where it is used without an RTOS, is a higher performing stack than the Linux stack. The VxWorks stack is also good, and Linux (of the three I've mentioned) seems to be the slowest. Again, it's possible that the Linux stack could be tuned better, but we haven't taken the time to try this. ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Speed of plb_temac 3.00 on ML403 2007-02-12 19:45 ` Rick Moleres @ 2007-02-12 20:39 ` Ming Liu 0 siblings, 0 replies; 15+ messages in thread From: Ming Liu @ 2007-02-12 20:39 UTC (permalink / raw) To: Rick.Moleres; +Cc: linuxppc-embedded Dear Rick, Thanks for your kindly telling once more. :) >GSRD is a reference design intended to exhibit high-performance gigabit >rates. It offloads the data path of the Ethernet traffic from the PLB >bus, under the assumption that the arbitrated bus is best used for other >things (control, other data, etc...). With Linux, however, GSRD still >only achieves slightly more than 500Mbps TCP. We see similar numbers >with PLB TEMAC, and with other stacks we see similar numbers as GSRD as >well (e.g., Treck). The decision points for using GSRD would be a) what >else needs to happen on the PLB in your system, and b) Xilinx support. >GSRD is a reference design, so it's not officially supported through the >Xilinx support chain. However, many of its architectural concepts are >being considered for future EDK IP (sorry, no timeframe). For now, I >recommend PLB TEMAC because it's part of the EDK, supported, and gets as >good performance in most use cases. Well. This time I am totally clear on the concept of GSRD. That's, if I have other tasks which use PLB Bus a lot, it will release the network traffic from the PLB and then improve the network performance, right? So I would like to agree with you: in my system, I choose PLB_TEMAC. >Note that Linux only supports zero-copy on the transmit side (i.e., >sendfile), not on the receive side. I'm not going to recommend one RTOS >or network stack over another. Treck is a general purpose TCP/IP stack >that can be used in a standalone environment or in various RTOS >environments (I think). We've found that Treck, in the case where it is >used without an RTOS, is a higher performing stack than the Linux stack. >The VxWorks stack is also good, and Linux (of the three I've mentioned) >seems to be the slowest. Again, it's possible that the Linux stack >could be tuned better, but we haven't taken the time to try this. I just read some documents on sendfile() and have understanded some. So I tried to use TCP_SENDFILE in netperf at this time. With TCP_SENDFILE option, I can achieve a higher TX performance of 301.4Mbps for TCP. (without TCP_SENDFILE, it's 213.8Mbps. improved by almost 50%) For RX there is no difference with or without TCP_SENDFILE (278Mbps), which shows that Linux only supports zero-copy on the TX side as you mentioned. Then till now, my best performance for TCP is TX(301Mbps) and RX(278Mbps). There is still a long distance from your result (TX of 550Mbps). In my system, everything is the same as yours (PLB_TEMAC v3.00, SGDMA, TX/RX DRE and CSUM offload, 16k TX/RX FIFO, 300Mhz CPU) except that I am using the opensource linux other than MontaVista linux 4.0. Will Montavista Linux lead to such a much higher performance? Or what's the real reason why my performance is yet not so high as yours? I will appreciate a lot if you can give me more hints. BTW, for your result of TX 550Mbps, did you just use the Linux stack or include the Treck one? Thanks again for your time and kindly help. BR Ming _________________________________________________________________ 与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Linux on ML403 2007-02-09 14:16 ` Ming Liu 2007-02-09 14:57 ` jozsef imrek 2007-02-09 16:00 ` Rick Moleres @ 2007-02-11 6:55 ` Leonid 2007-02-11 13:10 ` Ming Liu 2 siblings, 1 reply; 15+ messages in thread From: Leonid @ 2007-02-11 6:55 UTC (permalink / raw) To: linuxppc-embedded Folks, is everybody using Monta Vista Linux on ML403, or there are those, using standard kernel from kernel.org? Or any other place I can get 2.6 kernel for free.=20 I would also appreciate any instructions how to patch it for TMAC if necessary.=20 Thanks, Leonid. ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: Linux on ML403 2007-02-11 6:55 ` Linux " Leonid @ 2007-02-11 13:10 ` Ming Liu 0 siblings, 0 replies; 15+ messages in thread From: Ming Liu @ 2007-02-11 13:10 UTC (permalink / raw) To: Leonid; +Cc: linuxppc-embedded Hi, >From http://source.mvista.com/ you can get the kernel of linux-xilinx-26.git. The Temac driver is provided along with Xilinx EDK. So just overwrite the corresponding files and try to compile. It will not be strange if there are some errors. So just debug it. Good luck. BR Ming >From: "Leonid" <Leonid@a-k-a.net> >To: <linuxppc-embedded@ozlabs.org> >Subject: Linux on ML403 >Date: Sat, 10 Feb 2007 22:55:27 -0800 > >Folks, is everybody using Monta Vista Linux on ML403, or there are >those, using standard kernel from kernel.org? Or any other place I can >get 2.6 kernel for free. > >I would also appreciate any instructions how to patch it for TMAC if >necessary. > >Thanks, > >Leonid. > >_______________________________________________ >Linuxppc-embedded mailing list >Linuxppc-embedded@ozlabs.org >https://ozlabs.org/mailman/listinfo/linuxppc-embedded _________________________________________________________________ 免费下载 MSN Explorer: http://explorer.msn.com/lccn/ ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2007-02-14 7:17 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-12-05 19:08 Speed of plb_temac 3.00 on ML403 Rick Moleres 2006-12-12 11:08 ` Ming Liu 2007-02-09 14:16 ` Ming Liu 2007-02-09 14:57 ` jozsef imrek 2007-02-11 15:25 ` Ming Liu 2007-02-12 18:09 ` jozsef imrek 2007-02-12 19:18 ` Ming Liu 2007-02-14 7:24 ` jozsef imrek 2007-02-09 16:00 ` Rick Moleres 2007-02-11 6:22 ` Leonid 2007-02-11 13:37 ` Ming Liu 2007-02-12 19:45 ` Rick Moleres 2007-02-12 20:39 ` Ming Liu 2007-02-11 6:55 ` Linux " Leonid 2007-02-11 13:10 ` Ming Liu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.