netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
@ 2020-07-01 18:44 Kegl Rohit
  2020-07-02  4:18 ` [EXT] " Andy Duan
  0 siblings, 1 reply; 8+ messages in thread
From: Kegl Rohit @ 2020-07-01 18:44 UTC (permalink / raw)
  To: netdev; +Cc: fugang.duan

[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]

We are running a imx6q system using the FEC network interface driver.
Our goal is to process many small multicast raw frames with ~128 bytes
each at a rate of 4kHz. Therefore the driver must be able to cope with
receiving small packet bursts.

Tests with 6*128 byte brusts @ 4kHz interval (~24MBit/s) caused frame
drops from time to time as seen in IEEE_rx_macerr stat increasing.
Therefore the small FEC hardware FIFO overruns.
When adding additional system load via stress -m 4 --vm-bytes 50M the
situation gets worse. This additional load slows down the whole
system, also the NAPI processing of the FEC driver.

Because of this behaviour I created a flamegraph to get an idea of
time spent in the processing pipeline.
Attached is a flame graph with no additional system stress and one
without. I could observe that most time is spent in dma_sync_*() calls
during rx descriptor processing.

Commit https://github.com/torvalds/linux/commit/1b7bde6d659d30f171259cc2dfba8e5dab34e735
introduced copybreak to increase performance. In our test copybreak is
active the whole time because the 128 bytes packet size is below the
256 default limit.
As I can see from the commit the goal is to allocate a new skb with
the minimum size needed, memcpy the contents to it and pass the new
skb up the network stack. This allows it to keep the larger rx
descriptor buffer (~2KB) DMA mapped without allocation of a new
FEC_ENET_RX_FRSIZE (2KB) buffer each time.

fec_enet_copybreak(u32 length, ...) uses

dma_sync_single_for_cpu(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr), FEC_ENET_RX_FRSIZE - fep->rx_align,
DMA_FROM_DEVICE);
if (!swap)
   memcpy(new_skb->data, (*skb)->data, length);

to sync the descriptor data buffer and memcpy the data to the new skb
without calling dma_unmap().
Later in fec_enet_rx_queue() the dma descriptor buffer is synced again
in the opposite direction.

if (is_copybreak) {
  dma_sync_single_for_device(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr),  FEC_ENET_RX_FRSIZE - fep->rx_align,
DMA_FROM_DEVICE);
}

Now the two main questions:
1. Is it necessary to call dma_sync_single_for_cpu for the whole
buffer size (FEC_ENET_RX_FRSIZE - fep->rx_align), wouldn't syncing the
real packet length which is accessed by memcpy be enough?
Like so: dma_sync_single_for_cpu(&fep->pdev->dev,
fec32_to_cpu(bdp->cbd_bufaddr), (u32) length, DMA_FROM_DEVICE);

2. Is dma_sync_single_for_device even necessary? There is no data
passed back to the device because the skb descriptor buffer is not
modified and the fec peripheral does not need any valid data.
The example in https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
states:
 /* CPU should not write to
  * DMA_FROM_DEVICE-mapped area,
  * so dma_sync_single_for_device() is
  * not needed here. It would be required
  * for DMA_BIDIRECTIONAL mapping if
  * the memory was modified.
 */
I am new to the DMA API on ARM. Are these changes regarding cache
flushing,... allowed? These would increase the copybreak throughput by
reducing CPU load.

[-- Attachment #2: copybreak_memorystress.png --]
[-- Type: image/png, Size: 54512 bytes --]

[-- Attachment #3: copybreak_nostress.png --]
[-- Type: image/png, Size: 55569 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-01 18:44 net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed? Kegl Rohit
@ 2020-07-02  4:18 ` Andy Duan
  2020-07-02  7:38   ` Kegl Rohit
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Duan @ 2020-07-02  4:18 UTC (permalink / raw)
  To: Kegl Rohit, netdev

From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020 2:45 AM
> fec_enet_copybreak(u32 length, ...) uses
> 
> dma_sync_single_for_cpu(&fep->pdev->dev,
> fec32_to_cpu(bdp->cbd_bufaddr), FEC_ENET_RX_FRSIZE - fep->rx_align,
> DMA_FROM_DEVICE); if (!swap)
>    memcpy(new_skb->data, (*skb)->data, length);
> 
> to sync the descriptor data buffer and memcpy the data to the new skb
> without calling dma_unmap().
dma_sync_* is enough, no need to call dma_unmap and dma_map_* that
will heavy load.

> Later in fec_enet_rx_queue() the dma descriptor buffer is synced again in the
> opposite direction.
> 
> if (is_copybreak) {
>   dma_sync_single_for_device(&fep->pdev->dev,
> fec32_to_cpu(bdp->cbd_bufaddr),  FEC_ENET_RX_FRSIZE - fep->rx_align,
> DMA_FROM_DEVICE); }
> 
dma_sync_single_for_cpu(DMA_FROM_DEVICE)
	__dma_inv_area  #invalidate the area

dma_sync_single_for_device(DMA_FROM_DEVICE)
	__dma_inv_area  #invalidate the area
	__dma_clean_area #clean the area

dma_sync_single_for_cpu() and dma_sync_single_for_device() are used in pairs,
there have no problem for usage.

> Now the two main questions:
> 1. Is it necessary to call dma_sync_single_for_cpu for the whole buffer size
> (FEC_ENET_RX_FRSIZE - fep->rx_align), wouldn't syncing the real packet
> length which is accessed by memcpy be enough?
> Like so: dma_sync_single_for_cpu(&fep->pdev->dev,
> fec32_to_cpu(bdp->cbd_bufaddr), (u32) length, DMA_FROM_DEVICE);

In general usage, you don't know the next frame size, and cannot ensure
the buffer is dirty or not, so invalidate the whole area for next frame.

On some arm64 A53, the dcache invalidate on A53 is flush + invalidate, 
and prefetch may fetch the area, that may causes dirty data flushed back
to the dma memory if the area has dirty data.

> 
> 2. Is dma_sync_single_for_device even necessary? There is no data passed
> back to the device because the skb descriptor buffer is not modified and the
> fec peripheral does not need any valid data.
> The example in
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> kernel.org%2Fdoc%2FDocumentation%2FDMA-API-HOWTO.txt&amp;data=0
> 2%7C01%7Cfugang.duan%40nxp.com%7C7fb56778153a4139214808d81deed
> a6d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637292258992
> 313637&amp;sdata=4Nv4J7APzmNTv7Dv39tmwpJhFeZ8bNY1eaoAQnx4FdM
> %3D&amp;reserved=0
> states:
>  /* CPU should not write to
>   * DMA_FROM_DEVICE-mapped area,
>   * so dma_sync_single_for_device() is
>   * not needed here. It would be required
>   * for DMA_BIDIRECTIONAL mapping if
>   * the memory was modified.
>  */
That should ensure the whole area is not dirty.

> I am new to the DMA API on ARM. Are these changes regarding cache
> flushing,... allowed? These would increase the copybreak throughput by
> reducing CPU load.

To avoid FIFO overrun, it requires to ensure PHY pause frame is enabled.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-02  4:18 ` [EXT] " Andy Duan
@ 2020-07-02  7:38   ` Kegl Rohit
  2020-07-02  9:14     ` Andy Duan
  0 siblings, 1 reply; 8+ messages in thread
From: Kegl Rohit @ 2020-07-02  7:38 UTC (permalink / raw)
  To: Andy Duan; +Cc: netdev

On Thu, Jul 2, 2020 at 6:18 AM Andy Duan <fugang.duan@nxp.com> wrote:
>
> From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020 2:45 AM
> > fec_enet_copybreak(u32 length, ...) uses
> >
> > dma_sync_single_for_cpu(&fep->pdev->dev,
> > fec32_to_cpu(bdp->cbd_bufaddr), FEC_ENET_RX_FRSIZE - fep->rx_align,
> > DMA_FROM_DEVICE); if (!swap)
> >    memcpy(new_skb->data, (*skb)->data, length);
> >
> > to sync the descriptor data buffer and memcpy the data to the new skb
> > without calling dma_unmap().
> dma_sync_* is enough, no need to call dma_unmap and dma_map_* that
> will heavy load.
>
> > Later in fec_enet_rx_queue() the dma descriptor buffer is synced again in the
> > opposite direction.
> >
> > if (is_copybreak) {
> >   dma_sync_single_for_device(&fep->pdev->dev,
> > fec32_to_cpu(bdp->cbd_bufaddr),  FEC_ENET_RX_FRSIZE - fep->rx_align,
> > DMA_FROM_DEVICE); }
> >
> dma_sync_single_for_cpu(DMA_FROM_DEVICE)
>         __dma_inv_area  #invalidate the area
>
> dma_sync_single_for_device(DMA_FROM_DEVICE)
>         __dma_inv_area  #invalidate the area
>         __dma_clean_area #clean the area
>
> dma_sync_single_for_cpu() and dma_sync_single_for_device() are used in pairs,
> there have no problem for usage.
>
> > Now the two main questions:
> > 1. Is it necessary to call dma_sync_single_for_cpu for the whole buffer size
> > (FEC_ENET_RX_FRSIZE - fep->rx_align), wouldn't syncing the real packet
> > length which is accessed by memcpy be enough?
> > Like so: dma_sync_single_for_cpu(&fep->pdev->dev,
> > fec32_to_cpu(bdp->cbd_bufaddr), (u32) length, DMA_FROM_DEVICE);
>
> In general usage, you don't know the next frame size, and cannot ensure
> the buffer is dirty or not, so invalidate the whole area for next frame.
>
> On some arm64 A53, the dcache invalidate on A53 is flush + invalidate,
> and prefetch may fetch the area, that may causes dirty data flushed back
> to the dma memory if the area has dirty data.
>
> > 2. Is dma_sync_single_for_device even necessary? There is no data passed
> > back to the device because the skb descriptor buffer is not modified and the
> > fec peripheral does not need any valid data.
> > The example in
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > kernel.org%2Fdoc%2FDocumentation%2FDMA-API-HOWTO.txt&amp;data=0
> > 2%7C01%7Cfugang.duan%40nxp.com%7C7fb56778153a4139214808d81deed
> > a6d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637292258992
> > 313637&amp;sdata=4Nv4J7APzmNTv7Dv39tmwpJhFeZ8bNY1eaoAQnx4FdM
> > %3D&amp;reserved=0
> > states:
> >  /* CPU should not write to
> >   * DMA_FROM_DEVICE-mapped area,
> >   * so dma_sync_single_for_device() is
> >   * not needed here. It would be required
> >   * for DMA_BIDIRECTIONAL mapping if
> >   * the memory was modified.
> >  */
> That should ensure the whole area is not dirty.

dma_sync_single_for_cpu() and dma_sync_single_for_device() can or must
be used in pairs?
So in this case it is really necessary to sync back the skb data
buffer via dma_sync_single_for_device? Even when the CPU does not
change any bytes in the skb data buffer / readonly like in this case.
And there is no DMA_BIDIRECTIONAL mapping.

I thought copybreak it is not about the next frame size. It is about
the current frame. And the actual length is known via the size field
in the finished DMA descriptor.
Or do you mean that the next received frame could be no copybreak frame.
1. Rx copybreakable frame with sizeX < copybreak
2. copybreak dma_sync_single_for_cpu(dmabuffer, sizeX)
3. copybreak alloc new_skb, memcpy(new_skb, dmabuffer, sizeX)
4. copybreak dma_sync_single_for_device(dmabuffer, sizeX)
5. Rx non copybreakable frame with sizeY >= copybreak
4. dma_unmap_single(dmabuffer, FEC_ENET_RX_FRSIZE - fep->rx_align) is
called and can cause data corruption because not all bytes were marked
dirty even if nobody DMA & CPU touched them?

> > I am new to the DMA API on ARM. Are these changes regarding cache
> > flushing,... allowed? These would increase the copybreak throughput by
> > reducing CPU load.
>
> To avoid FIFO overrun, it requires to ensure PHY pause frame is enabled.

As the errata states this is also not always true, because the first
xoff could arrive too late. Pause frames/flow control is not really
common and could cause troubles with other random network components
acting different or not supporting pause frames correctly. For example
the driver itself does enable pause frames for Gigabit by default. But
we have no Gigabit Phy so no FEC_QUIRK_HAS_GBIT and therefore pause
frames are not supported by the driver as of now.


It looks like copybreak is implemented similar to e1000_main.c
e1000_copybreak().
There is only the real/needed packet length (length =
le16_to_cpu(rx_desc->length)) is synced via dma_sync_single_for_cpu
and no dma_sync_single_for_device.

Here is a diff with the previous changes assuming that
dma_sync_single_for_device must be used to avoid any cache flush backs
even when no data was changed.

diff --git a/drivers/net/ethernet/freescale/fec_main.c
b/drivers/net/ethernet/freescale/fec_main.c
index 2d0d313ee..464783c15 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct net_device
*ndev, struct sk_buff **skb,
                return false;

        dma_sync_single_for_cpu(&fep->pdev->dev,
                                fec32_to_cpu(bdp->cbd_bufaddr),
-                               FEC_ENET_RX_FRSIZE - fep->rx_align,
+                               length,
                                DMA_FROM_DEVICE);
        if (!swap)
                memcpy(new_skb->data, (*skb)->data, length);
        else
@@ -1413,8 +1413,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
budget, u16 queue_id)
        unsigned short status;
        struct  sk_buff *skb_new = NULL;
        struct  sk_buff *skb;
        ushort  pkt_len;
+       ushort  pkt_len_nofcs;
        __u8 *data;
        int     pkt_received = 0;
        struct  bufdesc_ex *ebdp = NULL;
        bool    vlan_packet_rcvd = false;
@@ -1479,9 +1480,10 @@ fec_enet_rx_queue(struct net_device *ndev, int
budget, u16 queue_id)
                /* The packet length includes FCS, but we don't want to
                 * include that when passing upstream as it messes up
                 * bridging applications.
                 */
-               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp, pkt_len - 4,
+               pkt_len_nofcs = pkt_len - 4;
+               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
pkt_len_nofcs,
                                                  need_swap);
                if (!is_copybreak) {
                        skb_new = netdev_alloc_skb(ndev, FEC_ENET_RX_FRSIZE);
                        if (unlikely(!skb_new)) {
@@ -1554,9 +1556,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
budget, u16 queue_id)

                if (is_copybreak) {
                        dma_sync_single_for_device(&fep->pdev->dev,

fec32_to_cpu(bdp->cbd_bufaddr),
-                                                  FEC_ENET_RX_FRSIZE
- fep->rx_align,
+                                                  pkt_len_nofcs,
                                                   DMA_FROM_DEVICE);
                } else {
                        rxq->rx_skbuff[index] = skb_new;
                        fec_enet_new_rxbdp(ndev, bdp, skb_new);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-02  7:38   ` Kegl Rohit
@ 2020-07-02  9:14     ` Andy Duan
  2020-07-06 14:18       ` Kegl Rohit
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Duan @ 2020-07-02  9:14 UTC (permalink / raw)
  To: Kegl Rohit; +Cc: netdev

From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020 3:39 PM
> On Thu, Jul 2, 2020 at 6:18 AM Andy Duan <fugang.duan@nxp.com> wrote:
> > That should ensure the whole area is not dirty.
> 
> dma_sync_single_for_cpu() and dma_sync_single_for_device() can or must be
> used in pairs?
> So in this case it is really necessary to sync back the skb data buffer via
> dma_sync_single_for_device? Even when the CPU does not change any bytes
> in the skb data buffer / readonly like in this case.

No, if the buffer is not modified, dma_sync_single_for_device() is not necessary.

For some arm64 core, the dcache invalidate on A53 is flush + invalidate, once the buffer
is modified, it will cause read back wrong data without dma_sync_single_for_device(). 
And the driver also support Coldfire platforms, I am not family with the arch.

From current analyze for arm/arm64,  I also think dma_sync_single_for_device() is not
necessary due to the buffer is not modified.

Anyway, it still need to get other experts comment, and it need to do many test and stress test.

> And there is no DMA_BIDIRECTIONAL mapping.
> 
> I thought copybreak it is not about the next frame size. It is about the current
> frame. And the actual length is known via the size field in the finished DMA
> descriptor.
> Or do you mean that the next received frame could be no copybreak frame.
> 1. Rx copybreakable frame with sizeX < copybreak 2. copybreak
> dma_sync_single_for_cpu(dmabuffer, sizeX) 3. copybreak alloc new_skb,
> memcpy(new_skb, dmabuffer, sizeX) 4. copybreak
> dma_sync_single_for_device(dmabuffer, sizeX) 5. Rx non copybreakable
> frame with sizeY >= copybreak 4. dma_unmap_single(dmabuffer,
> FEC_ENET_RX_FRSIZE - fep->rx_align) is called and can cause data corruption
> because not all bytes were marked dirty even if nobody DMA & CPU touched
> them?

No CPU touch, it should be clean.
> 
> > > I am new to the DMA API on ARM. Are these changes regarding cache
> > > flushing,... allowed? These would increase the copybreak throughput
> > > by reducing CPU load.
> >
> > To avoid FIFO overrun, it requires to ensure PHY pause frame is enabled.
> 
> As the errata states this is also not always true, because the first xoff could
> arrive too late. Pause frames/flow control is not really common and could
> cause troubles with other random network components acting different or not
> supporting pause frames correctly. For example the driver itself does enable
> pause frames for Gigabit by default. But we have no Gigabit Phy so no
> FEC_QUIRK_HAS_GBIT and therefore pause frames are not supported by the
> driver as of now.
> 
> 
> It looks like copybreak is implemented similar to e1000_main.c
> e1000_copybreak().
> There is only the real/needed packet length (length =
> le16_to_cpu(rx_desc->length)) is synced via dma_sync_single_for_cpu and no
> dma_sync_single_for_device.
> 
> Here is a diff with the previous changes assuming that
> dma_sync_single_for_device must be used to avoid any cache flush backs
> even when no data was changed.

Below change seems fine, can you collect some data before you send out
the patch for review.
- run iperf stress test to ensure the stability
- collect the performance improvement data 

Thanks.
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 2d0d313ee..464783c15 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct net_device
> *ndev, struct sk_buff **skb,
>                 return false;
> 
>         dma_sync_single_for_cpu(&fep->pdev->dev,
>                                 fec32_to_cpu(bdp->cbd_bufaddr),
> -                               FEC_ENET_RX_FRSIZE - fep->rx_align,
> +                               length,
>                                 DMA_FROM_DEVICE);
>         if (!swap)
>                 memcpy(new_skb->data, (*skb)->data, length);
>         else
> @@ -1413,8 +1413,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>         unsigned short status;
>         struct  sk_buff *skb_new = NULL;
>         struct  sk_buff *skb;
>         ushort  pkt_len;
> +       ushort  pkt_len_nofcs;
>         __u8 *data;
>         int     pkt_received = 0;
>         struct  bufdesc_ex *ebdp = NULL;
>         bool    vlan_packet_rcvd = false;
> @@ -1479,9 +1480,10 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>                 /* The packet length includes FCS, but we don't want to
>                  * include that when passing upstream as it messes up
>                  * bridging applications.
>                  */
> -               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> pkt_len - 4,
> +               pkt_len_nofcs = pkt_len - 4;
> +               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> pkt_len_nofcs,
>                                                   need_swap);
>                 if (!is_copybreak) {
>                         skb_new = netdev_alloc_skb(ndev,
> FEC_ENET_RX_FRSIZE);
>                         if (unlikely(!skb_new)) { @@ -1554,9 +1556,9
> @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
> 
>                 if (is_copybreak) {
>                         dma_sync_single_for_device(&fep->pdev->dev,
> 
> fec32_to_cpu(bdp->cbd_bufaddr),
> -
> FEC_ENET_RX_FRSIZE
> - fep->rx_align,
> +
> pkt_len_nofcs,
> 
> DMA_FROM_DEVICE);
>                 } else {
>                         rxq->rx_skbuff[index] = skb_new;
>                         fec_enet_new_rxbdp(ndev, bdp, skb_new);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-02  9:14     ` Andy Duan
@ 2020-07-06 14:18       ` Kegl Rohit
  2020-07-07  3:44         ` Andy Duan
  0 siblings, 1 reply; 8+ messages in thread
From: Kegl Rohit @ 2020-07-06 14:18 UTC (permalink / raw)
  To: Andy Duan; +Cc: netdev

So you would also say a single dma_sync_single_for_cpu is enough.
You are right it would be great if some mm expert could have a look at
it. If any corruption could happen.

diff --git a/drivers/net/ethernet/freescale/fec_main.c
b/drivers/net/ethernet/freescale/fec_main.c
index 2d0d313ee..acc04726f 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct net_device
*ndev, struct sk_buff **skb,
                return false;

        dma_sync_single_for_cpu(&fep->pdev->dev,
                                fec32_to_cpu(bdp->cbd_bufaddr),
-                               FEC_ENET_RX_FRSIZE - fep->rx_align,
+                               length,
                                DMA_FROM_DEVICE);
        if (!swap)
                memcpy(new_skb->data, (*skb)->data, length);
        else
@@ -1551,14 +1551,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
budget, u16 queue_id)
                                               vlan_tag);

                napi_gro_receive(&fep->napi, skb);

-               if (is_copybreak) {
-                       dma_sync_single_for_device(&fep->pdev->dev,
-
fec32_to_cpu(bdp->cbd_bufaddr),
-                                                  FEC_ENET_RX_FRSIZE
- fep->rx_align,
-                                                  DMA_FROM_DEVICE);
-               } else {
+               if (!is_copybreak) {
                        rxq->rx_skbuff[index] = skb_new;
                        fec_enet_new_rxbdp(ndev, bdp, skb_new);
                }

Testing is pretty complex because of multiple different contexts.
To test the whole rx chain throughput i decided to use UDP with 200
bytes payload.
Wireshark reports 242 bytes on wire, so copybreak is active (< 256 bytes).
I choose larger packets to reduce CPU load caused by the network stack / iperf3.
Smaller packets will benefit even more, but for this test the network
stack / iperf3 could become the bottleneck not the driver itself (e.g.
to test call skb_free instead of napi_gro_receive).
iperf missbehaved sometimes and caused a lot more CPU load, so using iperf3.
Kernelversion is Linux falcon 5.4.8-rt11-yocto-standard #59 SMP PREEMPT
Normaly our system has PREEMPT RT enabled. But I think to test
throughput PREEMPT is a better way to compare results.
NAPI and PREEMPT RT has even more contexts with the additional
threaded irq handler and therefore PREEMPT RT throughput is another
story.
stress -m 1 --vm-bytes 100M was used to simulate additional memory
load. Only one core to keep ressources free for iperf3.


###############################
# NOT PATCHED
###############################
-------------------------------
- NOT STRESSED
-------------------------------
user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10
Connecting to host 192.168.1.2, port 5201
[  4] local 192.168.1.1 port 52032 connected to 192.168.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  8.06 MBytes  67.6 Mbits/sec  42279
[  4]   1.00-2.00   sec  8.93 MBytes  74.9 Mbits/sec  46800
[  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46771
[  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46701
[  4]   4.00-5.00   sec  8.92 MBytes  74.9 Mbits/sec  46792
[  4]   5.00-6.00   sec  8.93 MBytes  74.9 Mbits/sec  46840
[  4]   6.00-7.00   sec  8.90 MBytes  74.6 Mbits/sec  46636
[  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46924
[  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46716
[  4]   9.00-10.00  sec  8.89 MBytes  74.5 Mbits/sec  46592
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter
Lost/Total Datagrams
[  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.056 ms
123112/463051 (27%)
[  4] Sent 463051 datagrams

[root@imx6q ~]# iperf3 -s
Server listening on 5201
Accepted connection from 192.168.1.1, port 53394
[  5] local 192.168.1.2 port 5201 connected to 192.168.1.1 port 43224
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-1.00   sec  5.56 MBytes  46.6 Mbits/sec  0.046 ms
5520/34650 (16%)
[  5]   1.00-2.00   sec  6.40 MBytes  53.7 Mbits/sec  0.050 ms
13035/46576 (28%)
[  5]   2.00-3.00   sec  6.40 MBytes  53.7 Mbits/sec  0.044 ms
13295/46854 (28%)
[  5]   3.00-4.00   sec  6.26 MBytes  52.5 Mbits/sec  0.056 ms
13849/46692 (30%)
[  5]   4.00-5.00   sec  6.31 MBytes  53.0 Mbits/sec  0.052 ms
13735/46833 (29%)
[  5]   5.00-6.00   sec  6.32 MBytes  53.0 Mbits/sec  0.054 ms
13610/46761 (29%)
[  5]   6.00-7.00   sec  6.32 MBytes  53.0 Mbits/sec  0.043 ms
13780/46931 (29%)
[  5]   7.00-8.00   sec  6.33 MBytes  53.1 Mbits/sec  0.051 ms
13679/46848 (29%)
[  5]   8.00-9.00   sec  6.33 MBytes  53.1 Mbits/sec  0.050 ms
13746/46912 (29%)
[  5]   9.00-10.00  sec  6.33 MBytes  53.1 Mbits/sec  0.043 ms
13437/46626 (29%)
[  5]  10.00-10.17  sec  1.04 MBytes  52.5 Mbits/sec  0.044 ms
2233/7684 (29%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.17  sec  63.6 MBytes  52.5 Mbits/sec  0.044 ms
129919/463367 (28%)  receiver

[root@imx6q ~]# mpstat -P ALL 4 1
Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
23:38:51     CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
23:38:55     all    2.49    0.00   19.96    0.00    0.00   25.45
0.00    0.00   52.10
23:38:55       0    0.00    0.00    0.25    0.00    0.00   99.75
0.00    0.00    0.00
23:38:55       1    1.54    0.00   11.54    0.00    0.00    0.00
0.00    0.00   86.92
23:38:55       2    1.27    0.00    5.85    0.00    0.00    0.00
0.00    0.00   92.88
23:38:55       3    7.24    0.00   63.31    0.00    0.00    0.00
0.00    0.00   29.46
=> ksoftirqd/0 @ 100%; iperf @ 94%

[root@imx6q ~]# sar -I 60 1
00:13:32         INTR    intr/s
00:13:33           60      0.00
00:13:34           60      0.00
00:13:35           60      0.00
00:13:36           60      0.00
=> 100% napi poll


-------------------------------
- STRESSED
- stress -m 1 --vm-bytes 100M &
-------------------------------
user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10
Connecting to host 192.168.1.2, port 5201
[  4] local 192.168.1.1 port 52129 connected to 192.168.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  8.09 MBytes  67.8 Mbits/sec  42399
[  4]   1.00-2.00   sec  8.88 MBytes  74.5 Mbits/sec  46552
[  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46780
[  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46688
[  4]   4.00-5.00   sec  8.91 MBytes  74.7 Mbits/sec  46712
[  4]   5.00-6.00   sec  8.91 MBytes  74.8 Mbits/sec  46724
[  4]   6.00-7.00   sec  8.95 MBytes  75.0 Mbits/sec  46900
[  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46928
[  4]   8.00-9.00   sec  8.93 MBytes  74.9 Mbits/sec  46796
[  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46732
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter
Lost/Total Datagrams
[  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.067 ms
233989/463134 (51%)
[  4] Sent 463134 datagrams

[root@imx6q ~]# iperf3 -s
Server listening on 5201
Accepted connection from 192.168.1.1, port 57892
[  5] local 192.168.1.2 port 5201 connected to 192.168.1.1 port 43729
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-1.00   sec  3.59 MBytes  30.1 Mbits/sec  0.071 ms
13027/31836 (41%)
[  5]   1.00-2.00   sec  4.32 MBytes  36.2 Mbits/sec  0.075 ms
23690/46345 (51%)
[  5]   2.00-3.00   sec  4.28 MBytes  35.9 Mbits/sec  0.040 ms
24879/47293 (53%)
[  5]   3.00-4.00   sec  4.05 MBytes  34.0 Mbits/sec  0.055 ms
24430/45651 (54%)
[  5]   4.00-5.00   sec  4.01 MBytes  33.6 Mbits/sec  0.130 ms
25200/46209 (55%)
[  5]   5.00-6.00   sec  4.22 MBytes  35.4 Mbits/sec  0.052 ms
24777/46902 (53%)
[  5]   6.00-7.00   sec  4.04 MBytes  33.9 Mbits/sec  0.057 ms
24452/45635 (54%)
[  5]   7.00-8.00   sec  4.26 MBytes  35.7 Mbits/sec  0.063 ms
26530/48871 (54%)
[  5]   8.00-9.00   sec  4.06 MBytes  34.1 Mbits/sec  0.061 ms
24293/45583 (53%)
[  5]   9.00-10.00  sec  4.15 MBytes  34.8 Mbits/sec  0.155 ms
24752/46499 (53%)
[  5]  10.00-10.25  sec   985 KBytes  32.5 Mbits/sec  0.069 ms
6246/11291 (55%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.25  sec  41.9 MBytes  34.3 Mbits/sec  0.069 ms
242276/462115 (52%)  receiver

[root@imx6q ~]# mpstat -P ALL 4 1
Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
23:43:12     CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
23:43:16     all    2.51    0.00   42.42    0.00    0.00   25.47
0.00    0.00   29.59
23:43:16       0    0.00    0.00    1.25    0.00    0.00   98.75
0.00    0.00    0.00
23:43:16       1    2.25    0.00   97.75    0.00    0.00    0.00
0.00    0.00    0.00
23:43:16       2    0.25    0.00    2.29    0.00    0.00    0.00
0.00    0.00   97.46
23:43:16       3    8.08    0.00   70.75    0.00    0.00    0.00
0.00    0.00   21.17
=> ksoftirqd/0 @ 100%; stress @ 100%; iperf3 @ 84%

[root@imx6q ~]# sar -I 60 1
00:14:41         INTR    intr/s
00:14:42           60      0.00
00:14:43           60      0.00
00:14:44           60      0.00
=> 100% napi poll


###############################
# PATCHED
###############################
-------------------------------
-NOT STRESSED
-------------------------------
user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10
Connecting to host 192.168.1.2, port 5201
[  4] local 192.168.1.1 port 50177 connected to 192.168.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  8.08 MBytes  67.8 Mbits/sec  42373
[  4]   1.00-2.00   sec  8.91 MBytes  74.7 Mbits/sec  46693
[  4]   2.00-3.00   sec  8.93 MBytes  74.9 Mbits/sec  46803
[  4]   3.00-4.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
[  4]   4.00-5.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
[  4]   5.00-6.00   sec  8.92 MBytes  74.8 Mbits/sec  46764
[  4]   6.00-7.00   sec  8.95 MBytes  75.1 Mbits/sec  46944
[  4]   7.00-8.00   sec  8.90 MBytes  74.7 Mbits/sec  46672
[  4]   8.00-9.00   sec  8.91 MBytes  74.8 Mbits/sec  46732
[  4]   9.00-10.00  sec  8.90 MBytes  74.7 Mbits/sec  46660
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter
Lost/Total Datagrams
[  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.032 ms
16561/463165 (3.6%)
[  4] Sent 463165 datagrams

[root@imx6q ~]# iperf3 -s
Server listening on 5201
Accepted connection from 192.168.1.1, port 45012
[  5] local 192.168.1.2 port 5201 connected to 192.168.1.1 port 50177
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-1.00   sec  7.36 MBytes  61.8 Mbits/sec  0.030 ms
1430/40038 (3.6%)
[  5]   1.00-2.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
1757/46850 (3.8%)
[  5]   2.00-3.00   sec  8.61 MBytes  72.2 Mbits/sec  0.039 ms
1690/46809 (3.6%)
[  5]   3.00-4.00   sec  8.61 MBytes  72.3 Mbits/sec  0.037 ms
1650/46812 (3.5%)
[  5]   4.00-5.00   sec  8.59 MBytes  72.1 Mbits/sec  0.033 ms
1589/46651 (3.4%)
[  5]   5.00-6.00   sec  8.60 MBytes  72.1 Mbits/sec  0.036 ms
1904/46967 (4.1%)
[  5]   6.00-7.00   sec  8.61 MBytes  72.2 Mbits/sec  0.046 ms
1633/46766 (3.5%)
[  5]   7.00-8.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
1748/46818 (3.7%)
[  5]   8.00-9.00   sec  8.61 MBytes  72.2 Mbits/sec  0.036 ms
1572/46707 (3.4%)
[  5]   9.00-10.00  sec  8.61 MBytes  72.3 Mbits/sec  0.031 ms
1538/46703 (3.3%)
[  5]  10.00-10.04  sec   389 KBytes  72.1 Mbits/sec  0.032 ms  50/2044 (2.4%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.04  sec  85.2 MBytes  71.1 Mbits/sec  0.032 ms
16561/463165 (3.6%)  receiver

[root@imx6q ~]# mpstat -P ALL 4 1
Linux 5.4.8-rt11-yocto-standard  03/17/00 _armv7l_ (4 CPU)
23:15:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
23:15:54     all    3.06    0.00   26.11    0.00    0.00   11.09
0.00    0.00   59.74
23:15:54       0    0.00    0.00    0.00    0.00    0.00   86.44
0.00    0.00   13.56
23:15:54       1   10.25    0.00   89.50    0.00    0.00    0.25
0.00    0.00    0.00
23:15:54       2    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00
23:15:54       3    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00
=> ksoftirqd/* @ 0% CPU; iperf3 @ 100%

[root@imx6q ~]# sar -I 60 1
23:56:10         INTR    intr/s
23:56:11           60  12339.00
23:56:12           60  11476.00
=> irq load high


-------------------------------
- STRESSED
- stress -m 1 --vm-bytes 100M &
-------------------------------
user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10
Connecting to host 192.168.1.2, port 5201
[  4] local 192.168.1.1 port 59042 connected to 192.168.1.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-1.00   sec  7.92 MBytes  66.2 Mbits/sec  41527
[  4]   1.00-2.00   sec  8.95 MBytes  75.4 Mbits/sec  46931
[  4]   2.00-3.00   sec  8.95 MBytes  75.1 Mbits/sec  46923
[  4]   3.00-4.00   sec  8.96 MBytes  75.2 Mbits/sec  46983
[  4]   4.00-5.00   sec  8.96 MBytes  75.2 Mbits/sec  46999
[  4]   5.00-6.00   sec  8.91 MBytes  74.5 Mbits/sec  46688
[  4]   6.00-7.00   sec  8.93 MBytes  75.1 Mbits/sec  46824
[  4]   7.00-8.00   sec  8.94 MBytes  75.0 Mbits/sec  46872
[  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46700
[  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46720
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter
Lost/Total Datagrams
[  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.042 ms
70360/462955 (15%)
[  4] Sent 462955 datagrams


[root@imx6q ~]# iperf3 -s
Server listening on 5201
Accepted connection from 192.168.1.1, port 46306
[  5] local 192.168.1.2 port 5201 connected to 192.168.1.1 port 59042
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-1.00   sec  6.54 MBytes  54.8 Mbits/sec  0.038 ms
5602/39871 (14%)
[  5]   1.00-2.00   sec  7.51 MBytes  63.0 Mbits/sec  0.037 ms
6973/46362 (15%)
[  5]   2.00-3.00   sec  7.54 MBytes  63.3 Mbits/sec  0.037 ms
7414/46966 (16%)
[  5]   3.00-4.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
7354/46984 (16%)
[  5]   4.00-5.00   sec  7.60 MBytes  63.7 Mbits/sec  0.031 ms
7241/47069 (15%)
[  5]   5.00-6.00   sec  7.58 MBytes  63.6 Mbits/sec  0.033 ms
7134/46865 (15%)
[  5]   6.00-7.00   sec  7.56 MBytes  63.5 Mbits/sec  0.058 ms
6991/46649 (15%)
[  5]   7.00-8.00   sec  7.57 MBytes  63.5 Mbits/sec  0.043 ms
7259/46933 (15%)
[  5]   8.00-9.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
7065/46721 (15%)
[  5]   9.00-10.00  sec  7.55 MBytes  63.3 Mbits/sec  0.042 ms
7002/46588 (15%)
[  5]  10.00-10.04  sec   317 KBytes  61.8 Mbits/sec  0.042 ms  325/1947 (17%)
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter
Lost/Total Datagrams
[  5]   0.00-10.04  sec  74.9 MBytes  62.6 Mbits/sec  0.042 ms
70360/462955 (15%)  receiver


[root@imx6q ~]# mpstat -P ALL 4 1
Linux 5.4.8-rt11-yocto-standard 03/17/00 _armv7l_ (4 CPU)
23:49:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft
%steal  %guest   %idle
23:50:03     all    3.21    0.00   47.27    0.00    0.00   25.02
0.00    0.00   24.51
23:50:03       0    0.00    0.00    0.50    0.00    0.00   99.50
0.00    0.00    0.00
23:50:03       1    0.00    0.00    0.26    0.00    0.00    0.00
0.00    0.00   99.74
23:50:03       2    9.50    0.00   90.50    0.00    0.00    0.00
0.00    0.00    0.00
23:50:03       3    3.01    0.00   96.99    0.00    0.00    0.00
0.00    0.00    0.00
=> ksoftirqd/0 @ 92%; stress @ 100%; iperf3 @ 100%

[root@imx6q ~]# sar -I 60 1
23:57:08           60    505.00
23:57:09           60    748.00
23:57:10           60    416.00
=> irq load low => napi active but not all the time

===============================================================

As a result the patched system stressed with memory load has a higher
throughput than the unpatched system without memory stress.
The system could nearly keep up with the desktop sending @ 75MBit/s
(100MBit/s link) with i210 / igb.
I think IRQ/s could reach a new peak because of the faster napi_poll /
rx_queue execution times.
iperf, iperf3 is mainly to test throughput. Is there any tool which
checks for connection reliability (packet content, sequence, ...)?

On Thu, Jul 2, 2020 at 11:14 AM Andy Duan <fugang.duan@nxp.com> wrote:
>
> From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020 3:39 PM
> > On Thu, Jul 2, 2020 at 6:18 AM Andy Duan <fugang.duan@nxp.com> wrote:
> > > That should ensure the whole area is not dirty.
> >
> > dma_sync_single_for_cpu() and dma_sync_single_for_device() can or must be
> > used in pairs?
> > So in this case it is really necessary to sync back the skb data buffer via
> > dma_sync_single_for_device? Even when the CPU does not change any bytes
> > in the skb data buffer / readonly like in this case.
>
> No, if the buffer is not modified, dma_sync_single_for_device() is not necessary.
>
> For some arm64 core, the dcache invalidate on A53 is flush + invalidate, once the buffer
> is modified, it will cause read back wrong data without dma_sync_single_for_device().
> And the driver also support Coldfire platforms, I am not family with the arch.
>
> From current analyze for arm/arm64,  I also think dma_sync_single_for_device() is not
> necessary due to the buffer is not modified.
>
> Anyway, it still need to get other experts comment, and it need to do many test and stress test.
>
> > And there is no DMA_BIDIRECTIONAL mapping.
> >
> > I thought copybreak it is not about the next frame size. It is about the current
> > frame. And the actual length is known via the size field in the finished DMA
> > descriptor.
> > Or do you mean that the next received frame could be no copybreak frame.
> > 1. Rx copybreakable frame with sizeX < copybreak 2. copybreak
> > dma_sync_single_for_cpu(dmabuffer, sizeX) 3. copybreak alloc new_skb,
> > memcpy(new_skb, dmabuffer, sizeX) 4. copybreak
> > dma_sync_single_for_device(dmabuffer, sizeX) 5. Rx non copybreakable
> > frame with sizeY >= copybreak 4. dma_unmap_single(dmabuffer,
> > FEC_ENET_RX_FRSIZE - fep->rx_align) is called and can cause data corruption
> > because not all bytes were marked dirty even if nobody DMA & CPU touched
> > them?
>
> No CPU touch, it should be clean.
> >
> > > > I am new to the DMA API on ARM. Are these changes regarding cache
> > > > flushing,... allowed? These would increase the copybreak throughput
> > > > by reducing CPU load.
> > >
> > > To avoid FIFO overrun, it requires to ensure PHY pause frame is enabled.
> >
> > As the errata states this is also not always true, because the first xoff could
> > arrive too late. Pause frames/flow control is not really common and could
> > cause troubles with other random network components acting different or not
> > supporting pause frames correctly. For example the driver itself does enable
> > pause frames for Gigabit by default. But we have no Gigabit Phy so no
> > FEC_QUIRK_HAS_GBIT and therefore pause frames are not supported by the
> > driver as of now.
> >
> >
> > It looks like copybreak is implemented similar to e1000_main.c
> > e1000_copybreak().
> > There is only the real/needed packet length (length =
> > le16_to_cpu(rx_desc->length)) is synced via dma_sync_single_for_cpu and no
> > dma_sync_single_for_device.
> >
> > Here is a diff with the previous changes assuming that
> > dma_sync_single_for_device must be used to avoid any cache flush backs
> > even when no data was changed.
>
> Below change seems fine, can you collect some data before you send out
> the patch for review.
> - run iperf stress test to ensure the stability
> - collect the performance improvement data
>
> Thanks.
> >
> > diff --git a/drivers/net/ethernet/freescale/fec_main.c
> > b/drivers/net/ethernet/freescale/fec_main.c
> > index 2d0d313ee..464783c15 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct net_device
> > *ndev, struct sk_buff **skb,
> >                 return false;
> >
> >         dma_sync_single_for_cpu(&fep->pdev->dev,
> >                                 fec32_to_cpu(bdp->cbd_bufaddr),
> > -                               FEC_ENET_RX_FRSIZE - fep->rx_align,
> > +                               length,
> >                                 DMA_FROM_DEVICE);
> >         if (!swap)
> >                 memcpy(new_skb->data, (*skb)->data, length);
> >         else
> > @@ -1413,8 +1413,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
> > budget, u16 queue_id)
> >         unsigned short status;
> >         struct  sk_buff *skb_new = NULL;
> >         struct  sk_buff *skb;
> >         ushort  pkt_len;
> > +       ushort  pkt_len_nofcs;
> >         __u8 *data;
> >         int     pkt_received = 0;
> >         struct  bufdesc_ex *ebdp = NULL;
> >         bool    vlan_packet_rcvd = false;
> > @@ -1479,9 +1480,10 @@ fec_enet_rx_queue(struct net_device *ndev, int
> > budget, u16 queue_id)
> >                 /* The packet length includes FCS, but we don't want to
> >                  * include that when passing upstream as it messes up
> >                  * bridging applications.
> >                  */
> > -               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> > pkt_len - 4,
> > +               pkt_len_nofcs = pkt_len - 4;
> > +               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> > pkt_len_nofcs,
> >                                                   need_swap);
> >                 if (!is_copybreak) {
> >                         skb_new = netdev_alloc_skb(ndev,
> > FEC_ENET_RX_FRSIZE);
> >                         if (unlikely(!skb_new)) { @@ -1554,9 +1556,9
> > @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
> >
> >                 if (is_copybreak) {
> >                         dma_sync_single_for_device(&fep->pdev->dev,
> >
> > fec32_to_cpu(bdp->cbd_bufaddr),
> > -
> > FEC_ENET_RX_FRSIZE
> > - fep->rx_align,
> > +
> > pkt_len_nofcs,
> >
> > DMA_FROM_DEVICE);
> >                 } else {
> >                         rxq->rx_skbuff[index] = skb_new;
> >                         fec_enet_new_rxbdp(ndev, bdp, skb_new);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-06 14:18       ` Kegl Rohit
@ 2020-07-07  3:44         ` Andy Duan
  2020-07-07 16:14           ` Robin Murphy
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Duan @ 2020-07-07  3:44 UTC (permalink / raw)
  To: Kegl Rohit, catalin.marinas, will, 'Christoph Hellwig'
  Cc: netdev, m.szyprowski, robin.murphy

Hi mm experts,

From: Kegl Rohit <keglrohit@gmail.com> Sent: Monday, July 6, 2020 10:18 PM
> So you would also say a single dma_sync_single_for_cpu is enough.
> You are right it would be great if some mm expert could have a look at it. If
> any corruption could happen.

Kegl want to remove dma_sync_single_for_device(), who thinks the cost is
not necessary, and only dma_sync_single_for_cpu() is enough,  do you have
any comments ?

Thanks!
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 2d0d313ee..acc04726f 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct net_device
> *ndev, struct sk_buff **skb,
>                 return false;
> 
>         dma_sync_single_for_cpu(&fep->pdev->dev,
>                                 fec32_to_cpu(bdp->cbd_bufaddr),
> -                               FEC_ENET_RX_FRSIZE - fep->rx_align,
> +                               length,
>                                 DMA_FROM_DEVICE);
>         if (!swap)
>                 memcpy(new_skb->data, (*skb)->data, length);
>         else
> @@ -1551,14 +1551,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>                                                vlan_tag);
> 
>                 napi_gro_receive(&fep->napi, skb);
> 
> -               if (is_copybreak) {
> -                       dma_sync_single_for_device(&fep->pdev->dev,
> -
> fec32_to_cpu(bdp->cbd_bufaddr),
> -
> FEC_ENET_RX_FRSIZE
> - fep->rx_align,
> -
> DMA_FROM_DEVICE);
> -               } else {
> +               if (!is_copybreak) {
>                         rxq->rx_skbuff[index] = skb_new;
>                         fec_enet_new_rxbdp(ndev, bdp, skb_new);
>                 }
> 
> Testing is pretty complex because of multiple different contexts.
> To test the whole rx chain throughput i decided to use UDP with 200 bytes
> payload.
> Wireshark reports 242 bytes on wire, so copybreak is active (< 256 bytes).
> I choose larger packets to reduce CPU load caused by the network stack /
> iperf3.
> Smaller packets will benefit even more, but for this test the network stack /
> iperf3 could become the bottleneck not the driver itself (e.g.
> to test call skb_free instead of napi_gro_receive).
> iperf missbehaved sometimes and caused a lot more CPU load, so using
> iperf3.
> Kernelversion is Linux falcon 5.4.8-rt11-yocto-standard #59 SMP PREEMPT
> Normaly our system has PREEMPT RT enabled. But I think to test throughput
> PREEMPT is a better way to compare results.
> NAPI and PREEMPT RT has even more contexts with the additional threaded
> irq handler and therefore PREEMPT RT throughput is another story.
> stress -m 1 --vm-bytes 100M was used to simulate additional memory load.
> Only one core to keep ressources free for iperf3.
> 
> 
> ###############################
> # NOT PATCHED
> ###############################
> -------------------------------
> - NOT STRESSED
> -------------------------------
> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 52032 connected to
> 192.168.1.2 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-1.00   sec  8.06 MBytes  67.6 Mbits/sec  42279
> [  4]   1.00-2.00   sec  8.93 MBytes  74.9 Mbits/sec  46800
> [  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46771
> [  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46701
> [  4]   4.00-5.00   sec  8.92 MBytes  74.9 Mbits/sec  46792
> [  4]   5.00-6.00   sec  8.93 MBytes  74.9 Mbits/sec  46840
> [  4]   6.00-7.00   sec  8.90 MBytes  74.6 Mbits/sec  46636
> [  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46924
> [  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46716
> [  4]   9.00-10.00  sec  8.89 MBytes  74.5 Mbits/sec  46592
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter
> Lost/Total Datagrams
> [  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.056 ms
> 123112/463051 (27%)
> [  4] Sent 463051 datagrams
> 
> [root@imx6q ~]# iperf3 -s
> Server listening on 5201
> Accepted connection from 192.168.1.1, port 53394 [  5] local 192.168.1.2
> port 5201 connected to 192.168.1.1 port 43224
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-1.00   sec  5.56 MBytes  46.6 Mbits/sec  0.046 ms
> 5520/34650 (16%)
> [  5]   1.00-2.00   sec  6.40 MBytes  53.7 Mbits/sec  0.050 ms
> 13035/46576 (28%)
> [  5]   2.00-3.00   sec  6.40 MBytes  53.7 Mbits/sec  0.044 ms
> 13295/46854 (28%)
> [  5]   3.00-4.00   sec  6.26 MBytes  52.5 Mbits/sec  0.056 ms
> 13849/46692 (30%)
> [  5]   4.00-5.00   sec  6.31 MBytes  53.0 Mbits/sec  0.052 ms
> 13735/46833 (29%)
> [  5]   5.00-6.00   sec  6.32 MBytes  53.0 Mbits/sec  0.054 ms
> 13610/46761 (29%)
> [  5]   6.00-7.00   sec  6.32 MBytes  53.0 Mbits/sec  0.043 ms
> 13780/46931 (29%)
> [  5]   7.00-8.00   sec  6.33 MBytes  53.1 Mbits/sec  0.051 ms
> 13679/46848 (29%)
> [  5]   8.00-9.00   sec  6.33 MBytes  53.1 Mbits/sec  0.050 ms
> 13746/46912 (29%)
> [  5]   9.00-10.00  sec  6.33 MBytes  53.1 Mbits/sec  0.043 ms
> 13437/46626 (29%)
> [  5]  10.00-10.17  sec  1.04 MBytes  52.5 Mbits/sec  0.044 ms
> 2233/7684 (29%)
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.17  sec  63.6 MBytes  52.5 Mbits/sec  0.044 ms
> 129919/463367 (28%)  receiver
> 
> [root@imx6q ~]# mpstat -P ALL 4 1
> Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
> 23:38:51     CPU    %usr   %nice    %sys %iowait    %irq   %soft
> %steal  %guest   %idle
> 23:38:55     all    2.49    0.00   19.96    0.00    0.00   25.45
> 0.00    0.00   52.10
> 23:38:55       0    0.00    0.00    0.25    0.00    0.00   99.75
> 0.00    0.00    0.00
> 23:38:55       1    1.54    0.00   11.54    0.00    0.00    0.00
> 0.00    0.00   86.92
> 23:38:55       2    1.27    0.00    5.85    0.00    0.00    0.00
> 0.00    0.00   92.88
> 23:38:55       3    7.24    0.00   63.31    0.00    0.00    0.00
> 0.00    0.00   29.46
> => ksoftirqd/0 @ 100%; iperf @ 94%
> 
> [root@imx6q ~]# sar -I 60 1
> 00:13:32         INTR    intr/s
> 00:13:33           60      0.00
> 00:13:34           60      0.00
> 00:13:35           60      0.00
> 00:13:36           60      0.00
> => 100% napi poll
> 
> 
> -------------------------------
> - STRESSED
> - stress -m 1 --vm-bytes 100M &
> -------------------------------
> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 52129 connected to
> 192.168.1.2 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-1.00   sec  8.09 MBytes  67.8 Mbits/sec  42399
> [  4]   1.00-2.00   sec  8.88 MBytes  74.5 Mbits/sec  46552
> [  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46780
> [  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46688
> [  4]   4.00-5.00   sec  8.91 MBytes  74.7 Mbits/sec  46712
> [  4]   5.00-6.00   sec  8.91 MBytes  74.8 Mbits/sec  46724
> [  4]   6.00-7.00   sec  8.95 MBytes  75.0 Mbits/sec  46900
> [  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46928
> [  4]   8.00-9.00   sec  8.93 MBytes  74.9 Mbits/sec  46796
> [  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46732
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter
> Lost/Total Datagrams
> [  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.067 ms
> 233989/463134 (51%)
> [  4] Sent 463134 datagrams
> 
> [root@imx6q ~]# iperf3 -s
> Server listening on 5201
> Accepted connection from 192.168.1.1, port 57892 [  5] local 192.168.1.2
> port 5201 connected to 192.168.1.1 port 43729
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-1.00   sec  3.59 MBytes  30.1 Mbits/sec  0.071 ms
> 13027/31836 (41%)
> [  5]   1.00-2.00   sec  4.32 MBytes  36.2 Mbits/sec  0.075 ms
> 23690/46345 (51%)
> [  5]   2.00-3.00   sec  4.28 MBytes  35.9 Mbits/sec  0.040 ms
> 24879/47293 (53%)
> [  5]   3.00-4.00   sec  4.05 MBytes  34.0 Mbits/sec  0.055 ms
> 24430/45651 (54%)
> [  5]   4.00-5.00   sec  4.01 MBytes  33.6 Mbits/sec  0.130 ms
> 25200/46209 (55%)
> [  5]   5.00-6.00   sec  4.22 MBytes  35.4 Mbits/sec  0.052 ms
> 24777/46902 (53%)
> [  5]   6.00-7.00   sec  4.04 MBytes  33.9 Mbits/sec  0.057 ms
> 24452/45635 (54%)
> [  5]   7.00-8.00   sec  4.26 MBytes  35.7 Mbits/sec  0.063 ms
> 26530/48871 (54%)
> [  5]   8.00-9.00   sec  4.06 MBytes  34.1 Mbits/sec  0.061 ms
> 24293/45583 (53%)
> [  5]   9.00-10.00  sec  4.15 MBytes  34.8 Mbits/sec  0.155 ms
> 24752/46499 (53%)
> [  5]  10.00-10.25  sec   985 KBytes  32.5 Mbits/sec  0.069 ms
> 6246/11291 (55%)
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.25  sec  41.9 MBytes  34.3 Mbits/sec  0.069 ms
> 242276/462115 (52%)  receiver
> 
> [root@imx6q ~]# mpstat -P ALL 4 1
> Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
> 23:43:12     CPU    %usr   %nice    %sys %iowait    %irq   %soft
> %steal  %guest   %idle
> 23:43:16     all    2.51    0.00   42.42    0.00    0.00   25.47
> 0.00    0.00   29.59
> 23:43:16       0    0.00    0.00    1.25    0.00    0.00   98.75
> 0.00    0.00    0.00
> 23:43:16       1    2.25    0.00   97.75    0.00    0.00    0.00
> 0.00    0.00    0.00
> 23:43:16       2    0.25    0.00    2.29    0.00    0.00    0.00
> 0.00    0.00   97.46
> 23:43:16       3    8.08    0.00   70.75    0.00    0.00    0.00
> 0.00    0.00   21.17
> => ksoftirqd/0 @ 100%; stress @ 100%; iperf3 @ 84%
> 
> [root@imx6q ~]# sar -I 60 1
> 00:14:41         INTR    intr/s
> 00:14:42           60      0.00
> 00:14:43           60      0.00
> 00:14:44           60      0.00
> => 100% napi poll
> 
> 
> ###############################
> # PATCHED
> ###############################
> -------------------------------
> -NOT STRESSED
> -------------------------------
> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 50177 connected to
> 192.168.1.2 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-1.00   sec  8.08 MBytes  67.8 Mbits/sec  42373
> [  4]   1.00-2.00   sec  8.91 MBytes  74.7 Mbits/sec  46693
> [  4]   2.00-3.00   sec  8.93 MBytes  74.9 Mbits/sec  46803
> [  4]   3.00-4.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
> [  4]   4.00-5.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
> [  4]   5.00-6.00   sec  8.92 MBytes  74.8 Mbits/sec  46764
> [  4]   6.00-7.00   sec  8.95 MBytes  75.1 Mbits/sec  46944
> [  4]   7.00-8.00   sec  8.90 MBytes  74.7 Mbits/sec  46672
> [  4]   8.00-9.00   sec  8.91 MBytes  74.8 Mbits/sec  46732
> [  4]   9.00-10.00  sec  8.90 MBytes  74.7 Mbits/sec  46660
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter
> Lost/Total Datagrams
> [  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.032 ms
> 16561/463165 (3.6%)
> [  4] Sent 463165 datagrams
> 
> [root@imx6q ~]# iperf3 -s
> Server listening on 5201
> Accepted connection from 192.168.1.1, port 45012 [  5] local 192.168.1.2
> port 5201 connected to 192.168.1.1 port 50177
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-1.00   sec  7.36 MBytes  61.8 Mbits/sec  0.030 ms
> 1430/40038 (3.6%)
> [  5]   1.00-2.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
> 1757/46850 (3.8%)
> [  5]   2.00-3.00   sec  8.61 MBytes  72.2 Mbits/sec  0.039 ms
> 1690/46809 (3.6%)
> [  5]   3.00-4.00   sec  8.61 MBytes  72.3 Mbits/sec  0.037 ms
> 1650/46812 (3.5%)
> [  5]   4.00-5.00   sec  8.59 MBytes  72.1 Mbits/sec  0.033 ms
> 1589/46651 (3.4%)
> [  5]   5.00-6.00   sec  8.60 MBytes  72.1 Mbits/sec  0.036 ms
> 1904/46967 (4.1%)
> [  5]   6.00-7.00   sec  8.61 MBytes  72.2 Mbits/sec  0.046 ms
> 1633/46766 (3.5%)
> [  5]   7.00-8.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
> 1748/46818 (3.7%)
> [  5]   8.00-9.00   sec  8.61 MBytes  72.2 Mbits/sec  0.036 ms
> 1572/46707 (3.4%)
> [  5]   9.00-10.00  sec  8.61 MBytes  72.3 Mbits/sec  0.031 ms
> 1538/46703 (3.3%)
> [  5]  10.00-10.04  sec   389 KBytes  72.1 Mbits/sec  0.032 ms
> 50/2044 (2.4%)
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.04  sec  85.2 MBytes  71.1 Mbits/sec  0.032 ms
> 16561/463165 (3.6%)  receiver
> 
> [root@imx6q ~]# mpstat -P ALL 4 1
> Linux 5.4.8-rt11-yocto-standard  03/17/00 _armv7l_ (4 CPU)
> 23:15:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft
> %steal  %guest   %idle
> 23:15:54     all    3.06    0.00   26.11    0.00    0.00   11.09
> 0.00    0.00   59.74
> 23:15:54       0    0.00    0.00    0.00    0.00    0.00   86.44
> 0.00    0.00   13.56
> 23:15:54       1   10.25    0.00   89.50    0.00    0.00    0.25
> 0.00    0.00    0.00
> 23:15:54       2    0.00    0.00    0.00    0.00    0.00    0.00
> 0.00    0.00  100.00
> 23:15:54       3    0.00    0.00    0.00    0.00    0.00    0.00
> 0.00    0.00  100.00
> => ksoftirqd/* @ 0% CPU; iperf3 @ 100%
> 
> [root@imx6q ~]# sar -I 60 1
> 23:56:10         INTR    intr/s
> 23:56:11           60  12339.00
> 23:56:12           60  11476.00
> => irq load high
> 
> 
> -------------------------------
> - STRESSED
> - stress -m 1 --vm-bytes 100M &
> -------------------------------
> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 59042 connected to
> 192.168.1.2 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-1.00   sec  7.92 MBytes  66.2 Mbits/sec  41527
> [  4]   1.00-2.00   sec  8.95 MBytes  75.4 Mbits/sec  46931
> [  4]   2.00-3.00   sec  8.95 MBytes  75.1 Mbits/sec  46923
> [  4]   3.00-4.00   sec  8.96 MBytes  75.2 Mbits/sec  46983
> [  4]   4.00-5.00   sec  8.96 MBytes  75.2 Mbits/sec  46999
> [  4]   5.00-6.00   sec  8.91 MBytes  74.5 Mbits/sec  46688
> [  4]   6.00-7.00   sec  8.93 MBytes  75.1 Mbits/sec  46824
> [  4]   7.00-8.00   sec  8.94 MBytes  75.0 Mbits/sec  46872
> [  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46700
> [  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46720
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter
> Lost/Total Datagrams
> [  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.042 ms
> 70360/462955 (15%)
> [  4] Sent 462955 datagrams
> 
> 
> [root@imx6q ~]# iperf3 -s
> Server listening on 5201
> Accepted connection from 192.168.1.1, port 46306 [  5] local 192.168.1.2
> port 5201 connected to 192.168.1.1 port 59042
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-1.00   sec  6.54 MBytes  54.8 Mbits/sec  0.038 ms
> 5602/39871 (14%)
> [  5]   1.00-2.00   sec  7.51 MBytes  63.0 Mbits/sec  0.037 ms
> 6973/46362 (15%)
> [  5]   2.00-3.00   sec  7.54 MBytes  63.3 Mbits/sec  0.037 ms
> 7414/46966 (16%)
> [  5]   3.00-4.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
> 7354/46984 (16%)
> [  5]   4.00-5.00   sec  7.60 MBytes  63.7 Mbits/sec  0.031 ms
> 7241/47069 (15%)
> [  5]   5.00-6.00   sec  7.58 MBytes  63.6 Mbits/sec  0.033 ms
> 7134/46865 (15%)
> [  5]   6.00-7.00   sec  7.56 MBytes  63.5 Mbits/sec  0.058 ms
> 6991/46649 (15%)
> [  5]   7.00-8.00   sec  7.57 MBytes  63.5 Mbits/sec  0.043 ms
> 7259/46933 (15%)
> [  5]   8.00-9.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
> 7065/46721 (15%)
> [  5]   9.00-10.00  sec  7.55 MBytes  63.3 Mbits/sec  0.042 ms
> 7002/46588 (15%)
> [  5]  10.00-10.04  sec   317 KBytes  61.8 Mbits/sec  0.042 ms
> 325/1947 (17%)
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Jitter
> Lost/Total Datagrams
> [  5]   0.00-10.04  sec  74.9 MBytes  62.6 Mbits/sec  0.042 ms
> 70360/462955 (15%)  receiver
> 
> 
> [root@imx6q ~]# mpstat -P ALL 4 1
> Linux 5.4.8-rt11-yocto-standard 03/17/00 _armv7l_ (4 CPU)
> 23:49:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft
> %steal  %guest   %idle
> 23:50:03     all    3.21    0.00   47.27    0.00    0.00   25.02
> 0.00    0.00   24.51
> 23:50:03       0    0.00    0.00    0.50    0.00    0.00   99.50
> 0.00    0.00    0.00
> 23:50:03       1    0.00    0.00    0.26    0.00    0.00    0.00
> 0.00    0.00   99.74
> 23:50:03       2    9.50    0.00   90.50    0.00    0.00    0.00
> 0.00    0.00    0.00
> 23:50:03       3    3.01    0.00   96.99    0.00    0.00    0.00
> 0.00    0.00    0.00
> => ksoftirqd/0 @ 92%; stress @ 100%; iperf3 @ 100%
> 
> [root@imx6q ~]# sar -I 60 1
> 23:57:08           60    505.00
> 23:57:09           60    748.00
> 23:57:10           60    416.00
> => irq load low => napi active but not all the time
> 
> ==============================================================
> =
> 
> As a result the patched system stressed with memory load has a higher
> throughput than the unpatched system without memory stress.
> The system could nearly keep up with the desktop sending @ 75MBit/s
> (100MBit/s link) with i210 / igb.
> I think IRQ/s could reach a new peak because of the faster napi_poll /
> rx_queue execution times.
> iperf, iperf3 is mainly to test throughput. Is there any tool which checks for
> connection reliability (packet content, sequence, ...)?
> 
> On Thu, Jul 2, 2020 at 11:14 AM Andy Duan <fugang.duan@nxp.com> wrote:
> >
> > From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020
> > 3:39 PM
> > > On Thu, Jul 2, 2020 at 6:18 AM Andy Duan <fugang.duan@nxp.com>
> wrote:
> > > > That should ensure the whole area is not dirty.
> > >
> > > dma_sync_single_for_cpu() and dma_sync_single_for_device() can or
> > > must be used in pairs?
> > > So in this case it is really necessary to sync back the skb data
> > > buffer via dma_sync_single_for_device? Even when the CPU does not
> > > change any bytes in the skb data buffer / readonly like in this case.
> >
> > No, if the buffer is not modified, dma_sync_single_for_device() is not
> necessary.
> >
> > For some arm64 core, the dcache invalidate on A53 is flush +
> > invalidate, once the buffer is modified, it will cause read back wrong data
> without dma_sync_single_for_device().
> > And the driver also support Coldfire platforms, I am not family with the arch.
> >
> > From current analyze for arm/arm64,  I also think
> > dma_sync_single_for_device() is not necessary due to the buffer is not
> modified.
> >
> > Anyway, it still need to get other experts comment, and it need to do many
> test and stress test.
> >
> > > And there is no DMA_BIDIRECTIONAL mapping.
> > >
> > > I thought copybreak it is not about the next frame size. It is about
> > > the current frame. And the actual length is known via the size field
> > > in the finished DMA descriptor.
> > > Or do you mean that the next received frame could be no copybreak
> frame.
> > > 1. Rx copybreakable frame with sizeX < copybreak 2. copybreak
> > > dma_sync_single_for_cpu(dmabuffer, sizeX) 3. copybreak alloc
> > > new_skb, memcpy(new_skb, dmabuffer, sizeX) 4. copybreak
> > > dma_sync_single_for_device(dmabuffer, sizeX) 5. Rx non copybreakable
> > > frame with sizeY >= copybreak 4. dma_unmap_single(dmabuffer,
> > > FEC_ENET_RX_FRSIZE - fep->rx_align) is called and can cause data
> > > corruption because not all bytes were marked dirty even if nobody
> > > DMA & CPU touched them?
> >
> > No CPU touch, it should be clean.
> > >
> > > > > I am new to the DMA API on ARM. Are these changes regarding
> > > > > cache flushing,... allowed? These would increase the copybreak
> > > > > throughput by reducing CPU load.
> > > >
> > > > To avoid FIFO overrun, it requires to ensure PHY pause frame is enabled.
> > >
> > > As the errata states this is also not always true, because the first
> > > xoff could arrive too late. Pause frames/flow control is not really
> > > common and could cause troubles with other random network
> components
> > > acting different or not supporting pause frames correctly. For
> > > example the driver itself does enable pause frames for Gigabit by
> > > default. But we have no Gigabit Phy so no FEC_QUIRK_HAS_GBIT and
> > > therefore pause frames are not supported by the driver as of now.
> > >
> > >
> > > It looks like copybreak is implemented similar to e1000_main.c
> > > e1000_copybreak().
> > > There is only the real/needed packet length (length =
> > > le16_to_cpu(rx_desc->length)) is synced via dma_sync_single_for_cpu
> > > and no dma_sync_single_for_device.
> > >
> > > Here is a diff with the previous changes assuming that
> > > dma_sync_single_for_device must be used to avoid any cache flush
> > > backs even when no data was changed.
> >
> > Below change seems fine, can you collect some data before you send out
> > the patch for review.
> > - run iperf stress test to ensure the stability
> > - collect the performance improvement data
> >
> > Thanks.
> > >
> > > diff --git a/drivers/net/ethernet/freescale/fec_main.c
> > > b/drivers/net/ethernet/freescale/fec_main.c
> > > index 2d0d313ee..464783c15 100644
> > > --- a/drivers/net/ethernet/freescale/fec_main.c
> > > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > > @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct
> > > net_device *ndev, struct sk_buff **skb,
> > >                 return false;
> > >
> > >         dma_sync_single_for_cpu(&fep->pdev->dev,
> > >
> fec32_to_cpu(bdp->cbd_bufaddr),
> > > -                               FEC_ENET_RX_FRSIZE -
> fep->rx_align,
> > > +                               length,
> > >                                 DMA_FROM_DEVICE);
> > >         if (!swap)
> > >                 memcpy(new_skb->data, (*skb)->data, length);
> > >         else
> > > @@ -1413,8 +1413,9 @@ fec_enet_rx_queue(struct net_device *ndev,
> int
> > > budget, u16 queue_id)
> > >         unsigned short status;
> > >         struct  sk_buff *skb_new = NULL;
> > >         struct  sk_buff *skb;
> > >         ushort  pkt_len;
> > > +       ushort  pkt_len_nofcs;
> > >         __u8 *data;
> > >         int     pkt_received = 0;
> > >         struct  bufdesc_ex *ebdp = NULL;
> > >         bool    vlan_packet_rcvd = false;
> > > @@ -1479,9 +1480,10 @@ fec_enet_rx_queue(struct net_device *ndev,
> > > int budget, u16 queue_id)
> > >                 /* The packet length includes FCS, but we don't want
> to
> > >                  * include that when passing upstream as it messes
> up
> > >                  * bridging applications.
> > >                  */
> > > -               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> > > pkt_len - 4,
> > > +               pkt_len_nofcs = pkt_len - 4;
> > > +               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> > > pkt_len_nofcs,
> > >
> need_swap);
> > >                 if (!is_copybreak) {
> > >                         skb_new = netdev_alloc_skb(ndev,
> > > FEC_ENET_RX_FRSIZE);
> > >                         if (unlikely(!skb_new)) { @@ -1554,9
> +1556,9
> > > @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16
> > > queue_id)
> > >
> > >                 if (is_copybreak) {
> > >
> dma_sync_single_for_device(&fep->pdev->dev,
> > >
> > > fec32_to_cpu(bdp->cbd_bufaddr),
> > > -
> > > FEC_ENET_RX_FRSIZE
> > > - fep->rx_align,
> > > +
> > > pkt_len_nofcs,
> > >
> > > DMA_FROM_DEVICE);
> > >                 } else {
> > >                         rxq->rx_skbuff[index] = skb_new;
> > >                         fec_enet_new_rxbdp(ndev, bdp, skb_new);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-07  3:44         ` Andy Duan
@ 2020-07-07 16:14           ` Robin Murphy
  2020-07-08  5:29             ` Andy Duan
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Murphy @ 2020-07-07 16:14 UTC (permalink / raw)
  To: Andy Duan, Kegl Rohit, catalin.marinas, will,
	'Christoph Hellwig'
  Cc: netdev, m.szyprowski

On 2020-07-07 04:44, Andy Duan wrote:
> Hi mm experts,
> 
> From: Kegl Rohit <keglrohit@gmail.com> Sent: Monday, July 6, 2020 10:18 PM
>> So you would also say a single dma_sync_single_for_cpu is enough.
>> You are right it would be great if some mm expert could have a look at it. If
>> any corruption could happen.
> 
> Kegl want to remove dma_sync_single_for_device(), who thinks the cost is
> not necessary, and only dma_sync_single_for_cpu() is enough,  do you have
> any comments ?

Although you could get away with it on modern Arm CPUs if you can 
absolutely guarantee that no cache lines could ever possibly be dirtied, 
other architectures may not have the exact same cache behaviour. Without 
an exceptionally good justification for non-standard API usage and 
rock-solid reasoning that it cannot possibly go wrong anywhere, it's 
better for everyone if you just stick with normal balanced calls. I see 
mention of i.MX6 below, which presumably means you've got a PL310 system 
cache in the mix too, and that is not something I would like to reason 
about even if you paid me to ;)

Now, if you're recycling a mapped buffer by syncing it back and forth to 
read new data between multiple updates, then performing partial syncs on 
only the region you want to access each time is something the API does 
explicitly support, so the patch originally proposed in this thread to 
do that looks superficially reasonable to me.

Robin.

> 
> Thanks!
>>
>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>> b/drivers/net/ethernet/freescale/fec_main.c
>> index 2d0d313ee..acc04726f 100644
>> --- a/drivers/net/ethernet/freescale/fec_main.c
>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>> @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct net_device
>> *ndev, struct sk_buff **skb,
>>                  return false;
>>
>>          dma_sync_single_for_cpu(&fep->pdev->dev,
>>                                  fec32_to_cpu(bdp->cbd_bufaddr),
>> -                               FEC_ENET_RX_FRSIZE - fep->rx_align,
>> +                               length,
>>                                  DMA_FROM_DEVICE);
>>          if (!swap)
>>                  memcpy(new_skb->data, (*skb)->data, length);
>>          else
>> @@ -1551,14 +1551,9 @@ fec_enet_rx_queue(struct net_device *ndev, int
>> budget, u16 queue_id)
>>                                                 vlan_tag);
>>
>>                  napi_gro_receive(&fep->napi, skb);
>>
>> -               if (is_copybreak) {
>> -                       dma_sync_single_for_device(&fep->pdev->dev,
>> -
>> fec32_to_cpu(bdp->cbd_bufaddr),
>> -
>> FEC_ENET_RX_FRSIZE
>> - fep->rx_align,
>> -
>> DMA_FROM_DEVICE);
>> -               } else {
>> +               if (!is_copybreak) {
>>                          rxq->rx_skbuff[index] = skb_new;
>>                          fec_enet_new_rxbdp(ndev, bdp, skb_new);
>>                  }
>>
>> Testing is pretty complex because of multiple different contexts.
>> To test the whole rx chain throughput i decided to use UDP with 200 bytes
>> payload.
>> Wireshark reports 242 bytes on wire, so copybreak is active (< 256 bytes).
>> I choose larger packets to reduce CPU load caused by the network stack /
>> iperf3.
>> Smaller packets will benefit even more, but for this test the network stack /
>> iperf3 could become the bottleneck not the driver itself (e.g.
>> to test call skb_free instead of napi_gro_receive).
>> iperf missbehaved sometimes and caused a lot more CPU load, so using
>> iperf3.
>> Kernelversion is Linux falcon 5.4.8-rt11-yocto-standard #59 SMP PREEMPT
>> Normaly our system has PREEMPT RT enabled. But I think to test throughput
>> PREEMPT is a better way to compare results.
>> NAPI and PREEMPT RT has even more contexts with the additional threaded
>> irq handler and therefore PREEMPT RT throughput is another story.
>> stress -m 1 --vm-bytes 100M was used to simulate additional memory load.
>> Only one core to keep ressources free for iperf3.
>>
>>
>> ###############################
>> # NOT PATCHED
>> ###############################
>> -------------------------------
>> - NOT STRESSED
>> -------------------------------
>> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
>> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 52032 connected to
>> 192.168.1.2 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-1.00   sec  8.06 MBytes  67.6 Mbits/sec  42279
>> [  4]   1.00-2.00   sec  8.93 MBytes  74.9 Mbits/sec  46800
>> [  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46771
>> [  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46701
>> [  4]   4.00-5.00   sec  8.92 MBytes  74.9 Mbits/sec  46792
>> [  4]   5.00-6.00   sec  8.93 MBytes  74.9 Mbits/sec  46840
>> [  4]   6.00-7.00   sec  8.90 MBytes  74.6 Mbits/sec  46636
>> [  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46924
>> [  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46716
>> [  4]   9.00-10.00  sec  8.89 MBytes  74.5 Mbits/sec  46592
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter
>> Lost/Total Datagrams
>> [  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.056 ms
>> 123112/463051 (27%)
>> [  4] Sent 463051 datagrams
>>
>> [root@imx6q ~]# iperf3 -s
>> Server listening on 5201
>> Accepted connection from 192.168.1.1, port 53394 [  5] local 192.168.1.2
>> port 5201 connected to 192.168.1.1 port 43224
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-1.00   sec  5.56 MBytes  46.6 Mbits/sec  0.046 ms
>> 5520/34650 (16%)
>> [  5]   1.00-2.00   sec  6.40 MBytes  53.7 Mbits/sec  0.050 ms
>> 13035/46576 (28%)
>> [  5]   2.00-3.00   sec  6.40 MBytes  53.7 Mbits/sec  0.044 ms
>> 13295/46854 (28%)
>> [  5]   3.00-4.00   sec  6.26 MBytes  52.5 Mbits/sec  0.056 ms
>> 13849/46692 (30%)
>> [  5]   4.00-5.00   sec  6.31 MBytes  53.0 Mbits/sec  0.052 ms
>> 13735/46833 (29%)
>> [  5]   5.00-6.00   sec  6.32 MBytes  53.0 Mbits/sec  0.054 ms
>> 13610/46761 (29%)
>> [  5]   6.00-7.00   sec  6.32 MBytes  53.0 Mbits/sec  0.043 ms
>> 13780/46931 (29%)
>> [  5]   7.00-8.00   sec  6.33 MBytes  53.1 Mbits/sec  0.051 ms
>> 13679/46848 (29%)
>> [  5]   8.00-9.00   sec  6.33 MBytes  53.1 Mbits/sec  0.050 ms
>> 13746/46912 (29%)
>> [  5]   9.00-10.00  sec  6.33 MBytes  53.1 Mbits/sec  0.043 ms
>> 13437/46626 (29%)
>> [  5]  10.00-10.17  sec  1.04 MBytes  52.5 Mbits/sec  0.044 ms
>> 2233/7684 (29%)
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-10.17  sec  63.6 MBytes  52.5 Mbits/sec  0.044 ms
>> 129919/463367 (28%)  receiver
>>
>> [root@imx6q ~]# mpstat -P ALL 4 1
>> Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
>> 23:38:51     CPU    %usr   %nice    %sys %iowait    %irq   %soft
>> %steal  %guest   %idle
>> 23:38:55     all    2.49    0.00   19.96    0.00    0.00   25.45
>> 0.00    0.00   52.10
>> 23:38:55       0    0.00    0.00    0.25    0.00    0.00   99.75
>> 0.00    0.00    0.00
>> 23:38:55       1    1.54    0.00   11.54    0.00    0.00    0.00
>> 0.00    0.00   86.92
>> 23:38:55       2    1.27    0.00    5.85    0.00    0.00    0.00
>> 0.00    0.00   92.88
>> 23:38:55       3    7.24    0.00   63.31    0.00    0.00    0.00
>> 0.00    0.00   29.46
>> => ksoftirqd/0 @ 100%; iperf @ 94%
>>
>> [root@imx6q ~]# sar -I 60 1
>> 00:13:32         INTR    intr/s
>> 00:13:33           60      0.00
>> 00:13:34           60      0.00
>> 00:13:35           60      0.00
>> 00:13:36           60      0.00
>> => 100% napi poll
>>
>>
>> -------------------------------
>> - STRESSED
>> - stress -m 1 --vm-bytes 100M &
>> -------------------------------
>> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
>> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 52129 connected to
>> 192.168.1.2 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-1.00   sec  8.09 MBytes  67.8 Mbits/sec  42399
>> [  4]   1.00-2.00   sec  8.88 MBytes  74.5 Mbits/sec  46552
>> [  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46780
>> [  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46688
>> [  4]   4.00-5.00   sec  8.91 MBytes  74.7 Mbits/sec  46712
>> [  4]   5.00-6.00   sec  8.91 MBytes  74.8 Mbits/sec  46724
>> [  4]   6.00-7.00   sec  8.95 MBytes  75.0 Mbits/sec  46900
>> [  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46928
>> [  4]   8.00-9.00   sec  8.93 MBytes  74.9 Mbits/sec  46796
>> [  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46732
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter
>> Lost/Total Datagrams
>> [  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.067 ms
>> 233989/463134 (51%)
>> [  4] Sent 463134 datagrams
>>
>> [root@imx6q ~]# iperf3 -s
>> Server listening on 5201
>> Accepted connection from 192.168.1.1, port 57892 [  5] local 192.168.1.2
>> port 5201 connected to 192.168.1.1 port 43729
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-1.00   sec  3.59 MBytes  30.1 Mbits/sec  0.071 ms
>> 13027/31836 (41%)
>> [  5]   1.00-2.00   sec  4.32 MBytes  36.2 Mbits/sec  0.075 ms
>> 23690/46345 (51%)
>> [  5]   2.00-3.00   sec  4.28 MBytes  35.9 Mbits/sec  0.040 ms
>> 24879/47293 (53%)
>> [  5]   3.00-4.00   sec  4.05 MBytes  34.0 Mbits/sec  0.055 ms
>> 24430/45651 (54%)
>> [  5]   4.00-5.00   sec  4.01 MBytes  33.6 Mbits/sec  0.130 ms
>> 25200/46209 (55%)
>> [  5]   5.00-6.00   sec  4.22 MBytes  35.4 Mbits/sec  0.052 ms
>> 24777/46902 (53%)
>> [  5]   6.00-7.00   sec  4.04 MBytes  33.9 Mbits/sec  0.057 ms
>> 24452/45635 (54%)
>> [  5]   7.00-8.00   sec  4.26 MBytes  35.7 Mbits/sec  0.063 ms
>> 26530/48871 (54%)
>> [  5]   8.00-9.00   sec  4.06 MBytes  34.1 Mbits/sec  0.061 ms
>> 24293/45583 (53%)
>> [  5]   9.00-10.00  sec  4.15 MBytes  34.8 Mbits/sec  0.155 ms
>> 24752/46499 (53%)
>> [  5]  10.00-10.25  sec   985 KBytes  32.5 Mbits/sec  0.069 ms
>> 6246/11291 (55%)
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-10.25  sec  41.9 MBytes  34.3 Mbits/sec  0.069 ms
>> 242276/462115 (52%)  receiver
>>
>> [root@imx6q ~]# mpstat -P ALL 4 1
>> Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
>> 23:43:12     CPU    %usr   %nice    %sys %iowait    %irq   %soft
>> %steal  %guest   %idle
>> 23:43:16     all    2.51    0.00   42.42    0.00    0.00   25.47
>> 0.00    0.00   29.59
>> 23:43:16       0    0.00    0.00    1.25    0.00    0.00   98.75
>> 0.00    0.00    0.00
>> 23:43:16       1    2.25    0.00   97.75    0.00    0.00    0.00
>> 0.00    0.00    0.00
>> 23:43:16       2    0.25    0.00    2.29    0.00    0.00    0.00
>> 0.00    0.00   97.46
>> 23:43:16       3    8.08    0.00   70.75    0.00    0.00    0.00
>> 0.00    0.00   21.17
>> => ksoftirqd/0 @ 100%; stress @ 100%; iperf3 @ 84%
>>
>> [root@imx6q ~]# sar -I 60 1
>> 00:14:41         INTR    intr/s
>> 00:14:42           60      0.00
>> 00:14:43           60      0.00
>> 00:14:44           60      0.00
>> => 100% napi poll
>>
>>
>> ###############################
>> # PATCHED
>> ###############################
>> -------------------------------
>> -NOT STRESSED
>> -------------------------------
>> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
>> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 50177 connected to
>> 192.168.1.2 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-1.00   sec  8.08 MBytes  67.8 Mbits/sec  42373
>> [  4]   1.00-2.00   sec  8.91 MBytes  74.7 Mbits/sec  46693
>> [  4]   2.00-3.00   sec  8.93 MBytes  74.9 Mbits/sec  46803
>> [  4]   3.00-4.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
>> [  4]   4.00-5.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
>> [  4]   5.00-6.00   sec  8.92 MBytes  74.8 Mbits/sec  46764
>> [  4]   6.00-7.00   sec  8.95 MBytes  75.1 Mbits/sec  46944
>> [  4]   7.00-8.00   sec  8.90 MBytes  74.7 Mbits/sec  46672
>> [  4]   8.00-9.00   sec  8.91 MBytes  74.8 Mbits/sec  46732
>> [  4]   9.00-10.00  sec  8.90 MBytes  74.7 Mbits/sec  46660
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter
>> Lost/Total Datagrams
>> [  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.032 ms
>> 16561/463165 (3.6%)
>> [  4] Sent 463165 datagrams
>>
>> [root@imx6q ~]# iperf3 -s
>> Server listening on 5201
>> Accepted connection from 192.168.1.1, port 45012 [  5] local 192.168.1.2
>> port 5201 connected to 192.168.1.1 port 50177
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-1.00   sec  7.36 MBytes  61.8 Mbits/sec  0.030 ms
>> 1430/40038 (3.6%)
>> [  5]   1.00-2.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
>> 1757/46850 (3.8%)
>> [  5]   2.00-3.00   sec  8.61 MBytes  72.2 Mbits/sec  0.039 ms
>> 1690/46809 (3.6%)
>> [  5]   3.00-4.00   sec  8.61 MBytes  72.3 Mbits/sec  0.037 ms
>> 1650/46812 (3.5%)
>> [  5]   4.00-5.00   sec  8.59 MBytes  72.1 Mbits/sec  0.033 ms
>> 1589/46651 (3.4%)
>> [  5]   5.00-6.00   sec  8.60 MBytes  72.1 Mbits/sec  0.036 ms
>> 1904/46967 (4.1%)
>> [  5]   6.00-7.00   sec  8.61 MBytes  72.2 Mbits/sec  0.046 ms
>> 1633/46766 (3.5%)
>> [  5]   7.00-8.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
>> 1748/46818 (3.7%)
>> [  5]   8.00-9.00   sec  8.61 MBytes  72.2 Mbits/sec  0.036 ms
>> 1572/46707 (3.4%)
>> [  5]   9.00-10.00  sec  8.61 MBytes  72.3 Mbits/sec  0.031 ms
>> 1538/46703 (3.3%)
>> [  5]  10.00-10.04  sec   389 KBytes  72.1 Mbits/sec  0.032 ms
>> 50/2044 (2.4%)
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-10.04  sec  85.2 MBytes  71.1 Mbits/sec  0.032 ms
>> 16561/463165 (3.6%)  receiver
>>
>> [root@imx6q ~]# mpstat -P ALL 4 1
>> Linux 5.4.8-rt11-yocto-standard  03/17/00 _armv7l_ (4 CPU)
>> 23:15:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft
>> %steal  %guest   %idle
>> 23:15:54     all    3.06    0.00   26.11    0.00    0.00   11.09
>> 0.00    0.00   59.74
>> 23:15:54       0    0.00    0.00    0.00    0.00    0.00   86.44
>> 0.00    0.00   13.56
>> 23:15:54       1   10.25    0.00   89.50    0.00    0.00    0.25
>> 0.00    0.00    0.00
>> 23:15:54       2    0.00    0.00    0.00    0.00    0.00    0.00
>> 0.00    0.00  100.00
>> 23:15:54       3    0.00    0.00    0.00    0.00    0.00    0.00
>> 0.00    0.00  100.00
>> => ksoftirqd/* @ 0% CPU; iperf3 @ 100%
>>
>> [root@imx6q ~]# sar -I 60 1
>> 23:56:10         INTR    intr/s
>> 23:56:11           60  12339.00
>> 23:56:12           60  11476.00
>> => irq load high
>>
>>
>> -------------------------------
>> - STRESSED
>> - stress -m 1 --vm-bytes 100M &
>> -------------------------------
>> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting to host
>> 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 59042 connected to
>> 192.168.1.2 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-1.00   sec  7.92 MBytes  66.2 Mbits/sec  41527
>> [  4]   1.00-2.00   sec  8.95 MBytes  75.4 Mbits/sec  46931
>> [  4]   2.00-3.00   sec  8.95 MBytes  75.1 Mbits/sec  46923
>> [  4]   3.00-4.00   sec  8.96 MBytes  75.2 Mbits/sec  46983
>> [  4]   4.00-5.00   sec  8.96 MBytes  75.2 Mbits/sec  46999
>> [  4]   5.00-6.00   sec  8.91 MBytes  74.5 Mbits/sec  46688
>> [  4]   6.00-7.00   sec  8.93 MBytes  75.1 Mbits/sec  46824
>> [  4]   7.00-8.00   sec  8.94 MBytes  75.0 Mbits/sec  46872
>> [  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46700
>> [  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46720
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter
>> Lost/Total Datagrams
>> [  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.042 ms
>> 70360/462955 (15%)
>> [  4] Sent 462955 datagrams
>>
>>
>> [root@imx6q ~]# iperf3 -s
>> Server listening on 5201
>> Accepted connection from 192.168.1.1, port 46306 [  5] local 192.168.1.2
>> port 5201 connected to 192.168.1.1 port 59042
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-1.00   sec  6.54 MBytes  54.8 Mbits/sec  0.038 ms
>> 5602/39871 (14%)
>> [  5]   1.00-2.00   sec  7.51 MBytes  63.0 Mbits/sec  0.037 ms
>> 6973/46362 (15%)
>> [  5]   2.00-3.00   sec  7.54 MBytes  63.3 Mbits/sec  0.037 ms
>> 7414/46966 (16%)
>> [  5]   3.00-4.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
>> 7354/46984 (16%)
>> [  5]   4.00-5.00   sec  7.60 MBytes  63.7 Mbits/sec  0.031 ms
>> 7241/47069 (15%)
>> [  5]   5.00-6.00   sec  7.58 MBytes  63.6 Mbits/sec  0.033 ms
>> 7134/46865 (15%)
>> [  5]   6.00-7.00   sec  7.56 MBytes  63.5 Mbits/sec  0.058 ms
>> 6991/46649 (15%)
>> [  5]   7.00-8.00   sec  7.57 MBytes  63.5 Mbits/sec  0.043 ms
>> 7259/46933 (15%)
>> [  5]   8.00-9.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
>> 7065/46721 (15%)
>> [  5]   9.00-10.00  sec  7.55 MBytes  63.3 Mbits/sec  0.042 ms
>> 7002/46588 (15%)
>> [  5]  10.00-10.04  sec   317 KBytes  61.8 Mbits/sec  0.042 ms
>> 325/1947 (17%)
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate         Jitter
>> Lost/Total Datagrams
>> [  5]   0.00-10.04  sec  74.9 MBytes  62.6 Mbits/sec  0.042 ms
>> 70360/462955 (15%)  receiver
>>
>>
>> [root@imx6q ~]# mpstat -P ALL 4 1
>> Linux 5.4.8-rt11-yocto-standard 03/17/00 _armv7l_ (4 CPU)
>> 23:49:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft
>> %steal  %guest   %idle
>> 23:50:03     all    3.21    0.00   47.27    0.00    0.00   25.02
>> 0.00    0.00   24.51
>> 23:50:03       0    0.00    0.00    0.50    0.00    0.00   99.50
>> 0.00    0.00    0.00
>> 23:50:03       1    0.00    0.00    0.26    0.00    0.00    0.00
>> 0.00    0.00   99.74
>> 23:50:03       2    9.50    0.00   90.50    0.00    0.00    0.00
>> 0.00    0.00    0.00
>> 23:50:03       3    3.01    0.00   96.99    0.00    0.00    0.00
>> 0.00    0.00    0.00
>> => ksoftirqd/0 @ 92%; stress @ 100%; iperf3 @ 100%
>>
>> [root@imx6q ~]# sar -I 60 1
>> 23:57:08           60    505.00
>> 23:57:09           60    748.00
>> 23:57:10           60    416.00
>> => irq load low => napi active but not all the time
>>
>> ==============================================================
>> =
>>
>> As a result the patched system stressed with memory load has a higher
>> throughput than the unpatched system without memory stress.
>> The system could nearly keep up with the desktop sending @ 75MBit/s
>> (100MBit/s link) with i210 / igb.
>> I think IRQ/s could reach a new peak because of the faster napi_poll /
>> rx_queue execution times.
>> iperf, iperf3 is mainly to test throughput. Is there any tool which checks for
>> connection reliability (packet content, sequence, ...)?
>>
>> On Thu, Jul 2, 2020 at 11:14 AM Andy Duan <fugang.duan@nxp.com> wrote:
>>>
>>> From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020
>>> 3:39 PM
>>>> On Thu, Jul 2, 2020 at 6:18 AM Andy Duan <fugang.duan@nxp.com>
>> wrote:
>>>>> That should ensure the whole area is not dirty.
>>>>
>>>> dma_sync_single_for_cpu() and dma_sync_single_for_device() can or
>>>> must be used in pairs?
>>>> So in this case it is really necessary to sync back the skb data
>>>> buffer via dma_sync_single_for_device? Even when the CPU does not
>>>> change any bytes in the skb data buffer / readonly like in this case.
>>>
>>> No, if the buffer is not modified, dma_sync_single_for_device() is not
>> necessary.
>>>
>>> For some arm64 core, the dcache invalidate on A53 is flush +
>>> invalidate, once the buffer is modified, it will cause read back wrong data
>> without dma_sync_single_for_device().
>>> And the driver also support Coldfire platforms, I am not family with the arch.
>>>
>>>  From current analyze for arm/arm64,  I also think
>>> dma_sync_single_for_device() is not necessary due to the buffer is not
>> modified.
>>>
>>> Anyway, it still need to get other experts comment, and it need to do many
>> test and stress test.
>>>
>>>> And there is no DMA_BIDIRECTIONAL mapping.
>>>>
>>>> I thought copybreak it is not about the next frame size. It is about
>>>> the current frame. And the actual length is known via the size field
>>>> in the finished DMA descriptor.
>>>> Or do you mean that the next received frame could be no copybreak
>> frame.
>>>> 1. Rx copybreakable frame with sizeX < copybreak 2. copybreak
>>>> dma_sync_single_for_cpu(dmabuffer, sizeX) 3. copybreak alloc
>>>> new_skb, memcpy(new_skb, dmabuffer, sizeX) 4. copybreak
>>>> dma_sync_single_for_device(dmabuffer, sizeX) 5. Rx non copybreakable
>>>> frame with sizeY >= copybreak 4. dma_unmap_single(dmabuffer,
>>>> FEC_ENET_RX_FRSIZE - fep->rx_align) is called and can cause data
>>>> corruption because not all bytes were marked dirty even if nobody
>>>> DMA & CPU touched them?
>>>
>>> No CPU touch, it should be clean.
>>>>
>>>>>> I am new to the DMA API on ARM. Are these changes regarding
>>>>>> cache flushing,... allowed? These would increase the copybreak
>>>>>> throughput by reducing CPU load.
>>>>>
>>>>> To avoid FIFO overrun, it requires to ensure PHY pause frame is enabled.
>>>>
>>>> As the errata states this is also not always true, because the first
>>>> xoff could arrive too late. Pause frames/flow control is not really
>>>> common and could cause troubles with other random network
>> components
>>>> acting different or not supporting pause frames correctly. For
>>>> example the driver itself does enable pause frames for Gigabit by
>>>> default. But we have no Gigabit Phy so no FEC_QUIRK_HAS_GBIT and
>>>> therefore pause frames are not supported by the driver as of now.
>>>>
>>>>
>>>> It looks like copybreak is implemented similar to e1000_main.c
>>>> e1000_copybreak().
>>>> There is only the real/needed packet length (length =
>>>> le16_to_cpu(rx_desc->length)) is synced via dma_sync_single_for_cpu
>>>> and no dma_sync_single_for_device.
>>>>
>>>> Here is a diff with the previous changes assuming that
>>>> dma_sync_single_for_device must be used to avoid any cache flush
>>>> backs even when no data was changed.
>>>
>>> Below change seems fine, can you collect some data before you send out
>>> the patch for review.
>>> - run iperf stress test to ensure the stability
>>> - collect the performance improvement data
>>>
>>> Thanks.
>>>>
>>>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
>>>> b/drivers/net/ethernet/freescale/fec_main.c
>>>> index 2d0d313ee..464783c15 100644
>>>> --- a/drivers/net/ethernet/freescale/fec_main.c
>>>> +++ b/drivers/net/ethernet/freescale/fec_main.c
>>>> @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct
>>>> net_device *ndev, struct sk_buff **skb,
>>>>                  return false;
>>>>
>>>>          dma_sync_single_for_cpu(&fep->pdev->dev,
>>>>
>> fec32_to_cpu(bdp->cbd_bufaddr),
>>>> -                               FEC_ENET_RX_FRSIZE -
>> fep->rx_align,
>>>> +                               length,
>>>>                                  DMA_FROM_DEVICE);
>>>>          if (!swap)
>>>>                  memcpy(new_skb->data, (*skb)->data, length);
>>>>          else
>>>> @@ -1413,8 +1413,9 @@ fec_enet_rx_queue(struct net_device *ndev,
>> int
>>>> budget, u16 queue_id)
>>>>          unsigned short status;
>>>>          struct  sk_buff *skb_new = NULL;
>>>>          struct  sk_buff *skb;
>>>>          ushort  pkt_len;
>>>> +       ushort  pkt_len_nofcs;
>>>>          __u8 *data;
>>>>          int     pkt_received = 0;
>>>>          struct  bufdesc_ex *ebdp = NULL;
>>>>          bool    vlan_packet_rcvd = false;
>>>> @@ -1479,9 +1480,10 @@ fec_enet_rx_queue(struct net_device *ndev,
>>>> int budget, u16 queue_id)
>>>>                  /* The packet length includes FCS, but we don't want
>> to
>>>>                   * include that when passing upstream as it messes
>> up
>>>>                   * bridging applications.
>>>>                   */
>>>> -               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
>>>> pkt_len - 4,
>>>> +               pkt_len_nofcs = pkt_len - 4;
>>>> +               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
>>>> pkt_len_nofcs,
>>>>
>> need_swap);
>>>>                  if (!is_copybreak) {
>>>>                          skb_new = netdev_alloc_skb(ndev,
>>>> FEC_ENET_RX_FRSIZE);
>>>>                          if (unlikely(!skb_new)) { @@ -1554,9
>> +1556,9
>>>> @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16
>>>> queue_id)
>>>>
>>>>                  if (is_copybreak) {
>>>>
>> dma_sync_single_for_device(&fep->pdev->dev,
>>>>
>>>> fec32_to_cpu(bdp->cbd_bufaddr),
>>>> -
>>>> FEC_ENET_RX_FRSIZE
>>>> - fep->rx_align,
>>>> +
>>>> pkt_len_nofcs,
>>>>
>>>> DMA_FROM_DEVICE);
>>>>                  } else {
>>>>                          rxq->rx_skbuff[index] = skb_new;
>>>>                          fec_enet_new_rxbdp(ndev, bdp, skb_new);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [EXT] net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed?
  2020-07-07 16:14           ` Robin Murphy
@ 2020-07-08  5:29             ` Andy Duan
  0 siblings, 0 replies; 8+ messages in thread
From: Andy Duan @ 2020-07-08  5:29 UTC (permalink / raw)
  To: Robin Murphy, Kegl Rohit, catalin.marinas, will,
	'Christoph Hellwig'
  Cc: netdev, m.szyprowski

From: Robin Murphy <robin.murphy@arm.com> On 2020-07-07 04:44, Andy Duan wrote:
> > Hi mm experts,
> >
> > From: Kegl Rohit <keglrohit@gmail.com> Sent: Monday, July 6, 2020
> > 10:18 PM
> >> So you would also say a single dma_sync_single_for_cpu is enough.
> >> You are right it would be great if some mm expert could have a look
> >> at it. If any corruption could happen.
> >
> > Kegl want to remove dma_sync_single_for_device(), who thinks the cost
> > is not necessary, and only dma_sync_single_for_cpu() is enough,  do
> > you have any comments ?
> 
> Although you could get away with it on modern Arm CPUs if you can
> absolutely guarantee that no cache lines could ever possibly be dirtied, other
> architectures may not have the exact same cache behaviour. Without an
> exceptionally good justification for non-standard API usage and rock-solid
> reasoning that it cannot possibly go wrong anywhere, it's better for everyone
> if you just stick with normal balanced calls. I see mention of i.MX6 below,
> which presumably means you've got a PL310 system cache in the mix too, and
> that is not something I would like to reason about even if you paid me to ;)
> 
> Now, if you're recycling a mapped buffer by syncing it back and forth to read
> new data between multiple updates, then performing partial syncs on only the
> region you want to access each time is something the API does explicitly
> support, so the patch originally proposed in this thread to do that looks
> superficially reasonable to me.
> 
> Robin.

Thanks for your valuable comments.
So it is better to keep normal balanced dma_sync_* calls.
> 
> >
> > Thanks!
> >>
> >> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> >> b/drivers/net/ethernet/freescale/fec_main.c
> >> index 2d0d313ee..acc04726f 100644
> >> --- a/drivers/net/ethernet/freescale/fec_main.c
> >> +++ b/drivers/net/ethernet/freescale/fec_main.c
> >> @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct
> >> net_device *ndev, struct sk_buff **skb,
> >>                  return false;
> >>
> >>          dma_sync_single_for_cpu(&fep->pdev->dev,
> >>
> fec32_to_cpu(bdp->cbd_bufaddr),
> >> -                               FEC_ENET_RX_FRSIZE -
> fep->rx_align,
> >> +                               length,
> >>                                  DMA_FROM_DEVICE);
> >>          if (!swap)
> >>                  memcpy(new_skb->data, (*skb)->data, length);
> >>          else
> >> @@ -1551,14 +1551,9 @@ fec_enet_rx_queue(struct net_device *ndev,
> int
> >> budget, u16 queue_id)
> >>                                                 vlan_tag);
> >>
> >>                  napi_gro_receive(&fep->napi, skb);
> >>
> >> -               if (is_copybreak) {
> >> -
> dma_sync_single_for_device(&fep->pdev->dev,
> >> -
> >> fec32_to_cpu(bdp->cbd_bufaddr),
> >> -
> >> FEC_ENET_RX_FRSIZE
> >> - fep->rx_align,
> >> -
> >> DMA_FROM_DEVICE);
> >> -               } else {
> >> +               if (!is_copybreak) {
> >>                          rxq->rx_skbuff[index] = skb_new;
> >>                          fec_enet_new_rxbdp(ndev, bdp,
> skb_new);
> >>                  }
> >>
> >> Testing is pretty complex because of multiple different contexts.
> >> To test the whole rx chain throughput i decided to use UDP with 200
> >> bytes payload.
> >> Wireshark reports 242 bytes on wire, so copybreak is active (< 256 bytes).
> >> I choose larger packets to reduce CPU load caused by the network
> >> stack / iperf3.
> >> Smaller packets will benefit even more, but for this test the network
> >> stack /
> >> iperf3 could become the bottleneck not the driver itself (e.g.
> >> to test call skb_free instead of napi_gro_receive).
> >> iperf missbehaved sometimes and caused a lot more CPU load, so using
> >> iperf3.
> >> Kernelversion is Linux falcon 5.4.8-rt11-yocto-standard #59 SMP
> >> PREEMPT Normaly our system has PREEMPT RT enabled. But I think to
> >> test throughput PREEMPT is a better way to compare results.
> >> NAPI and PREEMPT RT has even more contexts with the additional
> >> threaded irq handler and therefore PREEMPT RT throughput is another
> story.
> >> stress -m 1 --vm-bytes 100M was used to simulate additional memory
> load.
> >> Only one core to keep ressources free for iperf3.
> >>
> >>
> >> ###############################
> >> # NOT PATCHED
> >> ###############################
> >> -------------------------------
> >> - NOT STRESSED
> >> -------------------------------
> >> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting
> >> to host 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 52032
> >> connected to
> >> 192.168.1.2 port 5201
> >> [ ID] Interval           Transfer     Bandwidth       Total
> Datagrams
> >> [  4]   0.00-1.00   sec  8.06 MBytes  67.6 Mbits/sec  42279
> >> [  4]   1.00-2.00   sec  8.93 MBytes  74.9 Mbits/sec  46800
> >> [  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46771
> >> [  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46701
> >> [  4]   4.00-5.00   sec  8.92 MBytes  74.9 Mbits/sec  46792
> >> [  4]   5.00-6.00   sec  8.93 MBytes  74.9 Mbits/sec  46840
> >> [  4]   6.00-7.00   sec  8.90 MBytes  74.6 Mbits/sec  46636
> >> [  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46924
> >> [  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46716
> >> [  4]   9.00-10.00  sec  8.89 MBytes  74.5 Mbits/sec  46592
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bandwidth       Jitter
> >> Lost/Total Datagrams
> >> [  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.056 ms
> >> 123112/463051 (27%)
> >> [  4] Sent 463051 datagrams
> >>
> >> [root@imx6q ~]# iperf3 -s
> >> Server listening on 5201
> >> Accepted connection from 192.168.1.1, port 53394 [  5] local
> >> 192.168.1.2 port 5201 connected to 192.168.1.1 port 43224
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-1.00   sec  5.56 MBytes  46.6 Mbits/sec  0.046 ms
> >> 5520/34650 (16%)
> >> [  5]   1.00-2.00   sec  6.40 MBytes  53.7 Mbits/sec  0.050 ms
> >> 13035/46576 (28%)
> >> [  5]   2.00-3.00   sec  6.40 MBytes  53.7 Mbits/sec  0.044 ms
> >> 13295/46854 (28%)
> >> [  5]   3.00-4.00   sec  6.26 MBytes  52.5 Mbits/sec  0.056 ms
> >> 13849/46692 (30%)
> >> [  5]   4.00-5.00   sec  6.31 MBytes  53.0 Mbits/sec  0.052 ms
> >> 13735/46833 (29%)
> >> [  5]   5.00-6.00   sec  6.32 MBytes  53.0 Mbits/sec  0.054 ms
> >> 13610/46761 (29%)
> >> [  5]   6.00-7.00   sec  6.32 MBytes  53.0 Mbits/sec  0.043 ms
> >> 13780/46931 (29%)
> >> [  5]   7.00-8.00   sec  6.33 MBytes  53.1 Mbits/sec  0.051 ms
> >> 13679/46848 (29%)
> >> [  5]   8.00-9.00   sec  6.33 MBytes  53.1 Mbits/sec  0.050 ms
> >> 13746/46912 (29%)
> >> [  5]   9.00-10.00  sec  6.33 MBytes  53.1 Mbits/sec  0.043 ms
> >> 13437/46626 (29%)
> >> [  5]  10.00-10.17  sec  1.04 MBytes  52.5 Mbits/sec  0.044 ms
> >> 2233/7684 (29%)
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-10.17  sec  63.6 MBytes  52.5 Mbits/sec  0.044 ms
> >> 129919/463367 (28%)  receiver
> >>
> >> [root@imx6q ~]# mpstat -P ALL 4 1
> >> Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
> >> 23:38:51
> CPU    %usr   %nice    %sys %iowait    %irq   %soft
> >> %steal  %guest   %idle
> >> 23:38:55     all    2.49    0.00   19.96    0.00    0.00   25.45
> >> 0.00    0.00   52.10
> >> 23:38:55       0    0.00    0.00    0.25    0.00    0.00
> 99.75
> >> 0.00    0.00    0.00
> >> 23:38:55       1    1.54    0.00   11.54    0.00    0.00
> 0.00
> >> 0.00    0.00   86.92
> >> 23:38:55       2    1.27    0.00    5.85    0.00    0.00
> 0.00
> >> 0.00    0.00   92.88
> >> 23:38:55       3    7.24    0.00   63.31    0.00    0.00
> 0.00
> >> 0.00    0.00   29.46
> >> => ksoftirqd/0 @ 100%; iperf @ 94%
> >>
> >> [root@imx6q ~]# sar -I 60 1
> >> 00:13:32         INTR    intr/s
> >> 00:13:33           60      0.00
> >> 00:13:34           60      0.00
> >> 00:13:35           60      0.00
> >> 00:13:36           60      0.00
> >> => 100% napi poll
> >>
> >>
> >> -------------------------------
> >> - STRESSED
> >> - stress -m 1 --vm-bytes 100M &
> >> -------------------------------
> >> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting
> >> to host 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 52129
> >> connected to
> >> 192.168.1.2 port 5201
> >> [ ID] Interval           Transfer     Bandwidth       Total
> Datagrams
> >> [  4]   0.00-1.00   sec  8.09 MBytes  67.8 Mbits/sec  42399
> >> [  4]   1.00-2.00   sec  8.88 MBytes  74.5 Mbits/sec  46552
> >> [  4]   2.00-3.00   sec  8.92 MBytes  74.8 Mbits/sec  46780
> >> [  4]   3.00-4.00   sec  8.91 MBytes  74.7 Mbits/sec  46688
> >> [  4]   4.00-5.00   sec  8.91 MBytes  74.7 Mbits/sec  46712
> >> [  4]   5.00-6.00   sec  8.91 MBytes  74.8 Mbits/sec  46724
> >> [  4]   6.00-7.00   sec  8.95 MBytes  75.0 Mbits/sec  46900
> >> [  4]   7.00-8.00   sec  8.95 MBytes  75.1 Mbits/sec  46928
> >> [  4]   8.00-9.00   sec  8.93 MBytes  74.9 Mbits/sec  46796
> >> [  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46732
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bandwidth       Jitter
> >> Lost/Total Datagrams
> >> [  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.067 ms
> >> 233989/463134 (51%)
> >> [  4] Sent 463134 datagrams
> >>
> >> [root@imx6q ~]# iperf3 -s
> >> Server listening on 5201
> >> Accepted connection from 192.168.1.1, port 57892 [  5] local
> >> 192.168.1.2 port 5201 connected to 192.168.1.1 port 43729
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-1.00   sec  3.59 MBytes  30.1 Mbits/sec  0.071 ms
> >> 13027/31836 (41%)
> >> [  5]   1.00-2.00   sec  4.32 MBytes  36.2 Mbits/sec  0.075 ms
> >> 23690/46345 (51%)
> >> [  5]   2.00-3.00   sec  4.28 MBytes  35.9 Mbits/sec  0.040 ms
> >> 24879/47293 (53%)
> >> [  5]   3.00-4.00   sec  4.05 MBytes  34.0 Mbits/sec  0.055 ms
> >> 24430/45651 (54%)
> >> [  5]   4.00-5.00   sec  4.01 MBytes  33.6 Mbits/sec  0.130 ms
> >> 25200/46209 (55%)
> >> [  5]   5.00-6.00   sec  4.22 MBytes  35.4 Mbits/sec  0.052 ms
> >> 24777/46902 (53%)
> >> [  5]   6.00-7.00   sec  4.04 MBytes  33.9 Mbits/sec  0.057 ms
> >> 24452/45635 (54%)
> >> [  5]   7.00-8.00   sec  4.26 MBytes  35.7 Mbits/sec  0.063 ms
> >> 26530/48871 (54%)
> >> [  5]   8.00-9.00   sec  4.06 MBytes  34.1 Mbits/sec  0.061 ms
> >> 24293/45583 (53%)
> >> [  5]   9.00-10.00  sec  4.15 MBytes  34.8 Mbits/sec  0.155 ms
> >> 24752/46499 (53%)
> >> [  5]  10.00-10.25  sec   985 KBytes  32.5 Mbits/sec  0.069 ms
> >> 6246/11291 (55%)
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-10.25  sec  41.9 MBytes  34.3 Mbits/sec  0.069 ms
> >> 242276/462115 (52%)  receiver
> >>
> >> [root@imx6q ~]# mpstat -P ALL 4 1
> >> Linux 5.4.8-rt11 03/17/00 _armv7l_ (4 CPU)
> >> 23:43:12
> CPU    %usr   %nice    %sys %iowait    %irq   %soft
> >> %steal  %guest   %idle
> >> 23:43:16     all    2.51    0.00   42.42    0.00    0.00   25.47
> >> 0.00    0.00   29.59
> >> 23:43:16       0    0.00    0.00    1.25    0.00    0.00
> 98.75
> >> 0.00    0.00    0.00
> >> 23:43:16       1    2.25    0.00   97.75    0.00    0.00
> 0.00
> >> 0.00    0.00    0.00
> >> 23:43:16       2    0.25    0.00    2.29    0.00    0.00
> 0.00
> >> 0.00    0.00   97.46
> >> 23:43:16       3    8.08    0.00   70.75    0.00    0.00
> 0.00
> >> 0.00    0.00   21.17
> >> => ksoftirqd/0 @ 100%; stress @ 100%; iperf3 @ 84%
> >>
> >> [root@imx6q ~]# sar -I 60 1
> >> 00:14:41         INTR    intr/s
> >> 00:14:42           60      0.00
> >> 00:14:43           60      0.00
> >> 00:14:44           60      0.00
> >> => 100% napi poll
> >>
> >>
> >> ###############################
> >> # PATCHED
> >> ###############################
> >> -------------------------------
> >> -NOT STRESSED
> >> -------------------------------
> >> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting
> >> to host 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 50177
> >> connected to
> >> 192.168.1.2 port 5201
> >> [ ID] Interval           Transfer     Bandwidth       Total
> Datagrams
> >> [  4]   0.00-1.00   sec  8.08 MBytes  67.8 Mbits/sec  42373
> >> [  4]   1.00-2.00   sec  8.91 MBytes  74.7 Mbits/sec  46693
> >> [  4]   2.00-3.00   sec  8.93 MBytes  74.9 Mbits/sec  46803
> >> [  4]   3.00-4.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
> >> [  4]   4.00-5.00   sec  8.93 MBytes  74.9 Mbits/sec  46804
> >> [  4]   5.00-6.00   sec  8.92 MBytes  74.8 Mbits/sec  46764
> >> [  4]   6.00-7.00   sec  8.95 MBytes  75.1 Mbits/sec  46944
> >> [  4]   7.00-8.00   sec  8.90 MBytes  74.7 Mbits/sec  46672
> >> [  4]   8.00-9.00   sec  8.91 MBytes  74.8 Mbits/sec  46732
> >> [  4]   9.00-10.00  sec  8.90 MBytes  74.7 Mbits/sec  46660
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bandwidth       Jitter
> >> Lost/Total Datagrams
> >> [  4]   0.00-10.00  sec  88.4 MBytes  74.1 Mbits/sec  0.032 ms
> >> 16561/463165 (3.6%)
> >> [  4] Sent 463165 datagrams
> >>
> >> [root@imx6q ~]# iperf3 -s
> >> Server listening on 5201
> >> Accepted connection from 192.168.1.1, port 45012 [  5] local
> >> 192.168.1.2 port 5201 connected to 192.168.1.1 port 50177
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-1.00   sec  7.36 MBytes  61.8 Mbits/sec  0.030 ms
> >> 1430/40038 (3.6%)
> >> [  5]   1.00-2.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
> >> 1757/46850 (3.8%)
> >> [  5]   2.00-3.00   sec  8.61 MBytes  72.2 Mbits/sec  0.039 ms
> >> 1690/46809 (3.6%)
> >> [  5]   3.00-4.00   sec  8.61 MBytes  72.3 Mbits/sec  0.037 ms
> >> 1650/46812 (3.5%)
> >> [  5]   4.00-5.00   sec  8.59 MBytes  72.1 Mbits/sec  0.033 ms
> >> 1589/46651 (3.4%)
> >> [  5]   5.00-6.00   sec  8.60 MBytes  72.1 Mbits/sec  0.036 ms
> >> 1904/46967 (4.1%)
> >> [  5]   6.00-7.00   sec  8.61 MBytes  72.2 Mbits/sec  0.046 ms
> >> 1633/46766 (3.5%)
> >> [  5]   7.00-8.00   sec  8.60 MBytes  72.1 Mbits/sec  0.030 ms
> >> 1748/46818 (3.7%)
> >> [  5]   8.00-9.00   sec  8.61 MBytes  72.2 Mbits/sec  0.036 ms
> >> 1572/46707 (3.4%)
> >> [  5]   9.00-10.00  sec  8.61 MBytes  72.3 Mbits/sec  0.031 ms
> >> 1538/46703 (3.3%)
> >> [  5]  10.00-10.04  sec   389 KBytes  72.1 Mbits/sec  0.032 ms
> >> 50/2044 (2.4%)
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-10.04  sec  85.2 MBytes  71.1 Mbits/sec  0.032 ms
> >> 16561/463165 (3.6%)  receiver
> >>
> >> [root@imx6q ~]# mpstat -P ALL 4 1
> >> Linux 5.4.8-rt11-yocto-standard  03/17/00 _armv7l_ (4 CPU)
> >> 23:15:50
> CPU    %usr   %nice    %sys %iowait    %irq   %soft
> >> %steal  %guest   %idle
> >> 23:15:54     all    3.06    0.00   26.11    0.00    0.00   11.09
> >> 0.00    0.00   59.74
> >> 23:15:54       0    0.00    0.00    0.00    0.00    0.00
> 86.44
> >> 0.00    0.00   13.56
> >> 23:15:54       1   10.25    0.00   89.50    0.00    0.00
> 0.25
> >> 0.00    0.00    0.00
> >> 23:15:54       2    0.00    0.00    0.00    0.00    0.00
> 0.00
> >> 0.00    0.00  100.00
> >> 23:15:54       3    0.00    0.00    0.00    0.00    0.00
> 0.00
> >> 0.00    0.00  100.00
> >> => ksoftirqd/* @ 0% CPU; iperf3 @ 100%
> >>
> >> [root@imx6q ~]# sar -I 60 1
> >> 23:56:10         INTR    intr/s
> >> 23:56:11           60  12339.00
> >> 23:56:12           60  11476.00
> >> => irq load high
> >>
> >>
> >> -------------------------------
> >> - STRESSED
> >> - stress -m 1 --vm-bytes 100M &
> >> -------------------------------
> >> user@ws:~/$ iperf3 -c 192.168.1.2 -u -l 200 -b 100M -t 10 Connecting
> >> to host 192.168.1.2, port 5201 [  4] local 192.168.1.1 port 59042
> >> connected to
> >> 192.168.1.2 port 5201
> >> [ ID] Interval           Transfer     Bandwidth       Total
> Datagrams
> >> [  4]   0.00-1.00   sec  7.92 MBytes  66.2 Mbits/sec  41527
> >> [  4]   1.00-2.00   sec  8.95 MBytes  75.4 Mbits/sec  46931
> >> [  4]   2.00-3.00   sec  8.95 MBytes  75.1 Mbits/sec  46923
> >> [  4]   3.00-4.00   sec  8.96 MBytes  75.2 Mbits/sec  46983
> >> [  4]   4.00-5.00   sec  8.96 MBytes  75.2 Mbits/sec  46999
> >> [  4]   5.00-6.00   sec  8.91 MBytes  74.5 Mbits/sec  46688
> >> [  4]   6.00-7.00   sec  8.93 MBytes  75.1 Mbits/sec  46824
> >> [  4]   7.00-8.00   sec  8.94 MBytes  75.0 Mbits/sec  46872
> >> [  4]   8.00-9.00   sec  8.91 MBytes  74.7 Mbits/sec  46700
> >> [  4]   9.00-10.00  sec  8.91 MBytes  74.8 Mbits/sec  46720
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bandwidth       Jitter
> >> Lost/Total Datagrams
> >> [  4]   0.00-10.00  sec  88.3 MBytes  74.1 Mbits/sec  0.042 ms
> >> 70360/462955 (15%)
> >> [  4] Sent 462955 datagrams
> >>
> >>
> >> [root@imx6q ~]# iperf3 -s
> >> Server listening on 5201
> >> Accepted connection from 192.168.1.1, port 46306 [  5] local
> >> 192.168.1.2 port 5201 connected to 192.168.1.1 port 59042
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-1.00   sec  6.54 MBytes  54.8 Mbits/sec  0.038 ms
> >> 5602/39871 (14%)
> >> [  5]   1.00-2.00   sec  7.51 MBytes  63.0 Mbits/sec  0.037 ms
> >> 6973/46362 (15%)
> >> [  5]   2.00-3.00   sec  7.54 MBytes  63.3 Mbits/sec  0.037 ms
> >> 7414/46966 (16%)
> >> [  5]   3.00-4.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
> >> 7354/46984 (16%)
> >> [  5]   4.00-5.00   sec  7.60 MBytes  63.7 Mbits/sec  0.031 ms
> >> 7241/47069 (15%)
> >> [  5]   5.00-6.00   sec  7.58 MBytes  63.6 Mbits/sec  0.033 ms
> >> 7134/46865 (15%)
> >> [  5]   6.00-7.00   sec  7.56 MBytes  63.5 Mbits/sec  0.058 ms
> >> 6991/46649 (15%)
> >> [  5]   7.00-8.00   sec  7.57 MBytes  63.5 Mbits/sec  0.043 ms
> >> 7259/46933 (15%)
> >> [  5]   8.00-9.00   sec  7.56 MBytes  63.4 Mbits/sec  0.038 ms
> >> 7065/46721 (15%)
> >> [  5]   9.00-10.00  sec  7.55 MBytes  63.3 Mbits/sec  0.042 ms
> >> 7002/46588 (15%)
> >> [  5]  10.00-10.04  sec   317 KBytes  61.8 Mbits/sec  0.042 ms
> >> 325/1947 (17%)
> >> - - - - - - - - - - - - - - - - - - - - - - - - -
> >> [ ID] Interval           Transfer     Bitrate         Jitter
> >> Lost/Total Datagrams
> >> [  5]   0.00-10.04  sec  74.9 MBytes  62.6 Mbits/sec  0.042 ms
> >> 70360/462955 (15%)  receiver
> >>
> >>
> >> [root@imx6q ~]# mpstat -P ALL 4 1
> >> Linux 5.4.8-rt11-yocto-standard 03/17/00 _armv7l_ (4 CPU)
> >> 23:49:59
> CPU    %usr   %nice    %sys %iowait    %irq   %soft
> >> %steal  %guest   %idle
> >> 23:50:03     all    3.21    0.00   47.27    0.00    0.00   25.02
> >> 0.00    0.00   24.51
> >> 23:50:03       0    0.00    0.00    0.50    0.00    0.00
> 99.50
> >> 0.00    0.00    0.00
> >> 23:50:03       1    0.00    0.00    0.26    0.00    0.00
> 0.00
> >> 0.00    0.00   99.74
> >> 23:50:03       2    9.50    0.00   90.50    0.00    0.00
> 0.00
> >> 0.00    0.00    0.00
> >> 23:50:03       3    3.01    0.00   96.99    0.00    0.00
> 0.00
> >> 0.00    0.00    0.00
> >> => ksoftirqd/0 @ 92%; stress @ 100%; iperf3 @ 100%
> >>
> >> [root@imx6q ~]# sar -I 60 1
> >> 23:57:08           60    505.00
> >> 23:57:09           60    748.00
> >> 23:57:10           60    416.00
> >> => irq load low => napi active but not all the time
> >>
> >>
> ==============================================================
> >> =
> >>
> >> As a result the patched system stressed with memory load has a higher
> >> throughput than the unpatched system without memory stress.
> >> The system could nearly keep up with the desktop sending @ 75MBit/s
> >> (100MBit/s link) with i210 / igb.
> >> I think IRQ/s could reach a new peak because of the faster napi_poll
> >> / rx_queue execution times.
> >> iperf, iperf3 is mainly to test throughput. Is there any tool which
> >> checks for connection reliability (packet content, sequence, ...)?
> >>
> >> On Thu, Jul 2, 2020 at 11:14 AM Andy Duan <fugang.duan@nxp.com>
> wrote:
> >>>
> >>> From: Kegl Rohit <keglrohit@gmail.com> Sent: Thursday, July 2, 2020
> >>> 3:39 PM
> >>>> On Thu, Jul 2, 2020 at 6:18 AM Andy Duan <fugang.duan@nxp.com>
> >> wrote:
> >>>>> That should ensure the whole area is not dirty.
> >>>>
> >>>> dma_sync_single_for_cpu() and dma_sync_single_for_device() can or
> >>>> must be used in pairs?
> >>>> So in this case it is really necessary to sync back the skb data
> >>>> buffer via dma_sync_single_for_device? Even when the CPU does not
> >>>> change any bytes in the skb data buffer / readonly like in this case.
> >>>
> >>> No, if the buffer is not modified, dma_sync_single_for_device() is
> >>> not
> >> necessary.
> >>>
> >>> For some arm64 core, the dcache invalidate on A53 is flush +
> >>> invalidate, once the buffer is modified, it will cause read back
> >>> wrong data
> >> without dma_sync_single_for_device().
> >>> And the driver also support Coldfire platforms, I am not family with the
> arch.
> >>>
> >>>  From current analyze for arm/arm64,  I also think
> >>> dma_sync_single_for_device() is not necessary due to the buffer is
> >>> not
> >> modified.
> >>>
> >>> Anyway, it still need to get other experts comment, and it need to
> >>> do many
> >> test and stress test.
> >>>
> >>>> And there is no DMA_BIDIRECTIONAL mapping.
> >>>>
> >>>> I thought copybreak it is not about the next frame size. It is
> >>>> about the current frame. And the actual length is known via the
> >>>> size field in the finished DMA descriptor.
> >>>> Or do you mean that the next received frame could be no copybreak
> >> frame.
> >>>> 1. Rx copybreakable frame with sizeX < copybreak 2. copybreak
> >>>> dma_sync_single_for_cpu(dmabuffer, sizeX) 3. copybreak alloc
> >>>> new_skb, memcpy(new_skb, dmabuffer, sizeX) 4. copybreak
> >>>> dma_sync_single_for_device(dmabuffer, sizeX) 5. Rx non
> >>>> copybreakable frame with sizeY >= copybreak 4.
> >>>> dma_unmap_single(dmabuffer, FEC_ENET_RX_FRSIZE - fep->rx_align) is
> >>>> called and can cause data corruption because not all bytes were
> >>>> marked dirty even if nobody DMA & CPU touched them?
> >>>
> >>> No CPU touch, it should be clean.
> >>>>
> >>>>>> I am new to the DMA API on ARM. Are these changes regarding cache
> >>>>>> flushing,... allowed? These would increase the copybreak
> >>>>>> throughput by reducing CPU load.
> >>>>>
> >>>>> To avoid FIFO overrun, it requires to ensure PHY pause frame is
> enabled.
> >>>>
> >>>> As the errata states this is also not always true, because the
> >>>> first xoff could arrive too late. Pause frames/flow control is not
> >>>> really common and could cause troubles with other random network
> >> components
> >>>> acting different or not supporting pause frames correctly. For
> >>>> example the driver itself does enable pause frames for Gigabit by
> >>>> default. But we have no Gigabit Phy so no FEC_QUIRK_HAS_GBIT and
> >>>> therefore pause frames are not supported by the driver as of now.
> >>>>
> >>>>
> >>>> It looks like copybreak is implemented similar to e1000_main.c
> >>>> e1000_copybreak().
> >>>> There is only the real/needed packet length (length =
> >>>> le16_to_cpu(rx_desc->length)) is synced via dma_sync_single_for_cpu
> >>>> and no dma_sync_single_for_device.
> >>>>
> >>>> Here is a diff with the previous changes assuming that
> >>>> dma_sync_single_for_device must be used to avoid any cache flush
> >>>> backs even when no data was changed.
> >>>
> >>> Below change seems fine, can you collect some data before you send
> >>> out the patch for review.
> >>> - run iperf stress test to ensure the stability
> >>> - collect the performance improvement data
> >>>
> >>> Thanks.
> >>>>
> >>>> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> >>>> b/drivers/net/ethernet/freescale/fec_main.c
> >>>> index 2d0d313ee..464783c15 100644
> >>>> --- a/drivers/net/ethernet/freescale/fec_main.c
> >>>> +++ b/drivers/net/ethernet/freescale/fec_main.c
> >>>> @@ -1387,9 +1387,9 @@ static bool fec_enet_copybreak(struct
> >>>> net_device *ndev, struct sk_buff **skb,
> >>>>                  return false;
> >>>>
> >>>>          dma_sync_single_for_cpu(&fep->pdev->dev,
> >>>>
> >> fec32_to_cpu(bdp->cbd_bufaddr),
> >>>> -                               FEC_ENET_RX_FRSIZE -
> >> fep->rx_align,
> >>>> +                               length,
> >>>>                                  DMA_FROM_DEVICE);
> >>>>          if (!swap)
> >>>>                  memcpy(new_skb->data, (*skb)->data, length);
> >>>>          else
> >>>> @@ -1413,8 +1413,9 @@ fec_enet_rx_queue(struct net_device *ndev,
> >> int
> >>>> budget, u16 queue_id)
> >>>>          unsigned short status;
> >>>>          struct  sk_buff *skb_new = NULL;
> >>>>          struct  sk_buff *skb;
> >>>>          ushort  pkt_len;
> >>>> +       ushort  pkt_len_nofcs;
> >>>>          __u8 *data;
> >>>>          int     pkt_received = 0;
> >>>>          struct  bufdesc_ex *ebdp = NULL;
> >>>>          bool    vlan_packet_rcvd = false;
> >>>> @@ -1479,9 +1480,10 @@ fec_enet_rx_queue(struct net_device
> *ndev,
> >>>> int budget, u16 queue_id)
> >>>>                  /* The packet length includes FCS, but we don't
> >>>> want
> >> to
> >>>>                   * include that when passing upstream as it
> messes
> >> up
> >>>>                   * bridging applications.
> >>>>                   */
> >>>> -               is_copybreak = fec_enet_copybreak(ndev, &skb, bdp,
> >>>> pkt_len - 4,
> >>>> +               pkt_len_nofcs = pkt_len - 4;
> >>>> +               is_copybreak = fec_enet_copybreak(ndev, &skb,
> bdp,
> >>>> pkt_len_nofcs,
> >>>>
> >> need_swap);
> >>>>                  if (!is_copybreak) {
> >>>>                          skb_new = netdev_alloc_skb(ndev,
> >>>> FEC_ENET_RX_FRSIZE);
> >>>>                          if (unlikely(!skb_new)) { @@ -1554,9
> >> +1556,9
> >>>> @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16
> >>>> queue_id)
> >>>>
> >>>>                  if (is_copybreak) {
> >>>>
> >> dma_sync_single_for_device(&fep->pdev->dev,
> >>>>
> >>>> fec32_to_cpu(bdp->cbd_bufaddr),
> >>>> -
> >>>> FEC_ENET_RX_FRSIZE
> >>>> - fep->rx_align,
> >>>> +
> >>>> pkt_len_nofcs,
> >>>>
> >>>> DMA_FROM_DEVICE);
> >>>>                  } else {
> >>>>                          rxq->rx_skbuff[index] = skb_new;
> >>>>                          fec_enet_new_rxbdp(ndev, bdp,
> skb_new);

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-08  5:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01 18:44 net: ethernet: freescale: fec: copybreak handling throughput, dma_sync_* optimisations allowed? Kegl Rohit
2020-07-02  4:18 ` [EXT] " Andy Duan
2020-07-02  7:38   ` Kegl Rohit
2020-07-02  9:14     ` Andy Duan
2020-07-06 14:18       ` Kegl Rohit
2020-07-07  3:44         ` Andy Duan
2020-07-07 16:14           ` Robin Murphy
2020-07-08  5:29             ` Andy Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).