All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] net: macb: DMA race condition fixes
@ 2018-11-30 18:21 Anssi Hannula
  2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Anssi Hannula @ 2018-11-30 18:21 UTC (permalink / raw)
  To: Nicolas Ferre, David S. Miller; +Cc: netdev

Hi all,

Here are a couple of race condition fixes for the macb driver. The first
two are issues observed on real HW.

Anssi Hannula (3):
      net: macb: fix random memory corruption on RX with 64-bit DMA
      net: macb: fix dropped RX frames due to a race
      net: macb: add missing barriers when reading buffers

 drivers/net/ethernet/cadence/macb_main.c | 34 +++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 6 deletions(-)

-- 
Anssi Hannula / Bitwise Oy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA
  2018-11-30 18:21 [PATCH 0/3] net: macb: DMA race condition fixes Anssi Hannula
@ 2018-11-30 18:21 ` Anssi Hannula
  2018-12-03  4:44   ` Harini Katakam
                     ` (2 more replies)
  2018-11-30 18:21 ` [PATCH 2/3] net: macb: fix dropped RX frames due to a race Anssi Hannula
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 31+ messages in thread
From: Anssi Hannula @ 2018-11-30 18:21 UTC (permalink / raw)
  To: Nicolas Ferre, David S. Miller; +Cc: netdev, Harini Katakam, Michal Simek

64-bit DMA addresses are split in upper and lower halves that are
written in separate fields on GEM. For RX, bit 0 of the address is used
as the ownership bit (RX_USED). When the RX_USED bit is unset the
controller is allowed to write data to the buffer.

The driver does not guarantee that the controller already sees the upper
half when the RX_USED bit is cleared, possibly resulting in the
controller writing an incoming frame to an address with an incorrect
upper half and therefore possibly corrupting unrelated system memory.

Fix that by adding the necessary DMA memory barrier between the writes.

This corruption was observed on a ZynqMP based system.

Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes: fff8019a08b6 ("net: macb: Add 64 bit addressing support for GEM")
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Harini Katakam <harini.katakam@xilinx.com>
Cc: Michal Simek <michal.simek@xilinx.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index d8c7ca037ae3..0bc2aab7be40 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct macb_dma_desc *desc, dma_addr_
 	if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
 		desc_64 = macb_64b_desc(bp, desc);
 		desc_64->addrh = upper_32_bits(addr);
+		/* The low bits of RX address contain the RX_USED bit, clearing
+		 * of which allows packet RX. Make sure the high bits are also
+		 * visible to HW at that point.
+		 */
+		dma_wmb();
 	}
 #endif
 	desc->addr = lower_32_bits(addr);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] net: macb: fix dropped RX frames due to a race
  2018-11-30 18:21 [PATCH 0/3] net: macb: DMA race condition fixes Anssi Hannula
  2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
@ 2018-11-30 18:21 ` Anssi Hannula
  2018-12-03  4:52   ` Harini Katakam
  2018-12-05 12:38   ` Claudiu.Beznea
  2018-11-30 18:21 ` [PATCH 3/3] net: macb: add missing barriers when reading buffers Anssi Hannula
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 31+ messages in thread
From: Anssi Hannula @ 2018-11-30 18:21 UTC (permalink / raw)
  To: Nicolas Ferre, David S. Miller; +Cc: netdev

Bit RX_USED set to 0 in the address field allows the controller to write
data to the receive buffer descriptor.

The driver does not ensure the ctrl field is ready (cleared) when the
controller sees the RX_USED=0 written by the driver. The ctrl field might
only be cleared after the controller has already updated it according to
a newly received frame, causing the frame to be discarded in gem_rx() due
to unexpected ctrl field contents.

A message is logged when the above scenario occurs:

  macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor

Fix the issue by ensuring that when the controller sees RX_USED=0 the
ctrl field is already cleared.

This issue was observed on a ZynqMP based system.

Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes: 4df95131ea80 ("net/macb: change RX path for GEM")
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 0bc2aab7be40..430b7a0f5436 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -935,14 +935,19 @@ static void gem_rx_refill(struct macb_queue *queue)
 
 			if (entry == bp->rx_ring_size - 1)
 				paddr |= MACB_BIT(RX_WRAP);
-			macb_set_addr(bp, desc, paddr);
 			desc->ctrl = 0;
+			/* Setting addr clears RX_USED and allows reception,
+			 * make sure ctrl is cleared first to avoid a race.
+			 */
+			dma_wmb();
+			macb_set_addr(bp, desc, paddr);
 
 			/* properly align Ethernet header */
 			skb_reserve(skb, NET_IP_ALIGN);
 		} else {
-			desc->addr &= ~MACB_BIT(RX_USED);
 			desc->ctrl = 0;
+			dma_wmb();
+			desc->addr &= ~MACB_BIT(RX_USED);
 		}
 	}
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-11-30 18:21 [PATCH 0/3] net: macb: DMA race condition fixes Anssi Hannula
  2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
  2018-11-30 18:21 ` [PATCH 2/3] net: macb: fix dropped RX frames due to a race Anssi Hannula
@ 2018-11-30 18:21 ` Anssi Hannula
  2018-12-05 12:37   ` Claudiu.Beznea
  2018-12-03  8:26 ` [PATCH 0/3] net: macb: DMA race condition fixes Nicolas.Ferre
  2018-12-05 20:35 ` David Miller
  4 siblings, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-11-30 18:21 UTC (permalink / raw)
  To: Nicolas Ferre, David S. Miller; +Cc: netdev

When reading buffer descriptors on RX or on TX completion, an
RX_USED/TX_USED bit is checked first to ensure that the descriptor has
been populated. However, there are no memory barriers to ensure that the
data protected by the RX_USED/TX_USED bit is up-to-date with respect to
that bit.

Fix that by adding DMA read memory barriers on those paths.

I did not observe any actual issues caused by these being missing,
though.

Tested on a ZynqMP based system.

Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
---
 drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 430b7a0f5436..c93baa8621d5 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
 
 			/* First, update TX stats if needed */
 			if (skb) {
+				/* Ensure all of desc is at least as up-to-date
+				 * as ctrl (TX_USED bit)
+				 */
+				dma_rmb();
+
 				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
 					/* skb now belongs to timestamp buffer
 					 * and will be removed later
@@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
 		rmb();
 
 		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
-		addr = macb_get_addr(bp, desc);
-		ctrl = desc->ctrl;
 
 		if (!rxused)
 			break;
 
+		/* Ensure other data is at least as up-to-date as rxused */
+		dma_rmb();
+
+		addr = macb_get_addr(bp, desc);
+		ctrl = desc->ctrl;
+
 		queue->rx_tail++;
 		count++;
 
@@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
 		/* Make hw descriptor updates visible to CPU */
 		rmb();
 
-		ctrl = desc->ctrl;
-
 		if (!(desc->addr & MACB_BIT(RX_USED)))
 			break;
 
+		/* Ensure other data is at least as up-to-date as addr */
+		dma_rmb();
+
+		ctrl = desc->ctrl;
+
 		if (ctrl & MACB_BIT(RX_SOF)) {
 			if (first_frag != -1)
 				discard_partial_frame(queue, first_frag, tail);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA
  2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
@ 2018-12-03  4:44   ` Harini Katakam
  2018-12-05 12:37   ` Claudiu.Beznea
  2018-12-05 20:32   ` David Miller
  2 siblings, 0 replies; 31+ messages in thread
From: Harini Katakam @ 2018-12-03  4:44 UTC (permalink / raw)
  To: anssi.hannula
  Cc: Nicolas Ferre, David Miller, netdev, Harini Katakam, Michal Simek

On Fri, Nov 30, 2018 at 11:53 PM Anssi Hannula <anssi.hannula@bitwise.fi> wrote:
>
> 64-bit DMA addresses are split in upper and lower halves that are
> written in separate fields on GEM. For RX, bit 0 of the address is used
> as the ownership bit (RX_USED). When the RX_USED bit is unset the
> controller is allowed to write data to the buffer.
>
> The driver does not guarantee that the controller already sees the upper
> half when the RX_USED bit is cleared, possibly resulting in the
> controller writing an incoming frame to an address with an incorrect
> upper half and therefore possibly corrupting unrelated system memory.
>
> Fix that by adding the necessary DMA memory barrier between the writes.
>
> This corruption was observed on a ZynqMP based system.
>
> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
> Fixes: fff8019a08b6 ("net: macb: Add 64 bit addressing support for GEM")
> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
> Cc: Harini Katakam <harini.katakam@xilinx.com>
> Cc: Michal Simek <michal.simek@xilinx.com>

Acked-by: Harini Katakam <harini.katakam@xilinx.com>

Regards,
Harini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race
  2018-11-30 18:21 ` [PATCH 2/3] net: macb: fix dropped RX frames due to a race Anssi Hannula
@ 2018-12-03  4:52   ` Harini Katakam
  2018-12-03 10:31     ` Anssi Hannula
  2018-12-05 12:38   ` Claudiu.Beznea
  1 sibling, 1 reply; 31+ messages in thread
From: Harini Katakam @ 2018-12-03  4:52 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Nicolas Ferre, David Miller, netdev

Hi Anssi,
On Fri, Nov 30, 2018 at 11:53 PM Anssi Hannula <anssi.hannula@bitwise.fi> wrote:
>
> Bit RX_USED set to 0 in the address field allows the controller to write
> data to the receive buffer descriptor.
>
> The driver does not ensure the ctrl field is ready (cleared) when the
> controller sees the RX_USED=0 written by the driver. The ctrl field might
> only be cleared after the controller has already updated it according to
> a newly received frame, causing the frame to be discarded in gem_rx() due
> to unexpected ctrl field contents.
>
> A message is logged when the above scenario occurs:
>
>   macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor
>
> Fix the issue by ensuring that when the controller sees RX_USED=0 the
> ctrl field is already cleared.
>
> This issue was observed on a ZynqMP based system.
>

Thanks for the patch.
Could you please describe the test in which this behavior was observed?
Were you able to confirm that this was because of the ctrl field being
cleared late? This error can also be observed under stress when RX UBR
is observed.
I understand it makes sense to clear ctrl field before setting RX used bit.
But I'm trying to understand if a dmb is necessary in the receive data path.

Regards,
Harini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-11-30 18:21 [PATCH 0/3] net: macb: DMA race condition fixes Anssi Hannula
                   ` (2 preceding siblings ...)
  2018-11-30 18:21 ` [PATCH 3/3] net: macb: add missing barriers when reading buffers Anssi Hannula
@ 2018-12-03  8:26 ` Nicolas.Ferre
  2018-12-03 23:56   ` David Miller
  2018-12-05 20:35 ` David Miller
  4 siblings, 1 reply; 31+ messages in thread
From: Nicolas.Ferre @ 2018-12-03  8:26 UTC (permalink / raw)
  To: anssi.hannula, davem, Claudiu.Beznea; +Cc: netdev

On 30/11/2018 at 19:21, Anssi Hannula wrote:
> Hi all,
> 
> Here are a couple of race condition fixes for the macb driver. The first
> two are issues observed on real HW.
> 
> Anssi Hannula (3):
>        net: macb: fix random memory corruption on RX with 64-bit DMA
>        net: macb: fix dropped RX frames due to a race
>        net: macb: add missing barriers when reading buffers
> 
>   drivers/net/ethernet/cadence/macb_main.c | 34 +++++++++++++++++++++-----
>   1 file changed, 28 insertions(+), 6 deletions(-)

David,

Can you please delay a bit the acceptance of this series, I would like 
that we assess these findings with tests on our hardware before applying 
them.

Anssi,

Thanks a lot for this series. On our side, we have one pending patch 
about fixing a race condition on our hardware. We would need to redo 
some tests with our patch and yours so that we know better the behavior 
of our hardware.

Best regards,
-- 
Nicolas Ferre

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race
  2018-12-03  4:52   ` Harini Katakam
@ 2018-12-03 10:31     ` Anssi Hannula
  2018-12-03 10:36       ` Harini Katakam
  0 siblings, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-12-03 10:31 UTC (permalink / raw)
  To: Harini Katakam; +Cc: Nicolas Ferre, David Miller, netdev

Hi,

On 3.12.2018 6:52, Harini Katakam wrote:
> Hi Anssi,
> On Fri, Nov 30, 2018 at 11:53 PM Anssi Hannula <anssi.hannula@bitwise.fi> wrote:
>> Bit RX_USED set to 0 in the address field allows the controller to write
>> data to the receive buffer descriptor.
>>
>> The driver does not ensure the ctrl field is ready (cleared) when the
>> controller sees the RX_USED=0 written by the driver. The ctrl field might
>> only be cleared after the controller has already updated it according to
>> a newly received frame, causing the frame to be discarded in gem_rx() due
>> to unexpected ctrl field contents.
>>
>> A message is logged when the above scenario occurs:
>>
>>   macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor
>>
>> Fix the issue by ensuring that when the controller sees RX_USED=0 the
>> ctrl field is already cleared.
>>
>> This issue was observed on a ZynqMP based system.
>>
> Thanks for the patch.
> Could you please describe the test in which this behavior was observed?

Sure. The testcase I used for the patches is:

- RT_FULL kernel,
- CPU-bound SCHED_FF RT priority 15 process (with
rcutree.kthread_prio=20 to avoid RCU starvation),
- Pyropus memtester running for 3GB (system has 4GB memory),
- "ping -f -l 5000 -s 100" running from a PC.

The "not whole frame pointed by descriptor" issue occurs within minutes
and the RX memory corruption within an hour. I did not try to reduce the
testcase to a minimum.

Both were also observed using real production loads (that of course do
not have CPU-bound RT tasks).

> Were you able to confirm that this was because of the ctrl field being
> cleared late? This error can also be observed under stress when RX UBR
> is observed.

I observed that the issue occurred without this patch, and didn't occur
after applying this patch (individually), but I didn't check it further
than that. If you have anything you'd like me to test, let me know.

> I understand it makes sense to clear ctrl field before setting RX used bit.
> But I'm trying to understand if a dmb is necessary in the receive data path.

A barrier is needed as otherwise writes to the ctrl field and addr
fields may be freely reordered as they are just regular writes to
memory, as far as I've understood it, potentially undoing the effect of
changing the order.

If you mean the third patch, same issue with reads - they may be freely
reordered (by e.g. the compiler) as they are just regular reads.

But I don't claim to be an expert on these so please correct me if I'm
wrong :)

-- 
Anssi Hannula / Bitwise Oy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race
  2018-12-03 10:31     ` Anssi Hannula
@ 2018-12-03 10:36       ` Harini Katakam
  0 siblings, 0 replies; 31+ messages in thread
From: Harini Katakam @ 2018-12-03 10:36 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Nicolas Ferre, David Miller, netdev

Hi Anssi,
On Mon, Dec 3, 2018 at 4:02 PM Anssi Hannula <anssi.hannula@bitwise.fi> wrote:
>
> Hi,
>
> On 3.12.2018 6:52, Harini Katakam wrote:
> > Hi Anssi,
> > On Fri, Nov 30, 2018 at 11:53 PM Anssi Hannula <anssi.hannula@bitwise.fi> wrote:
> >> Bit RX_USED set to 0 in the address field allows the controller to write
> >> data to the receive buffer descriptor.
> >>
> >> The driver does not ensure the ctrl field is ready (cleared) when the
> >> controller sees the RX_USED=0 written by the driver. The ctrl field might
> >> only be cleared after the controller has already updated it according to
> >> a newly received frame, causing the frame to be discarded in gem_rx() due
> >> to unexpected ctrl field contents.
> >>
> >> A message is logged when the above scenario occurs:
> >>
> >>   macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor
> >>
> >> Fix the issue by ensuring that when the controller sees RX_USED=0 the
> >> ctrl field is already cleared.
> >>
> >> This issue was observed on a ZynqMP based system.
> >>
> > Thanks for the patch.
> > Could you please describe the test in which this behavior was observed?
>
> Sure. The testcase I used for the patches is:
>
> - RT_FULL kernel,
> - CPU-bound SCHED_FF RT priority 15 process (with
> rcutree.kthread_prio=20 to avoid RCU starvation),
> - Pyropus memtester running for 3GB (system has 4GB memory),
> - "ping -f -l 5000 -s 100" running from a PC.
>
> The "not whole frame pointed by descriptor" issue occurs within minutes
> and the RX memory corruption within an hour. I did not try to reduce the
> testcase to a minimum.
>
> Both were also observed using real production loads (that of course do
> not have CPU-bound RT tasks).
>
> > Were you able to confirm that this was because of the ctrl field being
> > cleared late? This error can also be observed under stress when RX UBR
> > is observed.
>
> I observed that the issue occurred without this patch, and didn't occur
> after applying this patch (individually), but I didn't check it further
> than that. If you have anything you'd like me to test, let me know.

Thanks for the details.

Regards,
Harini

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-12-03  8:26 ` [PATCH 0/3] net: macb: DMA race condition fixes Nicolas.Ferre
@ 2018-12-03 23:56   ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2018-12-03 23:56 UTC (permalink / raw)
  To: Nicolas.Ferre; +Cc: anssi.hannula, Claudiu.Beznea, netdev

From: <Nicolas.Ferre@microchip.com>
Date: Mon, 3 Dec 2018 08:26:52 +0000

> Can you please delay a bit the acceptance of this series, I would like 
> that we assess these findings with tests on our hardware before applying 
> them.

Sure.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA
  2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
  2018-12-03  4:44   ` Harini Katakam
@ 2018-12-05 12:37   ` Claudiu.Beznea
  2018-12-05 13:58     ` Anssi Hannula
  2018-12-05 20:32   ` David Miller
  2 siblings, 1 reply; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-05 12:37 UTC (permalink / raw)
  To: anssi.hannula, Nicolas.Ferre, davem; +Cc: netdev, harini.katakam, michal.simek

Hi Anssi,

Few comments... Otherwise I tested this series on a SAMA5D2 Xplained and
SAMA5D4 Xplained under heavy traffic and it seems to behave OK.

Thank you,
Claudiu Beznea

On 30.11.2018 20:21, Anssi Hannula wrote:
> 64-bit DMA addresses are split in upper and lower halves that are
> written in separate fields on GEM. For RX, bit 0 of the address is used
> as the ownership bit (RX_USED). When the RX_USED bit is unset the
> controller is allowed to write data to the buffer.
> 
> The driver does not guarantee that the controller already sees the upper
> half when the RX_USED bit is cleared, possibly resulting in the
> controller writing an incoming frame to an address with an incorrect
> upper half and therefore possibly corrupting unrelated system memory.
> 
> Fix that by adding the necessary DMA memory barrier between the writes.>
> This corruption was observed on a ZynqMP based system.
> 
> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
> Fixes: fff8019a08b6 ("net: macb: Add 64 bit addressing support for GEM")
> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
> Cc: Harini Katakam <harini.katakam@xilinx.com>
> Cc: Michal Simek <michal.simek@xilinx.com>
> ---
>  drivers/net/ethernet/cadence/macb_main.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index d8c7ca037ae3..0bc2aab7be40 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct macb_dma_desc *desc, dma_addr_
>  	if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>  		desc_64 = macb_64b_desc(bp, desc);
>  		desc_64->addrh = upper_32_bits(addr);
> +		/* The low bits of RX address contain the RX_USED bit, clearing
> +		 * of which allows packet RX. Make sure the high bits are also
> +		 * visible to HW at that point.
> +		 */
> +		dma_wmb();

I think a wmb() would fit better here so that on ARM to also force the
flushing of caches not affected by dmb() by calling arm_heavy_mb().

Thank you,
Claudiu Beznea

>  	}
>  #endif
>  	desc->addr = lower_32_bits(addr);
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-11-30 18:21 ` [PATCH 3/3] net: macb: add missing barriers when reading buffers Anssi Hannula
@ 2018-12-05 12:37   ` Claudiu.Beznea
  2018-12-05 14:00     ` Anssi Hannula
  0 siblings, 1 reply; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-05 12:37 UTC (permalink / raw)
  To: anssi.hannula, Nicolas.Ferre, davem; +Cc: netdev



On 30.11.2018 20:21, Anssi Hannula wrote:
> When reading buffer descriptors on RX or on TX completion, an
> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
> been populated. However, there are no memory barriers to ensure that the
> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
> that bit.
> 
> Fix that by adding DMA read memory barriers on those paths.
> 
> I did not observe any actual issues caused by these being missing,
> though.
> 
> Tested on a ZynqMP based system.
> 
> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
> ---
>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 430b7a0f5436..c93baa8621d5 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>  
>  			/* First, update TX stats if needed */
>  			if (skb) {
> +				/* Ensure all of desc is at least as up-to-date
> +				 * as ctrl (TX_USED bit)
> +				 */
> +				dma_rmb();
> +

Is this necessary? Wouldn't previous rmb() take care of this? At this time
data specific to this descriptor was completed. The TX descriptors for next
data to be send is updated under a locked spinlock.

>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>  					/* skb now belongs to timestamp buffer
>  					 * and will be removed later
> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>  		rmb();
>  
>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
> -		addr = macb_get_addr(bp, desc);
> -		ctrl = desc->ctrl;
>  
>  		if (!rxused)
>  			break;
>  
> +		/* Ensure other data is at least as up-to-date as rxused */
> +		dma_rmb();

Same here, wouldn't previous rmb() should do this job?

> +
> +		addr = macb_get_addr(bp, desc);
> +		ctrl = desc->ctrl;
> +
>  		queue->rx_tail++;
>  		count++;
>  
> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>  		/* Make hw descriptor updates visible to CPU */
>  		rmb();
>  
> -		ctrl = desc->ctrl;
> -
>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>  			break;
>  
> +		/* Ensure other data is at least as up-to-date as addr */
> +		dma_rmb();

Ditto

> +
> +		ctrl = desc->ctrl;
> +
>  		if (ctrl & MACB_BIT(RX_SOF)) {
>  			if (first_frag != -1)
>  				discard_partial_frame(queue, first_frag, tail);
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] net: macb: fix dropped RX frames due to a race
  2018-11-30 18:21 ` [PATCH 2/3] net: macb: fix dropped RX frames due to a race Anssi Hannula
  2018-12-03  4:52   ` Harini Katakam
@ 2018-12-05 12:38   ` Claudiu.Beznea
  1 sibling, 0 replies; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-05 12:38 UTC (permalink / raw)
  To: anssi.hannula, Nicolas.Ferre, davem; +Cc: netdev



On 30.11.2018 20:21, Anssi Hannula wrote:
> Bit RX_USED set to 0 in the address field allows the controller to write
> data to the receive buffer descriptor.
> 
> The driver does not ensure the ctrl field is ready (cleared) when the
> controller sees the RX_USED=0 written by the driver. The ctrl field might
> only be cleared after the controller has already updated it according to
> a newly received frame, causing the frame to be discarded in gem_rx() due
> to unexpected ctrl field contents.
> 
> A message is logged when the above scenario occurs:
> 
>   macb ff0b0000.ethernet eth0: not whole frame pointed by descriptor
> > Fix the issue by ensuring that when the controller sees RX_USED=0 the
> ctrl field is already cleared.
> 
> This issue was observed on a ZynqMP based system.
> 
> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
> Fixes: 4df95131ea80 ("net/macb: change RX path for GEM")
> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
> ---
>  drivers/net/ethernet/cadence/macb_main.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 0bc2aab7be40..430b7a0f5436 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -935,14 +935,19 @@ static void gem_rx_refill(struct macb_queue *queue)
>  
>  			if (entry == bp->rx_ring_size - 1)
>  				paddr |= MACB_BIT(RX_WRAP);
> -			macb_set_addr(bp, desc, paddr);
>  			desc->ctrl = 0;
> +			/* Setting addr clears RX_USED and allows reception,
> +			 * make sure ctrl is cleared first to avoid a race.
> +			 */
> +			dma_wmb();

Same here as in patch 1/1 with regards to memory barrier.

> +			macb_set_addr(bp, desc, paddr);
>  
>  			/* properly align Ethernet header */
>  			skb_reserve(skb, NET_IP_ALIGN);
>  		} else {
> -			desc->addr &= ~MACB_BIT(RX_USED);
>  			desc->ctrl = 0;
> +			dma_wmb();

Ditto

> +			desc->addr &= ~MACB_BIT(RX_USED);
>  		}
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA
  2018-12-05 12:37   ` Claudiu.Beznea
@ 2018-12-05 13:58     ` Anssi Hannula
  0 siblings, 0 replies; 31+ messages in thread
From: Anssi Hannula @ 2018-12-05 13:58 UTC (permalink / raw)
  To: Claudiu.Beznea; +Cc: Nicolas.Ferre, davem, netdev, harini.katakam, michal.simek

On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
> Hi Anssi,

Hi, and thanks for looking at these.

> Few comments... Otherwise I tested this series on a SAMA5D2 Xplained and
> SAMA5D4 Xplained under heavy traffic and it seems to behave OK.
>
> Thank you,
> Claudiu Beznea
>
> On 30.11.2018 20:21, Anssi Hannula wrote:
>> 64-bit DMA addresses are split in upper and lower halves that are
>> written in separate fields on GEM. For RX, bit 0 of the address is used
>> as the ownership bit (RX_USED). When the RX_USED bit is unset the
>> controller is allowed to write data to the buffer.
>>
>> The driver does not guarantee that the controller already sees the upper
>> half when the RX_USED bit is cleared, possibly resulting in the
>> controller writing an incoming frame to an address with an incorrect
>> upper half and therefore possibly corrupting unrelated system memory.
>>
>> Fix that by adding the necessary DMA memory barrier between the writes.>
>> This corruption was observed on a ZynqMP based system.
>>
>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>> Fixes: fff8019a08b6 ("net: macb: Add 64 bit addressing support for GEM")
>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>> Cc: Harini Katakam <harini.katakam@xilinx.com>
>> Cc: Michal Simek <michal.simek@xilinx.com>
>> ---
>>  drivers/net/ethernet/cadence/macb_main.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>> index d8c7ca037ae3..0bc2aab7be40 100644
>> --- a/drivers/net/ethernet/cadence/macb_main.c
>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct macb_dma_desc *desc, dma_addr_
>>  	if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>>  		desc_64 = macb_64b_desc(bp, desc);
>>  		desc_64->addrh = upper_32_bits(addr);
>> +		/* The low bits of RX address contain the RX_USED bit, clearing
>> +		 * of which allows packet RX. Make sure the high bits are also
>> +		 * visible to HW at that point.
>> +		 */
>> +		dma_wmb();
> I think a wmb() would fit better here so that on ARM to also force the
> flushing of caches not affected by dmb() by calling arm_heavy_mb().

Hmm, if we want to simply ensure ordering of the two writes (upper half,
lower half) to DMA memory, isn't dma_wmb() exactly for that purpose?
This situation seems to match the dma_wmb() example in
Documentation/memory-barriers.txt where data is updated before ownership
bit.

If dma_wmb() = dmb(oshst) is not enough on ARM for this purpose, should
not dma_wmb() be fixed then?
Or maybe I'm missing some difference why dma_wmb() is not enough here
but would be OK on some other case (e.g. in the memory-barriers.txt
example)?

> Thank you,
> Claudiu Beznea
>
>>  	}
>>  #endif
>>  	desc->addr = lower_32_bits(addr);
>>

-- 
Anssi Hannula / Bitwise Oy
+358 503803997

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-05 12:37   ` Claudiu.Beznea
@ 2018-12-05 14:00     ` Anssi Hannula
  2018-12-06 14:14       ` Claudiu.Beznea
  0 siblings, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-12-05 14:00 UTC (permalink / raw)
  To: Claudiu.Beznea; +Cc: Nicolas.Ferre, davem, netdev

On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>
> On 30.11.2018 20:21, Anssi Hannula wrote:
>> When reading buffer descriptors on RX or on TX completion, an
>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>> been populated. However, there are no memory barriers to ensure that the
>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>> that bit.
>>
>> Fix that by adding DMA read memory barriers on those paths.
>>
>> I did not observe any actual issues caused by these being missing,
>> though.
>>
>> Tested on a ZynqMP based system.
>>
>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>> ---
>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>> index 430b7a0f5436..c93baa8621d5 100644
>> --- a/drivers/net/ethernet/cadence/macb_main.c
>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>  
>>  			/* First, update TX stats if needed */
>>  			if (skb) {
>> +				/* Ensure all of desc is at least as up-to-date
>> +				 * as ctrl (TX_USED bit)
>> +				 */
>> +				dma_rmb();
>> +
> Is this necessary? Wouldn't previous rmb() take care of this? At this time
> data specific to this descriptor was completed. The TX descriptors for next
> data to be send is updated under a locked spinlock.

The previous rmb() is before the TX_USED check, so my understanding is
that the following could happen in theory:

1. rmb().
2. Reads are reordered so that TX timestamp is read first - no barriers
are crossed.
3. HW writes timestamp and sets TX_USED (or they become visible).
4. Code checks TX_USED.
5. Code operates on timestamp that is actually garbage.

I'm not 100% sure that there isn't some lighter/cleaner way to do this
than dma_rmb(), though.

>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>  					/* skb now belongs to timestamp buffer
>>  					 * and will be removed later
>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>  		rmb();
>>  
>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>> -		addr = macb_get_addr(bp, desc);
>> -		ctrl = desc->ctrl;
>>  
>>  		if (!rxused)
>>  			break;
>>  
>> +		/* Ensure other data is at least as up-to-date as rxused */
>> +		dma_rmb();
> Same here, wouldn't previous rmb() should do this job?

The scenario I'm concerned about here (and in the last hunk) is:

1. rmb().
2. ctrl is read (i.e. ctrl read reordered before addr read).
3. HW updates to ctrl and addr become visible.
4. RX_USED check.
5. code operates on garbage ctrl.

I think it may be OK to move the earlier rmb() outside the loop so that
there is an rmb() only before and after the RX loop, as I don't at least
immediately see any hard requirement to do it on each loop pass (unlike
the added dma_rmb()). But my intent was to fix issues instead of
optimization so I didn't look too closely into that.

>> +
>> +		addr = macb_get_addr(bp, desc);
>> +		ctrl = desc->ctrl;
>> +
>>  		queue->rx_tail++;
>>  		count++;
>>  
>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>  		/* Make hw descriptor updates visible to CPU */
>>  		rmb();
>>  
>> -		ctrl = desc->ctrl;
>> -
>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>  			break;
>>  
>> +		/* Ensure other data is at least as up-to-date as addr */
>> +		dma_rmb();
> Ditto
>
>> +
>> +		ctrl = desc->ctrl;
>> +
>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>  			if (first_frag != -1)
>>  				discard_partial_frame(queue, first_frag, tail);
>>

-- 
Anssi Hannula / Bitwise Oy
+358 503803997

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA
  2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
  2018-12-03  4:44   ` Harini Katakam
  2018-12-05 12:37   ` Claudiu.Beznea
@ 2018-12-05 20:32   ` David Miller
  2018-12-06 14:16     ` Claudiu.Beznea
  2 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2018-12-05 20:32 UTC (permalink / raw)
  To: anssi.hannula; +Cc: nicolas.ferre, netdev, harini.katakam, michal.simek

From: Anssi Hannula <anssi.hannula@bitwise.fi>
Date: Fri, 30 Nov 2018 20:21:35 +0200

> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct macb_dma_desc *desc, dma_addr_
>  	if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>  		desc_64 = macb_64b_desc(bp, desc);
>  		desc_64->addrh = upper_32_bits(addr);
> +		/* The low bits of RX address contain the RX_USED bit, clearing
> +		 * of which allows packet RX. Make sure the high bits are also
> +		 * visible to HW at that point.
> +		 */
> +		dma_wmb();
>  	}

I agree with that dma_wmb() is what should be used here.

We are ordering CPU stores with DMA visibility, which is exactly what
the dma_*() are for.

If it doesn't work properly on some architecture's implementation of dma_*(),
those should be fixed rather than papering over it in the drivers.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-11-30 18:21 [PATCH 0/3] net: macb: DMA race condition fixes Anssi Hannula
                   ` (3 preceding siblings ...)
  2018-12-03  8:26 ` [PATCH 0/3] net: macb: DMA race condition fixes Nicolas.Ferre
@ 2018-12-05 20:35 ` David Miller
  2018-12-07 12:04   ` Anssi Hannula
  4 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2018-12-05 20:35 UTC (permalink / raw)
  To: anssi.hannula; +Cc: nicolas.ferre, netdev

From: Anssi Hannula <anssi.hannula@bitwise.fi>
Date: Fri, 30 Nov 2018 20:21:34 +0200

> Here are a couple of race condition fixes for the macb driver. The first
> two are issues observed on real HW.

It looks like there is still an active discussion about the memory
barriers in patch #3 being excessive.

Once that is sorted out to everyone's satisfaction, would you
please repost this series with appropriate ACKs, reviewed-by's,
tested-by's, etc. added?

Thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-05 14:00     ` Anssi Hannula
@ 2018-12-06 14:14       ` Claudiu.Beznea
  2018-12-07 12:00         ` Anssi Hannula
  0 siblings, 1 reply; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-06 14:14 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Nicolas.Ferre, davem, netdev

Hi Anssi,

On 05.12.2018 16:00, Anssi Hannula wrote:
> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>
>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>> When reading buffer descriptors on RX or on TX completion, an
>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>> been populated. However, there are no memory barriers to ensure that the
>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>> that bit.
>>>
>>> Fix that by adding DMA read memory barriers on those paths.
>>>
>>> I did not observe any actual issues caused by these being missing,
>>> though.
>>>
>>> Tested on a ZynqMP based system.
>>>
>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>> ---
>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>> index 430b7a0f5436..c93baa8621d5 100644
>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>  
>>>  			/* First, update TX stats if needed */
>>>  			if (skb) {
>>> +				/* Ensure all of desc is at least as up-to-date
>>> +				 * as ctrl (TX_USED bit)
>>> +				 */
>>> +				dma_rmb();
>>> +
>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>> data specific to this descriptor was completed. The TX descriptors for next
>> data to be send is updated under a locked spinlock.
> 
> The previous rmb() is before the TX_USED check, so my understanding is
> that the following could happen in theory:

We are using this IP on and ARM architecture, so, with regards to rmb(), I
understand from [1] that dsb completes when:
"All explicit memory accesses before this instruction complete.
All Cache, Branch predictor and TLB maintenance operations before this
instruction complete."

> 
> 1. rmb().
According to [1] this should end after all previous instructions (loads,
stores) ends.

> 2. Reads are reordered so that TX timestamp is read first - no barriers
> are crossed.

But, as per [1], no onward instruction will be reached until all
instruction prior to dsb ends, so, after rmb() all descriptor's members
should be updated, right?

> 3. HW writes timestamp and sets TX_USED (or they become visible).

I expect hardware to set TX_USED and timestamp before raising TX complete
interrupt. If so, there should be no on-flight updates of this descriptor,
right? Hardware raised a TX_USED bit read interrupt when it reads a
descriptor like this and hangs TX.

> 4. Code checks TX_USED.
> 5. Code operates on timestamp that is actually garbage.
> 
> I'm not 100% sure that there isn't some lighter/cleaner way to do this
> than dma_rmb(), though.

If you still think this scenario could happen why not calling a dsb in
gem_ptp_do_timestamp(). I feel like that is a proper place to call it.

Moreover, there is bit 32 of desc->ctrl which tells you if a valid
timestamp was placed in the descriptor. But, again, I expect the timestamp
and TX_USED to be set by hardware before raising TX complete interrupt.

[1]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html

> 
>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>  					/* skb now belongs to timestamp buffer
>>>  					 * and will be removed later
>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>  		rmb();
>>>  
>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>> -		addr = macb_get_addr(bp, desc);
>>> -		ctrl = desc->ctrl;
>>>  
>>>  		if (!rxused)
>>>  			break;
>>>  
>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>> +		dma_rmb();
>> Same here, wouldn't previous rmb() should do this job?
> 
> The scenario I'm concerned about here (and in the last hunk) is:
> 
> 1. rmb().
> 2. ctrl is read (i.e. ctrl read reordered before addr read).

Same here with regards to [1]. All prior loads, stores should be finished
when dsb ends.

> 3. HW updates to ctrl and addr become visible.
> 4. RX_USED check.
> 5. code operates on garbage ctrl.

If this is happen then the data will be read on next interrupt.

dma_rmb() is a dmb. According to [1]:
"Data Memory Barrier acts as a memory barrier. It ensures that all explicit
memory accesses that appear in program order before the DMB instruction are
observed before any explicit memory accesses that appear in program order
after the DMB instruction. It does not affect the ordering of any other
instructions executing on the processor."

and your code is:

		/* Ensure other data is at least as up-to-date as rxused */
		dma_rmb();

		addr = macb_get_addr(bp, desc);
		ctrl = desc->ctrl;

I understand from this that you want to wait for instructions prior to
dma_rmb() to be finished?

> 
> I think it may be OK to move the earlier rmb() outside the loop so that
> there is an rmb() only before and after the RX loop, as I don't at least
> immediately see any hard requirement to do it on each loop pass (unlike
> the added dma_rmb()). But my intent was to fix issues instead of
> optimization so I didn't look too closely into that.

But you said you did not see any issues with the code as it was previously.

Thank you,
Claudiu Beznea

> 
>>> +
>>> +		addr = macb_get_addr(bp, desc);
>>> +		ctrl = desc->ctrl;
>>> +
>>>  		queue->rx_tail++;
>>>  		count++;
>>>  
>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>  		/* Make hw descriptor updates visible to CPU */
>>>  		rmb();
>>>  
>>> -		ctrl = desc->ctrl;
>>> -
>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>  			break;
>>>  
>>> +		/* Ensure other data is at least as up-to-date as addr */
>>> +		dma_rmb();
>> Ditto
>>
>>> +
>>> +		ctrl = desc->ctrl;
>>> +
>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>  			if (first_frag != -1)
>>>  				discard_partial_frame(queue, first_frag, tail);
>>>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA
  2018-12-05 20:32   ` David Miller
@ 2018-12-06 14:16     ` Claudiu.Beznea
  0 siblings, 0 replies; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-06 14:16 UTC (permalink / raw)
  To: davem, anssi.hannula; +Cc: Nicolas.Ferre, netdev, harini.katakam, michal.simek



On 05.12.2018 22:32, David Miller wrote:
> From: Anssi Hannula <anssi.hannula@bitwise.fi>
> Date: Fri, 30 Nov 2018 20:21:35 +0200
> 
>> @@ -682,6 +682,11 @@ static void macb_set_addr(struct macb *bp, struct macb_dma_desc *desc, dma_addr_
>>  	if (bp->hw_dma_cap & HW_DMA_CAP_64B) {
>>  		desc_64 = macb_64b_desc(bp, desc);
>>  		desc_64->addrh = upper_32_bits(addr);
>> +		/* The low bits of RX address contain the RX_USED bit, clearing
>> +		 * of which allows packet RX. Make sure the high bits are also
>> +		 * visible to HW at that point.
>> +		 */
>> +		dma_wmb();
>>  	}
> 
> I agree with that dma_wmb() is what should be used here.
> 
> We are ordering CPU stores with DMA visibility, which is exactly what
> the dma_*() are for.
> 
> If it doesn't work properly on some architecture's implementation of dma_*(),
> those should be fixed rather than papering over it in the drivers.
> 

Ok, agree. This driver uses rmb()/wmb() all over the place with regards to
DMA descriptors updates, so I tough it might be a reason for that.

Thank you,
Claudiu Beznea

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-06 14:14       ` Claudiu.Beznea
@ 2018-12-07 12:00         ` Anssi Hannula
  2018-12-10 10:34           ` Claudiu.Beznea
  0 siblings, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-12-07 12:00 UTC (permalink / raw)
  To: Claudiu.Beznea; +Cc: Nicolas.Ferre, davem, netdev

On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
> Hi Anssi,

Hi!

> On 05.12.2018 16:00, Anssi Hannula wrote:
>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>> When reading buffer descriptors on RX or on TX completion, an
>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>> been populated. However, there are no memory barriers to ensure that the
>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>> that bit.
>>>>
>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>
>>>> I did not observe any actual issues caused by these being missing,
>>>> though.
>>>>
>>>> Tested on a ZynqMP based system.
>>>>
>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>> ---
>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>  
>>>>  			/* First, update TX stats if needed */
>>>>  			if (skb) {
>>>> +				/* Ensure all of desc is at least as up-to-date
>>>> +				 * as ctrl (TX_USED bit)
>>>> +				 */
>>>> +				dma_rmb();
>>>> +
>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>> data specific to this descriptor was completed. The TX descriptors for next
>>> data to be send is updated under a locked spinlock.
>> The previous rmb() is before the TX_USED check, so my understanding is
>> that the following could happen in theory:
> We are using this IP on and ARM architecture, so, with regards to rmb(), I
> understand from [1] that dsb completes when:
> "All explicit memory accesses before this instruction complete.
> All Cache, Branch predictor and TLB maintenance operations before this
> instruction complete."
>
>> 1. rmb().
> According to [1] this should end after all previous instructions (loads,
> stores) ends.
>
>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>> are crossed.
> But, as per [1], no onward instruction will be reached until all
> instruction prior to dsb ends, so, after rmb() all descriptor's members
> should be updated, right?

The descriptor that triggered the TX interrupt should be visible now, yes.
However, the controller may be writing to any other descriptors at the
same time as the loop is processing through them, as there are multiple
TX buffers.

>> 3. HW writes timestamp and sets TX_USED (or they become visible).
> I expect hardware to set TX_USED and timestamp before raising TX complete
> interrupt. If so, there should be no on-flight updates of this descriptor,
> right? Hardware raised a TX_USED bit read interrupt when it reads a
> descriptor like this and hangs TX.

For the first iteration of the loop, that is correct - there should be
no in-flight writes from controller as it already raised the interrupt.
However, the following iterations of the loop process descriptors that
may or may not have the interrupt raised yet, and therefore may still
have in-flight writes.

>> 4. Code checks TX_USED.
>> 5. Code operates on timestamp that is actually garbage.
>>
>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>> than dma_rmb(), though.
> If you still think this scenario could happen why not calling a dsb in
> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.

OK, I will move it there. Unless we arrive at a conclusion that it is
unnecessary altogether, of course :)

> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
> timestamp was placed in the descriptor. But, again, I expect the timestamp
> and TX_USED to be set by hardware before raising TX complete interrupt.

Yes, but since my concern is that without barriers in between,
desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
set even though ts_1 is not yet an actual timestamp. And per above, all
this may occur before the TX complete interrupt is raised for the
descriptor in question.

I agree that this TX case seems somewhat unlikely to be triggered in
real life at least at present, though, as the ts_1/ts_2 read is so far
after the desc->ctrl read, and behind function calls, so unlikely to be
reordered before desc->ctrl read.
But the similar RX cases below seem more problematic as the racy loads
in question are right after each other in code.

> [1]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>
>>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>>  					/* skb now belongs to timestamp buffer
>>>>  					 * and will be removed later
>>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>>  		rmb();
>>>>  
>>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>>> -		addr = macb_get_addr(bp, desc);
>>>> -		ctrl = desc->ctrl;
>>>>  
>>>>  		if (!rxused)
>>>>  			break;
>>>>  
>>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>>> +		dma_rmb();
>>> Same here, wouldn't previous rmb() should do this job?
>> The scenario I'm concerned about here (and in the last hunk) is:
>>
>> 1. rmb().
>> 2. ctrl is read (i.e. ctrl read reordered before addr read).
> Same here with regards to [1]. All prior loads, stores should be finished
> when dsb ends.

Yes, but the problematic loads are *after* the dsb.

>> 3. HW updates to ctrl and addr become visible.
>> 4. RX_USED check.
>> 5. code operates on garbage ctrl.
> If this is happen then the data will be read on next interrupt.

As long as the RX_USED bit is set, the code in gem_rx() will either drop
the frame or process it, depending on if it seems valid or not. It will
not be processed again on next interrupt.

I'll try to rephrase what I meant:

1. rmb(). This does nothing for loads that occur after it.
2. ctrl is loaded. It does not contain valid data yet.
3. HW receives a new frame and writes ctrl, addr, and they become
visible to processor.
4. addr is loaded. It contains the RX_USED=1 as written in step 3.
5. Code does the RX_USED check, it sees 1 and continues processing the
frame.
6. Code operates on ctrl, but as it was loaded before step 3, it
contains old data.

The old ctrl may e.g. cause the frame to be dropped due to the
RX_SOF/RX_EOF check.


>
> dma_rmb() is a dmb. According to [1]:
> "Data Memory Barrier acts as a memory barrier. It ensures that all explicit
> memory accesses that appear in program order before the DMB instruction are
> observed before any explicit memory accesses that appear in program order
> after the DMB instruction. It does not affect the ordering of any other
> instructions executing on the processor."
>
> and your code is:
>
> 		/* Ensure other data is at least as up-to-date as rxused */
> 		dma_rmb();
>
> 		addr = macb_get_addr(bp, desc);
> 		ctrl = desc->ctrl;
>
> I understand from this that you want to wait for instructions prior to
> dma_rmb() to be finished?

I want the load of "desc->addr" on the "rxused = (desc->addr &
MACB_BIT(RX_USED)) ? true : false;" line to be observed before the later
loads, such as "desc->ctrl".


>> I think it may be OK to move the earlier rmb() outside the loop so that
>> there is an rmb() only before and after the RX loop, as I don't at least
>> immediately see any hard requirement to do it on each loop pass (unlike
>> the added dma_rmb()). But my intent was to fix issues instead of
>> optimization so I didn't look too closely into that.
> But you said you did not see any issues with the code as it was previously.

I meant that I have not observed any issues in practice (though I
haven't really tried to), but I do see theoretical issues which this
patch addresses.

Especially the issues addressed by the two RX hunks seem entirely
possible - the gem_rx() one requires the compiler to just reorder the
two adjacent loads, and the macb_rx() one does not even require any
reordering - desc->ctrl is read before desc->addr in the code.

> Thank you,
> Claudiu Beznea
>
>>>> +
>>>> +		addr = macb_get_addr(bp, desc);
>>>> +		ctrl = desc->ctrl;
>>>> +
>>>>  		queue->rx_tail++;
>>>>  		count++;
>>>>  
>>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>>  		/* Make hw descriptor updates visible to CPU */
>>>>  		rmb();
>>>>  
>>>> -		ctrl = desc->ctrl;
>>>> -
>>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>>  			break;
>>>>  
>>>> +		/* Ensure other data is at least as up-to-date as addr */
>>>> +		dma_rmb();
>>> Ditto
>>>
>>>> +
>>>> +		ctrl = desc->ctrl;
>>>> +
>>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>>  			if (first_frag != -1)
>>>>  				discard_partial_frame(queue, first_frag, tail);
>>>>

-- 
Anssi Hannula / Bitwise Oy
+358 503803997

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-12-05 20:35 ` David Miller
@ 2018-12-07 12:04   ` Anssi Hannula
  2018-12-10 10:58     ` Nicolas.Ferre
  0 siblings, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-12-07 12:04 UTC (permalink / raw)
  To: David Miller, nicolas.ferre; +Cc: netdev, Claudiu Beznea

On 5.12.2018 22:35, David Miller wrote:
> From: Anssi Hannula <anssi.hannula@bitwise.fi>
> Date: Fri, 30 Nov 2018 20:21:34 +0200
>
>> Here are a couple of race condition fixes for the macb driver. The first
>> two are issues observed on real HW.
> It looks like there is still an active discussion about the memory
> barriers in patch #3 being excessive.
>
> Once that is sorted out to everyone's satisfaction, would you
> please repost this series with appropriate ACKs, reviewed-by's,
> tested-by's, etc. added?
>
> Thank you.


OK, I'll do that once everything is sorted.

Nicolas, were Claudiu's tests the ones you wanted to wait for or are we
waiting for more tests?

-- 
Anssi Hannula / Bitwise Oy
+358 503803997

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-07 12:00         ` Anssi Hannula
@ 2018-12-10 10:34           ` Claudiu.Beznea
  2018-12-11 13:21             ` Anssi Hannula
  0 siblings, 1 reply; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-10 10:34 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Nicolas.Ferre, davem, netdev



On 07.12.2018 14:00, Anssi Hannula wrote:
> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>> Hi Anssi,
> 
> Hi!
> 
>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>> that bit.
>>>>>
>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>
>>>>> I did not observe any actual issues caused by these being missing,
>>>>> though.
>>>>>
>>>>> Tested on a ZynqMP based system.
>>>>>
>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>> ---
>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>  
>>>>>  			/* First, update TX stats if needed */
>>>>>  			if (skb) {
>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>> +				 * as ctrl (TX_USED bit)
>>>>> +				 */
>>>>> +				dma_rmb();
>>>>> +
>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>> data to be send is updated under a locked spinlock.
>>> The previous rmb() is before the TX_USED check, so my understanding is
>>> that the following could happen in theory:
>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>> understand from [1] that dsb completes when:
>> "All explicit memory accesses before this instruction complete.
>> All Cache, Branch predictor and TLB maintenance operations before this
>> instruction complete."
>>
>>> 1. rmb().
>> According to [1] this should end after all previous instructions (loads,
>> stores) ends.
>>
>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>> are crossed.
>> But, as per [1], no onward instruction will be reached until all
>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>> should be updated, right?
> 
> The descriptor that triggered the TX interrupt should be visible now, yes.
> However, the controller may be writing to any other descriptors at the
> same time as the loop is processing through them, as there are multiple
> TX buffers.

Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
the beginning of loop intended for that... In the 2nd loop you are
operating with the same descriptor which was read in the first loop.

> 
>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>> I expect hardware to set TX_USED and timestamp before raising TX complete
>> interrupt. If so, there should be no on-flight updates of this descriptor,
>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>> descriptor like this and hangs TX.
> 
> For the first iteration of the loop, that is correct - there should be
> no in-flight writes from controller as it already raised the interrupt.
> However, the following iterations of the loop process descriptors that
> may or may not have the interrupt raised yet, and therefore may still
> have in-flight writes.

I expect the hardware to give up ownership of the descriptor just at the
end of TX operation, so that the word containing TX_USED to be the last one
updated. If you reads the descriptor and TX_USED is not there the first
loop breaks, queue->tx_tail is updated with the last processed descriptor
in the queue (see "queue->tx_tail = tail;") then the next interrupt will
start processing with this last one descriptor.

> 
>>> 4. Code checks TX_USED.
>>> 5. Code operates on timestamp that is actually garbage.
>>>
>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>> than dma_rmb(), though.
>> If you still think this scenario could happen why not calling a dsb in
>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
> 
> OK, I will move it there. Unless we arrive at a conclusion that it is
> unnecessary altogether, of course :)
> 
>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>> and TX_USED to be set by hardware before raising TX complete interrupt.
> 
> Yes, but since my concern is that without barriers in between,
> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
> set even though ts_1 is not yet an actual timestamp. And per above, all
> this may occur before the TX complete interrupt is raised for the
> descriptor in question.

If so, why not placing the barrier when reading timestamps? From my point
of view the place of "dma_rmb()" in this patch doesn't guarantees that
ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).

> 
> I agree that this TX case seems somewhat unlikely to be triggered in
> real life at least at present, though, as the ts_1/ts_2 read is so far
> after the desc->ctrl read, and behind function calls, so unlikely to be
> reordered before desc->ctrl read.
> But the similar RX cases below seem more problematic as the racy loads
> in question are right after each other in code.
> 
>> [1]
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>
>>>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>>>  					/* skb now belongs to timestamp buffer
>>>>>  					 * and will be removed later
>>>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>>>  		rmb();
>>>>>  
>>>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>>>> -		addr = macb_get_addr(bp, desc);
>>>>> -		ctrl = desc->ctrl;
>>>>>  
>>>>>  		if (!rxused)
>>>>>  			break;
>>>>>  
>>>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>>>> +		dma_rmb();
>>>> Same here, wouldn't previous rmb() should do this job?
>>> The scenario I'm concerned about here (and in the last hunk) is:
>>>
>>> 1. rmb().
>>> 2. ctrl is read (i.e. ctrl read reordered before addr read).
>> Same here with regards to [1]. All prior loads, stores should be finished
>> when dsb ends.
> 
> Yes, but the problematic loads are *after* the dsb.
> 
>>> 3. HW updates to ctrl and addr become visible.
>>> 4. RX_USED check.
>>> 5. code operates on garbage ctrl.
>> If this is happen then the data will be read on next interrupt.
> 
> As long as the RX_USED bit is set, the code in gem_rx() will either drop
> the frame or process it, depending on if it seems valid or not.

Yes, I wanted to say that if the RX_USED is not set when first reading ctrl
then the frame will not be processed at that moment but on the next interrupt.

It will
> not be processed again on next interrupt.
> 
> I'll try to rephrase what I meant:
> 
> 1. rmb(). This does nothing for loads that occur after it.
> 2. ctrl is loaded. It does not contain valid data yet.
> 3. HW receives a new frame and writes ctrl, addr, and they become
> visible to processor.
> 4. addr is loaded. It contains the RX_USED=1 as written in step 3.
> 5. Code does the RX_USED check, it sees 1 and continues processing the
> frame.
> 6. Code operates on ctrl, but as it was loaded before step 3, it
> contains old data.
> 
> The old ctrl may e.g. cause the frame to be dropped due to the
> RX_SOF/RX_EOF check.

Yes, agree, but placing dma_rmb() before "ctrl = desc->ctrl" will not solve
this case, right? What you want is to have ctrl loaded after you execute
next instruction which refers to it. According to [1] you will have to
place the dma_rmb() after the "ctrl = desc->ctrl" instruction to ensure
next time you refer to it, it contains the proper value.

> 
> 
>>
>> dma_rmb() is a dmb. According to [1]:
>> "Data Memory Barrier acts as a memory barrier. It ensures that all explicit
>> memory accesses that appear in program order before the DMB instruction are
>> observed before any explicit memory accesses that appear in program order
>> after the DMB instruction. It does not affect the ordering of any other
>> instructions executing on the processor."
>>
>> and your code is:
>>
>> 		/* Ensure other data is at least as up-to-date as rxused */
>> 		dma_rmb();
>>
>> 		addr = macb_get_addr(bp, desc);
>> 		ctrl = desc->ctrl;
>>
>> I understand from this that you want to wait for instructions prior to
>> dma_rmb() to be finished?
> 
> I want the load of "desc->addr" on the "rxused = (desc->addr &
> MACB_BIT(RX_USED)) ? true : false;" line to be observed before the later
> loads, such as "desc->ctrl".
> 
> 
>>> I think it may be OK to move the earlier rmb() outside the loop so that
>>> there is an rmb() only before and after the RX loop, as I don't at least
>>> immediately see any hard requirement to do it on each loop pass (unlike
>>> the added dma_rmb()). But my intent was to fix issues instead of
>>> optimization so I didn't look too closely into that.
>> But you said you did not see any issues with the code as it was previously.
> 
> I meant that I have not observed any issues in practice (though I
> haven't really tried to), but I do see theoretical issues which this
> patch addresses.
> 
> Especially the issues addressed by the two RX hunks seem entirely
> possible - the gem_rx() one requires the compiler to just reorder the
> two adjacent loads, and the macb_rx() one does not even require any
> reordering - desc->ctrl is read before desc->addr in the code.

In gem_rx() I agree that a barrier might be necessary, but placing it after
desc->addr will not guarantee you that the load of addr and ctrl will be
consistent. According to [1] the barrier should be after reading of addr
and ctrl.

Thank you,
Claudiu Beznea

> 
>> Thank you,
>> Claudiu Beznea
>>
>>>>> +
>>>>> +		addr = macb_get_addr(bp, desc);
>>>>> +		ctrl = desc->ctrl;
>>>>> +
>>>>>  		queue->rx_tail++;
>>>>>  		count++;
>>>>>  
>>>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>>>  		/* Make hw descriptor updates visible to CPU */
>>>>>  		rmb();
>>>>>  
>>>>> -		ctrl = desc->ctrl;
>>>>> -
>>>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>>>  			break;
>>>>>  
>>>>> +		/* Ensure other data is at least as up-to-date as addr */
>>>>> +		dma_rmb();
>>>> Ditto
>>>>
>>>>> +
>>>>> +		ctrl = desc->ctrl;
>>>>> +
>>>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>>>  			if (first_frag != -1)
>>>>>  				discard_partial_frame(queue, first_frag, tail);
>>>>>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-12-07 12:04   ` Anssi Hannula
@ 2018-12-10 10:58     ` Nicolas.Ferre
  2018-12-10 11:32       ` Claudiu.Beznea
  0 siblings, 1 reply; 31+ messages in thread
From: Nicolas.Ferre @ 2018-12-10 10:58 UTC (permalink / raw)
  To: anssi.hannula, davem; +Cc: netdev, Claudiu.Beznea

On 07/12/2018 at 13:04, Anssi Hannula wrote:
> On 5.12.2018 22:35, David Miller wrote:
>> From: Anssi Hannula <anssi.hannula@bitwise.fi>
>> Date: Fri, 30 Nov 2018 20:21:34 +0200
>>
>>> Here are a couple of race condition fixes for the macb driver. The first
>>> two are issues observed on real HW.
>> It looks like there is still an active discussion about the memory
>> barriers in patch #3 being excessive.
>>
>> Once that is sorted out to everyone's satisfaction, would you
>> please repost this series with appropriate ACKs, reviewed-by's,
>> tested-by's, etc. added?
>>
>> Thank you.
> 
> 
> OK, I'll do that once everything is sorted.
> 
> Nicolas, were Claudiu's tests the ones you wanted to wait for or are we
> waiting for more tests?

Claudiu will report on our tests. The discussion is still progress from 
what I can tell. Thanks a lot for continuing in participating to this 
thread!

Best regards,
-- 
Nicolas Ferre

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-12-10 10:58     ` Nicolas.Ferre
@ 2018-12-10 11:32       ` Claudiu.Beznea
  2018-12-10 11:34         ` Claudiu.Beznea
  0 siblings, 1 reply; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-10 11:32 UTC (permalink / raw)
  To: Nicolas.Ferre, anssi.hannula, davem; +Cc: netdev



On 10.12.2018 12:58, Nicolas Ferre - M43238 wrote:
> On 07/12/2018 at 13:04, Anssi Hannula wrote:
>> On 5.12.2018 22:35, David Miller wrote:
>>> From: Anssi Hannula <anssi.hannula@bitwise.fi>
>>> Date: Fri, 30 Nov 2018 20:21:34 +0200
>>>
>>>> Here are a couple of race condition fixes for the macb driver. The first
>>>> two are issues observed on real HW.
>>> It looks like there is still an active discussion about the memory
>>> barriers in patch #3 being excessive.
>>>
>>> Once that is sorted out to everyone's satisfaction, would you
>>> please repost this series with appropriate ACKs, reviewed-by's,
>>> tested-by's, etc. added?
>>>
>>> Thank you.
>>
>>
>> OK, I'll do that once everything is sorted.
>>
>> Nicolas, were Claudiu's tests the ones you wanted to wait for or are we
>> waiting for more tests?
> 
> Claudiu will report on our tests. The discussion is still progress from 
> what I can tell. Thanks a lot for continuing in participating to this 
> thread!

As I said, I tested this series on heavy traffic load and in the setup we
observed the issue with TX used bit read interrupt (see patch "net: macb:
restart tx after tx used bit read" I submitted these days) on our SAMA5D2
Xplained and SAMA5D4 Xplained boards. With patches 1 & 2 from this series I
can say the "not whole frame pointed by descriptor" issue is not present
anymore.

The idea with tests was to see if this series has any impact on "tx used
bit read" issue we faced with. The result is no, the "net: macb: restart tx
after tx used bit read" patch is needed. I asked for postponing of this to
run more tests on my side (independent of this series with memory barriers).

Thank you,
Claudiu Beznea

> 
> Best regards,
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3] net: macb: DMA race condition fixes
  2018-12-10 11:32       ` Claudiu.Beznea
@ 2018-12-10 11:34         ` Claudiu.Beznea
  0 siblings, 0 replies; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-10 11:34 UTC (permalink / raw)
  To: Nicolas.Ferre, anssi.hannula, davem; +Cc: netdev



On 10.12.2018 13:32, Claudiu.Beznea@microchip.com wrote:
> 
> 
> On 10.12.2018 12:58, Nicolas Ferre - M43238 wrote:
>> On 07/12/2018 at 13:04, Anssi Hannula wrote:
>>> On 5.12.2018 22:35, David Miller wrote:
>>>> From: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>> Date: Fri, 30 Nov 2018 20:21:34 +0200
>>>>
>>>>> Here are a couple of race condition fixes for the macb driver. The first
>>>>> two are issues observed on real HW.
>>>> It looks like there is still an active discussion about the memory
>>>> barriers in patch #3 being excessive.
>>>>
>>>> Once that is sorted out to everyone's satisfaction, would you
>>>> please repost this series with appropriate ACKs, reviewed-by's,
>>>> tested-by's, etc. added?
>>>>
>>>> Thank you.
>>>
>>>
>>> OK, I'll do that once everything is sorted.
>>>
>>> Nicolas, were Claudiu's tests the ones you wanted to wait for or are we
>>> waiting for more tests?
>>
>> Claudiu will report on our tests. The discussion is still progress from 
>> what I can tell. Thanks a lot for continuing in participating to this 
>> thread!
> 
> As I said, I tested this series on heavy traffic load and in the setup we
> observed the issue with TX used bit read interrupt (see patch "net: macb:
> restart tx after tx used bit read" I submitted these days) on our SAMA5D2
> Xplained and SAMA5D4 Xplained boards. With patches 1 & 2 from this series I
> can say the "not whole frame pointed by descriptor" issue is not present
> anymore.
> 
> The idea with tests was to see if this series has any impact on "tx used
> bit read" issue we faced with. The result is no, the "net: macb: restart tx
> after tx used bit read" patch is needed. I asked for postponing of this to
> run more tests on my side (independent of this series with memory barriers).

So, for the first 2 patches in this series you can add:

Tested-by: Claudiu Beznea <claudiu.beznea@microchip.com>

> 
> Thank you,
> Claudiu Beznea
> 
>>
>> Best regards,
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-10 10:34           ` Claudiu.Beznea
@ 2018-12-11 13:21             ` Anssi Hannula
  2018-12-12 10:58               ` Claudiu.Beznea
  2018-12-12 10:59               ` [PATCH 3/3 v2] net: macb: add missing barriers when reading descriptors Anssi Hannula
  0 siblings, 2 replies; 31+ messages in thread
From: Anssi Hannula @ 2018-12-11 13:21 UTC (permalink / raw)
  To: Claudiu.Beznea; +Cc: Nicolas.Ferre, davem, netdev

On 10.12.2018 12:34, Claudiu.Beznea@microchip.com wrote:
> On 07.12.2018 14:00, Anssi Hannula wrote:
>> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>>> Hi Anssi,
>> Hi!
>>
>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>> that bit.
>>>>>>
>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>
>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>> though.
>>>>>>
>>>>>> Tested on a ZynqMP based system.
>>>>>>
>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>>> ---
>>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>  
>>>>>>  			/* First, update TX stats if needed */
>>>>>>  			if (skb) {
>>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>>> +				 * as ctrl (TX_USED bit)
>>>>>> +				 */
>>>>>> +				dma_rmb();
>>>>>> +
>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>> data to be send is updated under a locked spinlock.
>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>> that the following could happen in theory:
>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>> understand from [1] that dsb completes when:
>>> "All explicit memory accesses before this instruction complete.
>>> All Cache, Branch predictor and TLB maintenance operations before this
>>> instruction complete."
>>>
>>>> 1. rmb().
>>> According to [1] this should end after all previous instructions (loads,
>>> stores) ends.
>>>
>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>> are crossed.
>>> But, as per [1], no onward instruction will be reached until all
>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>> should be updated, right?
>> The descriptor that triggered the TX interrupt should be visible now, yes.
>> However, the controller may be writing to any other descriptors at the
>> same time as the loop is processing through them, as there are multiple
>> TX buffers.
> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
> the beginning of loop intended for that... In the 2nd loop you are
> operating with the same descriptor which was read in the first loop.

I'm concerned about the 2nd iteration of the first for loop in
macb_tx_interrupt().
That operates on a different descriptor due to tail++, and that
different descriptor may have in-flight writes from controller as the
interrupt may or may not already be raised for it.

Or am I missing something?

>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>> descriptor like this and hangs TX.
>> For the first iteration of the loop, that is correct - there should be
>> no in-flight writes from controller as it already raised the interrupt.
>> However, the following iterations of the loop process descriptors that
>> may or may not have the interrupt raised yet, and therefore may still
>> have in-flight writes.
> I expect the hardware to give up ownership of the descriptor just at the
> end of TX operation, so that the word containing TX_USED to be the last one
> updated. If you reads the descriptor and TX_USED is not there the first
> loop breaks, queue->tx_tail is updated with the last processed descriptor
> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
> start processing with this last one descriptor.

Agreed, that is how it works. Issues only occur if we somehow operate on
data before ownership was transferred.

>>>> 4. Code checks TX_USED.
>>>> 5. Code operates on timestamp that is actually garbage.
>>>>
>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>> than dma_rmb(), though.
>>> If you still think this scenario could happen why not calling a dsb in
>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>> OK, I will move it there. Unless we arrive at a conclusion that it is
>> unnecessary altogether, of course :)
>>
>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>> Yes, but since my concern is that without barriers in between,
>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>> set even though ts_1 is not yet an actual timestamp. And per above, all
>> this may occur before the TX complete interrupt is raised for the
>> descriptor in question.
> If so, why not placing the barrier when reading timestamps? From my point
> of view the place of "dma_rmb()" in this patch doesn't guarantees that
> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).

If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
is read before the barrier, my understanding is that they are guaranteed
to be ordered so that dev->ctrl is read before ts1/ts2.

Per [1], "[DMB] ensures that all explicit memory accesses that appear in
program order before the |DMB| instruction are observed before any
explicit memory accesses that appear in program order after the |DMB|
instruction".
And the compiler barrier ("memory" asm clobber) included within
dma_rmb() ensures that in program order, ctrl is loaded before the
barrier, and ts1/ts2 after it.

But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
so I'll make that change.

[1]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html


>> I agree that this TX case seems somewhat unlikely to be triggered in
>> real life at least at present, though, as the ts_1/ts_2 read is so far
>> after the desc->ctrl read, and behind function calls, so unlikely to be
>> reordered before desc->ctrl read.
>> But the similar RX cases below seem more problematic as the racy loads
>> in question are right after each other in code.
>>
>>> [1]
>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>>
>>>>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>>>>  					/* skb now belongs to timestamp buffer
>>>>>>  					 * and will be removed later
>>>>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>>>>  		rmb();
>>>>>>  
>>>>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>>>>> -		addr = macb_get_addr(bp, desc);
>>>>>> -		ctrl = desc->ctrl;
>>>>>>  
>>>>>>  		if (!rxused)
>>>>>>  			break;
>>>>>>  
>>>>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>>>>> +		dma_rmb();
>>>>> Same here, wouldn't previous rmb() should do this job?
>>>> The scenario I'm concerned about here (and in the last hunk) is:
>>>>
>>>> 1. rmb().
>>>> 2. ctrl is read (i.e. ctrl read reordered before addr read).
>>> Same here with regards to [1]. All prior loads, stores should be finished
>>> when dsb ends.
>> Yes, but the problematic loads are *after* the dsb.
>>
>>>> 3. HW updates to ctrl and addr become visible.
>>>> 4. RX_USED check.
>>>> 5. code operates on garbage ctrl.
>>> If this is happen then the data will be read on next interrupt.
>> As long as the RX_USED bit is set, the code in gem_rx() will either drop
>> the frame or process it, depending on if it seems valid or not.
> Yes, I wanted to say that if the RX_USED is not set when first reading ctrl
> then the frame will not be processed at that moment but on the next interrupt.

Agreed, and the RX_USED=0 case looks OK to me too. The issue is only
with RX_USED=1.

> It will
>> not be processed again on next interrupt.
>>
>> I'll try to rephrase what I meant:
>>
>> 1. rmb(). This does nothing for loads that occur after it.
>> 2. ctrl is loaded. It does not contain valid data yet.
>> 3. HW receives a new frame and writes ctrl, addr, and they become
>> visible to processor.
>> 4. addr is loaded. It contains the RX_USED=1 as written in step 3.
>> 5. Code does the RX_USED check, it sees 1 and continues processing the
>> frame.
>> 6. Code operates on ctrl, but as it was loaded before step 3, it
>> contains old data.
>>
>> The old ctrl may e.g. cause the frame to be dropped due to the
>> RX_SOF/RX_EOF check.
> Yes, agree, but placing dma_rmb() before "ctrl = desc->ctrl" will not solve
> this case, right? What you want is to have ctrl loaded after you execute
> next instruction which refers to it. According to [1] you will have to
> place the dma_rmb() after the "ctrl = desc->ctrl" instruction to ensure
> next time you refer to it, it contains the proper value.

I don't think doing dma_rmb() after ctrl assignment works. I want to
ensure the ordering of loads between ctrl and addr.
ctrl has to be at least as "recent" as addr to avoid a race.

If I did
   rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
   ctrl = desc->ctrl;
   dma_rmb();
both the compiler and the processor would be free to reorder the addr
and ctrl loads, and the race condition described above could still happen.
The loaded ctrl value would likely be kept in a register as it is used
shortly after dma_rmb(). It will not be reloaded as the compiler
considers desc->ctrl to possibly have different contents now (due to the
compiler barrier).

Doing
   rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
   dma_rmb();
   ctrl = desc->ctrl;
the compiler barrier makes sure that in the generated code, addr load is
before DMB and ctrl load is after it, and the DMB makes sure that the
processor observes the addr load before the ctrl load.

Of course, please correct me if I'm wrong or missing something.

>>> dma_rmb() is a dmb. According to [1]:
>>> "Data Memory Barrier acts as a memory barrier. It ensures that all explicit
>>> memory accesses that appear in program order before the DMB instruction are
>>> observed before any explicit memory accesses that appear in program order
>>> after the DMB instruction. It does not affect the ordering of any other
>>> instructions executing on the processor."
>>>
>>> and your code is:
>>>
>>> 		/* Ensure other data is at least as up-to-date as rxused */
>>> 		dma_rmb();
>>>
>>> 		addr = macb_get_addr(bp, desc);
>>> 		ctrl = desc->ctrl;
>>>
>>> I understand from this that you want to wait for instructions prior to
>>> dma_rmb() to be finished?
>> I want the load of "desc->addr" on the "rxused = (desc->addr &
>> MACB_BIT(RX_USED)) ? true : false;" line to be observed before the later
>> loads, such as "desc->ctrl".
>>
>>
>>>> I think it may be OK to move the earlier rmb() outside the loop so that
>>>> there is an rmb() only before and after the RX loop, as I don't at least
>>>> immediately see any hard requirement to do it on each loop pass (unlike
>>>> the added dma_rmb()). But my intent was to fix issues instead of
>>>> optimization so I didn't look too closely into that.
>>> But you said you did not see any issues with the code as it was previously.
>> I meant that I have not observed any issues in practice (though I
>> haven't really tried to), but I do see theoretical issues which this
>> patch addresses.
>>
>> Especially the issues addressed by the two RX hunks seem entirely
>> possible - the gem_rx() one requires the compiler to just reorder the
>> two adjacent loads, and the macb_rx() one does not even require any
>> reordering - desc->ctrl is read before desc->addr in the code.
> In gem_rx() I agree that a barrier might be necessary, but placing it after
> desc->addr will not guarantee you that the load of addr and ctrl will be
> consistent. According to [1] the barrier should be after reading of addr
> and ctrl.
>
> Thank you,
> Claudiu Beznea
>
>>> Thank you,
>>> Claudiu Beznea
>>>
>>>>>> +
>>>>>> +		addr = macb_get_addr(bp, desc);
>>>>>> +		ctrl = desc->ctrl;
>>>>>> +
>>>>>>  		queue->rx_tail++;
>>>>>>  		count++;
>>>>>>  
>>>>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>>>>  		/* Make hw descriptor updates visible to CPU */
>>>>>>  		rmb();
>>>>>>  
>>>>>> -		ctrl = desc->ctrl;
>>>>>> -
>>>>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>>>>  			break;
>>>>>>  
>>>>>> +		/* Ensure other data is at least as up-to-date as addr */
>>>>>> +		dma_rmb();
>>>>> Ditto
>>>>>
>>>>>> +
>>>>>> +		ctrl = desc->ctrl;
>>>>>> +
>>>>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>>>>  			if (first_frag != -1)
>>>>>>  				discard_partial_frame(queue, first_frag, tail);
>>>>>>

-- 
Anssi Hannula / Bitwise Oy
+358 503803997

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-11 13:21             ` Anssi Hannula
@ 2018-12-12 10:58               ` Claudiu.Beznea
  2018-12-12 11:27                 ` Anssi Hannula
  2018-12-12 10:59               ` [PATCH 3/3 v2] net: macb: add missing barriers when reading descriptors Anssi Hannula
  1 sibling, 1 reply; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-12 10:58 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Nicolas.Ferre, davem, netdev



On 11.12.2018 15:21, Anssi Hannula wrote:
> On 10.12.2018 12:34, Claudiu.Beznea@microchip.com wrote:
>> On 07.12.2018 14:00, Anssi Hannula wrote:
>>> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>>>> Hi Anssi,
>>> Hi!
>>>
>>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>>> that bit.
>>>>>>>
>>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>>
>>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>>> though.
>>>>>>>
>>>>>>> Tested on a ZynqMP based system.
>>>>>>>
>>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>>>> ---
>>>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>>  
>>>>>>>  			/* First, update TX stats if needed */
>>>>>>>  			if (skb) {
>>>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>>>> +				 * as ctrl (TX_USED bit)
>>>>>>> +				 */
>>>>>>> +				dma_rmb();
>>>>>>> +
>>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>>> data to be send is updated under a locked spinlock.
>>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>>> that the following could happen in theory:
>>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>>> understand from [1] that dsb completes when:
>>>> "All explicit memory accesses before this instruction complete.
>>>> All Cache, Branch predictor and TLB maintenance operations before this
>>>> instruction complete."
>>>>
>>>>> 1. rmb().
>>>> According to [1] this should end after all previous instructions (loads,
>>>> stores) ends.
>>>>
>>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>>> are crossed.
>>>> But, as per [1], no onward instruction will be reached until all
>>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>>> should be updated, right?
>>> The descriptor that triggered the TX interrupt should be visible now, yes.
>>> However, the controller may be writing to any other descriptors at the
>>> same time as the loop is processing through them, as there are multiple
>>> TX buffers.
>> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
>> the beginning of loop intended for that... In the 2nd loop you are
>> operating with the same descriptor which was read in the first loop.
> 
> I'm concerned about the 2nd iteration of the first for loop in
> macb_tx_interrupt().
> That operates on a different descriptor due to tail++, and that
> different descriptor may have in-flight writes from controller as the
> interrupt may or may not already be raised for it.

I agree with that, there may be in-flights updates, but since ownership bit
should be updated at the end of the TX (I expect at this moment to be
updated all the descriptor's parts, including timestamps), if you read a
descriptor and it hasn't been treated by the hardware the descriptor is own
by the hardware and there is this check after ctrl is read:

if (!(ctrl & MACB_BIT(TX_USED)))
	break;

Next time a new tx interrupt will be raised this descriptor would be also
treated.

In case the above instruction doesn't break then it means the descriptor
was transmitted. And since ownership should be updated at the end of the
TX, this means no in-flights updates should be done on this descriptor, so,
when you will read the timestamps these should be the good ones (even it is
the first one, the 3rd, and so on).

Moreover, this new barrier barrier is in the 2nd loop, you're placing a
memory barrier after every step in the 2nd loop for the same descriptor
read in the 1st loop (it was already loaded in a processor register).

To avoid reordering of ctrl and ts1/ts2 loads wouldn't be enough to have
memory barrier after:
ctrl = desc->ctrl;

> 
> Or am I missing something?
> 
>>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>>> descriptor like this and hangs TX.
>>> For the first iteration of the loop, that is correct - there should be
>>> no in-flight writes from controller as it already raised the interrupt.
>>> However, the following iterations of the loop process descriptors that
>>> may or may not have the interrupt raised yet, and therefore may still
>>> have in-flight writes.
>> I expect the hardware to give up ownership of the descriptor just at the
>> end of TX operation, so that the word containing TX_USED to be the last one
>> updated. If you reads the descriptor and TX_USED is not there the first
>> loop breaks, queue->tx_tail is updated with the last processed descriptor
>> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
>> start processing with this last one descriptor.
> 
> Agreed, that is how it works. Issues only occur if we somehow operate on
> data before ownership was transferred.

Could this happens? Shouldn't this:
	if (!(ctrl & MACB_BIT(TX_USED)))
		break;

assure that this could not happen and the fact that the ownership is passed
to CPU at the end of descriptor update?

If load reordering b/w desc->ctrl and desc->ts_1/ts_2 wouldn't be enough to
have a barrier after
ctrl = desc->ctrl instruction?

> 
>>>>> 4. Code checks TX_USED.
>>>>> 5. Code operates on timestamp that is actually garbage.
>>>>>
>>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>>> than dma_rmb(), though.
>>>> If you still think this scenario could happen why not calling a dsb in
>>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>>> OK, I will move it there. Unless we arrive at a conclusion that it is
>>> unnecessary altogether, of course :)
>>>
>>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>>> Yes, but since my concern is that without barriers in between,
>>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>>> set even though ts_1 is not yet an actual timestamp. And per above, all
>>> this may occur before the TX complete interrupt is raised for the
>>> descriptor in question.
>> If so, why not placing the barrier when reading timestamps? From my point
>> of view the place of "dma_rmb()" in this patch doesn't guarantees that
>> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).
> 
> If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
> is read before the barrier, my understanding is that they are guaranteed
> to be ordered so that dev->ctrl is read before ts1/ts2.
> 
> Per [1], "[DMB] ensures that all explicit memory accesses that appear in
> program order before the |DMB| instruction are observed before any
> explicit memory accesses that appear in program order after the |DMB|
> instruction".
> And the compiler barrier ("memory" asm clobber) included within
> dma_rmb() ensures that in program order, ctrl is loaded before the
> barrier, and ts1/ts2 after it.
> 
> But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
> so I'll make that change.
> 
> [1]
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
> 
> 
>>> I agree that this TX case seems somewhat unlikely to be triggered in
>>> real life at least at present, though, as the ts_1/ts_2 read is so far
>>> after the desc->ctrl read, and behind function calls, so unlikely to be
>>> reordered before desc->ctrl read.
>>> But the similar RX cases below seem more problematic as the racy loads
>>> in question are right after each other in code.
>>>
>>>> [1]
>>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>>>
>>>>>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>>>>>  					/* skb now belongs to timestamp buffer
>>>>>>>  					 * and will be removed later
>>>>>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>>>>>  		rmb();
>>>>>>>  
>>>>>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>>>>>> -		addr = macb_get_addr(bp, desc);
>>>>>>> -		ctrl = desc->ctrl;
>>>>>>>  
>>>>>>>  		if (!rxused)
>>>>>>>  			break;
>>>>>>>  
>>>>>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>>>>>> +		dma_rmb();
>>>>>> Same here, wouldn't previous rmb() should do this job?
>>>>> The scenario I'm concerned about here (and in the last hunk) is:
>>>>>
>>>>> 1. rmb().
>>>>> 2. ctrl is read (i.e. ctrl read reordered before addr read).
>>>> Same here with regards to [1]. All prior loads, stores should be finished
>>>> when dsb ends.
>>> Yes, but the problematic loads are *after* the dsb.
>>>
>>>>> 3. HW updates to ctrl and addr become visible.
>>>>> 4. RX_USED check.
>>>>> 5. code operates on garbage ctrl.
>>>> If this is happen then the data will be read on next interrupt.
>>> As long as the RX_USED bit is set, the code in gem_rx() will either drop
>>> the frame or process it, depending on if it seems valid or not.
>> Yes, I wanted to say that if the RX_USED is not set when first reading ctrl
>> then the frame will not be processed at that moment but on the next interrupt.
> 
> Agreed, and the RX_USED=0 case looks OK to me too. The issue is only
> with RX_USED=1.
> 
>> It will
>>> not be processed again on next interrupt.
>>>
>>> I'll try to rephrase what I meant:
>>>
>>> 1. rmb(). This does nothing for loads that occur after it.
>>> 2. ctrl is loaded. It does not contain valid data yet.
>>> 3. HW receives a new frame and writes ctrl, addr, and they become
>>> visible to processor.
>>> 4. addr is loaded. It contains the RX_USED=1 as written in step 3.
>>> 5. Code does the RX_USED check, it sees 1 and continues processing the
>>> frame.
>>> 6. Code operates on ctrl, but as it was loaded before step 3, it
>>> contains old data.
>>>
>>> The old ctrl may e.g. cause the frame to be dropped due to the
>>> RX_SOF/RX_EOF check.
>> Yes, agree, but placing dma_rmb() before "ctrl = desc->ctrl" will not solve
>> this case, right? What you want is to have ctrl loaded after you execute
>> next instruction which refers to it. According to [1] you will have to
>> place the dma_rmb() after the "ctrl = desc->ctrl" instruction to ensure
>> next time you refer to it, it contains the proper value.
> 
> I don't think doing dma_rmb() after ctrl assignment works. I want to
> ensure the ordering of loads between ctrl and addr.
> ctrl has to be at least as "recent" as addr to avoid a race.
> 
> If I did
>    rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>    ctrl = desc->ctrl;
>    dma_rmb();
> both the compiler and the processor would be free to reorder the addr
> and ctrl loads, and the race condition described above could still happen.
> The loaded ctrl value would likely be kept in a register as it is used
> shortly after dma_rmb(). It will not be reloaded as the compiler
> considers desc->ctrl to possibly have different contents now (due to the
> compiler barrier).

Good, I understand this. Agree.

> 
> Doing
>    rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>    dma_rmb();
>    ctrl = desc->ctrl;
> the compiler barrier makes sure that in the generated code, addr load is
> before DMB and ctrl load is after it, and the DMB makes sure that the
> processor observes the addr load before the ctrl load.
> 
> Of course, please correct me if I'm wrong or missing something.
> 
>>>> dma_rmb() is a dmb. According to [1]:
>>>> "Data Memory Barrier acts as a memory barrier. It ensures that all explicit
>>>> memory accesses that appear in program order before the DMB instruction are
>>>> observed before any explicit memory accesses that appear in program order
>>>> after the DMB instruction. It does not affect the ordering of any other
>>>> instructions executing on the processor."
>>>>
>>>> and your code is:
>>>>
>>>> 		/* Ensure other data is at least as up-to-date as rxused */
>>>> 		dma_rmb();
>>>>
>>>> 		addr = macb_get_addr(bp, desc);
>>>> 		ctrl = desc->ctrl;
>>>>
>>>> I understand from this that you want to wait for instructions prior to
>>>> dma_rmb() to be finished?
>>> I want the load of "desc->addr" on the "rxused = (desc->addr &
>>> MACB_BIT(RX_USED)) ? true : false;" line to be observed before the later
>>> loads, such as "desc->ctrl".
>>>
>>>
>>>>> I think it may be OK to move the earlier rmb() outside the loop so that
>>>>> there is an rmb() only before and after the RX loop, as I don't at least
>>>>> immediately see any hard requirement to do it on each loop pass (unlike
>>>>> the added dma_rmb()). But my intent was to fix issues instead of
>>>>> optimization so I didn't look too closely into that.
>>>> But you said you did not see any issues with the code as it was previously.
>>> I meant that I have not observed any issues in practice (though I
>>> haven't really tried to), but I do see theoretical issues which this
>>> patch addresses.
>>>
>>> Especially the issues addressed by the two RX hunks seem entirely
>>> possible - the gem_rx() one requires the compiler to just reorder the
>>> two adjacent loads, and the macb_rx() one does not even require any
>>> reordering - desc->ctrl is read before desc->addr in the code.
>> In gem_rx() I agree that a barrier might be necessary, but placing it after
>> desc->addr will not guarantee you that the load of addr and ctrl will be
>> consistent. According to [1] the barrier should be after reading of addr
>> and ctrl.
>>
>> Thank you,
>> Claudiu Beznea
>>
>>>> Thank you,
>>>> Claudiu Beznea
>>>>
>>>>>>> +
>>>>>>> +		addr = macb_get_addr(bp, desc);
>>>>>>> +		ctrl = desc->ctrl;
>>>>>>> +
>>>>>>>  		queue->rx_tail++;
>>>>>>>  		count++;
>>>>>>>  
>>>>>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>>>>>  		/* Make hw descriptor updates visible to CPU */
>>>>>>>  		rmb();
>>>>>>>  
>>>>>>> -		ctrl = desc->ctrl;
>>>>>>> -
>>>>>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>>>>>  			break;
>>>>>>>  
>>>>>>> +		/* Ensure other data is at least as up-to-date as addr */
>>>>>>> +		dma_rmb();
>>>>>> Ditto
>>>>>>
>>>>>>> +
>>>>>>> +		ctrl = desc->ctrl;
>>>>>>> +
>>>>>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>>>>>  			if (first_frag != -1)
>>>>>>>  				discard_partial_frame(queue, first_frag, tail);
>>>>>>>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/3 v2] net: macb: add missing barriers when reading descriptors
  2018-12-11 13:21             ` Anssi Hannula
  2018-12-12 10:58               ` Claudiu.Beznea
@ 2018-12-12 10:59               ` Anssi Hannula
  2018-12-12 23:19                 ` David Miller
  1 sibling, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-12-12 10:59 UTC (permalink / raw)
  To: Claudiu.Beznea; +Cc: Anssi Hannula, Nicolas.Ferre, davem, netdev

When reading buffer descriptors on RX or on TX completion, an
RX_USED/TX_USED bit is checked first to ensure that the descriptors have
been populated, i.e. the ownership has been transferred. However, there
are no memory barriers to ensure that the data protected by the
RX_USED/TX_USED bit is up-to-date with respect to that bit.

Specifically:

- TX timestamp descriptors may be loaded before ctrl is loaded for the
  TX_USED check, which is racy as the descriptors may be updated between
  the loads, causing old timestamp descriptor data to be used.

- RX ctrl may be loaded before addr is loaded for the RX_USED check,
  which is racy as a new frame may be written between the loads, causing
  old ctrl descriptor data to be used.
  This issue exists for both macb_rx() and gem_rx() variants.

Fix the races by adding DMA read memory barriers on those paths and
reordering the reads in macb_rx().

I have not observed any actual problems in practice caused by these
being missing, though.

Tested on a ZynqMP based system.

Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
---

The discussion is still ongoing on whether I have used the barriers
correctly here, but in the meantime, here's an updated version of the
patch.

v2:
- moved the timestamp protection barrier closer to the timestamp reads
- removed unnecessary move of the addr assignment in gem_rx() to keep
  the patch minimal for maximum clarity
- clarified commit message and comments


 drivers/net/ethernet/cadence/macb_main.c | 13 ++++++++++---
 drivers/net/ethernet/cadence/macb_ptp.c  |  2 ++
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 430b7a0f5436..10a2bb44612b 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -1001,11 +1001,15 @@ static int gem_rx(struct macb_queue *queue, int budget)
 
 		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
 		addr = macb_get_addr(bp, desc);
-		ctrl = desc->ctrl;
 
 		if (!rxused)
 			break;
 
+		/* Ensure ctrl is at least as up-to-date as rxused */
+		dma_rmb();
+
+		ctrl = desc->ctrl;
+
 		queue->rx_tail++;
 		count++;
 
@@ -1180,11 +1184,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
 		/* Make hw descriptor updates visible to CPU */
 		rmb();
 
-		ctrl = desc->ctrl;
-
 		if (!(desc->addr & MACB_BIT(RX_USED)))
 			break;
 
+		/* Ensure ctrl is at least as up-to-date as addr */
+		dma_rmb();
+
+		ctrl = desc->ctrl;
+
 		if (ctrl & MACB_BIT(RX_SOF)) {
 			if (first_frag != -1)
 				discard_partial_frame(queue, first_frag, tail);
diff --git a/drivers/net/ethernet/cadence/macb_ptp.c b/drivers/net/ethernet/cadence/macb_ptp.c
index cd5296b84229..a6dc47edc4cf 100644
--- a/drivers/net/ethernet/cadence/macb_ptp.c
+++ b/drivers/net/ethernet/cadence/macb_ptp.c
@@ -319,6 +319,8 @@ int gem_ptp_txstamp(struct macb_queue *queue, struct sk_buff *skb,
 	desc_ptp = macb_ptp_desc(queue->bp, desc);
 	tx_timestamp = &queue->tx_timestamps[head];
 	tx_timestamp->skb = skb;
+	/* ensure ts_1/ts_2 is loaded after ctrl (TX_USED check) */
+	dma_rmb();
 	tx_timestamp->desc_ptp.ts_1 = desc_ptp->ts_1;
 	tx_timestamp->desc_ptp.ts_2 = desc_ptp->ts_2;
 	/* move head */
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-12 10:58               ` Claudiu.Beznea
@ 2018-12-12 11:27                 ` Anssi Hannula
  2018-12-13 10:48                   ` Claudiu.Beznea
  0 siblings, 1 reply; 31+ messages in thread
From: Anssi Hannula @ 2018-12-12 11:27 UTC (permalink / raw)
  To: Claudiu.Beznea; +Cc: Nicolas.Ferre, davem, netdev

On 12.12.2018 12:58, Claudiu.Beznea@microchip.com wrote:
>
> On 11.12.2018 15:21, Anssi Hannula wrote:
>> On 10.12.2018 12:34, Claudiu.Beznea@microchip.com wrote:
>>> On 07.12.2018 14:00, Anssi Hannula wrote:
>>>> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>>>>> Hi Anssi,
>>>> Hi!
>>>>
>>>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>>>> that bit.
>>>>>>>>
>>>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>>>
>>>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>>>> though.
>>>>>>>>
>>>>>>>> Tested on a ZynqMP based system.
>>>>>>>>
>>>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>>>>> ---
>>>>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>>>  
>>>>>>>>  			/* First, update TX stats if needed */
>>>>>>>>  			if (skb) {
>>>>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>>>>> +				 * as ctrl (TX_USED bit)
>>>>>>>> +				 */
>>>>>>>> +				dma_rmb();
>>>>>>>> +
>>>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>>>> data to be send is updated under a locked spinlock.
>>>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>>>> that the following could happen in theory:
>>>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>>>> understand from [1] that dsb completes when:
>>>>> "All explicit memory accesses before this instruction complete.
>>>>> All Cache, Branch predictor and TLB maintenance operations before this
>>>>> instruction complete."
>>>>>
>>>>>> 1. rmb().
>>>>> According to [1] this should end after all previous instructions (loads,
>>>>> stores) ends.
>>>>>
>>>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>>>> are crossed.
>>>>> But, as per [1], no onward instruction will be reached until all
>>>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>>>> should be updated, right?
>>>> The descriptor that triggered the TX interrupt should be visible now, yes.
>>>> However, the controller may be writing to any other descriptors at the
>>>> same time as the loop is processing through them, as there are multiple
>>>> TX buffers.
>>> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
>>> the beginning of loop intended for that... In the 2nd loop you are
>>> operating with the same descriptor which was read in the first loop.
>> I'm concerned about the 2nd iteration of the first for loop in
>> macb_tx_interrupt().
>> That operates on a different descriptor due to tail++, and that
>> different descriptor may have in-flight writes from controller as the
>> interrupt may or may not already be raised for it.
> I agree with that, there may be in-flights updates, but since ownership bit
> should be updated at the end of the TX (I expect at this moment to be
> updated all the descriptor's parts, including timestamps), if you read a
> descriptor and it hasn't been treated by the hardware the descriptor is own
> by the hardware and there is this check after ctrl is read:
>
> if (!(ctrl & MACB_BIT(TX_USED)))
> 	break;
>
> Next time a new tx interrupt will be raised this descriptor would be also
> treated.
>
> In case the above instruction doesn't break then it means the descriptor
> was transmitted. And since ownership should be updated at the end of the
> TX, this means no in-flights updates should be done on this descriptor, so,
> when you will read the timestamps these should be the good ones (even it is
> the first one, the 3rd, and so on).

Agreed with above, the only problem is if the ts1/ts2 load is reordered
before ctrl load.

> Moreover, this new barrier barrier is in the 2nd loop, you're placing a
> memory barrier after every step in the 2nd loop for the same descriptor
> read in the 1st loop (it was already loaded in a processor register).

The barrier is inside "if (skb)", and skb is only set for the last frame
as indicated by the code at the end of the loop:

            /* skb is set only for the last buffer of the frame.
             * WARNING: at this point skb has been freed by
             * macb_tx_unmap().
             */
            if (skb)
                break;


> To avoid reordering of ctrl and ts1/ts2 loads wouldn't be enough to have
> memory barrier after:
> ctrl = desc->ctrl;

Yes, it would.
My v2 moved it into the other direction, into the ts code, though, which
has the added benefit of no barrier when timestamps are disabled (which
IIRC is the default).
But I can move it to just after ctrl if you so prefer.

>> Or am I missing something?
>>
>>>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>>>> descriptor like this and hangs TX.
>>>> For the first iteration of the loop, that is correct - there should be
>>>> no in-flight writes from controller as it already raised the interrupt.
>>>> However, the following iterations of the loop process descriptors that
>>>> may or may not have the interrupt raised yet, and therefore may still
>>>> have in-flight writes.
>>> I expect the hardware to give up ownership of the descriptor just at the
>>> end of TX operation, so that the word containing TX_USED to be the last one
>>> updated. If you reads the descriptor and TX_USED is not there the first
>>> loop breaks, queue->tx_tail is updated with the last processed descriptor
>>> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
>>> start processing with this last one descriptor.
>> Agreed, that is how it works. Issues only occur if we somehow operate on
>> data before ownership was transferred.
> Could this happens? Shouldn't this:
> 	if (!(ctrl & MACB_BIT(TX_USED)))
> 		break;
>
> assure that this could not happen and the fact that the ownership is passed
> to CPU at the end of descriptor update?

I don't see how that prevents desc->ctrl and desc->ts_1/ts_2 reordering,
which is the only issue that I can see.
By "data" above I meant ts_1/ts_2, AFAICS there is nothing else we read
from the descriptors aside from ctrl itself.

> If load reordering b/w desc->ctrl and desc->ts_1/ts_2 wouldn't be enough to
> have a barrier after
> ctrl = desc->ctrl instruction?

Yes, per above.

>>>>>> 4. Code checks TX_USED.
>>>>>> 5. Code operates on timestamp that is actually garbage.
>>>>>>
>>>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>>>> than dma_rmb(), though.
>>>>> If you still think this scenario could happen why not calling a dsb in
>>>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>>>> OK, I will move it there. Unless we arrive at a conclusion that it is
>>>> unnecessary altogether, of course :)
>>>>
>>>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>>>> Yes, but since my concern is that without barriers in between,
>>>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>>>> set even though ts_1 is not yet an actual timestamp. And per above, all
>>>> this may occur before the TX complete interrupt is raised for the
>>>> descriptor in question.
>>> If so, why not placing the barrier when reading timestamps? From my point
>>> of view the place of "dma_rmb()" in this patch doesn't guarantees that
>>> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).
>> If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
>> is read before the barrier, my understanding is that they are guaranteed
>> to be ordered so that dev->ctrl is read before ts1/ts2.
>>
>> Per [1], "[DMB] ensures that all explicit memory accesses that appear in
>> program order before the |DMB| instruction are observed before any
>> explicit memory accesses that appear in program order after the |DMB|
>> instruction".
>> And the compiler barrier ("memory" asm clobber) included within
>> dma_rmb() ensures that in program order, ctrl is loaded before the
>> barrier, and ts1/ts2 after it.
>>
>> But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
>> so I'll make that change.
>>
>> [1]
>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>
>>
>>>> I agree that this TX case seems somewhat unlikely to be triggered in
>>>> real life at least at present, though, as the ts_1/ts_2 read is so far
>>>> after the desc->ctrl read, and behind function calls, so unlikely to be
>>>> reordered before desc->ctrl read.
>>>> But the similar RX cases below seem more problematic as the racy loads
>>>> in question are right after each other in code.
>>>>
[...]

-- 
Anssi Hannula / Bitwise Oy
+358 503803997

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3 v2] net: macb: add missing barriers when reading descriptors
  2018-12-12 10:59               ` [PATCH 3/3 v2] net: macb: add missing barriers when reading descriptors Anssi Hannula
@ 2018-12-12 23:19                 ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2018-12-12 23:19 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Claudiu.Beznea, Nicolas.Ferre, netdev


Please, when you update a patch within a patch series, you must always
repost the entire patch series including the introductory posting.

Thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
  2018-12-12 11:27                 ` Anssi Hannula
@ 2018-12-13 10:48                   ` Claudiu.Beznea
  0 siblings, 0 replies; 31+ messages in thread
From: Claudiu.Beznea @ 2018-12-13 10:48 UTC (permalink / raw)
  To: anssi.hannula; +Cc: Nicolas.Ferre, davem, netdev



On 12.12.2018 13:27, Anssi Hannula wrote:
> On 12.12.2018 12:58, Claudiu.Beznea@microchip.com wrote:
>>
>> On 11.12.2018 15:21, Anssi Hannula wrote:
>>> On 10.12.2018 12:34, Claudiu.Beznea@microchip.com wrote:
>>>> On 07.12.2018 14:00, Anssi Hannula wrote:
>>>>> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>>>>>> Hi Anssi,
>>>>> Hi!
>>>>>
>>>>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>>>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>>>>> that bit.
>>>>>>>>>
>>>>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>>>>
>>>>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>>>>> though.
>>>>>>>>>
>>>>>>>>> Tested on a ZynqMP based system.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>>>>>> ---
>>>>>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>>>>  
>>>>>>>>>  			/* First, update TX stats if needed */
>>>>>>>>>  			if (skb) {
>>>>>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>>>>>> +				 * as ctrl (TX_USED bit)
>>>>>>>>> +				 */
>>>>>>>>> +				dma_rmb();
>>>>>>>>> +
>>>>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>>>>> data to be send is updated under a locked spinlock.
>>>>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>>>>> that the following could happen in theory:
>>>>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>>>>> understand from [1] that dsb completes when:
>>>>>> "All explicit memory accesses before this instruction complete.
>>>>>> All Cache, Branch predictor and TLB maintenance operations before this
>>>>>> instruction complete."
>>>>>>
>>>>>>> 1. rmb().
>>>>>> According to [1] this should end after all previous instructions (loads,
>>>>>> stores) ends.
>>>>>>
>>>>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>>>>> are crossed.
>>>>>> But, as per [1], no onward instruction will be reached until all
>>>>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>>>>> should be updated, right?
>>>>> The descriptor that triggered the TX interrupt should be visible now, yes.
>>>>> However, the controller may be writing to any other descriptors at the
>>>>> same time as the loop is processing through them, as there are multiple
>>>>> TX buffers.
>>>> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
>>>> the beginning of loop intended for that... In the 2nd loop you are
>>>> operating with the same descriptor which was read in the first loop.
>>> I'm concerned about the 2nd iteration of the first for loop in
>>> macb_tx_interrupt().
>>> That operates on a different descriptor due to tail++, and that
>>> different descriptor may have in-flight writes from controller as the
>>> interrupt may or may not already be raised for it.
>> I agree with that, there may be in-flights updates, but since ownership bit
>> should be updated at the end of the TX (I expect at this moment to be
>> updated all the descriptor's parts, including timestamps), if you read a
>> descriptor and it hasn't been treated by the hardware the descriptor is own
>> by the hardware and there is this check after ctrl is read:
>>
>> if (!(ctrl & MACB_BIT(TX_USED)))
>> 	break;
>>
>> Next time a new tx interrupt will be raised this descriptor would be also
>> treated.
>>
>> In case the above instruction doesn't break then it means the descriptor
>> was transmitted. And since ownership should be updated at the end of the
>> TX, this means no in-flights updates should be done on this descriptor, so,
>> when you will read the timestamps these should be the good ones (even it is
>> the first one, the 3rd, and so on).
> 
> Agreed with above, the only problem is if the ts1/ts2 load is reordered
> before ctrl load.
> 
>> Moreover, this new barrier barrier is in the 2nd loop, you're placing a
>> memory barrier after every step in the 2nd loop for the same descriptor
>> read in the 1st loop (it was already loaded in a processor register).
> 
> The barrier is inside "if (skb)", and skb is only set for the last frame

Agree, this escaped me.

> as indicated by the code at the end of the loop:
> 
>             /* skb is set only for the last buffer of the frame.
>              * WARNING: at this point skb has been freed by
>              * macb_tx_unmap().
>              */
>             if (skb)
>                 break;
> 
> 
>> To avoid reordering of ctrl and ts1/ts2 loads wouldn't be enough to have
>> memory barrier after:
>> ctrl = desc->ctrl;
> 
> Yes, it would.
> My v2 moved it into the other direction, into the ts code, though, which
> has the added benefit of no barrier when timestamps are disabled (which
> IIRC is the default).
> But I can move it to just after ctrl if you so prefer.

It's ok in gem_ptp_txstamp().

> 
>>> Or am I missing something?
>>>
>>>>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>>>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>>>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>>>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>>>>> descriptor like this and hangs TX.
>>>>> For the first iteration of the loop, that is correct - there should be
>>>>> no in-flight writes from controller as it already raised the interrupt.
>>>>> However, the following iterations of the loop process descriptors that
>>>>> may or may not have the interrupt raised yet, and therefore may still
>>>>> have in-flight writes.
>>>> I expect the hardware to give up ownership of the descriptor just at the
>>>> end of TX operation, so that the word containing TX_USED to be the last one
>>>> updated. If you reads the descriptor and TX_USED is not there the first
>>>> loop breaks, queue->tx_tail is updated with the last processed descriptor
>>>> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
>>>> start processing with this last one descriptor.
>>> Agreed, that is how it works. Issues only occur if we somehow operate on
>>> data before ownership was transferred.
>> Could this happens? Shouldn't this:
>> 	if (!(ctrl & MACB_BIT(TX_USED)))
>> 		break;
>>
>> assure that this could not happen and the fact that the ownership is passed
>> to CPU at the end of descriptor update?
> 
> I don't see how that prevents desc->ctrl and desc->ts_1/ts_2 reordering,
> which is the only issue that I can see.
> By "data" above I meant ts_1/ts_2, AFAICS there is nothing else we read
> from the descriptors aside from ctrl itself.
> 
>> If load reordering b/w desc->ctrl and desc->ts_1/ts_2 wouldn't be enough to
>> have a barrier after
>> ctrl = desc->ctrl instruction?
> 
> Yes, per above.
> 
>>>>>>> 4. Code checks TX_USED.
>>>>>>> 5. Code operates on timestamp that is actually garbage.
>>>>>>>
>>>>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>>>>> than dma_rmb(), though.
>>>>>> If you still think this scenario could happen why not calling a dsb in
>>>>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>>>>> OK, I will move it there. Unless we arrive at a conclusion that it is
>>>>> unnecessary altogether, of course :)
>>>>>
>>>>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>>>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>>>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>>>>> Yes, but since my concern is that without barriers in between,
>>>>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>>>>> set even though ts_1 is not yet an actual timestamp. And per above, all
>>>>> this may occur before the TX complete interrupt is raised for the
>>>>> descriptor in question.
>>>> If so, why not placing the barrier when reading timestamps? From my point
>>>> of view the place of "dma_rmb()" in this patch doesn't guarantees that
>>>> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).
>>> If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
>>> is read before the barrier, my understanding is that they are guaranteed
>>> to be ordered so that dev->ctrl is read before ts1/ts2.
>>>
>>> Per [1], "[DMB] ensures that all explicit memory accesses that appear in
>>> program order before the |DMB| instruction are observed before any
>>> explicit memory accesses that appear in program order after the |DMB|
>>> instruction".
>>> And the compiler barrier ("memory" asm clobber) included within
>>> dma_rmb() ensures that in program order, ctrl is loaded before the
>>> barrier, and ts1/ts2 after it.
>>>
>>> But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
>>> so I'll make that change.
>>>
>>> [1]
>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>>
>>>
>>>>> I agree that this TX case seems somewhat unlikely to be triggered in
>>>>> real life at least at present, though, as the ts_1/ts_2 read is so far
>>>>> after the desc->ctrl read, and behind function calls, so unlikely to be
>>>>> reordered before desc->ctrl read.
>>>>> But the similar RX cases below seem more problematic as the racy loads
>>>>> in question are right after each other in code.
>>>>>
> [...]
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-12-13 10:49 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-30 18:21 [PATCH 0/3] net: macb: DMA race condition fixes Anssi Hannula
2018-11-30 18:21 ` [PATCH 1/3] net: macb: fix random memory corruption on RX with 64-bit DMA Anssi Hannula
2018-12-03  4:44   ` Harini Katakam
2018-12-05 12:37   ` Claudiu.Beznea
2018-12-05 13:58     ` Anssi Hannula
2018-12-05 20:32   ` David Miller
2018-12-06 14:16     ` Claudiu.Beznea
2018-11-30 18:21 ` [PATCH 2/3] net: macb: fix dropped RX frames due to a race Anssi Hannula
2018-12-03  4:52   ` Harini Katakam
2018-12-03 10:31     ` Anssi Hannula
2018-12-03 10:36       ` Harini Katakam
2018-12-05 12:38   ` Claudiu.Beznea
2018-11-30 18:21 ` [PATCH 3/3] net: macb: add missing barriers when reading buffers Anssi Hannula
2018-12-05 12:37   ` Claudiu.Beznea
2018-12-05 14:00     ` Anssi Hannula
2018-12-06 14:14       ` Claudiu.Beznea
2018-12-07 12:00         ` Anssi Hannula
2018-12-10 10:34           ` Claudiu.Beznea
2018-12-11 13:21             ` Anssi Hannula
2018-12-12 10:58               ` Claudiu.Beznea
2018-12-12 11:27                 ` Anssi Hannula
2018-12-13 10:48                   ` Claudiu.Beznea
2018-12-12 10:59               ` [PATCH 3/3 v2] net: macb: add missing barriers when reading descriptors Anssi Hannula
2018-12-12 23:19                 ` David Miller
2018-12-03  8:26 ` [PATCH 0/3] net: macb: DMA race condition fixes Nicolas.Ferre
2018-12-03 23:56   ` David Miller
2018-12-05 20:35 ` David Miller
2018-12-07 12:04   ` Anssi Hannula
2018-12-10 10:58     ` Nicolas.Ferre
2018-12-10 11:32       ` Claudiu.Beznea
2018-12-10 11:34         ` Claudiu.Beznea

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.