Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers

From: Anssi Hannula <anssi.hannula@bitwise.fi>
To: Claudiu.Beznea@microchip.com
Cc: Nicolas.Ferre@microchip.com, davem@davemloft.net, netdev@vger.kernel.org
Subject: Re: [PATCH 3/3] net: macb: add missing barriers when reading buffers
Date: Tue, 11 Dec 2018 15:21:36 +0200	[thread overview]
Message-ID: <ca8a468d-38df-1c79-3616-81b589e55357@bitwise.fi> (raw)
In-Reply-To: <ff275d8f-6f13-7b78-1136-b4a5a18046ba@microchip.com>

On 10.12.2018 12:34, Claudiu.Beznea@microchip.com wrote:
> On 07.12.2018 14:00, Anssi Hannula wrote:
>> On 6.12.2018 16:14, Claudiu.Beznea@microchip.com wrote:
>>> Hi Anssi,
>> Hi!
>>
>>> On 05.12.2018 16:00, Anssi Hannula wrote:
>>>> On 5.12.2018 14:37, Claudiu.Beznea@microchip.com wrote:
>>>>> On 30.11.2018 20:21, Anssi Hannula wrote:
>>>>>> When reading buffer descriptors on RX or on TX completion, an
>>>>>> RX_USED/TX_USED bit is checked first to ensure that the descriptor has
>>>>>> been populated. However, there are no memory barriers to ensure that the
>>>>>> data protected by the RX_USED/TX_USED bit is up-to-date with respect to
>>>>>> that bit.
>>>>>>
>>>>>> Fix that by adding DMA read memory barriers on those paths.
>>>>>>
>>>>>> I did not observe any actual issues caused by these being missing,
>>>>>> though.
>>>>>>
>>>>>> Tested on a ZynqMP based system.
>>>>>>
>>>>>> Signed-off-by: Anssi Hannula <anssi.hannula@bitwise.fi>
>>>>>> Fixes: 89e5785fc8a6 ("[PATCH] Atmel MACB ethernet driver")
>>>>>> Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
>>>>>> ---
>>>>>>  drivers/net/ethernet/cadence/macb_main.c | 20 ++++++++++++++++----
>>>>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
>>>>>> index 430b7a0f5436..c93baa8621d5 100644
>>>>>> --- a/drivers/net/ethernet/cadence/macb_main.c
>>>>>> +++ b/drivers/net/ethernet/cadence/macb_main.c
>>>>>> @@ -861,6 +861,11 @@ static void macb_tx_interrupt(struct macb_queue *queue)
>>>>>>  
>>>>>>  			/* First, update TX stats if needed */
>>>>>>  			if (skb) {
>>>>>> +				/* Ensure all of desc is at least as up-to-date
>>>>>> +				 * as ctrl (TX_USED bit)
>>>>>> +				 */
>>>>>> +				dma_rmb();
>>>>>> +
>>>>> Is this necessary? Wouldn't previous rmb() take care of this? At this time
>>>>> data specific to this descriptor was completed. The TX descriptors for next
>>>>> data to be send is updated under a locked spinlock.
>>>> The previous rmb() is before the TX_USED check, so my understanding is
>>>> that the following could happen in theory:
>>> We are using this IP on and ARM architecture, so, with regards to rmb(), I
>>> understand from [1] that dsb completes when:
>>> "All explicit memory accesses before this instruction complete.
>>> All Cache, Branch predictor and TLB maintenance operations before this
>>> instruction complete."
>>>
>>>> 1. rmb().
>>> According to [1] this should end after all previous instructions (loads,
>>> stores) ends.
>>>
>>>> 2. Reads are reordered so that TX timestamp is read first - no barriers
>>>> are crossed.
>>> But, as per [1], no onward instruction will be reached until all
>>> instruction prior to dsb ends, so, after rmb() all descriptor's members
>>> should be updated, right?
>> The descriptor that triggered the TX interrupt should be visible now, yes.
>> However, the controller may be writing to any other descriptors at the
>> same time as the loop is processing through them, as there are multiple
>> TX buffers.
> Yes, but there is the "rmb()" after "desc = macb_tx_desc(queue, tail);" at
> the beginning of loop intended for that... In the 2nd loop you are
> operating with the same descriptor which was read in the first loop.

I'm concerned about the 2nd iteration of the first for loop in
macb_tx_interrupt().
That operates on a different descriptor due to tail++, and that
different descriptor may have in-flight writes from controller as the
interrupt may or may not already be raised for it.

Or am I missing something?

>>>> 3. HW writes timestamp and sets TX_USED (or they become visible).
>>> I expect hardware to set TX_USED and timestamp before raising TX complete
>>> interrupt. If so, there should be no on-flight updates of this descriptor,
>>> right? Hardware raised a TX_USED bit read interrupt when it reads a
>>> descriptor like this and hangs TX.
>> For the first iteration of the loop, that is correct - there should be
>> no in-flight writes from controller as it already raised the interrupt.
>> However, the following iterations of the loop process descriptors that
>> may or may not have the interrupt raised yet, and therefore may still
>> have in-flight writes.
> I expect the hardware to give up ownership of the descriptor just at the
> end of TX operation, so that the word containing TX_USED to be the last one
> updated. If you reads the descriptor and TX_USED is not there the first
> loop breaks, queue->tx_tail is updated with the last processed descriptor
> in the queue (see "queue->tx_tail = tail;") then the next interrupt will
> start processing with this last one descriptor.

Agreed, that is how it works. Issues only occur if we somehow operate on
data before ownership was transferred.

>>>> 4. Code checks TX_USED.
>>>> 5. Code operates on timestamp that is actually garbage.
>>>>
>>>> I'm not 100% sure that there isn't some lighter/cleaner way to do this
>>>> than dma_rmb(), though.
>>> If you still think this scenario could happen why not calling a dsb in
>>> gem_ptp_do_timestamp(). I feel like that is a proper place to call it.
>> OK, I will move it there. Unless we arrive at a conclusion that it is
>> unnecessary altogether, of course :)
>>
>>> Moreover, there is bit 32 of desc->ctrl which tells you if a valid
>>> timestamp was placed in the descriptor. But, again, I expect the timestamp
>>> and TX_USED to be set by hardware before raising TX complete interrupt.
>> Yes, but since my concern is that without barriers in between,
>> desc->ctrl might be read after ts_1/ts_2, so that bit might be seen as
>> set even though ts_1 is not yet an actual timestamp. And per above, all
>> this may occur before the TX complete interrupt is raised for the
>> descriptor in question.
> If so, why not placing the barrier when reading timestamps? From my point
> of view the place of "dma_rmb()" in this patch doesn't guarantees that
> ts1/ts2 were read after the execution of barrier (correct me if I'm wrong).

If the timestamp ts1/ts2 reads are done after dma_rmb(), and dev->ctrl
is read before the barrier, my understanding is that they are guaranteed
to be ordered so that dev->ctrl is read before ts1/ts2.

Per [1], "[DMB] ensures that all explicit memory accesses that appear in
program order before the |DMB| instruction are observed before any
explicit memory accesses that appear in program order after the |DMB|
instruction".
And the compiler barrier ("memory" asm clobber) included within
dma_rmb() ensures that in program order, ctrl is loaded before the
barrier, and ts1/ts2 after it.

But as mentioned, it makes sense to have it closer to the ts1/ts2 reads
so I'll make that change.

[1]
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html

>> I agree that this TX case seems somewhat unlikely to be triggered in
>> real life at least at present, though, as the ts_1/ts_2 read is so far
>> after the desc->ctrl read, and behind function calls, so unlikely to be
>> reordered before desc->ctrl read.
>> But the similar RX cases below seem more problematic as the racy loads
>> in question are right after each other in code.
>>
>>> [1]
>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHGHHIE.html
>>>
>>>>>>  				if (gem_ptp_do_txstamp(queue, skb, desc) == 0) {
>>>>>>  					/* skb now belongs to timestamp buffer
>>>>>>  					 * and will be removed later
>>>>>> @@ -1000,12 +1005,16 @@ static int gem_rx(struct macb_queue *queue, int budget)
>>>>>>  		rmb();
>>>>>>  
>>>>>>  		rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
>>>>>> -		addr = macb_get_addr(bp, desc);
>>>>>> -		ctrl = desc->ctrl;
>>>>>>  
>>>>>>  		if (!rxused)
>>>>>>  			break;
>>>>>>  
>>>>>> +		/* Ensure other data is at least as up-to-date as rxused */
>>>>>> +		dma_rmb();
>>>>> Same here, wouldn't previous rmb() should do this job?
>>>> The scenario I'm concerned about here (and in the last hunk) is:
>>>>
>>>> 1. rmb().
>>>> 2. ctrl is read (i.e. ctrl read reordered before addr read).
>>> Same here with regards to [1]. All prior loads, stores should be finished
>>> when dsb ends.
>> Yes, but the problematic loads are *after* the dsb.
>>
>>>> 3. HW updates to ctrl and addr become visible.
>>>> 4. RX_USED check.
>>>> 5. code operates on garbage ctrl.
>>> If this is happen then the data will be read on next interrupt.
>> As long as the RX_USED bit is set, the code in gem_rx() will either drop
>> the frame or process it, depending on if it seems valid or not.
> Yes, I wanted to say that if the RX_USED is not set when first reading ctrl
> then the frame will not be processed at that moment but on the next interrupt.

Agreed, and the RX_USED=0 case looks OK to me too. The issue is only
with RX_USED=1.

> It will
>> not be processed again on next interrupt.
>>
>> I'll try to rephrase what I meant:
>>
>> 1. rmb(). This does nothing for loads that occur after it.
>> 2. ctrl is loaded. It does not contain valid data yet.
>> 3. HW receives a new frame and writes ctrl, addr, and they become
>> visible to processor.
>> 4. addr is loaded. It contains the RX_USED=1 as written in step 3.
>> 5. Code does the RX_USED check, it sees 1 and continues processing the
>> frame.
>> 6. Code operates on ctrl, but as it was loaded before step 3, it
>> contains old data.
>>
>> The old ctrl may e.g. cause the frame to be dropped due to the
>> RX_SOF/RX_EOF check.
> Yes, agree, but placing dma_rmb() before "ctrl = desc->ctrl" will not solve
> this case, right? What you want is to have ctrl loaded after you execute
> next instruction which refers to it. According to [1] you will have to
> place the dma_rmb() after the "ctrl = desc->ctrl" instruction to ensure
> next time you refer to it, it contains the proper value.

I don't think doing dma_rmb() after ctrl assignment works. I want to
ensure the ordering of loads between ctrl and addr.
ctrl has to be at least as "recent" as addr to avoid a race.

If I did
   rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
   ctrl = desc->ctrl;
   dma_rmb();
both the compiler and the processor would be free to reorder the addr
and ctrl loads, and the race condition described above could still happen.
The loaded ctrl value would likely be kept in a register as it is used
shortly after dma_rmb(). It will not be reloaded as the compiler
considers desc->ctrl to possibly have different contents now (due to the
compiler barrier).

Doing
   rxused = (desc->addr & MACB_BIT(RX_USED)) ? true : false;
   dma_rmb();
   ctrl = desc->ctrl;
the compiler barrier makes sure that in the generated code, addr load is
before DMB and ctrl load is after it, and the DMB makes sure that the
processor observes the addr load before the ctrl load.

Of course, please correct me if I'm wrong or missing something.

>>> dma_rmb() is a dmb. According to [1]:
>>> "Data Memory Barrier acts as a memory barrier. It ensures that all explicit
>>> memory accesses that appear in program order before the DMB instruction are
>>> observed before any explicit memory accesses that appear in program order
>>> after the DMB instruction. It does not affect the ordering of any other
>>> instructions executing on the processor."
>>>
>>> and your code is:
>>>
>>> 		/* Ensure other data is at least as up-to-date as rxused */
>>> 		dma_rmb();
>>>
>>> 		addr = macb_get_addr(bp, desc);
>>> 		ctrl = desc->ctrl;
>>>
>>> I understand from this that you want to wait for instructions prior to
>>> dma_rmb() to be finished?
>> I want the load of "desc->addr" on the "rxused = (desc->addr &
>> MACB_BIT(RX_USED)) ? true : false;" line to be observed before the later
>> loads, such as "desc->ctrl".
>>
>>
>>>> I think it may be OK to move the earlier rmb() outside the loop so that
>>>> there is an rmb() only before and after the RX loop, as I don't at least
>>>> immediately see any hard requirement to do it on each loop pass (unlike
>>>> the added dma_rmb()). But my intent was to fix issues instead of
>>>> optimization so I didn't look too closely into that.
>>> But you said you did not see any issues with the code as it was previously.
>> I meant that I have not observed any issues in practice (though I
>> haven't really tried to), but I do see theoretical issues which this
>> patch addresses.
>>
>> Especially the issues addressed by the two RX hunks seem entirely
>> possible - the gem_rx() one requires the compiler to just reorder the
>> two adjacent loads, and the macb_rx() one does not even require any
>> reordering - desc->ctrl is read before desc->addr in the code.
> In gem_rx() I agree that a barrier might be necessary, but placing it after
> desc->addr will not guarantee you that the load of addr and ctrl will be
> consistent. According to [1] the barrier should be after reading of addr
> and ctrl.
>
> Thank you,
> Claudiu Beznea
>
>>> Thank you,
>>> Claudiu Beznea
>>>
>>>>>> +
>>>>>> +		addr = macb_get_addr(bp, desc);
>>>>>> +		ctrl = desc->ctrl;
>>>>>> +
>>>>>>  		queue->rx_tail++;
>>>>>>  		count++;
>>>>>>  
>>>>>> @@ -1180,11 +1189,14 @@ static int macb_rx(struct macb_queue *queue, int budget)
>>>>>>  		/* Make hw descriptor updates visible to CPU */
>>>>>>  		rmb();
>>>>>>  
>>>>>> -		ctrl = desc->ctrl;
>>>>>> -
>>>>>>  		if (!(desc->addr & MACB_BIT(RX_USED)))
>>>>>>  			break;
>>>>>>  
>>>>>> +		/* Ensure other data is at least as up-to-date as addr */
>>>>>> +		dma_rmb();
>>>>> Ditto
>>>>>
>>>>>> +
>>>>>> +		ctrl = desc->ctrl;
>>>>>> +
>>>>>>  		if (ctrl & MACB_BIT(RX_SOF)) {
>>>>>>  			if (first_frag != -1)
>>>>>>  				discard_partial_frame(queue, first_frag, tail);
>>>>>>

-- 
Anssi Hannula / Bitwise Oy
+358 503803997