linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] rt2x00: fix rx queue hang
@ 2019-06-17  9:46 Soeren Moch
  2019-06-18  9:34 ` Stanislaw Gruszka
  0 siblings, 1 reply; 8+ messages in thread
From: Soeren Moch @ 2019-06-17  9:46 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Soeren Moch, Helmut Schaa, Kalle Valo, David S. Miller,
	linux-wireless, netdev, linux-kernel, stable

Since commit ed194d136769 ("usb: core: remove local_irq_save() around
 ->complete() handler") the handlers rt2x00usb_interrupt_rxdone() and
rt2x00usb_interrupt_txdone() are not running with interrupts disabled
anymore. So these handlers are not guaranteed to run completely before
workqueue processing starts. So only mark entries ready for workqueue
processing after proper accounting in the dma done queue.
Note that rt2x00usb_work_rxdone() processes all available entries, not
only such for which queue_work() was called.

This fixes a regression on a RT5370 based wifi stick in AP mode, which
suddenly stopped data transmission after some period of heavy load. Also
stopping the hanging hostapd resulted in the error message "ieee80211
phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
Other operation modes are probably affected as well, this just was
the used testcase.

Fixes: ed194d136769 ("usb: core: remove local_irq_save() around ->complete() handler")
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Helmut Schaa <helmut.schaa@googlemail.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # 4.20+
Signed-off-by: Soeren Moch <smoch@web.de>
---
 drivers/net/wireless/ralink/rt2x00/rt2x00dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
index 1b08b01db27b..9c102a501ee6 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
@@ -263,9 +263,9 @@ EXPORT_SYMBOL_GPL(rt2x00lib_dmastart);

 void rt2x00lib_dmadone(struct queue_entry *entry)
 {
-	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
 	clear_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags);
 	rt2x00queue_index_inc(entry, Q_INDEX_DMA_DONE);
+	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
 }
 EXPORT_SYMBOL_GPL(rt2x00lib_dmadone);

--
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-06-17  9:46 [PATCH] rt2x00: fix rx queue hang Soeren Moch
@ 2019-06-18  9:34 ` Stanislaw Gruszka
  2019-06-21 11:30   ` Soeren Moch
  0 siblings, 1 reply; 8+ messages in thread
From: Stanislaw Gruszka @ 2019-06-18  9:34 UTC (permalink / raw)
  To: Soeren Moch
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

Hi

On Mon, Jun 17, 2019 at 11:46:56AM +0200, Soeren Moch wrote:
> Since commit ed194d136769 ("usb: core: remove local_irq_save() around
>  ->complete() handler") the handlers rt2x00usb_interrupt_rxdone() and
> rt2x00usb_interrupt_txdone() are not running with interrupts disabled
> anymore. So these handlers are not guaranteed to run completely before
> workqueue processing starts. So only mark entries ready for workqueue
> processing after proper accounting in the dma done queue.

It was always the case on SMP machines that rt2x00usb_interrupt_{tx/rx}done
can run concurrently with rt2x00_work_{rx,tx}done, so I do not
understand how removing local_irq_save() around complete handler broke
things.

Have you reverted commit ed194d136769 and the revert does solve the problem ?

Between 4.19 and 4.20 we have some quite big changes in rt2x00 driver:

0240564430c0 rt2800: flush and txstatus rework for rt2800mmio
adf26a356f13 rt2x00: use different txstatus timeouts when flushing
5022efb50f62 rt2x00: do not check for txstatus timeout every time on tasklet
0b0d556e0ebb rt2800mmio: use txdone/txstatus routines from lib
5c656c71b1bf rt2800: move usb specific txdone/txstatus routines to rt2800lib

so I'm a bit afraid that one of those changes is real cause of
the issue not ed194d136769 .

> Note that rt2x00usb_work_rxdone() processes all available entries, not
> only such for which queue_work() was called.
> 
> This fixes a regression on a RT5370 based wifi stick in AP mode, which
> suddenly stopped data transmission after some period of heavy load. Also
> stopping the hanging hostapd resulted in the error message "ieee80211
> phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
> Other operation modes are probably affected as well, this just was
> the used testcase.

Do you know what actually make the traffic stop,
TX queue hung or RX queue hung?

> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
> index 1b08b01db27b..9c102a501ee6 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
> @@ -263,9 +263,9 @@ EXPORT_SYMBOL_GPL(rt2x00lib_dmastart);
> 
>  void rt2x00lib_dmadone(struct queue_entry *entry)
>  {
> -	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
>  	clear_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags);
>  	rt2x00queue_index_inc(entry, Q_INDEX_DMA_DONE);
> +	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);

Unfortunately I do not understand how this suppose to fix the problem,
could you elaborate more about this change?

Stanislaw

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-06-18  9:34 ` Stanislaw Gruszka
@ 2019-06-21 11:30   ` Soeren Moch
  2019-06-25  9:57     ` Stanislaw Gruszka
  0 siblings, 1 reply; 8+ messages in thread
From: Soeren Moch @ 2019-06-21 11:30 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

Hi!

On 18.06.19 11:34, Stanislaw Gruszka wrote:
> Hi
>
> On Mon, Jun 17, 2019 at 11:46:56AM +0200, Soeren Moch wrote:
>> Since commit ed194d136769 ("usb: core: remove local_irq_save() around
>>  ->complete() handler") the handlers rt2x00usb_interrupt_rxdone() and
>> rt2x00usb_interrupt_txdone() are not running with interrupts disabled
>> anymore. So these handlers are not guaranteed to run completely before
>> workqueue processing starts. So only mark entries ready for workqueue
>> processing after proper accounting in the dma done queue.
> It was always the case on SMP machines that rt2x00usb_interrupt_{tx/rx}done
> can run concurrently with rt2x00_work_{rx,tx}done, so I do not
> understand how removing local_irq_save() around complete handler broke
> things.
I think because completion handlers can be interrupted now and scheduled
away
in the middle of processing.
> Have you reverted commit ed194d136769 and the revert does solve the problem ?
Yes, I already sent a patch for this, see [1]. But this was not considered
an acceptablesolution. Especially RT folks do not like code running with
interrupts disabled,particularly when trying to acquire spinlocks then.

[1] https://lkml.org/lkml/2019/5/31/863
> Between 4.19 and 4.20 we have some quite big changes in rt2x00 driver:
>
> 0240564430c0 rt2800: flush and txstatus rework for rt2800mmio
> adf26a356f13 rt2x00: use different txstatus timeouts when flushing
> 5022efb50f62 rt2x00: do not check for txstatus timeout every time on tasklet
> 0b0d556e0ebb rt2800mmio: use txdone/txstatus routines from lib
> 5c656c71b1bf rt2800: move usb specific txdone/txstatus routines to rt2800lib
>
> so I'm a bit afraid that one of those changes is real cause of
> the issue not ed194d136769 .
I tested 4.20 and 5.1 and see the exact same behavior. Reverting this
usb core patchsolves the problem.
4.19.x (before this usb core patch) is running fine.
>> Note that rt2x00usb_work_rxdone() processes all available entries, not
>> only such for which queue_work() was called.
>>
>> This fixes a regression on a RT5370 based wifi stick in AP mode, which
>> suddenly stopped data transmission after some period of heavy load. Also
>> stopping the hanging hostapd resulted in the error message "ieee80211
>> phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
>> Other operation modes are probably affected as well, this just was
>> the used testcase.
> Do you know what actually make the traffic stop,
> TX queue hung or RX queue hung?
I think RX queue hang, as stated in the patch title. "Queue 14" means QID_RX
(rt2x00queue.h, enum data_queue_qid).
I also tried to re-add local_irq_save() in only one of the handlers. Adding
this tort2x00usb_interrupt_rxdone() alone solved the issue, while doing so
for tx alonedid not.

Note that this doesn't mean there is no problem for tx, that's maybe
just more
difficult to trigger.
>> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>> index 1b08b01db27b..9c102a501ee6 100644
>> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>> @@ -263,9 +263,9 @@ EXPORT_SYMBOL_GPL(rt2x00lib_dmastart);
>>
>>  void rt2x00lib_dmadone(struct queue_entry *entry)
>>  {
>> -	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
>>  	clear_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags);
>>  	rt2x00queue_index_inc(entry, Q_INDEX_DMA_DONE);
>> +	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
> Unfortunately I do not understand how this suppose to fix the problem,
> could you elaborate more about this change?
>
Re-adding local_irq_save() around thisrt2x00lib_dmadone()solved
the issue. So I also tried to reverse the order of these calls.
It seems totally plausible to me, that the correct sequence is to
first clear the device assignment, then to set the status to dma_done,
then to trigger the workqueue processing for this entry. When the handler
is scheduled away in the middle of this sequence, now there is no
strange state where the entry can be processed by the workqueue while
not declared dma_done for it.
With this changed sequence there is no need anymore to disable interrupts
for solving the hang issue.

Regards,
Soeren



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-06-21 11:30   ` Soeren Moch
@ 2019-06-25  9:57     ` Stanislaw Gruszka
  2019-06-26 13:28       ` Soeren Moch
  0 siblings, 1 reply; 8+ messages in thread
From: Stanislaw Gruszka @ 2019-06-25  9:57 UTC (permalink / raw)
  To: Soeren Moch
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

Hello

On Fri, Jun 21, 2019 at 01:30:01PM +0200, Soeren Moch wrote:
> On 18.06.19 11:34, Stanislaw Gruszka wrote:
> > Hi
> >
> > On Mon, Jun 17, 2019 at 11:46:56AM +0200, Soeren Moch wrote:
> >> Since commit ed194d136769 ("usb: core: remove local_irq_save() around
> >>  ->complete() handler") the handlers rt2x00usb_interrupt_rxdone() and
> >> rt2x00usb_interrupt_txdone() are not running with interrupts disabled
> >> anymore. So these handlers are not guaranteed to run completely before
> >> workqueue processing starts. So only mark entries ready for workqueue
> >> processing after proper accounting in the dma done queue.
> > It was always the case on SMP machines that rt2x00usb_interrupt_{tx/rx}done
> > can run concurrently with rt2x00_work_{rx,tx}done, so I do not
> > understand how removing local_irq_save() around complete handler broke
> > things.
> I think because completion handlers can be interrupted now and scheduled
> away
> in the middle of processing.
> > Have you reverted commit ed194d136769 and the revert does solve the problem ?
> Yes, I already sent a patch for this, see [1]. But this was not considered
> an acceptablesolution. Especially RT folks do not like code running with
> interrupts disabled,particularly when trying to acquire spinlocks then.
> 
> [1] https://lkml.org/lkml/2019/5/31/863
> > Between 4.19 and 4.20 we have some quite big changes in rt2x00 driver:
> >
> > 0240564430c0 rt2800: flush and txstatus rework for rt2800mmio
> > adf26a356f13 rt2x00: use different txstatus timeouts when flushing
> > 5022efb50f62 rt2x00: do not check for txstatus timeout every time on tasklet
> > 0b0d556e0ebb rt2800mmio: use txdone/txstatus routines from lib
> > 5c656c71b1bf rt2800: move usb specific txdone/txstatus routines to rt2800lib
> >
> > so I'm a bit afraid that one of those changes is real cause of
> > the issue not ed194d136769 .
> I tested 4.20 and 5.1 and see the exact same behavior. Reverting this
> usb core patchsolves the problem.
> 4.19.x (before this usb core patch) is running fine.
> >> Note that rt2x00usb_work_rxdone() processes all available entries, not
> >> only such for which queue_work() was called.
> >>
> >> This fixes a regression on a RT5370 based wifi stick in AP mode, which
> >> suddenly stopped data transmission after some period of heavy load. Also
> >> stopping the hanging hostapd resulted in the error message "ieee80211
> >> phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
> >> Other operation modes are probably affected as well, this just was
> >> the used testcase.
> > Do you know what actually make the traffic stop,
> > TX queue hung or RX queue hung?
> I think RX queue hang, as stated in the patch title. "Queue 14" means QID_RX
> (rt2x00queue.h, enum data_queue_qid).
> I also tried to re-add local_irq_save() in only one of the handlers. Adding
> this tort2x00usb_interrupt_rxdone() alone solved the issue, while doing so
> for tx alonedid not.
> 
> Note that this doesn't mean there is no problem for tx, that's maybe
> just more
> difficult to trigger.
> >> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
> >> index 1b08b01db27b..9c102a501ee6 100644
> >> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
> >> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
> >> @@ -263,9 +263,9 @@ EXPORT_SYMBOL_GPL(rt2x00lib_dmastart);
> >>
> >>  void rt2x00lib_dmadone(struct queue_entry *entry)
> >>  {
> >> -	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
> >>  	clear_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags);
> >>  	rt2x00queue_index_inc(entry, Q_INDEX_DMA_DONE);
> >> +	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
> > Unfortunately I do not understand how this suppose to fix the problem,
> > could you elaborate more about this change?
> >
> Re-adding local_irq_save() around thisrt2x00lib_dmadone()solved
> the issue. So I also tried to reverse the order of these calls.
> It seems totally plausible to me, that the correct sequence is to
> first clear the device assignment, then to set the status to dma_done,
> then to trigger the workqueue processing for this entry. When the handler
> is scheduled away in the middle of this sequence, now there is no
> strange state where the entry can be processed by the workqueue while
> not declared dma_done for it.
> With this changed sequence there is no need anymore to disable interrupts
> for solving the hang issue.

Thanks very much for explanations. However I still do not fully
understand the issue. Q_INDEX_DMA_DONE index is only checked on TX
processing (on RX we use only Q_INDEX_DONE and Q_INDEX) and
ENTRY_OWNER_DEVICE_DATA is already cleared before rt2x00lib_dmadone()
in rt2x00usb_interrupt_rxdone() .

So I'm not sure how changing the order solve the problem. Looks
for me that the issue is triggered by some rt2x00lib_dmadone()
call done on error path (not in rt2x00usb_interrupt_rxdone())
and it race with this check:

        if (test_and_set_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags) ||
            test_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags))
                return false;

in rt2x00usb_kick_rx_entry() - we return instead of submit urb.

I'm somewhat reluctant to change the order, because TX processing
might relay on it (we first mark we wait for TX status and
then mark entry is no longer owned by hardware). However on RX
side ENTRY_DATA_STATUS_PENDING bit make no sense as we do not
wait for status. We should remove ENTRY_DATA_STATUS_PENDING on
RX side and perhaps this also will solve issue you observe.
Could you please check below patch, if it fixes the problem as well?

Stanislaw

diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
index b6c1344..731e633 100644
--- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
+++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
@@ -360,8 +360,7 @@ static void rt2x00usb_work_rxdone(struct work_struct *work)
 	while (!rt2x00queue_empty(rt2x00dev->rx)) {
 		entry = rt2x00queue_get_entry(rt2x00dev->rx, Q_INDEX_DONE);
 
-		if (test_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags) ||
-		    !test_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags))
+		if (test_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags))
 			break;
 
 		/*
@@ -413,8 +412,7 @@ static bool rt2x00usb_kick_rx_entry(struct queue_entry *entry, void *data)
 	struct queue_entry_priv_usb *entry_priv = entry->priv_data;
 	int status;
 
-	if (test_and_set_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags) ||
-	    test_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags))
+	if (test_and_set_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags))
 		return false;
 
 	rt2x00lib_dmastart(entry);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-06-25  9:57     ` Stanislaw Gruszka
@ 2019-06-26 13:28       ` Soeren Moch
  2019-06-29  8:50         ` Stanislaw Gruszka
  0 siblings, 1 reply; 8+ messages in thread
From: Soeren Moch @ 2019-06-26 13:28 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

On 25.06.19 11:57, Stanislaw Gruszka wrote:
> Hello
>
> On Fri, Jun 21, 2019 at 01:30:01PM +0200, Soeren Moch wrote:
>> On 18.06.19 11:34, Stanislaw Gruszka wrote:
>>> Hi
>>>
>>> On Mon, Jun 17, 2019 at 11:46:56AM +0200, Soeren Moch wrote:
>>>> Since commit ed194d136769 ("usb: core: remove local_irq_save() around
>>>>  ->complete() handler") the handlers rt2x00usb_interrupt_rxdone() and
>>>> rt2x00usb_interrupt_txdone() are not running with interrupts disabled
>>>> anymore. So these handlers are not guaranteed to run completely before
>>>> workqueue processing starts. So only mark entries ready for workqueue
>>>> processing after proper accounting in the dma done queue.
>>> It was always the case on SMP machines that rt2x00usb_interrupt_{tx/rx}done
>>> can run concurrently with rt2x00_work_{rx,tx}done, so I do not
>>> understand how removing local_irq_save() around complete handler broke
>>> things.
>> I think because completion handlers can be interrupted now and scheduled
>> away
>> in the middle of processing.
>>> Have you reverted commit ed194d136769 and the revert does solve the problem ?
>> Yes, I already sent a patch for this, see [1]. But this was not considered
>> an acceptablesolution. Especially RT folks do not like code running with
>> interrupts disabled,particularly when trying to acquire spinlocks then.
>>
>> [1] https://lkml.org/lkml/2019/5/31/863
>>> Between 4.19 and 4.20 we have some quite big changes in rt2x00 driver:
>>>
>>> 0240564430c0 rt2800: flush and txstatus rework for rt2800mmio
>>> adf26a356f13 rt2x00: use different txstatus timeouts when flushing
>>> 5022efb50f62 rt2x00: do not check for txstatus timeout every time on tasklet
>>> 0b0d556e0ebb rt2800mmio: use txdone/txstatus routines from lib
>>> 5c656c71b1bf rt2800: move usb specific txdone/txstatus routines to rt2800lib
>>>
>>> so I'm a bit afraid that one of those changes is real cause of
>>> the issue not ed194d136769 .
>> I tested 4.20 and 5.1 and see the exact same behavior. Reverting this
>> usb core patchsolves the problem.
>> 4.19.x (before this usb core patch) is running fine.
>>>> Note that rt2x00usb_work_rxdone() processes all available entries, not
>>>> only such for which queue_work() was called.
>>>>
>>>> This fixes a regression on a RT5370 based wifi stick in AP mode, which
>>>> suddenly stopped data transmission after some period of heavy load. Also
>>>> stopping the hanging hostapd resulted in the error message "ieee80211
>>>> phy0: rt2x00queue_flush_queue: Warning - Queue 14 failed to flush".
>>>> Other operation modes are probably affected as well, this just was
>>>> the used testcase.
>>> Do you know what actually make the traffic stop,
>>> TX queue hung or RX queue hung?
>> I think RX queue hang, as stated in the patch title. "Queue 14" means QID_RX
>> (rt2x00queue.h, enum data_queue_qid).
>> I also tried to re-add local_irq_save() in only one of the handlers. Adding
>> this tort2x00usb_interrupt_rxdone() alone solved the issue, while doing so
>> for tx alonedid not.
>>
>> Note that this doesn't mean there is no problem for tx, that's maybe
>> just more
>> difficult to trigger.
>>>> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>>>> index 1b08b01db27b..9c102a501ee6 100644
>>>> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>>>> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00dev.c
>>>> @@ -263,9 +263,9 @@ EXPORT_SYMBOL_GPL(rt2x00lib_dmastart);
>>>>
>>>>  void rt2x00lib_dmadone(struct queue_entry *entry)
>>>>  {
>>>> -	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
>>>>  	clear_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags);
>>>>  	rt2x00queue_index_inc(entry, Q_INDEX_DMA_DONE);
>>>> +	set_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags);
>>> Unfortunately I do not understand how this suppose to fix the problem,
>>> could you elaborate more about this change?
>>>
>> Re-adding local_irq_save() around thisrt2x00lib_dmadone()solved
>> the issue. So I also tried to reverse the order of these calls.
>> It seems totally plausible to me, that the correct sequence is to
>> first clear the device assignment, then to set the status to dma_done,
>> then to trigger the workqueue processing for this entry. When the handler
>> is scheduled away in the middle of this sequence, now there is no
>> strange state where the entry can be processed by the workqueue while
>> not declared dma_done for it.
>> With this changed sequence there is no need anymore to disable interrupts
>> for solving the hang issue.
> Thanks very much for explanations. However I still do not fully
> understand the issue. Q_INDEX_DMA_DONE index is only checked on TX
> processing (on RX we use only Q_INDEX_DONE and Q_INDEX) and
> ENTRY_OWNER_DEVICE_DATA is already cleared before rt2x00lib_dmadone()
> in rt2x00usb_interrupt_rxdone() .
>
> So I'm not sure how changing the order solve the problem. Looks
> for me that the issue is triggered by some rt2x00lib_dmadone()
> call done on error path (not in rt2x00usb_interrupt_rxdone())
> and it race with this check:
>
>         if (test_and_set_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags) ||
>             test_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags))
>                 return false;
>
> in rt2x00usb_kick_rx_entry() - we return instead of submit urb.
Hi Stanislaw,

the good news is: your patch below also solves the issue for me. But
removing the ENTRY_DATA_STATUS_PENDING check in
rt2x00usb_kick_rx_entry() alone does not help, while removing this check
in rt2x00usb_work_rxdone() alone does the trick.

So the real race seems to be that the flags set in the completion
handler are not yet visible on the cpu core running the workqueue. And
because the worker is not rescheduled when aborted, the entry can just
wait forever.
Do you think this could make sense?
> I'm somewhat reluctant to change the order, because TX processing
> might relay on it (we first mark we wait for TX status and
> then mark entry is no longer owned by hardware).
OK, maybe it's just good luck that changing the order solves the rx
problem. Or can memory barriers associated with the spinlock in
rt2x00lib_dmadone() be responsible for that?
(I'm testing on a armv7 system, Cortex-A9 quadcore.)

While looking at it, why we double-clear ENTRY_OWNER_DEVICE_DATA in
rt2x00usb_interrupt_rxdone() directly and in rt2x00lib_dmadone() again,
while not doing the same for tx? Would it make more sense to possibly
set ENTRY_DATA_IO_FAILED before clearing ENTRY_OWNER_DEVICE_DATA in
rt2x00usb_interrupt_rxdone() as for tx?
>  However on RX
> side ENTRY_DATA_STATUS_PENDING bit make no sense as we do not
> wait for status. We should remove ENTRY_DATA_STATUS_PENDING on
> RX side and perhaps this also will solve issue you observe.
I agree that removing the unnecessary checks is a good idea in any case.
> Could you please check below patch, if it fixes the problem as well?
At least I could not trigger the problem within transferring 10GB of
data. But maybe the probability for triggering this bug is just lower
because ENTRY_OWNER_DEVICE_DATA is cleared some time before
ENTRY_DATA_STATUS_PENDING is set?

Soeren
>
> Stanislaw
>
> diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> index b6c1344..731e633 100644
> --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c
> @@ -360,8 +360,7 @@ static void rt2x00usb_work_rxdone(struct work_struct *work)
>  	while (!rt2x00queue_empty(rt2x00dev->rx)) {
>  		entry = rt2x00queue_get_entry(rt2x00dev->rx, Q_INDEX_DONE);
>  
> -		if (test_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags) ||
> -		    !test_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags))
> +		if (test_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags))
>  			break;
>  
>  		/*
> @@ -413,8 +412,7 @@ static bool rt2x00usb_kick_rx_entry(struct queue_entry *entry, void *data)
>  	struct queue_entry_priv_usb *entry_priv = entry->priv_data;
>  	int status;
>  
> -	if (test_and_set_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags) ||
> -	    test_bit(ENTRY_DATA_STATUS_PENDING, &entry->flags))
> +	if (test_and_set_bit(ENTRY_OWNER_DEVICE_DATA, &entry->flags))
>  		return false;
>  
>  	rt2x00lib_dmastart(entry);



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-06-26 13:28       ` Soeren Moch
@ 2019-06-29  8:50         ` Stanislaw Gruszka
  2019-07-01 10:49           ` Soeren Moch
  0 siblings, 1 reply; 8+ messages in thread
From: Stanislaw Gruszka @ 2019-06-29  8:50 UTC (permalink / raw)
  To: Soeren Moch
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

Hello

On Wed, Jun 26, 2019 at 03:28:00PM +0200, Soeren Moch wrote:
> Hi Stanislaw,
> 
> the good news is: your patch below also solves the issue for me. But
> removing the ENTRY_DATA_STATUS_PENDING check in
> rt2x00usb_kick_rx_entry() alone does not help, while removing this check
> in rt2x00usb_work_rxdone() alone does the trick.
> 
> So the real race seems to be that the flags set in the completion
> handler are not yet visible on the cpu core running the workqueue. And
> because the worker is not rescheduled when aborted, the entry can just
> wait forever.
> Do you think this could make sense?

Yes.

> > I'm somewhat reluctant to change the order, because TX processing
> > might relay on it (we first mark we wait for TX status and
> > then mark entry is no longer owned by hardware).
> OK, maybe it's just good luck that changing the order solves the rx
> problem. Or can memory barriers associated with the spinlock in
> rt2x00lib_dmadone() be responsible for that?
> (I'm testing on a armv7 system, Cortex-A9 quadcore.)

I'm not sure, rt2x00queue_index_inc() also disable/enable interrupts,
so maybe that make race not reproducible. 

> While looking at it, why we double-clear ENTRY_OWNER_DEVICE_DATA in
> rt2x00usb_interrupt_rxdone() directly and in rt2x00lib_dmadone() again,

rt2x00lib_dmadone() is called also on other palaces (error paths)
when we have to clear flags.

> while not doing the same for tx? 

If I remember correctly we have some races on rx (not happened on tx)
that was solved by using test_and_clear_bit(ENTRY_OWNER_DEVICE_DATA).

> Would it make more sense to possibly
> set ENTRY_DATA_IO_FAILED before clearing ENTRY_OWNER_DEVICE_DATA in
> rt2x00usb_interrupt_rxdone() as for tx?

I don't think so, ENTRY_DATA_IO_FAILED should be only set on error
case.

> >  However on RX
> > side ENTRY_DATA_STATUS_PENDING bit make no sense as we do not
> > wait for status. We should remove ENTRY_DATA_STATUS_PENDING on
> > RX side and perhaps this also will solve issue you observe.
> I agree that removing the unnecessary checks is a good idea in any case.
> > Could you please check below patch, if it fixes the problem as well?
> At least I could not trigger the problem within transferring 10GB of
> data. But maybe the probability for triggering this bug is just lower
> because ENTRY_OWNER_DEVICE_DATA is cleared some time before
> ENTRY_DATA_STATUS_PENDING is set?

Not sure. Anyway, could you post patch removing not needed checks
with proper description/changelog ?

Stanislaw

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-06-29  8:50         ` Stanislaw Gruszka
@ 2019-07-01 10:49           ` Soeren Moch
  2019-07-01 11:04             ` Stanislaw Gruszka
  0 siblings, 1 reply; 8+ messages in thread
From: Soeren Moch @ 2019-07-01 10:49 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

Hello!

On 29.06.19 10:50, Stanislaw Gruszka wrote:
> Hello
>
> On Wed, Jun 26, 2019 at 03:28:00PM +0200, Soeren Moch wrote:
>> Hi Stanislaw,
>>
>> the good news is: your patch below also solves the issue for me. But
>> removing the ENTRY_DATA_STATUS_PENDING check in
>> rt2x00usb_kick_rx_entry() alone does not help, while removing this check
>> in rt2x00usb_work_rxdone() alone does the trick.
>>
>> So the real race seems to be that the flags set in the completion
>> handler are not yet visible on the cpu core running the workqueue. And
>> because the worker is not rescheduled when aborted, the entry can just
>> wait forever.
>> Do you think this could make sense?
> Yes.
>
>>> I'm somewhat reluctant to change the order, because TX processing
>>> might relay on it (we first mark we wait for TX status and
>>> then mark entry is no longer owned by hardware).
>> OK, maybe it's just good luck that changing the order solves the rx
>> problem. Or can memory barriers associated with the spinlock in
>> rt2x00lib_dmadone() be responsible for that?
>> (I'm testing on a armv7 system, Cortex-A9 quadcore.)
> I'm not sure, rt2x00queue_index_inc() also disable/enable interrupts,
> so maybe that make race not reproducible. 
I tested some more, the race is between setting ENTRY_DATA_IO_FAILED (if
needed) and enabling workqueue processing. This enabling was done via
ENTRY_DATA_STATUS_PENDING in my patch. So setting
ENTRY_DATA_STATUS_PENDING behind the spinlock in
rt2x00lib_dmadone()/rt2x00queue_index_inc() moved this very close to
setting of ENTRY_DATA_IO_FAILED (if needed). While still in the wrong
order, this made it very unlikely for the race to show up.
>
>> While looking at it, why we double-clear ENTRY_OWNER_DEVICE_DATA in
>> rt2x00usb_interrupt_rxdone() directly and in rt2x00lib_dmadone() again,
> rt2x00lib_dmadone() is called also on other palaces (error paths)
> when we have to clear flags.
Yes, but also clearing ENTRY_OWNER_DEVICE_DATA in
rt2x00usb_interrupt_rxdone() directly is not necessary and can lead to
the wrong processing order.
>>  while not doing the same for tx? 
> If I remember correctly we have some races on rx (not happened on tx)
> that was solved by using test_and_clear_bit(ENTRY_OWNER_DEVICE_DATA).
I searched in the history, it actually was the other way around. You
changed test_and_clear_bit() to test_bit() in the TX path. I think this
is also the right way to go in RX.
>> Would it make more sense to possibly
>> set ENTRY_DATA_IO_FAILED before clearing ENTRY_OWNER_DEVICE_DATA in
>> rt2x00usb_interrupt_rxdone() as for tx?
> I don't think so, ENTRY_DATA_IO_FAILED should be only set on error
> case.

Yes of course. But if the error occurs, it should be signalled before
enabling the workqueue processing, see the race described above.

After some more testing I'm convinced that this would be the right fix
for this problem. I will send a v2 of this patch accordingly.
>
>>>  However on RX
>>> side ENTRY_DATA_STATUS_PENDING bit make no sense as we do not
>>> wait for status. We should remove ENTRY_DATA_STATUS_PENDING on
>>> RX side and perhaps this also will solve issue you observe.
>> I agree that removing the unnecessary checks is a good idea in any case.
>>> Could you please check below patch, if it fixes the problem as well?
>> At least I could not trigger the problem within transferring 10GB of
>> data. But maybe the probability for triggering this bug is just lower
>> because ENTRY_OWNER_DEVICE_DATA is cleared some time before
>> ENTRY_DATA_STATUS_PENDING is set?
> Not sure. Anyway, could you post patch removing not needed checks
> with proper description/changelog ?
>
OK, I will do so.

Soeren


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] rt2x00: fix rx queue hang
  2019-07-01 10:49           ` Soeren Moch
@ 2019-07-01 11:04             ` Stanislaw Gruszka
  0 siblings, 0 replies; 8+ messages in thread
From: Stanislaw Gruszka @ 2019-07-01 11:04 UTC (permalink / raw)
  To: Soeren Moch
  Cc: Helmut Schaa, Kalle Valo, David S. Miller, linux-wireless,
	netdev, linux-kernel, stable

On Mon, Jul 01, 2019 at 12:49:50PM +0200, Soeren Moch wrote:
> Hello!
> 
> On 29.06.19 10:50, Stanislaw Gruszka wrote:
> > Hello
> >
> > On Wed, Jun 26, 2019 at 03:28:00PM +0200, Soeren Moch wrote:
> >> Hi Stanislaw,
> >>
> >> the good news is: your patch below also solves the issue for me. But
> >> removing the ENTRY_DATA_STATUS_PENDING check in
> >> rt2x00usb_kick_rx_entry() alone does not help, while removing this check
> >> in rt2x00usb_work_rxdone() alone does the trick.
> >>
> >> So the real race seems to be that the flags set in the completion
> >> handler are not yet visible on the cpu core running the workqueue. And
> >> because the worker is not rescheduled when aborted, the entry can just
> >> wait forever.
> >> Do you think this could make sense?
> > Yes.
> >
> >>> I'm somewhat reluctant to change the order, because TX processing
> >>> might relay on it (we first mark we wait for TX status and
> >>> then mark entry is no longer owned by hardware).
> >> OK, maybe it's just good luck that changing the order solves the rx
> >> problem. Or can memory barriers associated with the spinlock in
> >> rt2x00lib_dmadone() be responsible for that?
> >> (I'm testing on a armv7 system, Cortex-A9 quadcore.)
> > I'm not sure, rt2x00queue_index_inc() also disable/enable interrupts,
> > so maybe that make race not reproducible. 
> I tested some more, the race is between setting ENTRY_DATA_IO_FAILED (if
> needed) and enabling workqueue processing. This enabling was done via
> ENTRY_DATA_STATUS_PENDING in my patch. So setting
> ENTRY_DATA_STATUS_PENDING behind the spinlock in
> rt2x00lib_dmadone()/rt2x00queue_index_inc() moved this very close to
> setting of ENTRY_DATA_IO_FAILED (if needed). While still in the wrong
> order, this made it very unlikely for the race to show up.
> >
> >> While looking at it, why we double-clear ENTRY_OWNER_DEVICE_DATA in
> >> rt2x00usb_interrupt_rxdone() directly and in rt2x00lib_dmadone() again,
> > rt2x00lib_dmadone() is called also on other palaces (error paths)
> > when we have to clear flags.
> Yes, but also clearing ENTRY_OWNER_DEVICE_DATA in
> rt2x00usb_interrupt_rxdone() directly is not necessary and can lead to
> the wrong processing order.
> >>  while not doing the same for tx? 
> > If I remember correctly we have some races on rx (not happened on tx)
> > that was solved by using test_and_clear_bit(ENTRY_OWNER_DEVICE_DATA).
> I searched in the history, it actually was the other way around. You
> changed test_and_clear_bit() to test_bit() in the TX path. I think this
> is also the right way to go in RX.
> >> Would it make more sense to possibly
> >> set ENTRY_DATA_IO_FAILED before clearing ENTRY_OWNER_DEVICE_DATA in
> >> rt2x00usb_interrupt_rxdone() as for tx?
> > I don't think so, ENTRY_DATA_IO_FAILED should be only set on error
> > case.
> 
> Yes of course. But if the error occurs, it should be signalled before
> enabling the workqueue processing, see the race described above.
> 
> After some more testing I'm convinced that this would be the right fix
> for this problem. I will send a v2 of this patch accordingly.

Great, now I understand the problem. Thank you very much!

Stanislaw

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-07-01 11:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-17  9:46 [PATCH] rt2x00: fix rx queue hang Soeren Moch
2019-06-18  9:34 ` Stanislaw Gruszka
2019-06-21 11:30   ` Soeren Moch
2019-06-25  9:57     ` Stanislaw Gruszka
2019-06-26 13:28       ` Soeren Moch
2019-06-29  8:50         ` Stanislaw Gruszka
2019-07-01 10:49           ` Soeren Moch
2019-07-01 11:04             ` Stanislaw Gruszka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).