rt2800 tx frame dropping issue.

* rt2800 tx frame dropping issue.
@ 2018-11-20 21:20 Daniel Santos
  2018-11-23 19:45 ` Johannes Berg
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel Santos @ 2018-11-20 21:20 UTC (permalink / raw)
  To: Stanislaw Gruszka, linux-wireless

Hello,

I'm coming to this issue rather late and I realize there has been much
ado about it.  I want to follow up on a thread from 3 months ago
https://marc.info/?l=linux-wireless&m=153511575812945

> I get testing results from T-Bone user in bugzilla:
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
>
> And get this:
>
> [  781.644185] ieee80211 phy0: rt2x00queue_write_tx_frame: Error - Dropping frame due \
> to full tx queue 2 [  781.662620] 2 d1 s1 p1 ms1
>
> Looks like rt2x00 correctly stop queue in mac80211, but sill get skb's. 
>
> So we can do 3 things: increase queue size to 128, increase threshold
> to 16 and make the massage debug one instead of error (I checked
> some other drivers and looks most of them silently drop the frame
> in case queue is full). Especially removing the message can be
> useful since printk can somehow make mt7620 router unresponsive
> for some time resulting in connection drops.
>
> Thoughts ?
>
> Regards
> Stanislaw

I believe I have the answer as to why we're getting frames after we stop
the queue.  I had a little chat about this in #kernelnewbies and some
other devs believe it is intentional.

There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
releasing local->queue_stop_reason_lock and calling the driver's tx
until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
these two points the thread can be preempted.  So while we stop the
queue in one thread, there can be 20 other threads in between that have
already checked that the queue was running and have a frame buffer
sitting on their stacks.  I think it could be fixed with the below
patch, but I'm being told that it doesn't need it and that the driver
should just *quietly* drop the frame:

diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index d43dab4..29009e0 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1117,6 +1117,7 @@ struct ieee80211_local {
 	int q_stop_reasons[IEEE80211_MAX_QUEUES][IEEE80211_QUEUE_STOP_REASONS];
 	/* also used to protect ampdu_ac_queue and amdpu_ac_stop_refcnt */
 	spinlock_t queue_stop_reason_lock;
+	spinlock_t tx_path_lock;
 
 	int open_count;
 	int monitors, cooked_mntrs;
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 4f8eceb..2312ac9 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -609,6 +609,7 @@ struct ieee80211_hw *ieee80211_alloc_hw_nm(size_t priv_data_len,
 	spin_lock_init(&local->filter_lock);
 	spin_lock_init(&local->rx_path_lock);
 	spin_lock_init(&local->queue_stop_reason_lock);
+	spin_lock_init(&local->tx_path_lock);
 
 	INIT_LIST_HEAD(&local->chanctx_list);
 	mutex_init(&local->chanctx_mtx);
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 87d807f..8d43e2a 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1311,6 +1311,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
 		}
 #endif
 
+		spin_lock_bh(&local->tx_path_lock);
 		spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
 		if (local->queue_stop_reasons[q] ||
 		    (!txpending && !skb_queue_empty(&local->pending[q]))) {
@@ -1327,6 +1328,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
 					spin_unlock_irqrestore(
 						&local->queue_stop_reason_lock,
 						flags);
+					spin_unlock_bh(&local->tx_path_lock);
 					ieee80211_purge_tx_queue(&local->hw,
 								 skbs);
 					return true;
@@ -1347,6 +1349,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
 
 				spin_unlock_irqrestore(&local->queue_stop_reason_lock,
 						       flags);
+				spin_unlock_bh(&local->tx_path_lock);
 				return false;
 			}
 		}
@@ -1356,6 +1359,7 @@ static bool ieee80211_tx_frags(struct ieee80211_local *local,
 
 		__skb_unlink(skb, skbs);
 		ieee80211_drv_tx(local, vif, sta, skb);
+		spin_unlock_bh(&local->tx_path_lock);
 	}
 
 	return true;


Could anybody illuminate for me why this isn't done?  I know that there
has to be a point where we realize there are too many items in the queue
and we can't keep up, but this would seem to be a terrible way to do
that.  Is it one of those things where it isn't worth the performance
degradation?  Any insights would be most appreciated!! :)

Like Stanislaw, I have not been able to reproduce the problem for
testing, but somebody who has the problem (in production) is going to
get a build where the only change is to remove the rt2x00_err message. 
It makes sense that it would be the problem as you mentioned.  I suppose
the proper thing to do is to just quietly drop it and increment the
struct net_device stats.tx_dropped?

Thanks,
Daniel


^ permalink raw reply related	[flat|nested] 6+ messages in thread