linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Tejun Heo <htejun@gmail.com>
Cc: Johannes Berg <johannes@sipsolutions.net>,
	"Luis R. Rodriguez" <mcgrof@gmail.com>,
	linux-wireless@vger.kernel.org
Subject: Re: [PATCH] mac80211: Fix deadlock in ieee80211_do_stop.
Date: Fri, 10 Dec 2010 08:35:09 -0800	[thread overview]
Message-ID: <4D0256BD.5040302@candelatech.com> (raw)
In-Reply-To: <4D024307.2060108@gmail.com>

On 12/10/2010 07:11 AM, Tejun Heo wrote:
> Hello, Ben.
>
> On 12/09/2010 11:23 PM, Ben Greear wrote:
>> I saw a brief hang today, and did a sysrq-t, and then saw the timer
>> printout you added here.  But, I think that was caused by sysrq-t.
>> The system recovered and ran fine.
>
> It would be nice if you turn on printk timestamp.  How brief is brief?
> Can you please turn on printk timestamp?  @115200, the dump would have
> taken ~25 seconds so yes it was mostly caused by sysrq-t dump.  In the
> dump, iface_work is at the same position in R state.  It looks like
> the ifmgd->mtx.  Can you please confirm this with gdb?  This would
> only happen if the lock is highly contended.  Would this be the case
> Johaness?

I'm getting low on time before the holidays and have lots of other
bugs to go after, so not sure I'll get to this soon.

'Brief' was long enough for me to see 'sh' running at 100+ CPU for
a few refreshes in 'top', which is one of the symptoms of this bug
in my case (a bash script is calling 'ip' which is blocked on
rtnl, or some such thing).

>
>> The second time (after several hours of rebooting), the hang was worse
>> and the system ran OOM after maybe 30 seconds.  I did a sysrq-t then.
>>
>> I see quite a few printouts from your debug message, but all of them
>> after things start going OOM, and after sysrq-t.
>>
>> Here's the console capture:
>>
>> http://www.candelatech.com/~greearb/minicom_ath9k_log4.txt
>>
>> Let me know if you need more traces like this if I hit it again.
>
> I don't know the code very well but it looks very suspicious.  A task
> ends up trying to flush a work which can run for extended period of
> time during which memory is aggressively used for buffering (looks
> like skb's are piling up without any limit), which is likely to
> further slow down other stuff.  This sounds like an extremely fragile
> mechanism to me.  When the work is being constantly being rescheduled,
> cancel ends up waiting one fewer time then flush.  If the work is
> running and pending, flush waits for the pending one to finish, while
> cancel would kill the pending one and waits for only the current one
> to finish.  I think it could be that that difference is acting as a
> threshold between going bonkers and staying alive.
>
> Can you please test whether the following patch makes any difference?
> If flush_work() is misbehaving, the following wouldn't fix anything
> but if this livelock is indeed caused by iface_work running too long,
> the problem should go away.
>
> One way or the other, Johannes, please consider fixing the behavior
> here.  It's way too fragile.

Perhaps a related bug:  On module unload with lots of VIFS passing
traffic, I often see crashes due to what appears to be accessing
stale sdata pointers.  Perhaps the logic to disable rx of acks
and other skbs doesn't work properly in the STA shutdown path?

Thanks,
Ben

>
> Thanks.
>
> diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c
> index 7aa8559..86bdfdd 100644
> --- a/net/mac80211/iface.c
> +++ b/net/mac80211/iface.c
> @@ -723,6 +723,7 @@ static void ieee80211_iface_work(struct work_struct *work)
>   	struct sk_buff *skb;
>   	struct sta_info *sta;
>   	struct ieee80211_ra_tid *ra_tid;
> +	unsigned int cnt = 0;
>
>   	if (!ieee80211_sdata_running(sdata))
>   		return;
> @@ -825,6 +826,11 @@ static void ieee80211_iface_work(struct work_struct *work)
>   		}
>
>   		kfree_skb(skb);
> +
> +		if (++cnt>  100) {
> +			ieee80211_queue_work(&local->hw, work);
> +			break;
> +		}
>   	}
>
>   	/* then other type-dependent work */


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

  reply	other threads:[~2010-12-10 16:35 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-12 20:07 [PATCH] mac80211: Fix deadlock in ieee80211_do_stop greearb
2010-11-12 20:08 ` Luis R. Rodriguez
2010-11-12 20:16   ` Ben Greear
2010-11-12 20:49 ` Johannes Berg
2010-11-12 20:57   ` Ben Greear
2010-11-12 21:08     ` Johannes Berg
2010-11-12 21:51       ` Ben Greear
2010-11-13 10:34       ` Tejun Heo
2010-11-15 21:16         ` Ben Greear
2010-11-16 14:19           ` Tejun Heo
2010-11-16 16:51             ` Ben Greear
2010-11-17  8:55               ` Tejun Heo
2010-11-17 17:37                 ` Ben Greear
2010-11-16 17:40             ` Johannes Berg
2010-11-17  8:47               ` Tejun Heo
2010-11-17 18:53                 ` Johannes Berg
2010-11-17 18:59                   ` Ben Greear
2010-11-17 19:03                     ` Johannes Berg
2010-11-18  6:34                   ` Tejun Heo
2010-11-18  7:07                     ` Johannes Berg
2010-11-18  7:22                       ` Tejun Heo
2010-11-18 16:59                         ` Johannes Berg
2010-11-19 14:34                           ` Tejun Heo
2010-11-19 17:57                             ` Johannes Berg
2010-11-19 20:55                               ` Ben Greear
2010-11-19 22:27                                 ` Luis R. Rodriguez
2010-12-08 17:36                                   ` Ben Greear
2010-12-08 18:19                                     ` Ben Greear
2010-12-08 18:28                                       ` Ben Greear
2010-12-09 14:34                                         ` Tejun Heo
2010-12-09 14:42                                           ` Johannes Berg
2010-12-09 14:46                                             ` Tejun Heo
2010-12-09 16:17                                               ` Tejun Heo
     [not found]                                                 ` <4D0156F6.4000306@candelate ch.com>
2010-12-09 17:27                                                 ` Ben Greear
2010-12-09 22:23                                                 ` Ben Greear
2010-12-10 15:11                                                   ` Tejun Heo
2010-12-10 16:35                                                     ` Ben Greear [this message]
2010-11-18 17:55                         ` Ben Greear
2010-11-18 18:04                           ` Tejun Heo
2010-11-18 18:11                             ` Ben Greear
2010-11-17 20:13             ` Ben Greear

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D0256BD.5040302@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=htejun@gmail.com \
    --cc=johannes@sipsolutions.net \
    --cc=linux-wireless@vger.kernel.org \
    --cc=mcgrof@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).