All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Michal Kazior <michal.kazior@tieto.com>
Cc: ath10k <ath10k@lists.infradead.org>
Subject: Re: Deadlock on (faked) firmware crash, CUS239, modified 10.4.3 firmware.
Date: Thu, 31 Mar 2016 12:16:49 -0700	[thread overview]
Message-ID: <56FD77A1.2010807@candelatech.com> (raw)
In-Reply-To: <CA+BoTQ=ZPA87-5iK7YtxQSAW-SeXaaBU1j+9gVMC-5c7cpTwrA@mail.gmail.com>

On 03/30/2016 11:32 PM, Michal Kazior wrote:
> On 31 March 2016 at 00:28, Ben Greear <greearb@candelatech.com> wrote:
>>
>>> Hmm.. If it still reproduces can you try the following diff?
>>>
>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>                   list_del_init(&artxq->list);
>>>                   if (ret != -ENOENT)
>>>                           list_add_tail(&artxq->list, &ar->txqs);
>>> +               else if (artxq == last)
>>> +                       last = list_last_entry(&ar->txqs, struct
>>> ath10k_txq, list);
>>>
>>>                   ath10k_htt_tx_txq_update(hw, txq);
>>
>>
>> Ok, I added this code, and can still reproduce the code.
>>
>> Firmware is crashing multiple times a minute in this machine in it's
>> current configuration.  Right before it hung, firmware crashed and
>> was restarted, and then I get the hang notification.
>>
>> I don't see any obvious bail-out in the tx_push_pending logic
>> if the firmware crashes?
>
> There's no explicit bail-out, yes. It should bail out if
> ath10k_mac_tx_push_txq() fails though (except -ENOENT, which is
> treated slightly differently but should result in bail-out eventually
> as well as ar->txqs will drain until it's empty).
>
> HTT-tx doesn't check for FW crash but it should be ultimately limited
> by either CE ring size and HTT's num-pending-tx (both should not be
> replenished as FW crashed and interrupts should not come in anymore).
> Whichever the case a <0 retval should result in a bailout.

I tried adding check for FW crash yesterday, but that did not help.

Today, I added a limit of 2000 loops.  I see that hit, and then kernel
crashes.  Maybe my patch is wrong.

I've tried to apply (almost) every patch in linux.ath related to ath10k,
including a few from the mailing list that have not been applied yet.

My push-pending method now looks like this:

void ath10k_mac_tx_push_pending(struct ath10k *ar)
{
	struct ieee80211_hw *hw = ar->hw;
	struct ieee80211_txq *txq;
	struct ath10k_txq *artxq;
	struct ath10k_txq *last;
	int ret;
	int max;
	int loop_max = 2000;

	spin_lock_bh(&ar->txqs_lock);
	rcu_read_lock();

	last = list_last_entry(&ar->txqs, struct ath10k_txq, list);
	while (!list_empty(&ar->txqs)) {
		artxq = list_first_entry(&ar->txqs, struct ath10k_txq, list);
		txq = container_of((void *)artxq, struct ieee80211_txq,
				   drv_priv);

		if (--loop_max == 0) {
			ath10k_err(ar, "Looped 2000 times in tx_push_pending, bailing out.\n");
			break;
		}
		
		/* Prevent aggressive sta/tid taking over tx queue */
		max = 16;
		ret = 0;
		while (ath10k_mac_tx_can_push(hw, txq) && max--) {
			ret = ath10k_mac_tx_push_txq(hw, txq);
			if (ret < 0)
				break;
		}

		list_del_init(&artxq->list);
		if (ret != -ENOENT)
			list_add_tail(&artxq->list, &ar->txqs);
		else if (artxq == last)
			last = list_last_entry(&ar->txqs, struct ath10k_txq, list);

		ath10k_htt_tx_txq_update(hw, txq);

		if (artxq == last || (ret < 0 && ret != -ENOENT))
			break;
	}

	rcu_read_unlock();
	spin_unlock_bh(&ar->txqs_lock);
}

The crash I get is this:


ath10k_pci 0000:05:00.0: firmware crashed! (uuid 2a118708-977d-43d6-8d40-079ddec99eb3)
ath10k_pci 0000:05:00.0: firmware register dump:
ath10k_pci 0000:05:00.0: [00]: 0x00000009 0x000015B3 0x0099E4B6 0x00955B31
ath10k_pci 0000:05:00.0: [04]: 0x0099E4B6 0x00060130 0x00000005 0x00000016
ath10k_pci 0000:05:00.0: [08]: 0x00455030 0x004402B0 0x004060F0 0x00000007
ath10k_pci 0000:05:00.0: [12]: 0x00000009 0x00000000 0x009533D0 0x009533DF
ath10k_pci 0000:05:00.0: [16]: 0x00953438 0x0A00286E 0x009406B6 0x00000000
ath10k_pci 0000:05:00.0: [20]: 0x4099E4B6 0x00405FEC 0x000000BE 0x00955A00
ath10k_pci 0000:05:00.0: [24]: 0x8099E680 0x0040604C 0x00000000 0xC099E4B6
ath10k_pci 0000:05:00.0: [28]: 0x80986D5F 0x004060AC 0x00423A14 0x004060F0
ath10k_pci 0000:05:00.0: [32]: 0x80984E51 0x004060CC 0x00423A14 0x004060F0
ath10k_pci 0000:05:00.0: [36]: 0x80985CBF 0x004060EC 0x00424654 0x004402B0
ath10k_pci 0000:05:00.0: [40]: 0x809CAE6A 0x0040615C 0x004402B0 0x00424654
ath10k_pci 0000:05:00.0: [44]: 0x80984EBC 0x0040618C 0x004402B0 0x0040623C
ath10k_pci 0000:05:00.0: [48]: 0x809CB3CC 0x0040623C 0x004402B0 0x00411988
ath10k_pci 0000:05:00.0: [52]: 0x80984DE0 0x0040626C 0x00424654 0x004402B0
ath10k_pci 0000:05:00.0: [56]: 0x809CCE08 0x0040635C 0x00424654 0x00423234
ath10k_pci 0000:05:00.0: ath10k_pci ATH10K_DBG_BUFFER:
ath10k: [0000]: 0001854A 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 0001854E
ath10k: [0008]: 07FC4C02 00000004 0001854F 0060581D 0001854F 17FC4C01 0F00851C 0000000A
ath10k: [0016]: 06003007 0000FFAA FFFFFFFF 0001854F 17FC4C01 71108880 00000000 00C400BF
ath10k: [0024]: 00000000 00000FF0 0001854F 17FC4C01 71108880 00010000 00C400BF 00000000
ath10k: [0032]: FFFFFFFF 0001854F 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF
ath10k: [0040]: 0001854F 17FC4C01 71108880 00030000 00C400BF 000000FF FFFFFFFF 0001854F
ath10k: [0048]: 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF 0001854F 17FC4C01
ath10k: [0056]: 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018550 0060581D 00018550
ath10k: [0064]: 0860581B 0000851C 00000000 00018550 0060581D 00018550 07FC4C02 00000004
ath10k: [0072]: 00018551 0060581D 00018551 17FC4C01 0F00851C 0000000A 06003007 0000FFAA
ath10k: [0080]: FFFFFFFF 00018551 17FC4C01 71108880 00000000 00C400BF 00000000 00000FF0
ath10k: [0088]: 00018551 17FC4C01 71108880 00010000 00C400BF 00000000 FFFFFFFF 00018551
ath10k: [0096]: 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF 00018551 17FC4C01
ath10k: [0104]: 71108880 00030000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880
ath10k: [0112]: 00040000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880 00050000
ath10k: [0120]: 00C400BF 000000FF FBFFFFFF 00018551 14605853 51100001 000F0DE4 00000400
ath10k: [0128]: 00000056 00440380 00018551 0060581D 00018551 0460581C 00000001 00018551
ath10k: [0136]: 0060581D 00018551 07FC4C02 00000004 00018552 0060581D 00018552 17FC4C01
ath10k: [0144]: 0F00851C 0000000A 06003007 0000FFAA FFFFFFFF 00018553 17FC4C01 71108880
ath10k: [0152]: 00000000 00C400BF 00000000 00000FF0 00018553 17FC4C01 71108880 00010000
ath10k: [0160]: 00C400BF 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00020000 00C400BF
ath10k: [0168]: 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00030000 00C400BF 000000FF
ath10k: [0176]: FFFFFFFF 00018553 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF
ath10k: [0184]: 00018553 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018553
ath10k: [0192]: 07FC4C02 00000001 00018553 07FC4C02 00000001 00018553 0BFC5826 000005E9
ath10k: [0200]: 00000003 00018554 0BFC5822 0000C01D 00000406 00018578 08383812 000F45C4
ath10k: [0208]: 00424654 00018578 10383809 0000143C 00000001 00000000 00000000 0001857B
ath10k: [0216]: 14385853 51100001 000F0D9C 000003FC 00000057 004402B0 0001857B 14385853
ath10k: [0224]: 51100001 000F0D54 000003FE 00000058 004402B0 0001857B 07FC5830 00000008
ath10k: [0232]: 0001857B 14385854 51100002 000F0D54 00000061 00000057 004402B0 0001857B
ath10k: [0240]: 14385851 91107001 00424654 004402B0 00000008 00000006 0001857B 17FC5855
ath10k: [0248]: 91108001 00000000 00000000 00000007 000000BE 0001857B 0FFC5855 91108002
ath10k: [0256]: 004402B0 00000010 0001857B 17FC0001 0099E4B6 000015B3 000015B3 00405EDC
ath10k: [0264]: 00000009
ath10k_pci 0000:05:00.0: ATH10K_END
sta13: drv-set-bitrate-mask had error return: -108
rdev-set-bitrate-mask failed: -108
wlan3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out.
sta22: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta0: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta1: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta2: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out.
sta3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
PGD 0
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan wanlink(O) pktgen 
rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt iTCO_vendor_support ath9k ath10k_pci coretemp ath9k_common hwmon intel_rapl ath10k_core iosf_mbi ath9k_hw 
x86_pkg_temp_thermal intel_powerclamp kvm_intel ath joydev kvm mac80211 irqbypass serio_raw pcspkr cfg80211 i2c_i801 lpc_ich snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp 
soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic i915 pata_acpi i2c_algo_bit drm_kms_helper e1000e ptp pps_core drm i2c_core video 
fjes ipv6 [last unloaded: nf_conntrack]
CPU: 2 PID: 581 Comm: kworker/u8:4 Tainted: G        W  O    4.4.6+ #21
Hardware name: To be filled by O.E.M. To be filled by O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012
Workqueue: phy2 ieee80211_iface_work [mac80211]
task: ffff8800d9c90000 ti: ffff880213fd0000 task.ti: ffff880213fd0000
RIP: 0010:[<ffffffffa08e9810>]  [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
RSP: 0018:ffff88021eb03c28  EFLAGS: 00010296
RAX: ffff8800cbfd7000 RBX: ffff8800cbfd5060 RCX: ffff8800cbfd1000
RDX: 0000000000001000 RSI: 00000000d9c90805 RDI: ffff8800cbfd5000
RBP: ffff88021eb03c28 R08: 0000000000000001 R09: 0000000000000000
R10: ffff88021eb03ba8 R11: ffff8800cbfd5030 R12: ffff8800cbfd5060
R13: ffff880214a34902 R14: ffff8800cbfd5018 R15: ffff88021350e1b0
FS:  0000000000000000(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001000 CR3: 0000000001c0a000 CR4: 00000000000406e0
Stack:
  ffff88021eb03c68 ffffffffa08e985a ffff880214a30a60 ffff880214a35600
  ffff8800cbfd5060 ffff880214a349e0 ffff880214a35430 ffff88021350e1b0
  ffff88021eb03cb8 ffffffffa0ec2bb4 ffff880214a30a60 0000000014a30a60
Call Trace:
  <IRQ>
  [<ffffffffa08e985a>] ieee80211_tx_dequeue+0x41/0xfe [mac80211]
  [<ffffffffa0ec2bb4>] ath10k_mac_tx_push_txq+0x6a/0x13b [ath10k_core]
  [<ffffffffa0ec2ddb>] ath10k_mac_tx_push_pending+0x156/0x16b [ath10k_core]
  [<ffffffffa0ed123d>] ath10k_htt_t2h_msg_handler+0x7d9/0x886 [ath10k_core]
  [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33
  [<ffffffffa0fca532>] ? ath10k_pci_hif_send_complete_check+0x5d/0x5d [ath10k_pci]
  [<ffffffffa0fca557>] ath10k_pci_htt_rx_deliver+0x25/0x2a [ath10k_pci]
  [<ffffffffa0fcbb51>] ath10k_pci_process_rx_cb+0x191/0x1c9 [ath10k_pci]
  [<ffffffff810f23ad>] ? __local_bh_enable_ip+0xa4/0xb9
  [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33
  [<ffffffffa0fcbbbf>] ath10k_pci_htt_rx_cb+0x24/0x27 [ath10k_pci]
  [<ffffffffa0fce1be>] ath10k_ce_per_engine_service+0x64/0xa0 [ath10k_pci]
  [<ffffffffa0fce260>] ath10k_ce_per_engine_service_any+0x66/0x74 [ath10k_pci]
  [<ffffffffa0fcc4b3>] ath10k_pci_tasklet+0x3a/0x4e [ath10k_pci]
  [<ffffffff810f29e0>] tasklet_action+0xc0/0xcf
  [<ffffffff810f1ff6>] __do_softirq+0x1a4/0x407
  [<ffffffff810f2462>] irq_exit+0x40/0x94
  [<ffffffff810134a2>] do_IRQ+0xd5/0xed
  [<ffffffff816fb24c>] common_interrupt+0x8c/0x8c
  <EOI>
  [<ffffffff81129d49>] ? arch_local_irq_restore+0x6/0xd
  [<ffffffff816f8a3a>] __mutex_unlock_slowpath+0x120/0x137
  [<ffffffff816f8a5a>] mutex_unlock+0x9/0xb
  [<ffffffffa0ebcc38>] ath10k_conf_tx+0x3a9/0x3bb [ath10k_core]
  [<ffffffffa08c2b48>] drv_conf_tx+0x140/0x202 [mac80211]
  [<ffffffffa08f3072>] ieee80211_set_wmm_default+0x1fb/0x24a [mac80211]
  [<ffffffffa0908bc5>] ieee80211_set_disassoc+0x248/0x31f [mac80211]
  [<ffffffffa0908ccf>] ieee80211_sta_connection_lost+0x33/0x69 [mac80211]
  [<ffffffffa090bb8f>] ieee80211_sta_work+0x5fc/0xda9 [mac80211]
  [<ffffffff8112d30b>] ? mark_held_locks+0x5e/0x74
  [<ffffffff8112d490>] ? trace_hardirqs_on_caller+0x16f/0x18b
  [<ffffffff816fa024>] ? _raw_spin_unlock_irqrestore+0x48/0x5d
  [<ffffffffa08d54bd>] ieee80211_iface_work+0x335/0x34e [mac80211]
  [<ffffffff8110471a>] process_one_work+0x260/0x4db
  [<ffffffff81104e50>] worker_thread+0x1e9/0x29b
  [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8
  [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8
  [<ffffffff81109bfb>] kthread+0xcf/0xd7
  [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f
  [<ffffffff816faaef>] ret_from_fork+0x3f/0x70
  [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f
Code: 55 48 89 e5 48 39 c7 74 27 48 85 c0 74 24 ff 4f 10 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 48 89 51 08 <48> 89 0a eb 02 31 c0 5d 
c3 55 48 89 e5 41 57 41 56 4c 8d 76 b8
RIP  [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
  RSP <ffff88021eb03c28>
CR2: 0000000000001000
---[ end trace eb4cdb33d766b5f3 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..

Thanks,
Ben

>
>
> Michał
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


_______________________________________________
ath10k mailing list
ath10k@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/ath10k

  reply	other threads:[~2016-03-31 19:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-26  2:27 Deadlock on (faked) firmware crash, CUS239, modified 10.4.3 firmware Ben Greear
2016-03-29  8:14 ` Michal Kazior
2016-03-29 15:46   ` Ben Greear
2016-03-30 22:28   ` Ben Greear
2016-03-31  6:32     ` Michal Kazior
2016-03-31 19:16       ` Ben Greear [this message]
2016-04-01  5:26         ` Michal Kazior
2016-04-01  5:33           ` Ben Greear

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56FD77A1.2010807@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=ath10k@lists.infradead.org \
    --cc=michal.kazior@tieto.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.