From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751289AbdFEPy1 (ORCPT ); Mon, 5 Jun 2017 11:54:27 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:43834 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751153AbdFEPyZ (ORCPT ); Mon, 5 Jun 2017 11:54:25 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 9EDDB6029B Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=kvalo@codeaurora.org From: Kalle Valo To: Brian Norris Cc: Ganapathi Bhat , Nishant Sarmukadam , linux-kernel@vger.kernel.org, Dmitry Torokhov , Amitkumar Karwar , linux-wireless@vger.kernel.org, Johannes Berg Subject: Re: [PATCH 05/14] mwifiex: re-register wiphy across reset References: <20170525001119.64791-1-briannorris@chromium.org> <20170525001119.64791-5-briannorris@chromium.org> <87fufk2hmm.fsf@purkki.adurom.net> <20170601173954.GA138807@google.com> Date: Mon, 05 Jun 2017 18:54:18 +0300 In-Reply-To: <20170601173954.GA138807@google.com> (Brian Norris's message of "Thu, 1 Jun 2017 10:39:56 -0700") Message-ID: <87inka77md.fsf@codeaurora.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Brian Norris writes: > On Thu, Jun 01, 2017 at 12:15:45PM +0300, Kalle Valo wrote: >> Brian Norris writes: >> >> > In general, it's helpful to use the same code for device removal as for >> > device reset, as this tends to have fewer bugs. Let's move the wiphy >> > unregistration code into the common reset and removal code. >> > >> > In particular, it's very hard to properly handle the reset sequence when >> > something fails. Currently, if mwifiex_reinit_sw() fails, we've failed >> > to unregister the associated wiphy, and so running something as simple >> > as "iw phy" can trigger an OOPS, as the wiphy still has hooks back into >> > freed mwifiex data structures. For example, KASAN complained: >> > >> > [... see reset fail for other reasons ...] >> > [ 1184.821158] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 174948 bytes >> > [ 1186.870914] mwifiex_pcie 0000:01:00.0: info: FW download over, size 608396 bytes >> > [ 1187.685990] mwifiex_pcie 0000:01:00.0: WLAN FW is active >> > [ 1187.692673] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -512 >> > [ 1187.699075] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device >> > [ 1187.713476] mwifiex: Failed to bring up adapter: -5 >> > [ 1187.718644] mwifiex_pcie 0000:01:00.0: reinit failed: -5 >> > >> > [... run `iw phy` ...] >> > [ 1212.902419] ================================================================== >> > [ 1212.909806] BUG: KASAN: use-after-free in mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] at addr ffffffc0ad1a8028 >> > [ 1212.920246] Read of size 1 by task iw/3127 >> > [...] >> > [ 1212.934946] page dumped because: kasan: bad access detected >> > [...] >> > [ 1212.950665] Call trace: >> > [ 1212.953148] [] dump_backtrace+0x0/0x190 >> > [ 1212.958572] [] show_stack+0x20/0x28 >> > [ 1212.963648] [] dump_stack+0xa4/0xcc >> > [ 1212.968723] [] kasan_report+0x378/0x500 >> > [ 1212.974140] [] __asan_load1+0x44/0x4c >> > [ 1212.979462] [] mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] >> > [ 1212.987131] [] nl80211_send_wiphy+0x75c/0x2de0 [cfg80211] >> > [ 1212.994246] [] nl80211_dump_wiphy+0x32c/0x438 [cfg80211] >> > [ 1213.001149] [] genl_lock_dumpit+0x48/0x64 >> > [ 1213.006746] [] netlink_dump+0x178/0x398 >> > [ 1213.012171] [] __netlink_dump_start+0x1bc/0x260 >> > [...] >> > >> > This all goes away if we just tear down the wiphy on the way down, and >> > set it back up if/when we bring the device back up. >> >> I don't know exactly what kind of reset this is about, > > Marvell firmwares are known to be quite buggy, and there are plenty of > situations in which they crash (often resulting in a command timeout). That is a common problem with firmwares :/ > The current best workaround for these is to essentially unwind the > whole driver, reset the card, and reprobe the whole thing. See > anywhere that the ->card_reset() callback is called. > > This has been around for a long time on SDIO, and you recently merged my > changes to enable this for PCIe: > > 6d7d579a8243 mwifiex: pcie: add card_reset() support > >> but isn't this a >> quite intrusive solution? For example, phy name changes because of this? > > Yes, it is a bit intrusive. But the whole process is intrusive, as it > deletes all the virtual interfaces and loses your settings. This all > relies on user space being prepared to clean up and reinitialize > everything afterward. > > And yes, this causes a phy name change. > > In favor: this is what the SDIO reset code *used* to do, before this > commit: > > c742e623e941 mwifiex: sdio card reset enhancement > > where the SDIO driver started using the half-baked reset solution > written for PCIe. > > Lastly, I still need to analyze a few more things in this driver, but > AFAICT, if we *don't* unregister the wiphy, we are exposed to quite a > few more race conditions -- not just the easy-to-notice condition > described above. What happens if the wiphy still processes cfg80211 > operations while we're still resetting the firmware? Much of the driver > may not be prepared for this. At the moment, I can't find anything > terribly wrong; if I slow down the reset (e.g., with msleep()s) I can > just trigger complaints about "cmd node not available" or "card is > removed", but I haven't yet found a true bug. > > That's not to say that there aren't such bugs out there. I'd still be > willing to bet there are. And IMO, it seems wise to just do the same > teardown/setup as one would do for (e.g.) 'rmmod', to prevent exposing > *too* many new permutations of "wiphy is available but rest of the > driver is torn down". This feels like a sledge hammer approach causing all sort of problems for user space and I really like the mac80211 approach more. For example, if an ath10k firmware crash happens user only sees a few second pause in data traffic and a warning in kernel log, otherwise everything happens behind the scenes. Of course there are very likely races somewhere but at least I haven't seen that many reports related to firmware restart functionality. > But if none of this is convincing to you, I can take a stab at a > different solution. I don't have any problem applying this patch but more about being curious why doing it like this. And hopefully finding a less intrusive solution in the future. > BTW, since you're taking an interest in this code now, can I trouble you > with a question? Looking at mwifiex_uninit_sw() (after this patchset), > you can see a loop like this: > > /* Stop data */ > for (i = 0; i < adapter->priv_num; i++) { > priv = adapter->priv[i]; > if (priv && priv->netdev) { > mwifiex_stop_net_dev_queue(priv->netdev, adapter); > if (netif_carrier_ok(priv->netdev)) > netif_carrier_off(priv->netdev); > netif_device_detach(priv->netdev); > } > } > > That seems to be the only attempt to prevent user space from talking to > the device while we proceed to shut down (mwifiex_shutdown_drv()). AIUI, > that's wholly insufficient, and we need to actually stop all the virtual > interfaces (and possibly the wiphy as well) first. I'm looking at trying > to move the mwifiex_del_virtual_intf() loop up much further in this > function (but there are other bugs preventing me from doing that yet). > > Does that sound like the right approach to you? I'm kinda figuring this > should better mimic the mac80211 ieee80211_remove_interfaces() > structure. Johannes is much better person to answer this (CCed). -- Kalle Valo