From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B410C4363D for ; Fri, 25 Sep 2020 13:37:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4792B2076D for ; Fri, 25 Sep 2020 13:37:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728635AbgIYNg6 (ORCPT ); Fri, 25 Sep 2020 09:36:58 -0400 Received: from mga18.intel.com ([134.134.136.126]:48408 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728353AbgIYNg6 (ORCPT ); Fri, 25 Sep 2020 09:36:58 -0400 IronPort-SDR: lafVxTiok7YIDplffWcXK7zIAlz2DY608830ucSxYk0EhO2erc1pLL4EicNet+2+l2ttPxgYJo Cf96mErHjBig== X-IronPort-AV: E=McAfee;i="6000,8403,9754"; a="149293374" X-IronPort-AV: E=Sophos;i="5.77,302,1596524400"; d="scan'208";a="149293374" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Sep 2020 06:36:57 -0700 IronPort-SDR: DRvC5xWrWRwtNSg+2RiqxIxD4cViHPortJHWDcEHF+XebislM4ZchbEDzxOBrbdqs2ccskKChw YaTetAjHbLIw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,302,1596524400"; d="scan'208";a="455836930" Received: from mattu-haswell.fi.intel.com (HELO [10.237.72.170]) ([10.237.72.170]) by orsmga004.jf.intel.com with ESMTP; 25 Sep 2020 06:36:56 -0700 Subject: Re: xhci problem -> general protection fault To: Andrzej Pietrasiewicz , "linux-usb@vger.kernel.org" Cc: "kernel@collabora.com" References: <65ac3a73-ca57-c3e8-561b-9ba5c15b3c65@collabora.com> <8230c2a2-719c-ef81-e85d-5921bf8e98e6@collabora.com> From: Mathias Nyman Openpgp: preference=signencrypt Autocrypt: addr=mathias.nyman@linux.intel.com; prefer-encrypt=mutual; keydata= mQINBFMB0ccBEADd+nZnZrFDsIjQtclVz6OsqFOQ6k0nQdveiDNeBuwyFYykkBpaGekoHZ6f lH4ogPZzQ+pzoJEMlRGXc881BIggKMCMH86fYJGfZKWdfpg9O6mqSxyEuvBHKe9eZCBKPvoC L2iwygtO8TcXXSCynvXSeZrOwqAlwnxWNRm4J2ikDck5S5R+Qie0ZLJIfaId1hELofWfuhy+ tOK0plFR0HgVVp8O7zWYT2ewNcgAzQrRbzidA3LNRfkL7jrzyAxDapuejuK8TMrFQT/wW53e uegnXcRJaibJD84RUJt+mJrn5BvZ0MYfyDSc1yHVO+aZcpNr+71yZBQVgVEI/AuEQ0+p9wpt O9Wt4zO2KT/R5lq2lSz1MYMJrtfFRKkqC6PsDSB4lGSgl91XbibK5poxrIouVO2g9Jabg04T MIPpVUlPme3mkYHLZUsboemRQp5/pxV4HTFR0xNBCmsidBICHOYAepCzNmfLhfo1EW2Uf+t4 L8IowAaoURKdgcR2ydUXjhACVEA/Ldtp3ftF4hTQ46Qhba/p4MUFtDAQ5yeA5vQVuspiwsqB BoL/298+V119JzM998d70Z1clqTc8fiGMXyVnFv92QKShDKyXpiisQn2rrJVWeXEIVoldh6+ J8M3vTwzetnvIKpoQdSFJ2qxOdQ8iYRtz36WYl7hhT3/hwkHuQARAQABtCdNYXRoaWFzIE55 bWFuIDxtYXRoaWFzLm55bWFuQGdtYWlsLmNvbT6JAjsEEwECACUCGwMGCwkIBwMCBhUIAgkK CwQWAgMBAh4BAheABQJTAeo1AhkBAAoJEFiDn/uYk8VJOdIP/jhA+RpIZ7rdUHFIYkHEKzHw tkwrJczGA5TyLgQaI8YTCTPSvdNHU9Rj19mkjhUO/9MKvwfoT2RFYqhkrtk0K92STDaBNXTL JIi4IHBqjXOyJ/dPADU0xiRVtCHWkBgjEgR7Wihr7McSdVpgupsaXhbZjXXgtR/N7PE0Wltz hAL2GAnMuIeJyXhIdIMLb+uyoydPCzKdH6znfu6Ox76XfGWBCqLBbvqPXvk4oH03jcdt+8UG 2nfSeti/To9ANRZIlSKGjddCGMa3xzjtTx9ryf1Xr0MnY5PeyNLexpgHp93sc1BKxKKtYaT0 lR6p0QEKeaZ70623oB7Sa2Ts4IytqUVxkQKRkJVWeQiPJ/dZYTK5uo15GaVwufuF8VTwnMkC 4l5X+NUYNAH1U1bpRtlT40aoLEUhWKAyVdowxW4yGCP3nL5E69tZQQgsag+OnxBa6f88j63u wxmOJGNXcwCerkCb+wUPwJzChSifFYmuV5l89LKHgSbv0WHSN9OLkuhJO+I9fsCNvro1Y7dT U/yq4aSVzjaqPT3yrnQkzVDxrYT54FLWO1ssFKAOlcfeWzqrT9QNcHIzHMQYf5c03Kyq3yMI Xi91hkw2uc/GuA2CZ8dUD3BZhUT1dm0igE9NViE1M7F5lHQONEr7MOCg1hcrkngY62V6vh0f RcDeV0ISwlZWuQINBFMB0ccBEACXKmWvojkaG+kh/yipMmqZTrCozsLeGitxJzo5hq9ev31N 2XpPGx4AGhpccbco63SygpVN2bOd0W62fJJoxGohtf/g0uVtRSuK43OTstoBPqyY/35+VnAV oA5cnfvtdx5kQPIL6LRcxmYKgN4/3+A7ejIxbOrjWFmbWCC+SgX6mzHHBrV0OMki8R+NnrNa NkUmMmosi7jBSKdoi9VqDqgQTJF/GftvmaZHqgmVJDWNrCv7UiorhesfIWPt1O/AIk9luxlE dHwkx5zkWa9CGYvV6LfP9BznendEoO3qYZ9IcUlW727Le80Q1oh69QnHoI8pODDBBTJvEq1h bOWcPm/DsNmDD8Rwr/msRmRyIoxjasFi5WkM/K/pzujICKeUcNGNsDsEDJC5TCmRO/TlvCvm 0X+vdfEJRZV6Z+QFBflK1asUz9QHFre5csG8MyVZkwTR9yUiKi3KiqQdaEu+LuDD2CGF5t68 xEl66Y6mwfyiISkkm3ETA4E8rVZP1rZQBBm83c5kJEDvs0A4zrhKIPTcI1smK+TWbyVyrZ/a mGYDrZzpF2N8DfuNSqOQkLHIOL3vuOyx3HPzS05lY3p+IIVmnPOEdZhMsNDIGmVorFyRWa4K uYjBP/W3E5p9e6TvDSDzqhLoY1RHfAIadM3I8kEx5wqco67VIgbIHHB9DbRcxQARAQABiQIf BBgBAgAJBQJTAdHHAhsMAAoJEFiDn/uYk8VJb7AQAK56tgX8V1Wa6RmZDmZ8dmBC7W8nsMRz PcKWiDSMIvTJT5bygMy1lf7gbHXm7fqezRtSfXAXr/OJqSA8LB2LWfThLyuuCvrdNsQNrI+3 D+hjHJjhW/4185y3EdmwwHcelixPg0X9EF+lHCltV/w29Pv3PiGDkoKxJrnOpnU6jrwiBebz eAYBfpSEvrCm4CR4hf+T6MdCs64UzZnNt0nxL8mLCCAGmq1iks9M4bZk+LG36QjCKGh8PDXz 9OsnJmCggptClgjTa7pO6040OW76pcVrP2rZrkjo/Ld/gvSc7yMO/m9sIYxLIsR2NDxMNpmE q/H7WO+2bRG0vMmsndxpEYS4WnuhKutoTA/goBEhtHu1fg5KC+WYXp9wZyTfeNPrL0L8F3N1 BCEYefp2JSZ/a355X6r2ROGSRgIIeYjAiSMgGAZMPEVsdvKsYw6BH17hDRzltNyIj5S0dIhb Gjynb3sXforM/GVbr4mnuxTdLXQYlj2EJ4O4f0tkLlADT7podzKSlSuZsLi2D+ohKxtP3U/r 42i8PBnX2oAV0UIkYk7Oel/3hr0+BP666SnTls9RJuoXc7R5XQVsomqXID6GmjwFQR5Wh/RE IJtkiDAsk37cfZ9d1kZ2gCQryTV9lmflSOB6AFZkOLuEVSC5qW8M/s6IGDfYXN12YJaZPptJ fiD/ Message-ID: <133c123e-e857-7f83-d146-f39c00afe39f@linux.intel.com> Date: Fri, 25 Sep 2020 16:40:29 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: <8230c2a2-719c-ef81-e85d-5921bf8e98e6@collabora.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-usb@vger.kernel.org On 18.9.2020 17.20, Andrzej Pietrasiewicz wrote: > Hi Mathias, > > W dniu 18.09.2020 o 12:50, Mathias Nyman pisze: >> On 17.9.2020 18.30, Andrzej Pietrasiewicz wrote: >>> Dear All, >>> >>> I have encountered a problem in xhci which leads to general protection fault. >>> >>> The problem is triggered by running this program: >>> >>> https://gitlab.collabora.com/andrzej.p/bulk-cancel.git >>> >>> $ sudo ./bulk-cancel -D /dev/bus/usb/002/006 -i 1 -b 1 >>> >>> where /dev/bus/usb/002/006 is a Gadget Zero: >>> >>> It takes less than a minute until the crash happens. >>> The DMAR (iommu) errors don't happen always, i.e. there are crashes >>> when they are not reported in the system log. In either case the >>> >>> "WARN Cannot submit Set TR Deq Ptr" >>> "A Set TR Deq Ptr command is pending." >>> "WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state." >>> >>> messages do appear. >>> >> >> Nice testcase and report, thanks. >> >> I started looking at issues in this area some time ago, and wrote a couple patches but >> it was left hanging. The two patches (now rebased on 5.9-rc3) can be found in my tree in the >> fix_invalid_context_at_stop_endpoint branch >> >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_invalid_context_at_stop_endpoint >> >> If you could give those a try and see if they help I'd be grateful. > > No, it doesn't help, albeit the errors are slightly different: > > xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr > xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending. > dmar_fault: 44 callbacks suppressed > DRHD: handling fault status reg 3> DMAR: [DMA Write] Request device [00:14.0] PASID ffffffff fault addr ffcda000 [fault reason 05] PTE Write access is not set > DMAR: DRHD: handling fault status reg 3 Ok, thanks, the DMA problems make sense to me now. If a transfer ring stops on a transfer request (TRB) that should be canceled (manual cancel, or error) it's not enough to just turn the TRB to a no-op. HW has most likely cached the TRB, and we need to move the transfer ring dequeue pointer past this TRB. Moving deq also clears controller cache. We do all this, but if we fail to queue the Set TR Deq command the TRB (with DMA pointers) will stay on the ring, and controller will access the TRB DMA address once it continues running. At this point xhci driver has already given back the canceled/erroneous TRB, and is probably unmapped already. Hence the DMAR entries. Looks like this part of the code needs a more extensive rewrite, on top of this we are not handling races between endpoints halted due errors, and endpoints stopped by driver to cancel URBs. -Mathias