From mboxrd@z Thu Jan 1 00:00:00 1970 From: Russell King - ARM Linux admin Subject: Re: Unstable Kernel behavior on an ARM based board Date: Tue, 5 Mar 2019 15:31:11 +0000 Message-ID: <20190305153111.frvgnoba646uf5ar@shell.armlinux.org.uk> References: <20190305112226.rhbl3dwopmip45ja@shell.armlinux.org.uk> <20190305115730.GE26369@ulmo> <20190305132350.pzfmz4yvh6ujgotn@shell.armlinux.org.uk> <20190305142351.ktciqkj5kycdwilr@shell.armlinux.org.uk> <20190305145853.qsmcaop4ths2jgrl@shell.armlinux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=m.gmane.org@lists.infradead.org To: Embedded Engineer Cc: Andrew Lunn , Vladimir Murzin , Jon Hunter , Thierry Reding , linux-tegra@vger.kernel.org, linux-arm-kernel@lists.infradead.org List-Id: linux-tegra@vger.kernel.org On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote: > On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin > wrote: > > > > Should've been pool->allocation. Sorry about that. > = > No problems, here are the new logs: > = > https://pastebin.com/dfey3LwB Thanks - the patch I posted substantially increases the amount of checking that is done... so not surprisingly we find new forms of corruption: tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted) 00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b =A0....kkkkkkkkkk= kk 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. and that corruption occurred _right_ after we allocated the page, memset the entire page to 0xa7, and wrote the "next" pointers. Again, similar scenario to the above: tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted) 00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0 .............= .. 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. which is again right after the page is allocated and initialised. If we look at the ci_hw_qh case, which is the one originally identified: tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted) 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. Again, just allocated the coherent DMA page, memset() it and written the offsets to it, and it is already corrupted. Tegra124 does not appear to be dma-coherent, so these allocations will be for normal, uncached memory. That means the cache won't be loading entire cachelines at a time from memory for these accesses, but will be reading them byte by byte as we print the hex values. The window for this corruption occuring is now very small. Right now, I don't have anything further to add beyond what I've already suggested as causes - this is *definitely* memory corruption either by something else writing to memory, by the CPU writes not properly being stored in RAM or the CPU not being able to reliably read data back from RAM. I wonder whether any of the memory testers run with normal, uncached memory. -- = RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps = up According to speedtest.net: 11.9Mbps down 500kbps up From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED, USER_AGENT_NEOMUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4F6BC43381 for ; Tue, 5 Mar 2019 15:31:32 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 85E2D20842 for ; Tue, 5 Mar 2019 15:31:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="WWT5eeOL"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=armlinux.org.uk header.i=@armlinux.org.uk header.b="DxxwGc3C" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 85E2D20842 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=armlinux.org.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=4o+oG3pvQDBCT3skgYMQ/Kf0yYvUvWCsd2bqGpSe3mI=; b=WWT5eeOLU+savN hNggxC+PCHQWVUusBu8r0FGK2BkPrqiySw/QLgHR4FfVenRGQ5+nu8VXxWNVHDKwlBc0sQQeWtXIV C60QwFeVyiZIvQDouTCqodKp73Lhl0uSMdgPXKGIgyrG9UaKhARFpjU/ITTDeJF3zuLttNHYP3Gpl oyBTShqpFNXbROmywm+jKk51Zi+5DINKZF3XCIc07nou+311V+WSvJzPGSC+Ww5EN6ib6WhD61/Cw yIHU1HHnppzjqn5o1J3hzqkRb4E3uFqzWoE3ckOpEAGAG4burmOSveygyX63PuvAgdwIXIRrEuGQD sNdsGFORajIizOp7+S0A==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1h1C2Q-0007zw-SW; Tue, 05 Mar 2019 15:31:26 +0000 Received: from pandora.armlinux.org.uk ([2001:4d48:ad52:3201:214:fdff:fe10:1be6]) by bombadil.infradead.org with esmtps (Exim 4.90_1 #2 (Red Hat Linux)) id 1h1C2L-0007zY-AD for linux-arm-kernel@lists.infradead.org; Tue, 05 Mar 2019 15:31:25 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=yJCS+CPYsVxoEKekZWLent0PZCaiFQP9cZPfN+Jm72M=; b=DxxwGc3CMxPp42SovBM8TRRo+ ebWQBxC6VVBbsHZEUU9NJS+PYjQMcl0OP19dfUy7vqBG6thJjw08lzoAj30Qo3DhMyxri/hdvf/pd lzJmvnaWY9YRIeL1iME67LC3+ZVuMG0TRkJ7bGD+HcnXp3F/VKZQnTn7cRbCiJed9qe9+VSlBgV90 B3QxWwoSKJQ7mEj7rQJ5uw7r1S60XIRwxdLx7IxVVCNTcl4A/7+Bhaa/9xiVUPNaJAyGV/5i/lkFq Q7uHvuth6KU+uezDVtsGCztiT579li74ziGif3BIbxt4ik8zd2ZUX0MIjgUrDObketnYIOLzvd0Pb 4T1+tqBNA==; Received: from shell.armlinux.org.uk ([2001:4d48:ad52:3201:5054:ff:fe00:4ec]:54958) by pandora.armlinux.org.uk with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1h1C2D-00021z-V0; Tue, 05 Mar 2019 15:31:14 +0000 Received: from linux by shell.armlinux.org.uk with local (Exim 4.89) (envelope-from ) id 1h1C2B-0003JO-9p; Tue, 05 Mar 2019 15:31:11 +0000 Date: Tue, 5 Mar 2019 15:31:11 +0000 From: Russell King - ARM Linux admin To: Embedded Engineer Subject: Re: Unstable Kernel behavior on an ARM based board Message-ID: <20190305153111.frvgnoba646uf5ar@shell.armlinux.org.uk> References: <20190305112226.rhbl3dwopmip45ja@shell.armlinux.org.uk> <20190305115730.GE26369@ulmo> <20190305132350.pzfmz4yvh6ujgotn@shell.armlinux.org.uk> <20190305142351.ktciqkj5kycdwilr@shell.armlinux.org.uk> <20190305145853.qsmcaop4ths2jgrl@shell.armlinux.org.uk> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170113 (1.7.2) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20190305_073121_358230_F812FCBA X-CRM114-Status: GOOD ( 10.68 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Andrew Lunn , Vladimir Murzin , Jon Hunter , Thierry Reding , linux-tegra@vger.kernel.org, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Tue, Mar 05, 2019 at 08:11:22PM +0500, Embedded Engineer wrote: > On Tue, Mar 5, 2019 at 7:58 PM Russell King - ARM Linux admin > wrote: > > > > Should've been pool->allocation. Sorry about that. > = > No problems, here are the new logs: > = > https://pastebin.com/dfey3LwB Thanks - the patch I posted substantially increases the amount of checking that is done... so not surprisingly we find new forms of corruption: tegra-ehci 7d004000.usb: pool_alloc_page ehci_qh, 0xac050240 (corrupted) 00000000: a0 02 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b =A0....kkkkkkkkkk= kk 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000040: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. and that corruption occurred _right_ after we allocated the page, memset the entire page to 0xa7, and wrote the "next" pointers. Again, similar scenario to the above: tegra-ehci 7d004000.usb: pool_alloc_page ehci_qtd, 0xac0510c0 (corrupted) 00000000: 20 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0 .............= .. 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000020: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000040: e0 01 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000050: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. which is again right after the page is allocated and initialised. If we look at the ci_hw_qh case, which is the one originally identified: tegra-udc 7d000000.usb: pool_alloc_page ci_hw_qh, 0xac056080 (corrupted) 00000000: c0 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000010: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000020: 80 00 00 00 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. 00000030: a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 a7 =A0..............= .. Again, just allocated the coherent DMA page, memset() it and written the offsets to it, and it is already corrupted. Tegra124 does not appear to be dma-coherent, so these allocations will be for normal, uncached memory. That means the cache won't be loading entire cachelines at a time from memory for these accesses, but will be reading them byte by byte as we print the hex values. The window for this corruption occuring is now very small. Right now, I don't have anything further to add beyond what I've already suggested as causes - this is *definitely* memory corruption either by something else writing to memory, by the CPU writes not properly being stored in RAM or the CPU not being able to reliably read data back from RAM. I wonder whether any of the memory testers run with normal, uncached memory. -- = RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps = up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel