From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9692C3F2C6 for ; Sat, 29 Feb 2020 17:03:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9695C2073D for ; Sat, 29 Feb 2020 17:03:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=armlinux.org.uk header.i=@armlinux.org.uk header.b="WuGamnAl" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727265AbgB2RDo (ORCPT ); Sat, 29 Feb 2020 12:03:44 -0500 Received: from pandora.armlinux.org.uk ([78.32.30.218]:45952 "EHLO pandora.armlinux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727168AbgB2RDn (ORCPT ); Sat, 29 Feb 2020 12:03:43 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=VJa4/ENqUUcyzQ50adGWhHo++94I/SPwFtIFyNEIG/M=; b=WuGamnAlOyuSYmTPlM/hnFV3Y 1IRzG/u5tcAan1mpj9ClDlUoxpvpfnYNsNO5Gh1YPhynX7JwAlSR233K+fmlM6xUzbnLP82u9nNuG uJEFbtIPC5pghl/ddLd7vKTmeq5KHptLRAaAOHo1M77aGGLrAQT3ltKd7Vd7DfYw/eowQozBB2/lZ WBASy0BUaKINZQHan6rUE098SKz/odQWVVLQ91+EwJmc5p6bvRNrvUTt0MhapgkZHO1/SfJn2/Loc fGumiPlir8nJUR4QT+TH5RrkB3UePEIhw2UmlPY2zZZDZn8ZfmW/aQ1zACfMfRmlPKQiQ9rG+XxCw rnqWtOcqQ==; Received: from shell.armlinux.org.uk ([fd8f:7570:feb6:1:5054:ff:fe00:4ec]:58578) by pandora.armlinux.org.uk with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1j85WW-0002dX-Gr; Sat, 29 Feb 2020 17:03:32 +0000 Received: from linux by shell.armlinux.org.uk with local (Exim 4.92) (envelope-from ) id 1j85WS-00037o-IM; Sat, 29 Feb 2020 17:03:28 +0000 Date: Sat, 29 Feb 2020 17:03:28 +0000 From: Russell King - ARM Linux admin To: "Theodore Y. Ts'o" Cc: Olof Johansson , Jon Nettleton , Andreas Dilger , "mark.rutland@arm.com" , Lorenzo Pieralisi , "arnd@arndb.de" , "m.karthikeyan@mobiveil.co.in" , "linux-pci@vger.kernel.org" , "Z.q. Hou" , "l.subrahmanya@mobiveil.co.in" , "will.deacon@arm.com" , "linux-kernel@vger.kernel.org" , Leo Li , "M.h. Lian" , Xiaowei Bao , "catalin.marinas@arm.com" , "bhelgaas@google.com" , "andrew.murray@arm.com" , "shawnguo@kernel.org" , Mingkai Hu , "linux-arm-kernel@lists.infradead.org" Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 driver for NXP Layerscape SoCs Message-ID: <20200229170328.GD25745@shell.armlinux.org.uk> References: <20200110153347.GA29372@e121166-lin.cambridge.arm.com> <20200210152257.GD25745@shell.armlinux.org.uk> <20200229095550.GX25745@shell.armlinux.org.uk> <20200229110456.GY25745@shell.armlinux.org.uk> <20200229151907.GA7378@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200229151907.GA7378@mit.edu> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Feb 29, 2020 at 10:19:07AM -0500, Theodore Y. Ts'o wrote: > On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > > Could it be a race condition, or some problem that's specific to the > > ARM64 kernel that's provoking this corruption? > > Since I got brought in mid-way through this discussion, can someone > summarize the vital details of the bughunt? What kernel version is > involved, and is this a regression? If so, what's the last version of > the kernel where you didn't have a problem on this hardware? It's a new platform, I've run most 5.x kernels on it, but only recently have I had a NVMe. Currently running a 5.5 based kernel (for which I have to patch in support for the platform), and I've no idea if it is a regression or not. > Can you trigger this failure reliably? No - the very first time I ended up with a corrupted ext4 fs was on the 8th February, and at that time it was put down to the NVMe not being power-off safe: the machine had crashed sometime over night, resulting in a section of my network going offline (due to a pause frame storm). So, I powered it down from crashed state - and from what people tell me, NVMe _may_ keep blocks unwritten to safe media for a considerable time. I never bothered to investigate it because the explanation seemed reasonable, and manually running e2fsck fixed the filesystem. The system was then booted back into using the NVMe rootfs, and continued to do so without apparent issue until the 21st Feb, when I cleanly shut it down, and powered it off. During the time it was running, it likely saw many reboots of the 5.5 kernel. I powered it back on yesterday morning, and this morning it found the fs corruption while trying to do a logrotate. As I say in my last email, I suspect it isn't an ext4 bug, but either a locking implementation issue, coherency issue, or interconnect issue. The 4k block with the affected inode looks perfectly reasonable with the only exception that the checksum is incorrect for that one inode - and other inodes stored in the same 4k block were modified afterwards. It suggests to me that the writes to update the two 16-bit words containing the checksum were somehow lost for this particular inode. > Unfortunately, while I'm regularly running xfstests on x86_64 on a > Google Compute Engine VM, I'm not doing any runs on arm64. I can > certainly build an arm-64. > > There's a test-appliance designed to be run on ARM64 here[1]. > > [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz The filename seems to say "amd64" not "arm64" ? > which is a Debian chroot, designed to be run via android-xfstests[2], but > if you unpack it, it should be possible to enter the chroot and > trigger the xfstests run manually on any arm64 system. > > [2] https://thunk.org/android-xfstests > > Does anyone know if kernel CI is running xfstests regularly? I don't know... -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBA84C3F2CD for ; Sat, 29 Feb 2020 17:04:04 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8EBE7222C4 for ; Sat, 29 Feb 2020 17:04:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="OPrmMQWH"; dkim=fail reason="signature verification failed" (2048-bit key) header.d=armlinux.org.uk header.i=@armlinux.org.uk header.b="WuGamnAl" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8EBE7222C4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=armlinux.org.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=8HNcOt1QaIoOuRI777JzAJbGr05SQAMNOoPkbjf1wwE=; b=OPrmMQWH8WJOkK Zhxqn2DA0Q7qXdgWee/wcRafD3t95ncqZxsqSY/wC7H7cSnXcoiGNAY8IuQ7WgnK3cJgiBYiwHRJM nG8oeHMMfLMiKMrDG3HPBxoo1Ebam9Eigl/NoFV/0cbSSVJrGlRXFFDU9U5e5Yjbern06HOf/y9Sq dG5+u5/+zMk2Jhc9UxOlZjEPI2BRmR34MwFq6VSZ82/obrCG+hmgbxaDhyCK53oAHBfgXpDiqGucD WjckLmNerk04plfIijauRpoYRPpZzUIYMWsfV1I6KrMyVBs75eiWonDDlhaFuL1kvlAstCjv0aN0p QO6AIV5/X7AZ0hH8k+6g==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1j85X2-00083L-6L; Sat, 29 Feb 2020 17:04:04 +0000 Received: from pandora.armlinux.org.uk ([2001:4d48:ad52:3201:214:fdff:fe10:1be6]) by bombadil.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1j85Wy-00082z-Kd for linux-arm-kernel@lists.infradead.org; Sat, 29 Feb 2020 17:04:02 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=armlinux.org.uk; s=pandora-2019; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=VJa4/ENqUUcyzQ50adGWhHo++94I/SPwFtIFyNEIG/M=; b=WuGamnAlOyuSYmTPlM/hnFV3Y 1IRzG/u5tcAan1mpj9ClDlUoxpvpfnYNsNO5Gh1YPhynX7JwAlSR233K+fmlM6xUzbnLP82u9nNuG uJEFbtIPC5pghl/ddLd7vKTmeq5KHptLRAaAOHo1M77aGGLrAQT3ltKd7Vd7DfYw/eowQozBB2/lZ WBASy0BUaKINZQHan6rUE098SKz/odQWVVLQ91+EwJmc5p6bvRNrvUTt0MhapgkZHO1/SfJn2/Loc fGumiPlir8nJUR4QT+TH5RrkB3UePEIhw2UmlPY2zZZDZn8ZfmW/aQ1zACfMfRmlPKQiQ9rG+XxCw rnqWtOcqQ==; Received: from shell.armlinux.org.uk ([fd8f:7570:feb6:1:5054:ff:fe00:4ec]:58578) by pandora.armlinux.org.uk with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.90_1) (envelope-from ) id 1j85WW-0002dX-Gr; Sat, 29 Feb 2020 17:03:32 +0000 Received: from linux by shell.armlinux.org.uk with local (Exim 4.92) (envelope-from ) id 1j85WS-00037o-IM; Sat, 29 Feb 2020 17:03:28 +0000 Date: Sat, 29 Feb 2020 17:03:28 +0000 From: Russell King - ARM Linux admin To: "Theodore Y. Ts'o" Subject: Re: [PATCHv9 00/12] PCI: Recode Mobiveil driver and add PCIe Gen4 driver for NXP Layerscape SoCs Message-ID: <20200229170328.GD25745@shell.armlinux.org.uk> References: <20200110153347.GA29372@e121166-lin.cambridge.arm.com> <20200210152257.GD25745@shell.armlinux.org.uk> <20200229095550.GX25745@shell.armlinux.org.uk> <20200229110456.GY25745@shell.armlinux.org.uk> <20200229151907.GA7378@mit.edu> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20200229151907.GA7378@mit.edu> User-Agent: Mutt/1.10.1 (2018-07-13) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20200229_090400_681104_0A9E7F90 X-CRM114-Status: GOOD ( 22.31 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "mark.rutland@arm.com" , Lorenzo Pieralisi , "m.karthikeyan@mobiveil.co.in" , "arnd@arndb.de" , "linux-pci@vger.kernel.org" , "Z.q. Hou" , "l.subrahmanya@mobiveil.co.in" , Jon Nettleton , "linux-kernel@vger.kernel.org" , "will.deacon@arm.com" , Leo Li , "M.h. Lian" , Andreas Dilger , Xiaowei Bao , "catalin.marinas@arm.com" , Olof Johansson , "andrew.murray@arm.com" , "bhelgaas@google.com" , "shawnguo@kernel.org" , Mingkai Hu , "linux-arm-kernel@lists.infradead.org" Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Sat, Feb 29, 2020 at 10:19:07AM -0500, Theodore Y. Ts'o wrote: > On Sat, Feb 29, 2020 at 11:04:56AM +0000, Russell King - ARM Linux admin wrote: > > Could it be a race condition, or some problem that's specific to the > > ARM64 kernel that's provoking this corruption? > > Since I got brought in mid-way through this discussion, can someone > summarize the vital details of the bughunt? What kernel version is > involved, and is this a regression? If so, what's the last version of > the kernel where you didn't have a problem on this hardware? It's a new platform, I've run most 5.x kernels on it, but only recently have I had a NVMe. Currently running a 5.5 based kernel (for which I have to patch in support for the platform), and I've no idea if it is a regression or not. > Can you trigger this failure reliably? No - the very first time I ended up with a corrupted ext4 fs was on the 8th February, and at that time it was put down to the NVMe not being power-off safe: the machine had crashed sometime over night, resulting in a section of my network going offline (due to a pause frame storm). So, I powered it down from crashed state - and from what people tell me, NVMe _may_ keep blocks unwritten to safe media for a considerable time. I never bothered to investigate it because the explanation seemed reasonable, and manually running e2fsck fixed the filesystem. The system was then booted back into using the NVMe rootfs, and continued to do so without apparent issue until the 21st Feb, when I cleanly shut it down, and powered it off. During the time it was running, it likely saw many reboots of the 5.5 kernel. I powered it back on yesterday morning, and this morning it found the fs corruption while trying to do a logrotate. As I say in my last email, I suspect it isn't an ext4 bug, but either a locking implementation issue, coherency issue, or interconnect issue. The 4k block with the affected inode looks perfectly reasonable with the only exception that the checksum is incorrect for that one inode - and other inodes stored in the same 4k block were modified afterwards. It suggests to me that the writes to update the two 16-bit words containing the checksum were somehow lost for this particular inode. > Unfortunately, while I'm regularly running xfstests on x86_64 on a > Google Compute Engine VM, I'm not doing any runs on arm64. I can > certainly build an arm-64. > > There's a test-appliance designed to be run on ARM64 here[1]. > > [1] https://kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/xfstests-amd64.tar.xz The filename seems to say "amd64" not "arm64" ? > which is a Debian chroot, designed to be run via android-xfstests[2], but > if you unpack it, it should be possible to enter the chroot and > trigger the xfstests run manually on any arm64 system. > > [2] https://thunk.org/android-xfstests > > Does anyone know if kernel CI is running xfstests regularly? I don't know... -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel