From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6012C433E1 for ; Fri, 14 Aug 2020 01:44:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 83A8520771 for ; Fri, 14 Aug 2020 01:44:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726576AbgHNBoB (ORCPT ); Thu, 13 Aug 2020 21:44:01 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:47658 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726567AbgHNBoA (ORCPT ); Thu, 13 Aug 2020 21:44:00 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id 3C7257B2393; Thu, 13 Aug 2020 21:43:59 -0400 (EDT) Date: Thu, 13 Aug 2020 21:43:59 -0400 From: Zygo Blaxell To: Marc MERLIN Cc: "linux-btrfs@vger.kernel.org" Subject: Re: 5.6 pretty massive unexplained btrfs corruption: parent transid verify failed + open_ctree failed Message-ID: <20200814014359.GQ5890@hungrycats.org> References: <20200707035530.GP30660@merlins.org> <20200708034407.GE10769@hungrycats.org> <20200812223433.GA533@merlins.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200812223433.GA533@merlins.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Wed, Aug 12, 2020 at 03:34:33PM -0700, Marc MERLIN wrote: > On Tue, Jul 07, 2020 at 11:44:07PM -0400, Zygo Blaxell wrote: > > sdd is running firmware 0957, also found in circa-2014 WD Green. > > The others are running 01.01RA2 firmware that appears in a model family > > that includes some broken WD Green and Red models from a few years back > > (including the venerable datavore 80.00A80). I have a few of the WD > > branded versions of these drives. They are unusable with write cache > > enabled. 1 in 10 unclean shutdowns lead to filesystem corruption on > > btrfs; on ext4, git and postgresql database corruption. After disabling > > write cache, I've used them for years with no problems. > > > > Hopefully your bcache drive is OK, you didn't post any details on that. > > bcache on a drive with buggy firmware write caching fails *spectacularly*. > > > > You can work around buggy write cache firmware with a udev rule like > > this to disable write cache on all the drives: > > > > ACTION=="add|change", SUBSYSTEM=="block", DRIVERS=="sd", KERNEL=="sd*[!0-9]", RUN+="/sbin/hdparm -W 0 $devnode" > > > > Note that in your logs, the kernel reports that 'sdd' has write cache > > disabled already, maybe due to lack of firmware support or a conservative > > default setting. That makes it probably the only drive in that array > > that is working properly. > > Hi Zygo, took a while to rebuild the array, but it's back up, thanks for > your tips. > > To avoid such pain in the future, is there a way for me to find out if > my other drives have such problems so that I disable write caching on > them now? > WDC WD10EADS-00L 01.0 PQ: 0 ANSI: 6 > SAMSUNG HD102UJ 1AA0 PQ: 0 ANSI: 6 > ST6000VN0041-2EL SC61 PQ: 0 ANSI: 6 > ST6000VN0041-2EL SC61 PQ: 0 ANSI: 6 > ST6000VN0041-2EL SC61 PQ: 0 ANSI: 6 > ST6000VN0041-2EL SC61 PQ: 0 ANSI: 6 > ST6000VN0041-2EL SC61 PQ: 0 ANSI: 6 > WDC WD40EFRX-68W 0A80 PQ: 0 ANSI: 5 > WDC WD40EFRX-68W 0A80 PQ: 0 ANSI: 5 > WDC WD40EFRX-68W 0A80 PQ: 0 ANSI: 5 > WDC WD40EFRX-68W 0A80 PQ: 0 ANSI: 5 > WDC WD40EFRX-68W 0A80 PQ: 0 ANSI: 5 That "0A80" is the last 4 bytes of "80.00A80" which I mentioned above. They all need write cache turned off. I'm told that SC61 firmware has a bug fix for SC60's write cache issues; however, I have some SC60 drives and haven't had write-cache problems with them. They're probably fine. Same for WD 01.01A01--some WD Green drives are OK. I have no data for Samsung firmware. > As for the failure you helped me with (thanks): > In hindsight, I think I know what went wrong. I was running bees on the > array to reclaim duplicate space (1.2TB saved, so it was worth it). > bees does lots of metadata operations, so that would explain why the > last operations that didn't get saved/got corrupted caused the fatal > corruption I ended up with. If the drive lost data writes then you'd have csum failures on data blocks. Some of my WD Reds with 0A80 firmware had those. There's a lot of metadata writes all the time on btrfs, even with data-heavy workloads. You only need to lose any two metadata writes (one in single or raid0 profile) to break the filesystem. bees typically only spends a single-digit percentage of its time manipulating metadata. Most of the bees IO time is waiting for btrfs to do slow metadata reads while holding a transaction lock. > Because it's an external array, I do stop it to save power and do this: > umount /dev/mapper/crypt_bcache0 > sync > dmsetup remove crypt_bcache0 > echo 1 > /sys/block/md6/bcache/stop > mdadm --stop /dev/md6 > /etc/init.d/smartmontools stop > sleep 5 > pdu disk3 off <= cuts power > > Note that I umount, then sync just in case, then a bunch of stuff to > free up block devices, but by then the btrfs FS has been unmounted and > synced. > > Yet, I added sleep 5 for good measure before turning the drives off. > Do you think the cache was so broken that even with all the steps I > took, it failed to flush? Anything after bcache stop is write cache flush theatre. If the drive's write cache firmware is broken, then you can put all the flushes you want, but the flushes don't do anything that wouldn't also be achieved by just waiting for the disks to be idle for a while. 5 seconds seems short to me. I'd keep the drive powered for a minute or two before turning it off. Ideally I'd set an idle timeout on the drive (hdparm -S) and wait until the disk spins itself down (hdparm -C). > Either way, I now do this as per your recommenation: > grep md6 /proc/mdstat | sed "s/.*raid5 //" | tr ' ' '\012' | sed "s/1.*//" | while read d; do /sbin/hdparm -v -W 0 /dev/$d; done Note that after a bus reset while the disks are online, the write caching flag might revert to default. Write cache corruption doesn't always occur at shutdown. It can also happen if there is a UNC sector or even noise on the SATA cables that triggers a bus reset. This wouldn't be detected by the host or btrfs--as far as the storage stack is concerned, the drive ACKed the flush, so the data's on the disk, and any losses after that point are equivalent to disk-level corruption. > Thanks > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > > Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08