From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f196.google.com ([209.85.208.196]:42205 "EHLO mail-lj1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727094AbeH2Dwy (ORCPT ); Tue, 28 Aug 2018 23:52:54 -0400 Received: by mail-lj1-f196.google.com with SMTP id f1-v6so2864463ljc.9 for ; Tue, 28 Aug 2018 16:58:49 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: From: Chris Murphy Date: Tue, 28 Aug 2018 17:58:48 -0600 Message-ID: Subject: Re: DRDY errors are not consistent with scrub results To: Cerem Cem ASLAN Cc: Chris Murphy , Btrfs BTRFS Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN wrote: > What I want to achive is that I want to add the problematic disk as > raid1 and see how/when it fails and how BTRFS recovers these fails. > While the party goes on, the main system shouldn't be interrupted > since this is a production system. For example, I would never expect > to be ended up with such a readonly state while trying to add a disk > with "unknown health" to the system. Was it somewhat expected? I don't know. I also can't tell you how LVM or mdraid behave in the same situation either though. For sure I've come across bug reports where underlying devices go read only and the file system falls over totally and developers shrug and say they can't do anything. This situation is a little different and difficult. You're starting out with a one drive setup so the profile is single/DUP or single/single, and that doesn't change when adding. So the 2nd drive is actually *mandatory* for a brief period of time before you've made it raid1 or higher. It's a developer question what is the design, and if this is a bug: maybe the device being added should be written to with placeholder supers or even just zeros in all the places for 'dev add' metadata, and only if that succeeds, to then write real updated supers to all devices. It's possible the 'dev add' presently writes updated supers to all devices at the same time, and has a brief period where the state is fragile and if it fails, it goes read only to prevent damaging the file system. Anyway, without a call trace, no idea why it ended up read only. So I have to speculate. > > Although we know that disk is about to fail, it still survives. That's very tenuous rationalization, a drive that rejects even a single write is considered failed by the md driver. Btrfs is still very tolerant of this, so if it had successfully added and you were running in production, you should expect to see thousands of write errors dumped to the kernel log because Btrfs never ejects a bad drive still. It keeps trying. And keeps reporting the failures. And all those errors being logged can end up causing more write demand if the logs are on the same volume as the failing device, even more errors to record, and you get an escalating situation with heavy log writing. > Shouldn't we expect in such a scenario that when system tries to read > or write some data from/to that BROKEN_DISK and when it recognizes it > failed, it will try to recover the part of the data from GOOD_DISK and > try to store that recovered data in some other part of the > BROKEN_DISK? Nope. Btrfs can only write supers to fixed locations on the drive, same as any other file system. Btrfs metadata could possibly go elsewhere because it doesn't have fixed locations, but Btrfs doesn't do bad sector tracking. So once it decides metadata goes in location X, if X reports a write error it will not try to write elsewhere and insofar as I'm aware ext4 and XFS and LVM and md don't either; md does have an optional bad block map it will use for tracking bad sectors and remap to known good sectors. Normally the drive firmware should do this and when that fails the drive is considered toast for production purpose >Or did I misunderstood the whole thing? Well in a way this is sorta user sabotage. It's a valid test and I'd say ideally things should fail safely, rather than fall over. But at the same time it's not wrong for developers to say: "look if you add a bad device there's a decent chance we're going face plant and go read only to avoid causing worse problems, so next time you should qualify the drive before putting it into production." I'm willing to bet all the other file system devs would say something like that even if Btrfs devs think something better could happen, it's probably not a super high priority. -- Chris Murphy