From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f174.google.com ([209.85.213.174]:36979 "EHLO mail-ig0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752454AbcD1LVm (ORCPT ); Thu, 28 Apr 2016 07:21:42 -0400 Received: by mail-ig0-f174.google.com with SMTP id s8so17926289ign.0 for ; Thu, 28 Apr 2016 04:21:41 -0700 (PDT) Subject: Re: Add device while rebalancing To: Chris Murphy References: <571DFCF2.6050604@gmail.com> <571E154C.9060604@gmail.com> <571F4CD0.9050004@gmail.com> <5720A0E8.5000407@gmail.com> Cc: Juan Alberto Cirez , linux-btrfs From: "Austin S. Hemmelgarn" Message-ID: Date: Thu, 28 Apr 2016 07:21:14 -0400 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 2016-04-27 19:19, Chris Murphy wrote: > On Wed, Apr 27, 2016 at 5:22 AM, Austin S. Hemmelgarn > wrote: >> On 2016-04-26 20:58, Chris Murphy wrote: >>> >>> On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez >>> wrote: >>>> >>>> >>>> With GlusterFS as a distributed volume, the files are already spread >>>> among the servers causing file I/O to be spread fairly evenly among >>>> them as well, thus probably providing the benefit one might expect >>>> with stripe (RAID10). >>> >>> >>> Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes >>> if you lose a drive. But since raid1 is not n-way copies, and only >>> means two copies, you don't really want the file systems getting that >>> big or you increase the chances of a double failure. >>> >>> I've always though it'd be neat in a Btrfs + GlusterFS, if it were >>> possible for Btrfs to inform Gluster FS of "missing/corrupt" files, >>> and then for Btrfs to drop reference for those files, instead of >>> either rebuilding or remaining degraded. And then let GlusterFS deal >>> with replication of those files to maintain redundancy. i.e. the Btrfs >>> volumes would be single profile for data, and raid1 for metadata. When >>> there's n-way raid1, each drive can have a copy of the file system, >>> and it'd tolerate in effect n-1 drive failures and the file system >>> could at least still inform Gluster (or Ceph) of the missing data, the >>> file system still remains valid, only briefly degraded, and can still >>> be expanded when new drives become available. >> >> FWIW, I _think_ this can be done with the scrubbing code in GlusterFS. It's >> designed to repair data mismatches, but I'm not sure how it handles missing >> copies of data. However, in the current state, there's no way without >> external scripts to handle re-shaping of the storage bricks if part of them >> fails. > > Yeah I haven't tried doing a scrub, parsing dmesg for busted file > paths, and feeling those paths into rm to see what happens. Will they > get deleted without additional errors? If so good, then scrub again > should be clean. And then btrfs dev missing to get rid of the broken > device *and* cause missing metadata to be replicated again and now in > theory the fs should be back to normal. But it'd have to be tested > with a umount followed by mount to see if -o degraded is still > required. > I'm not entirely certain, although I had been planning on adding a test to check this to my usual testing before the system I use for it went offline, I just haven't had the time to get it working again. If I find the time in the near future, I may just test it on my laptop in a VM.