linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Spare Volume Features
@ 2019-08-29  0:51 Marc Oggier
  2019-08-29  2:21 ` Sean Greenslade
  2019-08-30  8:07 ` Anand Jain
  0 siblings, 2 replies; 9+ messages in thread
From: Marc Oggier @ 2019-08-29  0:51 UTC (permalink / raw)
  To: linux-btrfs

Hi All,

I am currently buidling a small data server for an experiment.

I was wondering if the features of the spare volume introduced a couple 
of years ago (ttps://patchwork.kernel.org/patch/8687721/) would be 
release soon. I think this would be awesome to have a drive installed, 
that can be used as a spare if one drive of an array died to avoid downtime.

Does anyone have news about it, and when it will be officially in the 
kernel/btrfs-progs ?

Marc

P.S. It took me a long time to switch to btrfs. I did it less than a 
year ago, and I love it.  Keep the great job going, y'all


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-08-29  0:51 Spare Volume Features Marc Oggier
@ 2019-08-29  2:21 ` Sean Greenslade
  2019-08-29 22:41   ` waxhead
  2019-09-01  3:28   ` Sean Greenslade
  2019-08-30  8:07 ` Anand Jain
  1 sibling, 2 replies; 9+ messages in thread
From: Sean Greenslade @ 2019-08-29  2:21 UTC (permalink / raw)
  To: Marc Oggier, linux-btrfs

On August 28, 2019 5:51:02 PM PDT, Marc Oggier <marc.oggier@megavolts.ch> wrote:
>Hi All,
>
>I am currently buidling a small data server for an experiment.
>
>I was wondering if the features of the spare volume introduced a couple
>
>of years ago (ttps://patchwork.kernel.org/patch/8687721/) would be 
>release soon. I think this would be awesome to have a drive installed, 
>that can be used as a spare if one drive of an array died to avoid
>downtime.
>
>Does anyone have news about it, and when it will be officially in the 
>kernel/btrfs-progs ?
>
>Marc
>
>P.S. It took me a long time to switch to btrfs. I did it less than a 
>year ago, and I love it.  Keep the great job going, y'all

I've been thinking about this issue myself, and I have an (untested) idea for how to accomplish something similar. My file server has three disks in a btrfs raid1. I added a fourth disk to the array as just a normal, participating disk. I keep an eye on the usage to make sure that I never exceed 3 disk's worth of usage. That way, if one disk dies, there are still enough disks to mount RW (though I may still need to do an explicit degraded mount, not sure). In that scenario, I can just trigger an online full balance to rebuild the missing raid copies on the remaining disks. In theory, minimal to no downtime.

I'm curious if anyone can see any problems with this idea. I've never tested it, and my offsite backups are thorough enough to survive downtime anyway.

--Sean


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-08-29  2:21 ` Sean Greenslade
@ 2019-08-29 22:41   ` waxhead
  2019-09-01  3:28   ` Sean Greenslade
  1 sibling, 0 replies; 9+ messages in thread
From: waxhead @ 2019-08-29 22:41 UTC (permalink / raw)
  To: Sean Greenslade, Marc Oggier, linux-btrfs



Sean Greenslade wrote:
> On August 28, 2019 5:51:02 PM PDT, Marc Oggier <marc.oggier@megavolts.ch> wrote:
>> Hi All,
>>
>> I am currently buidling a small data server for an experiment.
>>
>> I was wondering if the features of the spare volume introduced a couple
>>
>> of years ago (ttps://patchwork.kernel.org/patch/8687721/) would be
>> release soon. I think this would be awesome to have a drive installed,
>> that can be used as a spare if one drive of an array died to avoid
>> downtime.
>>
>> Does anyone have news about it, and when it will be officially in the
>> kernel/btrfs-progs ?
>>
>> Marc
>>
>> P.S. It took me a long time to switch to btrfs. I did it less than a
>> year ago, and I love it.  Keep the great job going, y'all
> 
> I've been thinking about this issue myself, and I have an (untested) idea for how to accomplish something similar. My file server has three disks in a btrfs raid1. I added a fourth disk to the array as just a normal, participating disk. I keep an eye on the usage to make sure that I never exceed 3 disk's worth of usage. That way, if one disk dies, there are still enough disks to mount RW (though I may still need to do an explicit degraded mount, not sure). In that scenario, I can just trigger an online full balance to rebuild the missing raid copies on the remaining disks. In theory, minimal to no downtime.
> 
> I'm curious if anyone can see any problems with this idea. I've never tested it, and my offsite backups are thorough enough to survive downtime anyway.
> 
> --Sean
> 
I'm just a regular btrfs user, but I see tons of problems with this.

When BTRFS introduce per-subvolume (or even per file) "RAID" or 
redundancy levels a spare device an quickly become a headache. While you 
can argue that a spare device of equal or large size than the largest 
device in the pool would suffice in most cases I don't think it is very 
practical.

What BTRFS needs to do (IMHO) is to reserve spare-space instead. This 
means that many smaller devices can be used in case a large device keels 
over.

The spare space also of course needs to be as large or larger than the 
largest device in the pool, but you would have more flexibility.

For example spare space COULD be pre-populated with the most important 
data (hot data tracking) and serve as a speed-up for read operations. 
What is the point of having idle space just waiting to be used when you 
in fact can just use it for useful things such as obvious ideas like 
increased read speed, extra redundancy for stuff like single, dup or 
even raid0 chunks. Using the spare space for SOME potential for recovery 
is better than not using the spare space for anything.

When the spare space is needed you can either simply discard the data on 
the device that is broken if the spare space already holds the data 
(which makes for superfast recovery) or drop any caches it is used for 
and repopulate by restoring non-redundant data to it as soon as you hit 
a certain error count on another device etc...

Just like Linux uses memory I think that BTRFS is better off using the 
spare space for something rather than nothing. This should of course be 
configurable just for the record.

Anyway - that is how I, a humble user without the detailed know-how 
think it should be implemented... :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-08-29  0:51 Spare Volume Features Marc Oggier
  2019-08-29  2:21 ` Sean Greenslade
@ 2019-08-30  8:07 ` Anand Jain
  1 sibling, 0 replies; 9+ messages in thread
From: Anand Jain @ 2019-08-30  8:07 UTC (permalink / raw)
  To: Marc Oggier, linux-btrfs


  use-cases in production must need a spare device to maintain data
  redundancy in the event of volume disk failure, so this feature
  should be in btrfs.ko. And nntil we get there, workaround like
  monitor for write_io_errs and call for replace should help.

  HTH.

Thanks, Anand

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-08-29  2:21 ` Sean Greenslade
  2019-08-29 22:41   ` waxhead
@ 2019-09-01  3:28   ` Sean Greenslade
  2019-09-01  8:03     ` Andrei Borzenkov
  1 sibling, 1 reply; 9+ messages in thread
From: Sean Greenslade @ 2019-09-01  3:28 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Aug 28, 2019 at 07:21:14PM -0700, Sean Greenslade wrote:
> On August 28, 2019 5:51:02 PM PDT, Marc Oggier <marc.oggier@megavolts.ch> wrote:
> >Hi All,
> >
> >I am currently buidling a small data server for an experiment.
> >
> >I was wondering if the features of the spare volume introduced a couple
> >
> >of years ago (ttps://patchwork.kernel.org/patch/8687721/) would be 
> >release soon. I think this would be awesome to have a drive installed, 
> >that can be used as a spare if one drive of an array died to avoid
> >downtime.
> >
> >Does anyone have news about it, and when it will be officially in the 
> >kernel/btrfs-progs ?
> >
> >Marc
> >
> >P.S. It took me a long time to switch to btrfs. I did it less than a 
> >year ago, and I love it.  Keep the great job going, y'all
> 
> I've been thinking about this issue myself, and I have an (untested)
> idea for how to accomplish something similar. My file server has three
> disks in a btrfs raid1. I added a fourth disk to the array as just a
> normal, participating disk. I keep an eye on the usage to make sure
> that I never exceed 3 disk's worth of usage. That way, if one disk
> dies, there are still enough disks to mount RW (though I may still
> need to do an explicit degraded mount, not sure). In that scenario, I
> can just trigger an online full balance to rebuild the missing raid
> copies on the remaining disks. In theory, minimal to no downtime.
> 
> I'm curious if anyone can see any problems with this idea. I've never
> tested it, and my offsite backups are thorough enough to survive
> downtime anyway.
> 
> --Sean

I decided to do a bit of experimentation to test this theory. The
primary goal was to see if a filesystem could suffer a failed disk and
have that disk removed and rebalanced among the remaining disks without
the filesystem losing data or going read-only. Tested on kernel
5.2.5-arch1-1-ARCH, progs: v5.2.1.

I was actually quite impressed. When I ripped one of the block devices
out from under btrfs, the kernel started spewing tons of BTRFS errors,
but seemed to keep on trucking. I didn't leave it in this state for too
long, but I was reading, writing, and syncing the fs without issue.
After performing a btrfs device delete <MISSING_DEVID>, the filesystem
rebalanced and stopped reporting errors. Looks like this may be a viable
strategy for high-availability filesystems assuming you have adequate
monitoring in place to catch the disk failures quickly. I personally
wouldn't want to fully automate the disk deletion, but it's certainly
possible.

--Sean


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-09-01  3:28   ` Sean Greenslade
@ 2019-09-01  8:03     ` Andrei Borzenkov
  2019-09-02  0:52       ` Sean Greenslade
  0 siblings, 1 reply; 9+ messages in thread
From: Andrei Borzenkov @ 2019-09-01  8:03 UTC (permalink / raw)
  To: Sean Greenslade, linux-btrfs

01.09.2019 6:28, Sean Greenslade пишет:
> 
> I decided to do a bit of experimentation to test this theory. The
> primary goal was to see if a filesystem could suffer a failed disk and
> have that disk removed and rebalanced among the remaining disks without
> the filesystem losing data or going read-only. Tested on kernel
> 5.2.5-arch1-1-ARCH, progs: v5.2.1.
> 
> I was actually quite impressed. When I ripped one of the block devices
> out from under btrfs, the kernel started spewing tons of BTRFS errors,
> but seemed to keep on trucking. I didn't leave it in this state for too
> long, but I was reading, writing, and syncing the fs without issue.
> After performing a btrfs device delete <MISSING_DEVID>, the filesystem
> rebalanced and stopped reporting errors.

How many devices did filesystem have? What profiles did original
filesystem use and what profiles were present after deleting device?
Just to be sure there was no silent downgrade from raid1 to dup or
single as example.


> Looks like this may be a viable
> strategy for high-availability filesystems assuming you have adequate
> monitoring in place to catch the disk failures quickly. I personally
> wouldn't want to fully automate the disk deletion, but it's certainly
> possible.
> 

This would be valid strategy if we could tell btrfs to reserve enough
spare space; but even this is not enough, every allocation btrfs does
must be done so that enough spare space remains to reconstruct every
other missing chunk.

Actually I now ask myself - what happens when btrfs sees unusable disk
sector(s) in some chunk? Will it automatically reconstruct content of
this chunk somewhere else? If not, what is an option besides full device
replacement?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-09-01  8:03     ` Andrei Borzenkov
@ 2019-09-02  0:52       ` Sean Greenslade
  2019-09-02  1:09         ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Greenslade @ 2019-09-02  0:52 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: linux-btrfs

On Sun, Sep 01, 2019 at 11:03:59AM +0300, Andrei Borzenkov wrote:
> 01.09.2019 6:28, Sean Greenslade пишет:
> > 
> > I decided to do a bit of experimentation to test this theory. The
> > primary goal was to see if a filesystem could suffer a failed disk and
> > have that disk removed and rebalanced among the remaining disks without
> > the filesystem losing data or going read-only. Tested on kernel
> > 5.2.5-arch1-1-ARCH, progs: v5.2.1.
> > 
> > I was actually quite impressed. When I ripped one of the block devices
> > out from under btrfs, the kernel started spewing tons of BTRFS errors,
> > but seemed to keep on trucking. I didn't leave it in this state for too
> > long, but I was reading, writing, and syncing the fs without issue.
> > After performing a btrfs device delete <MISSING_DEVID>, the filesystem
> > rebalanced and stopped reporting errors.
> 
> How many devices did filesystem have? What profiles did original
> filesystem use and what profiles were present after deleting device?
> Just to be sure there was no silent downgrade from raid1 to dup or
> single as example.

I did the simplest case: raid1 with 3 disks, dropping 1 disk to end up
with raid1 with 2 disks. I did check and btrfs fi usage reported no dup
or single chunks.

> > Looks like this may be a viable
> > strategy for high-availability filesystems assuming you have adequate
> > monitoring in place to catch the disk failures quickly. I personally
> > wouldn't want to fully automate the disk deletion, but it's certainly
> > possible.
> > 
> 
> This would be valid strategy if we could tell btrfs to reserve enough
> spare space; but even this is not enough, every allocation btrfs does
> must be done so that enough spare space remains to reconstruct every
> other missing chunk.
> 
> Actually I now ask myself - what happens when btrfs sees unusable disk
> sector(s) in some chunk? Will it automatically reconstruct content of
> this chunk somewhere else? If not, what is an option besides full device
> replacement?

As far as I can tell, btrfs has no facility for dealing with medium
errors (besides just reporting the error).  I just re-ran a simple test
with a two-device raid1 with one device deleted after mounting. Btrfs
complains loudly every time writes to the missing disk fail, but doesn't
retry or redirect these writes.  One half of the raid1 block group makes
it to disk, the other gets lost to the void. The chunk that makes it to
disk is still of raid1 type.

Essentially, it seems that btrfs currently had no way of marking a disk
as offline / missing / problematic post-mount. Additionally, and
possibly more troubling, is the fact that a failed chunk write will not
get retried, even if there is another disk that could possibly accept
that write. I think that for my fake-hot-spare proposal to be viable as
a fault resiliancy measure, this failed-chunk-retry logic would need to
be implemented. Otherwise you're living without data redundancy for some
old data and some (or potentially all) new data from the moment the
first medium error occurs until the moment the device delete completes
successfully.

--Sean


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-09-02  0:52       ` Sean Greenslade
@ 2019-09-02  1:09         ` Chris Murphy
  2019-09-03 11:35           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2019-09-02  1:09 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Andrei Borzenkov, Sean Greenslade

I'm still mostly convinced the policy questions and management should
be dealt with a btrfsd userspace daemon.

Btrfs kernel code itself tolerates quite a lot of read and write
errors, where a userspace service could say, yeah forget that we're
moving over to the spare.

Also, that user space daemon could handle the spare device while its
in spare status. I don't really see why btrfs kernel code needs to
know about it. It's reserved for Btrfs but isn't used by Btrfs, until
a policy is triggered. Plausibly one of the policies isn't even device
failure, but the volume is nearly full. Spares should be assignable to
multiple Btrfs volumes. And that too can be managed by this
hypothetical daemon.

--
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Spare Volume Features
  2019-09-02  1:09         ` Chris Murphy
@ 2019-09-03 11:35           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 9+ messages in thread
From: Austin S. Hemmelgarn @ 2019-09-03 11:35 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS; +Cc: Andrei Borzenkov, Sean Greenslade

On 2019-09-01 21:09, Chris Murphy wrote:
> I'm still mostly convinced the policy questions and management should
> be dealt with a btrfsd userspace daemon.
> 
> Btrfs kernel code itself tolerates quite a lot of read and write
> errors, where a userspace service could say, yeah forget that we're
> moving over to the spare.
> 
> Also, that user space daemon could handle the spare device while its
> in spare status. I don't really see why btrfs kernel code needs to
> know about it. It's reserved for Btrfs but isn't used by Btrfs, until
> a policy is triggered. Plausibly one of the policies isn't even device
> failure, but the volume is nearly full. Spares should be assignable to
> multiple Btrfs volumes. And that too can be managed by this
> hypothetical daemon.
Having the kernel know about it means, among other things, that 
switching to actually using the spare when needed is far less likely to 
be delayed by an arbitrarily long time.  Worst case scenario in 
userspace is that the daemon gets paged out, but the executable is on 
the volume it's supposed to be fixing and the volume is in a state that 
it returns read errors until it gets fixed.  In such a case, the volume 
will never get fixed without manual intervention.  Such a situation is 
impossible if it's being handled by the kernel.  This could be mitigated 
by using mlock, but that brings it's own set of issues.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-09-03 11:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-29  0:51 Spare Volume Features Marc Oggier
2019-08-29  2:21 ` Sean Greenslade
2019-08-29 22:41   ` waxhead
2019-09-01  3:28   ` Sean Greenslade
2019-09-01  8:03     ` Andrei Borzenkov
2019-09-02  0:52       ` Sean Greenslade
2019-09-02  1:09         ` Chris Murphy
2019-09-03 11:35           ` Austin S. Hemmelgarn
2019-08-30  8:07 ` Anand Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).