All of lore.kernel.org
 help / color / mirror / Atom feed
* Help with space
@ 2014-02-27 18:19 Justin Brown
  2014-02-27 19:27 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Justin Brown @ 2014-02-27 18:19 UTC (permalink / raw)
  To: linux-btrfs

I've a 18 tera hardware raid 5 (areca ARC-1170 w/ 8 3 gig drives) in
need of help.  Disk usage (du) shows 13 tera allocated yet strangely
enough df shows approx. 780 gigs are free.  It seems, somehow, btrfs
has eaten roughly 4 tera internally.  I've run a scrub and a balance
usage=5 with no success, in fact I lost about 20 gigs after the
balance attempt.  Some numbers:

terra:/var/lib/nobody/fs/ubfterra # uname -a
Linux terra 3.12.4-2.44-desktop #1 SMP PREEMPT Mon Dec 9 03:14:51 CST
2013 i686 i686 i386 GNU/Linux

terra:/var/lib/nobody/fs/ubfterra # parted -l
Model: Areca ARC-1170-VOL#00 (scsi)
Disk /dev/sdb: 21.0TB
Sector size (logical/physical): 4096B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name              Flags
 1      1049kB  21.0TB  21.0TB               Linux filesystem

terra:/var/lib/nobody/fs/ubfterra # du -shc *
1.7M    40588-4-1376856876.jpg
2.7M    40588-4-1376856876b.jpg
1008G   Anime
180G    Doctor Who (classic)
5.5T    Downloads
28G     Flash Rescue
1.9T    Jus
3.6T    Tornado
4.0K    dirsanime
4.0K    filesanime
55G     home videos
0       testsub
4.0K    unsharedanime
13T     total

terra:/var/lib/nobody/fs/ubfterra # btrfs fi show /dev/sdb1
Label: ubfterra  uuid: 40f0f692-c68c-4af7-ade2-c15a127ceab5
        Total devices 1 FS bytes used 17.61TiB
        devid    1 size 19.10TiB used 18.34TiB path /dev/sdb1

Btrfs v3.12

terra:/var/lib/nobody/fs/ubfterra # btrfs fi df .
Data, single: total=17.58TiB, used=17.57TiB
System, DUP: total=8.00MiB, used=1.93MiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=392.00GiB, used=33.50GiB
Metadata, single: total=8.00MiB, used=0.00


I use no subvolumes nor are there any snapshots, at least as near as I
can tell.  Any suggestions as to how to recover the missing space
assuming it's possible?  Any help is most appreciated.

-Justin

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 18:19 Help with space Justin Brown
@ 2014-02-27 19:27 ` Chris Murphy
  2014-02-27 19:51   ` Chris Murphy
  2014-02-28  4:34 ` Roman Mamedov
  2014-02-28  6:13 ` Chris Murphy
  2 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2014-02-27 19:27 UTC (permalink / raw)
  To: Justin Brown; +Cc: linux-btrfs


On Feb 27, 2014, at 11:19 AM, Justin Brown <otakujunction@gmail.com> wrote:

> I've a 18 tera hardware raid 5 (areca ARC-1170 w/ 8 3 gig drives) in
> need of help.  Disk usage (du) shows 13 tera allocated yet strangely
> enough df shows approx. 780 gigs are free.  It seems, somehow, btrfs
> has eaten roughly 4 tera internally.  I've run a scrub and a balance
> usage=5 with no success, in fact I lost about 20 gigs after the
> balance attempt.  Some numbers:
> 
> terra:/var/lib/nobody/fs/ubfterra # uname -a
> Linux terra 3.12.4-2.44-desktop #1 SMP PREEMPT Mon Dec 9 03:14:51 CST
> 2013 i686 i686 i386 GNU/Linux

This is on i686?

The kernel page cache is limited to 16TB on i686, so effectively your block device is limited to 16TB. While the file system successfully creates, I think it's a bug that the mount -t btrfs command is probably a btrfs bug.

The way this works for XFS and ext4 is mount fails.

EXT4-fs (sdc): filesystem too large to mount safely on this system
XFS (sdc): file system too large to be mounted on this system.

If you're on a 32-bit OS, the file system might be toast, I'm not really sure. But I'd immediately stop using it and only use 64-bit OS for file systems of this size.



Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 19:27 ` Chris Murphy
@ 2014-02-27 19:51   ` Chris Murphy
  2014-02-27 20:49     ` otakujunction
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2014-02-27 19:51 UTC (permalink / raw)
  To: Btrfs BTRFS


On Feb 27, 2014, at 12:27 PM, Chris Murphy <lists@colorremedies.com> wrote:
> This is on i686?
> 
> The kernel page cache is limited to 16TB on i686, so effectively your block device is limited to 16TB. While the file system successfully creates, I think it's a bug that the mount -t btrfs command is probably a btrfs bug.

Yes Chris, circular logic day. It's probably a btrfs bug that the mount command succeeds.

So let us know if this is i686 or x86_64, because if it's the former it's a bug that should get fixed.


Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 19:51   ` Chris Murphy
@ 2014-02-27 20:49     ` otakujunction
  2014-02-27 21:11       ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: otakujunction @ 2014-02-27 20:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Yes it's an ancient 32 bit machine.  There must be a complex bug involved as the system, when originally mounted, claimed the correct free space and only as used over time did the discrepancy between used and free grow.  I'm afraid I chose btrfs because it appeared capable of breaking the 16 tera limit on a 32 bit system.  If this isn't the case then it's incredible that I've been using this file system for about a year without difficulty until now.

-Justin

Sent from my iPad

> On Feb 27, 2014, at 1:51 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> 
>> On Feb 27, 2014, at 12:27 PM, Chris Murphy <lists@colorremedies.com> wrote:
>> This is on i686?
>> 
>> The kernel page cache is limited to 16TB on i686, so effectively your block device is limited to 16TB. While the file system successfully creates, I think it's a bug that the mount -t btrfs command is probably a btrfs bug.
> 
> Yes Chris, circular logic day. It's probably a btrfs bug that the mount command succeeds.
> 
> So let us know if this is i686 or x86_64, because if it's the former it's a bug that should get fixed.
> 
> 
> Chris Murphy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 20:49     ` otakujunction
@ 2014-02-27 21:11       ` Chris Murphy
  2014-02-28  0:12         ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2014-02-27 21:11 UTC (permalink / raw)
  To: otakujunction; +Cc: Btrfs BTRFS


On Feb 27, 2014, at 1:49 PM, otakujunction@gmail.com wrote:

> Yes it's an ancient 32 bit machine.  There must be a complex bug involved as the system, when originally mounted, claimed the correct free space and only as used over time did the discrepancy between used and free grow.  I'm afraid I chose btrfs because it appeared capable of breaking the 16 tera limit on a 32 bit system.  If this isn't the case then it's incredible that I've been using this file system for about a year without difficulty until now.

Yep, it's not a good bug. This happened some years ago on XFS too, where people would use the file system for a long time and then at 16TB+1byte written to the volume, kablewy! And then it wasn't usable at all, until put on a 64-bit kernel.

http://oss.sgi.com/pipermail/xfs/2014-February/034588.html

I can't tell you if there's a work around for this other than to go to a 64bit kernel. Maybe you could partition the raid5 into two 9TB block devices, and then format the two partitions with -d single -m raid1. That way it behaves as one volume, and alternates 1GB chunks to the two partitions. This should be decent performing for large files, but otherwise it's possible that you will sometimes have the allocator writing to two data chunks on what it thinks are two drives, atthe same time, but it's actually writing to the physical device (array) at the same time. Hardware raid should optimize some of this, but I don't know what the penalty will be, if it'll work for your use case.

And I definitely don't know if the kernel page cache limit applies to the block device (partition) or if it applies to the file system. It sounds like it applies to the block device, so this might be a way around this if you had to stick to a 32bit system.


Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 21:11       ` Chris Murphy
@ 2014-02-28  0:12         ` Dave Chinner
  2014-02-28  0:27           ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2014-02-28  0:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: otakujunction, Btrfs BTRFS

On Thu, Feb 27, 2014 at 02:11:19PM -0700, Chris Murphy wrote:
> 
> On Feb 27, 2014, at 1:49 PM, otakujunction@gmail.com wrote:
> 
> > Yes it's an ancient 32 bit machine.  There must be a complex bug
> > involved as the system, when originally mounted, claimed the
> > correct free space and only as used over time did the
> > discrepancy between used and free grow.  I'm afraid I chose
> > btrfs because it appeared capable of breaking the 16 tera limit
> > on a 32 bit system.  If this isn't the case then it's incredible
> > that I've been using this file system for about a year without
> > difficulty until now.
> 
> Yep, it's not a good bug. This happened some years ago on XFS too,
> where people would use the file system for a long time and then at
> 16TB+1byte written to the volume, kablewy! And then it wasn't
> usable at all, until put on a 64-bit kernel.
> 
> http://oss.sgi.com/pipermail/xfs/2014-February/034588.html

Well, no, that's not what I said. I said that it was limited on XFS,
not that the limit was a result of a user making a filesystem too
large and then finding out it didn't work. Indeed, you can't do that
on XFS - mkfs will refuse to run on a block device it can't access the
last block on, and the kernel has the same "can I access the last
block of the filesystem" sanity checks that are run at mount and
growfs time.

IOWs, XFS has *never* allowed >16TB on 32 bit systems on Linux. And,
historically speaking, it didn't even allow it on Irix. Irix on 32
bit systems was limited to 1TB (2^31 sectors of 2^9 bytes = 1TB),
and only as Linux gained sufficient capability on 32 bit systems
(e.g.  CONFIG_LBD) was the limit increased. The limit we are now at
is the address space index being 32 bits, so the size is limited by
2^32 * PAGE_SIZE = 2^44 = 16TB....

i.e Back when XFS was still being ported to Linux from Irix in 2000:

203 #if !XFS_BIG_FILESYSTEMS
204         if (sbp->sb_dblocks > INT_MAX || sbp->sb_rblocks > INT_MAX)  {
205                 cmn_err(CE_WARN,
206 "XFS:  File systems greater than 1TB not supported on this system.\n");
207                 return XFS_ERROR(E2BIG);
208         }
209 #endif

(http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=blob;f=fs/xfs/xfs_mount.c;hb=60a4726a60437654e2af369ccc8458376e1657b9)

So, good story, but is not true.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  0:12         ` Dave Chinner
@ 2014-02-28  0:27           ` Chris Murphy
  2014-02-28  4:21             ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2014-02-28  0:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: otakujunction, Btrfs BTRFS


On Feb 27, 2014, at 5:12 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Feb 27, 2014 at 02:11:19PM -0700, Chris Murphy wrote:
>> 
>> On Feb 27, 2014, at 1:49 PM, otakujunction@gmail.com wrote:
>> 
>>> Yes it's an ancient 32 bit machine.  There must be a complex bug
>>> involved as the system, when originally mounted, claimed the
>>> correct free space and only as used over time did the
>>> discrepancy between used and free grow.  I'm afraid I chose
>>> btrfs because it appeared capable of breaking the 16 tera limit
>>> on a 32 bit system.  If this isn't the case then it's incredible
>>> that I've been using this file system for about a year without
>>> difficulty until now.
>> 
>> Yep, it's not a good bug. This happened some years ago on XFS too,
>> where people would use the file system for a long time and then at
>> 16TB+1byte written to the volume, kablewy! And then it wasn't
>> usable at all, until put on a 64-bit kernel.
>> 
>> http://oss.sgi.com/pipermail/xfs/2014-February/034588.html
> 
> Well, no, that's not what I said.

What are you thinking I said you said? I wasn't quoting or paraphrasing anything you've said above. I had done a google search on this early and found some rather old threads where some people had this experience of making a large file system on a 32-bit kernel, and only after filling it beyond 16TB did they run into the problem. Here is one of them:

http://lists.centos.org/pipermail/centos/2011-April/109142.html



> I said that it was limited on XFS,
> not that the limit was a result of a user making a filesystem too
> large and then finding out it didn't work. Indeed, you can't do that
> on XFS - mkfs will refuse to run on a block device it can't access the
> last block on, and the kernel has the same "can I access the last
> block of the filesystem" sanity checks that are run at mount and
> growfs time.

Nope. What I reported on the XFS list, I had used mkfs.xfs while running 32bit kernel on a 20TB virtual disk. It did not fail to make the file system, it failed only to mount it. It was the same booted virtual machine, I created the file system and immediately mounted it. If you want the specifics, I'll post on the XFS list with versions and reproduce steps.


> 
> IOWs, XFS has *never* allowed >16TB on 32 bit systems on Linux.

OK that's fine, I've only reported what other people said they experienced, and it comes as no surprise they might have been confused. Although not knowing the size of one's file system would seem to be rare.


Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  0:27           ` Chris Murphy
@ 2014-02-28  4:21             ` Dave Chinner
  2014-02-28  5:49               ` Chris Murphy
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2014-02-28  4:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: otakujunction, Btrfs BTRFS

On Thu, Feb 27, 2014 at 05:27:48PM -0700, Chris Murphy wrote:
> 
> On Feb 27, 2014, at 5:12 PM, Dave Chinner <david@fromorbit.com>
> wrote:
> 
> > On Thu, Feb 27, 2014 at 02:11:19PM -0700, Chris Murphy wrote:
> >> 
> >> On Feb 27, 2014, at 1:49 PM, otakujunction@gmail.com wrote:
> >> 
> >>> Yes it's an ancient 32 bit machine.  There must be a complex
> >>> bug involved as the system, when originally mounted, claimed
> >>> the correct free space and only as used over time did the
> >>> discrepancy between used and free grow.  I'm afraid I chose
> >>> btrfs because it appeared capable of breaking the 16 tera
> >>> limit on a 32 bit system.  If this isn't the case then it's
> >>> incredible that I've been using this file system for about a
> >>> year without difficulty until now.
> >> 
> >> Yep, it's not a good bug. This happened some years ago on XFS
> >> too, where people would use the file system for a long time and
> >> then at 16TB+1byte written to the volume, kablewy! And then it
> >> wasn't usable at all, until put on a 64-bit kernel.
> >> 
> >> http://oss.sgi.com/pipermail/xfs/2014-February/034588.html
> > 
> > Well, no, that's not what I said.
> 
> What are you thinking I said you said? I wasn't quoting or
> paraphrasing anything you've said above. I had done a google
> search on this early and found some rather old threads where some
> people had this experience of making a large file system on a
> 32-bit kernel, and only after filling it beyond 16TB did they run
> into the problem. Here is one of them:
> 
> http://lists.centos.org/pipermail/centos/2011-April/109142.html

<sigh>

No, he didn't fill it with 16TB of data and then have it fail. He
made a new filesystem *larger* than 16TB and tried to mount it:

| On a CentOS 32-bit backup server with a 17TB LVM logical volume on
| EMC storage.  Worked great, until it rolled 16TB.  Then it quit
| working.  Altogether.  /var/log/messages told me that the
| filesystem was too large to be mounted. Had to re-image the VM as
| a 64-bit CentOS, and then re-attached the RDM's to the LUNs
| holding the PV's for the LV, and it mounted instantly, and we
| kept on trucking.

This just backs up what I told you originally - that XFS has always
refused to mount >16TB filesystems on 32 bit systems.

> > I said that it was limited on XFS, not that the limit was a
> > result of a user making a filesystem too large and then finding
> > out it didn't work. Indeed, you can't do that on XFS - mkfs will
> > refuse to run on a block device it can't access the last block
> > on, and the kernel has the same "can I access the last block of
> > the filesystem" sanity checks that are run at mount and growfs
> > time.
> 
> Nope. What I reported on the XFS list, I had used mkfs.xfs while
> running 32bit kernel on a 20TB virtual disk. It did not fail to
> make the file system, it failed only to mount it.

You said no such thing. All you said was you couldn't mount a
filesystem > 16TB - you made no mention of how you made the fs, what
the block device was or any other details.

> It was the same
> booted virtual machine, I created the file system and immediately
> mounted it. If you want the specifics, I'll post on the XFS list
> with versions and reproduce steps.

Did you check to see whether the block device silently wrapped at
16TB? There's a real good chance it did - but you might have got
lucky because mkfs.xfs uses direct IO and *maybe* that works
correctly on block devices on 32 bit systems. I wouldn't bet on it,
though, given it's something we don't support and therefore never
test....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 18:19 Help with space Justin Brown
  2014-02-27 19:27 ` Chris Murphy
@ 2014-02-28  4:34 ` Roman Mamedov
  2014-02-28  7:27   ` Duncan
  2014-05-01  1:52   ` Russell Coker
  2014-02-28  6:13 ` Chris Murphy
  2 siblings, 2 replies; 28+ messages in thread
From: Roman Mamedov @ 2014-02-28  4:34 UTC (permalink / raw)
  To: Justin Brown; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1507 bytes --]

On Thu, 27 Feb 2014 12:19:05 -0600
Justin Brown <otakujunction@gmail.com> wrote:

> I've a 18 tera hardware raid 5 (areca ARC-1170 w/ 8 3 gig drives) in

Do you sleep well at night knowing that if one disk fails, you end up with
basically a RAID0 of 7x3TB disks? And that if 2nd one encounters unreadable
sector during rebuild, you lost your data? RAID5 actually stopped working 5
years ago, apparently you didn't get the memo. :)
http://hardware.slashdot.org/story/08/10/21/2126252/why-raid-5-stops-working-in-2009

> need of help.  Disk usage (du) shows 13 tera allocated yet strangely
> enough df shows approx. 780 gigs are free.  It seems, somehow, btrfs
> has eaten roughly 4 tera internally.  I've run a scrub and a balance
> usage=5 with no success, in fact I lost about 20 gigs after the

Did you run balance with "-dusage=5" or "-musage=5"? Or both?
What is the output of the balance command?

> terra:/var/lib/nobody/fs/ubfterra # btrfs fi df .
> Data, single: total=17.58TiB, used=17.57TiB
> System, DUP: total=8.00MiB, used=1.93MiB
> System, single: total=4.00MiB, used=0.00
> Metadata, DUP: total=392.00GiB, used=33.50GiB
                       ^^^^^^^^^

If you'd use "-musage=5", I think this metadata reserve should have been
shrunk, and you'd gain a lot more free space.

But then as others mentioned it may be risky to use this FS on 32-bit at all,
so I'd suggest trying anything else only after you reboot into a 64-bit kernel.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  4:21             ` Dave Chinner
@ 2014-02-28  5:49               ` Chris Murphy
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2014-02-28  5:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: otakujunction, Btrfs BTRFS


On Feb 27, 2014, at 9:21 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> http://lists.centos.org/pipermail/centos/2011-April/109142.html
> 
> <sigh>
> 
> No, he didn't fill it with 16TB of data and then have it fail. He
> made a new filesystem *larger* than 16TB and tried to mount it:
> 
> | On a CentOS 32-bit backup server with a 17TB LVM logical volume on
> | EMC storage.  Worked great, until it rolled 16TB.  Then it quit
> | working.  Altogether.  /var/log/messages told me that the
> | filesystem was too large to be mounted. Had to re-image the VM as
> | a 64-bit CentOS, and then re-attached the RDM's to the LUNs
> | holding the PV's for the LV, and it mounted instantly, and we
> | kept on trucking.
> 
> This just backs up what I told you originally - that XFS has always
> refused to mount >16TB filesystems on 32 bit systems.

That isn't how I read that at all. It was a 17TB LV, working great (i.e. mounted) until it was filled with 16TB, then it quite working and could not subsequently be mounted until put on a 64-bit kernel.

I don't see how it's "working great" if it's not mountable.



> 
>>> I said that it was limited on XFS, not that the limit was a
>>> result of a user making a filesystem too large and then finding
>>> out it didn't work. Indeed, you can't do that on XFS - mkfs will
>>> refuse to run on a block device it can't access the last block
>>> on, and the kernel has the same "can I access the last block of
>>> the filesystem" sanity checks that are run at mount and growfs
>>> time.
>> 
>> Nope. What I reported on the XFS list, I had used mkfs.xfs while
>> running 32bit kernel on a 20TB virtual disk. It did not fail to
>> make the file system, it failed only to mount it.
> 
> You said no such thing. All you said was you couldn't mount a
> filesystem > 16TB - you made no mention of how you made the fs, what
> the block device was or any other details.

All correct. It wasn't intended as a bug report, it seemed normal. What I reported = the mount failure.

VBox 25TB VDI as a single block device, as well as 5x 5TB VDIs in an 20TB linear LV, as well as a 100TB virtual size LV using LVM thinp - all can be formatted with default mkfs.xfs with no complaints.

3.13.4-200.fc20.i686+PAE
xfsprogs-3.1.11-2.fc20.i686


> 
>> It was the same
>> booted virtual machine, I created the file system and immediately
>> mounted it. If you want the specifics, I'll post on the XFS list
>> with versions and reproduce steps.
> 
> Did you check to see whether the block device silently wrapped at
> 16TB? There's a real good chance it did - but you might have got
> lucky because mkfs.xfs uses direct IO and *maybe* that works
> correctly on block devices on 32 bit systems. I wouldn't bet on it,
> though, given it's something we don't support and therefore never
> test….

I did not check to see if any of the block devices silently wrapped, I don't know how to do that although I have a strace of the mkfs on the 100TB virtual LV here:

https://dl.dropboxusercontent.com/u/3253801/mkfsxfs32bit100TBvLV.txt


Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-27 18:19 Help with space Justin Brown
  2014-02-27 19:27 ` Chris Murphy
  2014-02-28  4:34 ` Roman Mamedov
@ 2014-02-28  6:13 ` Chris Murphy
  2014-02-28  6:26   ` Chris Murphy
  2 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2014-02-28  6:13 UTC (permalink / raw)
  To: Justin Brown; +Cc: Btrfs BTRFS, Josef Bacik


On Feb 27, 2014, at 11:19 AM, Justin Brown <otakujunction@gmail.com> wrote:

> terra:/var/lib/nobody/fs/ubfterra # btrfs fi df .
> Data, single: total=17.58TiB, used=17.57TiB
> System, DUP: total=8.00MiB, used=1.93MiB
> System, single: total=4.00MiB, used=0.00
> Metadata, DUP: total=392.00GiB, used=33.50GiB
> Metadata, single: total=8.00MiB, used=0.00

After glancing at this again, what I thought might be going on might not be going on. The fact it has 17+TB already used, not merely allocated, doesn't seem possible if there's a hard 16TB limit for Btrfs on 32-bit kernels.

But then I don't know why du -h is reporting only 13T total used. And I'm unconvinced this is a balance issue either. Is anything obviously missing from the file system?


Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  6:13 ` Chris Murphy
@ 2014-02-28  6:26   ` Chris Murphy
  2014-02-28  7:39     ` Justin Brown
  0 siblings, 1 reply; 28+ messages in thread
From: Chris Murphy @ 2014-02-28  6:26 UTC (permalink / raw)
  To: Justin Brown; +Cc: Btrfs BTRFS


On Feb 27, 2014, at 11:13 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On Feb 27, 2014, at 11:19 AM, Justin Brown <otakujunction@gmail.com> wrote:
> 
>> terra:/var/lib/nobody/fs/ubfterra # btrfs fi df .
>> Data, single: total=17.58TiB, used=17.57TiB
>> System, DUP: total=8.00MiB, used=1.93MiB
>> System, single: total=4.00MiB, used=0.00
>> Metadata, DUP: total=392.00GiB, used=33.50GiB
>> Metadata, single: total=8.00MiB, used=0.00
> 
> After glancing at this again, what I thought might be going on might not be going on. The fact it has 17+TB already used, not merely allocated, doesn't seem possible if there's a hard 16TB limit for Btrfs on 32-bit kernels.
> 
> But then I don't know why du -h is reporting only 13T total used. And I'm unconvinced this is a balance issue either. Is anything obviously missing from the file system?

What are your mount options? Maybe compression?

Clearly du is calculating things differently. I'm getting:

du -sch = 4.2G
df -h    = 5.4G
btrfs df  = 4.7G data and 620MB metadata(total).

I am using compress=lzo.

Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  4:34 ` Roman Mamedov
@ 2014-02-28  7:27   ` Duncan
  2014-02-28  7:37     ` Roman Mamedov
  2014-02-28  7:46     ` Justin Brown
  2014-05-01  1:52   ` Russell Coker
  1 sibling, 2 replies; 28+ messages in thread
From: Duncan @ 2014-02-28  7:27 UTC (permalink / raw)
  To: linux-btrfs

Roman Mamedov posted on Fri, 28 Feb 2014 10:34:36 +0600 as excerpted:

> But then as others mentioned it may be risky to use this FS on 32-bit at
> all, so I'd suggest trying anything else only after you reboot into a
> 64-bit kernel.

Based on what I've read on-list, btrfs is not arch-agnostic, with certain 
on-disk sizes set to native kernel page size, etc, so a filesystem 
created on one arch may well not work on another.

Question: Does this apply to x86/amd64?  Will a filesystem created/used 
on 32-bit x86 even mount/work on 64-bit amd64/x86_64, or does upgrading 
to 64-bit imply backing up (in this case) double-digit TiB of data to 
something other than btrfs and testing it, doing a mkfs on the original 
filesystem once in 64-bit mode, and restoring all that data from backup?

If the existing 32-bit x86 btrfs can't be used on 64-bit amd64, 
transferring all that data (assuming there's something big enough 
available to transfer it to!) to backup and then restoring it is going to 
hurt!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  7:27   ` Duncan
@ 2014-02-28  7:37     ` Roman Mamedov
  2014-02-28  7:46     ` Justin Brown
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Mamedov @ 2014-02-28  7:37 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1070 bytes --]

On Fri, 28 Feb 2014 07:27:06 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Based on what I've read on-list, btrfs is not arch-agnostic, with certain 
> on-disk sizes set to native kernel page size, etc, so a filesystem 
> created on one arch may well not work on another.
> 
> Question: Does this apply to x86/amd64?  Will a filesystem created/used 
> on 32-bit x86 even mount/work on 64-bit amd64/x86_64, or does upgrading 
> to 64-bit imply backing up (in this case) double-digit TiB of data to 
> something other than btrfs and testing it, doing a mkfs on the original 
> filesystem once in 64-bit mode, and restoring all that data from backup?

Page size (4K) is the same on both i386 and amd64. It's also the same on ARM.

Problem arises only on architectures like MIPS and PowerPC, some variants of
which use 16K or 64K page sizes.

Other than this page size issue, it has no arch-specific dependencies,  e.g.
no on-disk structures with "CPU-native integer" sized fields etc, that'd be too
crazy to be true.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  6:26   ` Chris Murphy
@ 2014-02-28  7:39     ` Justin Brown
  0 siblings, 0 replies; 28+ messages in thread
From: Justin Brown @ 2014-02-28  7:39 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Apologies for the late reply, I'd assumed the issue was closed even
given the unusual behavior.  My mount options are:

/dev/sdb1 on /var/lib/nobody/fs/ubfterra type btrfs
(rw,noatime,nodatasum,nodatacow,noacl,space_cache,skip_balance)

I only recently added nodatacow and skip_balance in an attempt to
figure out where the missing space had gone.  I don't know what impact
it might have if any on things.  I've got a full balance running at
the moment which, after about a day or so, has managed to process 5%
of the chunks it's considering (988 out of about 18396 chunks balanced
(989 considered),  95% left).  The amount of free space has vacillated
slightly, growing by about a gig to shrink back.  As far as objects in
the file system missing, I've not seen any such.  I've a lot of files
of various data types, the majority is encoded japanese animation.
Since I actually play these files via samba from a htpc, particularly
the more recent additions, I'd hazard to guess that if something were
breaking I'd have tripped across it by now, the unusual used to free
space delta being the exception.  My brother also uses this raid for
data storage, he's something of a closet meteorologist and is
fascinated by tornadoes.  He hasn't noticed any unusual behavior
either.  I'm in the process of sourcing a 64 bit capable system in the
hopes that will resolve the issue.  Neither of us are currently
writing anything to the file system for fear of things breaking, but
both have been reading from it without issue other than the noticeable
impact in performance balance seems to be having.  Thanks for the
help.

-Justin


On Fri, Feb 28, 2014 at 12:26 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Feb 27, 2014, at 11:13 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
>>
>> On Feb 27, 2014, at 11:19 AM, Justin Brown <otakujunction@gmail.com> wrote:
>>
>>> terra:/var/lib/nobody/fs/ubfterra # btrfs fi df .
>>> Data, single: total=17.58TiB, used=17.57TiB
>>> System, DUP: total=8.00MiB, used=1.93MiB
>>> System, single: total=4.00MiB, used=0.00
>>> Metadata, DUP: total=392.00GiB, used=33.50GiB
>>> Metadata, single: total=8.00MiB, used=0.00
>>
>> After glancing at this again, what I thought might be going on might not be going on. The fact it has 17+TB already used, not merely allocated, doesn't seem possible if there's a hard 16TB limit for Btrfs on 32-bit kernels.
>>
>> But then I don't know why du -h is reporting only 13T total used. And I'm unconvinced this is a balance issue either. Is anything obviously missing from the file system?
>
> What are your mount options? Maybe compression?
>
> Clearly du is calculating things differently. I'm getting:
>
> du -sch = 4.2G
> df -h    = 5.4G
> btrfs df  = 4.7G data and 620MB metadata(total).
>
> I am using compress=lzo.
>
> Chris Murphy
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  7:27   ` Duncan
  2014-02-28  7:37     ` Roman Mamedov
@ 2014-02-28  7:46     ` Justin Brown
  1 sibling, 0 replies; 28+ messages in thread
From: Justin Brown @ 2014-02-28  7:46 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

Absolutely.  I'd like to know the answer to this, as 13 tera will take
a considerable amount of time to back up anywhere, assuming I find a
place.  I'm considering rebuilding a smaller raid with newer drives
(it was originally built using 16 250 gig western digital drives, it's
about eleven years old now, having been in use the entire time without
failure, I'm considering replacing each 250 gig with a 3 tera
alternative).  Unfortunately, between upgrading the host and building
a new raid the expense isn't something I'm anticipating with
pleasure...

On Fri, Feb 28, 2014 at 1:27 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Roman Mamedov posted on Fri, 28 Feb 2014 10:34:36 +0600 as excerpted:
>
>> But then as others mentioned it may be risky to use this FS on 32-bit at
>> all, so I'd suggest trying anything else only after you reboot into a
>> 64-bit kernel.
>
> Based on what I've read on-list, btrfs is not arch-agnostic, with certain
> on-disk sizes set to native kernel page size, etc, so a filesystem
> created on one arch may well not work on another.
>
> Question: Does this apply to x86/amd64?  Will a filesystem created/used
> on 32-bit x86 even mount/work on 64-bit amd64/x86_64, or does upgrading
> to 64-bit imply backing up (in this case) double-digit TiB of data to
> something other than btrfs and testing it, doing a mkfs on the original
> filesystem once in 64-bit mode, and restoring all that data from backup?
>
> If the existing 32-bit x86 btrfs can't be used on 64-bit amd64,
> transferring all that data (assuming there's something big enough
> available to transfer it to!) to backup and then restoring it is going to
> hurt!
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-02-28  4:34 ` Roman Mamedov
  2014-02-28  7:27   ` Duncan
@ 2014-05-01  1:52   ` Russell Coker
  2014-05-01  5:33     ` Duncan
  1 sibling, 1 reply; 28+ messages in thread
From: Russell Coker @ 2014-05-01  1:52 UTC (permalink / raw)
  To: Roman Mamedov, linux-btrfs

On Fri, 28 Feb 2014 10:34:36 Roman Mamedov wrote:
> > I've a 18 tera hardware raid 5 (areca ARC-1170 w/ 8 3 gig drives) in
> 
> Do you sleep well at night knowing that if one disk fails, you end up with
> basically a RAID0 of 7x3TB disks? And that if 2nd one encounters unreadable
> sector during rebuild, you lost your data? RAID5 actually stopped working 5
> years ago, apparently you didn't get the memo. :)
> http://hardware.slashdot.org/story/08/10/21/2126252/why-raid-5-stops-working
> -in-2009

I've just been doing some experiments with a failing disk used for backups (so 
I'm not losing any real data here).  The "dup" option for metadata means that 
the entire filesystem structure is intact in spite of having lots of errors 
(in another thread I wrote about getting 50+ correctable errors on metadata 
while doing a backup).

My experience is that in the vast majority of disk failures that don't involve 
dropping a disk the majority of disk data will still be readable.  For example 
one time I had a workstation running RAID-1 get too hot in summer and both 
disks developed significant numbers of errors, enough that it couldn't 
maintain a Linux Software RAID-1 (disks got kicked out all the time).  I wrote 
a program to read all the data from disk 0 and read from disk 1 any blocks 
that couldn't be read from disk 0, the result was that after running e2fsck on 
the result I didn't lose any data.

So if you have BTRFS configured to "dup" metadata on a RAID-5 array (either 
hardware RAID or Linux Software RAID) then the probability of losing metadata 
would be a lot lower than for a filesystem which doesn't do checksums and 
doesn't duplicate metadata.  To lose metadata you would need to have two 
errors that line up with both copies of the same metadata block.

One problem with many RAID arrays is that it seems to only be possible to 
remove a disk and generate a replacement from parity.  I'd like to be able to 
read all the data from the old disk which is readable and write it to the new 
disk.  Then use the parity from other disks to recover the blocks which 
weren't readable.  That way if you have errors on two disks it won't matter 
unless they both happen to be on the same stripe.  Given that BTRFS RAID-5 
isn't usable yet it seems that the only way to get this result is to use RAID-
Z on ZFS.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-01  1:52   ` Russell Coker
@ 2014-05-01  5:33     ` Duncan
  2014-05-02  1:48       ` Russell Coker
  0 siblings, 1 reply; 28+ messages in thread
From: Duncan @ 2014-05-01  5:33 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Thu, 01 May 2014 11:52:33 +1000 as excerpted:

> I've just been doing some experiments with a failing disk used for
> backups (so I'm not losing any real data here).

=:^)

> The "dup" option for metadata means that the entire filesystem
> structure is intact in spite of having lots of errors (in another
> thread I wrote about getting 50+ correctable errors on metadata while
> doing a backup).

TL;DR: Discustion of btrfs raid1 and n-way-mirroring.  Bonus discussion 
on spinning rust heat-death and death in general modes.

That's why I'm running raid1 for both data and metadata here.  I love 
btrfs' data/metadata checksumming and integrity mechanisms, and having 
that second copy to scrub from in the event of an error on one of them is 
just as important to me as the device-redundancy-and-failure-recovery bit.

I could get the latter on md/raid and did run it for some years, but the 
fact that there's no way to have it do routine read-time parity cross-
check and scrub (or N-way checking and vote, rewriting to a bad copy on 
failure, in the case of raid1), even tho it has all the cross-checksums 
already there and available to do it, but only actually makes /use/ of 
that for recovery if a device fails...

My biggest frustration with btrfs ATM is the lack of "true" raid1, aka 
N-way-mirroring.  Btrfs presently only does pair-mirroring, no matter the 
number of devices in the "raid1".  Checksummed-3-way-redundancy really is 
the sweet spot I'd like to hit, and yes it's on the road map, but this 
thing seems to be taking about as long as Christmas does to a five or six 
year old... which is a pretty apt metaphor of my anticipation and the 
eagerness with which I'll be unwrapping and playing with that present 
once it comes! =:^)

> My experience is that in the vast majority of disk failures that don't
> involve dropping a disk the majority of disk data will still be
> readable.  For example one time I had a workstation running RAID-1 get
> too hot in summer and both disks developed significant numbers of
> errors, enough that it couldn't maintain a Linux Software RAID-1 (disks
> got kicked out all the time).  I wrote a program to read all the data
> from disk 0 and read from disk 1 any blocks that couldn't be read from
> disk 0, the result was that after running e2fsck on the result I didn't
> lose any data.

That's rather similar to an experience of mine.  I'm in Phoenix, AZ, and 
outdoor in-the-shade temps can reach near 50C.  Air-conditioning failure 
with a system left running while I was elsewhere.  I came home the the 
"hot car effect", far hotter inside than out, so likely 55-60C ambient 
air temp, very likely 70+ device temps.  The system was still on but 
"frozen" (broiled?) due to disk head crash and possibly CPU thermal 
shutdown.

Surprisingly, after shutting everything down, getting a new AC, and 
letting the system cool for a few hours, it pretty much all came back to 
life, including the CPU(s) (that was pre-multi-core, but I don't remember 
whether it was my dual socket original Opteron, or pre-dual-socket for me 
as well) which I had feared would be dead.

The disk as well came back, minus the sections that were being accessed 
at the time of the head crash, which I expect were physically grooved.

I only had the one main disk running at the time, but fortunately I had 
partitioned it up and had working and backup partitions for everything 
vital, and of course the backup partitions weren't mounted at the time, 
and they came thru just fine (tho without checksumming so I'll never know 
if there were bit-flips, but I could boot from the backup / and mount the 
other backups, and a working partition or two that weren't hurt, just 
fine.

But I *DID* have quite a time recovering anyway, primarily because my 
rootfs, /usr/ and /var (which had the system's installed package 
database), were three different partitions that ended up being from three 
different backup dates... on gentoo, with its rolling updates!  IIRC I 
had a current /var including the package database, but the package files 
actually on the rootfs and on /usr were from different package versions 
from what the db in /var was tracking, and were different from each other 
as well.  I was still finding stale package remnants nearly two years 
later!

But I continued running that disk for several months until I had some 
money to replace it, then copied the system, by then current again except 
for the occasional stale file, to the new setup.  I always wondered how 
much longer I could have run the heat-tested one, but didn't want to 
trust my luck any further, so retired it.

Which was when I got into md/raid, first mostly raid6, then later redone 
to raid1, once I figured out the fancy dual checksums weren't doing 
anything but slowing me down in normal operations anyway.

And on my new setup, I used a partitioning policy I continue to this day, 
namely, everything that the package manager touches[1] including its 
installed-pkg database on /var goes on rootfs.  With a working rootfs and 
several backups of various ages on various physical devices (that 
filesystem's only 8 gig or so, with only 4 gig or so of data, so I can 
and do now keep multiple alternate rootfs partition backups on multiple 
devices) should I need to use them, that means no matter what age the 
backup I might ultimately end up booting to, the package database it 
contains will remain in sync with the content of the packages it's 
tracking.  No further possibility of database and /var from one backup, 
rootfs from another, and /usr from a third!

Anyway, yes, my experience tracks yours.  Both in that case and when I 
simply run the disks to wear-out (which I sometimes do as a secondary/
backup/low-priority-cache-data device once it starts clicking or 
developing bad sectors or whatever), the devices themselves continue to 
work in general, long after I've begun to see intermittent issues with 
them.

Tho my experience to date has been spinning rust.  My primary workstation 
pair of current primary devices are now SSD (Corsair Neutron 256-gig, NOT 
Neutron GTX), partitioned identically with multiple btrfs partitions, in 
btrfs raid1 mode except for the two separate individual /boots), and I'm 
happy with them so far, but I must admit to being a bit worried about 
their less familiar failure modes.

> So if you have BTRFS configured to "dup" metadata on a RAID-5 array
> (either hardware RAID or Linux Software RAID) then the probability of
> losing metadata would be a lot lower than for a filesystem which doesn't
> do checksums and doesn't duplicate metadata.  To lose metadata you would
> need to have two errors that line up with both copies of the same
> metadata block.

Like I said, btrfs raid1 both data/metadata here, for exactly that 
reason.  But I'd sure like to make it triplet-mirror instead of being 
limited to pair-mirror, again for exactly that reason.  Currently, I 
figure the chance of both copies independently going bad is lower than 
the risk of a bug in still-under-development btrfs making BOTH copies 
equally bad (even if they pass checksum), and I'm choosing to run btrfs 
knowing that tho I keep non-btrfs backups just in case.  But as btrfs 
matures and stabilizes, the chance of a btrfs bug making both copies bad 
goes down, while the chance of the two copies independently going bad at 
the same place remains the same, and as the two chances reverse in 
likelihood, I'd sure like to have that triplet-mirroring available.

Oh well, the day will come, even if I'm a six-year-old waiting for 
Christmas at this point.  =:^\

> One problem with many RAID arrays is that it seems to only be possible
> to remove a disk and generate a replacement from parity.  I'd like to be
> able to read all the data from the old disk which is readable and write
> it to the new disk.  Then use the parity from other disks to recover the
> blocks which weren't readable.  That way if you have errors on two disks
> it won't matter unless they both happen to be on the same stripe.  Given
> that BTRFS RAID-5 isn't usable yet it seems that the only way to get
> this result is to use RAID-Z on ZFS.

=:^(  But at least you're already in December, in terms of your btrfs 
Christmas, while at best I'm still in November, for mine...

---
[1] Everything the package manager touches:  Minus a few write-required 
state files and the like in /var, which are now symlinked to parallels 
in /home/var, since I keep the rootfs read-only mounted by default these 
days, but by the same token, those operational-write-required files can 
go missing or be out of sync without dramatically affecting operation.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-01  5:33     ` Duncan
@ 2014-05-02  1:48       ` Russell Coker
  2014-05-02  8:23         ` Duncan
  0 siblings, 1 reply; 28+ messages in thread
From: Russell Coker @ 2014-05-02  1:48 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Thu, 1 May 2014, Duncan <1i5t5.duncan@cox.net> wrote:
> That's why I'm running raid1 for both data and metadata here.  I love
> btrfs' data/metadata checksumming and integrity mechanisms, and having
> that second copy to scrub from in the event of an error on one of them is
> just as important to me as the device-redundancy-and-failure-recovery bit.
> 
> I could get the latter on md/raid and did run it for some years, but the
> fact that there's no way to have it do routine read-time parity cross-
> check and scrub (or N-way checking and vote, rewriting to a bad copy on
> failure, in the case of raid1), even tho it has all the cross-checksums
> already there and available to do it, but only actually makes /use/ of
> that for recovery if a device fails...

Am I missing something or is it impossible to do a disk replace on BTRFS right 
now?

I can delete a device, I can add a device, but I'd like to replace a device.

If a disk has some bad sectors and I delete it from a RAID-1 (or RAID-5) array 
then I'll be one bad sector away from real data loss.  However if I could do a 
replace operation then the old disk would still be available if other disks 
don't work.

Currently it seems that the best thing to do if a disk in a RAID-1 array gets 
bad sectors is to shut the system down, run a program to read all the readable 
data and copy it to a fresh disk, then boot up again and run a scrub to fill 
the holes.  With modern disks that means 6+ hours of down-time for the copy.

> My biggest frustration with btrfs ATM is the lack of "true" raid1, aka
> N-way-mirroring.  Btrfs presently only does pair-mirroring, no matter the
> number of devices in the "raid1".  Checksummed-3-way-redundancy really is
> the sweet spot I'd like to hit, and yes it's on the road map, but this
> thing seems to be taking about as long as Christmas does to a five or six
> year old... which is a pretty apt metaphor of my anticipation and the
> eagerness with which I'll be unwrapping and playing with that present
> once it comes! =:^)

http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf‎

Whether a true RAID-1 means just 2 copies or N copies is a matter of opinion.  
Papers such as the above seem to clearly imply that RAID-1 is strictly 2 
copies of data.

I don't have a strong opinion on how many copies of data can be involved in a 
RAID-1, but I think that there's no good case to claim that only 2 copies 
means that something isn't "true RAID-1".

> > My experience is that in the vast majority of disk failures that don't
> > involve dropping a disk the majority of disk data will still be
> > readable.  For example one time I had a workstation running RAID-1 get
> > too hot in summer and both disks developed significant numbers of
> > errors, enough that it couldn't maintain a Linux Software RAID-1 (disks
> > got kicked out all the time).  I wrote a program to read all the data
> > from disk 0 and read from disk 1 any blocks that couldn't be read from
> > disk 0, the result was that after running e2fsck on the result I didn't
> > lose any data.
> 
> That's rather similar to an experience of mine.  I'm in Phoenix, AZ, and
> outdoor in-the-shade temps can reach near 50C.  Air-conditioning failure
> with a system left running while I was elsewhere.  I came home the the
> "hot car effect", far hotter inside than out, so likely 55-60C ambient
> air temp, very likely 70+ device temps.  The system was still on but
> "frozen" (broiled?) due to disk head crash and possibly CPU thermal
> shutdown.
> 
> Surprisingly, after shutting everything down, getting a new AC, and
> letting the system cool for a few hours, it pretty much all came back to
> life, including the CPU(s) (that was pre-multi-core, but I don't remember
> whether it was my dual socket original Opteron, or pre-dual-socket for me
> as well) which I had feared would be dead.

CPUs have had thermal shutdown for a long time.  When a CPU lacks such 
controls (as some buggy Opteron chips did a few years ago) it makes the IT 
news.

> Anyway, yes, my experience tracks yours.  Both in that case and when I
> simply run the disks to wear-out (which I sometimes do as a secondary/
> backup/low-priority-cache-data device once it starts clicking or
> developing bad sectors or whatever), the devices themselves continue to
> work in general, long after I've begun to see intermittent issues with
> them.

Disks can continue to work for a long time after they flag errors.  The backup 
disk I'm referring to is one that I got from a client a year ago after the NAS 
it was running in flagged an error.

> > So if you have BTRFS configured to "dup" metadata on a RAID-5 array
> > (either hardware RAID or Linux Software RAID) then the probability of
> > losing metadata would be a lot lower than for a filesystem which doesn't
> > do checksums and doesn't duplicate metadata.  To lose metadata you would
> > need to have two errors that line up with both copies of the same
> > metadata block.
> 
> Like I said, btrfs raid1 both data/metadata here, for exactly that
> reason.  But I'd sure like to make it triplet-mirror instead of being
> limited to pair-mirror, again for exactly that reason.  Currently, I

I'd like to be able to run a combination of "dup" and RAID-1 for metadata.  
ZFS has a "copies" option, it would be good if we could do that.

RAID-1 plus backups is more than adequate for file data for me.  But errors on 
2 disks knocking out some metadata would be a major PITA.

It's nice the way a BTRFS scrub tells you the file names that are affected.  
So if I have errors on a pair of disks in a RAID-1 array that don't affect 
metadata then I don't need to do a full restore and try and find and merge 
changes that happened after the last backup, I just need to copy the raw 
devices to new disks, scrub the filesystem, and then restore from backup any 
files that are flagged as bad.

> figure the chance of both copies independently going bad is lower than
> the risk of a bug in still-under-development btrfs making BOTH copies
> equally bad (even if they pass checksum), and I'm choosing to run btrfs
> knowing that tho I keep non-btrfs backups just in case.  But as btrfs
> matures and stabilizes, the chance of a btrfs bug making both copies bad
> goes down, while the chance of the two copies independently going bad at
> the same place remains the same, and as the two chances reverse in
> likelihood, I'd sure like to have that triplet-mirroring available.

I use BTRFS for all my backups too.  I think that the chance of data patterns 
triggering filesystem bugs that break backups as well as primary storage is 
vanishingly small.  The chance of such bugs being latent for long enough that 
I can't easily recreate the data isn't worth worrying about.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-02  1:48       ` Russell Coker
@ 2014-05-02  8:23         ` Duncan
  2014-05-02  9:28           ` Brendan Hide
  2014-05-02 19:21           ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: Duncan @ 2014-05-02  8:23 UTC (permalink / raw)
  To: linux-btrfs

Russell Coker posted on Fri, 02 May 2014 11:48:07 +1000 as excerpted:

> On Thu, 1 May 2014, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> Am I missing something or is it impossible to do a disk replace on BTRFS
> right now?
> 
> I can delete a device, I can add a device, but I'd like to replace a
> device.

You're missing something... but it's easy to do as I almost missed it too 
even tho I was sure it was there.

Something tells me btrfs replace (not device replace, simply replace) 
should be moved to btrfs device replace...

> http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf‎
> 
> Whether a true RAID-1 means just 2 copies or N copies is a matter of
> opinion. Papers such as the above seem to clearly imply that RAID-1 is
> strictly 2 copies of data.

Thanks for that link. =:^)

My position would be that reflects the original, but not the modern, 
definition.  The paper seems to describe as raid1 what would later come 
to be called raid1+0, which quickly morphed into raid10, leaving the 
raid1 description only covering pure mirror-raid.

And even then, the paper says mirrors in spots without specifically 
defining it as (only) two mirrors, but in others it seems to /assume/, 
without further explanation, just two mirrors.  So I'd argue that even 
then the definition of raid1 allowed more than two mirrors, but that it 
just so happened that the examples and formulae given dealt with only two 
mirrors.

Tho certainly I can see the room for differing opinions on the matter as 
well.

> I don't have a strong opinion on how many copies of data can be involved
> in a RAID-1, but I think that there's no good case to claim that only 2
> copies means that something isn't "true RAID-1".

Well, I'd say two copies if it's only two devices in the raid1... would 
be true raid1.  But if it's say four devices in the raid1, as is 
certainly possible with btrfs raid1, that if it's not mirrored 4-way 
across all devices, it's not true raid1, but rather some sort of hybrid 
raid,  raid10 (or raid01) if the devices are so arranged, raid1+linear if 
arranged that way, or some form that doesn't nicely fall into a well 
defined raid level categorization.

But still, opinions can differ.  Point well made... and taken. =:^)

>> Surprisingly, after shutting everything down, getting a new AC, and
>> letting the system cool for a few hours, it pretty much all came back
>> to life, including the CPU(s) (that was pre-multi-core, but I don't
>> remember whether it was my dual socket original Opteron, or
>> pre-dual-socket for me as well) which I had feared would be dead.
> 
> CPUs have had thermal shutdown for a long time.  When a CPU lacks such
> controls (as some buggy Opteron chips did a few years ago) it makes the
> IT news.

That was certainly some years ago, and I remember for awhile, AMD Athlons 
didn't have thermal shutdown yet, while Intel CPUs of the time did.  And 
that was an AMD CPU as I've run mostly AMD (with only specific 
exceptions) for literally decades, now.  But what IDR for sure is whether 
it was my original AMD Athlon (500 MHz), or the Athlon C @ 1.2 GHz, or 
the dual Opteron 242s I ran for several years.  If it was the original 
Athlon, it wouldn't have had thermal shutdown.  If it was the Opterons I 
think they did, but I think the Athlon Cs were in the period when Intel 
had introduced thermal shutdown but AMD hadn't, and Tom's Hardware among 
others had dramatic videos of just exactly what happened if one actually 
tried to run the things without cooling, compared to running an Intel of 
the period.

But I remember being rather surprised that the CPU(s) was/were unharmed, 
which means it very well could have been the Athlon C era, and I had seen 
the dramatic videos and knew my CPU wasn't protected.

> I'd like to be able to run a combination of "dup" and RAID-1 for
> metadata. ZFS has a "copies" option, it would be good if we could do
> that.

Well, if N-way-mirroring were possible, one could do more or less just 
that easily enough with suitable partitioning and setting the data vs 
metadata number of mirrors as appropriate... but of course with only two-
way-mirroring and dup as choices... the only way to do it would be 
layering btrfs atop something else, say md/raid.  And without real-time 
checksumming verification at the md/raid level...

> I use BTRFS for all my backups too.  I think that the chance of data
> patterns triggering filesystem bugs that break backups as well as
> primary storage is vanishingly small.  The chance of such bugs being
> latent for long enough that I can't easily recreate the data isn't worth
> worrying about.

The fact that my primary filesystems and their first backups are btrfs 
raid1 on dual SSDs, while secondary backups are on spinning rust, does 
factor into my calculations here.

I ran reiserfs for many years, since I first switched to Linux full time 
in the early kernel 2.4 era in fact, and while it had its problems early 
on, since the introduction of ordered data mode in IIRC 2.6.16 or some 
such, reiserfs has proven its reliability thru all sorts of hardware 
issues including faulty memory, bad power, and that overheated disk, 
here, and thru the infamous ext3 write-back-journal-by-default period as 
well.  Of course I attribute a good part of that reliability to the fact 
that kernel hackers that think they know enough about ext* to mess with 
it are afraid to touch reiserfs, leaving that to the experts, and of 
course that (and memories of its earlier writeback issues) are precisely 
why reiserfs didn't suffer the same writeback-by-default problems that 
ext3 had, when kernel hackers thought they knew ext3 well enough to try 
writeback with it.

And it's in no small part due to Chris Mason's history with reiserfs and 
the introduction of ordered journaling there, that I trust btrfs to the 
degree I trust it today.

But reiserfs, while it has proven its reliability here time and again, 
simply wasn't designed nor is it appropriate for SSDs.  So while I had 
tried btrfs on spinning rust somewhat earlier and decided it wasn't 
mature enough for my usage at that time, when I switched to SSD I needed 
to find a new filesystem as well, and in part because I do /not/ trust 
the kernel hackers to keep their hands off ext*, while at the same time 
I /do/ routinely run pre-release kernels including occasionally pre-rc1 
kernels, thereby heightening my exposure to ext* kernel hacking risks, I 
wasn't particularly enthusiastic about switching to ext4 and its ssd 
mode.  Moreover, having run reiserfs with tail-packing for years, I 
viewed the restrictions of whole-block allocations as a regression I 
didn't want to deal with.

As a result, when I switched to SSD and needed something more suited to 
ssd than reiserfs, it was little surprise that I decided on a new 
filesystem with a lead developer instrumental in making reiserfs as 
stable as it has been for me, even while keeping my spinning rust backups 
on the reiserfs that has time and again demonstrated for me surprisingly 
good stability in the face of hardware issues.

Meanwhile, I'm not so much afraid of data-pattern triggered btrfs bugs, 
at least not directly, as I am of the possibility of some new development-
version btrfs bug eating my working fs, and then when I boot to the 
backup to try to recover, eating it too, if it too is btrfs.  If that 
backup is instead my trusted reiserfs I've found so stable over the 
years, then that new btrfs bug shouldn't affect it, and while I'll 
discover my attempt to restore from reiserfs to btrfs doesn't work due to 
that btrfs bug again eating the btrfs as soon as I load it up, at least 
the reiserfs copy of the same data should still be safe, since the btrfs 
bug wouldn't affect it.

In that regard having reiserfs on the second level backups on spinning 
rust, while running btrfs on the working copy and primary level backups 
on ssd, serves as a firewall against bugs from the still under 
development btrfs eating first my working copy, then the primary backup, 
then the secondary backup and beyond, since the secondary and beyond 
backups are beyond the firewall on a totally different filesystem, which 
shouldn't be susceptible to the same bugs.

Another risk reduction I take some comfort in, is the fact that I keep my 
rootfs mounted read-only by default, only remounting it read-write for 
updates to packages or configuration.  Since the rootfs is thus likely to 
be read-only mounted at the time of a crash and will almost certainly be 
read-only mounted if I'm booting from backup in ordered to restore a 
damaged working filesystem, it's even more unlikely that a bug that might 
destroy the working copy could destroy the backup as I boot from it to 
try to restore the working copy. =:^)  Of course if the bug triggers on 
/home or the like, it could still destroy the backup /home as well, at 
least the primary btrfs backup, but in that case, chances remain quite 
good that the read-only rootfs with all the usual recovery tools, etc, 
will remain intact and usable to rescue the other filesystems.)

And yet another risk reduction is the fact that I run totally separate 
and independent partition filesystems, not subvolumes on the same 
partition with the same common base filesystem structures, which is what 
a lot of btrfs users are choosing to do.  If btrfs suffers a structure 
destroying bug, they'll lose and have to restore everything, while I'll 
lose and have to restore just one filesystem with its rather more limited 
dataset.

Meanwhile, those partitions are all dual-copy checksummed gpt based (now 
days I use gpt even on USB sticks), too, with an identical partitioning 
scheme on each of two different physical devices.  So if one partition 
table gets corrupted, the other one will kick in.  And if both partition 
tables at opposite ends of the same device get corrupted, presumably in 
some power failure accident while I was actually editing them or 
something, then there's still the other physical device I can boot from 
and use its partitioning table and gptfdisk to redo the corrupted 
partition table on the damaged device.

If all that fails and my final backup, an external spinning rust device 
not normally even attached to the computer, fails as well, say due to a 
fire or flood or something, I figure at that point I'll have rather more 
important things to worry about, like just surviving and finding a new 
home, than what happened to all those checksum-verified both logically 
and physically redundant layers of backup.  And when I do get around to 
worrying about computers again, well, the really valuable stuff's in my 
head anyway, and if *THAT* copy dies too, well, come visit me in the the 
mental ward or cemetery!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-02  8:23         ` Duncan
@ 2014-05-02  9:28           ` Brendan Hide
  2014-05-02 19:21           ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Brendan Hide @ 2014-05-02  9:28 UTC (permalink / raw)
  To: Duncan, linux-btrfs; +Cc: Russell Coker

On 02/05/14 10:23, Duncan wrote:
> Russell Coker posted on Fri, 02 May 2014 11:48:07 +1000 as excerpted:
>
>> On Thu, 1 May 2014, Duncan <1i5t5.duncan@cox.net> wrote:
>> [snip]
>> http://www.eecs.berkeley.edu/Pubs/TechRpts/1987/CSD-87-391.pdf‎
>>
>> Whether a true RAID-1 means just 2 copies or N copies is a matter of
>> opinion. Papers such as the above seem to clearly imply that RAID-1 is
>> strictly 2 copies of data.
> Thanks for that link. =:^)
>
> My position would be that reflects the original, but not the modern,
> definition.  The paper seems to describe as raid1 what would later come
> to be called raid1+0, which quickly morphed into raid10, leaving the
> raid1 description only covering pure mirror-raid.
Personally I'm flexible on using the terminology in day-to-day 
operations and discussion due to the fact that the end-result is "close 
enough". But ...

The definition of "RAID 1" is still only a mirror of two devices. As far 
as I'm aware, Linux's mdraid is the only raid system in the world that 
allows N-way mirroring while still referring to it as "RAID1". Due to 
the way it handles data in chunks, and also due to its "rampant layering 
violations", *technically* btrfs's "RAID-like" features are not "RAID".

To differentiate from "RAID", we're already using lowercase "raid" and, 
in the long term, some of us are also looking to do away with "raid{x}" 
terms altogether with what Hugo and I last termed as "csp notation". 
Changing the terminology is important - but it is particularly non-urgent.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-02  8:23         ` Duncan
  2014-05-02  9:28           ` Brendan Hide
@ 2014-05-02 19:21           ` Chris Murphy
  2014-05-02 21:08             ` Hugo Mills
  2014-05-03 16:31             ` Austin S Hemmelgarn
  1 sibling, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2014-05-02 19:21 UTC (permalink / raw)
  To: Btrfs BTRFS


On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> Something tells me btrfs replace (not device replace, simply replace) 
> should be moved to btrfs device replace…

The syntax for "btrfs device" is different though; replace is like balance: btrfs balance start and btrfs replace start. And you can also get a status on it. We don't (yet) have options to stop, start, resume, which could maybe come in handy for long rebuilds and a reboot is required (?) although maybe that just gets handled automatically: set it to pause, then unmount, then reboot, then mount and resume.

> Well, I'd say two copies if it's only two devices in the raid1... would 
> be true raid1.  But if it's say four devices in the raid1, as is 
> certainly possible with btrfs raid1, that if it's not mirrored 4-way 
> across all devices, it's not true raid1, but rather some sort of hybrid 
> raid,  raid10 (or raid01) if the devices are so arranged, raid1+linear if 
> arranged that way, or some form that doesn't nicely fall into a well 
> defined raid level categorization.

Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid that works this way. They all seem to be raid 1 is strictly two devices. At 4 devices it's raid10, and only in pairs.

Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something like raid1 (2 copies) + linear/concat. But that allocation is round robin. I don't read code but based on how a 3 disk raid1 volume grows VDI files as it's filled it looks like 1GB chunks are copied like this

Disk1	Disk2	Disk3
134		124		235
679		578		689

So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up 18GB of space, 6GB on each drive. You can't do this with any other raid1 as far as I know. You do definitely run out of space on one disk first though because of uneven metadata to data chunk allocation.

Anyway I think we're off the rails with raid1 nomenclature as soon as we have 3 devices. It's probably better to call it replication, with an assumed default of 2 replicates unless otherwise specified.

There's definitely a benefit to a 3 device volume with 2 replicates, efficiency wise. As soon as we go to four disks 2 replicates it makes more sense to do raid10, although I haven't tested odd device raid10 setups so I'm not sure what happens.


Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-02 19:21           ` Chris Murphy
@ 2014-05-02 21:08             ` Hugo Mills
  2014-05-02 22:33               ` Chris Murphy
  2014-05-03 16:31             ` Austin S Hemmelgarn
  1 sibling, 1 reply; 28+ messages in thread
From: Hugo Mills @ 2014-05-02 21:08 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 3304 bytes --]

On Fri, May 02, 2014 at 01:21:50PM -0600, Chris Murphy wrote:
> 
> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> > 
> > Something tells me btrfs replace (not device replace, simply replace) 
> > should be moved to btrfs device replace…
> 
> The syntax for "btrfs device" is different though; replace is like balance: btrfs balance start and btrfs replace start. And you can also get a status on it. We don't (yet) have options to stop, start, resume, which could maybe come in handy for long rebuilds and a reboot is required (?) although maybe that just gets handled automatically: set it to pause, then unmount, then reboot, then mount and resume.
> 
> > Well, I'd say two copies if it's only two devices in the raid1... would 
> > be true raid1.  But if it's say four devices in the raid1, as is 
> > certainly possible with btrfs raid1, that if it's not mirrored 4-way 
> > across all devices, it's not true raid1, but rather some sort of hybrid 
> > raid,  raid10 (or raid01) if the devices are so arranged, raid1+linear if 
> > arranged that way, or some form that doesn't nicely fall into a well 
> > defined raid level categorization.
> 
> Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid that works this way. They all seem to be raid 1 is strictly two devices. At 4 devices it's raid10, and only in pairs.
> 
> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something like raid1 (2 copies) + linear/concat. But that allocation is round robin. I don't read code but based on how a 3 disk raid1 volume grows VDI files as it's filled it looks like 1GB chunks are copied like this
> 
> Disk1	Disk2	Disk3
> 134		124		235
> 679		578		689
> 
> So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up 18GB of space, 6GB on each drive. You can't do this with any other raid1 as far as I know. You do definitely run out of space on one disk first though because of uneven metadata to data chunk allocation.

   The algorithm is that when the chunk allocator is asked for a block
group (in pairs of chunks for RAID-1), it picks the number of chunks
it needs, from different devices, in order of the device with the most
free space. So, with disks of size 8, 4, 4, you get:

Disk 1: 12345678
Disk 2: 1357
Disk 3: 2468

and with 8, 8, 4, you get:

Disk 1: 1234568A
Disk 2: 1234579A
Disk 3: 6789

   Hugo.

> Anyway I think we're off the rails with raid1 nomenclature as soon as we have 3 devices. It's probably better to call it replication, with an assumed default of 2 replicates unless otherwise specified.
> 
> There's definitely a benefit to a 3 device volume with 2 replicates, efficiency wise. As soon as we go to four disks 2 replicates it makes more sense to do raid10, although I haven't tested odd device raid10 setups so I'm not sure what happens.
> 
> 
> Chris Murphy
> 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
               --- Prisoner unknown:  Return to Zenda. ---               

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-02 21:08             ` Hugo Mills
@ 2014-05-02 22:33               ` Chris Murphy
  0 siblings, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2014-05-02 22:33 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Btrfs BTRFS


On May 2, 2014, at 3:08 PM, Hugo Mills <hugo@carfax.org.uk> wrote:

> On Fri, May 02, 2014 at 01:21:50PM -0600, Chris Murphy wrote:
>> 
>> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> 
>>> Something tells me btrfs replace (not device replace, simply replace) 
>>> should be moved to btrfs device replace…
>> 
>> The syntax for "btrfs device" is different though; replace is like balance: btrfs balance start and btrfs replace start. And you can also get a status on it. We don't (yet) have options to stop, start, resume, which could maybe come in handy for long rebuilds and a reboot is required (?) although maybe that just gets handled automatically: set it to pause, then unmount, then reboot, then mount and resume.
>> 
>>> Well, I'd say two copies if it's only two devices in the raid1... would 
>>> be true raid1.  But if it's say four devices in the raid1, as is 
>>> certainly possible with btrfs raid1, that if it's not mirrored 4-way 
>>> across all devices, it's not true raid1, but rather some sort of hybrid 
>>> raid,  raid10 (or raid01) if the devices are so arranged, raid1+linear if 
>>> arranged that way, or some form that doesn't nicely fall into a well 
>>> defined raid level categorization.
>> 
>> Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid that works this way. They all seem to be raid 1 is strictly two devices. At 4 devices it's raid10, and only in pairs.
>> 
>> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something like raid1 (2 copies) + linear/concat. But that allocation is round robin. I don't read code but based on how a 3 disk raid1 volume grows VDI files as it's filled it looks like 1GB chunks are copied like this
>> 
>> Disk1	Disk2	Disk3
>> 134		124		235
>> 679		578		689
>> 
>> So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up 18GB of space, 6GB on each drive. You can't do this with any other raid1 as far as I know. You do definitely run out of space on one disk first though because of uneven metadata to data chunk allocation.
> 
>   The algorithm is that when the chunk allocator is asked for a block
> group (in pairs of chunks for RAID-1), it picks the number of chunks
> it needs, from different devices, in order of the device with the most
> free space. So, with disks of size 8, 4, 4, you get:
> 
> Disk 1: 12345678
> Disk 2: 1357
> Disk 3: 2468
> 
> and with 8, 8, 4, you get:
> 
> Disk 1: 1234568A
> Disk 2: 1234579A
> Disk 3: 6789

Sure in my example I was assuming equal size disks. But it's a good example to have uneven disks also, because it exemplifies all the more the flexibility btrfs replication has, over alternatives, with odd numbered *and* uneven size disks.


Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-02 19:21           ` Chris Murphy
  2014-05-02 21:08             ` Hugo Mills
@ 2014-05-03 16:31             ` Austin S Hemmelgarn
  2014-05-03 19:09               ` Chris Murphy
  1 sibling, 1 reply; 28+ messages in thread
From: Austin S Hemmelgarn @ 2014-05-03 16:31 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 05/02/2014 03:21 PM, Chris Murphy wrote:
> 
> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> 
>> Something tells me btrfs replace (not device replace, simply
>> replace) should be moved to btrfs device replace…
> 
> The syntax for "btrfs device" is different though; replace is like
> balance: btrfs balance start and btrfs replace start. And you can
> also get a status on it. We don't (yet) have options to stop,
> start, resume, which could maybe come in handy for long rebuilds
> and a reboot is required (?) although maybe that just gets handled
> automatically: set it to pause, then unmount, then reboot, then
> mount and resume.
> 
>> Well, I'd say two copies if it's only two devices in the raid1...
>> would be true raid1.  But if it's say four devices in the raid1,
>> as is certainly possible with btrfs raid1, that if it's not
>> mirrored 4-way across all devices, it's not true raid1, but
>> rather some sort of hybrid raid,  raid10 (or raid01) if the
>> devices are so arranged, raid1+linear if arranged that way, or
>> some form that doesn't nicely fall into a well defined raid level
>> categorization.
> 
> Well, md raid1 is always n-way. So if you use -n 3 and specify
> three devices, you'll get 3-way mirroring (3 mirrors). But I don't
> know any hardware raid that works this way. They all seem to be
> raid 1 is strictly two devices. At 4 devices it's raid10, and only
> in pairs.
> 
> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is
> something like raid1 (2 copies) + linear/concat. But that
> allocation is round robin. I don't read code but based on how a 3
> disk raid1 volume grows VDI files as it's filled it looks like 1GB
> chunks are copied like this
Actually, MD RAID10 can be configured to work almost the same with an
odd number of disks, except it uses (much) smaller chunks, and it does
more intelligent striping of reads.
> 
> Disk1	Disk2	Disk3 134		124		235 679		578		689
> 
> So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a
> chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB
> of data taking up 18GB of space, 6GB on each drive. You can't do
> this with any other raid1 as far as I know. You do definitely run
> out of space on one disk first though because of uneven metadata to
> data chunk allocation.
> 
> Anyway I think we're off the rails with raid1 nomenclature as soon
> as we have 3 devices. It's probably better to call it replication,
> with an assumed default of 2 replicates unless otherwise
> specified.
> 
> There's definitely a benefit to a 3 device volume with 2
> replicates, efficiency wise. As soon as we go to four disks 2
> replicates it makes more sense to do raid10, although I haven't
> tested odd device raid10 setups so I'm not sure what happens.
> 
> 
> Chris Murphy
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majordomo@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-03 16:31             ` Austin S Hemmelgarn
@ 2014-05-03 19:09               ` Chris Murphy
  2014-05-03 20:52                 ` Austin S Hemmelgarn
  2014-05-03 23:16                 ` Chris Murphy
  0 siblings, 2 replies; 28+ messages in thread
From: Chris Murphy @ 2014-05-03 19:09 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Btrfs BTRFS


On May 3, 2014, at 10:31 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:

> On 05/02/2014 03:21 PM, Chris Murphy wrote:
>> 
>> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>> 
>>> Something tells me btrfs replace (not device replace, simply
>>> replace) should be moved to btrfs device replace…
>> 
>> The syntax for "btrfs device" is different though; replace is like
>> balance: btrfs balance start and btrfs replace start. And you can
>> also get a status on it. We don't (yet) have options to stop,
>> start, resume, which could maybe come in handy for long rebuilds
>> and a reboot is required (?) although maybe that just gets handled
>> automatically: set it to pause, then unmount, then reboot, then
>> mount and resume.
>> 
>>> Well, I'd say two copies if it's only two devices in the raid1...
>>> would be true raid1.  But if it's say four devices in the raid1,
>>> as is certainly possible with btrfs raid1, that if it's not
>>> mirrored 4-way across all devices, it's not true raid1, but
>>> rather some sort of hybrid raid,  raid10 (or raid01) if the
>>> devices are so arranged, raid1+linear if arranged that way, or
>>> some form that doesn't nicely fall into a well defined raid level
>>> categorization.
>> 
>> Well, md raid1 is always n-way. So if you use -n 3 and specify
>> three devices, you'll get 3-way mirroring (3 mirrors). But I don't
>> know any hardware raid that works this way. They all seem to be
>> raid 1 is strictly two devices. At 4 devices it's raid10, and only
>> in pairs.
>> 
>> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is
>> something like raid1 (2 copies) + linear/concat. But that
>> allocation is round robin. I don't read code but based on how a 3
>> disk raid1 volume grows VDI files as it's filled it looks like 1GB
>> chunks are copied like this
> Actually, MD RAID10 can be configured to work almost the same with an
> odd number of disks, except it uses (much) smaller chunks, and it does
> more intelligent striping of reads.

The efficiency of storage depends on the file system placed on top. Btrfs will allocate space exclusively for metadata, and it's possible much of that space either won't or can't be used. So ext4 or XFS on md probably is more efficient in that regard; but then Btrfs also has compression options so this clouds the efficiency analysis.

For striping of reads, there is a note in man 4 md about the layout with respect to raid10: "The 'far' arrangement can give sequential read performance equal to that of a RAID0 array, but at the cost of reduced write performance." The default layout for raid10 is near 2. I think either the read performance is a wash with defaults, and md reads are better while writes are worse with the far layout.

I'm not sure how Btrfs performs reads with multiple devices.

Chris Murphy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-03 19:09               ` Chris Murphy
@ 2014-05-03 20:52                 ` Austin S Hemmelgarn
  2014-05-03 23:16                 ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Austin S Hemmelgarn @ 2014-05-03 20:52 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 05/03/2014 03:09 PM, Chris Murphy wrote:
> 
> On May 3, 2014, at 10:31 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
> 
>> On 05/02/2014 03:21 PM, Chris Murphy wrote:
>>>
>>> On May 2, 2014, at 2:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>>>>
>>>> Something tells me btrfs replace (not device replace, simply
>>>> replace) should be moved to btrfs device replace…
>>>
>>> The syntax for "btrfs device" is different though; replace is like
>>> balance: btrfs balance start and btrfs replace start. And you can
>>> also get a status on it. We don't (yet) have options to stop,
>>> start, resume, which could maybe come in handy for long rebuilds
>>> and a reboot is required (?) although maybe that just gets handled
>>> automatically: set it to pause, then unmount, then reboot, then
>>> mount and resume.
>>>
>>>> Well, I'd say two copies if it's only two devices in the raid1...
>>>> would be true raid1.  But if it's say four devices in the raid1,
>>>> as is certainly possible with btrfs raid1, that if it's not
>>>> mirrored 4-way across all devices, it's not true raid1, but
>>>> rather some sort of hybrid raid,  raid10 (or raid01) if the
>>>> devices are so arranged, raid1+linear if arranged that way, or
>>>> some form that doesn't nicely fall into a well defined raid level
>>>> categorization.
>>>
>>> Well, md raid1 is always n-way. So if you use -n 3 and specify
>>> three devices, you'll get 3-way mirroring (3 mirrors). But I don't
>>> know any hardware raid that works this way. They all seem to be
>>> raid 1 is strictly two devices. At 4 devices it's raid10, and only
>>> in pairs.
>>>
>>> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is
>>> something like raid1 (2 copies) + linear/concat. But that
>>> allocation is round robin. I don't read code but based on how a 3
>>> disk raid1 volume grows VDI files as it's filled it looks like 1GB
>>> chunks are copied like this
>> Actually, MD RAID10 can be configured to work almost the same with an
>> odd number of disks, except it uses (much) smaller chunks, and it does
>> more intelligent striping of reads.
> 
> The efficiency of storage depends on the file system placed on top. Btrfs will allocate space exclusively for metadata, and it's possible much of that space either won't or can't be used. So ext4 or XFS on md probably is more efficient in that regard; but then Btrfs also has compression options so this clouds the efficiency analysis.
> 
> For striping of reads, there is a note in man 4 md about the layout with respect to raid10: "The 'far' arrangement can give sequential read performance equal to that of a RAID0 array, but at the cost of reduced write performance." The default layout for raid10 is near 2. I think either the read performance is a wash with defaults, and md reads are better while writes are worse with the far layout.
> 
> I'm not sure how Btrfs performs reads with multiple devices.
While I haven't tested MD RAID10 specifically, I do know that when used
as a backend for mirrored striping on LVM, it does, by default, get
better read performance than BTRFS (all though the difference is usually
not very significant for most use cases).

As far as how BTRFS preforms reads with multiple devices, it uses the
following algorithm (at least this is my understanding of it, I may be
wrong):
1. Create a 0-indexed list of the devices that the block is stored on.
2. Take the PID of the process that issued the read() call modulo the
number of device that the requested block is stored on, and dispatch the
read to the device with that index in the aforementioned list.
3. If checksum verification fails, then try other devices from the list
in sequential order.

While this algorithm gets relatively good performance for many use
cases, and causes very little overhead in the read path, it is still
sub-optimal in almost all cases, and produces bad results in a few
cases, such as copying very large files, or any other case where only a
single process/thread is reading very large amounts of data.

As far as improving it, dispatching the read to the least recently
accessed device.  Such a strategy would not introduce much more overhead
to the read path ( a few 64-bit compares), and would allow reads to be
striped across devices much more efficiently.  To get much better than
that would require tracking where the last access to each device was,
and dispatching to whichever one was closest.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Help with space
  2014-05-03 19:09               ` Chris Murphy
  2014-05-03 20:52                 ` Austin S Hemmelgarn
@ 2014-05-03 23:16                 ` Chris Murphy
  1 sibling, 0 replies; 28+ messages in thread
From: Chris Murphy @ 2014-05-03 23:16 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Austin S Hemmelgarn


On May 3, 2014, at 1:09 PM, Chris Murphy <lists@colorremedies.com> wrote:

> 
> On May 3, 2014, at 10:31 AM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
> 
>> On 05/02/2014 03:21 PM, Chris Murphy wrote:
>>> 
>>> Btrfs raid1 with 3+ devices is unique as far as I can tell. It is
>>> something like raid1 (2 copies) + linear/concat. But that
>>> allocation is round robin. I don't read code but based on how a 3
>>> disk raid1 volume grows VDI files as it's filled it looks like 1GB
>>> chunks are copied like this
>> Actually, MD RAID10 can be configured to work almost the same with an
>> odd number of disks, except it uses (much) smaller chunks, and it does
>> more intelligent striping of reads.
> 
> The efficiency of storage depends on the file system placed on top. Btrfs will allocate space exclusively for metadata, and it's possible much of that space either won't or can't be used. So ext4 or XFS on md probably is more efficient in that regard; but then Btrfs also has compression options so this clouds the efficiency analysis.
> 
> For striping of reads, there is a note in man 4 md about the layout with respect to raid10: "The 'far' arrangement can give sequential read performance equal to that of a RAID0 array, but at the cost of reduced write performance." The default layout for raid10 is near 2. I think either the read performance is a wash with defaults, and md reads are better while writes are worse with the far layout.
> 
> I'm not sure how Btrfs performs reads with multiple devices.


Also, for unequal sized devices, for example 12G,6G,6G, Btrfs raid1 is OK with this and efficiently uses the space, whereas md does not in raid10. First it complains when creating, asking if I want to continue anyway, and then it 



Second it ends up with *less* usable space than if it had 3x 6GB drives.

12G,6G,6G md raid10
# mdadm -C /dev/md0 -n 3 -l raid10 --assume-clean /dev/sd[bcd]
mdadm: largest drive (/dev/sdb) exceeds size (6283264K) by more than 1%.
# mdadm -D /dev/md0 (partial)
     Array Size : 9424896 (8.99 GiB 9.65 GB)
  Used Dev Size : 6283264 (5.99 GiB 6.43 GB)

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        9.0G   33M  9.0G   1% /mnt

12G,6G,6G btrfs raid1

# mkfs.btrfs -d raid1 -m raid1 /dev/sd[bcd]
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         24G  1.3M   12G   1% /mnt


For performance workloads, this is probably a pathological configuration since it depends on disproportionate reading almost no matter what. But for those who happen to have uneven devices available, and favor space usage efficiency over performance, it's a nice capability.


Chris Murphy

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-05-03 23:16 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-27 18:19 Help with space Justin Brown
2014-02-27 19:27 ` Chris Murphy
2014-02-27 19:51   ` Chris Murphy
2014-02-27 20:49     ` otakujunction
2014-02-27 21:11       ` Chris Murphy
2014-02-28  0:12         ` Dave Chinner
2014-02-28  0:27           ` Chris Murphy
2014-02-28  4:21             ` Dave Chinner
2014-02-28  5:49               ` Chris Murphy
2014-02-28  4:34 ` Roman Mamedov
2014-02-28  7:27   ` Duncan
2014-02-28  7:37     ` Roman Mamedov
2014-02-28  7:46     ` Justin Brown
2014-05-01  1:52   ` Russell Coker
2014-05-01  5:33     ` Duncan
2014-05-02  1:48       ` Russell Coker
2014-05-02  8:23         ` Duncan
2014-05-02  9:28           ` Brendan Hide
2014-05-02 19:21           ` Chris Murphy
2014-05-02 21:08             ` Hugo Mills
2014-05-02 22:33               ` Chris Murphy
2014-05-03 16:31             ` Austin S Hemmelgarn
2014-05-03 19:09               ` Chris Murphy
2014-05-03 20:52                 ` Austin S Hemmelgarn
2014-05-03 23:16                 ` Chris Murphy
2014-02-28  6:13 ` Chris Murphy
2014-02-28  6:26   ` Chris Murphy
2014-02-28  7:39     ` Justin Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.