All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID1 disk upgrade method
@ 2016-01-22  3:45 Sean Greenslade
  2016-01-22  4:37 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Sean Greenslade @ 2016-01-22  3:45 UTC (permalink / raw)
  To: linux-btrfs

Hi, all. I have a box running a btrfs raid1 of two disks. One of the
disks started reallocating sectors, so I've decided to replace it
pre-emptively. And since larger disks are a bit cheaper now, I'm trading
up. The current disks are 2x 2TB, and I'm going to be putting in 2x 3TB
disks. Hopefully this should be reasonably straightforward, since the
raid is still healthy, but I wanted to ask what the best way to go about
doing this would be.

I have the ability (through shuffling other drive bays around) to mount
the 2 existing drives + one new drive all at once. So my first blush
thought would be to mount one of the new drives, partition it, then
"btrfs replace" the worse existing drive.

Another possibility is to "btrfs add" the new drive, balance, then
"btrfs device delete" the old drive. Would that make more sense if the
old drive is still (mostly) good?

Or maybe I could just create a new btrfs partiton on the new device,
copy over the data, then shuffle the disks around and balance the new
single partition into raid1.


Which of these makes the most sense? Or is there something else I
haven't thought of?

System info:

[sean@rat ~]$ uname -a
Linux rat 4.3.3-3-ARCH #1 SMP PREEMPT Wed Jan 20 08:12:23 CET 2016
x86_64 GNU/Linux

[sean@rat ~]$ btrfs --version
btrfs-progs v4.3.1

All drives are spinning rust. Original raid1 was created ~Aug 2013, on
kernel 3.10.6.


Thanks,

--Sean

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-22  3:45 RAID1 disk upgrade method Sean Greenslade
@ 2016-01-22  4:37 ` Chris Murphy
  2016-01-22 10:54 ` Duncan
  2016-01-22 14:27 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2016-01-22  4:37 UTC (permalink / raw)
  To: Sean Greenslade; +Cc: Btrfs BTRFS

On Thu, Jan 21, 2016 at 8:45 PM, Sean Greenslade
<sean@seangreenslade.com> wrote:

> I have the ability (through shuffling other drive bays around) to mount
> the 2 existing drives + one new drive all at once. So my first blush
> thought would be to mount one of the new drives, partition it, then
> "btrfs replace" the worse existing drive.

Yes if there are many bad sectors you can use -r which will cause
reads to happen from the drive you're not replacing, and only read
from the "bad" drive if there's a checksum mismatch or a read error on
the "good" drive.

Just like in other threads though, it's still important to make sure
that you don't have an all too common misconfiguration for RAID
setups. The drive's error timeout needs to be shorter than the
kernel's. To check it, on each drive:


# smartctl -l scterc /dev/sdX
# cat /sys/block/sdX/device/timeout

If SCT ERC is not enabled, use -l scterc,70,70 to make it 7 seconds.
That's per drive and it's not persistent through reboots.

If SCT ERC is not supported then you need to increase the SCSI command
timer setting by writing a value such as 160 to that timeout path, per
drive, also not a persistent setting.



> Another possibility is to "btrfs add" the new drive, balance, then
> "btrfs device delete" the old drive. Would that make more sense if the
> old drive is still (mostly) good?

No just use 'btrfs replace start' without -r option if the old drive
is mostly good.

add then delete is necessary if the replacement is smaller in size. So
make sure the partition you use for replace is at least as large or
larger than the device (partition) being replaced.



> Or maybe I could just create a new btrfs partiton on the new device,
> copy over the data, then shuffle the disks around and balance the new
> single partition into raid1.

Oh you mean make a new file system, and use btrfs send/receive (or
rsync or cp)? That's all OK also, and then later you can wipe the good
old drive and add it, then do a -dconvert -mconvert raid 1 balance to
convert to raid1.

There's an advantage if there are features not being used by the older
formatting from the 2013 btrfs-progs. I forget when skinny extents
became default. While that can be set with btrfstune I'm not sure how
to cause all data to get rewritten to use it (?)



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-22  3:45 RAID1 disk upgrade method Sean Greenslade
  2016-01-22  4:37 ` Chris Murphy
@ 2016-01-22 10:54 ` Duncan
  2016-01-23 21:41   ` Sean Greenslade
  2016-01-22 14:27 ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 31+ messages in thread
From: Duncan @ 2016-01-22 10:54 UTC (permalink / raw)
  To: linux-btrfs

Sean Greenslade posted on Thu, 21 Jan 2016 22:45:38 -0500 as excerpted:

> Hi, all. I have a box running a btrfs raid1 of two disks. One of the
> disks started reallocating sectors, so I've decided to replace it
> pre-emptively. And since larger disks are a bit cheaper now, I'm trading
> up. The current disks are 2x 2TB, and I'm going to be putting in 2x 3TB
> disks. Hopefully this should be reasonably straightforward, since the
> raid is still healthy, but I wanted to ask what the best way to go about
> doing this would be.
> 
> I have the ability (through shuffling other drive bays around) to mount
> the 2 existing drives + one new drive all at once. So my first blush
> thought would be to mount one of the new drives, partition it, then
> "btrfs replace" the worse existing drive.

I just did exactly this, not too long ago, tho in my case everything was 
exactly the same size, and SSD.

I had originally purchased three SSDs of the same type and size, with the 
intent of one of them going in my netbook, which disappeared before that 
replacement done, so I had a spare.  The other two were installed in my 
main machine, GPT-partitioned up into about 10 (rather small, nothing 
over 50 GiB) partitions each, with a bunch of space left at the end (I 
needed ~128 GiB SSDs and purchased 256 GB aka 238 GiB SSDs, but the 
recommendation is to leave about 25% unpartitioned for use by the FTL 
anyway and I was ~47%, so...).

Other than the BIOS and EFI reserved partitions (which are both small 
enough I make it a habit to setup both, easier to put the devices in a 
different machine that way even tho only one will actually be used), all 
partitions were btrfs, with both devices partitioned identically and 
working and backup copies of several partitions.  Save for /boot, which 
was btrfs dup mode, working copy on one device, backup on the other, all 
btrfs were raid1 both data and metadata.

One device prematurely started replacing sectors, however, and while I 
continued to run the btrfs raid1 on it for awhile, using btrfs scrub to 
fix things up from the good device, eventually I decided it was time to 
do the switch-out.

First thing of course was setting it up with partitioning identical to 
the other two devices.  Once that was done, on all the btrfs raid1, the 
switch-out was straightforward, btrfs replace start on each.  As 
everything was SSD and the partitions were all under 50 GiB, the replaces 
only took a few minutes each, but of course I had about 8 of them to do...

The /boot partition was also easy enough, simply mkfs.btrfs the new one, 
mount it, and copy everything over as if I was doing a backup, the same 
as I routinely do from working to backup copy on most of my partitions.

Of course then I had to grub-install to the new device, so I could boot 
it.  Again routine, as after grub updates I always grub-install to each 
device, one at a time, rebooting to the grub prompt with the other device 
disconnected after the first grub-install to ensure it installed 
correctly and is still bootable to grub, before doing the second, and 
again in reverse after the second, so if there's any problem, I have 
either the untouched old one or the already tested new one to boot from.

The only two caveats I'm aware of with btrfs replace are the one CMurphy 
already brought up, size and functionality of the device being replaced.  
As my existing device was still working and I had just finished scrubs on 
all btrfs, the replaces went without a hitch.

Size-wise, if the new target device/partition is larger, do the replace, 
and then double-check to see that it used the full device or partition.  
If you need to, resize the btrfs to use the full size.  (If it's smaller, 
as CMurphy says, use the add/delete method instead of replace, but you 
say it's larger in this case so...)

If the old device is in serious trouble but still at least installed, use 
the replace option that only reads from that device if absolutely 
necessary.  If the old one doesn't work at all, procedures are slightly 
different, but that wasn't an issue for me, and by what you posted, 
shouldn't be one for you, either.

> Another possibility is to "btrfs add" the new drive, balance, then
> "btrfs device delete" the old drive. Would that make more sense if the
> old drive is still (mostly) good?

Replace is what I'd recommend in your case as it's the most 
straightforward.  Add then delete works too, and has to be used on older 
systems where replace isn't available yet as an option, or when the 
replacement is smaller.  I think it may still be necessary for raid56 
mode too as last I knew, replace couldn't yet handle that.  But replace 
is more straightforward and thus easier and recommended, where it's an 
option.

However, where people are doing add then delete, there's no need to do 
balance between them, as the device delete runs balance in that process.  
Indeed, running a balance after the add but before the delete simply puts 
more stress on the device being replaced, so is definitely NOT 
recommended if that device isn't perfectly healthy.  Besides taking 
longer, of course, since you're effectively doing two balances, one as a 
balance, one as a device delete, in a row.

> Or maybe I could just create a new btrfs partiton on the new device,
> copy over the data, then shuffle the disks around and balance the new
> single partition into raid1.

That should work just fine as well, but is less straightforward, and in 
theory at least is a bit more risky when the device being replaced is 
still mostly working, since while the new btrfs is single-device, it's 
likely to be single-data as well, so you don't have the fallback to the 
second copy that you get with raid1, if the one copy is bad.  Of course 
if you're getting bad copies on a new device something's already wrong, 
which is why I said in theory, but there you have it.

The one reason you might still want to do it that way is as CMurphy said, 
if the old btrfs was sufficiently old that it was missing features you 
wanted to enable on the new one.

Actually, here, since most of my btrfs are not only raid1, but also have 
two filesystem copies, the working and backup btrfs on different 
partitions of the same two raid1 devices, with that being my primary 
backup (of course I have a secondary backup on other devices, spinning 
rust in my case while the main devices are SSD), I routinely mkfs.btrfs 
the backup copy and copy everything over from the working copy once 
again, thus both updating the backup, and getting the benefit of any new 
btrfs features enabled on the fresh mkfs.btrfs.  Of course a backup isn't 
complete until it's tested, so I routinely reboot and mount the fresh 
backup copy as that test, and while I'm booted to that backup, once I'm 
very sure it's good, it's trivially easy to reverse the process, doing a 
fresh mkfs.btrfs of the normal working copy and copying everything from 
the backup I'm running on back to the fresh normal working copy, thus 
both taking advantage of new features now on the new working copy, and 
ensuring no weird issues due to some long fixed and forgotten about but 
still lurking bug waiting to come out and bite me, on a working copy 
that's been around for years.

For those without the space-luxury of identically partitioned duplicate 
working and backup raid1 copies of the same data, which makes the above 
refreshing of the working copy to a brand new btrfs routine as an 
optional reverse-run of the regular backup cycle, the above "separate 
second filesystem" device upgrade method does have the advantage of 
starting with a fresh btrfs with all the newer features, that the btrfs 
replace or btrfs device add/delete methods on an existing filesystem 
don't.

> Which of these makes the most sense? Or is there something else I
> haven't thought of?

In general, I'd recommend the replace as the most straightforward, unless 
your existing filesystem is old enough that it doesn't have some of the 
newer features enabled, and you want to take the opportunity of the 
switch to enable them, in which case the copy to new filesystem option 
does allow you to do that.

> System info:
> 
> [sean@rat ~]$ uname -a
> Linux rat 4.3.3-3-ARCH #1 SMP PREEMPT Wed Jan 20
> 08:12:23 CET 2016 x86_64 GNU/Linux
> 
> [sean@rat ~]$ btrfs --version
> btrfs-progs v4.3.1
> 
> All drives are spinning rust. Original raid1 was created ~Aug 2013, on
> kernel 3.10.6.

Thank you.  A lot of posts don't include that information, but it's nice 
to have, to be certain you actually have a new enough kernel and 
userspace to actually have a working btrfs replace command, etc. =:^)

And since you do have the raid1 creation kernel info there too, I can 
tell you that yes, a number of filesystem features are now default that 
weren't, back on kernel 3.10, including I believe 16k node size (the 
default back then was 4k, tho 16k was an available option, just not the 
default).  I'm quite sure that was before skinny metadata by default, as 
well.  Whether the newer features are worth the additional hassle of 
doing the new mkfs.btrfs and copy, as opposed to the more straightforward 
btrfs replace, is up to you, but yes, the defaults are slightly different 
now, so you have that additional information to consider when choosing 
your upgrade method. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-22  3:45 RAID1 disk upgrade method Sean Greenslade
  2016-01-22  4:37 ` Chris Murphy
  2016-01-22 10:54 ` Duncan
@ 2016-01-22 14:27 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-22 14:27 UTC (permalink / raw)
  To: Sean Greenslade, linux-btrfs

On 2016-01-21 22:45, Sean Greenslade wrote:
> Hi, all. I have a box running a btrfs raid1 of two disks. One of the
> disks started reallocating sectors, so I've decided to replace it
> pre-emptively. And since larger disks are a bit cheaper now, I'm trading
> up. The current disks are 2x 2TB, and I'm going to be putting in 2x 3TB
> disks. Hopefully this should be reasonably straightforward, since the
> raid is still healthy, but I wanted to ask what the best way to go about
> doing this would be.
>
> I have the ability (through shuffling other drive bays around) to mount
> the 2 existing drives + one new drive all at once. So my first blush
> thought would be to mount one of the new drives, partition it, then
> "btrfs replace" the worse existing drive.
>
> Another possibility is to "btrfs add" the new drive, balance, then
> "btrfs device delete" the old drive. Would that make more sense if the
> old drive is still (mostly) good?
>
> Or maybe I could just create a new btrfs partiton on the new device,
> copy over the data, then shuffle the disks around and balance the new
> single partition into raid1.
>
>
> Which of these makes the most sense? Or is there something else I
> haven't thought of?
Just to further back up what everyone else has said, 'btrfs replace' is 
the preferred way to do this, largely because it's significantly more 
efficient and puts less stress on the disks.  Using add/delete requires 
rewriting everything on the filesystem at least once, and possibly twice 
depending on how you do it, whereas replace just rewrites the half of 
the chunk that's on the disk being replaced, and updates some metadata 
on the other disk.

That said, if you have the time, it may be better for you to use 
send/receive and create a new filesystem from scratch so that you can be 
certain that you have a clean filesystem with all the new features.  If 
you do go this way, use send/receive if possible, it's much more 
efficient than rsync or cp, and can preserve almost everything (unlike 
rsync).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-22 10:54 ` Duncan
@ 2016-01-23 21:41   ` Sean Greenslade
  2016-01-24  0:03     ` Chris Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Greenslade @ 2016-01-23 21:41 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Jan 22, 2016 at 10:54:58AM +0000, Duncan wrote:
> And since you do have the raid1 creation kernel info there too, I can 
> tell you that yes, a number of filesystem features are now default that 
> weren't, back on kernel 3.10, including I believe 16k node size (the 
> default back then was 4k, tho 16k was an available option, just not the 
> default).  I'm quite sure that was before skinny metadata by default, as 
> well.  Whether the newer features are worth the additional hassle of 
> doing the new mkfs.btrfs and copy, as opposed to the more straightforward 
> btrfs replace, is up to you, but yes, the defaults are slightly different 
> now, so you have that additional information to consider when choosing 
> your upgrade method. =:^)

Thanks Duncan, Austin, and Chris. I think my plan is going to be to do
btrfs replace on the disks, and try to enable skinny extents with
btrfstune. I'm not really concerned about the node size, as from what I
can tell it's just a slight performance bump.

Question about btrfstune: It seems to only operate on unmounted
partitions. If I want to enable skinny extents on my raid1, do I need to
run btrfstune on both drives, or just one? And I'm assuming it will just
apply to newly-allocated extents, so I should enable it before I start
the replace, correct?

--Sean

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-23 21:41   ` Sean Greenslade
@ 2016-01-24  0:03     ` Chris Murphy
  2016-01-27 22:45       ` Sean Greenslade
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2016-01-24  0:03 UTC (permalink / raw)
  To: Sean Greenslade; +Cc: Btrfs BTRFS

On Sat, Jan 23, 2016 at 2:41 PM, Sean Greenslade
<sean@seangreenslade.com> wrote:
> On Fri, Jan 22, 2016 at 10:54:58AM +0000, Duncan wrote:
>> And since you do have the raid1 creation kernel info there too, I can
>> tell you that yes, a number of filesystem features are now default that
>> weren't, back on kernel 3.10, including I believe 16k node size (the
>> default back then was 4k, tho 16k was an available option, just not the
>> default).  I'm quite sure that was before skinny metadata by default, as
>> well.  Whether the newer features are worth the additional hassle of
>> doing the new mkfs.btrfs and copy, as opposed to the more straightforward
>> btrfs replace, is up to you, but yes, the defaults are slightly different
>> now, so you have that additional information to consider when choosing
>> your upgrade method. =:^)
>
> Thanks Duncan, Austin, and Chris. I think my plan is going to be to do
> btrfs replace on the disks, and try to enable skinny extents with
> btrfstune. I'm not really concerned about the node size, as from what I
> can tell it's just a slight performance bump.
>
> Question about btrfstune: It seems to only operate on unmounted
> partitions. If I want to enable skinny extents on my raid1, do I need to
> run btrfstune on both drives, or just one?

Just one, it would be applied to the whole fs.

> And I'm assuming it will just
> apply to newly-allocated extents, so I should enable it before I start
> the replace, correct?

I do not expect replication to change the extent format. The replace
(or add then delete) operation isn't allocating new extents, it's just
copying them as-is in the chunks that contain them. So I don't think
it matters when you do it.

Just make sure you have backups before, during, and after all of this,
which should be the case no matter Btrfs or not. Mixed extent types
should work fine, but who knows you could get into an edge case at a
later time for all anyone knows. Looks like skinny extent became
default in btrfs-progs 3.18.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-24  0:03     ` Chris Murphy
@ 2016-01-27 22:45       ` Sean Greenslade
  2016-01-27 23:55         ` Sean Greenslade
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Greenslade @ 2016-01-27 22:45 UTC (permalink / raw)
  To: Btrfs BTRFS

On Sat, Jan 23, 2016 at 05:03:15PM -0700, Chris Murphy wrote:
> Just make sure you have backups before, during, and after all of this,
> which should be the case no matter Btrfs or not. Mixed extent types
> should work fine, but who knows you could get into an edge case at a
> later time for all anyone knows. Looks like skinny extent became
> default in btrfs-progs 3.18.

OK, disks have arrived, and I've completed the first replace.
Interestingly enough, the replace seems to have succeeded, however btrfs
fi show doesn't seem to think so.

Dmesg:
[Wed Jan 27 13:03:09 2016] BTRFS info (device sdb1): disk space caching
is enabled
[Wed Jan 27 13:03:09 2016] BTRFS: has skinny extents
[Wed Jan 27 13:03:09 2016] BTRFS: bdev /dev/sdb1 errs: wr 0, rd 186,
flush 0, corrupt 0, gen 0
[Wed Jan 27 13:05:23 2016]  sdc: sdc1
[Wed Jan 27 13:05:25 2016]  sdc: sdc1
[Wed Jan 27 13:08:37 2016] BTRFS: dev_replace from /dev/sdb1 (devid 2)
to /dev/sdc1 started
[Wed Jan 27 16:34:49 2016] BTRFS: dev_replace from /dev/sdb1 (devid 2)
to /dev/sdc1 finished

Btrfs fi show:
warning, device 3 is missing
warning, device 3 is missing
warning devid 3 not found already
Label: none  uuid: 490b8b7c-59c4-45dc-ac63-6a90f0966776
        Total devices 2 FS bytes used 1.45TiB
		devid    1 size 1.82TiB used 1.52TiB path /dev/sda1
		*** Some devices missing

I haven't rebooted or remounted yet, so I'm curious if this is a bug,
a normal thing that is fixed by a reboot, or what.

Still running the same versions from my original email, but here they
are again:

Linux rat 4.3.3-3-ARCH #1 SMP PREEMPT Wed Jan 20 08:12:23 CET 2016
x86_64 GNU/Linux

btrfs-progs v4.3.1


Thanks,

--Sean

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-27 22:45       ` Sean Greenslade
@ 2016-01-27 23:55         ` Sean Greenslade
  2016-01-28 12:31           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Greenslade @ 2016-01-27 23:55 UTC (permalink / raw)
  To: Btrfs BTRFS

On Wed, Jan 27, 2016 at 05:45:49PM -0500, Sean Greenslade wrote:
> OK, disks have arrived, and I've completed the first replace.
> Interestingly enough, the replace seems to have succeeded, however btrfs
> fi show doesn't seem to think so.
> 
> Dmesg:
> [Wed Jan 27 13:03:09 2016] BTRFS info (device sdb1): disk space caching
> is enabled
> [Wed Jan 27 13:03:09 2016] BTRFS: has skinny extents
> [Wed Jan 27 13:03:09 2016] BTRFS: bdev /dev/sdb1 errs: wr 0, rd 186,
> flush 0, corrupt 0, gen 0
> [Wed Jan 27 13:05:23 2016]  sdc: sdc1
> [Wed Jan 27 13:05:25 2016]  sdc: sdc1
> [Wed Jan 27 13:08:37 2016] BTRFS: dev_replace from /dev/sdb1 (devid 2)
> to /dev/sdc1 started
> [Wed Jan 27 16:34:49 2016] BTRFS: dev_replace from /dev/sdb1 (devid 2)
> to /dev/sdc1 finished
> 
> Btrfs fi show:
> warning, device 3 is missing
> warning, device 3 is missing
> warning devid 3 not found already
> Label: none  uuid: 490b8b7c-59c4-45dc-ac63-6a90f0966776
>         Total devices 2 FS bytes used 1.45TiB
> 		devid    1 size 1.82TiB used 1.52TiB path /dev/sda1
> 		*** Some devices missing
> 
> I haven't rebooted or remounted yet, so I'm curious if this is a bug,
> a normal thing that is fixed by a reboot, or what.

Got the opportunity to reboot, and things appear to be OK. Still, I
would expect replace to work without requiring a reboot, so this may
still be a bug. I'm running a scrub to verify things, and once that
completes I'll do the second replace and see if I encounter the same
problem.

--Sean


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-27 23:55         ` Sean Greenslade
@ 2016-01-28 12:31           ` Austin S. Hemmelgarn
  2016-01-28 15:37             ` Sean Greenslade
  0 siblings, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-28 12:31 UTC (permalink / raw)
  To: Sean Greenslade, Btrfs BTRFS

On 2016-01-27 18:55, Sean Greenslade wrote:
> On Wed, Jan 27, 2016 at 05:45:49PM -0500, Sean Greenslade wrote:
>> OK, disks have arrived, and I've completed the first replace.
>> Interestingly enough, the replace seems to have succeeded, however btrfs
>> fi show doesn't seem to think so.
>>
>> Dmesg:
>> [Wed Jan 27 13:03:09 2016] BTRFS info (device sdb1): disk space caching
>> is enabled
>> [Wed Jan 27 13:03:09 2016] BTRFS: has skinny extents
>> [Wed Jan 27 13:03:09 2016] BTRFS: bdev /dev/sdb1 errs: wr 0, rd 186,
>> flush 0, corrupt 0, gen 0
>> [Wed Jan 27 13:05:23 2016]  sdc: sdc1
>> [Wed Jan 27 13:05:25 2016]  sdc: sdc1
>> [Wed Jan 27 13:08:37 2016] BTRFS: dev_replace from /dev/sdb1 (devid 2)
>> to /dev/sdc1 started
>> [Wed Jan 27 16:34:49 2016] BTRFS: dev_replace from /dev/sdb1 (devid 2)
>> to /dev/sdc1 finished
>>
>> Btrfs fi show:
>> warning, device 3 is missing
>> warning, device 3 is missing
>> warning devid 3 not found already
>> Label: none  uuid: 490b8b7c-59c4-45dc-ac63-6a90f0966776
>>          Total devices 2 FS bytes used 1.45TiB
>> 		devid    1 size 1.82TiB used 1.52TiB path /dev/sda1
>> 		*** Some devices missing
>>
>> I haven't rebooted or remounted yet, so I'm curious if this is a bug,
>> a normal thing that is fixed by a reboot, or what.
>
> Got the opportunity to reboot, and things appear to be OK. Still, I
> would expect replace to work without requiring a reboot, so this may
> still be a bug. I'm running a scrub to verify things, and once that
> completes I'll do the second replace and see if I encounter the same
> problem.
That is unusual, it's supposed to work without needing a reboot or 
rescan, so I think you may have found a bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 12:31           ` Austin S. Hemmelgarn
@ 2016-01-28 15:37             ` Sean Greenslade
  2016-01-28 16:18               ` Chris Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Sean Greenslade @ 2016-01-28 15:37 UTC (permalink / raw)
  To: Btrfs BTRFS

On Thu, Jan 28, 2016 at 07:31:06AM -0500, Austin S. Hemmelgarn wrote:
> >Got the opportunity to reboot, and things appear to be OK. Still, I
> >would expect replace to work without requiring a reboot, so this may
> >still be a bug. I'm running a scrub to verify things, and once that
> >completes I'll do the second replace and see if I encounter the same
> >problem.
>
> That is unusual, it's supposed to work without needing a reboot or rescan,
> so I think you may have found a bug.

Did the second replace, and encountered a slightly different issue.
Btrfs fi show did list both new devices after the replace completed,
however the partition was no longer mounted. Trying to mount, the mount
command returned 0 but the partition was not actually mounted. I got
this in dmesg:

[Thu Jan 28 10:20:20 2016] BTRFS info (device sdd1): disk space caching
is enabled
[Thu Jan 28 10:20:20 2016] BTRFS: has skinny extents
[Thu Jan 28 10:20:20 2016] BTRFS: bdev /dev/sda1 errs: wr 0, rd 186,
flush 0, corrupt 0, gen 0

I ejected and re-inserted the HDDs, and everything was happy again. The
mount actually succeeded, with the same dmesg output. If I had to guess,
I'd say there's probably some stale state left over in the kernel after
the replace. I may try to create a test case to reproduce this later if
I have time.

Additionally, the filesystem did not expand to the larger drives after
the replace. So I ran btrfs fi resize max, and I got the following:


Label: none  uuid: 573863ec-d55e-4817-8c11-70bb8523b643
        Total devices 2 FS bytes used 1.64TiB
		devid    1 size 2.73TiB used 1.71TiB path /dev/sdb1
		devid    2 size 1.82TiB used 1.71TiB path /dev/sda1

[Thu Jan 28 10:32:36 2016] BTRFS: new size for /dev/sdb1 is 3000591912960

It only resized one of the two devices, and I'm suspicious because the
one that didn't resize is also the one that reports read errors in the
mount dmesg output.

Any ideas on what to try next? I'm kicking off a scrub for now.

--Sean


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 15:37             ` Sean Greenslade
@ 2016-01-28 16:18               ` Chris Murphy
  2016-01-28 18:47                 ` Sean Greenslade
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2016-01-28 16:18 UTC (permalink / raw)
  To: Sean Greenslade; +Cc: Btrfs BTRFS

On Thu, Jan 28, 2016 at 8:37 AM, Sean Greenslade
<sean@seangreenslade.com> wrote:
> On Thu, Jan 28, 2016 at 07:31:06AM -0500, Austin S. Hemmelgarn wrote:
>> >Got the opportunity to reboot, and things appear to be OK. Still, I
>> >would expect replace to work without requiring a reboot, so this may
>> >still be a bug. I'm running a scrub to verify things, and once that
>> >completes I'll do the second replace and see if I encounter the same
>> >problem.
>>
>> That is unusual, it's supposed to work without needing a reboot or rescan,
>> so I think you may have found a bug.
>
> Did the second replace, and encountered a slightly different issue.
> Btrfs fi show did list both new devices after the replace completed,
> however the partition was no longer mounted. Trying to mount, the mount
> command returned 0 but the partition was not actually mounted. I got
> this in dmesg:
>
> [Thu Jan 28 10:20:20 2016] BTRFS info (device sdd1): disk space caching
> is enabled
> [Thu Jan 28 10:20:20 2016] BTRFS: has skinny extents
> [Thu Jan 28 10:20:20 2016] BTRFS: bdev /dev/sda1 errs: wr 0, rd 186,
> flush 0, corrupt 0, gen 0

Those read errors are a persistent counter. Use 'btrfs dev stat' to
see them for each device, and use -z to clear. I think this is in
DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
with this specific device, not merely "sda1". So ... I'd look in the
journal for the time during the replace and see where those read
errors might have come from if this is supposed to be a new drive and
you're not expecting read errors already.

Like I mentioned in my first reply to this thread, sct erc... it's
very important to get these settings right.




>
> I ejected and re-inserted the HDDs, and everything was happy again. The
> mount actually succeeded, with the same dmesg output. If I had to guess,
> I'd say there's probably some stale state left over in the kernel after
> the replace. I may try to create a test case to reproduce this later if
> I have time.
>
> Additionally, the filesystem did not expand to the larger drives after
> the replace. So I ran btrfs fi resize max, and I got the following:
>
>
> Label: none  uuid: 573863ec-d55e-4817-8c11-70bb8523b643
>         Total devices 2 FS bytes used 1.64TiB
>                 devid    1 size 2.73TiB used 1.71TiB path /dev/sdb1
>                 devid    2 size 1.82TiB used 1.71TiB path /dev/sda1
>
> [Thu Jan 28 10:32:36 2016] BTRFS: new size for /dev/sdb1 is 3000591912960
>
> It only resized one of the two devices, and I'm suspicious because the
> one that didn't resize is also the one that reports read errors in the
> mount dmesg output.
>
> Any ideas on what to try next? I'm kicking off a scrub for now.

Three things from the man page:

-       resize [<devid>:][+/-]<size>[kKmMgGtTpPeE]|[<devid>:]max <path>

-           Resize a mounted filesystem identified by path. A
particular device can be resized by specifying a devid.

-           If max is passed, the filesystem will occupy all available
space on the device respecting devid (remember, devid 1 by default).


Try:

btrfs fi resize 2:max <mountpoint>




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 16:18               ` Chris Murphy
@ 2016-01-28 18:47                 ` Sean Greenslade
  2016-01-28 19:37                   ` Austin S. Hemmelgarn
                                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Sean Greenslade @ 2016-01-28 18:47 UTC (permalink / raw)
  To: Btrfs BTRFS

On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
> Those read errors are a persistent counter. Use 'btrfs dev stat' to
> see them for each device, and use -z to clear. I think this is in
> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
> with this specific device, not merely "sda1". So ... I'd look in the
> journal for the time during the replace and see where those read
> errors might have come from if this is supposed to be a new drive and
> you're not expecting read errors already.
> 
> Like I mentioned in my first reply to this thread, sct erc... it's
> very important to get these settings right.

I don't see anything that indicates read errors in my journal or dmesg,
though it's hard to tell given the rather scary-looking messages I get
whenever I eject a drive:

[Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr 0x280100 action 0x6 frozen
[Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal error
[Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
[Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
[Thu Jan 28 10:38:10 2016] ata6.00: cmd 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
                                    res 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
[Thu Jan 28 10:38:10 2016] ata6: hard resetting link
[Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)

> Three things from the man page:
> 
> -       resize [<devid>:][+/-]<size>[kKmMgGtTpPeE]|[<devid>:]max <path>
> 
> -           Resize a mounted filesystem identified by path. A
> particular device can be resized by specifying a devid.
> 
> -           If max is passed, the filesystem will occupy all available
> space on the device respecting devid (remember, devid 1 by default).
> 
> 
> Try:
> 
> btrfs fi resize 2:max <mountpoint>

OK, I just misunderstood how that syntax worked. All seems good now.
I'll try to play around with some dummy configurations this weekend to
see if I can reproduce the post-replace mount bug.

Thanks, everyone!

--Sean


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 18:47                 ` Sean Greenslade
@ 2016-01-28 19:37                   ` Austin S. Hemmelgarn
  2016-01-28 19:46                     ` Chris Murphy
  2016-01-28 19:39                   ` Chris Murphy
  2016-02-14  0:44                   ` Sean Greenslade
  2 siblings, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-28 19:37 UTC (permalink / raw)
  To: Sean Greenslade, Btrfs BTRFS

On 2016-01-28 13:47, Sean Greenslade wrote:
> On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
>> Those read errors are a persistent counter. Use 'btrfs dev stat' to
>> see them for each device, and use -z to clear. I think this is in
>> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
>> with this specific device, not merely "sda1". So ... I'd look in the
>> journal for the time during the replace and see where those read
>> errors might have come from if this is supposed to be a new drive and
>> you're not expecting read errors already.
>>
>> Like I mentioned in my first reply to this thread, sct erc... it's
>> very important to get these settings right.
>
> I don't see anything that indicates read errors in my journal or dmesg,
> though it's hard to tell given the rather scary-looking messages I get
> whenever I eject a drive:
>
> [Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr 0x280100 action 0x6 frozen
> [Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal error
> [Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
> [Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
> [Thu Jan 28 10:38:10 2016] ata6.00: cmd 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
>                                      res 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
> [Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
> [Thu Jan 28 10:38:10 2016] ata6: hard resetting link
> [Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
>
If by eject you mean disconnect form the system, this is exactly the 
output I would expect if you haven't done something to tell the kernel 
the disk is disappearing.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 18:47                 ` Sean Greenslade
  2016-01-28 19:37                   ` Austin S. Hemmelgarn
@ 2016-01-28 19:39                   ` Chris Murphy
  2016-01-28 22:51                     ` Duncan
  2016-02-14  0:44                   ` Sean Greenslade
  2 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2016-01-28 19:39 UTC (permalink / raw)
  To: Sean Greenslade; +Cc: Btrfs BTRFS

On Thu, Jan 28, 2016 at 11:47 AM, Sean Greenslade
<sean@seangreenslade.com> wrote:
> On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
>> Those read errors are a persistent counter. Use 'btrfs dev stat' to
>> see them for each device, and use -z to clear. I think this is in
>> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
>> with this specific device, not merely "sda1". So ... I'd look in the
>> journal for the time during the replace and see where those read
>> errors might have come from if this is supposed to be a new drive and
>> you're not expecting read errors already.
>>
>> Like I mentioned in my first reply to this thread, sct erc... it's
>> very important to get these settings right.
>
> I don't see anything that indicates read errors in my journal or dmesg,
> though it's hard to tell given the rather scary-looking messages I get
> whenever I eject a drive:
>
> [Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr 0x280100 action 0x6 frozen
> [Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal error
> [Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
> [Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
> [Thu Jan 28 10:38:10 2016] ata6.00: cmd 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
>                                     res 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
> [Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
> [Thu Jan 28 10:38:10 2016] ata6: hard resetting link
> [Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
>
>> Three things from the man page:
>>
>> -       resize [<devid>:][+/-]<size>[kKmMgGtTpPeE]|[<devid>:]max <path>
>>
>> -           Resize a mounted filesystem identified by path. A
>> particular device can be resized by specifying a devid.
>>
>> -           If max is passed, the filesystem will occupy all available
>> space on the device respecting devid (remember, devid 1 by default).
>>
>>
>> Try:
>>
>> btrfs fi resize 2:max <mountpoint>
>
> OK, I just misunderstood how that syntax worked.

I've gotten tripped by this more than once myself, and have to keep
coming back to the man page; is it max:2 or 2:max or? Actually the
easier thing to do is skip important info and you get a mini cheat
sheet instead of the full man page, so you can do

[chris@f23m ~]$  btrfs fi resize
btrfs filesystem resize: too few arguments
usage: btrfs filesystem resize
[devid:][+/-]<newsize>[kKmMgGtTpPeE]|[devid:]max <path>

While you don't get the devid 1 is default clue with this, you get the
syntax formatting.


>All seems good now.
> I'll try to play around with some dummy configurations this weekend to
> see if I can reproduce the post-replace mount bug.

There are bugs and also missing features that seem like bugs!

I broke a raid1 a few weeks ago, I *think* because I mounted rw
degraded and for some reason a single chunk was created on, let's call
it drive A. Later, drive A and B are together again, but drive B
doesn't get a copy of that chunk even after a scrub, but I didn't know
this. Still later drive A is obliterated, and now drive B will not
mount rw, only ro. So no data loss, but the raid1 was broken, I had to
recreate the volume from scratch.

Just before that, I did a replace and also found one device had single
chunks. Since I haven't reproduced it yet, I have no bug report to
file so far. But that'd definitely be a bug. So before and after each
replaced, check btrfs fi df and btrfs fi usage to make sure there are
only raid1 chunks, no single chunks for either data or metadata or
system.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 19:37                   ` Austin S. Hemmelgarn
@ 2016-01-28 19:46                     ` Chris Murphy
  2016-01-28 19:49                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2016-01-28 19:46 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Sean Greenslade, Btrfs BTRFS

On Thu, Jan 28, 2016 at 12:37 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-28 13:47, Sean Greenslade wrote:
>>
>> On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
>>>
>>> Those read errors are a persistent counter. Use 'btrfs dev stat' to
>>> see them for each device, and use -z to clear. I think this is in
>>> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
>>> with this specific device, not merely "sda1". So ... I'd look in the
>>> journal for the time during the replace and see where those read
>>> errors might have come from if this is supposed to be a new drive and
>>> you're not expecting read errors already.
>>>
>>> Like I mentioned in my first reply to this thread, sct erc... it's
>>> very important to get these settings right.
>>
>>
>> I don't see anything that indicates read errors in my journal or dmesg,
>> though it's hard to tell given the rather scary-looking messages I get
>> whenever I eject a drive:
>>
>> [Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr
>> 0x280100 action 0x6 frozen
>> [Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal
>> error
>> [Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
>> [Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
>> [Thu Jan 28 10:38:10 2016] ata6.00: cmd
>> 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
>>                                      res
>> 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
>> [Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
>> [Thu Jan 28 10:38:10 2016] ata6: hard resetting link
>> [Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123
>> SControl 320)
>>
> If by eject you mean disconnect form the system, this is exactly the output
> I would expect if you haven't done something to tell the kernel the disk is
> disappearing.


How about something like:

# hdparm -Y /dev/sdb
# echo 1 /sys/block/sdb/device/delete

Then physically disconnect the drive, assuming hot-plug is supported
by all hardware?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 19:46                     ` Chris Murphy
@ 2016-01-28 19:49                       ` Austin S. Hemmelgarn
  2016-01-28 20:24                         ` Chris Murphy
  0 siblings, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-28 19:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Sean Greenslade, Btrfs BTRFS

On 2016-01-28 14:46, Chris Murphy wrote:
> On Thu, Jan 28, 2016 at 12:37 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-01-28 13:47, Sean Greenslade wrote:
>>>
>>> On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
>>>>
>>>> Those read errors are a persistent counter. Use 'btrfs dev stat' to
>>>> see them for each device, and use -z to clear. I think this is in
>>>> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
>>>> with this specific device, not merely "sda1". So ... I'd look in the
>>>> journal for the time during the replace and see where those read
>>>> errors might have come from if this is supposed to be a new drive and
>>>> you're not expecting read errors already.
>>>>
>>>> Like I mentioned in my first reply to this thread, sct erc... it's
>>>> very important to get these settings right.
>>>
>>>
>>> I don't see anything that indicates read errors in my journal or dmesg,
>>> though it's hard to tell given the rather scary-looking messages I get
>>> whenever I eject a drive:
>>>
>>> [Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr
>>> 0x280100 action 0x6 frozen
>>> [Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal
>>> error
>>> [Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
>>> [Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
>>> [Thu Jan 28 10:38:10 2016] ata6.00: cmd
>>> 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
>>>                                       res
>>> 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
>>> [Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
>>> [Thu Jan 28 10:38:10 2016] ata6: hard resetting link
>>> [Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123
>>> SControl 320)
>>>
>> If by eject you mean disconnect form the system, this is exactly the output
>> I would expect if you haven't done something to tell the kernel the disk is
>> disappearing.
>
>
> How about something like:
>
> # hdparm -Y /dev/sdb
> # echo 1 /sys/block/sdb/device/delete
>
> Then physically disconnect the drive, assuming hot-plug is supported
> by all hardware?
>
That should safely disconnect the device, but you may still have to 
touch some of the PM related stuff in the /sys/class/ directories for 
the disk itself, and possibly do something to force it to flush the 
write cache (toggling the write cache off then back on again usually 
does this).  That said, the hdparm -Y is probably not nessecary 
depending on what else you do (it technically isn't even guaranteed to 
spin down the disk anyway, and internal design of most modern HDD's 
means that as long as you keep the drive level while you're removing 
power, you don't technically have to spin it down first).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 19:49                       ` Austin S. Hemmelgarn
@ 2016-01-28 20:24                         ` Chris Murphy
  2016-01-28 20:41                           ` Sean Greenslade
  2016-01-28 20:44                           ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 31+ messages in thread
From: Chris Murphy @ 2016-01-28 20:24 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Sean Greenslade, Btrfs BTRFS

On Thu, Jan 28, 2016 at 12:49 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-28 14:46, Chris Murphy wrote:
>>
>> On Thu, Jan 28, 2016 at 12:37 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>>
>>> On 2016-01-28 13:47, Sean Greenslade wrote:
>>>>
>>>>
>>>> On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
>>>>>
>>>>>
>>>>> Those read errors are a persistent counter. Use 'btrfs dev stat' to
>>>>> see them for each device, and use -z to clear. I think this is in
>>>>> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
>>>>> with this specific device, not merely "sda1". So ... I'd look in the
>>>>> journal for the time during the replace and see where those read
>>>>> errors might have come from if this is supposed to be a new drive and
>>>>> you're not expecting read errors already.
>>>>>
>>>>> Like I mentioned in my first reply to this thread, sct erc... it's
>>>>> very important to get these settings right.
>>>>
>>>>
>>>>
>>>> I don't see anything that indicates read errors in my journal or dmesg,
>>>> though it's hard to tell given the rather scary-looking messages I get
>>>> whenever I eject a drive:
>>>>
>>>> [Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr
>>>> 0x280100 action 0x6 frozen
>>>> [Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal
>>>> error
>>>> [Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
>>>> [Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
>>>> [Thu Jan 28 10:38:10 2016] ata6.00: cmd
>>>> 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
>>>>                                       res
>>>> 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
>>>> [Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
>>>> [Thu Jan 28 10:38:10 2016] ata6: hard resetting link
>>>> [Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123
>>>> SControl 320)
>>>>
>>> If by eject you mean disconnect form the system, this is exactly the
>>> output
>>> I would expect if you haven't done something to tell the kernel the disk
>>> is
>>> disappearing.
>>
>>
>>
>> How about something like:
>>
>> # hdparm -Y /dev/sdb
>> # echo 1 /sys/block/sdb/device/delete
>>
>> Then physically disconnect the drive, assuming hot-plug is supported
>> by all hardware?
>>
> That should safely disconnect the device, but you may still have to touch
> some of the PM related stuff in the /sys/class/ directories for the disk
> itself, and possibly do something to force it to flush the write cache
> (toggling the write cache off then back on again usually does this).

Interesting, I figured a umount should include telling the drive to
flush the write cache; but maybe not, if the drive or connection (i.e.
USB enclosure) doesn't support FUA?

I wonder what the kernel sends to the device on restart/poweroff?



 That
> said, the hdparm -Y is probably not nessecary depending on what else you do
> (it technically isn't even guaranteed to spin down the disk anyway, and
> internal design of most modern HDD's means that as long as you keep the
> drive level while you're removing power, you don't technically have to spin
> it down first).

If I don't, my drives make a loud clank, and the smart attribute 192
Power-off Retract Count, goes up by one. This never happens on a
normal power off. So some message is being sent to the drive at
restart/poweroff that's different than just pulling the drive, even if
that message isn't the same thing as whatever hdparm -Y sends.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 20:24                         ` Chris Murphy
@ 2016-01-28 20:41                           ` Sean Greenslade
  2016-01-28 20:44                           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 31+ messages in thread
From: Sean Greenslade @ 2016-01-28 20:41 UTC (permalink / raw)
  To: Btrfs BTRFS

On Thu, Jan 28, 2016 at 01:24:07PM -0700, Chris Murphy wrote:
> >> How about something like:
> >>
> >> # hdparm -Y /dev/sdb
> >> # echo 1 /sys/block/sdb/device/delete
> >>
> >> Then physically disconnect the drive, assuming hot-plug is supported
> >> by all hardware?
> >>
> > That should safely disconnect the device, but you may still have to touch
> > some of the PM related stuff in the /sys/class/ directories for the disk
> > itself, and possibly do something to force it to flush the write cache
> > (toggling the write cache off then back on again usually does this).
> 
> Interesting, I figured a umount should include telling the drive to
> flush the write cache; but maybe not, if the drive or connection (i.e.
> USB enclosure) doesn't support FUA?
> 
> I wonder what the kernel sends to the device on restart/poweroff?

Yes, I did neglect to mention that. Wherever I said eject, I meant "send
1 to /sys delete, wait for drive to spin down, then remove." I also
unmount all partitions that are on that disk.

It's not an enclosure, it's hot-swap sata ports on the motherboard.

--Sean


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 20:24                         ` Chris Murphy
  2016-01-28 20:41                           ` Sean Greenslade
@ 2016-01-28 20:44                           ` Austin S. Hemmelgarn
  2016-01-28 23:01                             ` Chris Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-28 20:44 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Sean Greenslade, Btrfs BTRFS

On 2016-01-28 15:24, Chris Murphy wrote:
> On Thu, Jan 28, 2016 at 12:49 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-01-28 14:46, Chris Murphy wrote:
>>>
>>> On Thu, Jan 28, 2016 at 12:37 PM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>> On 2016-01-28 13:47, Sean Greenslade wrote:
>>>>>
>>>>>
>>>>> On Thu, Jan 28, 2016 at 09:18:06AM -0700, Chris Murphy wrote:
>>>>>>
>>>>>>
>>>>>> Those read errors are a persistent counter. Use 'btrfs dev stat' to
>>>>>> see them for each device, and use -z to clear. I think this is in
>>>>>> DEV_ITEM, and it should be dev.uuid based, so the counter ought to be
>>>>>> with this specific device, not merely "sda1". So ... I'd look in the
>>>>>> journal for the time during the replace and see where those read
>>>>>> errors might have come from if this is supposed to be a new drive and
>>>>>> you're not expecting read errors already.
>>>>>>
>>>>>> Like I mentioned in my first reply to this thread, sct erc... it's
>>>>>> very important to get these settings right.
>>>>>
>>>>>
>>>>>
>>>>> I don't see anything that indicates read errors in my journal or dmesg,
>>>>> though it's hard to tell given the rather scary-looking messages I get
>>>>> whenever I eject a drive:
>>>>>
>>>>> [Thu Jan 28 10:38:10 2016] ata6.00: exception Emask 0x10 SAct 0x8 SErr
>>>>> 0x280100 action 0x6 frozen
>>>>> [Thu Jan 28 10:38:10 2016] ata6.00: irq_stat 0x08000000, interface fatal
>>>>> error
>>>>> [Thu Jan 28 10:38:10 2016] ata6: SError: { UnrecovData 10B8B BadCRC }
>>>>> [Thu Jan 28 10:38:10 2016] ata6.00: failed command: READ FPDMA QUEUED
>>>>> [Thu Jan 28 10:38:10 2016] ata6.00: cmd
>>>>> 60/00:18:00:79:02/05:00:00:00:00/40 tag 3 ncq 655360 in
>>>>>                                        res
>>>>> 40/00:18:00:79:02/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
>>>>> [Thu Jan 28 10:38:10 2016] ata6.00: status: { DRDY }
>>>>> [Thu Jan 28 10:38:10 2016] ata6: hard resetting link
>>>>> [Thu Jan 28 10:38:10 2016] ata6: SATA link up 3.0 Gbps (SStatus 123
>>>>> SControl 320)
>>>>>
>>>> If by eject you mean disconnect form the system, this is exactly the
>>>> output
>>>> I would expect if you haven't done something to tell the kernel the disk
>>>> is
>>>> disappearing.
>>>
>>>
>>>
>>> How about something like:
>>>
>>> # hdparm -Y /dev/sdb
>>> # echo 1 /sys/block/sdb/device/delete
>>>
>>> Then physically disconnect the drive, assuming hot-plug is supported
>>> by all hardware?
>>>
>> That should safely disconnect the device, but you may still have to touch
>> some of the PM related stuff in the /sys/class/ directories for the disk
>> itself, and possibly do something to force it to flush the write cache
>> (toggling the write cache off then back on again usually does this).
>
> Interesting, I figured a umount should include telling the drive to
> flush the write cache; but maybe not, if the drive or connection (i.e.
> USB enclosure) doesn't support FUA?
It's supposed to send an FUA, but depending on the hardware, this may 
either disappear on the way to the disk, or more likely just be a no-op. 
  A lot of cheap older HDD's just ignore it, and I've seen a lot of USB 
enclosures that just eat the command and don't pass anything to the 
disk, so sometimes you have to get creative to actually flush the cache. 
  It's worth noting that most such disks are not safe to use BTRFS on 
anyway though, because FUA is part of what's used to force write barriers.
>
> I wonder what the kernel sends to the device on restart/poweroff?
For most SATA drives, I'm pretty certain that it doesn't do much of 
anything, although it may well tell the disk to spin down.  I'm not as 
versed on the SATA spec as I am WRT SCSI, but I am pretty certain that 
there isn't any command that is 100% guaranteed to spin down the disk.

For SCSI drives, there's a specific command to tell the device to power 
down (and a corresponding one to spin up, which is how HBA's do 
sequenced spin-up of drives) which gets issued.

For USB, it's technically the same command set as SCSI, except most USB 
enclosures don't properly translate the command to the drive.
>
>   That
>> said, the hdparm -Y is probably not nessecary depending on what else you do
>> (it technically isn't even guaranteed to spin down the disk anyway, and
>> internal design of most modern HDD's means that as long as you keep the
>> drive level while you're removing power, you don't technically have to spin
>> it down first).
>
> If I don't, my drives make a loud clank, and the smart attribute 192
> Power-off Retract Count, goes up by one. This never happens on a
> normal power off. So some message is being sent to the drive at
> restart/poweroff that's different than just pulling the drive, even if
> that message isn't the same thing as whatever hdparm -Y sends.
>
I'm not saying it's a good idea to not tell the drive to spin down, just 
that it won't damage most modern drives as long as they're kept level 
while they spin down and you don't do it all the time.

Almost every modern hard disk uses a voice-coil actuator for the heads 
which gets balanced such that having no power to the coil causes the 
forces from the spinning disks to park the heads, so pulling power will 
(more than 99.9% of the time) not cause a head cash like a lot of older 
servo based drives as long as you keep the drive level.  The clank you 
hear is the end of the head armature opposite the heads hitting the 
mechanical stop that's present to prevent them from completely 
decoupling from the disk.  This gets accounted in SMART attributes 
because over extremely long times (usually tens thousands of cycles), 
this will eventually wear out that mechanical stop, and things will stop 
working, so it technically is a failure condition, but you're almost 
certain to hit some other failure condition before this becomes an issue.

The interesting thing is that some drives actually _rely_ on this 
behavior to park the heads (I've seen a lot of Seagate desktop drives 
that appear to do this, although they use a rubber stopper instead of 
metal or plastic, so it tends to last longer).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 19:39                   ` Chris Murphy
@ 2016-01-28 22:51                     ` Duncan
  0 siblings, 0 replies; 31+ messages in thread
From: Duncan @ 2016-01-28 22:51 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 28 Jan 2016 12:39:20 -0700 as excerpted:

> I've gotten tripped by this more than once myself, and have to keep
> coming back to the man page; is it max:2 or 2:max or? Actually the
> easier thing to do is skip important info and you get a mini cheat sheet
> instead of the full man page, so you can do
> 
> [chris@f23m ~]$  btrfs fi resize btrfs filesystem resize: too few
> arguments usage: btrfs filesystem resize
> [devid:][+/-]<newsize>[kKmMgGtTpPeE]|[devid:]max <path>
> 
> While you don't get the devid 1 is default clue with this, you get the
> syntax formatting.

When I first started with btrfs, I setup a helper script, that checks for 
command and subcommand and a reasonable number of parameters.  If it gets 
them, it passes everything on as-is.  If it doesn't, or if they don't 
look like commands/subcommands that btrfs will understand, it gives you 
the list of commands/subcommands for that level, or prints the --help 
output and asks for additional parameters.  Then it prints the assembled 
command for you and asks for a final OK before running it.

Later, I modified it to handle mkfs.btrfs as well, taking info from my 
fstab, etc, so I can recreate the backup filesystems and it'll set label, 
raid1, devices, btrfs feature options, etc, then again print the 
assembled btrfs.mkfs and ask for final OK before running it.

That way I don't have to keep track of all the parameter details for the 
various commands and subcommands.  If I happen to remember them I can add 
them to the initial commandline.  If not, I get prompted for them as 
appropriate. =:^)  That, or creating a bunch of separate scriptlet stubs 
or aliases that take care of most of the details (the solution I used for 
emerge, the default gentoo package manager), are the two ways I've found 
to deal with these all-in-one commands, which otherwise tend to be way 
too complex for me to work with, without constantly referencing the 
manpage as I try to enter the command.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 20:44                           ` Austin S. Hemmelgarn
@ 2016-01-28 23:01                             ` Chris Murphy
  2016-01-29 12:14                               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 31+ messages in thread
From: Chris Murphy @ 2016-01-28 23:01 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Sean Greenslade, Btrfs BTRFS

On Thu, Jan 28, 2016 at 1:44 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
>> Interesting, I figured a umount should include telling the drive to
>> flush the write cache; but maybe not, if the drive or connection (i.e.
>> USB enclosure) doesn't support FUA?
>
> It's supposed to send an FUA, but depending on the hardware, this may either
> disappear on the way to the disk, or more likely just be a no-op.  A lot of
> cheap older HDD's just ignore it, and I've seen a lot of USB enclosures that
> just eat the command and don't pass anything to the disk, so sometimes you
> have to get creative to actually flush the cache.  It's worth noting that
> most such disks are not safe to use BTRFS on anyway though, because FUA is
> part of what's used to force write barriers.

Err. Really?

[    0.833452] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD
840  DB6Q PQ: 0 ANSI: 5
[    0.835810] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
filtered out
[    0.835827] ata3.00: configured for UDMA/100
[    0.838010] usb 1-1: new high-speed USB device number 2 using ehci-pci
[    0.839785] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    0.839810] sd 0:0:0:0: [sda] 488397168 512-byte logical blocks:
(250 GB/233 GiB)
[    0.840381] sd 0:0:0:0: [sda] Write Protect is off
[    0.840393] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    0.840634] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA

This is not a cheap or old HDD. It's not in an enclosure. I get the
same message for a new Toshiba 1TiB drive I just stuck in a new Intel
NUC. So now what?


>> If I don't, my drives make a loud clank, and the smart attribute 192
>> Power-off Retract Count, goes up by one. This never happens on a
>> normal power off. So some message is being sent to the drive at
>> restart/poweroff that's different than just pulling the drive, even if
>> that message isn't the same thing as whatever hdparm -Y sends.
>>
> I'm not saying it's a good idea to not tell the drive to spin down, just
> that it won't damage most modern drives as long as they're kept level while
> they spin down and you don't do it all the time.

Gotcha.


>
> Almost every modern hard disk uses a voice-coil actuator for the heads which
> gets balanced such that having no power to the coil causes the forces from
> the spinning disks to park the heads, so pulling power will (more than 99.9%
> of the time) not cause a head cash like a lot of older servo based drives as
> long as you keep the drive level.  The clank you hear is the end of the head
> armature opposite the heads hitting the mechanical stop that's present to
> prevent them from completely decoupling from the disk.  This gets accounted
> in SMART attributes because over extremely long times (usually tens
> thousands of cycles), this will eventually wear out that mechanical stop,
> and things will stop working, so it technically is a failure condition, but
> you're almost certain to hit some other failure condition before this
> becomes an issue.

OK.

>
> The interesting thing is that some drives actually _rely_ on this behavior
> to park the heads (I've seen a lot of Seagate desktop drives that appear to
> do this, although they use a rubber stopper instead of metal or plastic, so
> it tends to last longer).

Cute.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 23:01                             ` Chris Murphy
@ 2016-01-29 12:14                               ` Austin S. Hemmelgarn
  2016-01-29 20:27                                 ` Henk Slager
                                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-29 12:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Sean Greenslade, Btrfs BTRFS

On 2016-01-28 18:01, Chris Murphy wrote:
> On Thu, Jan 28, 2016 at 1:44 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>>> Interesting, I figured a umount should include telling the drive to
>>> flush the write cache; but maybe not, if the drive or connection (i.e.
>>> USB enclosure) doesn't support FUA?
>>
>> It's supposed to send an FUA, but depending on the hardware, this may either
>> disappear on the way to the disk, or more likely just be a no-op.  A lot of
>> cheap older HDD's just ignore it, and I've seen a lot of USB enclosures that
>> just eat the command and don't pass anything to the disk, so sometimes you
>> have to get creative to actually flush the cache.  It's worth noting that
>> most such disks are not safe to use BTRFS on anyway though, because FUA is
>> part of what's used to force write barriers.
>
> Err. Really?
>
> [    0.833452] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD
> 840  DB6Q PQ: 0 ANSI: 5
> [    0.835810] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
> filtered out
> [    0.835827] ata3.00: configured for UDMA/100
> [    0.838010] usb 1-1: new high-speed USB device number 2 using ehci-pci
> [    0.839785] sd 0:0:0:0: Attached scsi generic sg0 type 0
> [    0.839810] sd 0:0:0:0: [sda] 488397168 512-byte logical blocks:
> (250 GB/233 GiB)
> [    0.840381] sd 0:0:0:0: [sda] Write Protect is off
> [    0.840393] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> [    0.840634] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
> This is not a cheap or old HDD. It's not in an enclosure. I get the
> same message for a new Toshiba 1TiB drive I just stuck in a new Intel
> NUC. So now what?
Well, depending on how the kernel talks to the device, there are ways 
around this, but most of them are slow (like waiting for the write cache 
to drain).  Just like SCT ERC, most drives marketed for 'desktop' usage 
don't actually support FUA, but they report this fact correctly, so the 
kernel can often work around it.  Most of the older drives that have 
issues actually report that they support it, but just treat it like a 
no-op.  Last I checked, Seagate's 'NAS' drives and whatever they've 
re-branded their other enterprise line as, as well as WD's 'Red' drives 
support both SCT ERC and FUA, but I don't know about any other brands 
(most of the Hitachi, Toshiba, and Samsung drives I've seen do not 
support FUA).  This is in-fact part of the reason I'm saving up to get 
good NAS rated drives for my home server, because those almost always 
support both SCT ERC and FUA.

That said, you may want to test the performance difference with the 
write cache disabled, depending on how the kernel is trying to emulate 
write barriers, it may actually speed things up.
>
>
>>> If I don't, my drives make a loud clank, and the smart attribute 192
>>> Power-off Retract Count, goes up by one. This never happens on a
>>> normal power off. So some message is being sent to the drive at
>>> restart/poweroff that's different than just pulling the drive, even if
>>> that message isn't the same thing as whatever hdparm -Y sends.
>>>
>> I'm not saying it's a good idea to not tell the drive to spin down, just
>> that it won't damage most modern drives as long as they're kept level while
>> they spin down and you don't do it all the time.
>
> Gotcha.
>
>
>>
>> Almost every modern hard disk uses a voice-coil actuator for the heads which
>> gets balanced such that having no power to the coil causes the forces from
>> the spinning disks to park the heads, so pulling power will (more than 99.9%
>> of the time) not cause a head cash like a lot of older servo based drives as
>> long as you keep the drive level.  The clank you hear is the end of the head
>> armature opposite the heads hitting the mechanical stop that's present to
>> prevent them from completely decoupling from the disk.  This gets accounted
>> in SMART attributes because over extremely long times (usually tens
>> thousands of cycles), this will eventually wear out that mechanical stop,
>> and things will stop working, so it technically is a failure condition, but
>> you're almost certain to hit some other failure condition before this
>> becomes an issue.
>
> OK.
>
>>
>> The interesting thing is that some drives actually _rely_ on this behavior
>> to park the heads (I've seen a lot of Seagate desktop drives that appear to
>> do this, although they use a rubber stopper instead of metal or plastic, so
>> it tends to last longer).
>
> Cute.
>
Yeah, I've used a lot of Seagate drives that appear to do this, and 
they've always failed in some way other than this failing.  It is kind 
of nice though that it means you get clearly audible confirmation that 
the drive has spun down.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-29 12:14                               ` Austin S. Hemmelgarn
@ 2016-01-29 20:27                                 ` Henk Slager
  2016-01-29 20:40                                   ` Austin S. Hemmelgarn
  2016-01-29 20:41                                 ` Chris Murphy
  2016-01-30 14:50                                 ` Patrik Lundquist
  2 siblings, 1 reply; 31+ messages in thread
From: Henk Slager @ 2016-01-29 20:27 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Jan 29, 2016 at 1:14 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-28 18:01, Chris Murphy wrote:
>>
>> On Thu, Jan 28, 2016 at 1:44 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>>>
>>>> Interesting, I figured a umount should include telling the drive to
>>>> flush the write cache; but maybe not, if the drive or connection (i.e.
>>>> USB enclosure) doesn't support FUA?
>>>
>>>
>>> It's supposed to send an FUA, but depending on the hardware, this may
>>> either
>>> disappear on the way to the disk, or more likely just be a no-op.  A lot
>>> of
>>> cheap older HDD's just ignore it, and I've seen a lot of USB enclosures
>>> that
>>> just eat the command and don't pass anything to the disk, so sometimes
>>> you
>>> have to get creative to actually flush the cache.  It's worth noting that
>>> most such disks are not safe to use BTRFS on anyway though, because FUA
>>> is
>>> part of what's used to force write barriers.
>>
>>
>> Err. Really?
>>
>> [    0.833452] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD
>> 840  DB6Q PQ: 0 ANSI: 5
>> [    0.835810] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
>> filtered out
>> [    0.835827] ata3.00: configured for UDMA/100
>> [    0.838010] usb 1-1: new high-speed USB device number 2 using ehci-pci
>> [    0.839785] sd 0:0:0:0: Attached scsi generic sg0 type 0
>> [    0.839810] sd 0:0:0:0: [sda] 488397168 512-byte logical blocks:
>> (250 GB/233 GiB)
>> [    0.840381] sd 0:0:0:0: [sda] Write Protect is off
>> [    0.840393] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
>> [    0.840634] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>>
>> This is not a cheap or old HDD. It's not in an enclosure. I get the
>> same message for a new Toshiba 1TiB drive I just stuck in a new Intel
>> NUC. So now what?
>
> Well, depending on how the kernel talks to the device, there are ways around
> this, but most of them are slow (like waiting for the write cache to drain).
> Just like SCT ERC, most drives marketed for 'desktop' usage don't actually
> support FUA, but they report this fact correctly, so the kernel can often
> work around it.  Most of the older drives that have issues actually report
> that they support it, but just treat it like a no-op.  Last I checked,
> Seagate's 'NAS' drives and whatever they've re-branded their other
> enterprise line as, as well as WD's 'Red' drives support both SCT ERC and
> FUA, but I don't know about any other brands (most of the Hitachi, Toshiba,
> and Samsung drives I've seen do not support FUA).  This is in-fact part of
> the reason I'm saving up to get good NAS rated drives for my home server,
> because those almost always support both SCT ERC and FUA.

[    0.895207] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
SCT ERC is supported though.
This is a 4TB (64MB buffer size) WD40EFRX-68WT0N0  FirmWare 82.00A82
and sold as 'NAS' drive.

How long do you think data will stay dirty in the drives writebuffer
(average/min/max)?

Another thing I noticed, is that with a Seagate 8TB SMR drive (no
FUA), the drive might be doing internal (re)writes between zones a
considerable time after OS level 'sync' has finished (I think, you can
also hear the head movements although no I/O reported on OS level /
SATA level). I think it is then not just committing its dirty parts of
the 128MB buffer, that should not take so long. Since then, I am not
so sure how fast I can shutdown+switchoff the system+drive after e.g.
btrfs receive has finished. But maybe the rewriting can be interrupted
and restarted without data corruption, I hope it can, I am just
guessing.

> That said, you may want to test the performance difference with the write
> cache disabled, depending on how the kernel is trying to emulate write
> barriers, it may actually speed things up.
>
>>
>>
>>>> If I don't, my drives make a loud clank, and the smart attribute 192
>>>> Power-off Retract Count, goes up by one. This never happens on a
>>>> normal power off. So some message is being sent to the drive at
>>>> restart/poweroff that's different than just pulling the drive, even if
>>>> that message isn't the same thing as whatever hdparm -Y sends.
>>>>
>>> I'm not saying it's a good idea to not tell the drive to spin down, just
>>> that it won't damage most modern drives as long as they're kept level
>>> while
>>> they spin down and you don't do it all the time.
>>
>>
>> Gotcha.
>>
>>
>>>
>>> Almost every modern hard disk uses a voice-coil actuator for the heads
>>> which
>>> gets balanced such that having no power to the coil causes the forces
>>> from
>>> the spinning disks to park the heads, so pulling power will (more than
>>> 99.9%
>>> of the time) not cause a head cash like a lot of older servo based drives
>>> as
>>> long as you keep the drive level.  The clank you hear is the end of the
>>> head
>>> armature opposite the heads hitting the mechanical stop that's present to
>>> prevent them from completely decoupling from the disk.  This gets
>>> accounted
>>> in SMART attributes because over extremely long times (usually tens
>>> thousands of cycles), this will eventually wear out that mechanical stop,
>>> and things will stop working, so it technically is a failure condition,
>>> but
>>> you're almost certain to hit some other failure condition before this
>>> becomes an issue.
>>
>>
>> OK.
>>
>>>
>>> The interesting thing is that some drives actually _rely_ on this
>>> behavior
>>> to park the heads (I've seen a lot of Seagate desktop drives that appear
>>> to
>>> do this, although they use a rubber stopper instead of metal or plastic,
>>> so
>>> it tends to last longer).
>>
>>
>> Cute.
>>
> Yeah, I've used a lot of Seagate drives that appear to do this, and they've
> always failed in some way other than this failing.  It is kind of nice
> though that it means you get clearly audible confirmation that the drive has
> spun down.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-29 20:27                                 ` Henk Slager
@ 2016-01-29 20:40                                   ` Austin S. Hemmelgarn
  2016-01-29 22:06                                     ` Henk Slager
  0 siblings, 1 reply; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-01-29 20:40 UTC (permalink / raw)
  To: Henk Slager, Btrfs BTRFS

On 2016-01-29 15:27, Henk Slager wrote:
> On Fri, Jan 29, 2016 at 1:14 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-01-28 18:01, Chris Murphy wrote:
>>>
>>> On Thu, Jan 28, 2016 at 1:44 PM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>>>
>>>>> Interesting, I figured a umount should include telling the drive to
>>>>> flush the write cache; but maybe not, if the drive or connection (i.e.
>>>>> USB enclosure) doesn't support FUA?
>>>>
>>>>
>>>> It's supposed to send an FUA, but depending on the hardware, this may
>>>> either
>>>> disappear on the way to the disk, or more likely just be a no-op.  A lot
>>>> of
>>>> cheap older HDD's just ignore it, and I've seen a lot of USB enclosures
>>>> that
>>>> just eat the command and don't pass anything to the disk, so sometimes
>>>> you
>>>> have to get creative to actually flush the cache.  It's worth noting that
>>>> most such disks are not safe to use BTRFS on anyway though, because FUA
>>>> is
>>>> part of what's used to force write barriers.
>>>
>>>
>>> Err. Really?
>>>
>>> [    0.833452] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD
>>> 840  DB6Q PQ: 0 ANSI: 5
>>> [    0.835810] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
>>> filtered out
>>> [    0.835827] ata3.00: configured for UDMA/100
>>> [    0.838010] usb 1-1: new high-speed USB device number 2 using ehci-pci
>>> [    0.839785] sd 0:0:0:0: Attached scsi generic sg0 type 0
>>> [    0.839810] sd 0:0:0:0: [sda] 488397168 512-byte logical blocks:
>>> (250 GB/233 GiB)
>>> [    0.840381] sd 0:0:0:0: [sda] Write Protect is off
>>> [    0.840393] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
>>> [    0.840634] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>>> enabled, doesn't support DPO or FUA
>>>
>>> This is not a cheap or old HDD. It's not in an enclosure. I get the
>>> same message for a new Toshiba 1TiB drive I just stuck in a new Intel
>>> NUC. So now what?
>>
>> Well, depending on how the kernel talks to the device, there are ways around
>> this, but most of them are slow (like waiting for the write cache to drain).
>> Just like SCT ERC, most drives marketed for 'desktop' usage don't actually
>> support FUA, but they report this fact correctly, so the kernel can often
>> work around it.  Most of the older drives that have issues actually report
>> that they support it, but just treat it like a no-op.  Last I checked,
>> Seagate's 'NAS' drives and whatever they've re-branded their other
>> enterprise line as, as well as WD's 'Red' drives support both SCT ERC and
>> FUA, but I don't know about any other brands (most of the Hitachi, Toshiba,
>> and Samsung drives I've seen do not support FUA).  This is in-fact part of
>> the reason I'm saving up to get good NAS rated drives for my home server,
>> because those almost always support both SCT ERC and FUA.
>
> [    0.895207] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
> SCT ERC is supported though.
> This is a 4TB (64MB buffer size) WD40EFRX-68WT0N0  FirmWare 82.00A82
> and sold as 'NAS' drive.
That is at the same time troubling and not all that surprising (
SSD's don't implement it so why should we?'  I hate marketing 
idiocy...).  I was apparently misinformed about WD's disks (although 
given the apparent insanity of the firmware on some of their drives, 
that really doesn't surprise me either).
>
> How long do you think data will stay dirty in the drives writebuffer
> (average/min/max)?
That depends on a huge number of factors, and I don't really have a good 
answer.  The 1TB 7200RPM single platter Seagate drives I'm using right 
now (which have a 64MB cache) take less than 0.1 second for streaming 
writes, and less than 0.5 on average for scattered writes, so it's not 
too bad most of the time, but it's still a performance hit, and I do get 
marginally better performance by turning off the on-disk write-cache 
(I've got a very atypical workload though, so YMMV).
>
> Another thing I noticed, is that with a Seagate 8TB SMR drive (no
> FUA), the drive might be doing internal (re)writes between zones a
> considerable time after OS level 'sync' has finished (I think, you can
> also hear the head movements although no I/O reported on OS level /
> SATA level). I think it is then not just committing its dirty parts of
> the 128MB buffer, that should not take so long. Since then, I am not
> so sure how fast I can shutdown+switchoff the system+drive after e.g.
> btrfs receive has finished. But maybe the rewriting can be interrupted
> and restarted without data corruption, I hope it can, I am just
> guessing.
This really doesn't surprise me, and is a large part of why I will be 
avoiding SMR drives for a long as possible.  The very design means that 
unless you have a battery backed write-cache, you've got serious 
potential to lose data due to unclean shutdowns.  One which is properly 
designed should have no issues with this, but proper design of anything 
these days is becoming the exception, not the rule.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-29 12:14                               ` Austin S. Hemmelgarn
  2016-01-29 20:27                                 ` Henk Slager
@ 2016-01-29 20:41                                 ` Chris Murphy
  2016-01-30 14:50                                 ` Patrik Lundquist
  2 siblings, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2016-01-29 20:41 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Sean Greenslade, Btrfs BTRFS

On Fri, Jan 29, 2016 at 5:14 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>
> That said, you may want to test the performance difference with the write
> cache disabled, depending on how the kernel is trying to emulate write
> barriers, it may actually speed things up.

With all of the laptop drives I've tested, WDC Blue and Black, an HST,
and Toshiba, using hdparm to disable the write cache resulting in
writes becoming absolutely abysmal. Instead of ~120MB/s writes, they
went to 4MB/s. Unusuable. It's so bad I'm thinking it might even be a
bug.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-29 20:40                                   ` Austin S. Hemmelgarn
@ 2016-01-29 22:06                                     ` Henk Slager
  2016-02-01 12:08                                       ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 31+ messages in thread
From: Henk Slager @ 2016-01-29 22:06 UTC (permalink / raw)
  To: Btrfs BTRFS

On Fri, Jan 29, 2016 at 9:40 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-01-29 15:27, Henk Slager wrote:
>>
>> On Fri, Jan 29, 2016 at 1:14 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>>
>>> On 2016-01-28 18:01, Chris Murphy wrote:
>>>>
>>>>
>>>> On Thu, Jan 28, 2016 at 1:44 PM, Austin S. Hemmelgarn
>>>> <ahferroin7@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Interesting, I figured a umount should include telling the drive to
>>>>>> flush the write cache; but maybe not, if the drive or connection (i.e.
>>>>>> USB enclosure) doesn't support FUA?
>>>>>
>>>>>
>>>>>
>>>>> It's supposed to send an FUA, but depending on the hardware, this may
>>>>> either
>>>>> disappear on the way to the disk, or more likely just be a no-op.  A
>>>>> lot
>>>>> of
>>>>> cheap older HDD's just ignore it, and I've seen a lot of USB enclosures
>>>>> that
>>>>> just eat the command and don't pass anything to the disk, so sometimes
>>>>> you
>>>>> have to get creative to actually flush the cache.  It's worth noting
>>>>> that
>>>>> most such disks are not safe to use BTRFS on anyway though, because FUA
>>>>> is
>>>>> part of what's used to force write barriers.
>>>>
>>>>
>>>>
>>>> Err. Really?
>>>>
>>>> [    0.833452] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD
>>>> 840  DB6Q PQ: 0 ANSI: 5
>>>> [    0.835810] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
>>>> filtered out
>>>> [    0.835827] ata3.00: configured for UDMA/100
>>>> [    0.838010] usb 1-1: new high-speed USB device number 2 using
>>>> ehci-pci
>>>> [    0.839785] sd 0:0:0:0: Attached scsi generic sg0 type 0
>>>> [    0.839810] sd 0:0:0:0: [sda] 488397168 512-byte logical blocks:
>>>> (250 GB/233 GiB)
>>>> [    0.840381] sd 0:0:0:0: [sda] Write Protect is off
>>>> [    0.840393] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
>>>> [    0.840634] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>>>> enabled, doesn't support DPO or FUA
>>>>
>>>> This is not a cheap or old HDD. It's not in an enclosure. I get the
>>>> same message for a new Toshiba 1TiB drive I just stuck in a new Intel
>>>> NUC. So now what?
>>>
>>>
>>> Well, depending on how the kernel talks to the device, there are ways
>>> around
>>> this, but most of them are slow (like waiting for the write cache to
>>> drain).
>>> Just like SCT ERC, most drives marketed for 'desktop' usage don't
>>> actually
>>> support FUA, but they report this fact correctly, so the kernel can often
>>> work around it.  Most of the older drives that have issues actually
>>> report
>>> that they support it, but just treat it like a no-op.  Last I checked,
>>> Seagate's 'NAS' drives and whatever they've re-branded their other
>>> enterprise line as, as well as WD's 'Red' drives support both SCT ERC and
>>> FUA, but I don't know about any other brands (most of the Hitachi,
>>> Toshiba,
>>> and Samsung drives I've seen do not support FUA).  This is in-fact part
>>> of
>>> the reason I'm saving up to get good NAS rated drives for my home server,
>>> because those almost always support both SCT ERC and FUA.
>>
>>
>> [    0.895207] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
>> enabled, doesn't support DPO or FUA
>> SCT ERC is supported though.
>> This is a 4TB (64MB buffer size) WD40EFRX-68WT0N0  FirmWare 82.00A82
>> and sold as 'NAS' drive.
>
> That is at the same time troubling and not all that surprising (
> SSD's don't implement it so why should we?'  I hate marketing idiocy...).  I
> was apparently misinformed about WD's disks (although given the apparent
> insanity of the firmware on some of their drives, that really doesn't
> surprise me either).
>>
>>
>> How long do you think data will stay dirty in the drives writebuffer
>> (average/min/max)?
>
> That depends on a huge number of factors, and I don't really have a good
> answer.  The 1TB 7200RPM single platter Seagate drives I'm using right now
> (which have a 64MB cache) take less than 0.1 second for streaming writes,
> and less than 0.5 on average for scattered writes, so it's not too bad most
> of the time, but it's still a performance hit, and I do get marginally
> better performance by turning off the on-disk write-cache (I've got a very
> atypical workload though, so YMMV).

I think you refer to the transfer from PC main RAM via SATA to 64MB buffer.
What I try to estimate is the transfer-time from 64MB buffer to the
platter(s). Indeed a huge number of factors and without insight in the
drives ASIC/firmware design, just assumptions, but anyhow I am giving
it a try:

- min:  assume complete 64MB dirty and 1 sequential datablock in outer
cyl, no seek done, then 64 / 150 = ~0.5s

- max:  assume only 1 physical sector sized max scattered (all non
sequential) datablocks, 150MB/s outer cyl write speed, 75MB/s inner
cyl write speed, 4ms avg seektime, no merging writes per head
position, 1 (side) platter, then
 ( 4k / 150M ) * 8k = ~200ms +
 ( 4k / 75M ) * 8k = ~400ms +
  16k * 4ms = ~64s,
 so in total more than 1 minute in this very simple and worst-case
model. Drive firmware can't be so inefficient, so seeks are probably
mostly mitigated, so then it is likely around 1s or a few seconds.

This all would mean that after default 30s commit in btrfs, the
drive's powersupply must not fail for 0.5s..few seconds.
If there is powerloss in this timeframe, the fs can get corrupt, but
AFAIU, there is previous roots, generations etc that can be used, such
that btrfs fs can restart without mount failure etc, just possibly 30
+ few seconds dataloss.

So if those calculations make sense, I am concluding that I am not
that worried about lack of FUA in normal (non-SMR) spinning drives.

>> Another thing I noticed, is that with a Seagate 8TB SMR drive (no
>> FUA), the drive might be doing internal (re)writes between zones a
>> considerable time after OS level 'sync' has finished (I think, you can
>> also hear the head movements although no I/O reported on OS level /
>> SATA level). I think it is then not just committing its dirty parts of
>> the 128MB buffer, that should not take so long. Since then, I am not
>> so sure how fast I can shutdown+switchoff the system+drive after e.g.
>> btrfs receive has finished. But maybe the rewriting can be interrupted
>> and restarted without data corruption, I hope it can, I am just
>> guessing.
>
> This really doesn't surprise me, and is a large part of why I will be
> avoiding SMR drives for a long as possible.  The very design means that
> unless you have a battery backed write-cache, you've got serious potential
> to lose data due to unclean shutdowns.  One which is properly designed
> should have no issues with this, but proper design of anything these days is
> becoming the exception, not the rule.
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-29 12:14                               ` Austin S. Hemmelgarn
  2016-01-29 20:27                                 ` Henk Slager
  2016-01-29 20:41                                 ` Chris Murphy
@ 2016-01-30 14:50                                 ` Patrik Lundquist
  2016-01-30 19:44                                   ` Chris Murphy
  2016-02-04 19:20                                   ` Patrik Lundquist
  2 siblings, 2 replies; 31+ messages in thread
From: Patrik Lundquist @ 2016-01-30 14:50 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS

On 29 January 2016 at 13:14, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>
> Last I checked, Seagate's 'NAS' drives and whatever they've re-branded their other enterprise line as, as well as WD's 'Red' drives support both SCT ERC and FUA, but I don't know about any other brands (most of the Hitachi, Toshiba, and Samsung drives I've seen do not support FUA).

I don't know about WD Red Pro but my WD Reds don't support FUA.

Can I list supported commands with something like hdparm? I'm curious
about a WD Re in a LSI RAID.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-30 14:50                                 ` Patrik Lundquist
@ 2016-01-30 19:44                                   ` Chris Murphy
  2016-02-04 19:20                                   ` Patrik Lundquist
  1 sibling, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2016-01-30 19:44 UTC (permalink / raw)
  To: Patrik Lundquist; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

On Sat, Jan 30, 2016 at 7:50 AM, Patrik Lundquist
<patrik.lundquist@gmail.com> wrote:
> On 29 January 2016 at 13:14, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>
>> Last I checked, Seagate's 'NAS' drives and whatever they've re-branded their other enterprise line as, as well as WD's 'Red' drives support both SCT ERC and FUA, but I don't know about any other brands (most of the Hitachi, Toshiba, and Samsung drives I've seen do not support FUA).
>
> I don't know about WD Red Pro but my WD Reds don't support FUA.
>
> Can I list supported commands with something like hdparm? I'm curious
> about a WD Re in a LSI RAID.

Blast from the past.

https://lwn.net/Articles/400541/

I kinda wonder where things are at with all of this now, especially
since the VFS changes that have happened recently. There were also
some Btrfs fsync patches a while back to improve performance. There
are so many apps that are asking for fsync and fdatasync now, it seems
almost overkill, and now there are optimizations to handle all of that
(sometimes unnecessary) fsyncing. I also wonder about the differences
between file systems in that respect and whether it is, or is
possible, to better abstract such things from developers so they don't
have to do file system specific stuff.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-29 22:06                                     ` Henk Slager
@ 2016-02-01 12:08                                       ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 31+ messages in thread
From: Austin S. Hemmelgarn @ 2016-02-01 12:08 UTC (permalink / raw)
  To: Henk Slager, Btrfs BTRFS

On 2016-01-29 17:06, Henk Slager wrote:
> On Fri, Jan 29, 2016 at 9:40 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-01-29 15:27, Henk Slager wrote:
>>>
>>> On Fri, Jan 29, 2016 at 1:14 PM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>> On 2016-01-28 18:01, Chris Murphy wrote:
>>>>>
>>>>>
>>>>> On Thu, Jan 28, 2016 at 1:44 PM, Austin S. Hemmelgarn
>>>>> <ahferroin7@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Interesting, I figured a umount should include telling the drive to
>>>>>>> flush the write cache; but maybe not, if the drive or connection (i.e.
>>>>>>> USB enclosure) doesn't support FUA?
>>>>>>
>>>>>>
>>>>>>
>>>>>> It's supposed to send an FUA, but depending on the hardware, this may
>>>>>> either
>>>>>> disappear on the way to the disk, or more likely just be a no-op.  A
>>>>>> lot
>>>>>> of
>>>>>> cheap older HDD's just ignore it, and I've seen a lot of USB enclosures
>>>>>> that
>>>>>> just eat the command and don't pass anything to the disk, so sometimes
>>>>>> you
>>>>>> have to get creative to actually flush the cache.  It's worth noting
>>>>>> that
>>>>>> most such disks are not safe to use BTRFS on anyway though, because FUA
>>>>>> is
>>>>>> part of what's used to force write barriers.
>>>>>
>>>>>
>>>>>
>>>>> Err. Really?
>>>>>
>>>>> [    0.833452] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD
>>>>> 840  DB6Q PQ: 0 ANSI: 5
>>>>> [    0.835810] ata3.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES)
>>>>> filtered out
>>>>> [    0.835827] ata3.00: configured for UDMA/100
>>>>> [    0.838010] usb 1-1: new high-speed USB device number 2 using
>>>>> ehci-pci
>>>>> [    0.839785] sd 0:0:0:0: Attached scsi generic sg0 type 0
>>>>> [    0.839810] sd 0:0:0:0: [sda] 488397168 512-byte logical blocks:
>>>>> (250 GB/233 GiB)
>>>>> [    0.840381] sd 0:0:0:0: [sda] Write Protect is off
>>>>> [    0.840393] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
>>>>> [    0.840634] sd 0:0:0:0: [sda] Write cache: enabled, read cache:
>>>>> enabled, doesn't support DPO or FUA
>>>>>
>>>>> This is not a cheap or old HDD. It's not in an enclosure. I get the
>>>>> same message for a new Toshiba 1TiB drive I just stuck in a new Intel
>>>>> NUC. So now what?
>>>>
>>>>
>>>> Well, depending on how the kernel talks to the device, there are ways
>>>> around
>>>> this, but most of them are slow (like waiting for the write cache to
>>>> drain).
>>>> Just like SCT ERC, most drives marketed for 'desktop' usage don't
>>>> actually
>>>> support FUA, but they report this fact correctly, so the kernel can often
>>>> work around it.  Most of the older drives that have issues actually
>>>> report
>>>> that they support it, but just treat it like a no-op.  Last I checked,
>>>> Seagate's 'NAS' drives and whatever they've re-branded their other
>>>> enterprise line as, as well as WD's 'Red' drives support both SCT ERC and
>>>> FUA, but I don't know about any other brands (most of the Hitachi,
>>>> Toshiba,
>>>> and Samsung drives I've seen do not support FUA).  This is in-fact part
>>>> of
>>>> the reason I'm saving up to get good NAS rated drives for my home server,
>>>> because those almost always support both SCT ERC and FUA.
>>>
>>>
>>> [    0.895207] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
>>> enabled, doesn't support DPO or FUA
>>> SCT ERC is supported though.
>>> This is a 4TB (64MB buffer size) WD40EFRX-68WT0N0  FirmWare 82.00A82
>>> and sold as 'NAS' drive.
>>
>> That is at the same time troubling and not all that surprising (
>> SSD's don't implement it so why should we?'  I hate marketing idiocy...).  I
>> was apparently misinformed about WD's disks (although given the apparent
>> insanity of the firmware on some of their drives, that really doesn't
>> surprise me either).
>>>
>>>
>>> How long do you think data will stay dirty in the drives writebuffer
>>> (average/min/max)?
>>
>> That depends on a huge number of factors, and I don't really have a good
>> answer.  The 1TB 7200RPM single platter Seagate drives I'm using right now
>> (which have a 64MB cache) take less than 0.1 second for streaming writes,
>> and less than 0.5 on average for scattered writes, so it's not too bad most
>> of the time, but it's still a performance hit, and I do get marginally
>> better performance by turning off the on-disk write-cache (I've got a very
>> atypical workload though, so YMMV).
>
> I think you refer to the transfer from PC main RAM via SATA to 64MB buffer.
> What I try to estimate is the transfer-time from 64MB buffer to the
> platter(s). Indeed a huge number of factors and without insight in the
> drives ASIC/firmware design, just assumptions, but anyhow I am giving
> it a try:
Ah, I misunderstood what you meant, sorry for the confusion.
>
> - min:  assume complete 64MB dirty and 1 sequential datablock in outer
> cyl, no seek done, then 64 / 150 = ~0.5s
>
> - max:  assume only 1 physical sector sized max scattered (all non
> sequential) datablocks, 150MB/s outer cyl write speed, 75MB/s inner
> cyl write speed, 4ms avg seektime, no merging writes per head
> position, 1 (side) platter, then
>   ( 4k / 150M ) * 8k = ~200ms +
>   ( 4k / 75M ) * 8k = ~400ms +
>    16k * 4ms = ~64s,
>   so in total more than 1 minute in this very simple and worst-case
> model. Drive firmware can't be so inefficient, so seeks are probably
> mostly mitigated, so then it is likely around 1s or a few seconds.
>
> This all would mean that after default 30s commit in btrfs, the
> drive's powersupply must not fail for 0.5s..few seconds.
> If there is powerloss in this timeframe, the fs can get corrupt, but
> AFAIU, there is previous roots, generations etc that can be used, such
> that btrfs fs can restart without mount failure etc, just possibly 30
> + few seconds dataloss.
In theory, that entirely depends on how the drive batches and possibly 
reorders writes in cache.  If things get reordered poorly, then it's 
fully possible to have all your SB's pointing at invalid tree roots. 
The point of FUA as used by most filesystems is to act as a _very_ 
strong write barrier (it's supposed to flush the write-cache, thus 
anything before it can't get reordered after it).
>
> So if those calculations make sense, I am concluding that I am not
> that worried about lack of FUA in normal (non-SMR) spinning drives.
Understandable, the failure modes it's supposed to be protecting against 
are relatively rare, so unless you are working with data that you can't 
afford to need to restore from backup or are using a system that 
absolutely has to come back online without administrative intervention 
after an unclean shutdown, it's usually not needed.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-30 14:50                                 ` Patrik Lundquist
  2016-01-30 19:44                                   ` Chris Murphy
@ 2016-02-04 19:20                                   ` Patrik Lundquist
  1 sibling, 0 replies; 31+ messages in thread
From: Patrik Lundquist @ 2016-02-04 19:20 UTC (permalink / raw)
  To: Btrfs BTRFS

On 30 January 2016 at 15:50, Patrik Lundquist
<patrik.lundquist@gmail.com> wrote:
> On 29 January 2016 at 13:14, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>
>> Last I checked, Seagate's 'NAS' drives and whatever they've re-branded their other enterprise line as, as well as WD's 'Red' drives support both SCT ERC and FUA, but I don't know about any other brands (most of the Hitachi, Toshiba, and Samsung drives I've seen do not support FUA).
>
> I don't know about WD Red Pro but my WD Reds don't support FUA.
>
> Can I list supported commands with something like hdparm? I'm curious
> about a WD Re in a LSI RAID.

No FUA in WD Re either.

[20312.701155] scsi 4:0:0:0: Direct-Access     ATA      WDC
WD5003ABYZ-0 1S03 PQ: 0 ANSI: 5
[20312.701453] sd 4:0:0:0: [sdb] 976773168 512-byte logical blocks:
(500 GB/465 GiB)
[20312.701454] sd 4:0:0:0: Attached scsi generic sg2 type 0
[20312.701603] sd 4:0:0:0: [sdb] Write Protect is off
[20312.701609] sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[20312.701663] sd 4:0:0:0: [sdb] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[20312.712396] sd 4:0:0:0: [sdb] Attached SCSI disk

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RAID1 disk upgrade method
  2016-01-28 18:47                 ` Sean Greenslade
  2016-01-28 19:37                   ` Austin S. Hemmelgarn
  2016-01-28 19:39                   ` Chris Murphy
@ 2016-02-14  0:44                   ` Sean Greenslade
  2 siblings, 0 replies; 31+ messages in thread
From: Sean Greenslade @ 2016-02-14  0:44 UTC (permalink / raw)
  To: Btrfs BTRFS

On Thu, Jan 28, 2016 at 01:47:36PM -0500, Sean Greenslade wrote:
> OK, I just misunderstood how that syntax worked. All seems good now.
> I'll try to play around with some dummy configurations this weekend to
> see if I can reproduce the post-replace mount bug.

So I finally got some time to play with this, and I am entirely unable
to reproduce these errors with virtual loop disks. I'm going to chalk
these errors up to transient SATA nastiness, since that's happened on
this system before. Either way, there was no data loss during this
entire operation, so besides a few extra unplanned reboots, things went
extremely well. Excellent work on btrfs, devs, and thanks to everyone
who chimed in to help me. 

--Sean


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-02-14  0:44 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-22  3:45 RAID1 disk upgrade method Sean Greenslade
2016-01-22  4:37 ` Chris Murphy
2016-01-22 10:54 ` Duncan
2016-01-23 21:41   ` Sean Greenslade
2016-01-24  0:03     ` Chris Murphy
2016-01-27 22:45       ` Sean Greenslade
2016-01-27 23:55         ` Sean Greenslade
2016-01-28 12:31           ` Austin S. Hemmelgarn
2016-01-28 15:37             ` Sean Greenslade
2016-01-28 16:18               ` Chris Murphy
2016-01-28 18:47                 ` Sean Greenslade
2016-01-28 19:37                   ` Austin S. Hemmelgarn
2016-01-28 19:46                     ` Chris Murphy
2016-01-28 19:49                       ` Austin S. Hemmelgarn
2016-01-28 20:24                         ` Chris Murphy
2016-01-28 20:41                           ` Sean Greenslade
2016-01-28 20:44                           ` Austin S. Hemmelgarn
2016-01-28 23:01                             ` Chris Murphy
2016-01-29 12:14                               ` Austin S. Hemmelgarn
2016-01-29 20:27                                 ` Henk Slager
2016-01-29 20:40                                   ` Austin S. Hemmelgarn
2016-01-29 22:06                                     ` Henk Slager
2016-02-01 12:08                                       ` Austin S. Hemmelgarn
2016-01-29 20:41                                 ` Chris Murphy
2016-01-30 14:50                                 ` Patrik Lundquist
2016-01-30 19:44                                   ` Chris Murphy
2016-02-04 19:20                                   ` Patrik Lundquist
2016-01-28 19:39                   ` Chris Murphy
2016-01-28 22:51                     ` Duncan
2016-02-14  0:44                   ` Sean Greenslade
2016-01-22 14:27 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.