All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: write corruption due to bio cloning on raid5/6
@ 2017-07-24 20:22 Janos Toth F.
  2017-07-26 16:07 ` Liu Bo
  0 siblings, 1 reply; 8+ messages in thread
From: Janos Toth F. @ 2017-07-24 20:22 UTC (permalink / raw)
  To: Btrfs BTRFS

I accidentally ran into this problem (it's pretty silly because I
almost never run RC kernels or do dio writes but somehow I just
happened to do both at once, exactly before I read your patch notes).
I didn't initially catch any issues (I see no related messages in the
kernel log) but after seeing your patch, I started a scrub (*) and it
hung.

Is there a way to fix a filesystem corrupted by this bug or does it
need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with
5x4Tb HDDs.) There is a partial backup (of everything really
important, the rest is not important enough to be kept in multiple
copies, hence the desire for raid5...) and everything seems to be
readable anyway (so could be saved if needed) but nuking a big fs is
never fun...

Scrub just hangs and pretty much makes the whole system hanging (it
needs a power cycling for a reboot). Although everything runs smooth
besides this. Btrfs check (read-only normal-mem mode) finds no errors,
the kernel log is clean, etc.

I think I deleted all the affected dio-written test-files even before
I started scrubbing, so that doesn't seem to do the trick. Any other
ideas?


* By the way, I see raid56 scrub is still painfully slow (~30Mb/s /
disk with raw disk speeds of >100 Mb/s). I forgot about this issue
since I last used raid5 a few years ago.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-24 20:22 write corruption due to bio cloning on raid5/6 Janos Toth F.
@ 2017-07-26 16:07 ` Liu Bo
  2017-07-27 14:14   ` Janos Toth F.
  0 siblings, 1 reply; 8+ messages in thread
From: Liu Bo @ 2017-07-26 16:07 UTC (permalink / raw)
  To: Janos Toth F.; +Cc: Btrfs BTRFS

On Mon, Jul 24, 2017 at 10:22:53PM +0200, Janos Toth F. wrote:
> I accidentally ran into this problem (it's pretty silly because I
> almost never run RC kernels or do dio writes but somehow I just
> happened to do both at once, exactly before I read your patch notes).
> I didn't initially catch any issues (I see no related messages in the
> kernel log) but after seeing your patch, I started a scrub (*) and it
> hung.
> 
> Is there a way to fix a filesystem corrupted by this bug or does it
> need to be destroyed and recreated? (It's m=s=raid10, d=raid5 with
> 5x4Tb HDDs.) There is a partial backup (of everything really
> important, the rest is not important enough to be kept in multiple
> copies, hence the desire for raid5...) and everything seems to be
> readable anyway (so could be saved if needed) but nuking a big fs is
> never fun...

It should only affect the dio-written files, the mentioned bug makes
btrfs write garbage into those files, so checksum fails when reading
files, nothing else from this bug.

As you use m=s=raid10, filesystem metadata is OK, so I think hang of
scrub could be another problem.


> 
> Scrub just hangs and pretty much makes the whole system hanging (it
> needs a power cycling for a reboot). Although everything runs smooth
> besides this. Btrfs check (read-only normal-mem mode) finds no errors,
> the kernel log is clean, etc.
> 
> I think I deleted all the affected dio-written test-files even before
> I started scrubbing, so that doesn't seem to do the trick. Any other
> ideas?
>

A hang could normally be caught by sysrq-w, could you please try it
and see if there is a difference in kernel log?

Thanks,

-liubo
> 
> * By the way, I see raid56 scrub is still painfully slow (~30Mb/s /
> disk with raw disk speeds of >100 Mb/s). I forgot about this issue
> since I last used raid5 a few years ago.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-26 16:07 ` Liu Bo
@ 2017-07-27 14:14   ` Janos Toth F.
  2017-07-27 20:44     ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Janos Toth F. @ 2017-07-27 14:14 UTC (permalink / raw)
  To: bo.li.liu; +Cc: Btrfs BTRFS

> It should only affect the dio-written files, the mentioned bug makes
> btrfs write garbage into those files, so checksum fails when reading
> files, nothing else from this bug.

Thanks for confirming that.  I thought so but I removed the affected
temporary files even before I knew they were corrupt, yet I had
trouble with the follow-up scrub, so I got confused about the scope of
the issue.
However, I am not sure if some other software which regularly runs in
the background might use DIO (I don't think so but can't say for
sure).

> A hang could normally be caught by sysrq-w, could you please try it
> and see if there is a difference in kernel log?

It's not a total system hang. The filesystem in question effectively
becomes read-only (I forgot to check if it actually turns RO or writes
just silently hang) and scrub hangs (it doesn't seem to do any disk
I/O and can't be cancelled gracefully). A graceful reboot or shutdown
silently fails.

In the mean time, I switched to Linux 4.12.3, updated the firmware on
the HDDs and ran an extended SMART self-test (which found no errors),
used cp to copy everything (not for backup but as a form of "crude
scrub" [see *], which yielded zero errors) and now finally started a
scrub (in foreground and read-only mode this time).

* This is off-topic but raid5 scrub is painful. The disks run at
constant ~100% utilization while performing at ~1/5 of their
sequential read speeds. And despite explicitly asking idle IO priority
when launching scrub, the filesystem becomes unbearably slow (while
scrub takes a days or so to finish ... or get to the point where it
hung the last time around, close to the end).

I find it a little strange that BFQ and the on-board disk caching with
NCQ + FUA (look-ahead read caching and write cache reordering enabled
with 128Mb on-board caches) can't mitigate the issue a little better
(whatever scrub is doing "wrong" from a performance perspective).

If scrub hangs again, I will try to extract something useful from the logs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-27 14:14   ` Janos Toth F.
@ 2017-07-27 20:44     ` Duncan
  2017-07-29  3:02       ` Janos Toth F.
  0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2017-07-27 20:44 UTC (permalink / raw)
  To: linux-btrfs

Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted:

> * This is off-topic but raid5 scrub is painful. The disks run at
> constant ~100% utilization while performing at ~1/5 of their sequential
> read speeds. And despite explicitly asking idle IO priority when
> launching scrub, the filesystem becomes unbearably slow (while scrub
> takes a days or so to finish ... or get to the point where it hung the
> last time around, close to the end).

That's because basically all the userspace scrub command does is make the 
appropriate kernel calls to have it do the real scrub.  So priority-
idling the userspace scrub doesn't do what it does on normal userspace 
jobs that do much of the work themselves.

The problem is that idle-prioritizing the kernel threads actually doing 
the work could risk a deadlock due to lock inversion, since they're 
kernel threads and aren't designed with the idea of people messing with 
their priority in mind.

Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended 
at this time.  Try btrfs raid1 or raid10 mode instead, or possible btrfs 
raid1, single or raid0 mode on top of a pair of mdraid5s or similar.  Tho 
parity-raid mode in general (that is, not btrfs-specific) is known for 
being slow in various cases, with raid10 normally being the best 
performing closest alternative.  (Tho in the btrfs-specific case, btrfs 
raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally 
recommended higher performance reasonably low danger alternative.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-27 20:44     ` Duncan
@ 2017-07-29  3:02       ` Janos Toth F.
  2017-07-29 23:05         ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Janos Toth F. @ 2017-07-29  3:02 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

The read-only scrub finished without errors/hangs (with kernel
4.12.3). So, I guess the hangs were caused by:
1: other bug in 4.13-RC1
2: crazy-random SATA/disk-controller issue
3: interference between various btrfs tools [*]
4: something in the background did DIO write with 4.13-RC1 (but all
affected content was eventually overwritten/deleted between the scrub
attempts)

[*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
expect interference issues), so I didn't disable the scheduled
maintenance script which deletes old files, recursively defrags the
whole fs and runs a balance with usage=33 filters. I guess either of
those (especially balance) could potentially cause scrub to hang.

On Thu, Jul 27, 2017 at 10:44 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Janos Toth F. posted on Thu, 27 Jul 2017 16:14:47 +0200 as excerpted:
>
>> * This is off-topic but raid5 scrub is painful. The disks run at
>> constant ~100% utilization while performing at ~1/5 of their sequential
>> read speeds. And despite explicitly asking idle IO priority when
>> launching scrub, the filesystem becomes unbearably slow (while scrub
>> takes a days or so to finish ... or get to the point where it hung the
>> last time around, close to the end).
>
> That's because basically all the userspace scrub command does is make the
> appropriate kernel calls to have it do the real scrub.  So priority-
> idling the userspace scrub doesn't do what it does on normal userspace
> jobs that do much of the work themselves.
>
> The problem is that idle-prioritizing the kernel threads actually doing
> the work could risk a deadlock due to lock inversion, since they're
> kernel threads and aren't designed with the idea of people messing with
> their priority in mind.
>
> Meanwhile, that's yet another reason btrfs raid56 mode isn't recommended
> at this time.  Try btrfs raid1 or raid10 mode instead, or possible btrfs
> raid1, single or raid0 mode on top of a pair of mdraid5s or similar.  Tho
> parity-raid mode in general (that is, not btrfs-specific) is known for
> being slow in various cases, with raid10 normally being the best
> performing closest alternative.  (Tho in the btrfs-specific case, btrfs
> raid1 on top of a pair of mdraid/dmraid/whatever raid0s, is the normally
> recommended higher performance reasonably low danger alternative.)

If this applies to all RAID flavors then I consider the built-in help
and the manual pages of scrub misleading (if it's RAID56-only, the
manual should still mention how RAID56 is an exception).

Also, a resumed scrub seems to skip a lot of data. It picks up where
it left but then prematurely reports a job well done. I remember
noticing a similar behavior with balance cancel/resume on RAID5 a few
years ago (it went on for a few more chunks but left the rest alone
and reported completion --- I am not sure if that's fixed now or these
have a common root cause).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-29  3:02       ` Janos Toth F.
@ 2017-07-29 23:05         ` Duncan
  2017-07-30  1:39           ` Janos Toth F.
  0 siblings, 1 reply; 8+ messages in thread
From: Duncan @ 2017-07-29 23:05 UTC (permalink / raw)
  To: linux-btrfs

Janos Toth F. posted on Sat, 29 Jul 2017 05:02:48 +0200 as excerpted:

> The read-only scrub finished without errors/hangs (with kernel
> 4.12.3). So, I guess the hangs were caused by:
> 1: other bug in 4.13-RC1
> 2: crazy-random SATA/disk-controller issue
> 3: interference between various btrfs tools [*]
> 4: something in the background did DIO write with 4.13-RC1 (but all
> affected content was eventually overwritten/deleted between the scrub
> attempts)
> 
> [*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
> expect interference issues), so I didn't disable the scheduled
> maintenance script which deletes old files, recursively defrags the
> whole fs and runs a balance with usage=33 filters. I guess either of
> those (especially balance) could potentially cause scrub to hang.

That #3, interference between btrfs tools, could be it.  It seems btrfs
in general is getting stable enough now that we're beginning to see
bugs exposed from running two or more tools at once, because the devs
have apparently caught and fixed enough of the single-usage race bugs
that individual tools are working reasonably well, and it's now the
concurrent multi-usage case races that no one was thinking about when
they were writing the code that are being exposed.  At least, there
have been a number of such bugs either definitely or probability-traced
to concurrent usage, reported and traced/fixed, lately, more than I
remember seeing in the past.


(TL;DR folks can stop at that.)

Incidentally, that's one more advantage to my own strategy of multiple
independent small btrfs, keeping everything small enough that
maintenance jobs are at least tolerably short, making it actually
practical to run them.

Tho my case is surely an extreme, with everything on ssd and my largest
btrfs, even after recently switching my media filesystems to ssd and
btrfs, being 80 GiB (usable and per device, btrfs raid1 on paired
partitions, each on a different physical ssd).  I use neither quotas,
which don't scale well on btrfs and I don't need them, nor snapshots,
which have a bit of a scaling issue (tho apparently not as bad as
quotas) as well, because weekly or monthly backups are enough here, and
the filesystems are small enough (and on ssd) to do full-copy backups
in minutes each.  In fact, making backups easier was a major reason I
switched even the backups and media devices to all ssd, this time.

So scrubs are trivially short enough I can run them and wait for the
results while composing posts such as this (bscrub is my scrub script,
run here by my admin user with a stub setup so sudo isn't required):

$$ bscrub /mm 
scrub device /dev/sda11 (id 1) done
        scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08
        total bytes scrubbed: 33.98GiB with 0 errors
scrub device /dev/sdb11 (id 2) done
        scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08
        total bytes scrubbed: 33.98GiB with 0 errors

Just over a minute for a scrub of both devices on my largest 80 gig per
device btrfs. =:^)  Tho projecting to full it might be 2 and a half minutes...

Tho of course parity-raid scrubs would be far slower, at a WAG an hour or two,
for similar size on spinning rust...

Balances are similar, but being on ssd and not needing one on any of the still
relatively freshly redone filesystems ATM, I don't feel inclined to needlessly
spend a write cycle just for demonstration.

With filesystem maintenance runtimes of a minute, definitely under five minutes,
per filesystem, and with full backups under 10, I don't /need/ to run more than
one tool at once, and backups can trivially be kept fresh enough that I don't
really feel the need to schedule maintenance and risk running more than one
that way, either, particularly when I know it'll be done in minutes if I run it
manually. =:^)

Like I said, I'm obviously an extreme case, but equally obviously, while I see
the runtime concurrency bug reports on the list, it's not something likely to
affect me personally. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-29 23:05         ` Duncan
@ 2017-07-30  1:39           ` Janos Toth F.
  2017-07-30  5:26             ` Duncan
  0 siblings, 1 reply; 8+ messages in thread
From: Janos Toth F. @ 2017-07-30  1:39 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

Reply to the TL;DR part, so TL;DR marker again...

Well, I live on the other extreme now. I want as few filesystems as
possible and viable (it's obviously impossible to have a real backup
within the same fs and/or device and with the current
size/performance/price differences between HDD and SSD, it's evident
to separate the "small and fast" from the "big and slow" storage but
other than that...). I always believed (even before I got a real grasp
on these things and could explain my view or argue about this)
"subvolumes" (in a general sense but let's use this word here) should
reside below filesystems (and be totally optional) and filesystems
should spread over a whole disk or(md- or hardware) RAID volume
(forget the MSDOS partitions) and even these ZFS/Brtfs style
subvolumes should be used sparingly (only when you really have a good
enough reason to create a subvolume, although it doesn't matter nearly
as much with subvolumes than it does with partitions).

I remember the days when I thought it's important to create separate
partitions for different kinds of data (10+ years ago when I was aware
I didn't have the experience to deviate from common general
teachings). I remember all the pain of randomly running out of space
on any and all filesystems and eventually mixing the various kinds of
data on every theoretically-segregated filesystems (wherever I found
free space), causing a nightmare of broken sorting system (like a
library after a tornado) and then all the horror of my first russian
rulett like experiences of resizing partitions and filesystem to make
the segregation decent again. And I saw much worse on other peoples's
machines. At one point, I decided to create as few partitions as
possible (and I really like the idea of zero partitions, I don't miss
MSDOS).
I still get shivers if I need to resize a filesystems due to the
memories of those early tragic experiences when I never won the
lottery on the "trial and error" runs but lost filesystems with both
hands and learned what wild-spread silent corruption is and how you
can refresh your backups with corrupted copies...). Let's not take me
back to those early days, please. I don't want to live in a cave
anymore. Thank you modern filesystems (and their authors). :)

And on that note... Assuming I had interference problems, it was
caused by my human mistake/negligence. I can always make similar or
bigger human mistakes, independent of disk-level segregation. (For
example, no amount of partitions will save any data if I accidentally
wipe the entire drive with DD, or if I have it security-locked by the
controller and loose the passwords, etc...)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: write corruption due to bio cloning on raid5/6
  2017-07-30  1:39           ` Janos Toth F.
@ 2017-07-30  5:26             ` Duncan
  0 siblings, 0 replies; 8+ messages in thread
From: Duncan @ 2017-07-30  5:26 UTC (permalink / raw)
  To: linux-btrfs

Janos Toth F. posted on Sun, 30 Jul 2017 03:39:10 +0200 as excerpted:

[OT but related topic continues...]

> I still get shivers if I need to resize a filesystems due to the
> memories of those early tragic experiences when I never won the lottery
> on the "trial and error" runs but lost filesystems with both hands and
> learned what wild-spread silent corruption is and how you can refresh
> your backups with corrupted copies...). Let's not take me back to those
> early days, please. I don't want to live in a cave anymore. Thank you
> modern filesystems (and their authors). :)
> 
> And on that note... Assuming I had interference problems, it was caused
> by my human mistake/negligence. I can always make similar or bigger
> human mistakes, independent of disk-level segregation. (For example, no
> amount of partitions will save any data if I accidentally wipe the
> entire drive with DD, or if I have it security-locked by the controller
> and loose the passwords, etc...)

I was glad to say goodbye to MSDOS/MBR style partitions as well, but just 
as happy to enthusiastically endorse GPT/EFI style partitions, with their 
effectively unlimited partition numbers (128 allowed at the default table 
size), no primary/logical partition stuff to worry about, partition (as 
opposed to filesystem in the partition) labels/names, integrity 
checksums, and second copy at the other end of the device. =:^)

And while all admins have their fat-finger or fat-head, aka brown-bag, 
experiences, I've never erased the wrong partition, tho I can certainly 
remember being /very/ careful the first couple times I did partitioning, 
back in the 90s on MSDOS.  Thankfully, these days even ssds are 
"reasonably" priced, and spinning rust is the trivial cost of perhaps a 
couple meals out, so as long as there's backups on other physical 
devices, getting even the device name wrong simply means losing perhaps 
your working copy instead of redoing the layout of one of your backups.

And of course you can see the existing layout on the device before you 
repartition it, and if it's not what you expected or there are any other 
problems, you just back out without doing that final writeout of the new 
partition table.

FWIW my last brown-bag was writing and running a script as root, with a 
variable-name typo that made varname empty with an rm -rf $varname/.  I 
caught and stopped it after it had emptied /bin, while it was in /etc, I 
believe.  Luckily I could boot to the (primary) backup.

But meanwhile, two experiences that set in concrete the practicality of 
separate filesystems on their own partitions, for me:

1) Back on MS, IE4-beta era.  I was running the public beta when the MSIE 
devs decided that for performance reasons they needed to write directly 
to the IE cache index on disk, bypassing the usual filesystem methods.  
What they didn't think about, however, was IE's new integration into the 
Explorer shell, meaning it was running all the time.

So along come people running the beta, running their scheduled defrag, 
which decides the index is fragmented and moves it out from under the of 
course still running Explorer shell, so the next time it direct-writes to 
what WAS the cache index, it's overwriting whatever file defrag moved to 
that spot after it moved the cache file out.

The eventual fix was to set the system attribute on the cache index, so 
the defragger wouldn't touch it.

I know a number of people running that beta that lost important files to 
that, when those files got moved into the old on-disk location of the 
cache index file and overwritten by IE when it direct-wrote to what it 
/thought/ was still the on-disk location of its index file.

But I was fine, never in any danger, because IE's "Temporary Internet 
Files" cache was on a dedicated tempfiles filesystem.  So the only files 
it overwrote for me were temporary in any case.

2) Some years ago, during a Phoenix summer, my AC went out.  I was in a 
trailer at the time, so without the AC it got hot pretty quickly, and I 
was away, with the computer left on, at the time it went out.

The high in the shade that day was about 47C/117F, and the trailer was in 
the sun, so it easily hit 55-60C/131-140F inside.  The computer was 
obviously going to be hotter than that, and the spinning disk in the 
computer hotter still, so it easily hit 70C/158F or higher.

The CPU shut down of course, and was fine when I turned it back on after 
a cooldown.

The disk... not so fine.  I'm sure it physically head-crashed and if I 
had taken it apart I'd have found grooves on the platter.

But... disks were a lot more expensive back then, and I didn't have 
another disk with backups.

What I *DID* have were backup partitions on the same disk, and because 
they weren't mounted at the time, the head didn't try seeking to them, 
and they weren't damaged (at least not beyond what could be repaired). 
When I went to assess things after everything cooled down, the damage was 
(almost) all on the mounted partitions, damaged beyond repair, but I 
continued to run that disk from what had been the backup partitions, 
until some months later when I scraped together enough money to buy more 
disks, this time enough of them to do RAID and make proper other-physical-
device backups.  (Prices were by that time starting to come down enough 
so I could by multiple at a time and I bought four 300-gig disks, my 
first SATAs, as the mobo had four SATA ports.)

Had I not had it partitioned off, the seeks across the much bigger single 
active partition would have been very likely to do far greater damage, 
and I'd have probably lost everything.

Tho I /did/ decide I had it /too/ partitioned up at that time, because I 
ended up working off of a / backup from one date, a /var backup from 
another, and a /usr backup from a third, which really complicated package 
management trying to get things back to sanity, because the package 
manager database on /var didn't match the actual package versions running 
on / and /usr.  That's why these days, with a small exception for the 
operational-writable-mount-necessary /var/lib since / is operational read-
only mounted, everything the package manager touches, including both what 
it installs and its installation database, is on /, so regardless of what 
backup I end up on, the installation database now on / matches what's 
actually installed on that /.

Meanwhile, as mentioned, these days I keep /, including /bin /etc /usr 
and (most of) /var, mounted read-only, so it's unlikely to be damaged 
unless I'm actually updating at the time of a mishap.  /boot is its own 
partition, unmounted most of the time.  /var/log is its own sub-GiB 
partition (with journald set to volatile only, syslog-ng does the logging 
to permanent storage), so it's actually writable, but a runaway log won't 
fill up anything critical.  /home is its own partition, with a /home/var/
lib with a symlink from the read-only /var/lib for the rather small 
quantity of generic system stuff that needs to be writable.  I'm a 
gentooer so I build all my updates, and the gentoo and overlay trees, 
along with sources, the local binpkg cache, ccache and the kernel tree, 
are all on /pkgs, which is only mounted when I'm updating.  That's the 
core.  Then there's a media partitions and the text and binary news 
partitions, separate from each other and from /home both so usage is 
controlled (quota-by-filesystem-size) and so they can be unmounted if I'm 
not using them, reducing the amount of potential damage should a mishap 
occur.

Then of course there's two levels of backup of everything, with each 
backup partition and filesystem the same size as the working copy, with 
one backup on the same large working pair of ssds, and the other on a 
smaller backup pair.

Because only what's necessary for normal ops and for whatever I'm doing 
at that time are mounted, and root is mounted read-only unless I'm 
updating, both the likely potential damage and total filesystem-loss 
scenarios are much more limited and thus much easier to recover from than 
if the entire working system was on the same filesystem.

I actually learned /that/ lesson the hard way, tho, when I had everything 
on massive mdraid, and it took hours to rebuild from a simple ungraceful 
shutdown (this was before write-intent bitmaps).  After redoing the 
layout to multiple mdraids, each assembled from parallel partitions on 
the respective physical devices, then spinning rust, rebuild of a single 
mdraid was under 10 minutes, where after the first one I could go back to 
work with less disruption, and rebuild of all the typically active mdraids 
was under 30 minutes.  The difference was in all the data that wasn't in 
actual use at the time, but still assembled into the active raid under 
the first setup, while under the second, the other raids weren't even 
active.

Meanwhile, thru multiple generations of layout revisions and updates, I 
now have all filesystems at pretty close to their perfect sizes, so 
filesystem size, far from being an inconvenient barrier to workaround, 
actually works as convenient size-quota enforcement. If one of the 
filesystems starts getting too full, it's time to investigate why, take 
corrective action and do some cleanup. =:^)

And if I actually decide I need more room than is free on that filesystem 
for something, worst-case, I have a set of backups on a second pair of 
ssds, that I can make sure are current, then blkdiscard the entire 
physical device for the pair of working-copy ssds, and redo the layout.  
But I'm close enough to optimal now, that I usually only need to do that 
when I'm updating devices anyway.  Or at least, I've found the last two 
existing layouts more than workable until I updated physical devices and 
needed to do a new layout for them anyway. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-07-30  5:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-24 20:22 write corruption due to bio cloning on raid5/6 Janos Toth F.
2017-07-26 16:07 ` Liu Bo
2017-07-27 14:14   ` Janos Toth F.
2017-07-27 20:44     ` Duncan
2017-07-29  3:02       ` Janos Toth F.
2017-07-29 23:05         ` Duncan
2017-07-30  1:39           ` Janos Toth F.
2017-07-30  5:26             ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.