All of lore.kernel.org
 help / color / mirror / Atom feed
* Power down tests...
@ 2017-08-04  5:51 Shyam Prasad N
  2017-08-04  6:00 ` Adam Borowski
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Shyam Prasad N @ 2017-08-04  5:51 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

We're running a couple of experiments on our servers with btrfs
(kernel version 4.4).
And we're running some abrupt power-off tests for a couple of scenarios:

1. We have a filesystem on top of two different btrfs filesystems
(distributed across N disks). i.e. Our filesystem lays out data and
metadata on top of these two filesystems. With the test workload, it
is going to generate a good amount of 16MB files on top of the system.
On abrupt power-off and following reboot, what is the recommended
steps to be run. We're attempting btrfs mount, which seems to fail
sometimes. If it fails, we run a fsck and then mount the btrfs. The
issue that we're facing is that a few files have been zero-sized. As a
result, there is either a data-loss, or inconsistency in the stacked
filesystem's metadata.
We're mounting the btrfs with commit period of 5s. However, I do
expect btrfs to journal the I/Os that are still dirty. Why then are we
seeing the above behaviour.

2. Another test that we're running is to create a virtual disk each on
multiple NFS mounts (softmount with timeout of 1 min), and use these
virtual disks as individual devices for one single btrfs. What is the
expected btrfs behaviour when one of the virtual disk becomes
unresponsive for a period of time (say 5 min)? Does the expectation
change if the NFS mounts are mounted with sync option?

Thanks in advance for any help you can offer.
-- 
-Shyam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-04  5:51 Power down tests Shyam Prasad N
@ 2017-08-04  6:00 ` Adam Borowski
       [not found]   ` <CANT5p=qvu9tCf73+_PuAj9Ryy69p3JjAyHFY_pAA9eUsTz_ELA@mail.gmail.com>
  2017-08-07  2:15 ` Chris Murphy
  2017-08-07  2:22 ` Chris Murphy
  2 siblings, 1 reply; 11+ messages in thread
From: Adam Borowski @ 2017-08-04  6:00 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: linux-btrfs

On Fri, Aug 04, 2017 at 11:21:15AM +0530, Shyam Prasad N wrote:
> We're running a couple of experiments on our servers with btrfs
> (kernel version 4.4).
> And we're running some abrupt power-off tests for a couple of scenarios:
> 
> 1. We have a filesystem on top of two different btrfs filesystems
> (distributed across N disks). i.e. Our filesystem lays out data and
> metadata on top of these two filesystems. With the test workload, it
> is going to generate a good amount of 16MB files on top of the system.
> On abrupt power-off and following reboot, what is the recommended
> steps to be run. We're attempting btrfs mount, which seems to fail
> sometimes. If it fails, we run a fsck and then mount the btrfs. The
> issue that we're facing is that a few files have been zero-sized. As a
> result, there is either a data-loss, or inconsistency in the stacked
> filesystem's metadata.

Sounds like you want to mount with -o flushoncommit.

> We're mounting the btrfs with commit period of 5s. However, I do
> expect btrfs to journal the I/Os that are still dirty. Why then are we
> seeing the above behaviour.

By default, btrfs does only metadata consistency, like most filesystems. 
This improves performance at the cost of failing use case like yours.

-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄⠀⠀⠀⠀ • use glitches to walk on water

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
       [not found]   ` <CANT5p=qvu9tCf73+_PuAj9Ryy69p3JjAyHFY_pAA9eUsTz_ELA@mail.gmail.com>
@ 2017-08-04  7:22     ` Adam Borowski
  2017-08-04  7:49       ` Shyam Prasad N
  0 siblings, 1 reply; 11+ messages in thread
From: Adam Borowski @ 2017-08-04  7:22 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: linux-btrfs

On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote:
> Is flushoncommit not a default option on version
> 4.4? Do I need specifically set this option?

It's not the default.

-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄⠀⠀⠀⠀ • use glitches to walk on water

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-04  7:22     ` Adam Borowski
@ 2017-08-04  7:49       ` Shyam Prasad N
       [not found]         ` <20170804145401.78e50990@job>
  0 siblings, 1 reply; 11+ messages in thread
From: Shyam Prasad N @ 2017-08-04  7:49 UTC (permalink / raw)
  To: Adam Borowski; +Cc: linux-btrfs

Oh ok. I read this in the man page and assumed that it's on by default:
       flushoncommit, noflushoncommit
           (default: on)

           This option forces any data dirtied by a write in a prior
transaction to commit as part of the current commit. This makes the
committed state a fully consistent view of the file system from the
           application’s perspective (i.e., it includes all completed
file system operations). This was previously the behavior only when a
snapshot was created.

           Disabling flushing may improve performance but is not crash-safe.


Maybe this needs a correction?

On Fri, Aug 4, 2017 at 12:52 PM, Adam Borowski <kilobyte@angband.pl> wrote:
> On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote:
>> Is flushoncommit not a default option on version
>> 4.4? Do I need specifically set this option?
>
> It's not the default.
>
> --
> ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
> ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
> ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
> ⠈⠳⣄⠀⠀⠀⠀ • use glitches to walk on water



-- 
-Shyam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
       [not found]         ` <20170804145401.78e50990@job>
@ 2017-08-04 12:09           ` Shyam Prasad N
  2017-08-07  2:25             ` Chris Murphy
  0 siblings, 1 reply; 11+ messages in thread
From: Shyam Prasad N @ 2017-08-04 12:09 UTC (permalink / raw)
  To: Dmitrii Tcvetkov, Adam Borowski; +Cc: linux-btrfs

Thanks guys. I've enabled that option now. Let's see how it goes.
One general question regarding the stability of btrfs in kernel
version 4.4. Is this okay for power off test cases? Or are there many
important fixes in newer kernels?

On Fri, Aug 4, 2017 at 5:24 PM, Dmitrii Tcvetkov <demfloro@demfloro.ru> wrote:
> On Fri, 4 Aug 2017 13:19:39 +0530
> Shyam Prasad N <nspmangalore@gmail.com> wrote:
>
>> Oh ok. I read this in the man page and assumed that it's on by
>> default: flushoncommit, noflushoncommit
>>            (default: on)
>>
>>            This option forces any data dirtied by a write in a prior
>> transaction to commit as part of the current commit. This makes the
>> committed state a fully consistent view of the file system from the
>>            application’s perspective (i.e., it includes all completed
>> file system operations). This was previously the behavior only when a
>> snapshot was created.
>>
>>            Disabling flushing may improve performance but is not
>> crash-safe.
>>
>>
>> Maybe this needs a correction?
>
> In 4.12 btrfs-progs man pages it's already updated.
>
> $ man 5 btrfs
> ...
>        flushoncommit, noflushoncommit
>            (default: off)
>
>            This option forces any data dirtied by a write in a prior
>            transaction to commit as part of the current commit,
>            effectively a full filesystem sync.
>
>            This makes the committed state a fully consistent view of
>            the file system from the application’s perspective (i.e., it
>            includes all completed file system operations). This was
>            previously the behavior only when a snapshot was created.
>
>            When off, the filesystem is consistent but buffered writes
>            may last more than one transaction commit.
>
>



-- 
-Shyam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-04  5:51 Power down tests Shyam Prasad N
  2017-08-04  6:00 ` Adam Borowski
@ 2017-08-07  2:15 ` Chris Murphy
  2017-08-07  3:35   ` Adam Borowski
  2017-08-07  6:53   ` Shyam Prasad N
  2017-08-07  2:22 ` Chris Murphy
  2 siblings, 2 replies; 11+ messages in thread
From: Chris Murphy @ 2017-08-07  2:15 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: Btrfs BTRFS

On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N <nspmangalore@gmail.com> wrote:
> Hi all,
>
> We're running a couple of experiments on our servers with btrfs
> (kernel version 4.4).
> And we're running some abrupt power-off tests for a couple of scenarios:
>
> 1. We have a filesystem on top of two different btrfs filesystems
> (distributed across N disks). i.e. Our filesystem lays out data and
> metadata on top of these two filesystems.

This is astronomically more complicated than the already complicated
scenario with one file system on a single normal partition of a well
behaved (non-lying) single drive.

You have multiple devices, so any one or all of them could drop data
during the power failure and in different amounts. In the best case
scenario, at next mount the supers are checked on all the devices, and
the lowest common denominator generation is found, and therefore the
lowest common denominator root tree. No matter what it means some data
is going to be lost.

Next there is a file system on top of a file system, I assume it's a
file that's loopback mounted?

>With the test workload, it
> is going to generate a good amount of 16MB files on top of the system.
> On abrupt power-off and following reboot, what is the recommended
> steps to be run. We're attempting btrfs mount, which seems to fail
> sometimes. If it fails, we run a fsck and then mount the btrfs.

I'd want to know why it fails. And then I'd check all the supers on
all the devices  with 'btrfs inspect-internal dump-super -fa <dev>'.

Are all the copies on a given device the same and valid? Are all the
copies among all devices the same and valid? I'm expecting there will
be discrepancies and then you have to figure out if the mount logic is
really finding the right root to try to mount. I'm not sure if kernel
code by default reports back in detail what logic its using and
exactly where it fails, or if you just get the generic open_ctree
mount failure message.

And then it's an open question whether the supers need fixing, or
whether the 'usebackuproot' mount option is the way to go. It might
depend on the status of the supers how that logic ends up working.
Again, it might be useful if there were debug info that explicitly
shows the mount logic actually being used, dumped to kernel messages.
I'm not sure if that code exists when CONFIG_BTRFS_DEBUG is enabled
(as in, I haven't looked but I've thought it really could come in
handy in some of the cases we see of mount failure can can't tell
where things are getting stuck with the existing reporting).



>The
> issue that we're facing is that a few files have been zero-sized.

I can't tell you if that's a bug or not because I'm not sure how your
software creates these 16M backing files, if they're fallocated or
touched or what. It's plausible they're created as zero length files,
and the file system successful creates them, and then data is written
to them, but before there is either committed metadata or an updated
super pointing to the new root tree you get a power failure. And in
that case, I expect a zero length file or maybe some partial amount of
data is there.


>As a
> result, there is either a data-loss, or inconsistency in the stacked
> filesystem's metadata.

Sounds expected for any file system, but chances are there's more
missing with a CoW file system since by nature it rolls back to the
most recent sane checkpoint for the fs metadata without any regard to
what data is lost to make that happen. The goal is to not lose the
file system in such a case, as some amount of data is always going to
happen, and why power losses need to be avoided (UPS's and such). The
fact that you have a file system on top of a file system makes it more
fragile because the 2nd file system's metadata *IS* data as far as the
1st file system is concerned. And that data is considered expendable.


> We're mounting the btrfs with commit period of 5s. However, I do
> expect btrfs to journal the I/Os that are still dirty. Why then are we
> seeing the above behaviour.

commit 5s might make the problem worse by requiring such constant
flushing of dirty data that you're getting a bunch of disk contention,
hard to say since there's no details about the workload at the time of
the power failure. Changing nothing else but but commit= mount option,
what difference do you see (with a scientific sample) if any between
commit 5 and default commit 30 when it comes to the amount of data
loss?

Another thing we don't know is the application or service writing out
these 16M backing files behavior when it comes to fsync or fdatasync
or fadvise.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-04  5:51 Power down tests Shyam Prasad N
  2017-08-04  6:00 ` Adam Borowski
  2017-08-07  2:15 ` Chris Murphy
@ 2017-08-07  2:22 ` Chris Murphy
  2017-08-07  7:38   ` Shyam Prasad N
  2 siblings, 1 reply; 11+ messages in thread
From: Chris Murphy @ 2017-08-07  2:22 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: Btrfs BTRFS

On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N <nspmangalore@gmail.com> wrote:
> Hi all,
>
> We're running a couple of experiments on our servers with btrfs
> (kernel version 4.4).
> And we're running some abrupt power-off tests for a couple of scenarios:
>
> 1. We have a filesystem on top of two different btrfs filesystems
> (distributed across N disks).

What's the layout from physical devices all the way to your 16M file?
This is hardware raid, lvm linear, Btrfs raid? All of that matters.

Do the drives have write caching disabled? You might be better off
with the drive write cache disabled, and then add bcache or dm-cache
and an SSD to compensate. But that's just speculation on my part. The
write cache in the drives is definitely volatile. And disabling them
will definitely make writes slower. So, you might have slightly better
luck with another layout.

But the bottom line is, you need to figure out a way to avoid *any*
data loss in your files because otherwise that means the 2nd file
system has data loss and even corruption. This is not something a file
system choice can solve. You need reliable power and reliable
shutdown. And you may also need a cluster file system like ceph or
glusterfs instead of depending on a single box to stay upright.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-04 12:09           ` Shyam Prasad N
@ 2017-08-07  2:25             ` Chris Murphy
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Murphy @ 2017-08-07  2:25 UTC (permalink / raw)
  To: Shyam Prasad N; +Cc: Dmitrii Tcvetkov, Adam Borowski, Btrfs BTRFS

On Fri, Aug 4, 2017 at 6:09 AM, Shyam Prasad N <nspmangalore@gmail.com> wrote:
> Thanks guys. I've enabled that option now. Let's see how it goes.
> One general question regarding the stability of btrfs in kernel
> version 4.4. Is this okay for power off test cases? Or are there many
> important fixes in newer kernels?


$ git log --grep=power tags/v4.4...tags/v4.12 -- fs/btrfs

The answer is yes there are power failure related fixes since 4.4. I
can't tell you off hand to what degree they're backported, you'd have
to do a search with whatever specific sub version of 4.4 you're using.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-07  2:15 ` Chris Murphy
@ 2017-08-07  3:35   ` Adam Borowski
  2017-08-07  6:53   ` Shyam Prasad N
  1 sibling, 0 replies; 11+ messages in thread
From: Adam Borowski @ 2017-08-07  3:35 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Shyam Prasad N, Btrfs BTRFS

On Sun, Aug 06, 2017 at 08:15:45PM -0600, Chris Murphy wrote:
> On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N <nspmangalore@gmail.com> wrote:
> > We're running a couple of experiments on our servers with btrfs
> > (kernel version 4.4).
> > And we're running some abrupt power-off tests for a couple of scenarios:
> >
> > 1. We have a filesystem on top of two different btrfs filesystems
> > (distributed across N disks). i.e. Our filesystem lays out data and
> > metadata on top of these two filesystems.
> 
> This is astronomically more complicated than the already complicated
> scenario with one file system on a single normal partition of a well
> behaved (non-lying) single drive.
> 
> You have multiple devices, so any one or all of them could drop data
> during the power failure and in different amounts. In the best case
> scenario, at next mount the supers are checked on all the devices, and
> the lowest common denominator generation is found, and therefore the
> lowest common denominator root tree. No matter what it means some data
> is going to be lost.

That's exactly why we have CoW.  Unless at least one of the disks lies,
there's no way for data from a fully committed transaction to be lost.
Any writes after that are _supposed_ to be lost.

Reordering writes between disks is no different from reordering writes on a
single disk.  Even more so with NVMe where you have multiple parallel writes
on the same device, with multiple command queues.  You know the transaction
has hit the, uhm, platters, only once every device says so, and that's when
you can start writing the new superblock.
> 
> > The issue that we're facing is that a few files have been zero-sized.
> 
> I can't tell you if that's a bug or not because I'm not sure how your
> software creates these 16M backing files, if they're fallocated or
> touched or what. It's plausible they're created as zero length files,
> and the file system successful creates them, and then data is written
> to them, but before there is either committed metadata or an updated
> super pointing to the new root tree you get a power failure. And in
> that case, I expect a zero length file or maybe some partial amount of
> data is there.

It's the so-called O_PONIES issue.  No filesystem can know whether you want
files written immediately (abysmal performance) or held in cache until later
(sacrificing durability).  The only portable interface to do so is
f{,data}sync: any write that hasn't been synced cannot be relied upon.
Some traditional filesystems have implicitly synced things, but all such
details are filesystem specific.

Btrfs in particular has -o flushoncommit, which instead of a fsync after
every single write gathers writes from the last 30 seconds and flushes them
as one transaction.

More generic interfaces have been proposed but none has been implemented
yet.  Heck, I'm playing with one such idea myself, although I'm not sure if
I know enough to ensure the semantics I have in mind.

> > As a result, there is either a data-loss, or inconsistency in the
> > stacked filesystem's metadata.
> 
> Sounds expected for any file system, but chances are there's more
> missing with a CoW file system since by nature it rolls back to the
> most recent sane checkpoint for the fs metadata without any regard to
> what data is lost to make that happen. The goal is to not lose the
> file system in such a case, as some amount of data is always going to
> happen

All it takes is to _somehow_ tell the filesystem you demand the same
guarantees for data as it already provides for metadata.  And a CoW
or log-based filesystem can actually deliver such a demand.

> and why power losses need to be avoided (UPS's and such).

An UPS can't protect you from a kernel crash, a motherboard running out of
smoke, a stick of memory going bad or unseated, power supply deciding it
wants a break from delivering the juice (for redundant power supplies, the
thingy mediating power will do so), etc, etc.  There's no way around crash
tolerance.

> The
> fact that you have a file system on top of a file system makes it more
> fragile because the 2nd file system's metadata *IS* data as far as the
> 1st file system is concerned. And that data is considered expendable.

Only because by default the underlying filesystem has been taught to
consider it expendable.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄⠀⠀⠀⠀ • use glitches to walk on water

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-07  2:15 ` Chris Murphy
  2017-08-07  3:35   ` Adam Borowski
@ 2017-08-07  6:53   ` Shyam Prasad N
  1 sibling, 0 replies; 11+ messages in thread
From: Shyam Prasad N @ 2017-08-07  6:53 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Hi Chris,

Thanks for the detailed reply. :)
Read my answers inline:

On Mon, Aug 7, 2017 at 7:45 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
> This is astronomically more complicated than the already complicated
> scenario with one file system on a single normal partition of a well
> behaved (non-lying) single drive.
>
> You have multiple devices, so any one or all of them could drop data
> during the power failure and in different amounts. In the best case
> scenario, at next mount the supers are checked on all the devices, and
> the lowest common denominator generation is found, and therefore the
> lowest common denominator root tree. No matter what it means some data
> is going to be lost.

True. This is something that we're experimenting with, since we can
use many btrfs features. Except for these power off issues, we didn't
face many other issues.

>
> Next there is a file system on top of a file system, I assume it's a
> file that's loopback mounted?
>

Not exactly loopback mounted. We are, however, distributing the data
and metadata across different btrfs files and reading them to present
a filesystem view to the client.

>
> I'd want to know why it fails. And then I'd check all the supers on
> all the devices  with 'btrfs inspect-internal dump-super -fa <dev>'.
>
> Are all the copies on a given device the same and valid? Are all the
> copies among all devices the same and valid? I'm expecting there will
> be discrepancies and then you have to figure out if the mount logic is
> really finding the right root to try to mount. I'm not sure if kernel
> code by default reports back in detail what logic its using and
> exactly where it fails, or if you just get the generic open_ctree
> mount failure message.
>
> And then it's an open question whether the supers need fixing, or
> whether the 'usebackuproot' mount option is the way to go. It might
> depend on the status of the supers how that logic ends up working.
> Again, it might be useful if there were debug info that explicitly
> shows the mount logic actually being used, dumped to kernel messages.
> I'm not sure if that code exists when CONFIG_BTRFS_DEBUG is enabled
> (as in, I haven't looked but I've thought it really could come in
> handy in some of the cases we see of mount failure can can't tell
> where things are getting stuck with the existing reporting).
>

Unfortunately, we don't have these data now, since we've started a
fresh batch of similar tests with a couple of new mount options (-o
flushoncommit,recovery). If we hit the issue again, I'll share the
data here.

>
> I can't tell you if that's a bug or not because I'm not sure how your
> software creates these 16M backing files, if they're fallocated or
> touched or what. It's plausible they're created as zero length files,
> and the file system successful creates them, and then data is written
> to them, but before there is either committed metadata or an updated
> super pointing to the new root tree you get a power failure. And in
> that case, I expect a zero length file or maybe some partial amount of
> data is there.
>

The files are first touched, then truncated to 16M, before being written to.
So, it does makes sense then that on recovery, we ended up with
zero-sized files. Btrfs could be showing us a consistent older
filesystem, rather than inconsistent newer one.

>
> Sounds expected for any file system, but chances are there's more
> missing with a CoW file system since by nature it rolls back to the
> most recent sane checkpoint for the fs metadata without any regard to
> what data is lost to make that happen. The goal is to not lose the
> file system in such a case, as some amount of data is always going to
> happen, and why power losses need to be avoided (UPS's and such). The
> fact that you have a file system on top of a file system makes it more
> fragile because the 2nd file system's metadata *IS* data as far as the
> 1st file system is concerned. And that data is considered expendable.
>

Yes, you're right. that is a downside when we stack one FS on top of
another. As long as we minimize the scope of seeing filesystem
inconsistencies, we should be okay. Even if the data is slightly
older.
We were using ext4 for the same purpose with good results on power off
and recovery. With flushoncommit, hopefully, we should see better
results on btrfs as well. Let's see.

>
> commit 5s might make the problem worse by requiring such constant
> flushing of dirty data that you're getting a bunch of disk contention,
> hard to say since there's no details about the workload at the time of
> the power failure. Changing nothing else but but commit= mount option,
> what difference do you see (with a scientific sample) if any between
> commit 5 and default commit 30 when it comes to the amount of data
> loss?

We're not choking the disk with the workload now, if that is what
you're asking. The disks can take a lot more load.

>
> Another thing we don't know is the application or service writing out
> these 16M backing files behavior when it comes to fsync or fdatasync
> or fadvise.

Yeah. That is something we've considered. Strictly speaking, we should
fsync the files in our test scripts.
However, in this one case of zero-sized file, the stacked filesystem
says that the file should be non-zero sized. So the I/O was not lost
in the client cache.

>
>
>
> --
> Chris Murphy



-- 
-Shyam

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Power down tests...
  2017-08-07  2:22 ` Chris Murphy
@ 2017-08-07  7:38   ` Shyam Prasad N
  0 siblings, 0 replies; 11+ messages in thread
From: Shyam Prasad N @ 2017-08-07  7:38 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

Hi Chris,
Good points that you make.

We're making use of btrfs raid only. (One of the reasons we want to
move to btrfs)
However, during this test, we haven't run with multi-disk btrfs raid.
We just have one disk. (This test setup doesn't have too many disks)

We do have our metadata replicated as well. For data, we do have
regular async backups.
However, this is something that we noticed (somewhat more frequently)
while testing out btrfs as our data store (as compared to ext4).
We've tests going on with flushoncommit and recovery mount options
running on the same setup. Hopefully, we'll have near-ext4-like
behaviour with this, w.r.t power off recovery.

Regards,
Shyam

On Mon, Aug 7, 2017 at 7:52 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N <nspmangalore@gmail.com> wrote:
>> Hi all,
>>
>> We're running a couple of experiments on our servers with btrfs
>> (kernel version 4.4).
>> And we're running some abrupt power-off tests for a couple of scenarios:
>>
>> 1. We have a filesystem on top of two different btrfs filesystems
>> (distributed across N disks).
>
> What's the layout from physical devices all the way to your 16M file?
> This is hardware raid, lvm linear, Btrfs raid? All of that matters.
>
> Do the drives have write caching disabled? You might be better off
> with the drive write cache disabled, and then add bcache or dm-cache
> and an SSD to compensate. But that's just speculation on my part. The
> write cache in the drives is definitely volatile. And disabling them
> will definitely make writes slower. So, you might have slightly better
> luck with another layout.
>
> But the bottom line is, you need to figure out a way to avoid *any*
> data loss in your files because otherwise that means the 2nd file
> system has data loss and even corruption. This is not something a file
> system choice can solve. You need reliable power and reliable
> shutdown. And you may also need a cluster file system like ceph or
> glusterfs instead of depending on a single box to stay upright.
>
>
>
> --
> Chris Murphy



-- 
-Shyam

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-08-07  7:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04  5:51 Power down tests Shyam Prasad N
2017-08-04  6:00 ` Adam Borowski
     [not found]   ` <CANT5p=qvu9tCf73+_PuAj9Ryy69p3JjAyHFY_pAA9eUsTz_ELA@mail.gmail.com>
2017-08-04  7:22     ` Adam Borowski
2017-08-04  7:49       ` Shyam Prasad N
     [not found]         ` <20170804145401.78e50990@job>
2017-08-04 12:09           ` Shyam Prasad N
2017-08-07  2:25             ` Chris Murphy
2017-08-07  2:15 ` Chris Murphy
2017-08-07  3:35   ` Adam Borowski
2017-08-07  6:53   ` Shyam Prasad N
2017-08-07  2:22 ` Chris Murphy
2017-08-07  7:38   ` Shyam Prasad N

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.