All of lore.kernel.org
 help / color / mirror / Atom feed
* The FAQ on fsync/O_SYNC
@ 2015-04-19 13:20 Craig Ringer
  2015-04-19 14:28 ` Martin Steigerwald
  2015-04-20  3:29 ` Craig Ringer
  0 siblings, 2 replies; 17+ messages in thread
From: Craig Ringer @ 2015-04-19 13:20 UTC (permalink / raw)
  To: linux-btrfs

Hi all

I'm looking into the advisability of running PostgreSQL on BTRFS, and
after looking at the FAQ there's something I'm hoping you could
clarify.

The wiki FAQ says:

"Btrfs does not force all dirty data to disk on every fsync or O_SYNC
operation, fsync is designed to be fast."

Is that wording intended narrowly, to contrast with ext3's nasty habit
of flushing *all* dirty blocks for the entire file system whenever
anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
fsync won't necessarily flush all data blocks (just metadata) ?

Is that statement still true in recent BTRFS versions (3.18, etc)?


PostgreSQL (and any other transactional database) absolutely requires
that there be a system call that will provide a hard guarantee that
all dirty blocks for a given file are on durable storage. In the case
of data-integrity-significant metadata operations it has to be able to
get the same guarantee on metadata too.

The documentation for fsync says that:

       fsync() transfers ("flushes") all modified in-core data of (i.e., modi‐
       fied  buffer cache pages for) the file referred to by the file descrip‐
       tor fd to the disk device (or other permanent storage device)  so  that
       all  changed information can be retrieved even after the system crashed
       or was rebooted.  This includes writing  through  or  flushing  a  disk
       cache  if  present.   The call blocks until the device reports that the
       transfer has completed.  It also flushes metadata  information  associ‐
       ated with the file (see stat(2)).


so I'm hoping that the FAQ writer was just comparing with ext3, and
that btrfs's fsync() fully flushes all dirty blocks and metadata for a
file or directory. (I haven't had a chance to do any testing on a
machine with slow flushes to see yet, or any plug-pull testing).


Also on the FAQ:

https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_crash_guarantees_of_overwrite-by-rename.3F

it might be a good idea to recommend that applications really should
fsync() the directory if they want a crash safety guarantee, and that
doing so (hopefully?) won't flush dirty file blocks, just directory
metadata.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 13:20 The FAQ on fsync/O_SYNC Craig Ringer
@ 2015-04-19 14:28 ` Martin Steigerwald
  2015-04-19 14:31   ` Craig Ringer
  2015-04-20  3:29 ` Craig Ringer
  1 sibling, 1 reply; 17+ messages in thread
From: Martin Steigerwald @ 2015-04-19 14:28 UTC (permalink / raw)
  To: Craig Ringer; +Cc: linux-btrfs

Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer:
> Hi all

Hi Craig,

> I'm looking into the advisability of running PostgreSQL on BTRFS, and
> after looking at the FAQ there's something I'm hoping you could
> clarify.
> 
> The wiki FAQ says:
> 
> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC
> operation, fsync is designed to be fast."
> 
> Is that wording intended narrowly, to contrast with ext3's nasty habit
> of flushing *all* dirty blocks for the entire file system whenever
> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
> fsync won't necessarily flush all data blocks (just metadata) ?
> 
> Is that statement still true in recent BTRFS versions (3.18, etc)?

I don´t know, thus leave that for others to answer. I always assumed a 
strong fsync() guarentee as in "its on disk" with BTRFS. So I am 
interested in that as well.

But for databases, did you consider the copy on write fragmentation BTRFS 
will give? Even with autodefrag, afaik it is not recommended to use it for 
large databases on rotating media at least.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 14:28 ` Martin Steigerwald
@ 2015-04-19 14:31   ` Craig Ringer
  2015-04-19 15:10     ` Martin Steigerwald
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Craig Ringer @ 2015-04-19 14:31 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> wrote:
> Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer:
>> Hi all
>
> Hi Craig,
>
>> I'm looking into the advisability of running PostgreSQL on BTRFS, and
>> after looking at the FAQ there's something I'm hoping you could
>> clarify.
>>
>> The wiki FAQ says:
>>
>> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC
>> operation, fsync is designed to be fast."
>>
>> Is that wording intended narrowly, to contrast with ext3's nasty habit
>> of flushing *all* dirty blocks for the entire file system whenever
>> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
>> fsync won't necessarily flush all data blocks (just metadata) ?
>>
>> Is that statement still true in recent BTRFS versions (3.18, etc)?
>
> I don´t know, thus leave that for others to answer. I always assumed a
> strong fsync() guarentee as in "its on disk" with BTRFS. So I am
> interested in that as well.
>
> But for databases, did you consider the copy on write fragmentation BTRFS
> will give? Even with autodefrag, afaik it is not recommended to use it for
> large databases on rotating media at least.

I did, and any testing would need to look at the efficacy of the
chattr +C option on the database directory tree.

PostgreSQL is its self copy-on-write (because of multi-version
concurrency control), so it doesn't make much sense to have the FS
doing another layer of COW.

I'm curious as to whether +C has any effect on BTRFS's durability, too.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 14:31   ` Craig Ringer
@ 2015-04-19 15:10     ` Martin Steigerwald
  2015-04-19 15:18       ` Hugo Mills
  2015-04-19 15:28     ` Russell Coker
  2015-04-20  4:27     ` Zygo Blaxell
  2 siblings, 1 reply; 17+ messages in thread
From: Martin Steigerwald @ 2015-04-19 15:10 UTC (permalink / raw)
  To: Craig Ringer; +Cc: linux-btrfs

Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer:
> On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> 
wrote:
> > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer:
> >> Hi all
> > 
> > Hi Craig,
> > 
> >> I'm looking into the advisability of running PostgreSQL on BTRFS, and
> >> after looking at the FAQ there's something I'm hoping you could
> >> clarify.
> >> 
> >> The wiki FAQ says:
> >> 
> >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC
> >> operation, fsync is designed to be fast."
> >> 
> >> Is that wording intended narrowly, to contrast with ext3's nasty
> >> habit
> >> of flushing *all* dirty blocks for the entire file system whenever
> >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
> >> fsync won't necessarily flush all data blocks (just metadata) ?
> >> 
> >> Is that statement still true in recent BTRFS versions (3.18, etc)?
> > 
> > I don´t know, thus leave that for others to answer. I always assumed a
> > strong fsync() guarentee as in "its on disk" with BTRFS. So I am
> > interested in that as well.
> > 
> > But for databases, did you consider the copy on write fragmentation
> > BTRFS will give? Even with autodefrag, afaik it is not recommended to
> > use it for large databases on rotating media at least.
> 
> I did, and any testing would need to look at the efficacy of the
> chattr +C option on the database directory tree.
> 
> PostgreSQL is its self copy-on-write (because of multi-version
> concurrency control), so it doesn't make much sense to have the FS
> doing another layer of COW.
> 
> I'm curious as to whether +C has any effect on BTRFS's durability, too.

You will loose the ability to snapshot that directory tree then.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 15:10     ` Martin Steigerwald
@ 2015-04-19 15:18       ` Hugo Mills
  2015-04-19 17:50         ` Martin Steigerwald
  0 siblings, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2015-04-19 15:18 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Craig Ringer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2884 bytes --]

On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote:
> Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer:
> > On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> 
> wrote:
> > > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer:
> > >> Hi all
> > > 
> > > Hi Craig,
> > > 
> > >> I'm looking into the advisability of running PostgreSQL on BTRFS, and
> > >> after looking at the FAQ there's something I'm hoping you could
> > >> clarify.
> > >> 
> > >> The wiki FAQ says:
> > >> 
> > >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC
> > >> operation, fsync is designed to be fast."
> > >> 
> > >> Is that wording intended narrowly, to contrast with ext3's nasty
> > >> habit
> > >> of flushing *all* dirty blocks for the entire file system whenever
> > >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
> > >> fsync won't necessarily flush all data blocks (just metadata) ?
> > >> 
> > >> Is that statement still true in recent BTRFS versions (3.18, etc)?
> > > 
> > > I don´t know, thus leave that for others to answer. I always assumed a
> > > strong fsync() guarentee as in "its on disk" with BTRFS. So I am
> > > interested in that as well.
> > > 
> > > But for databases, did you consider the copy on write fragmentation
> > > BTRFS will give? Even with autodefrag, afaik it is not recommended to
> > > use it for large databases on rotating media at least.
> > 
> > I did, and any testing would need to look at the efficacy of the
> > chattr +C option on the database directory tree.
> > 
> > PostgreSQL is its self copy-on-write (because of multi-version
> > concurrency control), so it doesn't make much sense to have the FS
> > doing another layer of COW.
> > 
> > I'm curious as to whether +C has any effect on BTRFS's durability, too.
> 
> You will loose the ability to snapshot that directory tree then.

   No you won't.

   The +C attribute still allows snapshotting and reflink copies.
However, after the snapshot, writes to either copy will result in that
copy being CoWed. (Specifically, writes to an extent of a +C file with
more than one reference to the extent will result in a CoW operation,
until there is only one reference, and then the writes will not be
CoWed again).

   The practical upshot of this is that every snapshot of, and
subsequent writes to, a +C file will introduce fragmentation in the
same way that writes to a non-+C file would.

   You also have a disadvantage with +C that you lose the checksumming
features of the FS, and hence the self-healing properties if you're
running with btrfs-native RAID.

   Hugo.

-- 
Hugo Mills             | Nothing right in my left brain. Nothing left in my
hugo@... carfax.org.uk | right brain.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 14:31   ` Craig Ringer
  2015-04-19 15:10     ` Martin Steigerwald
@ 2015-04-19 15:28     ` Russell Coker
  2015-04-20  4:27     ` Zygo Blaxell
  2 siblings, 0 replies; 17+ messages in thread
From: Russell Coker @ 2015-04-19 15:28 UTC (permalink / raw)
  To: Craig Ringer; +Cc: Martin Steigerwald, linux-btrfs

On Mon, 20 Apr 2015, Craig Ringer <craig@2ndquadrant.com> wrote:
> PostgreSQL is its self copy-on-write (because of multi-version
> concurrency control), so it doesn't make much sense to have the FS
> doing another layer of COW.

That's a matter of opinion.

I think it's great if PostgreSQL can do internal checkums and error 
correction.  But I'd rather not have to test that functionality in the field.

Really I prefer to have the ZFS copies= option for databases.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 15:18       ` Hugo Mills
@ 2015-04-19 17:50         ` Martin Steigerwald
  2015-04-19 18:18           ` Hugo Mills
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Steigerwald @ 2015-04-19 17:50 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Craig Ringer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3504 bytes --]

Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills:
> On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote:
> > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer:
> > > On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de>
> > 
> > wrote:
> > > > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer:
> > > >> Hi all
> > > > 
> > > > Hi Craig,
> > > > 
> > > >> I'm looking into the advisability of running PostgreSQL on BTRFS,
> > > >> and
> > > >> after looking at the FAQ there's something I'm hoping you could
> > > >> clarify.
> > > >> 
> > > >> The wiki FAQ says:
> > > >> 
> > > >> "Btrfs does not force all dirty data to disk on every fsync or
> > > >> O_SYNC
> > > >> operation, fsync is designed to be fast."
> > > >> 
> > > >> Is that wording intended narrowly, to contrast with ext3's nasty
> > > >> habit
> > > >> of flushing *all* dirty blocks for the entire file system
> > > >> whenever
> > > >> anyone calls fsync() ? Or is it intended broadly, to say that
> > > >> btrfs's
> > > >> fsync won't necessarily flush all data blocks (just metadata) ?
> > > >> 
> > > >> Is that statement still true in recent BTRFS versions (3.18,
> > > >> etc)?
> > > > 
> > > > I don´t know, thus leave that for others to answer. I always
> > > > assumed a
> > > > strong fsync() guarentee as in "its on disk" with BTRFS. So I am
> > > > interested in that as well.
> > > > 
> > > > But for databases, did you consider the copy on write
> > > > fragmentation
> > > > BTRFS will give? Even with autodefrag, afaik it is not recommended
> > > > to
> > > > use it for large databases on rotating media at least.
> > > 
> > > I did, and any testing would need to look at the efficacy of the
> > > chattr +C option on the database directory tree.
> > > 
> > > PostgreSQL is its self copy-on-write (because of multi-version
> > > concurrency control), so it doesn't make much sense to have the FS
> > > doing another layer of COW.
> > > 
> > > I'm curious as to whether +C has any effect on BTRFS's durability,
> > > too.
> > 
> > You will loose the ability to snapshot that directory tree then.
> 
>    No you won't.
> 
>    The +C attribute still allows snapshotting and reflink copies.
> However, after the snapshot, writes to either copy will result in that
> copy being CoWed. (Specifically, writes to an extent of a +C file with
> more than one reference to the extent will result in a CoW operation,
> until there is only one reference, and then the writes will not be
> CoWed again).
> 
>    The practical upshot of this is that every snapshot of, and
> subsequent writes to, a +C file will introduce fragmentation in the
> same way that writes to a non-+C file would.
> 
>    You also have a disadvantage with +C that you lose the checksumming
> features of the FS, and hence the self-healing properties if you're
> running with btrfs-native RAID.

Thanks for clarifying this Hugo, so chattr +C will make the directory 
cowed again.

And there is not checksumming on the FS at all anymore. Why is the later? 
Why can´t BTRFS checkum nocowed objects or at least the cowed ones in the 
same FS? Cause of atomicity guarentees?

If this has been answered before, and I missed it, feel free to point me 
to it, I didn´t find anything obvious with my quick search.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 17:50         ` Martin Steigerwald
@ 2015-04-19 18:18           ` Hugo Mills
  2015-04-19 18:41             ` Martin Steigerwald
  0 siblings, 1 reply; 17+ messages in thread
From: Hugo Mills @ 2015-04-19 18:18 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Craig Ringer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2825 bytes --]

On Sun, Apr 19, 2015 at 07:50:32PM +0200, Martin Steigerwald wrote:
> Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills:
> > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote:
> > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer:
> > > > I'm curious as to whether +C has any effect on BTRFS's durability,
> > > > too.
> > > 
> > > You will loose the ability to snapshot that directory tree then.
> > 
> >    No you won't.
> > 
> >    The +C attribute still allows snapshotting and reflink copies.
> > However, after the snapshot, writes to either copy will result in that
> > copy being CoWed. (Specifically, writes to an extent of a +C file with
> > more than one reference to the extent will result in a CoW operation,
> > until there is only one reference, and then the writes will not be
> > CoWed again).
> > 
> >    The practical upshot of this is that every snapshot of, and
> > subsequent writes to, a +C file will introduce fragmentation in the
> > same way that writes to a non-+C file would.
> > 
> >    You also have a disadvantage with +C that you lose the checksumming
> > features of the FS, and hence the self-healing properties if you're
> > running with btrfs-native RAID.
> 
> Thanks for clarifying this Hugo, so chattr +C will make the directory 
> cowed again.

   Not quite sure what you mean there.

   If you set +C on a file or directory, there's no CoW operations on
write to any of the affected files, *except* if there's a snapshot, in
which case the file being written to will have *one* CoW operation
before reverting to nodatacow again.

> And there is not checksumming on the FS at all anymore. Why is the later? 
> Why can´t BTRFS checkum nocowed objects or at least the cowed ones in the 
> same FS? Cause of atomicity guarentees?

   Atomicity, indeed. You need to be able to update the checksum and
the new data atomically. This is possible when the data can be CoWed,
but if the data is being modified in place, there must be a short
period of time when the two parts are out of sync.

   Just to make it clear, the lack of checksums is *only* for those
files which are marked +C. The rest of the FS is unaffected.

> If this has been answered before, and I missed it, feel free to point me 
> to it, I didn´t find anything obvious with my quick search.

   It's certainly knowledge that's out there (it's been discussed at
length on IRC, for example), but I don't know if there's anything
written up on the wiki.

   Hugo.

-- 
Hugo Mills             | Jenkins! Chap with the wings there! Five rounds
hugo@... carfax.org.uk | rapid!
http://carfax.org.uk/  |                 Brigadier Alistair Lethbridge-Stewart
PGP: E2AB1DE4          |                                Dr Who and the Daemons

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 18:18           ` Hugo Mills
@ 2015-04-19 18:41             ` Martin Steigerwald
  2015-04-19 18:51               ` Hugo Mills
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Steigerwald @ 2015-04-19 18:41 UTC (permalink / raw)
  To: Hugo Mills; +Cc: Craig Ringer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3162 bytes --]

Am Sonntag, 19. April 2015, 18:18:24 schrieb Hugo Mills:
> On Sun, Apr 19, 2015 at 07:50:32PM +0200, Martin Steigerwald wrote:
> > Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills:
> > > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote:
> > > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer:
> > > > > I'm curious as to whether +C has any effect on BTRFS's
> > > > > durability,
> > > > > too.
> > > > 
> > > > You will loose the ability to snapshot that directory tree then.
> > > > 
> > >    No you won't.
> > >    
> > >    The +C attribute still allows snapshotting and reflink copies.
> > > 
> > > However, after the snapshot, writes to either copy will result in
> > > that
> > > copy being CoWed. (Specifically, writes to an extent of a +C file
> > > with
> > > more than one reference to the extent will result in a CoW
> > > operation,
> > > until there is only one reference, and then the writes will not be
> > > CoWed again).
> > > 
> > >    The practical upshot of this is that every snapshot of, and
> > > 
> > > subsequent writes to, a +C file will introduce fragmentation in the
> > > same way that writes to a non-+C file would.
> > > 
> > >    You also have a disadvantage with +C that you lose the
> > >    checksumming
> > > 
> > > features of the FS, and hence the self-healing properties if you're
> > > running with btrfs-native RAID.
> > 
> > Thanks for clarifying this Hugo, so chattr +C will make the directory
> > cowed again.
> 
>    Not quite sure what you mean there.
> 
>    If you set +C on a file or directory, there's no CoW operations on
> write to any of the affected files, *except* if there's a snapshot, in
> which case the file being written to will have *one* CoW operation
> before reverting to nodatacow again.

What do you mean by *one* CoW operation, will BTRFS duplicate the whole 
file to keep it no-cowed? Or, hmmm, I think now I get it: There will be one 
CoW operation for each write – well, with some granularity, extent? –, but 
*just* one, cause then the written data is cowed from the snapshot and 
then can be written again in a no-cowed way.

> > And there is not checksumming on the FS at all anymore. Why is the
> > later? Why can´t BTRFS checkum nocowed objects or at least the cowed
> > ones in the same FS? Cause of atomicity guarentees?
> 
>    Atomicity, indeed. You need to be able to update the checksum and
> the new data atomically. This is possible when the data can be CoWed,
> but if the data is being modified in place, there must be a short
> period of time when the two parts are out of sync.

And it would be too much effort or too much of a performance penalty to let 
any checksum check wait till they are in sync again?

>    Just to make it clear, the lack of checksums is *only* for those
> files which are marked +C. The rest of the FS is unaffected.

Thanks for that clarification. I read your original wording as if the whole 
FS was affected.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 18:41             ` Martin Steigerwald
@ 2015-04-19 18:51               ` Hugo Mills
  0 siblings, 0 replies; 17+ messages in thread
From: Hugo Mills @ 2015-04-19 18:51 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Craig Ringer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4189 bytes --]

On Sun, Apr 19, 2015 at 08:41:39PM +0200, Martin Steigerwald wrote:
> Am Sonntag, 19. April 2015, 18:18:24 schrieb Hugo Mills:
> > On Sun, Apr 19, 2015 at 07:50:32PM +0200, Martin Steigerwald wrote:
> > > Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills:
> > > > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote:
> > > > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer:
> > > > > > I'm curious as to whether +C has any effect on BTRFS's
> > > > > > durability,
> > > > > > too.
> > > > > 
> > > > > You will loose the ability to snapshot that directory tree then.
> > > > > 
> > > >    No you won't.
> > > >    
> > > >    The +C attribute still allows snapshotting and reflink copies.
> > > > 
> > > > However, after the snapshot, writes to either copy will result in
> > > > that
> > > > copy being CoWed. (Specifically, writes to an extent of a +C file
> > > > with
> > > > more than one reference to the extent will result in a CoW
> > > > operation,
> > > > until there is only one reference, and then the writes will not be
> > > > CoWed again).
> > > > 
> > > >    The practical upshot of this is that every snapshot of, and
> > > > 
> > > > subsequent writes to, a +C file will introduce fragmentation in the
> > > > same way that writes to a non-+C file would.
> > > > 
> > > >    You also have a disadvantage with +C that you lose the
> > > >    checksumming
> > > > 
> > > > features of the FS, and hence the self-healing properties if you're
> > > > running with btrfs-native RAID.
> > > 
> > > Thanks for clarifying this Hugo, so chattr +C will make the directory
> > > cowed again.
> > 
> >    Not quite sure what you mean there.
> > 
> >    If you set +C on a file or directory, there's no CoW operations on
> > write to any of the affected files, *except* if there's a snapshot, in
> > which case the file being written to will have *one* CoW operation
> > before reverting to nodatacow again.
> 
> What do you mean by *one* CoW operation, will BTRFS duplicate the whole 
> file to keep it no-cowed? Or, hmmm, I think now I get it: There will be one 
> CoW operation for each write – well, with some granularity, extent? –, but 
> *just* one, cause then the written data is cowed from the snapshot and 
> then can be written again in a no-cowed way.

   Correct.

   Granularity -- the storage of data is on a block basis (4k), but
extent size goes down to individual bytes. I think this means that if
you read and then write a single byte and then sync, the block is read
into the page cache, the byte is modified, the block is written out
(elsewhere, because it's CoW), and then the extents are updated to
reference the one modified byte from the new block. I'm not 100% sure
on that, though.

> > > And there is not checksumming on the FS at all anymore. Why is the
> > > later? Why can´t BTRFS checkum nocowed objects or at least the cowed
> > > ones in the same FS? Cause of atomicity guarentees?
> > 
> >    Atomicity, indeed. You need to be able to update the checksum and
> > the new data atomically. This is possible when the data can be CoWed,
> > but if the data is being modified in place, there must be a short
> > period of time when the two parts are out of sync.
> 
> And it would be too much effort or too much of a performance penalty to let 
> any checksum check wait till they are in sync again?

   It's more that if the system crashes or suffers a power outage in
that time window, you've got a mismatch that shows EIO even though the
data is valid.

> >    Just to make it clear, the lack of checksums is *only* for those
> > files which are marked +C. The rest of the FS is unaffected.
> 
> Thanks for that clarification. I read your original wording as if the whole 
> FS was affected.

   Only if the FS is mounted with nodatacow. :)

   Hugo.

-- 
Hugo Mills             | Jenkins! Chap with the wings there! Five rounds
hugo@... carfax.org.uk | rapid!
http://carfax.org.uk/  |                 Brigadier Alistair Lethbridge-Stewart
PGP: E2AB1DE4          |                                Dr Who and the Daemons

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 13:20 The FAQ on fsync/O_SYNC Craig Ringer
  2015-04-19 14:28 ` Martin Steigerwald
@ 2015-04-20  3:29 ` Craig Ringer
  1 sibling, 0 replies; 17+ messages in thread
From: Craig Ringer @ 2015-04-20  3:29 UTC (permalink / raw)
  To: linux-btrfs

While the discussion on -C was interesting, I'm really interested in
btrfs's fsync() behaviour, per the original post:

On 19 April 2015 at 21:20, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> I'm looking into the advisability of running PostgreSQL on BTRFS, and
> after looking at the FAQ there's something I'm hoping you could
> clarify.
>
> The wiki FAQ says:
>
> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC
> operation, fsync is designed to be fast."
>
> Is that wording intended narrowly, to contrast with ext3's nasty habit
> of flushing *all* dirty blocks for the entire file system whenever
> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
> fsync won't necessarily flush all data blocks (just metadata) ?
>
> Is that statement still true in recent BTRFS versions (3.18, etc)?
>
>
> PostgreSQL (and any other transactional database) absolutely requires
> that there be a system call that will provide a hard guarantee that
> all dirty blocks for a given file are on durable storage. In the case
> of data-integrity-significant metadata operations it has to be able to
> get the same guarantee on metadata too.
>
> The documentation for fsync says that:
>
>        fsync() transfers ("flushes") all modified in-core data of (i.e., modi‐
>        fied  buffer cache pages for) the file referred to by the file descrip‐
>        tor fd to the disk device (or other permanent storage device)  so  that
>        all  changed information can be retrieved even after the system crashed
>        or was rebooted.  This includes writing  through  or  flushing  a  disk
>        cache  if  present.   The call blocks until the device reports that the
>        transfer has completed.  It also flushes metadata  information  associ‐
>        ated with the file (see stat(2)).
>
>
> so I'm hoping that the FAQ writer was just comparing with ext3, and
> that btrfs's fsync() fully flushes all dirty blocks and metadata for a
> file or directory. (I haven't had a chance to do any testing on a
> machine with slow flushes to see yet, or any plug-pull testing).
>
>
> Also on the FAQ:
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_crash_guarantees_of_overwrite-by-rename.3F
>
> it might be a good idea to recommend that applications really should
> fsync() the directory if they want a crash safety guarantee, and that
> doing so (hopefully?) won't flush dirty file blocks, just directory
> metadata.
>
> --
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-19 14:31   ` Craig Ringer
  2015-04-19 15:10     ` Martin Steigerwald
  2015-04-19 15:28     ` Russell Coker
@ 2015-04-20  4:27     ` Zygo Blaxell
  2015-04-20  6:07       ` Duncan
                         ` (2 more replies)
  2 siblings, 3 replies; 17+ messages in thread
From: Zygo Blaxell @ 2015-04-20  4:27 UTC (permalink / raw)
  To: Craig Ringer; +Cc: Martin Steigerwald, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 6074 bytes --]

On Sun, Apr 19, 2015 at 10:31:02PM +0800, Craig Ringer wrote:
> On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> wrote:
> > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer:
> >> Hi all
> >
> > Hi Craig,
> >
> >> I'm looking into the advisability of running PostgreSQL on BTRFS, and
> >> after looking at the FAQ there's something I'm hoping you could
> >> clarify.
> >>
> >> The wiki FAQ says:
> >>
> >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC
> >> operation, fsync is designed to be fast."
> >>
> >> Is that wording intended narrowly, to contrast with ext3's nasty habit
> >> of flushing *all* dirty blocks for the entire file system whenever
> >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's
> >> fsync won't necessarily flush all data blocks (just metadata) ?

Normal writes to btrfs filesystems using the versioned filesystem tree are
consistent(ish), atomic, and durable; however, they have high latency as
the filesystem normally delays commit until triggered by a periodic timer
(or sync()--not fsync), then writes all outstanding dirty pages in memory.

btrfs handles fsync separately from the main versioned filesystem tree in
order to decrease the latency of fsync operations.  There is a 'log tree'
which behaves like a journal and contains data flushed with fsync() since
the last fully committed btrfs root.  After a crash, assuming no bugs,
the log is replayed over the last committed version of the filesystem
tree to implement fsync durability.

Unfortunately, in my experience, the log tree's most noticeable effect
at the moment seems to be to add a crapton of special-case code paths,
many of which do contain bugs, which are being fixed one at a time by
btrfs developers.  :-/

> >> Is that statement still true in recent BTRFS versions (3.18, etc)?

3.18 was released 133 days ago.  It has only been 49 days since the last
commit that fixes a btrfs data loss bug involving fsync (3a8b36f on Mar 1,
appearing in mainline as of v4.0-rc3), and 27 days since a commit that
fixes a problem involving fsync and discard (dcc82f4 on Mar 23, queued
for v4.1).

There has been a stream of fsync fixes in the past year, but it would
be naive to believe that there are not still more bugs to be found given
the frequency and recentness of fixes.

> > I don´t know, thus leave that for others to answer. I always assumed a
> > strong fsync() guarentee as in "its on disk" with BTRFS. So I am
> > interested in that as well.

That's the intention; however, btrfs is not there yet.

It has been only 28 days since I last detected corrupted data on a
btrfs instance:  after a crash and log tree replay, extents from the
*beginning* of several files written just before the crash were missing,
but the *ends* of the files were present and correct.

There are also cases where btrfs cannot read data that *is* on disk.
I encounter that bug *every* day on some test systems, but can't yet
reproduce it with less than a TB of data and heavy workloads.  :-P

> > But for databases, did you consider the copy on write fragmentation BTRFS
> > will give? Even with autodefrag, afaik it is not recommended to use it for
> > large databases on rotating media at least.
> 
> I did, and any testing would need to look at the efficacy of the
> chattr +C option on the database directory tree.
> 
> PostgreSQL is its self copy-on-write (because of multi-version
> concurrency control), so it doesn't make much sense to have the FS
> doing another layer of COW.

I noticed that redundancy and ended up picking btrfs over PostgreSQL.

I disable fsync in PostgreSQL (as well as a half-dozen assorted
applications that use sqlite, or just seem to like calling fsync
often), turn off full-page-writes on the journal--and also clear the
btrfs log tree before every mount.  A database can happily rely on
btrfs to preserve write ordering as long as all of its data is in one
filesystem(*) and btrfs never gets to replay its log tree (i.e. using
only the every-30-seconds global filesystem commit).  The database is
only able to offer async_commit mode when there is no fsync, but my
applications all want async_commit for performance reasons anyway.

Disabling fsync from PostgreSQL avoids the bugs in the btrfs
implementation of fsync and the log tree.  With fsync + log tree, I
was rebuilding corrupted PostgreSQL databases from backups after almost
every reboot, and sometimes even more often than that.

I stopped testing PostgreSQL with fsync 277 days ago, and I have
PostgreSQL instances running since then without fsync that are 117 days
old...so that configuration seems as stable as anything else in btrfs.

Note that 117 days ago this btrfs instance corrupted itself beyond repair
(garbage tree node pointers with correct checksums!) and the entire
filesystem had to be mkfs'ed and rebuilt from backup.

For reference, my PostgreSQL workload is a nearly continuous stream of
transactions modifying 10K-15K pages per commit (80-120 MB random writes,
plus indexes).

> I'm curious as to whether +C has any effect on BTRFS's durability, too.

I would expect it to be strictly equal to or worse than the CoW
durability.  It would have all the same general filesystem bugs as btrfs,
plus extra bugs that are specific to the no-CoW btrfs code paths, and
you lose write ordering and btrfs data integrity and repair capabilities,
and you have to enable fsync and log tree replay and dodge the bugs.


> -- 
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

(*) or maybe subvol.  I haven't tested a multi-subvol-single-filesystem
btrfs, but I don't see much real-world advantage in configuring that way.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-20  4:27     ` Zygo Blaxell
@ 2015-04-20  6:07       ` Duncan
  2015-04-21  1:31         ` Zygo Blaxell
  2015-04-20  8:13       ` Gian-Carlo Pascutto
  2015-04-21 19:07       ` Chris Murphy
  2 siblings, 1 reply; 17+ messages in thread
From: Duncan @ 2015-04-20  6:07 UTC (permalink / raw)
  To: linux-btrfs

Zygo Blaxell posted on Mon, 20 Apr 2015 00:27:31 -0400 as excerpted:

> Normal writes to btrfs filesystems using the versioned filesystem tree
> are consistent(ish), atomic, and durable; however, they have high
> latency as the filesystem normally delays commit until triggered by a
> periodic timer (or sync()--not fsync), then writes all outstanding dirty
> pages in memory.
> 
> btrfs handles fsync separately from the main versioned filesystem tree
> in order to decrease the latency of fsync operations.  There is a 'log
> tree' which behaves like a journal and contains data flushed with
> fsync() since the last fully committed btrfs root.  After a crash,
> assuming no bugs, the log is replayed over the last committed version of
> the filesystem tree to implement fsync durability.
> 
> Unfortunately, in my experience, the log tree's most noticeable effect
> at the moment seems to be to add a crapton of special-case code paths,
> many of which do contain bugs, which are being fixed one at a time by
> btrfs developers.  :-/

Thanks, Zygo.  That's the clearest explanation I've seen on why the 
supposedly atomic-commit btrfs still has a log, and what it actually 
does.  I wasn't entirely clear on that, myself.

Meanwhile, yes, log-replay bugs do seem to be one of the sore spots ATM.  
I'm glad it's getting some focus now.  It needed it.

>> >> Is that statement still true in recent BTRFS versions (3.18, etc)?
> 
> 3.18 was released 133 days ago.  It has only been 49 days since the last
> commit that fixes a btrfs data loss bug involving fsync (3a8b36f on Mar
> 1, appearing in mainline as of v4.0-rc3), and 27 days since a commit
> that fixes a problem involving fsync and discard (dcc82f4 on Mar 23,
> queued for v4.1).
> 
> There has been a stream of fsync fixes in the past year, but it would be
> naive to believe that there are not still more bugs to be found given
> the frequency and recentness of fixes.

Telling commentary on what is "recent" in btrfs context, vs. what is 
"recent" in many distro's context, particularly in "enterprise" distro 
context. =8^0

4.0 is out.  There's reason people may want to stick one version back by 
default, to 3.19 currently, since it can take a few weeks for early 
reports to develop into a coherent problem, and sticking one stable 
series back allows for that, and deciding exactly when one is comfortable 
upgrading.  But in btrfs context anyway, with 4.0 out, if you're not on 
at least 3.19 yet, you should be able to point to the bug explaining
/why/.  If you can't, arguably, you should be either upgrading yesterday 
if not sooner, or you really should choose some other filesystem, as 
btrfs simply isn't at the stability required for your use-case yet, and 
you unnecessarily risk data loss to already found and fixed bugs as a 
result.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-20  4:27     ` Zygo Blaxell
  2015-04-20  6:07       ` Duncan
@ 2015-04-20  8:13       ` Gian-Carlo Pascutto
  2015-04-20 15:19         ` Zygo Blaxell
  2015-04-21 19:07       ` Chris Murphy
  2 siblings, 1 reply; 17+ messages in thread
From: Gian-Carlo Pascutto @ 2015-04-20  8:13 UTC (permalink / raw)
  To: linux-btrfs

On 20-04-15 06:27, Zygo Blaxell wrote:

>> I'm curious as to whether +C has any effect on BTRFS's durability, too.
> 
> I would expect it to be strictly equal to or worse than the CoW
> durability.

In addition to the stuff pointed out, I've wondered about this:
PostgreSQL full_page_writes copies 8k pages in order to prevent
corruption from partial writes. But btrfs has 16k pages by default, so a
corrupted FS page would corrupt more data than PostgreSQL protects.

Maybe it's not an issue if the underlying HW has 512b/4k sectors, or
maybe I'm misunderstanding what the respective features assume, but
unless informed to the contrary I wouldn't be entirely comfortable with
this.

(With CoW enabled, you don't have partial writes, so the point is moot)

-- 
GCP


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-20  8:13       ` Gian-Carlo Pascutto
@ 2015-04-20 15:19         ` Zygo Blaxell
  0 siblings, 0 replies; 17+ messages in thread
From: Zygo Blaxell @ 2015-04-20 15:19 UTC (permalink / raw)
  To: Gian-Carlo Pascutto; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1838 bytes --]

On Mon, Apr 20, 2015 at 10:13:47AM +0200, Gian-Carlo Pascutto wrote:
> On 20-04-15 06:27, Zygo Blaxell wrote:
> 
> >> I'm curious as to whether +C has any effect on BTRFS's durability, too.
> > 
> > I would expect it to be strictly equal to or worse than the CoW
> > durability.
> 
> In addition to the stuff pointed out, I've wondered about this:
> PostgreSQL full_page_writes copies 8k pages in order to prevent
> corruption from partial writes. But btrfs has 16k pages by default, so a
> corrupted FS page would corrupt more data than PostgreSQL protects.

There are multiple page sizes in a btrfs.

*Metadata* pages are 16K by default.  Metadata is always CoW in btrfs.

*Data* pages are 4K by default.  AIUI the btrfs code currently heavily
relies on data page and host CPU page sizes being identical.  I don't
know what happens on architectures with 8K page sizes--it's been years
since I booted an Alpha machine.

> Maybe it's not an issue if the underlying HW has 512b/4k sectors, or
> maybe I'm misunderstanding what the respective features assume, but
> unless informed to the contrary I wouldn't be entirely comfortable with
> this.

Usually writes are issued in multiples of 4K pages at a time from
the kernel, so the underlying sector size only comes into play when
something interrupts a disk between individual sector writes within
a page.

full_page_writes should handle no-CoW on btrfs (or any other filesystem)
just fine.  It's the *other* btrfs bugs you have to worry about.  ;)

> (With CoW enabled, you don't have partial writes, so the point is moot)
> 
> -- 
> GCP
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-20  6:07       ` Duncan
@ 2015-04-21  1:31         ` Zygo Blaxell
  0 siblings, 0 replies; 17+ messages in thread
From: Zygo Blaxell @ 2015-04-21  1:31 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3489 bytes --]

On Mon, Apr 20, 2015 at 06:07:09AM +0000, Duncan wrote:
> 4.0 is out.  There's reason people may want to stick one version back by 
> default, to 3.19 currently, since it can take a few weeks for early 
> reports to develop into a coherent problem, and sticking one stable 
> series back allows for that, and deciding exactly when one is comfortable 
> upgrading.  But in btrfs context anyway, with 4.0 out, if you're not on 
> at least 3.19 yet, you should be able to point to the bug explaining
> /why/.  If you can't, arguably, you should be either upgrading yesterday 
> if not sooner, or you really should choose some other filesystem, as 
> btrfs simply isn't at the stability required for your use-case yet, and 
> you unnecessarily risk data loss to already found and fixed bugs as a 
> result.

I'm not sure that "run the latest kernel" or even "run the latest kernel
minus N weeks or months" is good advice for user data integrity at
present.  It's certainly unsupported by any test data I'm seeing.

If the intention is to discover and report or fix btrfs bugs, or confirm
that known bugs have been corrected, then the latest kernel (or a -next
integration branch) is the only one to run.  If the intention is to
use btrfs for data storage, then the kernel selection process is much
different.

In the stable kernels (the v3.xx.y Git tags with no other patches)
in the last year, there have been a number of btrfs regressions, from
memory leaks to deadlocks to filesystem-crashing corruption issues:

	2 severe corruption (i.e. destroy the filesystem) or memory leak
	issues (i.e. leak all the RAM and crash slowly and messily)
	I've encountered in my own testing,

	2 kernel panic or memory leak issues that I avoided by accident
	because the fix came out before I could pull the regression into
	a build,

	3 failure modes in new code leading to deadlock or temporary
	inability to retrieve stored data that first appeared in v3.13
	or v3.15, and as of today are not yet resolved.

My testing process runs like this (slightly simplified):

	1.  Build stable and/or Linus tagged kernels + integration-queue
	patches + locally-generated patches if any.

	2.  Run these kernels on various machines with workloads,
	observe and analyze failures.

	3.  When a machine fails to do its work due to a kernel issue,
	restart at step 1 with a different version or more patches.
	Note this includes more issues than btrfs; e.g. sometimes
	a kernel is not usable because of ACPI or WiFi issues that make
	btrfs test results irrelevant.

	4.  If a kernel build succeeds for N or more days, expand the set
	of test machines to get more test coverage, and go back to step 2.

	5.  If N >= 60 with no (severe) problems, consider that kernel
	stable and bless it for production.

Linux kernels getting to step 5 are rare and precious things even when
not testing btrfs.  The last kernel to reach step 5 for me was v3.12.x.
Before that was 835 days of searching for a successor to the kernel I
was running in production at the time.  :-/


> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: The FAQ on fsync/O_SYNC
  2015-04-20  4:27     ` Zygo Blaxell
  2015-04-20  6:07       ` Duncan
  2015-04-20  8:13       ` Gian-Carlo Pascutto
@ 2015-04-21 19:07       ` Chris Murphy
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2015-04-21 19:07 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Craig Ringer, Martin Steigerwald, Btrfs BTRFS

On Sun, Apr 19, 2015 at 10:27 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Sun, Apr 19, 2015 at 10:31:02PM +0800, Craig Ringer wrote:

>> I'm curious as to whether +C has any effect on BTRFS's durability, too.
>
> I would expect it to be strictly equal to or worse than the CoW
> durability.  It would have all the same general filesystem bugs as btrfs,
> plus extra bugs that are specific to the no-CoW btrfs code paths, and
> you lose write ordering and btrfs data integrity and repair capabilities,
> and you have to enable fsync and log tree replay and dodge the bugs.

Interesting. systemd-journald now uses +C by default on journal files.
I've had journals become corrupt, per its own journalctl --verify
command, but so far it happens infrequently enough I haven't
discovered the pattern - whether it's worse with +C. And it's even
possible there are bugs in systemd-journald causing the problem.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-04-21 19:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-19 13:20 The FAQ on fsync/O_SYNC Craig Ringer
2015-04-19 14:28 ` Martin Steigerwald
2015-04-19 14:31   ` Craig Ringer
2015-04-19 15:10     ` Martin Steigerwald
2015-04-19 15:18       ` Hugo Mills
2015-04-19 17:50         ` Martin Steigerwald
2015-04-19 18:18           ` Hugo Mills
2015-04-19 18:41             ` Martin Steigerwald
2015-04-19 18:51               ` Hugo Mills
2015-04-19 15:28     ` Russell Coker
2015-04-20  4:27     ` Zygo Blaxell
2015-04-20  6:07       ` Duncan
2015-04-21  1:31         ` Zygo Blaxell
2015-04-20  8:13       ` Gian-Carlo Pascutto
2015-04-20 15:19         ` Zygo Blaxell
2015-04-21 19:07       ` Chris Murphy
2015-04-20  3:29 ` Craig Ringer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.