* The FAQ on fsync/O_SYNC @ 2015-04-19 13:20 Craig Ringer 2015-04-19 14:28 ` Martin Steigerwald 2015-04-20 3:29 ` Craig Ringer 0 siblings, 2 replies; 17+ messages in thread From: Craig Ringer @ 2015-04-19 13:20 UTC (permalink / raw) To: linux-btrfs Hi all I'm looking into the advisability of running PostgreSQL on BTRFS, and after looking at the FAQ there's something I'm hoping you could clarify. The wiki FAQ says: "Btrfs does not force all dirty data to disk on every fsync or O_SYNC operation, fsync is designed to be fast." Is that wording intended narrowly, to contrast with ext3's nasty habit of flushing *all* dirty blocks for the entire file system whenever anyone calls fsync() ? Or is it intended broadly, to say that btrfs's fsync won't necessarily flush all data blocks (just metadata) ? Is that statement still true in recent BTRFS versions (3.18, etc)? PostgreSQL (and any other transactional database) absolutely requires that there be a system call that will provide a hard guarantee that all dirty blocks for a given file are on durable storage. In the case of data-integrity-significant metadata operations it has to be able to get the same guarantee on metadata too. The documentation for fsync says that: fsync() transfers ("flushes") all modified in-core data of (i.e., modi‐ fied buffer cache pages for) the file referred to by the file descrip‐ tor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associ‐ ated with the file (see stat(2)). so I'm hoping that the FAQ writer was just comparing with ext3, and that btrfs's fsync() fully flushes all dirty blocks and metadata for a file or directory. (I haven't had a chance to do any testing on a machine with slow flushes to see yet, or any plug-pull testing). Also on the FAQ: https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_crash_guarantees_of_overwrite-by-rename.3F it might be a good idea to recommend that applications really should fsync() the directory if they want a crash safety guarantee, and that doing so (hopefully?) won't flush dirty file blocks, just directory metadata. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 13:20 The FAQ on fsync/O_SYNC Craig Ringer @ 2015-04-19 14:28 ` Martin Steigerwald 2015-04-19 14:31 ` Craig Ringer 2015-04-20 3:29 ` Craig Ringer 1 sibling, 1 reply; 17+ messages in thread From: Martin Steigerwald @ 2015-04-19 14:28 UTC (permalink / raw) To: Craig Ringer; +Cc: linux-btrfs Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer: > Hi all Hi Craig, > I'm looking into the advisability of running PostgreSQL on BTRFS, and > after looking at the FAQ there's something I'm hoping you could > clarify. > > The wiki FAQ says: > > "Btrfs does not force all dirty data to disk on every fsync or O_SYNC > operation, fsync is designed to be fast." > > Is that wording intended narrowly, to contrast with ext3's nasty habit > of flushing *all* dirty blocks for the entire file system whenever > anyone calls fsync() ? Or is it intended broadly, to say that btrfs's > fsync won't necessarily flush all data blocks (just metadata) ? > > Is that statement still true in recent BTRFS versions (3.18, etc)? I don´t know, thus leave that for others to answer. I always assumed a strong fsync() guarentee as in "its on disk" with BTRFS. So I am interested in that as well. But for databases, did you consider the copy on write fragmentation BTRFS will give? Even with autodefrag, afaik it is not recommended to use it for large databases on rotating media at least. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 14:28 ` Martin Steigerwald @ 2015-04-19 14:31 ` Craig Ringer 2015-04-19 15:10 ` Martin Steigerwald ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Craig Ringer @ 2015-04-19 14:31 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-btrfs On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> wrote: > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer: >> Hi all > > Hi Craig, > >> I'm looking into the advisability of running PostgreSQL on BTRFS, and >> after looking at the FAQ there's something I'm hoping you could >> clarify. >> >> The wiki FAQ says: >> >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC >> operation, fsync is designed to be fast." >> >> Is that wording intended narrowly, to contrast with ext3's nasty habit >> of flushing *all* dirty blocks for the entire file system whenever >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's >> fsync won't necessarily flush all data blocks (just metadata) ? >> >> Is that statement still true in recent BTRFS versions (3.18, etc)? > > I don´t know, thus leave that for others to answer. I always assumed a > strong fsync() guarentee as in "its on disk" with BTRFS. So I am > interested in that as well. > > But for databases, did you consider the copy on write fragmentation BTRFS > will give? Even with autodefrag, afaik it is not recommended to use it for > large databases on rotating media at least. I did, and any testing would need to look at the efficacy of the chattr +C option on the database directory tree. PostgreSQL is its self copy-on-write (because of multi-version concurrency control), so it doesn't make much sense to have the FS doing another layer of COW. I'm curious as to whether +C has any effect on BTRFS's durability, too. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 14:31 ` Craig Ringer @ 2015-04-19 15:10 ` Martin Steigerwald 2015-04-19 15:18 ` Hugo Mills 2015-04-19 15:28 ` Russell Coker 2015-04-20 4:27 ` Zygo Blaxell 2 siblings, 1 reply; 17+ messages in thread From: Martin Steigerwald @ 2015-04-19 15:10 UTC (permalink / raw) To: Craig Ringer; +Cc: linux-btrfs Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer: > On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> wrote: > > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer: > >> Hi all > > > > Hi Craig, > > > >> I'm looking into the advisability of running PostgreSQL on BTRFS, and > >> after looking at the FAQ there's something I'm hoping you could > >> clarify. > >> > >> The wiki FAQ says: > >> > >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC > >> operation, fsync is designed to be fast." > >> > >> Is that wording intended narrowly, to contrast with ext3's nasty > >> habit > >> of flushing *all* dirty blocks for the entire file system whenever > >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's > >> fsync won't necessarily flush all data blocks (just metadata) ? > >> > >> Is that statement still true in recent BTRFS versions (3.18, etc)? > > > > I don´t know, thus leave that for others to answer. I always assumed a > > strong fsync() guarentee as in "its on disk" with BTRFS. So I am > > interested in that as well. > > > > But for databases, did you consider the copy on write fragmentation > > BTRFS will give? Even with autodefrag, afaik it is not recommended to > > use it for large databases on rotating media at least. > > I did, and any testing would need to look at the efficacy of the > chattr +C option on the database directory tree. > > PostgreSQL is its self copy-on-write (because of multi-version > concurrency control), so it doesn't make much sense to have the FS > doing another layer of COW. > > I'm curious as to whether +C has any effect on BTRFS's durability, too. You will loose the ability to snapshot that directory tree then. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 15:10 ` Martin Steigerwald @ 2015-04-19 15:18 ` Hugo Mills 2015-04-19 17:50 ` Martin Steigerwald 0 siblings, 1 reply; 17+ messages in thread From: Hugo Mills @ 2015-04-19 15:18 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Craig Ringer, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2884 bytes --] On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote: > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer: > > On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> > wrote: > > > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer: > > >> Hi all > > > > > > Hi Craig, > > > > > >> I'm looking into the advisability of running PostgreSQL on BTRFS, and > > >> after looking at the FAQ there's something I'm hoping you could > > >> clarify. > > >> > > >> The wiki FAQ says: > > >> > > >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC > > >> operation, fsync is designed to be fast." > > >> > > >> Is that wording intended narrowly, to contrast with ext3's nasty > > >> habit > > >> of flushing *all* dirty blocks for the entire file system whenever > > >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's > > >> fsync won't necessarily flush all data blocks (just metadata) ? > > >> > > >> Is that statement still true in recent BTRFS versions (3.18, etc)? > > > > > > I don´t know, thus leave that for others to answer. I always assumed a > > > strong fsync() guarentee as in "its on disk" with BTRFS. So I am > > > interested in that as well. > > > > > > But for databases, did you consider the copy on write fragmentation > > > BTRFS will give? Even with autodefrag, afaik it is not recommended to > > > use it for large databases on rotating media at least. > > > > I did, and any testing would need to look at the efficacy of the > > chattr +C option on the database directory tree. > > > > PostgreSQL is its self copy-on-write (because of multi-version > > concurrency control), so it doesn't make much sense to have the FS > > doing another layer of COW. > > > > I'm curious as to whether +C has any effect on BTRFS's durability, too. > > You will loose the ability to snapshot that directory tree then. No you won't. The +C attribute still allows snapshotting and reflink copies. However, after the snapshot, writes to either copy will result in that copy being CoWed. (Specifically, writes to an extent of a +C file with more than one reference to the extent will result in a CoW operation, until there is only one reference, and then the writes will not be CoWed again). The practical upshot of this is that every snapshot of, and subsequent writes to, a +C file will introduce fragmentation in the same way that writes to a non-+C file would. You also have a disadvantage with +C that you lose the checksumming features of the FS, and hence the self-healing properties if you're running with btrfs-native RAID. Hugo. -- Hugo Mills | Nothing right in my left brain. Nothing left in my hugo@... carfax.org.uk | right brain. http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 15:18 ` Hugo Mills @ 2015-04-19 17:50 ` Martin Steigerwald 2015-04-19 18:18 ` Hugo Mills 0 siblings, 1 reply; 17+ messages in thread From: Martin Steigerwald @ 2015-04-19 17:50 UTC (permalink / raw) To: Hugo Mills; +Cc: Craig Ringer, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3504 bytes --] Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills: > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote: > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer: > > > On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> > > > > wrote: > > > > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer: > > > >> Hi all > > > > > > > > Hi Craig, > > > > > > > >> I'm looking into the advisability of running PostgreSQL on BTRFS, > > > >> and > > > >> after looking at the FAQ there's something I'm hoping you could > > > >> clarify. > > > >> > > > >> The wiki FAQ says: > > > >> > > > >> "Btrfs does not force all dirty data to disk on every fsync or > > > >> O_SYNC > > > >> operation, fsync is designed to be fast." > > > >> > > > >> Is that wording intended narrowly, to contrast with ext3's nasty > > > >> habit > > > >> of flushing *all* dirty blocks for the entire file system > > > >> whenever > > > >> anyone calls fsync() ? Or is it intended broadly, to say that > > > >> btrfs's > > > >> fsync won't necessarily flush all data blocks (just metadata) ? > > > >> > > > >> Is that statement still true in recent BTRFS versions (3.18, > > > >> etc)? > > > > > > > > I don´t know, thus leave that for others to answer. I always > > > > assumed a > > > > strong fsync() guarentee as in "its on disk" with BTRFS. So I am > > > > interested in that as well. > > > > > > > > But for databases, did you consider the copy on write > > > > fragmentation > > > > BTRFS will give? Even with autodefrag, afaik it is not recommended > > > > to > > > > use it for large databases on rotating media at least. > > > > > > I did, and any testing would need to look at the efficacy of the > > > chattr +C option on the database directory tree. > > > > > > PostgreSQL is its self copy-on-write (because of multi-version > > > concurrency control), so it doesn't make much sense to have the FS > > > doing another layer of COW. > > > > > > I'm curious as to whether +C has any effect on BTRFS's durability, > > > too. > > > > You will loose the ability to snapshot that directory tree then. > > No you won't. > > The +C attribute still allows snapshotting and reflink copies. > However, after the snapshot, writes to either copy will result in that > copy being CoWed. (Specifically, writes to an extent of a +C file with > more than one reference to the extent will result in a CoW operation, > until there is only one reference, and then the writes will not be > CoWed again). > > The practical upshot of this is that every snapshot of, and > subsequent writes to, a +C file will introduce fragmentation in the > same way that writes to a non-+C file would. > > You also have a disadvantage with +C that you lose the checksumming > features of the FS, and hence the self-healing properties if you're > running with btrfs-native RAID. Thanks for clarifying this Hugo, so chattr +C will make the directory cowed again. And there is not checksumming on the FS at all anymore. Why is the later? Why can´t BTRFS checkum nocowed objects or at least the cowed ones in the same FS? Cause of atomicity guarentees? If this has been answered before, and I missed it, feel free to point me to it, I didn´t find anything obvious with my quick search. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 17:50 ` Martin Steigerwald @ 2015-04-19 18:18 ` Hugo Mills 2015-04-19 18:41 ` Martin Steigerwald 0 siblings, 1 reply; 17+ messages in thread From: Hugo Mills @ 2015-04-19 18:18 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Craig Ringer, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 2825 bytes --] On Sun, Apr 19, 2015 at 07:50:32PM +0200, Martin Steigerwald wrote: > Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills: > > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote: > > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer: > > > > I'm curious as to whether +C has any effect on BTRFS's durability, > > > > too. > > > > > > You will loose the ability to snapshot that directory tree then. > > > > No you won't. > > > > The +C attribute still allows snapshotting and reflink copies. > > However, after the snapshot, writes to either copy will result in that > > copy being CoWed. (Specifically, writes to an extent of a +C file with > > more than one reference to the extent will result in a CoW operation, > > until there is only one reference, and then the writes will not be > > CoWed again). > > > > The practical upshot of this is that every snapshot of, and > > subsequent writes to, a +C file will introduce fragmentation in the > > same way that writes to a non-+C file would. > > > > You also have a disadvantage with +C that you lose the checksumming > > features of the FS, and hence the self-healing properties if you're > > running with btrfs-native RAID. > > Thanks for clarifying this Hugo, so chattr +C will make the directory > cowed again. Not quite sure what you mean there. If you set +C on a file or directory, there's no CoW operations on write to any of the affected files, *except* if there's a snapshot, in which case the file being written to will have *one* CoW operation before reverting to nodatacow again. > And there is not checksumming on the FS at all anymore. Why is the later? > Why can´t BTRFS checkum nocowed objects or at least the cowed ones in the > same FS? Cause of atomicity guarentees? Atomicity, indeed. You need to be able to update the checksum and the new data atomically. This is possible when the data can be CoWed, but if the data is being modified in place, there must be a short period of time when the two parts are out of sync. Just to make it clear, the lack of checksums is *only* for those files which are marked +C. The rest of the FS is unaffected. > If this has been answered before, and I missed it, feel free to point me > to it, I didn´t find anything obvious with my quick search. It's certainly knowledge that's out there (it's been discussed at length on IRC, for example), but I don't know if there's anything written up on the wiki. Hugo. -- Hugo Mills | Jenkins! Chap with the wings there! Five rounds hugo@... carfax.org.uk | rapid! http://carfax.org.uk/ | Brigadier Alistair Lethbridge-Stewart PGP: E2AB1DE4 | Dr Who and the Daemons [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 18:18 ` Hugo Mills @ 2015-04-19 18:41 ` Martin Steigerwald 2015-04-19 18:51 ` Hugo Mills 0 siblings, 1 reply; 17+ messages in thread From: Martin Steigerwald @ 2015-04-19 18:41 UTC (permalink / raw) To: Hugo Mills; +Cc: Craig Ringer, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3162 bytes --] Am Sonntag, 19. April 2015, 18:18:24 schrieb Hugo Mills: > On Sun, Apr 19, 2015 at 07:50:32PM +0200, Martin Steigerwald wrote: > > Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills: > > > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote: > > > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer: > > > > > I'm curious as to whether +C has any effect on BTRFS's > > > > > durability, > > > > > too. > > > > > > > > You will loose the ability to snapshot that directory tree then. > > > > > > > No you won't. > > > > > > The +C attribute still allows snapshotting and reflink copies. > > > > > > However, after the snapshot, writes to either copy will result in > > > that > > > copy being CoWed. (Specifically, writes to an extent of a +C file > > > with > > > more than one reference to the extent will result in a CoW > > > operation, > > > until there is only one reference, and then the writes will not be > > > CoWed again). > > > > > > The practical upshot of this is that every snapshot of, and > > > > > > subsequent writes to, a +C file will introduce fragmentation in the > > > same way that writes to a non-+C file would. > > > > > > You also have a disadvantage with +C that you lose the > > > checksumming > > > > > > features of the FS, and hence the self-healing properties if you're > > > running with btrfs-native RAID. > > > > Thanks for clarifying this Hugo, so chattr +C will make the directory > > cowed again. > > Not quite sure what you mean there. > > If you set +C on a file or directory, there's no CoW operations on > write to any of the affected files, *except* if there's a snapshot, in > which case the file being written to will have *one* CoW operation > before reverting to nodatacow again. What do you mean by *one* CoW operation, will BTRFS duplicate the whole file to keep it no-cowed? Or, hmmm, I think now I get it: There will be one CoW operation for each write – well, with some granularity, extent? –, but *just* one, cause then the written data is cowed from the snapshot and then can be written again in a no-cowed way. > > And there is not checksumming on the FS at all anymore. Why is the > > later? Why can´t BTRFS checkum nocowed objects or at least the cowed > > ones in the same FS? Cause of atomicity guarentees? > > Atomicity, indeed. You need to be able to update the checksum and > the new data atomically. This is possible when the data can be CoWed, > but if the data is being modified in place, there must be a short > period of time when the two parts are out of sync. And it would be too much effort or too much of a performance penalty to let any checksum check wait till they are in sync again? > Just to make it clear, the lack of checksums is *only* for those > files which are marked +C. The rest of the FS is unaffected. Thanks for that clarification. I read your original wording as if the whole FS was affected. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 18:41 ` Martin Steigerwald @ 2015-04-19 18:51 ` Hugo Mills 0 siblings, 0 replies; 17+ messages in thread From: Hugo Mills @ 2015-04-19 18:51 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Craig Ringer, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4189 bytes --] On Sun, Apr 19, 2015 at 08:41:39PM +0200, Martin Steigerwald wrote: > Am Sonntag, 19. April 2015, 18:18:24 schrieb Hugo Mills: > > On Sun, Apr 19, 2015 at 07:50:32PM +0200, Martin Steigerwald wrote: > > > Am Sonntag, 19. April 2015, 15:18:51 schrieb Hugo Mills: > > > > On Sun, Apr 19, 2015 at 05:10:30PM +0200, Martin Steigerwald wrote: > > > > > Am Sonntag, 19. April 2015, 22:31:02 schrieb Craig Ringer: > > > > > > I'm curious as to whether +C has any effect on BTRFS's > > > > > > durability, > > > > > > too. > > > > > > > > > > You will loose the ability to snapshot that directory tree then. > > > > > > > > > No you won't. > > > > > > > > The +C attribute still allows snapshotting and reflink copies. > > > > > > > > However, after the snapshot, writes to either copy will result in > > > > that > > > > copy being CoWed. (Specifically, writes to an extent of a +C file > > > > with > > > > more than one reference to the extent will result in a CoW > > > > operation, > > > > until there is only one reference, and then the writes will not be > > > > CoWed again). > > > > > > > > The practical upshot of this is that every snapshot of, and > > > > > > > > subsequent writes to, a +C file will introduce fragmentation in the > > > > same way that writes to a non-+C file would. > > > > > > > > You also have a disadvantage with +C that you lose the > > > > checksumming > > > > > > > > features of the FS, and hence the self-healing properties if you're > > > > running with btrfs-native RAID. > > > > > > Thanks for clarifying this Hugo, so chattr +C will make the directory > > > cowed again. > > > > Not quite sure what you mean there. > > > > If you set +C on a file or directory, there's no CoW operations on > > write to any of the affected files, *except* if there's a snapshot, in > > which case the file being written to will have *one* CoW operation > > before reverting to nodatacow again. > > What do you mean by *one* CoW operation, will BTRFS duplicate the whole > file to keep it no-cowed? Or, hmmm, I think now I get it: There will be one > CoW operation for each write – well, with some granularity, extent? –, but > *just* one, cause then the written data is cowed from the snapshot and > then can be written again in a no-cowed way. Correct. Granularity -- the storage of data is on a block basis (4k), but extent size goes down to individual bytes. I think this means that if you read and then write a single byte and then sync, the block is read into the page cache, the byte is modified, the block is written out (elsewhere, because it's CoW), and then the extents are updated to reference the one modified byte from the new block. I'm not 100% sure on that, though. > > > And there is not checksumming on the FS at all anymore. Why is the > > > later? Why can´t BTRFS checkum nocowed objects or at least the cowed > > > ones in the same FS? Cause of atomicity guarentees? > > > > Atomicity, indeed. You need to be able to update the checksum and > > the new data atomically. This is possible when the data can be CoWed, > > but if the data is being modified in place, there must be a short > > period of time when the two parts are out of sync. > > And it would be too much effort or too much of a performance penalty to let > any checksum check wait till they are in sync again? It's more that if the system crashes or suffers a power outage in that time window, you've got a mismatch that shows EIO even though the data is valid. > > Just to make it clear, the lack of checksums is *only* for those > > files which are marked +C. The rest of the FS is unaffected. > > Thanks for that clarification. I read your original wording as if the whole > FS was affected. Only if the FS is mounted with nodatacow. :) Hugo. -- Hugo Mills | Jenkins! Chap with the wings there! Five rounds hugo@... carfax.org.uk | rapid! http://carfax.org.uk/ | Brigadier Alistair Lethbridge-Stewart PGP: E2AB1DE4 | Dr Who and the Daemons [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 14:31 ` Craig Ringer 2015-04-19 15:10 ` Martin Steigerwald @ 2015-04-19 15:28 ` Russell Coker 2015-04-20 4:27 ` Zygo Blaxell 2 siblings, 0 replies; 17+ messages in thread From: Russell Coker @ 2015-04-19 15:28 UTC (permalink / raw) To: Craig Ringer; +Cc: Martin Steigerwald, linux-btrfs On Mon, 20 Apr 2015, Craig Ringer <craig@2ndquadrant.com> wrote: > PostgreSQL is its self copy-on-write (because of multi-version > concurrency control), so it doesn't make much sense to have the FS > doing another layer of COW. That's a matter of opinion. I think it's great if PostgreSQL can do internal checkums and error correction. But I'd rather not have to test that functionality in the field. Really I prefer to have the ZFS copies= option for databases. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 14:31 ` Craig Ringer 2015-04-19 15:10 ` Martin Steigerwald 2015-04-19 15:28 ` Russell Coker @ 2015-04-20 4:27 ` Zygo Blaxell 2015-04-20 6:07 ` Duncan ` (2 more replies) 2 siblings, 3 replies; 17+ messages in thread From: Zygo Blaxell @ 2015-04-20 4:27 UTC (permalink / raw) To: Craig Ringer; +Cc: Martin Steigerwald, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 6074 bytes --] On Sun, Apr 19, 2015 at 10:31:02PM +0800, Craig Ringer wrote: > On 19 April 2015 at 22:28, Martin Steigerwald <martin@lichtvoll.de> wrote: > > Am Sonntag, 19. April 2015, 21:20:11 schrieb Craig Ringer: > >> Hi all > > > > Hi Craig, > > > >> I'm looking into the advisability of running PostgreSQL on BTRFS, and > >> after looking at the FAQ there's something I'm hoping you could > >> clarify. > >> > >> The wiki FAQ says: > >> > >> "Btrfs does not force all dirty data to disk on every fsync or O_SYNC > >> operation, fsync is designed to be fast." > >> > >> Is that wording intended narrowly, to contrast with ext3's nasty habit > >> of flushing *all* dirty blocks for the entire file system whenever > >> anyone calls fsync() ? Or is it intended broadly, to say that btrfs's > >> fsync won't necessarily flush all data blocks (just metadata) ? Normal writes to btrfs filesystems using the versioned filesystem tree are consistent(ish), atomic, and durable; however, they have high latency as the filesystem normally delays commit until triggered by a periodic timer (or sync()--not fsync), then writes all outstanding dirty pages in memory. btrfs handles fsync separately from the main versioned filesystem tree in order to decrease the latency of fsync operations. There is a 'log tree' which behaves like a journal and contains data flushed with fsync() since the last fully committed btrfs root. After a crash, assuming no bugs, the log is replayed over the last committed version of the filesystem tree to implement fsync durability. Unfortunately, in my experience, the log tree's most noticeable effect at the moment seems to be to add a crapton of special-case code paths, many of which do contain bugs, which are being fixed one at a time by btrfs developers. :-/ > >> Is that statement still true in recent BTRFS versions (3.18, etc)? 3.18 was released 133 days ago. It has only been 49 days since the last commit that fixes a btrfs data loss bug involving fsync (3a8b36f on Mar 1, appearing in mainline as of v4.0-rc3), and 27 days since a commit that fixes a problem involving fsync and discard (dcc82f4 on Mar 23, queued for v4.1). There has been a stream of fsync fixes in the past year, but it would be naive to believe that there are not still more bugs to be found given the frequency and recentness of fixes. > > I don´t know, thus leave that for others to answer. I always assumed a > > strong fsync() guarentee as in "its on disk" with BTRFS. So I am > > interested in that as well. That's the intention; however, btrfs is not there yet. It has been only 28 days since I last detected corrupted data on a btrfs instance: after a crash and log tree replay, extents from the *beginning* of several files written just before the crash were missing, but the *ends* of the files were present and correct. There are also cases where btrfs cannot read data that *is* on disk. I encounter that bug *every* day on some test systems, but can't yet reproduce it with less than a TB of data and heavy workloads. :-P > > But for databases, did you consider the copy on write fragmentation BTRFS > > will give? Even with autodefrag, afaik it is not recommended to use it for > > large databases on rotating media at least. > > I did, and any testing would need to look at the efficacy of the > chattr +C option on the database directory tree. > > PostgreSQL is its self copy-on-write (because of multi-version > concurrency control), so it doesn't make much sense to have the FS > doing another layer of COW. I noticed that redundancy and ended up picking btrfs over PostgreSQL. I disable fsync in PostgreSQL (as well as a half-dozen assorted applications that use sqlite, or just seem to like calling fsync often), turn off full-page-writes on the journal--and also clear the btrfs log tree before every mount. A database can happily rely on btrfs to preserve write ordering as long as all of its data is in one filesystem(*) and btrfs never gets to replay its log tree (i.e. using only the every-30-seconds global filesystem commit). The database is only able to offer async_commit mode when there is no fsync, but my applications all want async_commit for performance reasons anyway. Disabling fsync from PostgreSQL avoids the bugs in the btrfs implementation of fsync and the log tree. With fsync + log tree, I was rebuilding corrupted PostgreSQL databases from backups after almost every reboot, and sometimes even more often than that. I stopped testing PostgreSQL with fsync 277 days ago, and I have PostgreSQL instances running since then without fsync that are 117 days old...so that configuration seems as stable as anything else in btrfs. Note that 117 days ago this btrfs instance corrupted itself beyond repair (garbage tree node pointers with correct checksums!) and the entire filesystem had to be mkfs'ed and rebuilt from backup. For reference, my PostgreSQL workload is a nearly continuous stream of transactions modifying 10K-15K pages per commit (80-120 MB random writes, plus indexes). > I'm curious as to whether +C has any effect on BTRFS's durability, too. I would expect it to be strictly equal to or worse than the CoW durability. It would have all the same general filesystem bugs as btrfs, plus extra bugs that are specific to the no-CoW btrfs code paths, and you lose write ordering and btrfs data integrity and repair capabilities, and you have to enable fsync and log tree replay and dodge the bugs. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html (*) or maybe subvol. I haven't tested a multi-subvol-single-filesystem btrfs, but I don't see much real-world advantage in configuring that way. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-20 4:27 ` Zygo Blaxell @ 2015-04-20 6:07 ` Duncan 2015-04-21 1:31 ` Zygo Blaxell 2015-04-20 8:13 ` Gian-Carlo Pascutto 2015-04-21 19:07 ` Chris Murphy 2 siblings, 1 reply; 17+ messages in thread From: Duncan @ 2015-04-20 6:07 UTC (permalink / raw) To: linux-btrfs Zygo Blaxell posted on Mon, 20 Apr 2015 00:27:31 -0400 as excerpted: > Normal writes to btrfs filesystems using the versioned filesystem tree > are consistent(ish), atomic, and durable; however, they have high > latency as the filesystem normally delays commit until triggered by a > periodic timer (or sync()--not fsync), then writes all outstanding dirty > pages in memory. > > btrfs handles fsync separately from the main versioned filesystem tree > in order to decrease the latency of fsync operations. There is a 'log > tree' which behaves like a journal and contains data flushed with > fsync() since the last fully committed btrfs root. After a crash, > assuming no bugs, the log is replayed over the last committed version of > the filesystem tree to implement fsync durability. > > Unfortunately, in my experience, the log tree's most noticeable effect > at the moment seems to be to add a crapton of special-case code paths, > many of which do contain bugs, which are being fixed one at a time by > btrfs developers. :-/ Thanks, Zygo. That's the clearest explanation I've seen on why the supposedly atomic-commit btrfs still has a log, and what it actually does. I wasn't entirely clear on that, myself. Meanwhile, yes, log-replay bugs do seem to be one of the sore spots ATM. I'm glad it's getting some focus now. It needed it. >> >> Is that statement still true in recent BTRFS versions (3.18, etc)? > > 3.18 was released 133 days ago. It has only been 49 days since the last > commit that fixes a btrfs data loss bug involving fsync (3a8b36f on Mar > 1, appearing in mainline as of v4.0-rc3), and 27 days since a commit > that fixes a problem involving fsync and discard (dcc82f4 on Mar 23, > queued for v4.1). > > There has been a stream of fsync fixes in the past year, but it would be > naive to believe that there are not still more bugs to be found given > the frequency and recentness of fixes. Telling commentary on what is "recent" in btrfs context, vs. what is "recent" in many distro's context, particularly in "enterprise" distro context. =8^0 4.0 is out. There's reason people may want to stick one version back by default, to 3.19 currently, since it can take a few weeks for early reports to develop into a coherent problem, and sticking one stable series back allows for that, and deciding exactly when one is comfortable upgrading. But in btrfs context anyway, with 4.0 out, if you're not on at least 3.19 yet, you should be able to point to the bug explaining /why/. If you can't, arguably, you should be either upgrading yesterday if not sooner, or you really should choose some other filesystem, as btrfs simply isn't at the stability required for your use-case yet, and you unnecessarily risk data loss to already found and fixed bugs as a result. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-20 6:07 ` Duncan @ 2015-04-21 1:31 ` Zygo Blaxell 0 siblings, 0 replies; 17+ messages in thread From: Zygo Blaxell @ 2015-04-21 1:31 UTC (permalink / raw) To: Duncan; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3489 bytes --] On Mon, Apr 20, 2015 at 06:07:09AM +0000, Duncan wrote: > 4.0 is out. There's reason people may want to stick one version back by > default, to 3.19 currently, since it can take a few weeks for early > reports to develop into a coherent problem, and sticking one stable > series back allows for that, and deciding exactly when one is comfortable > upgrading. But in btrfs context anyway, with 4.0 out, if you're not on > at least 3.19 yet, you should be able to point to the bug explaining > /why/. If you can't, arguably, you should be either upgrading yesterday > if not sooner, or you really should choose some other filesystem, as > btrfs simply isn't at the stability required for your use-case yet, and > you unnecessarily risk data loss to already found and fixed bugs as a > result. I'm not sure that "run the latest kernel" or even "run the latest kernel minus N weeks or months" is good advice for user data integrity at present. It's certainly unsupported by any test data I'm seeing. If the intention is to discover and report or fix btrfs bugs, or confirm that known bugs have been corrected, then the latest kernel (or a -next integration branch) is the only one to run. If the intention is to use btrfs for data storage, then the kernel selection process is much different. In the stable kernels (the v3.xx.y Git tags with no other patches) in the last year, there have been a number of btrfs regressions, from memory leaks to deadlocks to filesystem-crashing corruption issues: 2 severe corruption (i.e. destroy the filesystem) or memory leak issues (i.e. leak all the RAM and crash slowly and messily) I've encountered in my own testing, 2 kernel panic or memory leak issues that I avoided by accident because the fix came out before I could pull the regression into a build, 3 failure modes in new code leading to deadlock or temporary inability to retrieve stored data that first appeared in v3.13 or v3.15, and as of today are not yet resolved. My testing process runs like this (slightly simplified): 1. Build stable and/or Linus tagged kernels + integration-queue patches + locally-generated patches if any. 2. Run these kernels on various machines with workloads, observe and analyze failures. 3. When a machine fails to do its work due to a kernel issue, restart at step 1 with a different version or more patches. Note this includes more issues than btrfs; e.g. sometimes a kernel is not usable because of ACPI or WiFi issues that make btrfs test results irrelevant. 4. If a kernel build succeeds for N or more days, expand the set of test machines to get more test coverage, and go back to step 2. 5. If N >= 60 with no (severe) problems, consider that kernel stable and bless it for production. Linux kernels getting to step 5 are rare and precious things even when not testing btrfs. The last kernel to reach step 5 for me was v3.12.x. Before that was 835 days of searching for a successor to the kernel I was running in production at the time. :-/ > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-20 4:27 ` Zygo Blaxell 2015-04-20 6:07 ` Duncan @ 2015-04-20 8:13 ` Gian-Carlo Pascutto 2015-04-20 15:19 ` Zygo Blaxell 2015-04-21 19:07 ` Chris Murphy 2 siblings, 1 reply; 17+ messages in thread From: Gian-Carlo Pascutto @ 2015-04-20 8:13 UTC (permalink / raw) To: linux-btrfs On 20-04-15 06:27, Zygo Blaxell wrote: >> I'm curious as to whether +C has any effect on BTRFS's durability, too. > > I would expect it to be strictly equal to or worse than the CoW > durability. In addition to the stuff pointed out, I've wondered about this: PostgreSQL full_page_writes copies 8k pages in order to prevent corruption from partial writes. But btrfs has 16k pages by default, so a corrupted FS page would corrupt more data than PostgreSQL protects. Maybe it's not an issue if the underlying HW has 512b/4k sectors, or maybe I'm misunderstanding what the respective features assume, but unless informed to the contrary I wouldn't be entirely comfortable with this. (With CoW enabled, you don't have partial writes, so the point is moot) -- GCP ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-20 8:13 ` Gian-Carlo Pascutto @ 2015-04-20 15:19 ` Zygo Blaxell 0 siblings, 0 replies; 17+ messages in thread From: Zygo Blaxell @ 2015-04-20 15:19 UTC (permalink / raw) To: Gian-Carlo Pascutto; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1838 bytes --] On Mon, Apr 20, 2015 at 10:13:47AM +0200, Gian-Carlo Pascutto wrote: > On 20-04-15 06:27, Zygo Blaxell wrote: > > >> I'm curious as to whether +C has any effect on BTRFS's durability, too. > > > > I would expect it to be strictly equal to or worse than the CoW > > durability. > > In addition to the stuff pointed out, I've wondered about this: > PostgreSQL full_page_writes copies 8k pages in order to prevent > corruption from partial writes. But btrfs has 16k pages by default, so a > corrupted FS page would corrupt more data than PostgreSQL protects. There are multiple page sizes in a btrfs. *Metadata* pages are 16K by default. Metadata is always CoW in btrfs. *Data* pages are 4K by default. AIUI the btrfs code currently heavily relies on data page and host CPU page sizes being identical. I don't know what happens on architectures with 8K page sizes--it's been years since I booted an Alpha machine. > Maybe it's not an issue if the underlying HW has 512b/4k sectors, or > maybe I'm misunderstanding what the respective features assume, but > unless informed to the contrary I wouldn't be entirely comfortable with > this. Usually writes are issued in multiples of 4K pages at a time from the kernel, so the underlying sector size only comes into play when something interrupts a disk between individual sector writes within a page. full_page_writes should handle no-CoW on btrfs (or any other filesystem) just fine. It's the *other* btrfs bugs you have to worry about. ;) > (With CoW enabled, you don't have partial writes, so the point is moot) > > -- > GCP > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-20 4:27 ` Zygo Blaxell 2015-04-20 6:07 ` Duncan 2015-04-20 8:13 ` Gian-Carlo Pascutto @ 2015-04-21 19:07 ` Chris Murphy 2 siblings, 0 replies; 17+ messages in thread From: Chris Murphy @ 2015-04-21 19:07 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Craig Ringer, Martin Steigerwald, Btrfs BTRFS On Sun, Apr 19, 2015 at 10:27 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Sun, Apr 19, 2015 at 10:31:02PM +0800, Craig Ringer wrote: >> I'm curious as to whether +C has any effect on BTRFS's durability, too. > > I would expect it to be strictly equal to or worse than the CoW > durability. It would have all the same general filesystem bugs as btrfs, > plus extra bugs that are specific to the no-CoW btrfs code paths, and > you lose write ordering and btrfs data integrity and repair capabilities, > and you have to enable fsync and log tree replay and dodge the bugs. Interesting. systemd-journald now uses +C by default on journal files. I've had journals become corrupt, per its own journalctl --verify command, but so far it happens infrequently enough I haven't discovered the pattern - whether it's worse with +C. And it's even possible there are bugs in systemd-journald causing the problem. -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: The FAQ on fsync/O_SYNC 2015-04-19 13:20 The FAQ on fsync/O_SYNC Craig Ringer 2015-04-19 14:28 ` Martin Steigerwald @ 2015-04-20 3:29 ` Craig Ringer 1 sibling, 0 replies; 17+ messages in thread From: Craig Ringer @ 2015-04-20 3:29 UTC (permalink / raw) To: linux-btrfs While the discussion on -C was interesting, I'm really interested in btrfs's fsync() behaviour, per the original post: On 19 April 2015 at 21:20, Craig Ringer <craig@2ndquadrant.com> wrote: > Hi all > > I'm looking into the advisability of running PostgreSQL on BTRFS, and > after looking at the FAQ there's something I'm hoping you could > clarify. > > The wiki FAQ says: > > "Btrfs does not force all dirty data to disk on every fsync or O_SYNC > operation, fsync is designed to be fast." > > Is that wording intended narrowly, to contrast with ext3's nasty habit > of flushing *all* dirty blocks for the entire file system whenever > anyone calls fsync() ? Or is it intended broadly, to say that btrfs's > fsync won't necessarily flush all data blocks (just metadata) ? > > Is that statement still true in recent BTRFS versions (3.18, etc)? > > > PostgreSQL (and any other transactional database) absolutely requires > that there be a system call that will provide a hard guarantee that > all dirty blocks for a given file are on durable storage. In the case > of data-integrity-significant metadata operations it has to be able to > get the same guarantee on metadata too. > > The documentation for fsync says that: > > fsync() transfers ("flushes") all modified in-core data of (i.e., modi‐ > fied buffer cache pages for) the file referred to by the file descrip‐ > tor fd to the disk device (or other permanent storage device) so that > all changed information can be retrieved even after the system crashed > or was rebooted. This includes writing through or flushing a disk > cache if present. The call blocks until the device reports that the > transfer has completed. It also flushes metadata information associ‐ > ated with the file (see stat(2)). > > > so I'm hoping that the FAQ writer was just comparing with ext3, and > that btrfs's fsync() fully flushes all dirty blocks and metadata for a > file or directory. (I haven't had a chance to do any testing on a > machine with slow flushes to see yet, or any plug-pull testing). > > > Also on the FAQ: > > https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_crash_guarantees_of_overwrite-by-rename.3F > > it might be a good idea to recommend that applications really should > fsync() the directory if they want a crash safety guarantee, and that > doing so (hopefully?) won't flush dirty file blocks, just directory > metadata. > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2015-04-21 19:07 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-19 13:20 The FAQ on fsync/O_SYNC Craig Ringer 2015-04-19 14:28 ` Martin Steigerwald 2015-04-19 14:31 ` Craig Ringer 2015-04-19 15:10 ` Martin Steigerwald 2015-04-19 15:18 ` Hugo Mills 2015-04-19 17:50 ` Martin Steigerwald 2015-04-19 18:18 ` Hugo Mills 2015-04-19 18:41 ` Martin Steigerwald 2015-04-19 18:51 ` Hugo Mills 2015-04-19 15:28 ` Russell Coker 2015-04-20 4:27 ` Zygo Blaxell 2015-04-20 6:07 ` Duncan 2015-04-21 1:31 ` Zygo Blaxell 2015-04-20 8:13 ` Gian-Carlo Pascutto 2015-04-20 15:19 ` Zygo Blaxell 2015-04-21 19:07 ` Chris Murphy 2015-04-20 3:29 ` Craig Ringer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.