* Questions about filesystems from SQLite author presentation @ 2020-01-06 7:24 Sitsofe Wheeler 2020-01-06 10:15 ` Dave Chinner 2020-01-06 15:40 ` Amir Goldstein 0 siblings, 2 replies; 11+ messages in thread From: Sitsofe Wheeler @ 2020-01-06 7:24 UTC (permalink / raw) To: linux-fsdevel; +Cc: drh At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled "Things to discuss" (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 ) and had a few questions: 1. Reliable ways to discover detailed filesystem properties 2. fbarrier() 3. Notify the OS about unused regions in the database file For 1. I think Jan Kara said that supporting it was undesirable for details like just how much additional fsync were needed due to competing constraints (https://youtu.be/-oP2BOsMpdo?t=6063 ). Someone mentioned there was a patch for fsinfo to discover if you were on a network filesystem (https://www.youtube.com/watch?v=-oP2BOsMpdo&feature=youtu.be&t=5525 )... For 2. there was a talk by MySQL dev Sergei Golubchik ( https://youtu.be/-oP2BOsMpdo?t=1219 ) talking about how barriers had been taken out and was there a replacement. In https://youtu.be/-oP2BOsMpdo?t=1731 Chris Mason(?) seems to suggest that the desired effect could be achieved with io_uring chaining. For 3. it sounded like Jan Kara was saying there wasn't anything at the moment (hypothetically you could introduce a call that marked the extents as "unwritten" but it doesn't sound like you can do that today) and even if you wanted to use something like TRIM it wouldn't be worth it unless you were trimming a large (gigabytes) amount of data (https://youtu.be/-oP2BOsMpdo?t=6330 ). However, there were even more questions in the briefing paper (https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?') that couldn't be asked due to limited time. Does anyone know the answer to the extended questions and whether the the above is right deduction for the questions that were asked? -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 7:24 Questions about filesystems from SQLite author presentation Sitsofe Wheeler @ 2020-01-06 10:15 ` Dave Chinner 2020-01-07 8:40 ` Sitsofe Wheeler 2020-01-07 8:47 ` Jan Kara 2020-01-06 15:40 ` Amir Goldstein 1 sibling, 2 replies; 11+ messages in thread From: Dave Chinner @ 2020-01-06 10:15 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: linux-fsdevel, drh On Mon, Jan 06, 2020 at 07:24:53AM +0000, Sitsofe Wheeler wrote: > At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite > (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled > "Things to discuss" > (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 ) > and had a few questions: > > 1. Reliable ways to discover detailed filesystem properties > 2. fbarrier() > 3. Notify the OS about unused regions in the database file > > For 1. I think Jan Kara said that supporting it was undesirable for > details like just how much additional fsync were needed due to > competing constraints (https://youtu.be/-oP2BOsMpdo?t=6063 ). Someone > mentioned there was a > patch for fsinfo to discover if you were on a network filesystem > (https://www.youtube.com/watch?v=-oP2BOsMpdo&feature=youtu.be&t=5525 > )... > For 2. there was a talk by MySQL dev Sergei Golubchik ( > https://youtu.be/-oP2BOsMpdo?t=1219 ) talking about how barriers had > been taken out and was there a replacement. In > https://youtu.be/-oP2BOsMpdo?t=1731 Chris Mason(?) seems to suggest > that the desired effect could be achieved with io_uring chaining. Even though it wasn't explicitly mentioned, I'm pretty sure that those "write barriers" for ordering groups of writes against other groups of writes are intended to be used for data integrity purposes. The problem is that data integrity writes also require any uncommitted filesytsem metadata to be written in the correct order to disk along with the data. i.e. you can write to the log file, but if the transactions during that write that allocate space and/or convert it to written space have not been committed to the journal then the data is not on stable storage and so data completion ordering cannot be relied on for integrity related operations. This is why write ordering always comes back to "you need to use fdatasync(), O_DSYNC or RWF_DSYNC" - it is the only way to guarantee the integrity of a initial data write(s) right down to the hardware before starting the new dependent write(s). Hence AIO_FSYNC and now chained operations in io_uring to allow fsync to be issues asynchronously and be used as a "write barrier" between groups of order dependent IOs... > For 3. it sounded like Jan Kara was saying there wasn't anything at > the moment (hypothetically you could introduce a call that marked the > extents as "unwritten" but it doesn't sound like you can do that You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark the unused range as unwritten in XFS, or you can just punch a hole to free the unused space with FALLOC_FL_PUNCH_HOLE... > today) and even if you wanted to use something like TRIM it wouldn't > be worth it unless you were trimming a large (gigabytes) amount of > data (https://youtu.be/-oP2BOsMpdo?t=6330 ). Punch the space out, then run a periodic background fstrim so the filesystem can issue efficient TRIM commands over free space... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 10:15 ` Dave Chinner @ 2020-01-07 8:40 ` Sitsofe Wheeler 2020-01-07 8:55 ` Jan Kara 2020-01-07 8:47 ` Jan Kara 1 sibling, 1 reply; 11+ messages in thread From: Sitsofe Wheeler @ 2020-01-07 8:40 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-fsdevel, drh, Jan Kara On Mon, 6 Jan 2020 at 10:16, Dave Chinner <david@fromorbit.com> wrote: > > Hence AIO_FSYNC and now chained operations in io_uring to allow > fsync to be issues asynchronously and be used as a "write barrier" > between groups of order dependent IOs... Thanks for detailing this. > > For 3. it sounded like Jan Kara was saying there wasn't anything at > > the moment (hypothetically you could introduce a call that marked the > > extents as "unwritten" but it doesn't sound like you can do that > > You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark > the unused range as unwritten in XFS, or you can just punch a hole > to free the unused space with FALLOC_FL_PUNCH_HOLE... Ah! > > today) and even if you wanted to use something like TRIM it wouldn't > > be worth it unless you were trimming a large (gigabytes) amount of > > data (https://youtu.be/-oP2BOsMpdo?t=6330 ). > > Punch the space out, then run a periodic background fstrim so the > filesystem can issue efficient TRIM commands over free space... Jan mentions this over on https://youtu.be/-oP2BOsMpdo?t=6268 . Basically he advises against hole punching if you're going to write to that area again because it fragments the file, hurts future performance etc. But I guess if you were using FALLOC_FL_ZERO_RANGE no hole is punched (so no fragmentation) and you likely get faster reads of that area until the data is rewritten too. Are areas that have had FALLOC_FL_ZERO_RANGE run on them eligible for trimming if someone goes on to do a background trim (Jan - doesn't this sound like the best of both both worlds)? My question is what happens if you call FALLOC_FL_ZERO_RANGE and your filesystem is too dumb to mark extents unwritten - will it literally go away and write a bunch of zeros over that region and your disk is a slow HDD or will that call just fail? It's almost like you need something that can tell you if FALLOC_FL_ZERO_RANGE is efficient... -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-07 8:40 ` Sitsofe Wheeler @ 2020-01-07 8:55 ` Jan Kara 2020-01-07 17:18 ` Darrick J. Wong 0 siblings, 1 reply; 11+ messages in thread From: Jan Kara @ 2020-01-07 8:55 UTC (permalink / raw) To: Sitsofe Wheeler; +Cc: Dave Chinner, linux-fsdevel, drh, Jan Kara On Tue 07-01-20 08:40:00, Sitsofe Wheeler wrote: > On Mon, 6 Jan 2020 at 10:16, Dave Chinner <david@fromorbit.com> wrote: > > > today) and even if you wanted to use something like TRIM it wouldn't > > > be worth it unless you were trimming a large (gigabytes) amount of > > > data (https://youtu.be/-oP2BOsMpdo?t=6330 ). > > > > Punch the space out, then run a periodic background fstrim so the > > filesystem can issue efficient TRIM commands over free space... > > Jan mentions this over on https://youtu.be/-oP2BOsMpdo?t=6268 . > Basically he advises against hole punching if you're going to write to > that area again because it fragments the file, hurts future > performance etc. But I guess if you were using FALLOC_FL_ZERO_RANGE no > hole is punched (so no fragmentation) and you likely get faster reads > of that area until the data is rewritten too. Yes, no fragmentation in this case (well, there's still the fact that the extent tree needs to record that a particular range is marked as unwritten so that will get fragmented but it is merged again as soon as the range is written). > Are areas that have had > FALLOC_FL_ZERO_RANGE run on them eligible for trimming if someone goes > on to do a background trim (Jan - doesn't this sound like the best of > both both worlds)? No, these areas are still allocated for the file and thus background trim will not touch them. Concievably, we could use trim for such areas but technically this is going to be too expensive to discover them (you'd need to read all the inodes and their extent trees to discover them) at least for ext4 and I belive for xfs as well. > My question is what happens if you call FALLOC_FL_ZERO_RANGE and your > filesystem is too dumb to mark extents unwritten - will it literally > go away and write a bunch of zeros over that region and your disk is a > slow HDD or will that call just fail? It's almost like you need > something that can tell you if FALLOC_FL_ZERO_RANGE is efficient... It is upto the filesystem how it implements the operation but so far we managed to maintain a situation that FALLOC_FL_ZERO_RANGE returns error if it is not efficient. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-07 8:55 ` Jan Kara @ 2020-01-07 17:18 ` Darrick J. Wong 0 siblings, 0 replies; 11+ messages in thread From: Darrick J. Wong @ 2020-01-07 17:18 UTC (permalink / raw) To: Jan Kara; +Cc: Sitsofe Wheeler, Dave Chinner, linux-fsdevel, drh On Tue, Jan 07, 2020 at 09:55:06AM +0100, Jan Kara wrote: > On Tue 07-01-20 08:40:00, Sitsofe Wheeler wrote: > > On Mon, 6 Jan 2020 at 10:16, Dave Chinner <david@fromorbit.com> wrote: > > > > today) and even if you wanted to use something like TRIM it wouldn't > > > > be worth it unless you were trimming a large (gigabytes) amount of > > > > data (https://youtu.be/-oP2BOsMpdo?t=6330 ). > > > > > > Punch the space out, then run a periodic background fstrim so the > > > filesystem can issue efficient TRIM commands over free space... > > > > Jan mentions this over on https://youtu.be/-oP2BOsMpdo?t=6268 . > > Basically he advises against hole punching if you're going to write to > > that area again because it fragments the file, hurts future > > performance etc. But I guess if you were using FALLOC_FL_ZERO_RANGE no > > hole is punched (so no fragmentation) and you likely get faster reads > > of that area until the data is rewritten too. > > Yes, no fragmentation in this case (well, there's still the fact that > the extent tree needs to record that a particular range is marked as > unwritten so that will get fragmented but it is merged again as soon as the > range is written). > > > Are areas that have had > > FALLOC_FL_ZERO_RANGE run on them eligible for trimming if someone goes > > on to do a background trim (Jan - doesn't this sound like the best of > > both both worlds)? > > No, these areas are still allocated for the file and thus background trim > will not touch them. Concievably, we could use trim for such areas but > technically this is going to be too expensive to discover them (you'd need > to read all the inodes and their extent trees to discover them) at least > for ext4 and I belive for xfs as well. > > > My question is what happens if you call FALLOC_FL_ZERO_RANGE and your > > filesystem is too dumb to mark extents unwritten - will it literally > > go away and write a bunch of zeros over that region and your disk is a > > slow HDD or will that call just fail? It's almost like you need > > something that can tell you if FALLOC_FL_ZERO_RANGE is efficient... > > It is upto the filesystem how it implements the operation but so far we > managed to maintain a situation that FALLOC_FL_ZERO_RANGE returns error if > it is not efficient. The manpage says "...the specified range will not be physically zeroed out on the device (except for partial blocks at the either end of the range), and I/O is (otherwise) required only to update metadata." I think that should be sufficient to hold the fs authors to "FALLOC_FL_ZERO_RANGE must be efficient". Though I've also wondered if that means fs is free to call blkdev_issue_zeroout with NOFALLBACK in lieu of using unwritten extents? --D > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 10:15 ` Dave Chinner 2020-01-07 8:40 ` Sitsofe Wheeler @ 2020-01-07 8:47 ` Jan Kara 1 sibling, 0 replies; 11+ messages in thread From: Jan Kara @ 2020-01-07 8:47 UTC (permalink / raw) To: Dave Chinner; +Cc: Sitsofe Wheeler, linux-fsdevel, drh On Mon 06-01-20 21:15:18, Dave Chinner wrote: > On Mon, Jan 06, 2020 at 07:24:53AM +0000, Sitsofe Wheeler wrote: > > For 3. it sounded like Jan Kara was saying there wasn't anything at > > the moment (hypothetically you could introduce a call that marked the > > extents as "unwritten" but it doesn't sound like you can do that > > You can do that with fallocate() - FALLOC_FL_ZERO_RANGE will mark > the unused range as unwritten in XFS, or you can just punch a hole > to free the unused space with FALLOC_FL_PUNCH_HOLE... Yes, this works for ext4 the same way. > > today) and even if you wanted to use something like TRIM it wouldn't > > be worth it unless you were trimming a large (gigabytes) amount of > > data (https://youtu.be/-oP2BOsMpdo?t=6330 ). > > Punch the space out, then run a periodic background fstrim so the > filesystem can issue efficient TRIM commands over free space... Yes, in that particular case Richard was mentioning with Sqlite, he was asking about a situation where he has a DB file which has 64k free here, 256k free there and whether it helps the OS in any way to tell that these areas are free (but will likely get reused in the future). And in this case I told him that punching out the free space is going to do more harm than good (due to fragmentation) and using FALLOC_FL_ZERO_RANGE isn't going to bring any benefit to the filesystem or the storage. He was also wondering whether using TRIM for these free areas on disk is useful and I told him that for current devices I don't think it will bring any benefit with the sizes he is talking about. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 7:24 Questions about filesystems from SQLite author presentation Sitsofe Wheeler 2020-01-06 10:15 ` Dave Chinner @ 2020-01-06 15:40 ` Amir Goldstein 2020-01-06 16:42 ` Matthew Wilcox ` (2 more replies) 1 sibling, 3 replies; 11+ messages in thread From: Amir Goldstein @ 2020-01-06 15:40 UTC (permalink / raw) To: Sitsofe Wheeler Cc: linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso, harshad shirwadkar On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote: > > At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite > (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled > "Things to discuss" > (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 ) > and had a few questions: > [...] > > However, there were even more questions in the briefing paper > (https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?') > that couldn't be asked due to limited time. Does anyone know the > answer to the extended questions and whether the the above is right > deduction for the questions that were asked? > As Jan said, there is a difference between the answer to "what is the current behavior" and "what are filesystem developers willing to commit as behavior that will remain the same in the future", but I will try to provide some answers to your questions. > If a power loss occurs at about the same time that a file is being extended > with new data, will the file be guaranteed to contain valid data after reboot, > or might the extended area of the file contain all zeros or all ones or > arbitrary content? In other words, is the file data always committed to disk > ahead of the file size? While that statement would generally be true (ever since ext3 journal=ordered...), you have no such guaranty. Getting such a guaranty would require a new API like O_ATOMIC. > If a power loss occurs at about the same time as a file truncation, is it possible > that the truncated area of the file will contain arbitrary data after reboot? > In other words, is the file size guaranteed to be committed to disk before the > data sections are released? That statement is generally true for filesystem that claim to be crash consistent. And the filesystems that do not claim to be crash consistent provide no guaranties at all w.r.t power loss, so it's not worth talking about them in this context. > If a write occurs on one or two bytes of a file at about the same time as a power > loss, are other bytes of the file guaranteed to be unchanged after reboot? > Or might some other bytes within the same sector have been modified as well? I don't see how other bytes could change in this scenario, but I don't know if the hardware provides this guarantee. Maybe someone else knows the answer. > When you create a new file, write to it, and fdatasync() successfully, is it also > necessary to open and fsync() the containing directory in or to ensure that the > file will still be there following reboot from a power loss? There is no guarantee that file will be there after power loss without fsync() of containing directory. In practice, with current upstream xfs and ext4 file will be there after reboot, because at the moment, fdatasync() of a new file implies journal flush, which also includes the file creation. With current upstream btrfs file may not be there after reboot. I tried to promote a new API to provide a weaker guarantee in LSF/MM 2019 [1][2]. The idea is an API used by an application that does not need durability - it doesn't care if new file is there or not after power loss, but if the file is there, its data of the file should be valid. I do not know if sqlite could potentially use such an API. If there is a potential use, I did not find it. Specifically, the proposed API DOES NOT have the semantics of fbarrier() mentioned in the sqlite briefing doc. [See more about fdatasync() at the bottom of my reply...] > Has a file been unlinked or renamed since it was opened? > (SQLite accomplishes this now by remembering the device and inode numbers > obtained from fstat() and comparing them against the results of subsequent stat() > calls against the original filename. Is there a more efficient way to do this?) name_to_handle_at() is a better way to make sure that file with same name wasn't replaced by another, because inode numbers get frequently recycled in create/delete workloads. > Has a particular file been created since the most recent reboot? statx(2) exposes "birth time" (STATX_BTIME) which some filesystems support depending on how they were formatted (e.g. ext4 inode size). In any case, statx reports if btime info is available or not. > Is it possible (or helpful) to tell the filesystem that the content of a particular file > does not need to survive reboot? Not that I know of. > Is it possible (or helpful) to tell the filesystem that a particular file can be > unlinked upon reboot? Not that I know of. > Is it possible (or helpful) to tell the filesystem about parts of the database > file that are currently unused and that the filesystem can zero-out without > harming the database? As Dave already replied, FALLOC_FL_ZERO_RANGE. [...more about fdatasync()] One thing that I think is worth mentioning, I discussed it on LSF [3], is the cost of requiring applications developers to use the most strict API (i.e. fsync()), because filesystem developers don't want to commit to new APIs - When the same filesystem hosts two different workloads: 1. sqlite with many frequent small transaction commits 2. Creating many small files with no need for durability (e.g. untar) Both workloads may in practice hurt each other on many filesystems. The frequent fdatasync() calls from sqlite will sometimes cause journal flushes, which flush more than sqlite needs, take more time to commit and slows down the other metadata intensive workload. Ext4 is trying to address this issue without extending the API [4]. XFS was a bit bettter than ext4 with avoiding unneeded journal flushes, but those could still take place. Btrfs is generally better in this regard (fdatasync() effects are quite isolated to the file). So how can sqlite developers help to improve the situation? If you ask me, I would suggest to provide benchmark results from mixed workloads, like the one I described above. If you can demonstrate the negative effects that frequent fdatasync() calls on a single sqlite db have on the system performance as a whole, then there is surely something that could be done to fix the problem. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/ [2] https://lore.kernel.org/linux-fsdevel/20190527172655.9287-1-amir73il@gmail.com/ [3] https://lwn.net/Articles/788938/ [4] https://lore.kernel.org/linux-ext4/20191001074101.256523-1-harshadshirwadkar@gmail.com/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 15:40 ` Amir Goldstein @ 2020-01-06 16:42 ` Matthew Wilcox 2020-01-07 9:28 ` Sitsofe Wheeler 2020-01-06 18:31 ` Amir Goldstein 2020-01-07 9:16 ` Jan Kara 2 siblings, 1 reply; 11+ messages in thread From: Matthew Wilcox @ 2020-01-06 16:42 UTC (permalink / raw) To: Amir Goldstein Cc: Sitsofe Wheeler, linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso, harshad shirwadkar On Mon, Jan 06, 2020 at 05:40:20PM +0200, Amir Goldstein wrote: > On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote: > > If a write occurs on one or two bytes of a file at about the same time as a power > > loss, are other bytes of the file guaranteed to be unchanged after reboot? > > Or might some other bytes within the same sector have been modified as well? > > I don't see how other bytes could change in this scenario, but I don't > know if the > hardware provides this guarantee. Maybe someone else knows the answer. The question is nonsense because there is no way to write less than one sector to a hardware device, by definition. So, treating this question as being a read-modify-write of a single sector (assuming the "two bytes" don't cross a sector boundary): Hardware vendors are reluctant to provide this guarantee, but it's essential to constructing a reliable storage system. We wrote the NVMe spec in such a way that vendors must provide single-sector-atomicity guarantees, and I hope they haven't managed to wiggle some nonsense into the spec that allows them to not make that guarantee. The below is a quote from the 1.4 spec. For those not versed in NVMe spec-ese, "0's based value" means that putting a zero in this field means the value of AWUPF is 1. Atomic Write Unit Power Fail (AWUPF): This field indicates the size of the write operation guaranteed to be written atomically to the NVM across all namespaces with any supported namespace format during a power fail or error condition. If a specific namespace guarantees a larger size than is reported in this field, then this namespace specific size is reported in the NAWUPF field in the Identify Namespace data structure. Refer to section 6.4. This field is specified in logical blocks and is a 0’s based value. The AWUPF value shall be less than or equal to the AWUN value. If a write command is submitted with size less than or equal to the AWUPF value, the host is guaranteed that the write is atomic to the NVM with respect to other read or write commands. If a write command is submitted that is greater than this size, there is no guarantee of command atomicity. If the write size is less than or equal to the AWUPF value and the write command fails, then subsequent read commands for the associated logical blocks shall return data from the previous successful write command. If a write command is submitted with size greater than the AWUPF value, then there is no guarantee of data returned on subsequent reads of the associated logical blocks. I take neither blame nor credit for what other storage standards may implement; this is the only one I had a hand in, and I had to fight hard to get it. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 16:42 ` Matthew Wilcox @ 2020-01-07 9:28 ` Sitsofe Wheeler 0 siblings, 0 replies; 11+ messages in thread From: Sitsofe Wheeler @ 2020-01-07 9:28 UTC (permalink / raw) To: Matthew Wilcox Cc: Amir Goldstein, linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso, harshad shirwadkar On Mon, 6 Jan 2020 at 16:42, Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Jan 06, 2020 at 05:40:20PM +0200, Amir Goldstein wrote: > > On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote: > > > If a write occurs on one or two bytes of a file at about the same time as a power > > > loss, are other bytes of the file guaranteed to be unchanged after reboot? > > > Or might some other bytes within the same sector have been modified as well? > > > > I don't see how other bytes could change in this scenario, but I don't > > know if the > > hardware provides this guarantee. Maybe someone else knows the answer. > > The question is nonsense because there is no way to write less than one > sector to a hardware device, by definition. So, treating this question > as being a read-modify-write of a single sector (assuming the "two bytes" > don't cross a sector boundary): > > Hardware vendors are reluctant to provide this guarantee, but it's > essential to constructing a reliable storage system. We wrote the NVMe > spec in such a way that vendors must provide single-sector-atomicity > guarantees, and I hope they haven't managed to wiggle some nonsense > into the spec that allows them to not make that guarantee. The below > is a quote from the 1.4 spec. For those not versed in NVMe spec-ese, > "0's based value" means that putting a zero in this field means the > value of AWUPF is 1. Wow - that's the first time I've seen someone go on the record as saying a sector write is atomic (albeit only for NVMe disks) without having it instantly debated! Sadly there's no way of guaranteeing this atomicity from userspace if https://youtu.be/-oP2BOsMpdo?t=3557 (where Chris Mason(?) warns there can be corner cases trying to use O_DIRECT) is to be believed though? > I take neither blame nor credit for what other storage standards may > implement; this is the only one I had a hand in, and I had to fight > hard to get it. So there's no consensus for SATA/SCSI etc (https://stackoverflow.com/questions/2009063/are-disk-sector-writes-atomic )? Just need to wait until there's NVMe everywhere :-) -- Sitsofe | http://sucs.org/~sits/ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 15:40 ` Amir Goldstein 2020-01-06 16:42 ` Matthew Wilcox @ 2020-01-06 18:31 ` Amir Goldstein 2020-01-07 9:16 ` Jan Kara 2 siblings, 0 replies; 11+ messages in thread From: Amir Goldstein @ 2020-01-06 18:31 UTC (permalink / raw) To: Sitsofe Wheeler Cc: linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso, harshad shirwadkar On Mon, Jan 6, 2020 at 5:40 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote: > > > > At Linux Plumbers 2019 Dr Richard Hipp presented a talk about SQLite > > (https://youtu.be/-oP2BOsMpdo?t=5525 ). One of the slides was titled > > "Things to discuss" > > (https://sqlite.org/lpc2019/doc/trunk/slides/sqlite-intro.html/#6 ) > > and had a few questions: > > > [...] > > > > However, there were even more questions in the briefing paper > > (https://sqlite.org/lpc2019/doc/trunk/briefing.md and search for '?') > > that couldn't be asked due to limited time. Does anyone know the > > answer to the extended questions and whether the the above is right > > deduction for the questions that were asked? > > > > As Jan said, there is a difference between the answer to "what is the > current behavior" and "what are filesystem developers willing to commit > as behavior that will remain the same in the future", but I will try to provide > some answers to your questions. > > > If a power loss occurs at about the same time that a file is being extended > > with new data, will the file be guaranteed to contain valid data after reboot, > > or might the extended area of the file contain all zeros or all ones or > > arbitrary content? In other words, is the file data always committed to disk > > ahead of the file size? > > While that statement would generally be true (ever since ext3 > journal=ordered...), Bah! scratch that. The statement is generally not true. Due to delayed allocation with xfs/ext4 you are much more likely to find the extended areas contain all zeroes. The only guaranty AFAIK is that with truncate+extend sequence, you won't find the old data in the re-extended area. Thanks, Amir. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Questions about filesystems from SQLite author presentation 2020-01-06 15:40 ` Amir Goldstein 2020-01-06 16:42 ` Matthew Wilcox 2020-01-06 18:31 ` Amir Goldstein @ 2020-01-07 9:16 ` Jan Kara 2 siblings, 0 replies; 11+ messages in thread From: Jan Kara @ 2020-01-07 9:16 UTC (permalink / raw) To: Amir Goldstein Cc: Sitsofe Wheeler, linux-fsdevel, drh, Jan Kara, Dave Chinner, Theodore Tso, harshad shirwadkar On Mon 06-01-20 17:40:20, Amir Goldstein wrote: > On Mon, Jan 6, 2020 at 9:26 AM Sitsofe Wheeler <sitsofe@gmail.com> wrote: > > If a power loss occurs at about the same time that a file is being extended > > with new data, will the file be guaranteed to contain valid data after reboot, > > or might the extended area of the file contain all zeros or all ones or > > arbitrary content? In other words, is the file data always committed to disk > > ahead of the file size? > > While that statement would generally be true (ever since ext3 > journal=ordered...), > you have no such guaranty. Getting such a guaranty would require a new API > like O_ATOMIC. This is not quite true. 1) The rule you can rely on is: No random data in a file. So after a power failure the state of the can be: a) original file state b) file size increased (possibly only partially), each block in the extended area contains either correct data or zeros. There are exceptions to this for filesystems that don't maintain metadata consistency on crash such as ext2, vfat, udf, or ext4 in data=writeback mode. There the outcome after a crash is undefined... > > If a write occurs on one or two bytes of a file at about the same time as a power > > loss, are other bytes of the file guaranteed to be unchanged after reboot? > > Or might some other bytes within the same sector have been modified as well? > > I don't see how other bytes could change in this scenario, but I don't > know if the hardware provides this guarantee. Maybe someone else knows > the answer. As Matthew wrote this boils down to whether the HW provides sector write atomicity. Practically that seems to be the case. > > Is it possible (or helpful) to tell the filesystem that the content of a particular file > > does not need to survive reboot? > > Not that I know of. > > > Is it possible (or helpful) to tell the filesystem that a particular file can be > > unlinked upon reboot? > > Not that I know of. Well, you could create the file with O_TMPFILE flag. That will give you unlinked inode which will get just deleted once the file is closed (and also on reboot). If you don't want to keep the file open all the time you use it, then I don't know of a way. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-01-07 17:18 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-06 7:24 Questions about filesystems from SQLite author presentation Sitsofe Wheeler 2020-01-06 10:15 ` Dave Chinner 2020-01-07 8:40 ` Sitsofe Wheeler 2020-01-07 8:55 ` Jan Kara 2020-01-07 17:18 ` Darrick J. Wong 2020-01-07 8:47 ` Jan Kara 2020-01-06 15:40 ` Amir Goldstein 2020-01-06 16:42 ` Matthew Wilcox 2020-01-07 9:28 ` Sitsofe Wheeler 2020-01-06 18:31 ` Amir Goldstein 2020-01-07 9:16 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).