* [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes @ 2020-02-13 22:33 Allison Collins 2020-02-14 4:42 ` Darrick J. Wong 0 siblings, 1 reply; 5+ messages in thread From: Allison Collins @ 2020-02-13 22:33 UTC (permalink / raw) To: lsf-pc; +Cc: xfs, linux-fsdevel Hi all, I know there's a lot of discussion on the list right now, but I'd like to get this out before too much time gets away. I would like to propose the topic of atomic writes. I realize the topic has been discussed before, but I have not found much activity for it recently so perhaps we can revisit it. We do have a customer who may have an interest, so I would like to discuss the current state of things, and how we can move forward. If efforts are in progress, and if not, what have we learned from the attempt. I also understand there are multiple ways to solve this problem that people may have opinions on. I've noticed some older patch sets trying to use a flag to control when dirty pages are flushed, though I think our customer would like to see a hardware solution via NVMe devices. So I would like to see if others have similar interests as well and what their thoughts may be. Thanks everyone! Allison ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes 2020-02-13 22:33 [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes Allison Collins @ 2020-02-14 4:42 ` Darrick J. Wong 2020-02-15 19:53 ` Matthew Wilcox 2020-02-20 21:30 ` Steve French 0 siblings, 2 replies; 5+ messages in thread From: Darrick J. Wong @ 2020-02-14 4:42 UTC (permalink / raw) To: Allison Collins; +Cc: lsf-pc, xfs, linux-fsdevel On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote: > Hi all, > > I know there's a lot of discussion on the list right now, but I'd like to > get this out before too much time gets away. I would like to propose the > topic of atomic writes. I realize the topic has been discussed before, but > I have not found much activity for it recently so perhaps we can revisit it. > We do have a customer who may have an interest, so I would like to discuss > the current state of things, and how we can move forward. If efforts are in > progress, and if not, what have we learned from the attempt. > > I also understand there are multiple ways to solve this problem that people > may have opinions on. I've noticed some older patch sets trying to use a > flag to control when dirty pages are flushed, though I think our customer > would like to see a hardware solution via NVMe devices. So I would like to > see if others have similar interests as well and what their thoughts may be. > Thanks everyone! Hmmm well there are a number of different ways one could do this-- 1) Userspace allocates an O_TMPFILE file, clones all the file data to it, makes whatever changes it wants (thus invoking COW writes), and then calls some ioctl to swap the differing extent maps atomically. For XFS we have most of those pieces, but we'd have to add a log intent item to track the progress of the remap so that we can complete the remap if the system goes down. This has potentially the best flexibility (multiple processes can coordinate to stage multiple updates to non-overlapping ranges of the file) but is also a nice foot bazooka. 2) Set O_ATOMIC on the file, ensure that all writes are staged via COW, and defer the cow remap step until we hit the synchronization point. When that happens, we persist the new mappings somewhere (e.g. well beyond all possible EOF in the XFS case) and then start an atomic remap operation to move the new blocks into place in the file. (XFS would still have to add a new log intent item here to finish the remapping if the system goes down.) Less foot bazooka but leaves lingering questions like what do you do if multiple processes want to run their own atomic updates? (Note that I think you have some sort of higher level progress tracking of the remap operation because we can't leave a torn write just because the computer crashed.) 3) Magic pwritev2 API that lets userspace talk directly to hardware atomic writes, though I don't know how userspace discovers what the hardware limits are. I'm assuming the usual sysfs knobs? Note that #1 and #2 are done entirely in software, which makes them less performant but OTOH there's effectively no limit (besides available physical space) on how much data or how many non-contiguous extents we can stage and commit. --D > Allison ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes 2020-02-14 4:42 ` Darrick J. Wong @ 2020-02-15 19:53 ` Matthew Wilcox 2020-02-16 21:41 ` Dave Chinner 2020-02-20 21:30 ` Steve French 1 sibling, 1 reply; 5+ messages in thread From: Matthew Wilcox @ 2020-02-15 19:53 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Allison Collins, lsf-pc, xfs, linux-fsdevel On Thu, Feb 13, 2020 at 08:42:42PM -0800, Darrick J. Wong wrote: > On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote: > > I also understand there are multiple ways to solve this problem that people > > may have opinions on. I've noticed some older patch sets trying to use a > > flag to control when dirty pages are flushed, though I think our customer > > would like to see a hardware solution via NVMe devices. So I would like to > > see if others have similar interests as well and what their thoughts may be. > > Thanks everyone! > > Hmmm well there are a number of different ways one could do this-- Interesting. Your answer implies a question of "How do we expose a filesystem's ability to do atomic writes to userspace", whereas I thought Allison's question was "What spec do we write to give to the NVMe vendors so that filesystems can optimise their atomic writes". I am very interested in the question of atomic writes, but I don't know that we're going to have the right people in the room to design a userspace API. Maybe this is more of a Plumbers topic? I think the two main users of a userspace API would be databases (sqlite, mysql, postgres, others) and package managers (dpkg, rpm, others?). Then there would be the miscellaneous users who just want things to work and don't really care about performance (writing a game's high score file, updating /etc/sudoers). That might argue in favour of having two independent APIs, one that's simple, probably quite slow, but safe, and one that's complex, fast and safe. There's also an option for simple, fast and unsafe, but, y'know, we already have that ... Your response also implies that atomic writes are only done to a single file at a time, which isn't true for either databases or for package managers. I wonder if the snapshot/reflink paradigm is the right one for multi-file atomic updates, or if we can use the same underlying mechanism to implement an API which better fits how userspace actually wants to do atomic updates. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes 2020-02-15 19:53 ` Matthew Wilcox @ 2020-02-16 21:41 ` Dave Chinner 0 siblings, 0 replies; 5+ messages in thread From: Dave Chinner @ 2020-02-16 21:41 UTC (permalink / raw) To: Matthew Wilcox Cc: Darrick J. Wong, Allison Collins, lsf-pc, xfs, linux-fsdevel On Sat, Feb 15, 2020 at 11:53:07AM -0800, Matthew Wilcox wrote: > On Thu, Feb 13, 2020 at 08:42:42PM -0800, Darrick J. Wong wrote: > > On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote: > > > I also understand there are multiple ways to solve this problem that people > > > may have opinions on. I've noticed some older patch sets trying to use a > > > flag to control when dirty pages are flushed, though I think our customer > > > would like to see a hardware solution via NVMe devices. So I would like to > > > see if others have similar interests as well and what their thoughts may be. > > > Thanks everyone! > > > > Hmmm well there are a number of different ways one could do this-- > > Interesting. Your answer implies a question of "How do we expose > a filesystem's ability to do atomic writes to userspace", whereas I > thought Allison's question was "What spec do we write to give to the > NVMe vendors so that filesystems can optimise their atomic writes". Well, hardware offload from a filesysetm perspective already has one easy userspace API: RWF_ATOMIC using direct IO. We already do "hardware offload" of persistence for pure overwrites (RWF_DSYNC -> REQ_FUA write) so we can avoid a device cache flush in this case. I suspect that we could do a similar thing at the filesystem level - pure atomic overwrites only require that no metadata is being modified for the write, similar to the REQ_FUA optimisation. The difference being that REQ_ATOMIC would currently fail if it can't be offloaded (hence the need for a software atomic overwrite), and we'd need REQ_ATOMIC plumbed through the block layer and drivers... > I am very interested in the question of atomic writes, but I don't > know that we're going to have the right people in the room to design > a userspace API. Maybe this is more of a Plumbers topic? I think > the two main users of a userspace API would be databases (sqlite, > mysql, postgres, others) and package managers (dpkg, rpm, others?). > Then there would be the miscellaneous users who just want things to work > and don't really care about performance (writing a game's high score file, > updating /etc/sudoers). I'm not sure we need a new userspace API: RWF_ATOMIC gives userspace exactly what is needed to define exact atomic writes boundaries... However, the difficulty with atomic writes is buffered IO, and I'm still not sure that makes any sense. This requires the page cache to track atomic write boundaries and the order in which the pages were dirtied. It also requires writeback to flush pages in that order and as single atomic IOs. There's an open question as to whether we can report the results of the atomic write to userspace (i.e. the cached data) before it has been written back successfully - is it a successful atomic write if the write has only reached the page cache and if so can userspace do anything useful with that information? i.e. you can't use buffered atomic writes for integrity purposes because you can't control the order they go to disk in from userspace. Unless, of course, the page cache is tracking *global* atomic write ordering across all files and filesystems and fsync() "syncs the world"... > That might argue in favour of having two independent APIs, one that's > simple, probably quite slow, but safe, and one that's complex, fast > and safe. There's also an option for simple, fast and unsafe, but, > y'know, we already have that ... > > Your response also implies that atomic writes are only done to a single > file at a time, which isn't true for either databases or for package > managers. I wonder if the snapshot/reflink paradigm is the right one > for multi-file atomic updates, or if we can use the same underlying > mechanism to implement an API which better fits how userspace actually > wants to do atomic updates. A reflink mechanism would allow concurrent independent atomic writes to independent files (because reflink is per-file). If implemented correctly, a reflink mechanism would also allow multiple concurrent ordered atomic writes to a single file. But to do globally ordered atomic writes in the kernel? Far simpler just to let userspace use direct IO, RWF_ATOMIC and do cross-file ordering based on IO completion notifications.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes 2020-02-14 4:42 ` Darrick J. Wong 2020-02-15 19:53 ` Matthew Wilcox @ 2020-02-20 21:30 ` Steve French 1 sibling, 0 replies; 5+ messages in thread From: Steve French @ 2020-02-20 21:30 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Allison Collins, lsf-pc, xfs, linux-fsdevel The idea of using O_TMPFILE is interesting ... but opening an O_TMPFILE is awkward for network file systems because it is not an atomic operation either ... (create/close then open) On Thu, Feb 13, 2020 at 10:43 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote: > > Hi all, > > > > I know there's a lot of discussion on the list right now, but I'd like to > > get this out before too much time gets away. I would like to propose the > > topic of atomic writes. I realize the topic has been discussed before, but > > I have not found much activity for it recently so perhaps we can revisit it. > > We do have a customer who may have an interest, so I would like to discuss > > the current state of things, and how we can move forward. If efforts are in > > progress, and if not, what have we learned from the attempt. > > > > I also understand there are multiple ways to solve this problem that people > > may have opinions on. I've noticed some older patch sets trying to use a > > flag to control when dirty pages are flushed, though I think our customer > > would like to see a hardware solution via NVMe devices. So I would like to > > see if others have similar interests as well and what their thoughts may be. > > Thanks everyone! > > Hmmm well there are a number of different ways one could do this-- > > 1) Userspace allocates an O_TMPFILE file, clones all the file data to > it, makes whatever changes it wants (thus invoking COW writes), and then > calls some ioctl to swap the differing extent maps atomically. For XFS > we have most of those pieces, but we'd have to add a log intent item to > track the progress of the remap so that we can complete the remap if the > system goes down. This has potentially the best flexibility (multiple > processes can coordinate to stage multiple updates to non-overlapping > ranges of the file) but is also a nice foot bazooka. > > 2) Set O_ATOMIC on the file, ensure that all writes are staged via COW, > and defer the cow remap step until we hit the synchronization point. > When that happens, we persist the new mappings somewhere (e.g. well > beyond all possible EOF in the XFS case) and then start an atomic remap > operation to move the new blocks into place in the file. (XFS would > still have to add a new log intent item here to finish the remapping if > the system goes down.) Less foot bazooka but leaves lingering questions > like what do you do if multiple processes want to run their own atomic > updates? > > (Note that I think you have some sort of higher level progress tracking > of the remap operation because we can't leave a torn write just because > the computer crashed.) > > 3) Magic pwritev2 API that lets userspace talk directly to hardware > atomic writes, though I don't know how userspace discovers what the > hardware limits are. I'm assuming the usual sysfs knobs? > > Note that #1 and #2 are done entirely in software, which makes them less > performant but OTOH there's effectively no limit (besides available > physical space) on how much data or how many non-contiguous extents we > can stage and commit. > > --D > > > Allison -- Thanks, Steve ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-02-20 21:30 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-13 22:33 [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes Allison Collins 2020-02-14 4:42 ` Darrick J. Wong 2020-02-15 19:53 ` Matthew Wilcox 2020-02-16 21:41 ` Dave Chinner 2020-02-20 21:30 ` Steve French
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).