All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes
@ 2020-02-13 22:33 Allison Collins
  2020-02-14  4:42 ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Allison Collins @ 2020-02-13 22:33 UTC (permalink / raw)
  To: lsf-pc; +Cc: xfs, linux-fsdevel

Hi all,

I know there's a lot of discussion on the list right now, but I'd like 
to get this out before too much time gets away.  I would like to propose 
the topic of atomic writes.  I realize the topic has been discussed 
before, but I have not found much activity for it recently so perhaps we 
can revisit it.  We do have a customer who may have an interest, so I 
would like to discuss the current state of things, and how we can move 
forward.  If efforts are in progress, and if not, what have we learned 
from the attempt.

I also understand there are multiple ways to solve this problem that 
people may have opinions on.  I've noticed some older patch sets trying 
to use a flag to control when dirty pages are flushed, though I think 
our customer would like to see a hardware solution via NVMe devices.  So 
I would like to see if others have similar interests as well and what 
their thoughts may be.  Thanks everyone!

Allison

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes
  2020-02-13 22:33 [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes Allison Collins
@ 2020-02-14  4:42 ` Darrick J. Wong
  2020-02-15 19:53   ` Matthew Wilcox
  2020-02-20 21:30   ` Steve French
  0 siblings, 2 replies; 5+ messages in thread
From: Darrick J. Wong @ 2020-02-14  4:42 UTC (permalink / raw)
  To: Allison Collins; +Cc: lsf-pc, xfs, linux-fsdevel

On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote:
> Hi all,
> 
> I know there's a lot of discussion on the list right now, but I'd like to
> get this out before too much time gets away.  I would like to propose the
> topic of atomic writes.  I realize the topic has been discussed before, but
> I have not found much activity for it recently so perhaps we can revisit it.
> We do have a customer who may have an interest, so I would like to discuss
> the current state of things, and how we can move forward.  If efforts are in
> progress, and if not, what have we learned from the attempt.
> 
> I also understand there are multiple ways to solve this problem that people
> may have opinions on.  I've noticed some older patch sets trying to use a
> flag to control when dirty pages are flushed, though I think our customer
> would like to see a hardware solution via NVMe devices.  So I would like to
> see if others have similar interests as well and what their thoughts may be.
> Thanks everyone!

Hmmm well there are a number of different ways one could do this--

1) Userspace allocates an O_TMPFILE file, clones all the file data to
it, makes whatever changes it wants (thus invoking COW writes), and then
calls some ioctl to swap the differing extent maps atomically.  For XFS
we have most of those pieces, but we'd have to add a log intent item to
track the progress of the remap so that we can complete the remap if the
system goes down.  This has potentially the best flexibility (multiple
processes can coordinate to stage multiple updates to non-overlapping
ranges of the file) but is also a nice foot bazooka.

2) Set O_ATOMIC on the file, ensure that all writes are staged via COW,
and defer the cow remap step until we hit the synchronization point.
When that happens, we persist the new mappings somewhere (e.g. well
beyond all possible EOF in the XFS case) and then start an atomic remap
operation to move the new blocks into place in the file.  (XFS would
still have to add a new log intent item here to finish the remapping if
the system goes down.)  Less foot bazooka but leaves lingering questions
like what do you do if multiple processes want to run their own atomic
updates?

(Note that I think you have some sort of higher level progress tracking
of the remap operation because we can't leave a torn write just because
the computer crashed.)

3) Magic pwritev2 API that lets userspace talk directly to hardware
atomic writes, though I don't know how userspace discovers what the
hardware limits are.   I'm assuming the usual sysfs knobs?

Note that #1 and #2 are done entirely in software, which makes them less
performant but OTOH there's effectively no limit (besides available
physical space) on how much data or how many non-contiguous extents we
can stage and commit.

--D

> Allison

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes
  2020-02-14  4:42 ` Darrick J. Wong
@ 2020-02-15 19:53   ` Matthew Wilcox
  2020-02-16 21:41     ` Dave Chinner
  2020-02-20 21:30   ` Steve French
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2020-02-15 19:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Allison Collins, lsf-pc, xfs, linux-fsdevel

On Thu, Feb 13, 2020 at 08:42:42PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote:
> > I also understand there are multiple ways to solve this problem that people
> > may have opinions on.  I've noticed some older patch sets trying to use a
> > flag to control when dirty pages are flushed, though I think our customer
> > would like to see a hardware solution via NVMe devices.  So I would like to
> > see if others have similar interests as well and what their thoughts may be.
> > Thanks everyone!
> 
> Hmmm well there are a number of different ways one could do this--

Interesting.  Your answer implies a question of "How do we expose
a filesystem's ability to do atomic writes to userspace", whereas I
thought Allison's question was "What spec do we write to give to the
NVMe vendors so that filesystems can optimise their atomic writes".

I am very interested in the question of atomic writes, but I don't
know that we're going to have the right people in the room to design
a userspace API.  Maybe this is more of a Plumbers topic?  I think
the two main users of a userspace API would be databases (sqlite,
mysql, postgres, others) and package managers (dpkg, rpm, others?).
Then there would be the miscellaneous users who just want things to work
and don't really care about performance (writing a game's high score file,
updating /etc/sudoers).

That might argue in favour of having two independent APIs, one that's
simple, probably quite slow, but safe, and one that's complex, fast
and safe.  There's also an option for simple, fast and unsafe, but,
y'know, we already have that ...

Your response also implies that atomic writes are only done to a single
file at a time, which isn't true for either databases or for package
managers.  I wonder if the snapshot/reflink paradigm is the right one
for multi-file atomic updates, or if we can use the same underlying
mechanism to implement an API which better fits how userspace actually
wants to do atomic updates.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes
  2020-02-15 19:53   ` Matthew Wilcox
@ 2020-02-16 21:41     ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2020-02-16 21:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Allison Collins, lsf-pc, xfs, linux-fsdevel

On Sat, Feb 15, 2020 at 11:53:07AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 13, 2020 at 08:42:42PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote:
> > > I also understand there are multiple ways to solve this problem that people
> > > may have opinions on.  I've noticed some older patch sets trying to use a
> > > flag to control when dirty pages are flushed, though I think our customer
> > > would like to see a hardware solution via NVMe devices.  So I would like to
> > > see if others have similar interests as well and what their thoughts may be.
> > > Thanks everyone!
> > 
> > Hmmm well there are a number of different ways one could do this--
> 
> Interesting.  Your answer implies a question of "How do we expose
> a filesystem's ability to do atomic writes to userspace", whereas I
> thought Allison's question was "What spec do we write to give to the
> NVMe vendors so that filesystems can optimise their atomic writes".

Well, hardware offload from a filesysetm perspective already has one
easy userspace API: RWF_ATOMIC using direct IO. We already do
"hardware offload" of persistence for pure overwrites (RWF_DSYNC ->
REQ_FUA write) so we can avoid a device cache flush in this case.

I suspect that we could do a similar thing at the filesystem level -
pure atomic overwrites only require that no metadata is being
modified for the write, similar to the REQ_FUA optimisation. The
difference being that REQ_ATOMIC would currently fail if it can't be
offloaded (hence the need for a software atomic overwrite), and we'd
need REQ_ATOMIC plumbed through the block layer and drivers...

> I am very interested in the question of atomic writes, but I don't
> know that we're going to have the right people in the room to design
> a userspace API.  Maybe this is more of a Plumbers topic?  I think
> the two main users of a userspace API would be databases (sqlite,
> mysql, postgres, others) and package managers (dpkg, rpm, others?).
> Then there would be the miscellaneous users who just want things to work
> and don't really care about performance (writing a game's high score file,
> updating /etc/sudoers).

I'm not sure we need a new userspace API: RWF_ATOMIC gives userspace
exactly what is needed to define exact atomic writes boundaries...

However, the difficulty with atomic writes is buffered IO, and I'm
still not sure that makes any sense. This requires the page cache to
track atomic write boundaries and the order in which the pages were
dirtied. It also requires writeback to flush pages in that order and
as single atomic IOs.

There's an open question as to whether we can report the results of
the atomic write to userspace (i.e. the cached data) before it has
been written back successfully - is it a successful atomic write if
the write has only reached the page cache and if so can userspace do
anything useful with that information? i.e. you can't use buffered
atomic writes for integrity purposes because you can't control the
order they go to disk in from userspace. Unless, of course, the page
cache is tracking *global* atomic write ordering across all files
and filesystems and fsync() "syncs the world"...

> That might argue in favour of having two independent APIs, one that's
> simple, probably quite slow, but safe, and one that's complex, fast
> and safe.  There's also an option for simple, fast and unsafe, but,
> y'know, we already have that ...
> 
> Your response also implies that atomic writes are only done to a single
> file at a time, which isn't true for either databases or for package
> managers.  I wonder if the snapshot/reflink paradigm is the right one
> for multi-file atomic updates, or if we can use the same underlying
> mechanism to implement an API which better fits how userspace actually
> wants to do atomic updates.

A reflink mechanism would allow concurrent independent atomic writes
to independent files (because reflink is per-file). If implemented
correctly, a reflink mechanism would also allow multiple concurrent
ordered atomic writes to a single file. But to do globally ordered
atomic writes in the kernel? Far simpler just to let userspace use
direct IO, RWF_ATOMIC and do cross-file ordering based on IO
completion notifications....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes
  2020-02-14  4:42 ` Darrick J. Wong
  2020-02-15 19:53   ` Matthew Wilcox
@ 2020-02-20 21:30   ` Steve French
  1 sibling, 0 replies; 5+ messages in thread
From: Steve French @ 2020-02-20 21:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Allison Collins, lsf-pc, xfs, linux-fsdevel

The idea of using O_TMPFILE is interesting ... but opening an
O_TMPFILE is awkward for network file systems because it is not an
atomic operation either ... (create/close then open)

On Thu, Feb 13, 2020 at 10:43 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Thu, Feb 13, 2020 at 03:33:08PM -0700, Allison Collins wrote:
> > Hi all,
> >
> > I know there's a lot of discussion on the list right now, but I'd like to
> > get this out before too much time gets away.  I would like to propose the
> > topic of atomic writes.  I realize the topic has been discussed before, but
> > I have not found much activity for it recently so perhaps we can revisit it.
> > We do have a customer who may have an interest, so I would like to discuss
> > the current state of things, and how we can move forward.  If efforts are in
> > progress, and if not, what have we learned from the attempt.
> >
> > I also understand there are multiple ways to solve this problem that people
> > may have opinions on.  I've noticed some older patch sets trying to use a
> > flag to control when dirty pages are flushed, though I think our customer
> > would like to see a hardware solution via NVMe devices.  So I would like to
> > see if others have similar interests as well and what their thoughts may be.
> > Thanks everyone!
>
> Hmmm well there are a number of different ways one could do this--
>
> 1) Userspace allocates an O_TMPFILE file, clones all the file data to
> it, makes whatever changes it wants (thus invoking COW writes), and then
> calls some ioctl to swap the differing extent maps atomically.  For XFS
> we have most of those pieces, but we'd have to add a log intent item to
> track the progress of the remap so that we can complete the remap if the
> system goes down.  This has potentially the best flexibility (multiple
> processes can coordinate to stage multiple updates to non-overlapping
> ranges of the file) but is also a nice foot bazooka.
>
> 2) Set O_ATOMIC on the file, ensure that all writes are staged via COW,
> and defer the cow remap step until we hit the synchronization point.
> When that happens, we persist the new mappings somewhere (e.g. well
> beyond all possible EOF in the XFS case) and then start an atomic remap
> operation to move the new blocks into place in the file.  (XFS would
> still have to add a new log intent item here to finish the remapping if
> the system goes down.)  Less foot bazooka but leaves lingering questions
> like what do you do if multiple processes want to run their own atomic
> updates?
>
> (Note that I think you have some sort of higher level progress tracking
> of the remap operation because we can't leave a torn write just because
> the computer crashed.)
>
> 3) Magic pwritev2 API that lets userspace talk directly to hardware
> atomic writes, though I don't know how userspace discovers what the
> hardware limits are.   I'm assuming the usual sysfs knobs?
>
> Note that #1 and #2 are done entirely in software, which makes them less
> performant but OTOH there's effectively no limit (besides available
> physical space) on how much data or how many non-contiguous extents we
> can stage and commit.
>
> --D
>
> > Allison



-- 
Thanks,

Steve

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-02-20 21:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 22:33 [Lsf-pc] [LSF/MM/BPF TOPIC] Atomic Writes Allison Collins
2020-02-14  4:42 ` Darrick J. Wong
2020-02-15 19:53   ` Matthew Wilcox
2020-02-16 21:41     ` Dave Chinner
2020-02-20 21:30   ` Steve French

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.