All of lore.kernel.org
 help / color / mirror / Atom feed
* write atomicity with xfs ... current status?
@ 2020-03-16 20:59 Ober, Frank
  2020-03-16 21:59 ` Darrick J. Wong
  2020-03-17 19:19 ` Christoph Hellwig
  0 siblings, 2 replies; 10+ messages in thread
From: Ober, Frank @ 2020-03-16 20:59 UTC (permalink / raw)
  To: linux-xfs

Hi, Intel is looking into does it make sense to take an existing, popular filesystem and patch it for write atomicity at the sector count level. Meaning we would protect a configured number of sectors using parameters that each layer in the kernel would synchronize on.  We could use a parameter(s) for this that comes from the NVMe specification such as awun or awunpf that set across the (affected) layers to a user space program such as innodb/MySQL which would benefit as would other software. The MySQL target is a strong use case, as its InnoDB has a double write buffer that could be removed if write atomicity was protected at 16KiB for the file opens and with fsync(). 

My question is why hasn't xfs write atomicity advanced further, as it seems in 3.x kernel time a few years ago this was tried but nothing committed. as documented here:
               http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC

Is xfs write atomicity still being pursued , and with what design objective. There is a long thread here, https://lwn.net/Articles/789600/ on write atomicity, but with no progress, lots of ideas in there but not any progress, but I am unclear.

Is my design idea above simply too simplistic, to try and protect a configured block size (sector count) through the filesystem and block layers, and what really is not making it attainable?

Thanks for the feedback
Frank Ober

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-16 20:59 write atomicity with xfs ... current status? Ober, Frank
@ 2020-03-16 21:59 ` Darrick J. Wong
  2020-03-16 23:32   ` Dave Chinner
  2020-03-17 19:19 ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Darrick J. Wong @ 2020-03-16 21:59 UTC (permalink / raw)
  To: Ober, Frank; +Cc: linux-xfs

On Mon, Mar 16, 2020 at 08:59:54PM +0000, Ober, Frank wrote:
> Hi, Intel is looking into does it make sense to take an existing,
> popular filesystem and patch it for write atomicity at the sector
> count level. Meaning we would protect a configured number of sectors
> using parameters that each layer in the kernel would synchronize on.
>  We could use a parameter(s) for this that comes from the NVMe
> specification such as awun or awunpf

<gesundheit>

Oh, that was an acronym...

> that set across the (affected)
> layers to a user space program such as innodb/MySQL which would
> benefit as would other software. The MySQL target is a strong use
> case, as its InnoDB has a double write buffer that could be removed if
> write atomicity was protected at 16KiB for the file opens and with
> fsync(). 

We probably need a better elaboration of the exact usecases of atomic
writes since I haven't been to LSF in a couple of years (and probably
not this year either).  I can think of a couple of access modes off the
top of my head:

1) atomic directio write where either you stay under the hardware atomic
write limit and we use it, or...

2) software atomic writes where we use the xfs copy-on-write mechanism
to stage the new blocks and later map them back into the inode, where
"later" is either an explicit fsync or an O_SYNC write or something...

3) ...or a totally separate interface where userspace does something
along the lines of:

	write_fd = stage_writes(fd);

which creates an O_TMPFILE and reflinks all of fd's content to it

	write(write_fd...);

	err = commit_writes(write_fd, fd);

which then uses extent remapping to push all the changed blocks back to
the original file if it hasn't changed.  Bonus: other threads don't see
the new data until commit_writes() finishes, and we can introduce new
log items to make sure that once we start committing we can finish it
even if the system goes down.

> My question is why hasn't xfs write atomicity advanced further, as it
> seems in 3.x kernel time a few years ago this was tried but nothing
> committed. as documented here:
>
>                http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC
> 
> Is xfs write atomicity still being pursued , and with what design
> objective. There is a long thread here,
> https://lwn.net/Articles/789600/ on write atomicity, but with no
> progress, lots of ideas in there but not any progress, but I am
> unclear.
> 
> Is my design idea above simply too simplistic, to try and protect a
> configured block size (sector count) through the filesystem and block
> layers, and what really is not making it attainable?

Lack of developer time, AFAICT.

--D

> Thanks for the feedback
> Frank Ober

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-16 21:59 ` Darrick J. Wong
@ 2020-03-16 23:32   ` Dave Chinner
  2020-03-17 22:56     ` Ober, Frank
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2020-03-16 23:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Ober, Frank, linux-xfs

On Mon, Mar 16, 2020 at 02:59:13PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 08:59:54PM +0000, Ober, Frank wrote:
> > Hi, Intel is looking into does it make sense to take an existing,
> > popular filesystem and patch it for write atomicity at the sector
> > count level. Meaning we would protect a configured number of sectors
> > using parameters that each layer in the kernel would synchronize on.
> >  We could use a parameter(s) for this that comes from the NVMe
> > specification such as awun or awunpf
> 
> <gesundheit>
> 
> Oh, that was an acronym...
> 
> > that set across the (affected)
> > layers to a user space program such as innodb/MySQL which would
> > benefit as would other software. The MySQL target is a strong use
> > case, as its InnoDB has a double write buffer that could be removed if
> > write atomicity was protected at 16KiB for the file opens and with
> > fsync(). 
> 
> We probably need a better elaboration of the exact usecases of atomic
> writes since I haven't been to LSF in a couple of years (and probably
> not this year either).  I can think of a couple of access modes off the
> top of my head:
> 
> 1) atomic directio write where either you stay under the hardware atomic
> write limit and we use it, or...

We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if the
hardware supports it. We can do exactly the same thing for
RWF_ATOMIC - it succeeds if:

- we can issue it as a single bio
- the lower layers can take the entire atomic bio without splitting
  it.
- we treat O_ATOMIC as O_DSYNC so that any metadata changes required
  also get synced to disk before signalling IO completion. If no
  metadata updates are required, then it's an open question as to
  whether REQ_FUA is also required with REQ_ATOMIC...

Anything else returns a "atomic write IO not possible" error.

> 2) software atomic writes where we use the xfs copy-on-write mechanism
> to stage the new blocks and later map them back into the inode, where
> "later" is either an explicit fsync or an O_SYNC write or something...

That's a possible fallback, but we can't guarantee that the write
will be atomic - partial write failure can still occur as page cache
writeback can be split into arbitrary IOs and transactions....

> 3) ...or a totally separate interface where userspace does something
> along the lines of:
> 
> 	write_fd = stage_writes(fd);
> 
> which creates an O_TMPFILE and reflinks all of fd's content to it
> 
> 	write(write_fd...);
> 
> 	err = commit_writes(write_fd, fd);
> 
> which then uses extent remapping to push all the changed blocks back to
> the original file if it hasn't changed.  Bonus: other threads don't see
> the new data until commit_writes() finishes, and we can introduce new
> log items to make sure that once we start committing we can finish it
> even if the system goes down.

Which is essentially userspace library code that runs multiple
syscalls to do the necessary work. commit_writes() is basically a
ranged swap-extents call. i.e.:

	write_fd = open(O_TMPFILE)
	clone_file_range(fd, writefd, /* overwrite range */)
	loop (overwrite range) {
		write(write_fd)
	}
	fsync(write_fd)
	swap_extents(fd, write_fd, /* overwrite range */)
	fsync(fd)

i.e. this is basically the same process as a partial file defrag
operation. Hence I don't think the kernel needs to be involved in
the software emulation of atomic writes at all. IOWs, if the kernel
returns an "cannot do an atomic write" error to RWF_ATOMIC,
userspace can simply do the slow atomic overwrite as per above
without needing any special kernel code at all...

> > My question is why hasn't xfs write atomicity advanced further, as it
> > seems in 3.x kernel time a few years ago this was tried but nothing
> > committed. as documented here:
> >
> >                http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC
> > 
> > Is xfs write atomicity still being pursued , and with what design
> > objective. There is a long thread here,
> > https://lwn.net/Articles/789600/ on write atomicity, but with no
> > progress, lots of ideas in there but not any progress, but I am
> > unclear.
> > 
> > Is my design idea above simply too simplistic, to try and protect a
> > configured block size (sector count) through the filesystem and block
> > layers, and what really is not making it attainable?
> 
> Lack of developer time, AFAICT.

There's multiple other things, I think:

1. no hardware that provides usable atomic write semantics.
2. no device or block layer support for atomic write IOs; we need
   IO level infrastructure before the filesystems can do anything
   useful
3. no support in page cache for tracking atomic write ranges, so
   atomic writes via buffered IO rather difficult without using
   temporary files and extent swapping tricks...
4. emulation in userspace is easy if you have clone_file_range()
   support, even if it is slow. We aren't hearing from app
   developers emulating atomic writes for kernel side acceleration
   because it won't work on ext4.

Once we get 1. and 2., then we can support atomic direct IO writes
through XFS via RWF_ATOMIC with relative ease. 4) probably requires
some mods to XFS's swap_extent function to properly support file
ranges. The API supports ranges, the implementation ony supports
"full file range"...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-16 20:59 write atomicity with xfs ... current status? Ober, Frank
  2020-03-16 21:59 ` Darrick J. Wong
@ 2020-03-17 19:19 ` Christoph Hellwig
  2020-03-17 22:55   ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2020-03-17 19:19 UTC (permalink / raw)
  To: Ober, Frank; +Cc: linux-xfs

Atomic writs are still waiting for more time to finish things off.

That being said while I had a prototype to use the NVMe atomic write
size I will never submit that to mainline in that paticular form.

NVMe does not have any flag to force atomic writes, thus a too large
or misaligned write will be executed by the device withour errors.
That kind of interface is way too fragile to be used in production.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-17 19:19 ` Christoph Hellwig
@ 2020-03-17 22:55   ` Dave Chinner
  2020-03-18  7:54     ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2020-03-17 22:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ober, Frank, linux-xfs

On Tue, Mar 17, 2020 at 12:19:54PM -0700, Christoph Hellwig wrote:
> Atomic writs are still waiting for more time to finish things off.
> 
> That being said while I had a prototype to use the NVMe atomic write
> size I will never submit that to mainline in that paticular form.
> 
> NVMe does not have any flag to force atomic writes, thus a too large
> or misaligned write will be executed by the device withour errors.
> That kind of interface is way too fragile to be used in production.

I didn't realise that the NVMe standard had such a glaring flaw.
That basically makes atomic writes useless for anything that
actually requires atomicity. Has the standard been fixed yet? And
does this means that hardware with usable atomic writes is still
years away?

/me is left to wonder how the NVMe standards process screwed this
up so badly....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: write atomicity with xfs ... current status?
  2020-03-16 23:32   ` Dave Chinner
@ 2020-03-17 22:56     ` Ober, Frank
  2020-03-18  2:27       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Ober, Frank @ 2020-03-17 22:56 UTC (permalink / raw)
  To: Dave Chinner, Darrick J. Wong, Dimitri
  Cc: linux-xfs, Barczak, Mariusz, Barajas, Felipe

Thanks Dave and Darrick, adding Dimitri Kravtchuk from Oracle to this thread.

If Intel produced an SSD that was atomic at just the block size level (as in using awun - atomic write unit of the NVMe spec) would that constitute that we could do the following>

We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if the hardware supports it. We can do exactly the same thing for RWF_ATOMIC - it succeeds if:

- we can issue it as a single bio
- the lower layers can take the entire atomic bio without splitting
  it.
- we treat O_ATOMIC as O_DSYNC so that any metadata changes required
  also get synced to disk before signalling IO completion. If no
  metadata updates are required, then it's an open question as to whether REQ_FUA is also required with REQ_ATOMIC...

Anything else returns a "atomic write IO not possible" error.

One design goal on the hw side, is to not slow the SSD down, the footprint of firmware code is smaller in an Optane SSD and we don't want to slow that down.  What's the fastest approach for something like InnoDB writes? Can we take small steps that produce value for DirectIO and specific files which is common in databases architectures even 1 table per file ? Streamlining one block size that can be tied to specific file opens seems valuable.

Is there some failure in this thinking?


-----Original Message-----
From: Dave Chinner <david@fromorbit.com> 
Sent: Monday, March 16, 2020 4:33 PM
To: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Ober, Frank <frank.ober@intel.com>; linux-xfs@vger.kernel.org
Subject: Re: write atomicity with xfs ... current status?

On Mon, Mar 16, 2020 at 02:59:13PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 08:59:54PM +0000, Ober, Frank wrote:
> > Hi, Intel is looking into does it make sense to take an existing, 
> > popular filesystem and patch it for write atomicity at the sector 
> > count level. Meaning we would protect a configured number of sectors 
> > using parameters that each layer in the kernel would synchronize on.
> >  We could use a parameter(s) for this that comes from the NVMe 
> > specification such as awun or awunpf
> 
> <gesundheit>
> 
> Oh, that was an acronym...
> 
> > that set across the (affected)
> > layers to a user space program such as innodb/MySQL which would 
> > benefit as would other software. The MySQL target is a strong use 
> > case, as its InnoDB has a double write buffer that could be removed 
> > if write atomicity was protected at 16KiB for the file opens and 
> > with fsync().
> 
> We probably need a better elaboration of the exact usecases of atomic 
> writes since I haven't been to LSF in a couple of years (and probably 
> not this year either).  I can think of a couple of access modes off 
> the top of my head:
> 
> 1) atomic directio write where either you stay under the hardware 
> atomic write limit and we use it, or...


> 2) software atomic writes where we use the xfs copy-on-write mechanism 
> to stage the new blocks and later map them back into the inode, where 
> "later" is either an explicit fsync or an O_SYNC write or something...

That's a possible fallback, but we can't guarantee that the write will be atomic - partial write failure can still occur as page cache writeback can be split into arbitrary IOs and transactions....

> 3) ...or a totally separate interface where userspace does something 
> along the lines of:
> 
> 	write_fd = stage_writes(fd);
> 
> which creates an O_TMPFILE and reflinks all of fd's content to it
> 
> 	write(write_fd...);
> 
> 	err = commit_writes(write_fd, fd);
> 
> which then uses extent remapping to push all the changed blocks back 
> to the original file if it hasn't changed.  Bonus: other threads don't 
> see the new data until commit_writes() finishes, and we can introduce 
> new log items to make sure that once we start committing we can finish 
> it even if the system goes down.

Which is essentially userspace library code that runs multiple syscalls to do the necessary work. commit_writes() is basically a ranged swap-extents call. i.e.:

	write_fd = open(O_TMPFILE)
	clone_file_range(fd, writefd, /* overwrite range */)
	loop (overwrite range) {
		write(write_fd)
	}
	fsync(write_fd)
	swap_extents(fd, write_fd, /* overwrite range */)
	fsync(fd)

i.e. this is basically the same process as a partial file defrag operation. Hence I don't think the kernel needs to be involved in the software emulation of atomic writes at all. IOWs, if the kernel returns an "cannot do an atomic write" error to RWF_ATOMIC, userspace can simply do the slow atomic overwrite as per above without needing any special kernel code at all...

> > My question is why hasn't xfs write atomicity advanced further, as 
> > it seems in 3.x kernel time a few years ago this was tried but 
> > nothing committed. as documented here:
> >
> >                
> > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATO
> > MIC
> > 
> > Is xfs write atomicity still being pursued , and with what design 
> > objective. There is a long thread here, 
> > https://lwn.net/Articles/789600/ on write atomicity, but with no 
> > progress, lots of ideas in there but not any progress, but I am 
> > unclear.
> > 
> > Is my design idea above simply too simplistic, to try and protect a 
> > configured block size (sector count) through the filesystem and 
> > block layers, and what really is not making it attainable?
> 
> Lack of developer time, AFAICT.

There's multiple other things, I think:

1. no hardware that provides usable atomic write semantics.
2. no device or block layer support for atomic write IOs; we need
   IO level infrastructure before the filesystems can do anything
   useful
3. no support in page cache for tracking atomic write ranges, so
   atomic writes via buffered IO rather difficult without using
   temporary files and extent swapping tricks...
4. emulation in userspace is easy if you have clone_file_range()
   support, even if it is slow. We aren't hearing from app
   developers emulating atomic writes for kernel side acceleration
   because it won't work on ext4.

Once we get 1. and 2., then we can support atomic direct IO writes through XFS via RWF_ATOMIC with relative ease. 4) probably requires some mods to XFS's swap_extent function to properly support file ranges. The API supports ranges, the implementation ony supports "full file range"...

Cheers,

Dave.

--
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-17 22:56     ` Ober, Frank
@ 2020-03-18  2:27       ` Dave Chinner
  2020-03-18  8:00         ` Christoph Hellwig
  2020-03-19  1:07         ` Ober, Frank
  0 siblings, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2020-03-18  2:27 UTC (permalink / raw)
  To: Ober, Frank
  Cc: Darrick J. Wong, Dimitri, linux-xfs, Barczak, Mariusz, Barajas, Felipe

[ Hi Frank, you email program is really badly mangling quoting and
line wrapping. Can you see if you can get it to behave better for
us? I think I've fixed it below. ]

On Tue, Mar 17, 2020 at 10:56:53PM +0000, Ober, Frank wrote:
> Thanks Dave and Darrick, adding Dimitri Kravtchuk from Oracle to
> this thread.
> 
> If Intel produced an SSD that was atomic at just the block size
> level (as in using awun - atomic write unit of the NVMe spec)

What is this "atomic block size" going to be, and how is it going to
be advertised to the block layer and filesystems?

> would that constitute that we could do the following

> > We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if
> > the hardware supports it. We can do exactly the same thing for
> > RWF_ATOMIC - it succeeds if:
> > 
> > - we can issue it as a single bio
> > - the lower layers can take the entire atomic bio without
> >   splitting it. 
> > - we treat O_ATOMIC as O_DSYNC so that any metadata changes
> >   required also get synced to disk before signalling IO
> >   completion. If no metadata updates are required, then it's an
> >   open question as to whether REQ_FUA is also required with
> >   REQ_ATOMIC...
> > 
> > Anything else returns a "atomic write IO not possible" error.

So, as I said, your agreeing that an atomic write is essentially a
variant of a data integrity write but has more strict size and
alignment requirements and a normal RWF_DSYNC write?

> One design goal on the hw side, is to not slow the SSD down, the
> footprint of firmware code is smaller in an Optane SSD and we
> don't want to slow that down.

I really don't care what the impact on the SSD firmware size or
speed is - if the hardware can't guarantee atomic writes right down
to the physical media with full data integrity guarantees, and/or
doesn't advertise it's atomic write limits to the OS and filesystem
then it's simply not usable.

Please focus on correctness of behaviour first - speed is completely
irrelevant if we don't have correctness guarantees from the
hardware.

> What's the fastest approach for
> something like InnoDB writes? Can we take small steps that produce
> value for DirectIO and specific files which is common in databases
> architectures even 1 table per file ? Streamlining one block size
> that can be tied to specific file opens seems valuable.

Atomic writes have nothing to do with individual files. Either the
device under the filesystem can do atomic writes or it can't. What
files we do atomic writes to is irrelevant; What we need to know at
the filesystem level is the alignment and size restrictions on
atomic writes so we can allocate space appropriately and/or reject
user IO as out of bounds.

i.e. we already have size and alignment restrictions for direct IO
(typically single logical sector size). For atomic direct IO we will
have a different set of size and alignment restrictions, and like
the logical sector size, we need to get that from the hardware
somehow, and then make use of it in the filesystem appropriately.

Ideally the hardware would supply us with a minimum atomic IO size
and alignment and a maximum size. e.g. minimum might be the
physical sector size (we can always do atomic physical sector
size/aligned IOs) but the maximum is likely going to be some device
internal limit. If we require a minimum and maximum from the device
and the device only supports one atomic IO size can simply set
min = max.

Then it will be up to the filesystem to align extents to those
limits, and prevent user IOs that don't match the device
size/alignment restrictions placed on atomic writes...

But, first, you're going to need to get sane atomic write behaviour
standardised in the NVMe spec, yes? Otherwise nobody can use it
because we aren't guaranteed the same behaviour from device to
device...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-17 22:55   ` Dave Chinner
@ 2020-03-18  7:54     ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2020-03-18  7:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Ober, Frank, linux-xfs

On Wed, Mar 18, 2020 at 09:55:05AM +1100, Dave Chinner wrote:
> > That being said while I had a prototype to use the NVMe atomic write
> > size I will never submit that to mainline in that paticular form.
> > 
> > NVMe does not have any flag to force atomic writes, thus a too large
> > or misaligned write will be executed by the device withour errors.
> > That kind of interface is way too fragile to be used in production.
> 
> I didn't realise that the NVMe standard had such a glaring flaw.
> That basically makes atomic writes useless for anything that
> actually requires atomicity. Has the standard been fixed yet?

No.

> And
> does this means that hardware with usable atomic writes is still
> years away?

At least for the hardware I'm familiar with checking a flag and failing
it if the conditions are not met might be a relatively simple firmware
fix.  It just needs a big enough customer to ask for, not just some
Linux developers.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: write atomicity with xfs ... current status?
  2020-03-18  2:27       ` Dave Chinner
@ 2020-03-18  8:00         ` Christoph Hellwig
  2020-03-19  1:07         ` Ober, Frank
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2020-03-18  8:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ober, Frank, Darrick J. Wong, Dimitri, linux-xfs, Barczak,
	Mariusz, Barajas, Felipe

On Wed, Mar 18, 2020 at 01:27:19PM +1100, Dave Chinner wrote:
> What is this "atomic block size" going to be, and how is it going to
> be advertised to the block layer and filesystems?

Enterprise SSDs typically support a few k.  That being said without
a scatter/gather Write command that isn't very useful except for
a few DB log case.  That is why the file system implemented
atomic semantics that have been my primary interest are a lot more
interesting.

The NVMe SSDs advertise this size in a really convolulted way, because
the limits can be globl, per-namespace and also have nasty offsets.

Take a look at Section 6.4 of the NVMe 1.4 spec:

https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

This is how I wired it up for my POC Linux series:

http://git.infradead.org/users/hch/xfs.git/commitdiff/66079e128d7fa452f45f8a4ffce1597157098dbe
http://git.infradead.org/users/hch/xfs.git/commitdiff/70dc57ff030bf3ce0f37678002ef36b5ab5ed42e
http://git.infradead.org/users/hch/xfs.git/commitdiff/b2f1a09c47b4404ef0d18aad576f4b2ca086a3e0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: write atomicity with xfs ... current status?
  2020-03-18  2:27       ` Dave Chinner
  2020-03-18  8:00         ` Christoph Hellwig
@ 2020-03-19  1:07         ` Ober, Frank
  1 sibling, 0 replies; 10+ messages in thread
From: Ober, Frank @ 2020-03-19  1:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Dimitri, linux-xfs, Barczak, Mariusz, Barajas, Felipe

Dave, 
Is the nvme 1.4 specification really broken? It provides boundaries as noted.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

Check section 6.4 page 249. There are several ways to do this and this ratified specification is quite deep about atomic writes and describes what you are saying. I know you told me in another note, there is a glaring hole in the specifications, but is this hole still in the 1.4 specifications?

The layers above the drive could leverage identify namespace, which the drive's controller could advertise to anyone looking for awun and awunpf, but if this would be required to be an offset all we can provide is either 512B or 4096B which are the two nvme block sizes that are atomic on our drives today.
If awun/awunpf were the offset to our standard blocksize (512b or 4096b) would that work?

nvme id-ctrl /dev/nvme0n1 | grep aw
awun      : 0
awupf     : 0
Frank

-----Original Message-----
From: Dave Chinner <david@fromorbit.com> 
Sent: Tuesday, March 17, 2020 7:27 PM
To: Ober, Frank <frank.ober@intel.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>; Dimitri <dimitri.kravtchuk@oracle.com>; linux-xfs@vger.kernel.org; Barczak, Mariusz <mariusz.barczak@intel.com>; Barajas, Felipe <felipe.barajas@intel.com>
Subject: Re: write atomicity with xfs ... current status?

[ Hi Frank, you email program is really badly mangling quoting and line wrapping. Can you see if you can get it to behave better for us? I think I've fixed it below. ]

On Tue, Mar 17, 2020 at 10:56:53PM +0000, Ober, Frank wrote:
> Thanks Dave and Darrick, adding Dimitri Kravtchuk from Oracle to this 
> thread.
> 
> If Intel produced an SSD that was atomic at just the block size level 
> (as in using awun - atomic write unit of the NVMe spec)

What is this "atomic block size" going to be, and how is it going to be advertised to the block layer and filesystems?

> would that constitute that we could do the following

> > We've plumbed RWF_DSYNC to use REQ_FUA IO for pure overwrites if the 
> > hardware supports it. We can do exactly the same thing for 
> > RWF_ATOMIC - it succeeds if:
> > 
> > - we can issue it as a single bio
> > - the lower layers can take the entire atomic bio without
> >   splitting it. 
> > - we treat O_ATOMIC as O_DSYNC so that any metadata changes
> >   required also get synced to disk before signalling IO
> >   completion. If no metadata updates are required, then it's an
> >   open question as to whether REQ_FUA is also required with
> >   REQ_ATOMIC...
> > 
> > Anything else returns a "atomic write IO not possible" error.

So, as I said, your agreeing that an atomic write is essentially a variant of a data integrity write but has more strict size and alignment requirements and a normal RWF_DSYNC write?

> One design goal on the hw side, is to not slow the SSD down, the 
> footprint of firmware code is smaller in an Optane SSD and we don't 
> want to slow that down.

I really don't care what the impact on the SSD firmware size or speed is - if the hardware can't guarantee atomic writes right down to the physical media with full data integrity guarantees, and/or doesn't advertise it's atomic write limits to the OS and filesystem then it's simply not usable.

Please focus on correctness of behaviour first - speed is completely irrelevant if we don't have correctness guarantees from the hardware.

> What's the fastest approach for
> something like InnoDB writes? Can we take small steps that produce 
> value for DirectIO and specific files which is common in databases 
> architectures even 1 table per file ? Streamlining one block size that 
> can be tied to specific file opens seems valuable.

Atomic writes have nothing to do with individual files. Either the device under the filesystem can do atomic writes or it can't. What files we do atomic writes to is irrelevant; What we need to know at the filesystem level is the alignment and size restrictions on atomic writes so we can allocate space appropriately and/or reject user IO as out of bounds.

i.e. we already have size and alignment restrictions for direct IO (typically single logical sector size). For atomic direct IO we will have a different set of size and alignment restrictions, and like the logical sector size, we need to get that from the hardware somehow, and then make use of it in the filesystem appropriately.

Ideally the hardware would supply us with a minimum atomic IO size and alignment and a maximum size. e.g. minimum might be the physical sector size (we can always do atomic physical sector size/aligned IOs) but the maximum is likely going to be some device internal limit. If we require a minimum and maximum from the device and the device only supports one atomic IO size can simply set min = max.

Then it will be up to the filesystem to align extents to those limits, and prevent user IOs that don't match the device size/alignment restrictions placed on atomic writes...

But, first, you're going to need to get sane atomic write behaviour standardised in the NVMe spec, yes? Otherwise nobody can use it because we aren't guaranteed the same behaviour from device to device...

Cheers,

Dave.
--
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-03-19  1:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-16 20:59 write atomicity with xfs ... current status? Ober, Frank
2020-03-16 21:59 ` Darrick J. Wong
2020-03-16 23:32   ` Dave Chinner
2020-03-17 22:56     ` Ober, Frank
2020-03-18  2:27       ` Dave Chinner
2020-03-18  8:00         ` Christoph Hellwig
2020-03-19  1:07         ` Ober, Frank
2020-03-17 19:19 ` Christoph Hellwig
2020-03-17 22:55   ` Dave Chinner
2020-03-18  7:54     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.