All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] atomic block device
@ 2014-02-15 15:04 Dan Williams
  2014-02-15 17:55 ` Andy Rudoff
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Dan Williams @ 2014-02-15 15:04 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, jmoyer, david, Chris Mason, Jens Axboe,
	Bryan E Veal, Annie Foong

In response to Dave's call [1] and highlighting Jeff's attend request
[2] I'd like to stoke a discussion on an emulation layer for atomic
block commands.  Specifically, SNIA has laid out their position on the
command set an atomic block device may support (NVM Programming Model
[3]) and it is a good conversation piece for this effort.  The goal
would be to review the proposed operations, identify the capabilities
that would be readily useful to filesystems / existing use cases, and
tear down a straw man implementation proposal.

The SNIA defined capabilities that seem the highest priority to implement are:
* ATOMIC_MULTIWRITE - dis-contiguous LBA ranges, power fail atomic, no
ordering constraint relative to other i/o

* ATOMIC_WRITE - contiguous LBA range, power fail atomic, no ordering
constraint relative to other i/o

* EXISTS - not an atomic command, but defined in the NPM.  It is akin
to SEEK_{DATA|HOLE} to test whether an LBA is mapped or unmapped.  If
the LBA is mapped additionally specifies whether data is present or
the LBA is only allocated.

* SCAR - again not an atomic command, but once we have metadata can
implement a bad block list, analogous to the bad-block-list support in
md.

Initial thought is that this functionality is better implemented as a
library a block device driver (bio-based or request-based) can call to
emulate these features.  In the case where the feature is directly
supported by the underlying hardware device the emulation layer will
stub out and pass it through.  The argument for not doing this as a
device-mapper target or stacked block device driver is to ease
provisioning and make the emulation transparent.  On the other hand,
the argument for doing this as a virtual block device is that the
"failed to parse device metadata" is a known failure scenario for
dm/md, but not sd for example.

Thoughts?

--
Dan

[1]: http://marc.info/?l=linux-fsdevel&m=138438717002687&w=2
[2]: http://marc.info/?l=linux-fsdevel&m=139041672718333&w=2
[3]: http://snia.org/sites/default/files/NVMProgrammingModel_v1.pdf

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 15:04 [LSF/MM TOPIC] atomic block device Dan Williams
@ 2014-02-15 17:55 ` Andy Rudoff
  2014-02-15 18:29   ` Howard Chu
  2014-02-15 18:02 ` James Bottomley
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Andy Rudoff @ 2014-02-15 17:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason, Jens Axboe,
	Bryan E Veal, Annie Foong

On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>
> In response to Dave's call [1] and highlighting Jeff's attend request
> [2] I'd like to stoke a discussion on an emulation layer for atomic
> block commands.  Specifically, SNIA has laid out their position on the
> command set an atomic block device may support (NVM Programming Model
> [3]) and it is a good conversation piece for this effort.  The goal
> would be to review the proposed operations, identify the capabilities
> that would be readily useful to filesystems / existing use cases, and
> tear down a straw man implementation proposal.
...
> The argument for not doing this as a
> device-mapper target or stacked block device driver is to ease
> provisioning and make the emulation transparent.  On the other hand,
> the argument for doing this as a virtual block device is that the
> "failed to parse device metadata" is a known failure scenario for
> dm/md, but not sd for example.


Hi Dan,

Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in
here with a couple observations.  I think the most interesting cases
where atomics provide a benefit are cases where storage is RAIDed
across multiple devices.  Part of the argument for atomic writes on
SSDs is that databases and file systems can save bandwidth and
complexity by avoiding write-ahead-logging.  But even if every SSD
supported it, the majority of production databases span across devices
for either capacity, performance, or, most likely, high availability
reasons.  So in my opinion, that very much supports the idea of doing
atomics at a layer where it applies to SW RAIDed storage (as I believe
Dave and others are suggesting).

On the other side of the coin, I remember Dave talking about this
during our NVM discussion at LSF last year and I got the impression
the size and number of writes he'd need supported before he could
really stop using his journaling code was potentially large.  Dave:
perhaps you can re-state the number of writes and their total size
that would have to be supported by block level atomics in order for
them to be worth using by XFS?

Finally, I think atomics for file system use is interesting, but also
exposing them for database use is very interesting.  That means
exposing the size and number of writes supported to the app and making
the file system able to turn around and leverage those when a database
app tries to use them via the file system.  This has been the primary
focus of the NVMP workgroup, helping ISVs determine what features they
can leverage in a uniform way.  So my point here is we get the most
use out of atomics by exposing them both in-kernel for file systems
and in user space for apps.

-andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 15:04 [LSF/MM TOPIC] atomic block device Dan Williams
  2014-02-15 17:55 ` Andy Rudoff
@ 2014-02-15 18:02 ` James Bottomley
  2014-02-15 18:15   ` Andy Rudoff
       [not found] ` <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>
  2014-02-17 13:05 ` Chris Mason
  3 siblings, 1 reply; 18+ messages in thread
From: James Bottomley @ 2014-02-15 18:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason, Jens Axboe,
	Bryan E Veal, Annie Foong, linux-scsi, Christoph Lameter

On Sat, 2014-02-15 at 07:04 -0800, Dan Williams wrote:
> In response to Dave's call [1] and highlighting Jeff's attend request
> [2] I'd like to stoke a discussion on an emulation layer for atomic
> block commands.  Specifically, SNIA has laid out their position on the
> command set an atomic block device may support (NVM Programming Model
> [3]) and it is a good conversation piece for this effort.  The goal
> would be to review the proposed operations, identify the capabilities
> that would be readily useful to filesystems / existing use cases, and
> tear down a straw man implementation proposal.
> 
> The SNIA defined capabilities that seem the highest priority to implement are:
> * ATOMIC_MULTIWRITE - dis-contiguous LBA ranges, power fail atomic, no
> ordering constraint relative to other i/o
> 
> * ATOMIC_WRITE - contiguous LBA range, power fail atomic, no ordering
> constraint relative to other i/o
> 
> * EXISTS - not an atomic command, but defined in the NPM.  It is akin
> to SEEK_{DATA|HOLE} to test whether an LBA is mapped or unmapped.  If
> the LBA is mapped additionally specifies whether data is present or
> the LBA is only allocated.
> 
> * SCAR - again not an atomic command, but once we have metadata can
> implement a bad block list, analogous to the bad-block-list support in
> md.
> 
> Initial thought is that this functionality is better implemented as a
> library a block device driver (bio-based or request-based) can call to
> emulate these features.  In the case where the feature is directly
> supported by the underlying hardware device the emulation layer will
> stub out and pass it through.  The argument for not doing this as a
> device-mapper target or stacked block device driver is to ease
> provisioning and make the emulation transparent.  On the other hand,
> the argument for doing this as a virtual block device is that the
> "failed to parse device metadata" is a known failure scenario for
> dm/md, but not sd for example.
> 
> Thoughts?

Actually, this topic has already been suggested by Christoph Lameter ...
he just didn't copy any external mailing lists (bad Christoph, rolled up
newspaper for you).

For those following at home, the SNIA proposal is here:

http://snia.org/sites/default/files/NVMProgrammingModel_v1.pdf

And this was my initial reply:

OK, I'm prepared to look through it, but I should warn you that after
the SNIA HBAAPI cluster fuck I'm not well disposed towards any APIs that
come out of SNIA.  I've read the first 30 pages and they don't inspire
confidence; it's basically going the same way as HBA API.  The failure
there was trying to define universal interfaces for every OS regardless
of the existing interfaces they currently had.  this NVM model seems to
define a lot of existing stuff in block and VFS but slightly
differently.  Why do you think it's a good idea?

I'll further add  what we really need are use cases, not an API
chocolate box.  I think some DB people will be coming to LSF, so we
should really talk use cases with them.

James



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 18:02 ` James Bottomley
@ 2014-02-15 18:15   ` Andy Rudoff
  2014-02-15 20:25     ` James Bottomley
  0 siblings, 1 reply; 18+ messages in thread
From: Andy Rudoff @ 2014-02-15 18:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dan Williams, lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason,
	Jens Axboe, Bryan E Veal, Annie Foong, linux-scsi,
	Christoph Lameter

> OK, I'm prepared to look through it, but I should warn you that after
> the SNIA HBAAPI cluster fuck I'm not well disposed towards any APIs that
> come out of SNIA.  I've read the first 30 pages and they don't inspire
> confidence; it's basically going the same way as HBA API.  The failure
> there was trying to define universal interfaces for every OS regardless
> of the existing interfaces they currently had.  this NVM model seems to
> define a lot of existing stuff in block and VFS but slightly
> differently.  Why do you think it's a good idea?

Note that the NVMP workgroup did not define any APIs.  Instead, we
concentrated on defining the actions that we see applications needing
(or being able to use) and defining some common terminology.  We leave
the API defintion to the operating system authors so they can create
them in the way that makes the most sense for their environment (much
like you are suggesting above, I think).

> I'll further add  what we really need are use cases, not an API
> chocolate box.  I think some DB people will be coming to LSF, so we
> should really talk use cases with them.

Totally agree.

-andy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 17:55 ` Andy Rudoff
@ 2014-02-15 18:29   ` Howard Chu
  2014-02-15 18:31     ` Howard Chu
  0 siblings, 1 reply; 18+ messages in thread
From: Howard Chu @ 2014-02-15 18:29 UTC (permalink / raw)
  To: Andy Rudoff, Dan Williams
  Cc: lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason, Jens Axboe,
	Bryan E Veal, Annie Foong

Andy Rudoff wrote:
> On the other side of the coin, I remember Dave talking about this
> during our NVM discussion at LSF last year and I got the impression
> the size and number of writes he'd need supported before he could
> really stop using his journaling code was potentially large.  Dave:
> perhaps you can re-state the number of writes and their total size
> that would have to be supported by block level atomics in order for
> them to be worth using by XFS?

If you're dealing with a typical update-in-place database then there's no 
upper bound on this, a DB transaction can be arbitrarily large and any partial 
write will result in corrupted data structures.

On the other hand, with a multi-version copy-on-write DB (like mine, 
http://symas.com/mdb/ ) all you need is a guarantee that all data writes 
complete before any metadata is updated.

IMO, catering to the update-in-place approach is an exercise in futility since 
it will require significant memory resources on every link in the storage 
chain and whatever amount you have available will never be sufficient.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 18:29   ` Howard Chu
@ 2014-02-15 18:31     ` Howard Chu
  0 siblings, 0 replies; 18+ messages in thread
From: Howard Chu @ 2014-02-15 18:31 UTC (permalink / raw)
  To: Andy Rudoff, Dan Williams
  Cc: lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason, Jens Axboe,
	Bryan E Veal, Annie Foong

Howard Chu wrote:
> Andy Rudoff wrote:
>> On the other side of the coin, I remember Dave talking about this
>> during our NVM discussion at LSF last year and I got the impression
>> the size and number of writes he'd need supported before he could
>> really stop using his journaling code was potentially large.  Dave:
>> perhaps you can re-state the number of writes and their total size
>> that would have to be supported by block level atomics in order for
>> them to be worth using by XFS?
>
> If you're dealing with a typical update-in-place database then there's no
> upper bound on this, a DB transaction can be arbitrarily large and any partial
> write will result in corrupted data structures.
>
> On the other hand, with a multi-version copy-on-write DB (like mine,
> http://symas.com/mdb/ ) all you need is a guarantee that all data writes
> complete before any metadata is updated.
>
> IMO, catering to the update-in-place approach is an exercise in futility since
> it will require significant memory resources on every link in the storage
> chain and whatever amount you have available will never be sufficient.
>
My proposal from last November could be implemented without requiring any more 
state than already present in current storage controllers.

http://www.spinics.net/lists/linux-fsdevel/msg70047.html

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 18:15   ` Andy Rudoff
@ 2014-02-15 20:25     ` James Bottomley
  2014-03-20 20:10       ` Jeff Moyer
  0 siblings, 1 reply; 18+ messages in thread
From: James Bottomley @ 2014-02-15 20:25 UTC (permalink / raw)
  To: Andy Rudoff
  Cc: Dan Williams, lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason,
	Jens Axboe, Bryan E Veal, Annie Foong, linux-scsi,
	Christoph Lameter

On Sat, 2014-02-15 at 11:15 -0700, Andy Rudoff wrote:
> > OK, I'm prepared to look through it, but I should warn you that after
> > the SNIA HBAAPI cluster fuck I'm not well disposed towards any APIs that
> > come out of SNIA.  I've read the first 30 pages and they don't inspire
> > confidence; it's basically going the same way as HBA API.  The failure
> > there was trying to define universal interfaces for every OS regardless
> > of the existing interfaces they currently had.  this NVM model seems to
> > define a lot of existing stuff in block and VFS but slightly
> > differently.  Why do you think it's a good idea?
> 
> Note that the NVMP workgroup did not define any APIs.  Instead, we
> concentrated on defining the actions that we see applications needing
> (or being able to use) and defining some common terminology.  We leave
> the API defintion to the operating system authors so they can create
> them in the way that makes the most sense for their environment (much
> like you are suggesting above, I think).

Well, the actions had define input and output properties ... we can
argue about what level of semantics you have to define before an action
becomes an API, but the real question 

> > I'll further add  what we really need are use cases, not an API
> > chocolate box.  I think some DB people will be coming to LSF, so we
> > should really talk use cases with them.
> 
> Totally agree.

OK, so what the Database people are currently fretting about is how the
Linux cache fights with the WAL.  Pretty much all DBs sit on filesystems
these days, so the first question is are block operations even relevant
and do the (rather scant) file operations fit their need.  The basic
problem with the file mode is the granularity.  What does a DB do with
transactions which go over the limits?  It also looks like the NVM file
actions need to work over DIO, so the question is how. (And the other
problem is that only a few DBs seem to use DIO).

James



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
       [not found] ` <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>
@ 2014-02-15 21:35   ` Dan Williams
  2014-02-17  8:56   ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dan Williams @ 2014-02-15 21:35 UTC (permalink / raw)
  To: Andy Rudoff
  Cc: lsf-pc, linux-fsdevel, jmoyer, david, Chris Mason, Jens Axboe,
	Bryan E Veal, Annie Foong

On Sat, Feb 15, 2014 at 9:47 AM, Andy Rudoff <andy@rudoff.com> wrote:
> On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com>
> wrote:
>>
>> In response to Dave's call [1] and highlighting Jeff's attend request
>> [2] I'd like to stoke a discussion on an emulation layer for atomic
>> block commands.  Specifically, SNIA has laid out their position on the
>> command set an atomic block device may support (NVM Programming Model
>> [3]) and it is a good conversation piece for this effort.  The goal
>> would be to review the proposed operations, identify the capabilities
>> that would be readily useful to filesystems / existing use cases, and
>> tear down a straw man implementation proposal.
>
> ...
>>
>> The argument for not doing this as a
>> device-mapper target or stacked block device driver is to ease
>> provisioning and make the emulation transparent.  On the other hand,
>> the argument for doing this as a virtual block device is that the
>> "failed to parse device metadata" is a known failure scenario for
>> dm/md, but not sd for example.
>
>
> Hi Dan,

Hi Andy.

> Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
> with a couple observations.  I think the most interesting cases where
> atomics provide a benefit are cases where storage is RAIDed across multiple
> devices.  Part of the argument for atomic writes on SSDs is that databases
> and file systems can save bandwidth and complexity by avoiding
> write-ahead-logging.  But even if every SSD supported it, the majority of
> production databases span across devices for either capacity, performance,
> or, most likely, high availability reasons.

The primary Facebook database server (Type 3 [1]) is single-device,
are they an outlier?  I would think scale-out architectures in general
handle database capacity and availability by scaling at the node
level... that said I don't doubt that some are dependent on
multi-device configurations.

[1]: http://opencompute.org/summit/ (slide 12)

> So in my opinion, that very
> much supports the idea of doing atomics at a layer where it applies to SW
> RAIDed storage (as I believe Dave and others are suggesting).

Sure this can expand to a multi-device capability, but that is
incremental to the single device use case.

> On the other side of the coin, I remember Dave talking about this during our
> NVM discussion at LSF last year and I got the impression the size and number
> of writes he'd need supported before he could really stop using his
> journaling code was potentially large.  Dave: perhaps you can re-state the
> number of writes and their total size that would have to be supported by
> block level atomics in order for them to be worth using by XFS?

...and that's the driving example of the value of having a solution
like this upstream.  Beat up on a common layer to determine the
minimum practical requirements across different use cases.

> Finally, I think atomics for file system use is interesting, but also
> exposing them for database use is very interesting.  That means exposing the
> size and number of writes supported to the app and making the file system
> able to turn around and leverage those when a database app tries to use them
> via the file system.  This has been the primary focus of the NVMP workgroup,
> helping ISVs determine what features they can leverage in a uniform way.  So
> my point here is we get the most use out of atomics by exposing them both
> in-kernel for file systems and in user space for apps.

*nod*

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
       [not found] ` <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>
  2014-02-15 21:35   ` Dan Williams
@ 2014-02-17  8:56   ` Dave Chinner
  2014-02-17  9:51     ` [Lsf-pc] " Jan Kara
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2014-02-17  8:56 UTC (permalink / raw)
  To: Andy Rudoff
  Cc: Dan Williams, lsf-pc, linux-fsdevel, jmoyer, Chris Mason,
	Jens Axboe, Bryan E Veal, Annie Foong

On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
> On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com>wrote:
> 
> > In response to Dave's call [1] and highlighting Jeff's attend request
> > [2] I'd like to stoke a discussion on an emulation layer for atomic
> > block commands.  Specifically, SNIA has laid out their position on the
> > command set an atomic block device may support (NVM Programming Model
> > [3]) and it is a good conversation piece for this effort.  The goal
> > would be to review the proposed operations, identify the capabilities
> > that would be readily useful to filesystems / existing use cases, and
> > tear down a straw man implementation proposal.
> >
> ...
> 
> > The argument for not doing this as a
> > device-mapper target or stacked block device driver is to ease
> > provisioning and make the emulation transparent.  On the other hand,
> > the argument for doing this as a virtual block device is that the
> > "failed to parse device metadata" is a known failure scenario for
> > dm/md, but not sd for example.
> >
> 
> Hi Dan,
> 
> Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
> with a couple observations.  I think the most interesting cases where
> atomics provide a benefit are cases where storage is RAIDed across multiple
> devices.  Part of the argument for atomic writes on SSDs is that databases
> and file systems can save bandwidth and complexity by avoiding
> write-ahead-logging.  But even if every SSD supported it, the majority of
> production databases span across devices for either capacity, performance,
> or, most likely, high availability reasons.  So in my opinion, that very
> much supports the idea of doing atomics at a layer where it applies to SW
> RAIDed storage (as I believe Dave and others are suggesting).
> 
> On the other side of the coin, I remember Dave talking about this during
> our NVM discussion at LSF last year and I got the impression the size and
> number of writes he'd need supported before he could really stop using his
> journaling code was potentially large.  Dave: perhaps you can re-state the
> number of writes and their total size that would have to be supported by
> block level atomics in order for them to be worth using by XFS?

Hi Andy - the numbers I gave last year were at the upper end of the
number of iovecs we can dump into an atomic checkpoint in the XFS
log at a time. because that is typically based on log size and the
log can be up to 2GB in size, this tends to max out at somewhere
around 150-200,000 individual iovecs and/or roughly 100MB of
metadata.

Yeah, it's a lot, but keep in mind that a workload running 250,000
file creates a second on XFS is retiring somewhere around 300,000
individual transactions per second, each of which will typically
have 10-20 dirty regions in them.  If we were to write them as
individual atomic writes at transaction commit time we'd need to
sustain somewhere in the order of 3-6 _million IOPS_ to maintain
this transaction rate with individual atomic writes for each
transaction.

That would also introduce unacceptible IO latency as we can't modify
metadata while it is under IO, especially as a large number of these
regions are redirtied repeatedly during ongoing operations(e.g.
directory data and index blocks). Hence to avoid this problem with
atomic writes, we need still need asynchronous transactions and
in-memory aggregation of changes.  IOWs, checkpoints are the until
of atomic write we need to for support in XFS.

We can limit the size of checkpoints in XFS without too much
trouble, either by amount of data or number of iovecs, but that
comes at a performance code. To maintain current levels of
performance we need a decent amount of in-memory change aggregation
and hence we are going to need - at minimum - thousands of vectors
in each atomic write. I'd prefer tens of thousands to hundreds of
thousands of vectors because that's our typical unit of "atomic
write" at current performance levels, but several thousand vectors
and tens of MB is sufficient to start with....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device
  2014-02-17  8:56   ` Dave Chinner
@ 2014-02-17  9:51     ` Jan Kara
  2014-02-17 10:20       ` Howard Chu
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2014-02-17  9:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andy Rudoff, Jens Axboe, Bryan E Veal, Annie Foong, Chris Mason,
	jmoyer, linux-fsdevel, Dan Williams, lsf-pc

On Mon 17-02-14 19:56:27, Dave Chinner wrote:
> On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
> > On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com>wrote:
> > 
> > > In response to Dave's call [1] and highlighting Jeff's attend request
> > > [2] I'd like to stoke a discussion on an emulation layer for atomic
> > > block commands.  Specifically, SNIA has laid out their position on the
> > > command set an atomic block device may support (NVM Programming Model
> > > [3]) and it is a good conversation piece for this effort.  The goal
> > > would be to review the proposed operations, identify the capabilities
> > > that would be readily useful to filesystems / existing use cases, and
> > > tear down a straw man implementation proposal.
> > >
> > ...
> > 
> > > The argument for not doing this as a
> > > device-mapper target or stacked block device driver is to ease
> > > provisioning and make the emulation transparent.  On the other hand,
> > > the argument for doing this as a virtual block device is that the
> > > "failed to parse device metadata" is a known failure scenario for
> > > dm/md, but not sd for example.
> > >
> > 
> > Hi Dan,
> > 
> > Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
> > with a couple observations.  I think the most interesting cases where
> > atomics provide a benefit are cases where storage is RAIDed across multiple
> > devices.  Part of the argument for atomic writes on SSDs is that databases
> > and file systems can save bandwidth and complexity by avoiding
> > write-ahead-logging.  But even if every SSD supported it, the majority of
> > production databases span across devices for either capacity, performance,
> > or, most likely, high availability reasons.  So in my opinion, that very
> > much supports the idea of doing atomics at a layer where it applies to SW
> > RAIDed storage (as I believe Dave and others are suggesting).
> > 
> > On the other side of the coin, I remember Dave talking about this during
> > our NVM discussion at LSF last year and I got the impression the size and
> > number of writes he'd need supported before he could really stop using his
> > journaling code was potentially large.  Dave: perhaps you can re-state the
> > number of writes and their total size that would have to be supported by
> > block level atomics in order for them to be worth using by XFS?
> 
> Hi Andy - the numbers I gave last year were at the upper end of the
> number of iovecs we can dump into an atomic checkpoint in the XFS
> log at a time. because that is typically based on log size and the
> log can be up to 2GB in size, this tends to max out at somewhere
> around 150-200,000 individual iovecs and/or roughly 100MB of
> metadata.
> 
> Yeah, it's a lot, but keep in mind that a workload running 250,000
> file creates a second on XFS is retiring somewhere around 300,000
> individual transactions per second, each of which will typically
> have 10-20 dirty regions in them.  If we were to write them as
> individual atomic writes at transaction commit time we'd need to
> sustain somewhere in the order of 3-6 _million IOPS_ to maintain
> this transaction rate with individual atomic writes for each
> transaction.
> 
> That would also introduce unacceptible IO latency as we can't modify
> metadata while it is under IO, especially as a large number of these
> regions are redirtied repeatedly during ongoing operations(e.g.
> directory data and index blocks). Hence to avoid this problem with
> atomic writes, we need still need asynchronous transactions and
> in-memory aggregation of changes.  IOWs, checkpoints are the until
> of atomic write we need to for support in XFS.
> 
> We can limit the size of checkpoints in XFS without too much
> trouble, either by amount of data or number of iovecs, but that
> comes at a performance code. To maintain current levels of
> performance we need a decent amount of in-memory change aggregation
> and hence we are going to need - at minimum - thousands of vectors
> in each atomic write. I'd prefer tens of thousands to hundreds of
> thousands of vectors because that's our typical unit of "atomic
> write" at current performance levels, but several thousand vectors
> and tens of MB is sufficient to start with....
  I did the math for ext4 and it worked out rather similarly. After the
transaction batching we do in memory, we have transactions which are tens
of MB in size. These go first to a physically contiguous journal during
transaction commit (that's the easy part but it would already save us one
cache flush + FUA write) and then during checkpoint to final locations on
disk which can be physically discontiguous so that can be thousands to tens
of thousands different locations (this would save us another cache flush +
FUA write).

Similarly as in XFS case it is easy to force smaller transactions in ext4
but the smaller you make them the larger it the journaling overhead...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device
  2014-02-17  9:51     ` [Lsf-pc] " Jan Kara
@ 2014-02-17 10:20       ` Howard Chu
  2014-02-18  0:10         ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Howard Chu @ 2014-02-17 10:20 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Andy Rudoff, Jens Axboe, Bryan E Veal, Annie Foong, Chris Mason,
	jmoyer, linux-fsdevel, Dan Williams, lsf-pc

Jan Kara wrote:
> On Mon 17-02-14 19:56:27, Dave Chinner wrote:
>> On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
>>> On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com>wrote:
>>>
>>>> In response to Dave's call [1] and highlighting Jeff's attend request
>>>> [2] I'd like to stoke a discussion on an emulation layer for atomic
>>>> block commands.  Specifically, SNIA has laid out their position on the
>>>> command set an atomic block device may support (NVM Programming Model
>>>> [3]) and it is a good conversation piece for this effort.  The goal
>>>> would be to review the proposed operations, identify the capabilities
>>>> that would be readily useful to filesystems / existing use cases, and
>>>> tear down a straw man implementation proposal.
>>>>
>>> ...
>>>
>>>> The argument for not doing this as a
>>>> device-mapper target or stacked block device driver is to ease
>>>> provisioning and make the emulation transparent.  On the other hand,
>>>> the argument for doing this as a virtual block device is that the
>>>> "failed to parse device metadata" is a known failure scenario for
>>>> dm/md, but not sd for example.
>>>>
>>>
>>> Hi Dan,
>>>
>>> Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
>>> with a couple observations.  I think the most interesting cases where
>>> atomics provide a benefit are cases where storage is RAIDed across multiple
>>> devices.  Part of the argument for atomic writes on SSDs is that databases
>>> and file systems can save bandwidth and complexity by avoiding
>>> write-ahead-logging.  But even if every SSD supported it, the majority of
>>> production databases span across devices for either capacity, performance,
>>> or, most likely, high availability reasons.  So in my opinion, that very
>>> much supports the idea of doing atomics at a layer where it applies to SW
>>> RAIDed storage (as I believe Dave and others are suggesting).
>>>
>>> On the other side of the coin, I remember Dave talking about this during
>>> our NVM discussion at LSF last year and I got the impression the size and
>>> number of writes he'd need supported before he could really stop using his
>>> journaling code was potentially large.  Dave: perhaps you can re-state the
>>> number of writes and their total size that would have to be supported by
>>> block level atomics in order for them to be worth using by XFS?
>>
>> Hi Andy - the numbers I gave last year were at the upper end of the
>> number of iovecs we can dump into an atomic checkpoint in the XFS
>> log at a time. because that is typically based on log size and the
>> log can be up to 2GB in size, this tends to max out at somewhere
>> around 150-200,000 individual iovecs and/or roughly 100MB of
>> metadata.
>>
>> Yeah, it's a lot, but keep in mind that a workload running 250,000
>> file creates a second on XFS is retiring somewhere around 300,000
>> individual transactions per second, each of which will typically
>> have 10-20 dirty regions in them.  If we were to write them as
>> individual atomic writes at transaction commit time we'd need to
>> sustain somewhere in the order of 3-6 _million IOPS_ to maintain
>> this transaction rate with individual atomic writes for each
>> transaction.
>>
>> That would also introduce unacceptible IO latency as we can't modify
>> metadata while it is under IO, especially as a large number of these
>> regions are redirtied repeatedly during ongoing operations(e.g.
>> directory data and index blocks). Hence to avoid this problem with
>> atomic writes, we need still need asynchronous transactions and
>> in-memory aggregation of changes.  IOWs, checkpoints are the until
>> of atomic write we need to for support in XFS.
>>
>> We can limit the size of checkpoints in XFS without too much
>> trouble, either by amount of data or number of iovecs, but that
>> comes at a performance code. To maintain current levels of
>> performance we need a decent amount of in-memory change aggregation
>> and hence we are going to need - at minimum - thousands of vectors
>> in each atomic write. I'd prefer tens of thousands to hundreds of
>> thousands of vectors because that's our typical unit of "atomic
>> write" at current performance levels, but several thousand vectors
>> and tens of MB is sufficient to start with....
>    I did the math for ext4 and it worked out rather similarly. After the
> transaction batching we do in memory, we have transactions which are tens
> of MB in size. These go first to a physically contiguous journal during
> transaction commit (that's the easy part but it would already save us one
> cache flush + FUA write) and then during checkpoint to final locations on
> disk which can be physically discontiguous so that can be thousands to tens
> of thousands different locations (this would save us another cache flush +
> FUA write).
>
> Similarly as in XFS case it is easy to force smaller transactions in ext4
> but the smaller you make them the larger it the journaling overhead...

Again, if you simply tag writes with group IDs as I outlined before 
http://www.spinics.net/lists/linux-fsdevel/msg70047.html then you don't need 
explicit cache flushes, nor do you need to worry about transaction size 
limits. All you actually need is to ensure the ordering of a specific set of 
writes in relation to another specific set of writes, completely independent 
of other arbitrary writes. You folks are cooking up a solution for NVMe that's 
only practical when data transfer rates are fast enough that a 100MB write can 
be done in ~1ms, whereas a simple tweak of command tagging will work for 
everything from the slowest HDD to the fastest storage device.

As it is, the Atomic Write mechanism will be unusable for DBs when the 
transaction size exceeds whatever limit a particular device supports, thus 
requiring DB software to still provide a fallback mechanism, e.g. standard 
WAL, which only results in more complicated software. That's not a solution, 
that's just a new problem.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 15:04 [LSF/MM TOPIC] atomic block device Dan Williams
                   ` (2 preceding siblings ...)
       [not found] ` <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>
@ 2014-02-17 13:05 ` Chris Mason
  2014-02-18 19:07   ` Dan Williams
  3 siblings, 1 reply; 18+ messages in thread
From: Chris Mason @ 2014-02-17 13:05 UTC (permalink / raw)
  To: Dan Williams, lsf-pc
  Cc: linux-fsdevel, jmoyer, david, Jens Axboe, Bryan E Veal, Annie Foong

On 02/15/2014 10:04 AM, Dan Williams wrote:
> In response to Dave's call [1] and highlighting Jeff's attend request
> [2] I'd like to stoke a discussion on an emulation layer for atomic
> block commands.  Specifically, SNIA has laid out their position on the
> command set an atomic block device may support (NVM Programming Model
> [3]) and it is a good conversation piece for this effort.  The goal
> would be to review the proposed operations, identify the capabilities
> that would be readily useful to filesystems / existing use cases, and
> tear down a straw man implementation proposal.
>
> The SNIA defined capabilities that seem the highest priority to implement are:
> * ATOMIC_MULTIWRITE - dis-contiguous LBA ranges, power fail atomic, no
> ordering constraint relative to other i/o
>
> * ATOMIC_WRITE - contiguous LBA range, power fail atomic, no ordering
> constraint relative to other i/o
>
> * EXISTS - not an atomic command, but defined in the NPM.  It is akin
> to SEEK_{DATA|HOLE} to test whether an LBA is mapped or unmapped.  If
> the LBA is mapped additionally specifies whether data is present or
> the LBA is only allocated.
>
> * SCAR - again not an atomic command, but once we have metadata can
> implement a bad block list, analogous to the bad-block-list support in
> md.
>
> Initial thought is that this functionality is better implemented as a
> library a block device driver (bio-based or request-based) can call to
> emulate these features.  In the case where the feature is directly
> supported by the underlying hardware device the emulation layer will
> stub out and pass it through.  The argument for not doing this as a
> device-mapper target or stacked block device driver is to ease
> provisioning and make the emulation transparent.  On the other hand,
> the argument for doing this as a virtual block device is that the
> "failed to parse device metadata" is a known failure scenario for
> dm/md, but not sd for example.

Hi Dan,

I'd suggest a dm device instead of a special library, mostly because the 
emulated device is likely to need some kind of cleanup action after a 
crash, and the dm model is best suited to cleanly provide that.  It's 
also a good fit for people that want to duct tape a small amount of very 
fast nvm onto relatively slower devices.

The absolute minimum to provide something useful is a 16K discontig 
atomic.  That won't help the filesystems much, but it will allow mysql 
to turn off double buffering.  Oracle would benefit from ~64K, mostly 
from a safety point of view since they don't double buffer.

Helping the filesystems is harder, we need atomics bigger than any 
individual device is likely to provide.  But as Dave says elsewhere in 
the thread, we can limit that for specific workloads.

I'm not sold on SCAR, since I'd expect the FTL or drive firmware provide 
that for us, what use case do you have in mind there?

-chris

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device
  2014-02-17 10:20       ` Howard Chu
@ 2014-02-18  0:10         ` Dave Chinner
  2014-02-18  8:59           ` Alex Elsayed
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2014-02-18  0:10 UTC (permalink / raw)
  To: Howard Chu
  Cc: Jan Kara, Jens Axboe, Andy Rudoff, Annie Foong, Chris Mason,
	jmoyer, Bryan E Veal, linux-fsdevel, Dan Williams, lsf-pc

On Mon, Feb 17, 2014 at 02:20:50AM -0800, Howard Chu wrote:
> Jan Kara wrote:
> >On Mon 17-02-14 19:56:27, Dave Chinner wrote:
> >>On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
> >>>On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@intel.com>wrote:
> >>>
> >>>>In response to Dave's call [1] and highlighting Jeff's attend request
> >>>>[2] I'd like to stoke a discussion on an emulation layer for atomic
> >>>>block commands.  Specifically, SNIA has laid out their position on the
> >>>>command set an atomic block device may support (NVM Programming Model
> >>>>[3]) and it is a good conversation piece for this effort.  The goal
> >>>>would be to review the proposed operations, identify the capabilities
> >>>>that would be readily useful to filesystems / existing use cases, and
> >>>>tear down a straw man implementation proposal.
> >>>>
> >>>...
> >>>
> >>>>The argument for not doing this as a
> >>>>device-mapper target or stacked block device driver is to ease
> >>>>provisioning and make the emulation transparent.  On the other hand,
> >>>>the argument for doing this as a virtual block device is that the
> >>>>"failed to parse device metadata" is a known failure scenario for
> >>>>dm/md, but not sd for example.
> >>>>
> >>>
> >>>Hi Dan,
> >>>
> >>>Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
> >>>with a couple observations.  I think the most interesting cases where
> >>>atomics provide a benefit are cases where storage is RAIDed across multiple
> >>>devices.  Part of the argument for atomic writes on SSDs is that databases
> >>>and file systems can save bandwidth and complexity by avoiding
> >>>write-ahead-logging.  But even if every SSD supported it, the majority of
> >>>production databases span across devices for either capacity, performance,
> >>>or, most likely, high availability reasons.  So in my opinion, that very
> >>>much supports the idea of doing atomics at a layer where it applies to SW
> >>>RAIDed storage (as I believe Dave and others are suggesting).
> >>>
> >>>On the other side of the coin, I remember Dave talking about this during
> >>>our NVM discussion at LSF last year and I got the impression the size and
> >>>number of writes he'd need supported before he could really stop using his
> >>>journaling code was potentially large.  Dave: perhaps you can re-state the
> >>>number of writes and their total size that would have to be supported by
> >>>block level atomics in order for them to be worth using by XFS?
> >>
> >>Hi Andy - the numbers I gave last year were at the upper end of the
> >>number of iovecs we can dump into an atomic checkpoint in the XFS
> >>log at a time. because that is typically based on log size and the
> >>log can be up to 2GB in size, this tends to max out at somewhere
> >>around 150-200,000 individual iovecs and/or roughly 100MB of
> >>metadata.
> >>
> >>Yeah, it's a lot, but keep in mind that a workload running 250,000
> >>file creates a second on XFS is retiring somewhere around 300,000
> >>individual transactions per second, each of which will typically
> >>have 10-20 dirty regions in them.  If we were to write them as
> >>individual atomic writes at transaction commit time we'd need to
> >>sustain somewhere in the order of 3-6 _million IOPS_ to maintain
> >>this transaction rate with individual atomic writes for each
> >>transaction.
> >>
> >>That would also introduce unacceptible IO latency as we can't modify
> >>metadata while it is under IO, especially as a large number of these
> >>regions are redirtied repeatedly during ongoing operations(e.g.
> >>directory data and index blocks). Hence to avoid this problem with
> >>atomic writes, we need still need asynchronous transactions and
> >>in-memory aggregation of changes.  IOWs, checkpoints are the until
> >>of atomic write we need to for support in XFS.
> >>
> >>We can limit the size of checkpoints in XFS without too much
> >>trouble, either by amount of data or number of iovecs, but that
> >>comes at a performance code. To maintain current levels of
> >>performance we need a decent amount of in-memory change aggregation
> >>and hence we are going to need - at minimum - thousands of vectors
> >>in each atomic write. I'd prefer tens of thousands to hundreds of
> >>thousands of vectors because that's our typical unit of "atomic
> >>write" at current performance levels, but several thousand vectors
> >>and tens of MB is sufficient to start with....
> >   I did the math for ext4 and it worked out rather similarly. After the
> >transaction batching we do in memory, we have transactions which are tens
> >of MB in size. These go first to a physically contiguous journal during
> >transaction commit (that's the easy part but it would already save us one
> >cache flush + FUA write) and then during checkpoint to final locations on
> >disk which can be physically discontiguous so that can be thousands to tens
> >of thousands different locations (this would save us another cache flush +
> >FUA write).
> >
> >Similarly as in XFS case it is easy to force smaller transactions in ext4
> >but the smaller you make them the larger it the journaling overhead...
> 
> Again, if you simply tag writes with group IDs as I outlined before
> http://www.spinics.net/lists/linux-fsdevel/msg70047.html then you
> don't need explicit cache flushes, nor do you need to worry about
> transaction size limits.
>
> All you actually need is to ensure the
> ordering of a specific set of writes in relation to another specific
> set of writes, completely independent of other arbitrary writes. You
> folks are cooking up a solution for NVMe that's only practical when
> data transfer rates are fast enough that a 100MB write can be done
> in ~1ms, whereas a simple tweak of command tagging will work for
> everything from the slowest HDD to the fastest storage device.

Perhaps you'd like to outline how you avoid IO priority inversion in
a journal with such a scheme where current checkpoints are held off
by all other metadata writeback because, by definition, metadata
writeback must be in a lower ordered tag group than the current
checkpoint.

> As it is, the Atomic Write mechanism will be unusable for DBs when
> the transaction size exceeds whatever limit a particular device
> supports, thus requiring DB software to still provide a fallback
> mechanism, e.g. standard WAL, which only results in more complicated
> software. That's not a solution, that's just a new problem.

Realistically, I haven't seen a single proposal coming out of the
hardware vendors that makes filesystem journalling more efficient
than it already is. Atomic writes might be able to save a journal
flush on an fsync() and so make databases go faster, but it gives
up a whole heap of other optimisations that make non-database
workloads go fast. e.g. untarring a tarball.

Similarly, things like ordered writes are great until you consider
how they interact with journalling and cause priority inversion
issues. The only way to make use of ordered writes is to design the
filesystem around ordered writes from the ground up. i.e. the
soft updates complexity problem. Unlike atomic writes, this can't
easily be retrofitted to an exising filesystem, and once you have
soft updates in place you are effectively fixing the format and
features of the filesystem in stone because if you need to change a
single operation or on disk structure you have to work out the
dependency graph for the entire filesystem from the ground up again.

Perhaps - just perhaps - we're doing this all wrong. Bottom up
design of hardware offload features has a history of resulting in
functionality that looks good on paper but can't be used in general
production systems because it is too limited or has undesirable side
effects.  Perhaps we need to be more top down, similar to how I
proposed a "dm-atomic" layer to implement atomic writes in software.

That is, design the software layer first, then convert filesystems
to use it. Once the concept is proven (a software implementation
should be no slower than what it replaced), the hardware offload
primitives can be derived from the algorithms that the software
offload uses.

i.e. design offload algorithms that work for existing users, prove
then work, then provide those primitives in hardware knowing that
they work and will be useful....

You can implement all this ordered group writes in a DM module quite
easily, it's trivial to extend submit_bio to take a 64 bit sequence
tag for ordered group writes. All metadata IO in XFS already has an
ordered 64 bit tag associated with it (funnily enough, called the
Log Sequence Number) and you can tell XFS not to send cache flushes
simply by using the nobarrier mount option.

So there's your proof of concept implementation - prove it works,
that priority inversion isn't a problem and that performance is
equivalent to the existing cache flush based implementation, and
then you have a proposal that we can take seriously.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device
  2014-02-18  0:10         ` Dave Chinner
@ 2014-02-18  8:59           ` Alex Elsayed
  2014-02-18 13:17             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Elsayed @ 2014-02-18  8:59 UTC (permalink / raw)
  To: linux-fsdevel

Dave Chinner wrote:

(My apologies for snipping, but I only wanted to address a very small part 
of what you said)
> Similarly, things like ordered writes are great until you consider
> how they interact with journalling and cause priority inversion
> issues. The only way to make use of ordered writes is to design the
> filesystem around ordered writes from the ground up. i.e. the
> soft updates complexity problem. Unlike atomic writes, this can't
> easily be retrofitted to an exising filesystem, and once you have
> soft updates in place you are effectively fixing the format and
> features of the filesystem in stone because if you need to change a
> single operation or on disk structure you have to work out the
> dependency graph for the entire filesystem from the ground up again.

One thing that keeps coming to mind whenever ordering guarantees in 
filesystems come up is Featherstitch[1], which is (to quote the article) "a 
generalization of the soft updates system of write dependencies and rollback 
data" (since "not enough file systems geniuses exist in the world to write 
and maintain more than one instance of soft updates").

Aside from its relevance to your observations on soft updates, it had a 
userspace API that provided similar guarantees to Howard Chu's suggestion.

[1] article by Valerie Aurora: https://lwn.net/Articles/354861/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device
  2014-02-18  8:59           ` Alex Elsayed
@ 2014-02-18 13:17             ` Dave Chinner
  2014-02-18 14:09               ` Theodore Ts'o
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2014-02-18 13:17 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: linux-fsdevel

On Tue, Feb 18, 2014 at 12:59:49AM -0800, Alex Elsayed wrote:
> Dave Chinner wrote:
> 
> (My apologies for snipping, but I only wanted to address a very small part 
> of what you said)
> > Similarly, things like ordered writes are great until you consider
> > how they interact with journalling and cause priority inversion
> > issues. The only way to make use of ordered writes is to design the
> > filesystem around ordered writes from the ground up. i.e. the
> > soft updates complexity problem. Unlike atomic writes, this can't
> > easily be retrofitted to an exising filesystem, and once you have
> > soft updates in place you are effectively fixing the format and
> > features of the filesystem in stone because if you need to change a
> > single operation or on disk structure you have to work out the
> > dependency graph for the entire filesystem from the ground up again.
> 
> One thing that keeps coming to mind whenever ordering guarantees in 
> filesystems come up is Featherstitch[1], which is (to quote the article) "a 
> generalization of the soft updates system of write dependencies and rollback 
> data" (since "not enough file systems geniuses exist in the world to write 
> and maintain more than one instance of soft updates").

Generalising a complex concept by abstracting it doesn't mean the
result is easier to understand. Indeed, when I first looked at
featherstitch back when that article was published (as a follow to
Val's article on soft updates where she characterised them as "an
evolutionary dead end") I couldn't find anything in the source code
that generalised the process of determining the dependencies in a
filesystem and verifying that the featherstitch core handled them
correctly.

i.e. the complexity problem is still there - all it provides is a
generalised method of tracking changes and resolving dependencies
that have been defined. Determining whether a filesystem has a
dependency that the featherstitch resolver doesn't handle requires
exactly the same understanding of the dependencies in the filesystem
that implementing soft updates requires.

> Aside from its relevance to your observations on soft updates, it had a 
> userspace API that provided similar guarantees to Howard Chu's suggestion.

Which requires featherstich to be implemented in all filesystems
so that the dependencies that the userspace API introduces can be
resolved correctly. It's an all or nothing solution, and requires
deep, dark surgery to every single filesystem that you want to
support that API.

Worse: I've actually looked at the featherstitch code and it's
pretty nasty.  All the filesystem modules it has are built into the
featherstitch kernel module, and called through a VFS shim layer
that re-implements much of the generic paths to add callouts to the
filesystem modules to track metadata updates in a featherstitch
aware fashion.

This is not helped by a lack of documentation and a distinct lack of
useful comments in the code. Not to mention that it's full of TODO
and FIXME items and it's error handling:

        r = patch_create_empty_list(NULL, &top, top_keep, NULL);
        if(r < 0)
                kpanic("Can't recover from failure!");

took lessons from btrfs and topped the class.

Quite frankly, featherstitch *was* a research project and it shows.
I say *was*, because it hasn't been updated since 2.6.20 and so from
that point of view it is a dead project.  Maybe some of the ideas
can be used in some way, but IMO the complexity of algorithms and
implementation just kills these ordered dependency graph solutions
stone dead.

Besides that, it needs a a complete rearchitecting and
re-implementation, as would all the filesystems that use it.  Hence
I just don't see it as a viable path to solving the issues at hand.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device
  2014-02-18 13:17             ` Dave Chinner
@ 2014-02-18 14:09               ` Theodore Ts'o
  0 siblings, 0 replies; 18+ messages in thread
From: Theodore Ts'o @ 2014-02-18 14:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Alex Elsayed, linux-fsdevel

In addition to Dave's comments, consider the following from Val's
article with Alex quoted:

   The overall performance result was that the Featherstitch
   implementations were at par or somewhat better with the comparable
   ext3 version for elapsed time, but used significantly more CPU
   time.....

   So, you can use Featherstitch to re-implement all kinds of file
   system consistency schemes - soft updates, copy-on-write,
   journaling of all flavors - and it will go about as fast the old
   version while using up more of your CPU.

And note that this was comparing against ext3, which is not exactly a
shining example of performance.  (i.e., ext4 and xfs tend to beat ext3
handily on most benchmarks.)

Furthermore, given the sort of dependency tracking which Featherstich
is attempting, I suspect that the results will be at the very least
interesting on a system with a large number of cores; it's very likely
that it's CPU scalability leaves much to be desired.

Finally, note that many disk drives do not perform all that well with
writeback caching disabled (which is required for soft update and its
variants).  So when people do benchmarks comparing soft updates versus
traditional file systems, and important question to ask is (1) did
they remember to disable writeback caching for the soft updates run
(which is not the default, and if you don't disable it, you lose your
powerfail relibility), and (2) was writeback caching enabled or
disabled when benchmarking the traditional system, which can safely
use the default HDD writeback caching.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-17 13:05 ` Chris Mason
@ 2014-02-18 19:07   ` Dan Williams
  0 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2014-02-18 19:07 UTC (permalink / raw)
  To: Chris Mason
  Cc: lsf-pc, linux-fsdevel, jmoyer, david, Jens Axboe, Bryan E Veal,
	Annie Foong

On Mon, Feb 17, 2014 at 5:05 AM, Chris Mason <clm@fb.com> wrote:
> On 02/15/2014 10:04 AM, Dan Williams wrote:
>>
>> In response to Dave's call [1] and highlighting Jeff's attend request
>> [2] I'd like to stoke a discussion on an emulation layer for atomic
>> block commands.  Specifically, SNIA has laid out their position on the
>> command set an atomic block device may support (NVM Programming Model
>> [3]) and it is a good conversation piece for this effort.  The goal
>> would be to review the proposed operations, identify the capabilities
>> that would be readily useful to filesystems / existing use cases, and
>> tear down a straw man implementation proposal.
>>
>> The SNIA defined capabilities that seem the highest priority to implement
>> are:
>> * ATOMIC_MULTIWRITE - dis-contiguous LBA ranges, power fail atomic, no
>> ordering constraint relative to other i/o
>>
>> * ATOMIC_WRITE - contiguous LBA range, power fail atomic, no ordering
>> constraint relative to other i/o
>>
>> * EXISTS - not an atomic command, but defined in the NPM.  It is akin
>> to SEEK_{DATA|HOLE} to test whether an LBA is mapped or unmapped.  If
>> the LBA is mapped additionally specifies whether data is present or
>> the LBA is only allocated.
>>
>> * SCAR - again not an atomic command, but once we have metadata can
>> implement a bad block list, analogous to the bad-block-list support in
>> md.
>>
>> Initial thought is that this functionality is better implemented as a
>> library a block device driver (bio-based or request-based) can call to
>> emulate these features.  In the case where the feature is directly
>> supported by the underlying hardware device the emulation layer will
>> stub out and pass it through.  The argument for not doing this as a
>>
>> device-mapper target or stacked block device driver is to ease
>> provisioning and make the emulation transparent.  On the other hand,
>> the argument for doing this as a virtual block device is that the
>> "failed to parse device metadata" is a known failure scenario for
>> dm/md, but not sd for example.
>
>
> Hi Dan,
>
> I'd suggest a dm device instead of a special library, mostly because the
> emulated device is likely to need some kind of cleanup action after a crash,
> and the dm model is best suited to cleanly provide that.  It's also a good
> fit for people that want to duct tape a small amount of very fast nvm onto
> relatively slower devices.

Hi Chris,

I can see that.  It would be a surprising if sda fails to show up due
to metadata corruption.  Support for making the transition transparent
when the backing device supports the offloads can come later.

> The absolute minimum to provide something useful is a 16K discontig atomic.
> That won't help the filesystems much, but it will allow mysql to turn off
> double buffering.  Oracle would benefit from ~64K, mostly from a safety
> point of view since they don't double buffer.
>
> Helping the filesystems is harder, we need atomics bigger than any
> individual device is likely to provide.  But as Dave says elsewhere in the
> thread, we can limit that for specific workloads.

This sounds like a difference between "atomically handle a set of
commands up to the device's in-flight queue depth" vs "guarantee
atomic commit of transactions that may have landed on media a while
ago with a current set of in-flight requests".  If I am parsing the
difference correctly?

> I'm not sold on SCAR, since I'd expect the FTL or drive firmware provide
> that for us, what use case do you have in mind there?

The only use case I know for SCAR is the internal functionality RAID
firmwares implement to continue array rebuild upon encountering a bad
block.  Rather than stop the rebuild or silently corrupt data, "scar"
the lba ranges on the incoming rebuild target that otherwise could not
be recovered due to bad blocks on the other array members.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [LSF/MM TOPIC] atomic block device
  2014-02-15 20:25     ` James Bottomley
@ 2014-03-20 20:10       ` Jeff Moyer
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Moyer @ 2014-03-20 20:10 UTC (permalink / raw)
  To: James Bottomley
  Cc: Andy Rudoff, Dan Williams, lsf-pc, linux-fsdevel, david,
	Chris Mason, Jens Axboe, Bryan E Veal, Annie Foong, linux-scsi,
	Christoph Lameter

James Bottomley <James.Bottomley@HansenPartnership.com> writes:

> OK, so what the Database people are currently fretting about is how the
> Linux cache fights with the WAL.  Pretty much all DBs sit on filesystems
> these days, so the first question is are block operations even relevant

Yes, they are relevant so long as there are users.  Not all databases
run on file systems.  More to the point, I think the spec includes them
for completeness.

> and do the (rather scant) file operations fit their need.  The basic
> problem with the file mode is the granularity.  What does a DB do with
> transactions which go over the limits?

s/transactions/atomic multiwrites?  There are a couple of options,
there.  You could either emulate the multiwrite transparently to the
application, or you could fail it.  I think we'd have to support both,
since some applications will not want the less performant fallback.

> It also looks like the NVM file actions need to work over DIO, so the
> question is how.

DIO just means avoid using the page cache (or, I guess you could make it
more generic by saying avoid double buffering).  If the hardware
supports atomic writes, then this is easy.  If it doesn't, then we can
still do the emulation.  DIO already has fallback to buffered, so it's
not like we always honor that flag.

> (And the other problem is that only a few DBs seem to use DIO).

This is the first time I have ever heard anyone state that not using DIO
was a problem.  :-)  Also, I'm not sure what problem you are thinking
of.  Perhaps you are referring to the interactions of the WAL and the
page cache (that you mentioned in the first paragraph).

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-03-20 20:10 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-15 15:04 [LSF/MM TOPIC] atomic block device Dan Williams
2014-02-15 17:55 ` Andy Rudoff
2014-02-15 18:29   ` Howard Chu
2014-02-15 18:31     ` Howard Chu
2014-02-15 18:02 ` James Bottomley
2014-02-15 18:15   ` Andy Rudoff
2014-02-15 20:25     ` James Bottomley
2014-03-20 20:10       ` Jeff Moyer
     [not found] ` <CABBL8E+r+Uao9aJsezy16K_JXQgVuoD7ArepB46WTS=zruHL4g@mail.gmail.com>
2014-02-15 21:35   ` Dan Williams
2014-02-17  8:56   ` Dave Chinner
2014-02-17  9:51     ` [Lsf-pc] " Jan Kara
2014-02-17 10:20       ` Howard Chu
2014-02-18  0:10         ` Dave Chinner
2014-02-18  8:59           ` Alex Elsayed
2014-02-18 13:17             ` Dave Chinner
2014-02-18 14:09               ` Theodore Ts'o
2014-02-17 13:05 ` Chris Mason
2014-02-18 19:07   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.