All of lore.kernel.org
 help / color / mirror / Atom feed
* [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16  6:00 ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-16  6:00 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

Hi Folks,

As I told many people at Vault last week, I wrote a document
outlining how we should modify the on-disk structures of XFS to
support host aware SMR drives on the (long) plane flights to Boston.

TL;DR: not a lot of change to the XFS kernel code is required, no
specific SMR awareness is needed by the kernel code.  Only
relatively minor tweaks to the on-disk format will be needed and
most of the userspace changes are relatively straight forward, too.

The source for that document can be found in this git tree here:

git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation

in the file design/xfs-smr-structure.asciidoc. Alternatively,
pull it straight from cgit:

https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc

Or there is a pdf version built from the current TOT on the xfs.org
wiki here:

http://xfs.org/index.php/Host_Aware_SMR_architecture

Happy reading!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16  6:00 ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-16  6:00 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel

Hi Folks,

As I told many people at Vault last week, I wrote a document
outlining how we should modify the on-disk structures of XFS to
support host aware SMR drives on the (long) plane flights to Boston.

TL;DR: not a lot of change to the XFS kernel code is required, no
specific SMR awareness is needed by the kernel code.  Only
relatively minor tweaks to the on-disk format will be needed and
most of the userspace changes are relatively straight forward, too.

The source for that document can be found in this git tree here:

git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation

in the file design/xfs-smr-structure.asciidoc. Alternatively,
pull it straight from cgit:

https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc

Or there is a pdf version built from the current TOT on the xfs.org
wiki here:

http://xfs.org/index.php/Host_Aware_SMR_architecture

Happy reading!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16  6:00 ` Dave Chinner
@ 2015-03-16 15:28   ` James Bottomley
  -1 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2015-03-16 15:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel, linux-scsi

[cc to linux-scsi added since this seems relevant]
On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
> Hi Folks,
> 
> As I told many people at Vault last week, I wrote a document
> outlining how we should modify the on-disk structures of XFS to
> support host aware SMR drives on the (long) plane flights to Boston.
> 
> TL;DR: not a lot of change to the XFS kernel code is required, no
> specific SMR awareness is needed by the kernel code.  Only
> relatively minor tweaks to the on-disk format will be needed and
> most of the userspace changes are relatively straight forward, too.
> 
> The source for that document can be found in this git tree here:
> 
> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> 
> in the file design/xfs-smr-structure.asciidoc. Alternatively,
> pull it straight from cgit:
> 
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> 
> Or there is a pdf version built from the current TOT on the xfs.org
> wiki here:
> 
> http://xfs.org/index.php/Host_Aware_SMR_architecture
> 
> Happy reading!

I don't think it would have caused too much heartache to post the entire
doc to the list, but anyway

The first is a meta question: What happened to the idea of separating
the fs block allocator from filesystems?  It looks like a lot of the
updates could be duplicated into other filesystems, so it might be a
very opportune time to think about this.


> == Data zones
> 
> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
> free space/write pointers within each zone, and some way of keeping track of
> that information across mounts. If we assign a real time bitmap/summary inode
> pair to each zone, we have a method of tracking free space in the zone. We can
> use the existing bitmap allocator with a small tweak (sequentially ascending,
> packed extent allocation only) to ensure that newly written blocks are allocated
> in a sane manner.
> 
> We're going to need userspace to be able to see the contents of these inodes;
> read only access wil be needed to analyse the contents of the zone, so we're
> going to need a special directory to expose this information. It would be useful
> to have a ".zones" directory hanging off the root directory that contains all
> the zone allocation inodes so userspace can simply open them.

The ZBC standard is being constructed.  However, all revisions agree
that the drive is perfectly capable of tracking the zone pointers (and
even the zone status).  Rather than having you duplicate the information
within the XFS metadata, surely it's better with us to come up with some
block way of reading it from the disk (and caching it for faster
access)?


> == Quantification of Random Write Zone Capacity
> 
> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
> so that's another 512MB per TB, plus another 256MB per TB for directory
> structures. There's other bits and pieces of metadata as well (attribute space,
> internal freespace btrees, reverse map btrees, etc.
> 
> So, at minimum we will probably need at least 2GB of random write space per TB
> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
> option. For those drive vendors out there that are listening and want good
> performance, replace the CMR region with a SSD....

This seems to be a place where standards work is still needed.  Right at
the moment for Host Managed, the physical layout of the drives makes it
reasonably simple to convert edge zones from SMR to CMR and vice versa
at the expense of changing capacity.  It really sounds like we need a
simple, programmatic way of doing this.  The question I'd have is: are
you happy with just telling manufacturers ahead of time how much CMR
space you need and hoping they comply, or should we push for a standards
way of flipping end zones to CMR?


> === Crash recovery
> 
> Write pointer location is undefined after power failure. It could be at an old
> location, the current location or anywhere in between. The only guarantee that
> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
> least be in a position at or past the location of the fsync.
> 
> Hence before a filesystem runs journal recovery, all it's zone allocation write
> pointers need to be set to what the drive thinks they are, and all of the zone
> allocation beyond the write pointer need to be cleared. We could do this during
> log recovery in kernel, but that means we need full ZBC awareness in log
> recovery to iterate and query all the zones.

If you just use a cached zone pointer provided by block, this should
never be a problem because you'd always know where the drive thought the
pointer was.


> === RAID on SMR....
> 
> How does RAID work with SMR, and exactly what does that look like to
> the filesystem?
> 
> How does libzbc work with RAID given it is implemented through the scsi ioctl
> interface?

Probably need to cc dm-devel here.  However, I think we're all agreed
this is RAID across multiple devices, rather than within a single
device?  In which case we just need a way of ensuring identical zoning
on the raided devices and what you get is either a standard zone (for
mirror) or a larger zone (for hamming etc).

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16 15:28   ` James Bottomley
  0 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2015-03-16 15:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, linux-scsi, xfs

[cc to linux-scsi added since this seems relevant]
On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
> Hi Folks,
> 
> As I told many people at Vault last week, I wrote a document
> outlining how we should modify the on-disk structures of XFS to
> support host aware SMR drives on the (long) plane flights to Boston.
> 
> TL;DR: not a lot of change to the XFS kernel code is required, no
> specific SMR awareness is needed by the kernel code.  Only
> relatively minor tweaks to the on-disk format will be needed and
> most of the userspace changes are relatively straight forward, too.
> 
> The source for that document can be found in this git tree here:
> 
> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> 
> in the file design/xfs-smr-structure.asciidoc. Alternatively,
> pull it straight from cgit:
> 
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> 
> Or there is a pdf version built from the current TOT on the xfs.org
> wiki here:
> 
> http://xfs.org/index.php/Host_Aware_SMR_architecture
> 
> Happy reading!

I don't think it would have caused too much heartache to post the entire
doc to the list, but anyway

The first is a meta question: What happened to the idea of separating
the fs block allocator from filesystems?  It looks like a lot of the
updates could be duplicated into other filesystems, so it might be a
very opportune time to think about this.


> == Data zones
> 
> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
> free space/write pointers within each zone, and some way of keeping track of
> that information across mounts. If we assign a real time bitmap/summary inode
> pair to each zone, we have a method of tracking free space in the zone. We can
> use the existing bitmap allocator with a small tweak (sequentially ascending,
> packed extent allocation only) to ensure that newly written blocks are allocated
> in a sane manner.
> 
> We're going to need userspace to be able to see the contents of these inodes;
> read only access wil be needed to analyse the contents of the zone, so we're
> going to need a special directory to expose this information. It would be useful
> to have a ".zones" directory hanging off the root directory that contains all
> the zone allocation inodes so userspace can simply open them.

The ZBC standard is being constructed.  However, all revisions agree
that the drive is perfectly capable of tracking the zone pointers (and
even the zone status).  Rather than having you duplicate the information
within the XFS metadata, surely it's better with us to come up with some
block way of reading it from the disk (and caching it for faster
access)?


> == Quantification of Random Write Zone Capacity
> 
> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
> so that's another 512MB per TB, plus another 256MB per TB for directory
> structures. There's other bits and pieces of metadata as well (attribute space,
> internal freespace btrees, reverse map btrees, etc.
> 
> So, at minimum we will probably need at least 2GB of random write space per TB
> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
> option. For those drive vendors out there that are listening and want good
> performance, replace the CMR region with a SSD....

This seems to be a place where standards work is still needed.  Right at
the moment for Host Managed, the physical layout of the drives makes it
reasonably simple to convert edge zones from SMR to CMR and vice versa
at the expense of changing capacity.  It really sounds like we need a
simple, programmatic way of doing this.  The question I'd have is: are
you happy with just telling manufacturers ahead of time how much CMR
space you need and hoping they comply, or should we push for a standards
way of flipping end zones to CMR?


> === Crash recovery
> 
> Write pointer location is undefined after power failure. It could be at an old
> location, the current location or anywhere in between. The only guarantee that
> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
> least be in a position at or past the location of the fsync.
> 
> Hence before a filesystem runs journal recovery, all it's zone allocation write
> pointers need to be set to what the drive thinks they are, and all of the zone
> allocation beyond the write pointer need to be cleared. We could do this during
> log recovery in kernel, but that means we need full ZBC awareness in log
> recovery to iterate and query all the zones.

If you just use a cached zone pointer provided by block, this should
never be a problem because you'd always know where the drive thought the
pointer was.


> === RAID on SMR....
> 
> How does RAID work with SMR, and exactly what does that look like to
> the filesystem?
> 
> How does libzbc work with RAID given it is implemented through the scsi ioctl
> interface?

Probably need to cc dm-devel here.  However, I think we're all agreed
this is RAID across multiple devices, rather than within a single
device?  In which case we just need a way of ensuring identical zoning
on the raided devices and what you get is either a standard zone (for
mirror) or a larger zone (for hamming etc).

James


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16 15:28   ` James Bottomley
@ 2015-03-16 18:23     ` Adrian Palmer
  -1 siblings, 0 replies; 21+ messages in thread
From: Adrian Palmer @ 2015-03-16 18:23 UTC (permalink / raw)
  To: James Bottomley, Dave Chinner
  Cc: xfs, Linux Filesystem Development List, linux-scsi, ext4 development

Thanks for the document!  I think we are off to a good start going in
a common direction.  We have quite a few details to iron out, but I
feel that we are getting there by everyone simply expressing what's
needed.

My additions are in-line.


Adrian Palmer
Firmware Engineer II
R&D Firmware
Seagate, Longmont Colorado
720-684-1307
adrian.palmer@seagate.com


On Mon, Mar 16, 2015 at 9:28 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> [cc to linux-scsi added since this seems relevant]
> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
>> Hi Folks,
>>
>> As I told many people at Vault last week, I wrote a document
>> outlining how we should modify the on-disk structures of XFS to
>> support host aware SMR drives on the (long) plane flights to Boston.
>>
>> TL;DR: not a lot of change to the XFS kernel code is required, no
>> specific SMR awareness is needed by the kernel code.  Only
>> relatively minor tweaks to the on-disk format will be needed and
>> most of the userspace changes are relatively straight forward, too.
>>
>> The source for that document can be found in this git tree here:
>>
>> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
>>
>> in the file design/xfs-smr-structure.asciidoc. Alternatively,
>> pull it straight from cgit:
>>
>> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
>>
>> Or there is a pdf version built from the current TOT on the xfs.org
>> wiki here:
>>
>> http://xfs.org/index.php/Host_Aware_SMR_architecture
>>
>> Happy reading!
>
> I don't think it would have caused too much heartache to post the entire
> doc to the list, but anyway
>
> The first is a meta question: What happened to the idea of separating
> the fs block allocator from filesystems?  It looks like a lot of the
> updates could be duplicated into other filesystems, so it might be a
> very opportune time to think about this.
>

That's not a half-bad idea.  In speaking to EXT4 dev group, we're
already looking at pulling the block allocator out and making it
plugable.  I'm looking at doing a clean re-write anyway for SMR.
However, the question I have is in Cow vs non-CoW system differences
for allocation preferences, and what other changes need to be made in
*all* the file systems.

>
>> == Data zones
>>
>> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
>> free space/write pointers within each zone, and some way of keeping track of
>> that information across mounts. If we assign a real time bitmap/summary inode
>> pair to each zone, we have a method of tracking free space in the zone. We can
>> use the existing bitmap allocator with a small tweak (sequentially ascending,
>> packed extent allocation only) to ensure that newly written blocks are allocated
>> in a sane manner.
>>
>> We're going to need userspace to be able to see the contents of these inodes;
>> read only access wil be needed to analyse the contents of the zone, so we're
>> going to need a special directory to expose this information. It would be useful
>> to have a ".zones" directory hanging off the root directory that contains all
>> the zone allocation inodes so userspace can simply open them.
>
> The ZBC standard is being constructed.  However, all revisions agree
> that the drive is perfectly capable of tracking the zone pointers (and
> even the zone status).  Rather than having you duplicate the information
> within the XFS metadata, surely it's better with us to come up with some
> block way of reading it from the disk (and caching it for faster
> access)?
>

In discussions with Dr. Reinecke, it seems extremely prudent to have a
kernel cache somewhere.  The SD driver would be the base for updating
the cache, but it would need to be available to the allocators, the
/sys fs for userspace utilities, and possibly other processes.  In
EXT4, I don't think it's feasible to have the cache -- however, the
metadata will MIRROR the cache ( BG# = Zone#, databitmap = WP, etc)

>
>> == Quantification of Random Write Zone Capacity
>>
>> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
>> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
>> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
>> so that's another 512MB per TB, plus another 256MB per TB for directory
>> structures. There's other bits and pieces of metadata as well (attribute space,
>> internal freespace btrees, reverse map btrees, etc.
>>
>> So, at minimum we will probably need at least 2GB of random write space per TB
>> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
>> option. For those drive vendors out there that are listening and want good
>> performance, replace the CMR region with a SSD....
>
> This seems to be a place where standards work is still needed.  Right at
> the moment for Host Managed, the physical layout of the drives makes it
> reasonably simple to convert edge zones from SMR to CMR and vice versa
> at the expense of changing capacity.  It really sounds like we need a
> simple, programmatic way of doing this.  The question I'd have is: are
> you happy with just telling manufacturers ahead of time how much CMR
> space you need and hoping they comply, or should we push for a standards
> way of flipping end zones to CMR?
>

I agree this is an issue, but for HA (and less for HM), there is a lot
of flexability needed for this.  In our BoFs at Vault, we talked about
partitioning needs.  We cannot assume that there is 1 partition per
disk, and that it has absolute boundaries.  Sure a data disk can have
1 partition from LBA 0 to end of disk, but an OS disk can't.  For
example, GPT and EFI cause problems.  On the other end, gamers and
hobbists tend to dual/triple boot....  There cannot be a onesize
partition for all purposes.

The conversion between CMR and SMR zones is not simple.  That's a
hardware format.  Any change in the LBA space would be non-linear.

One idea that I came up with in our BoFs is using flash with an FTL.
If the manufacturers put in enough flash to cover 8 or so zones, then
a command could be implemented to allow the flash to be assigned to
zones.  That way, a limited number of CMR zones can be placed anywhere
on the disk without disrupting format or LBA space.  However, ZAC/ZBC
is to be applied to flash also...

>
>> === Crash recovery
>>
>> Write pointer location is undefined after power failure. It could be at an old
>> location, the current location or anywhere in between. The only guarantee that
>> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
>> least be in a position at or past the location of the fsync.
>>
>> Hence before a filesystem runs journal recovery, all it's zone allocation write
>> pointers need to be set to what the drive thinks they are, and all of the zone
>> allocation beyond the write pointer need to be cleared. We could do this during
>> log recovery in kernel, but that means we need full ZBC awareness in log
>> recovery to iterate and query all the zones.
>
> If you just use a cached zone pointer provided by block, this should
> never be a problem because you'd always know where the drive thought the
> pointer was.

This would require a look at the order of updating the stack
information, and also WCD vs WCE behavior.  As for the WP, the spec
says that any data after the WP is returned with a clear pattern
(zeros on Seagate drives) -- it is already cleared.

>
>
>> === RAID on SMR....
>>
>> How does RAID work with SMR, and exactly what does that look like to
>> the filesystem?
>>
>> How does libzbc work with RAID given it is implemented through the scsi ioctl
>> interface?
>
> Probably need to cc dm-devel here.  However, I think we're all agreed
> this is RAID across multiple devices, rather than within a single
> device?  In which case we just need a way of ensuring identical zoning
> on the raided devices and what you get is either a standard zone (for
> mirror) or a larger zone (for hamming etc).
>

I agree.  It's up to the DM to mangle the zones and provide proper
modified zone info up to the FS.  In the case of mirror, keeps the
same zone size, just half the total of zones (or half in a condition
of read-only/full).  In stripped paradigms, double (or more if the
zone sizes don't match, or if more that 2 drives) the zone size and
let the DM mod the block numbers to determine the correct disk.  For
EXT4, this REQUIRES the equivalent of 8k Blocks.

> James
>
>

== Kernel implementation

The allocator will need to learn about multiple allocation zones based on
bitmaps. They aren't really allocation groups, but the initialisation and
iteration of them is going to be similar to allocation groups. To get use going
we can do some simple mapping between inode AG and data AZ mapping so that we
keep some form of locality to related data (e.g. grouping of data by parent
directory).

We can do simple things first - simply rotoring allocation across zones will get
us moving very quickly, and then we can refine it once we have more than just a
proof of concept prototype.

Optimising data allocation for SMR is going to be tricky, and I hope to be able
to leave that to drive vendor engineers....

Ideally, we won't need a zbc interface in the kernel, except to erase zones.
I'd like to see an interface that doesn't even require that. For example, we
issue a discard (TRIM) on an entire  zone and that erases it and
resets the write
pointer. This way we need no new infrastructure at the filesystem layer to
implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
drive underneath it.


Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
as of yet ignored the zone management pieces.  I have thought
(briefly) of the possible need for a new allocator:  the group
allocator.  As there can only be a few (relatively) zones available at
any one time, We might need a mechanism to tell which are available
and which are not.  The stack will have to collectively work together
to find a way to request and use zones in an orderly fashion.



> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16 18:23     ` Adrian Palmer
  0 siblings, 0 replies; 21+ messages in thread
From: Adrian Palmer @ 2015-03-16 18:23 UTC (permalink / raw)
  To: James Bottomley, Dave Chinner
  Cc: Linux Filesystem Development List, ext4 development, linux-scsi, xfs

Thanks for the document!  I think we are off to a good start going in
a common direction.  We have quite a few details to iron out, but I
feel that we are getting there by everyone simply expressing what's
needed.

My additions are in-line.


Adrian Palmer
Firmware Engineer II
R&D Firmware
Seagate, Longmont Colorado
720-684-1307
adrian.palmer@seagate.com


On Mon, Mar 16, 2015 at 9:28 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> [cc to linux-scsi added since this seems relevant]
> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
>> Hi Folks,
>>
>> As I told many people at Vault last week, I wrote a document
>> outlining how we should modify the on-disk structures of XFS to
>> support host aware SMR drives on the (long) plane flights to Boston.
>>
>> TL;DR: not a lot of change to the XFS kernel code is required, no
>> specific SMR awareness is needed by the kernel code.  Only
>> relatively minor tweaks to the on-disk format will be needed and
>> most of the userspace changes are relatively straight forward, too.
>>
>> The source for that document can be found in this git tree here:
>>
>> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
>>
>> in the file design/xfs-smr-structure.asciidoc. Alternatively,
>> pull it straight from cgit:
>>
>> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
>>
>> Or there is a pdf version built from the current TOT on the xfs.org
>> wiki here:
>>
>> http://xfs.org/index.php/Host_Aware_SMR_architecture
>>
>> Happy reading!
>
> I don't think it would have caused too much heartache to post the entire
> doc to the list, but anyway
>
> The first is a meta question: What happened to the idea of separating
> the fs block allocator from filesystems?  It looks like a lot of the
> updates could be duplicated into other filesystems, so it might be a
> very opportune time to think about this.
>

That's not a half-bad idea.  In speaking to EXT4 dev group, we're
already looking at pulling the block allocator out and making it
plugable.  I'm looking at doing a clean re-write anyway for SMR.
However, the question I have is in Cow vs non-CoW system differences
for allocation preferences, and what other changes need to be made in
*all* the file systems.

>
>> == Data zones
>>
>> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
>> free space/write pointers within each zone, and some way of keeping track of
>> that information across mounts. If we assign a real time bitmap/summary inode
>> pair to each zone, we have a method of tracking free space in the zone. We can
>> use the existing bitmap allocator with a small tweak (sequentially ascending,
>> packed extent allocation only) to ensure that newly written blocks are allocated
>> in a sane manner.
>>
>> We're going to need userspace to be able to see the contents of these inodes;
>> read only access wil be needed to analyse the contents of the zone, so we're
>> going to need a special directory to expose this information. It would be useful
>> to have a ".zones" directory hanging off the root directory that contains all
>> the zone allocation inodes so userspace can simply open them.
>
> The ZBC standard is being constructed.  However, all revisions agree
> that the drive is perfectly capable of tracking the zone pointers (and
> even the zone status).  Rather than having you duplicate the information
> within the XFS metadata, surely it's better with us to come up with some
> block way of reading it from the disk (and caching it for faster
> access)?
>

In discussions with Dr. Reinecke, it seems extremely prudent to have a
kernel cache somewhere.  The SD driver would be the base for updating
the cache, but it would need to be available to the allocators, the
/sys fs for userspace utilities, and possibly other processes.  In
EXT4, I don't think it's feasible to have the cache -- however, the
metadata will MIRROR the cache ( BG# = Zone#, databitmap = WP, etc)

>
>> == Quantification of Random Write Zone Capacity
>>
>> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
>> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
>> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
>> so that's another 512MB per TB, plus another 256MB per TB for directory
>> structures. There's other bits and pieces of metadata as well (attribute space,
>> internal freespace btrees, reverse map btrees, etc.
>>
>> So, at minimum we will probably need at least 2GB of random write space per TB
>> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
>> option. For those drive vendors out there that are listening and want good
>> performance, replace the CMR region with a SSD....
>
> This seems to be a place where standards work is still needed.  Right at
> the moment for Host Managed, the physical layout of the drives makes it
> reasonably simple to convert edge zones from SMR to CMR and vice versa
> at the expense of changing capacity.  It really sounds like we need a
> simple, programmatic way of doing this.  The question I'd have is: are
> you happy with just telling manufacturers ahead of time how much CMR
> space you need and hoping they comply, or should we push for a standards
> way of flipping end zones to CMR?
>

I agree this is an issue, but for HA (and less for HM), there is a lot
of flexability needed for this.  In our BoFs at Vault, we talked about
partitioning needs.  We cannot assume that there is 1 partition per
disk, and that it has absolute boundaries.  Sure a data disk can have
1 partition from LBA 0 to end of disk, but an OS disk can't.  For
example, GPT and EFI cause problems.  On the other end, gamers and
hobbists tend to dual/triple boot....  There cannot be a onesize
partition for all purposes.

The conversion between CMR and SMR zones is not simple.  That's a
hardware format.  Any change in the LBA space would be non-linear.

One idea that I came up with in our BoFs is using flash with an FTL.
If the manufacturers put in enough flash to cover 8 or so zones, then
a command could be implemented to allow the flash to be assigned to
zones.  That way, a limited number of CMR zones can be placed anywhere
on the disk without disrupting format or LBA space.  However, ZAC/ZBC
is to be applied to flash also...

>
>> === Crash recovery
>>
>> Write pointer location is undefined after power failure. It could be at an old
>> location, the current location or anywhere in between. The only guarantee that
>> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
>> least be in a position at or past the location of the fsync.
>>
>> Hence before a filesystem runs journal recovery, all it's zone allocation write
>> pointers need to be set to what the drive thinks they are, and all of the zone
>> allocation beyond the write pointer need to be cleared. We could do this during
>> log recovery in kernel, but that means we need full ZBC awareness in log
>> recovery to iterate and query all the zones.
>
> If you just use a cached zone pointer provided by block, this should
> never be a problem because you'd always know where the drive thought the
> pointer was.

This would require a look at the order of updating the stack
information, and also WCD vs WCE behavior.  As for the WP, the spec
says that any data after the WP is returned with a clear pattern
(zeros on Seagate drives) -- it is already cleared.

>
>
>> === RAID on SMR....
>>
>> How does RAID work with SMR, and exactly what does that look like to
>> the filesystem?
>>
>> How does libzbc work with RAID given it is implemented through the scsi ioctl
>> interface?
>
> Probably need to cc dm-devel here.  However, I think we're all agreed
> this is RAID across multiple devices, rather than within a single
> device?  In which case we just need a way of ensuring identical zoning
> on the raided devices and what you get is either a standard zone (for
> mirror) or a larger zone (for hamming etc).
>

I agree.  It's up to the DM to mangle the zones and provide proper
modified zone info up to the FS.  In the case of mirror, keeps the
same zone size, just half the total of zones (or half in a condition
of read-only/full).  In stripped paradigms, double (or more if the
zone sizes don't match, or if more that 2 drives) the zone size and
let the DM mod the block numbers to determine the correct disk.  For
EXT4, this REQUIRES the equivalent of 8k Blocks.

> James
>
>

== Kernel implementation

The allocator will need to learn about multiple allocation zones based on
bitmaps. They aren't really allocation groups, but the initialisation and
iteration of them is going to be similar to allocation groups. To get use going
we can do some simple mapping between inode AG and data AZ mapping so that we
keep some form of locality to related data (e.g. grouping of data by parent
directory).

We can do simple things first - simply rotoring allocation across zones will get
us moving very quickly, and then we can refine it once we have more than just a
proof of concept prototype.

Optimising data allocation for SMR is going to be tricky, and I hope to be able
to leave that to drive vendor engineers....

Ideally, we won't need a zbc interface in the kernel, except to erase zones.
I'd like to see an interface that doesn't even require that. For example, we
issue a discard (TRIM) on an entire  zone and that erases it and
resets the write
pointer. This way we need no new infrastructure at the filesystem layer to
implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
drive underneath it.


Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
as of yet ignored the zone management pieces.  I have thought
(briefly) of the possible need for a new allocator:  the group
allocator.  As there can only be a few (relatively) zones available at
any one time, We might need a mechanism to tell which are available
and which are not.  The stack will have to collectively work together
to find a way to request and use zones in an orderly fashion.



> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16 18:23     ` Adrian Palmer
@ 2015-03-16 19:06       ` James Bottomley
  -1 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2015-03-16 19:06 UTC (permalink / raw)
  To: Adrian Palmer
  Cc: Dave Chinner, xfs, Linux Filesystem Development List, linux-scsi,
	ext4 development

On Mon, 2015-03-16 at 12:23 -0600, Adrian Palmer wrote:
[...]
> >> == Data zones
> >>
> >> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
> >> free space/write pointers within each zone, and some way of keeping track of
> >> that information across mounts. If we assign a real time bitmap/summary inode
> >> pair to each zone, we have a method of tracking free space in the zone. We can
> >> use the existing bitmap allocator with a small tweak (sequentially ascending,
> >> packed extent allocation only) to ensure that newly written blocks are allocated
> >> in a sane manner.
> >>
> >> We're going to need userspace to be able to see the contents of these inodes;
> >> read only access wil be needed to analyse the contents of the zone, so we're
> >> going to need a special directory to expose this information. It would be useful
> >> to have a ".zones" directory hanging off the root directory that contains all
> >> the zone allocation inodes so userspace can simply open them.
> >
> > The ZBC standard is being constructed.  However, all revisions agree
> > that the drive is perfectly capable of tracking the zone pointers (and
> > even the zone status).  Rather than having you duplicate the information
> > within the XFS metadata, surely it's better with us to come up with some
> > block way of reading it from the disk (and caching it for faster
> > access)?
> >
> 
> In discussions with Dr. Reinecke, it seems extremely prudent to have a
> kernel cache somewhere.  The SD driver would be the base for updating
> the cache, but it would need to be available to the allocators, the
> /sys fs for userspace utilities, and possibly other processes.  In
> EXT4, I don't think it's feasible to have the cache -- however, the
> metadata will MIRROR the cache ( BG# = Zone#, databitmap = WP, etc)

I think I've got two points: if we're caching it, we should have a
single cache and everyone should use it.  There may be a good reason why
we can't do this, but I'd like to see it explained before everyone goes
off and invents their own zone pointer cache.  If we do it in one place,
we can make the cache properly shrinkable (the information can be purged
under memory pressure and re-fetched if requested).

> >
> >> == Quantification of Random Write Zone Capacity
> >>
> >> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
> >> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
> >> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
> >> so that's another 512MB per TB, plus another 256MB per TB for directory
> >> structures. There's other bits and pieces of metadata as well (attribute space,
> >> internal freespace btrees, reverse map btrees, etc.
> >>
> >> So, at minimum we will probably need at least 2GB of random write space per TB
> >> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
> >> option. For those drive vendors out there that are listening and want good
> >> performance, replace the CMR region with a SSD....
> >
> > This seems to be a place where standards work is still needed.  Right at
> > the moment for Host Managed, the physical layout of the drives makes it
> > reasonably simple to convert edge zones from SMR to CMR and vice versa
> > at the expense of changing capacity.  It really sounds like we need a
> > simple, programmatic way of doing this.  The question I'd have is: are
> > you happy with just telling manufacturers ahead of time how much CMR
> > space you need and hoping they comply, or should we push for a standards
> > way of flipping end zones to CMR?
> >
> 
> I agree this is an issue, but for HA (and less for HM), there is a lot
> of flexability needed for this.  In our BoFs at Vault, we talked about
> partitioning needs.  We cannot assume that there is 1 partition per
> disk, and that it has absolute boundaries.  Sure a data disk can have
> 1 partition from LBA 0 to end of disk, but an OS disk can't.  For
> example, GPT and EFI cause problems.  On the other end, gamers and
> hobbists tend to dual/triple boot....  There cannot be a onesize
> partition for all purposes.
> 
> The conversion between CMR and SMR zones is not simple.  That's a
> hardware format.  Any change in the LBA space would be non-linear.
> 
> One idea that I came up with in our BoFs is using flash with an FTL.
> If the manufacturers put in enough flash to cover 8 or so zones, then
> a command could be implemented to allow the flash to be assigned to
> zones.  That way, a limited number of CMR zones can be placed anywhere
> on the disk without disrupting format or LBA space.  However, ZAC/ZBC
> is to be applied to flash also...

Perhaps we need to step back a bit.  The problem is that most
filesystems will require some CMR space for metadata that is
continuously updated in place.  The amount will probably vary wildly by
specific filesystem and size, but it looks like everyone (except
possibly btrfs) will need some.  One possibility is that we let the
drives be reformatted in place, say as part of the initial filesystem
format, so the CMR requirements get tuned exactly.  The other is that we
simply let the manufacturers give us "enough" and try to determine what
"enough" is.

I suspect forcing a tuning command through the ZBC workgroup would be a
nice quick way of getting the manufacturers to focus on what is
possible, but I think we do need some way of closing out this either/or
debate (we tune or you tune).

> >
> >> === Crash recovery
> >>
> >> Write pointer location is undefined after power failure. It could be at an old
> >> location, the current location or anywhere in between. The only guarantee that
> >> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
> >> least be in a position at or past the location of the fsync.
> >>
> >> Hence before a filesystem runs journal recovery, all it's zone allocation write
> >> pointers need to be set to what the drive thinks they are, and all of the zone
> >> allocation beyond the write pointer need to be cleared. We could do this during
> >> log recovery in kernel, but that means we need full ZBC awareness in log
> >> recovery to iterate and query all the zones.
> >
> > If you just use a cached zone pointer provided by block, this should
> > never be a problem because you'd always know where the drive thought the
> > pointer was.
> 
> This would require a look at the order of updating the stack
> information, and also WCD vs WCE behavior.  As for the WP, the spec
> says that any data after the WP is returned with a clear pattern
> (zeros on Seagate drives) -- it is already cleared.

As long as the drive behaves to spec, our consistency algorithms should
be able to cope.  We would expect that on a crash the write pointer
would be further back than we think it should be, but then the FS will
just follow its consistency recovery procedures and either roll back or
forward the transactions from where the WP is at.  In some ways, the WP
will help us, because we do a lot of re-committing transactions that may
be on disk currently because we don't clearly know where the device
stopped writing data.

> >> === RAID on SMR....
> >>
> >> How does RAID work with SMR, and exactly what does that look like to
> >> the filesystem?
> >>
> >> How does libzbc work with RAID given it is implemented through the scsi ioctl
> >> interface?
> >
> > Probably need to cc dm-devel here.  However, I think we're all agreed
> > this is RAID across multiple devices, rather than within a single
> > device?  In which case we just need a way of ensuring identical zoning
> > on the raided devices and what you get is either a standard zone (for
> > mirror) or a larger zone (for hamming etc).
> >
> 
> I agree.  It's up to the DM to mangle the zones and provide proper
> modified zone info up to the FS.  In the case of mirror, keeps the
> same zone size, just half the total of zones (or half in a condition
> of read-only/full).  In stripped paradigms, double (or more if the
> zone sizes don't match, or if more that 2 drives) the zone size and
> let the DM mod the block numbers to determine the correct disk.  For
> EXT4, this REQUIRES the equivalent of 8k Blocks.
> 
> > James
> >
> >
> 
> == Kernel implementation
> 
> The allocator will need to learn about multiple allocation zones based on
> bitmaps. They aren't really allocation groups, but the initialisation and
> iteration of them is going to be similar to allocation groups. To get use going
> we can do some simple mapping between inode AG and data AZ mapping so that we
> keep some form of locality to related data (e.g. grouping of data by parent
> directory).
> 
> We can do simple things first - simply rotoring allocation across zones will get
> us moving very quickly, and then we can refine it once we have more than just a
> proof of concept prototype.
> 
> Optimising data allocation for SMR is going to be tricky, and I hope to be able
> to leave that to drive vendor engineers....

I think we'd all be interested in whether the write and return
allocation position suggested at LSF/MM would prove useful for this (and
whether the manufacturers are interested in prototyping it with us).

> Ideally, we won't need a zbc interface in the kernel, except to erase zones.
> I'd like to see an interface that doesn't even require that. For example, we
> issue a discard (TRIM) on an entire  zone and that erases it and
> resets the write
> pointer. This way we need no new infrastructure at the filesystem layer to
> implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
> drive underneath it.
> 
> 
> Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
> as of yet ignored the zone management pieces.  I have thought
> (briefly) of the possible need for a new allocator:  the group
> allocator.  As there can only be a few (relatively) zones available at
> any one time, We might need a mechanism to tell which are available
> and which are not.  The stack will have to collectively work together
> to find a way to request and use zones in an orderly fashion.

Here I think the sense of LSF/MM was that only allowing a fixed number
of zones to be open would get a bit unmanageable (unless the drive
silently manages it for us).  The idea of different sized zones is also
a complicating factor.  The other open question is that if we go for
fully drive managed, what sort of alignment, size, trim + anything else
should we do to make the drive's job easier.  I'm guessing we won't
really have a practical answer to any of these until we see how the
market responds.

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16 19:06       ` James Bottomley
  0 siblings, 0 replies; 21+ messages in thread
From: James Bottomley @ 2015-03-16 19:06 UTC (permalink / raw)
  To: Adrian Palmer
  Cc: Linux Filesystem Development List, ext4 development, linux-scsi, xfs

On Mon, 2015-03-16 at 12:23 -0600, Adrian Palmer wrote:
[...]
> >> == Data zones
> >>
> >> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
> >> free space/write pointers within each zone, and some way of keeping track of
> >> that information across mounts. If we assign a real time bitmap/summary inode
> >> pair to each zone, we have a method of tracking free space in the zone. We can
> >> use the existing bitmap allocator with a small tweak (sequentially ascending,
> >> packed extent allocation only) to ensure that newly written blocks are allocated
> >> in a sane manner.
> >>
> >> We're going to need userspace to be able to see the contents of these inodes;
> >> read only access wil be needed to analyse the contents of the zone, so we're
> >> going to need a special directory to expose this information. It would be useful
> >> to have a ".zones" directory hanging off the root directory that contains all
> >> the zone allocation inodes so userspace can simply open them.
> >
> > The ZBC standard is being constructed.  However, all revisions agree
> > that the drive is perfectly capable of tracking the zone pointers (and
> > even the zone status).  Rather than having you duplicate the information
> > within the XFS metadata, surely it's better with us to come up with some
> > block way of reading it from the disk (and caching it for faster
> > access)?
> >
> 
> In discussions with Dr. Reinecke, it seems extremely prudent to have a
> kernel cache somewhere.  The SD driver would be the base for updating
> the cache, but it would need to be available to the allocators, the
> /sys fs for userspace utilities, and possibly other processes.  In
> EXT4, I don't think it's feasible to have the cache -- however, the
> metadata will MIRROR the cache ( BG# = Zone#, databitmap = WP, etc)

I think I've got two points: if we're caching it, we should have a
single cache and everyone should use it.  There may be a good reason why
we can't do this, but I'd like to see it explained before everyone goes
off and invents their own zone pointer cache.  If we do it in one place,
we can make the cache properly shrinkable (the information can be purged
under memory pressure and re-fetched if requested).

> >
> >> == Quantification of Random Write Zone Capacity
> >>
> >> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
> >> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
> >> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
> >> so that's another 512MB per TB, plus another 256MB per TB for directory
> >> structures. There's other bits and pieces of metadata as well (attribute space,
> >> internal freespace btrees, reverse map btrees, etc.
> >>
> >> So, at minimum we will probably need at least 2GB of random write space per TB
> >> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
> >> option. For those drive vendors out there that are listening and want good
> >> performance, replace the CMR region with a SSD....
> >
> > This seems to be a place where standards work is still needed.  Right at
> > the moment for Host Managed, the physical layout of the drives makes it
> > reasonably simple to convert edge zones from SMR to CMR and vice versa
> > at the expense of changing capacity.  It really sounds like we need a
> > simple, programmatic way of doing this.  The question I'd have is: are
> > you happy with just telling manufacturers ahead of time how much CMR
> > space you need and hoping they comply, or should we push for a standards
> > way of flipping end zones to CMR?
> >
> 
> I agree this is an issue, but for HA (and less for HM), there is a lot
> of flexability needed for this.  In our BoFs at Vault, we talked about
> partitioning needs.  We cannot assume that there is 1 partition per
> disk, and that it has absolute boundaries.  Sure a data disk can have
> 1 partition from LBA 0 to end of disk, but an OS disk can't.  For
> example, GPT and EFI cause problems.  On the other end, gamers and
> hobbists tend to dual/triple boot....  There cannot be a onesize
> partition for all purposes.
> 
> The conversion between CMR and SMR zones is not simple.  That's a
> hardware format.  Any change in the LBA space would be non-linear.
> 
> One idea that I came up with in our BoFs is using flash with an FTL.
> If the manufacturers put in enough flash to cover 8 or so zones, then
> a command could be implemented to allow the flash to be assigned to
> zones.  That way, a limited number of CMR zones can be placed anywhere
> on the disk without disrupting format or LBA space.  However, ZAC/ZBC
> is to be applied to flash also...

Perhaps we need to step back a bit.  The problem is that most
filesystems will require some CMR space for metadata that is
continuously updated in place.  The amount will probably vary wildly by
specific filesystem and size, but it looks like everyone (except
possibly btrfs) will need some.  One possibility is that we let the
drives be reformatted in place, say as part of the initial filesystem
format, so the CMR requirements get tuned exactly.  The other is that we
simply let the manufacturers give us "enough" and try to determine what
"enough" is.

I suspect forcing a tuning command through the ZBC workgroup would be a
nice quick way of getting the manufacturers to focus on what is
possible, but I think we do need some way of closing out this either/or
debate (we tune or you tune).

> >
> >> === Crash recovery
> >>
> >> Write pointer location is undefined after power failure. It could be at an old
> >> location, the current location or anywhere in between. The only guarantee that
> >> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
> >> least be in a position at or past the location of the fsync.
> >>
> >> Hence before a filesystem runs journal recovery, all it's zone allocation write
> >> pointers need to be set to what the drive thinks they are, and all of the zone
> >> allocation beyond the write pointer need to be cleared. We could do this during
> >> log recovery in kernel, but that means we need full ZBC awareness in log
> >> recovery to iterate and query all the zones.
> >
> > If you just use a cached zone pointer provided by block, this should
> > never be a problem because you'd always know where the drive thought the
> > pointer was.
> 
> This would require a look at the order of updating the stack
> information, and also WCD vs WCE behavior.  As for the WP, the spec
> says that any data after the WP is returned with a clear pattern
> (zeros on Seagate drives) -- it is already cleared.

As long as the drive behaves to spec, our consistency algorithms should
be able to cope.  We would expect that on a crash the write pointer
would be further back than we think it should be, but then the FS will
just follow its consistency recovery procedures and either roll back or
forward the transactions from where the WP is at.  In some ways, the WP
will help us, because we do a lot of re-committing transactions that may
be on disk currently because we don't clearly know where the device
stopped writing data.

> >> === RAID on SMR....
> >>
> >> How does RAID work with SMR, and exactly what does that look like to
> >> the filesystem?
> >>
> >> How does libzbc work with RAID given it is implemented through the scsi ioctl
> >> interface?
> >
> > Probably need to cc dm-devel here.  However, I think we're all agreed
> > this is RAID across multiple devices, rather than within a single
> > device?  In which case we just need a way of ensuring identical zoning
> > on the raided devices and what you get is either a standard zone (for
> > mirror) or a larger zone (for hamming etc).
> >
> 
> I agree.  It's up to the DM to mangle the zones and provide proper
> modified zone info up to the FS.  In the case of mirror, keeps the
> same zone size, just half the total of zones (or half in a condition
> of read-only/full).  In stripped paradigms, double (or more if the
> zone sizes don't match, or if more that 2 drives) the zone size and
> let the DM mod the block numbers to determine the correct disk.  For
> EXT4, this REQUIRES the equivalent of 8k Blocks.
> 
> > James
> >
> >
> 
> == Kernel implementation
> 
> The allocator will need to learn about multiple allocation zones based on
> bitmaps. They aren't really allocation groups, but the initialisation and
> iteration of them is going to be similar to allocation groups. To get use going
> we can do some simple mapping between inode AG and data AZ mapping so that we
> keep some form of locality to related data (e.g. grouping of data by parent
> directory).
> 
> We can do simple things first - simply rotoring allocation across zones will get
> us moving very quickly, and then we can refine it once we have more than just a
> proof of concept prototype.
> 
> Optimising data allocation for SMR is going to be tricky, and I hope to be able
> to leave that to drive vendor engineers....

I think we'd all be interested in whether the write and return
allocation position suggested at LSF/MM would prove useful for this (and
whether the manufacturers are interested in prototyping it with us).

> Ideally, we won't need a zbc interface in the kernel, except to erase zones.
> I'd like to see an interface that doesn't even require that. For example, we
> issue a discard (TRIM) on an entire  zone and that erases it and
> resets the write
> pointer. This way we need no new infrastructure at the filesystem layer to
> implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
> drive underneath it.
> 
> 
> Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
> as of yet ignored the zone management pieces.  I have thought
> (briefly) of the possible need for a new allocator:  the group
> allocator.  As there can only be a few (relatively) zones available at
> any one time, We might need a mechanism to tell which are available
> and which are not.  The stack will have to collectively work together
> to find a way to request and use zones in an orderly fashion.

Here I think the sense of LSF/MM was that only allowing a fixed number
of zones to be open would get a bit unmanageable (unless the drive
silently manages it for us).  The idea of different sized zones is also
a complicating factor.  The other open question is that if we go for
fully drive managed, what sort of alignment, size, trim + anything else
should we do to make the drive's job easier.  I'm guessing we won't
really have a practical answer to any of these until we see how the
market responds.

James


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16 19:06       ` James Bottomley
@ 2015-03-16 20:20         ` Dave Chinner
  -1 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-16 20:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Adrian Palmer, xfs, Linux Filesystem Development List,
	linux-scsi, ext4 development

On Mon, Mar 16, 2015 at 03:06:27PM -0400, James Bottomley wrote:
> On Mon, 2015-03-16 at 12:23 -0600, Adrian Palmer wrote:
> [...]
> > >> == Data zones
> > >>
> > >> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
> > >> free space/write pointers within each zone, and some way of keeping track of
> > >> that information across mounts. If we assign a real time bitmap/summary inode
> > >> pair to each zone, we have a method of tracking free space in the zone. We can
> > >> use the existing bitmap allocator with a small tweak (sequentially ascending,
> > >> packed extent allocation only) to ensure that newly written blocks are allocated
> > >> in a sane manner.
> > >>
> > >> We're going to need userspace to be able to see the contents of these inodes;
> > >> read only access wil be needed to analyse the contents of the zone, so we're
> > >> going to need a special directory to expose this information. It would be useful
> > >> to have a ".zones" directory hanging off the root directory that contains all
> > >> the zone allocation inodes so userspace can simply open them.
> > >
> > > The ZBC standard is being constructed.  However, all revisions agree
> > > that the drive is perfectly capable of tracking the zone pointers (and
> > > even the zone status).  Rather than having you duplicate the information
> > > within the XFS metadata, surely it's better with us to come up with some
> > > block way of reading it from the disk (and caching it for faster
> > > access)?

You misunderstand my proposal - XFS doesn't track the write pointer
in it's metadata at all. It tracks a sequential allocation target
block in each zone via the per-zone allocation bitmap inode. The
assumption is that this will match the underlying zone write
pointer, as long as we verify they match when we first go to
allocate from the zone.

> > In discussions with Dr. Reinecke, it seems extremely prudent to have a
> > kernel cache somewhere.  The SD driver would be the base for updating
> > the cache, but it would need to be available to the allocators, the
> > /sys fs for userspace utilities, and possibly other processes.  In
> > EXT4, I don't think it's feasible to have the cache -- however, the
> > metadata will MIRROR the cache ( BG# = Zone#, databitmap = WP, etc)
> 
> I think I've got two points: if we're caching it, we should have a
> single cache and everyone should use it.  There may be a good reason why
> we can't do this, but I'd like to see it explained before everyone goes
> off and invents their own zone pointer cache.  If we do it in one place,
> we can make the cache properly shrinkable (the information can be purged
> under memory pressure and re-fetched if requested).

Sure, but XFS won't have it's own cache, so what the kernel does
here when we occasionally query the location of the write pointer is
irrelevant to me...

> > >> == Quantification of Random Write Zone Capacity
> > >>
> > >> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
> > >> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
> > >> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
> > >> so that's another 512MB per TB, plus another 256MB per TB for directory
> > >> structures. There's other bits and pieces of metadata as well (attribute space,
> > >> internal freespace btrees, reverse map btrees, etc.
> > >>
> > >> So, at minimum we will probably need at least 2GB of random write space per TB
> > >> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
> > >> option. For those drive vendors out there that are listening and want good
> > >> performance, replace the CMR region with a SSD....
> > >
> > > This seems to be a place where standards work is still needed.  Right at
> > > the moment for Host Managed, the physical layout of the drives makes it
> > > reasonably simple to convert edge zones from SMR to CMR and vice versa
> > > at the expense of changing capacity.  It really sounds like we need a
> > > simple, programmatic way of doing this.  The question I'd have is: are
> > > you happy with just telling manufacturers ahead of time how much CMR
> > > space you need and hoping they comply, or should we push for a standards
> > > way of flipping end zones to CMR?

I've taken what manufacturers are already shipping and found that it
is sufficient for our purposes. They've already set the precendence,
we'll be dependent on them maintaining that same percentage of
CMR:SMR regions in their drives. Otherwise, they won't have
filesystems that run on their drives and they won't sell any of
them.

i.e. we don't need to standardise anything here - the problem is
already solved.

> possibly btrfs) will need some.  One possibility is that we let the
> drives be reformatted in place, say as part of the initial filesystem
> format, so the CMR requirements get tuned exactly.  The other is that we
> simply let the manufacturers give us "enough" and try to determine what
> "enough" is.

Drive manufacturers are already giving us "enough" for market space
that we see XFS-on-SMR-drives will be seen. Making it tunable is
silly - if you are that close to the edge then DM can build you a
device that has a larger CMR from a SSD....

> I suspect forcing a tuning command through the ZBC workgroup would be a
> nice quick way of getting the manufacturers to focus on what is
> possible, but I think we do need some way of closing out this either/or
> debate (we tune or you tune).

It's already there in shipping drives...

> > >> === Crash recovery
> > >>
> > >> Write pointer location is undefined after power failure. It could be at an old
> > >> location, the current location or anywhere in between. The only guarantee that
> > >> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
> > >> least be in a position at or past the location of the fsync.
> > >>
> > >> Hence before a filesystem runs journal recovery, all it's zone allocation write
> > >> pointers need to be set to what the drive thinks they are, and all of the zone
> > >> allocation beyond the write pointer need to be cleared. We could do this during
> > >> log recovery in kernel, but that means we need full ZBC awareness in log
> > >> recovery to iterate and query all the zones.
> > >
> > > If you just use a cached zone pointer provided by block, this should
> > > never be a problem because you'd always know where the drive thought the
> > > pointer was.
> > 
> > This would require a look at the order of updating the stack
> > information, and also WCD vs WCE behavior.  As for the WP, the spec
> > says that any data after the WP is returned with a clear pattern
> > (zeros on Seagate drives) -- it is already cleared.
> 
> As long as the drive behaves to spec, our consistency algorithms should
> be able to cope.  We would expect that on a crash the write pointer
> would be further back than we think it should be, but then the FS will
> just follow its consistency recovery procedures and either roll back or
> forward the transactions from where the WP is at.

Journal recovery doesn't work that way - you can't roll back random
changes mid way through recovery and expect the result to be a
consistent filesystem.

If we run recovery fully, then we have blocks allocated to files
beyond the write pointer and that leaves us two choices:

	- writing zeros to the blocks allocated beyond the write
	  pointer during log recovery to get stuff back in sync,
	  prevent stale data exposure and double-referenced blocks
	- revoke the allocated blocks beyond the write pointer so
	  they can be allocated correctly on the next write.

Either way, it's different behaviour and we need to run write pointer
synchronisation after log recovery to detect the problems...

> In some ways, the WP
> will help us, because we do a lot of re-committing transactions that may
> be on disk currently because we don't clearly know where the device
> stopped writing data.

And therein lies the fundamental reason why write pointer
sychronisation after unclean shutdown is a really hard problem.

> > == Kernel implementation
> > 
> > The allocator will need to learn about multiple allocation zones based on
> > bitmaps. They aren't really allocation groups, but the initialisation and
> > iteration of them is going to be similar to allocation groups. To get use going
> > we can do some simple mapping between inode AG and data AZ mapping so that we
> > keep some form of locality to related data (e.g. grouping of data by parent
> > directory).
> > 
> > We can do simple things first - simply rotoring allocation across zones will get
> > us moving very quickly, and then we can refine it once we have more than just a
> > proof of concept prototype.
> > 
> > Optimising data allocation for SMR is going to be tricky, and I hope to be able
> > to leave that to drive vendor engineers....

Maybe in 5 years time....

> I think we'd all be interested in whether the write and return
> allocation position suggested at LSF/MM would prove useful for this (and
> whether the manufacturers are interested in prototyping it with us).

Right, that's where we need to head. I've got several other block
layer interfaces in mind that could use exactly this semantic to
avoid significant complexity in the filesystem layers.

> > Ideally, we won't need a zbc interface in the kernel, except to erase zones.
> > I'd like to see an interface that doesn't even require that. For example, we
> > issue a discard (TRIM) on an entire  zone and that erases it and
> > resets the write
> > pointer. This way we need no new infrastructure at the filesystem layer to
> > implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
> > drive underneath it.
> > 
> > 
> > Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
> > as of yet ignored the zone management pieces.  I have thought
> > (briefly) of the possible need for a new allocator:  the group
> > allocator.  As there can only be a few (relatively) zones available at
> > any one time, We might need a mechanism to tell which are available
> > and which are not.  The stack will have to collectively work together
> > to find a way to request and use zones in an orderly fashion.
> 
> Here I think the sense of LSF/MM was that only allowing a fixed number
> of zones to be open would get a bit unmanageable (unless the drive
> silently manages it for us).  The idea of different sized zones is also
> a complicating factor.

Not for XFS - my proposal handles variable sized zones without any
additional complexity. Indeed, it will handle zone sizes from 16MB
to 1TB without any modification - mkfs handles it all when it
queries the zones and sets up the zone allocation inodes...

And we limit the number of "open zones" by the number of zone groups
we alow concurrent allocation to....

> The other open question is that if we go for
> fully drive managed, what sort of alignment, size, trim + anything else
> should we do to make the drive's job easier.  I'm guessing we won't
> really have a practical answer to any of these until we see how the
> market responds.

I'm not aiming this proposal at drive managed, or even host-managed
drives: this proposal is for full host-aware (i.e. error on
out-of-order write) drive support. If you have drive managed SMR,
then there's pretty much nothing to change in existing filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16 20:20         ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-16 20:20 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linux Filesystem Development List, ext4 development,
	Adrian Palmer, linux-scsi, xfs

On Mon, Mar 16, 2015 at 03:06:27PM -0400, James Bottomley wrote:
> On Mon, 2015-03-16 at 12:23 -0600, Adrian Palmer wrote:
> [...]
> > >> == Data zones
> > >>
> > >> What we need is a mechanism for tracking the location of zones (i.e. start LBA),
> > >> free space/write pointers within each zone, and some way of keeping track of
> > >> that information across mounts. If we assign a real time bitmap/summary inode
> > >> pair to each zone, we have a method of tracking free space in the zone. We can
> > >> use the existing bitmap allocator with a small tweak (sequentially ascending,
> > >> packed extent allocation only) to ensure that newly written blocks are allocated
> > >> in a sane manner.
> > >>
> > >> We're going to need userspace to be able to see the contents of these inodes;
> > >> read only access wil be needed to analyse the contents of the zone, so we're
> > >> going to need a special directory to expose this information. It would be useful
> > >> to have a ".zones" directory hanging off the root directory that contains all
> > >> the zone allocation inodes so userspace can simply open them.
> > >
> > > The ZBC standard is being constructed.  However, all revisions agree
> > > that the drive is perfectly capable of tracking the zone pointers (and
> > > even the zone status).  Rather than having you duplicate the information
> > > within the XFS metadata, surely it's better with us to come up with some
> > > block way of reading it from the disk (and caching it for faster
> > > access)?

You misunderstand my proposal - XFS doesn't track the write pointer
in it's metadata at all. It tracks a sequential allocation target
block in each zone via the per-zone allocation bitmap inode. The
assumption is that this will match the underlying zone write
pointer, as long as we verify they match when we first go to
allocate from the zone.

> > In discussions with Dr. Reinecke, it seems extremely prudent to have a
> > kernel cache somewhere.  The SD driver would be the base for updating
> > the cache, but it would need to be available to the allocators, the
> > /sys fs for userspace utilities, and possibly other processes.  In
> > EXT4, I don't think it's feasible to have the cache -- however, the
> > metadata will MIRROR the cache ( BG# = Zone#, databitmap = WP, etc)
> 
> I think I've got two points: if we're caching it, we should have a
> single cache and everyone should use it.  There may be a good reason why
> we can't do this, but I'd like to see it explained before everyone goes
> off and invents their own zone pointer cache.  If we do it in one place,
> we can make the cache properly shrinkable (the information can be purged
> under memory pressure and re-fetched if requested).

Sure, but XFS won't have it's own cache, so what the kernel does
here when we occasionally query the location of the write pointer is
irrelevant to me...

> > >> == Quantification of Random Write Zone Capacity
> > >>
> > >> A basic guideline is that for 4k blocks and zones of 256MB, we'll need 8kB of
> > >> bitmap space and two inodes, so call it 10kB per 256MB zone. That's 40MB per TB
> > >> for free space bitmaps. We'll want to suport at least 1 million inodes per TB,
> > >> so that's another 512MB per TB, plus another 256MB per TB for directory
> > >> structures. There's other bits and pieces of metadata as well (attribute space,
> > >> internal freespace btrees, reverse map btrees, etc.
> > >>
> > >> So, at minimum we will probably need at least 2GB of random write space per TB
> > >> of SMR zone data space. Plus a couple of GB for the journal if we want the easy
> > >> option. For those drive vendors out there that are listening and want good
> > >> performance, replace the CMR region with a SSD....
> > >
> > > This seems to be a place where standards work is still needed.  Right at
> > > the moment for Host Managed, the physical layout of the drives makes it
> > > reasonably simple to convert edge zones from SMR to CMR and vice versa
> > > at the expense of changing capacity.  It really sounds like we need a
> > > simple, programmatic way of doing this.  The question I'd have is: are
> > > you happy with just telling manufacturers ahead of time how much CMR
> > > space you need and hoping they comply, or should we push for a standards
> > > way of flipping end zones to CMR?

I've taken what manufacturers are already shipping and found that it
is sufficient for our purposes. They've already set the precendence,
we'll be dependent on them maintaining that same percentage of
CMR:SMR regions in their drives. Otherwise, they won't have
filesystems that run on their drives and they won't sell any of
them.

i.e. we don't need to standardise anything here - the problem is
already solved.

> possibly btrfs) will need some.  One possibility is that we let the
> drives be reformatted in place, say as part of the initial filesystem
> format, so the CMR requirements get tuned exactly.  The other is that we
> simply let the manufacturers give us "enough" and try to determine what
> "enough" is.

Drive manufacturers are already giving us "enough" for market space
that we see XFS-on-SMR-drives will be seen. Making it tunable is
silly - if you are that close to the edge then DM can build you a
device that has a larger CMR from a SSD....

> I suspect forcing a tuning command through the ZBC workgroup would be a
> nice quick way of getting the manufacturers to focus on what is
> possible, but I think we do need some way of closing out this either/or
> debate (we tune or you tune).

It's already there in shipping drives...

> > >> === Crash recovery
> > >>
> > >> Write pointer location is undefined after power failure. It could be at an old
> > >> location, the current location or anywhere in between. The only guarantee that
> > >> we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
> > >> least be in a position at or past the location of the fsync.
> > >>
> > >> Hence before a filesystem runs journal recovery, all it's zone allocation write
> > >> pointers need to be set to what the drive thinks they are, and all of the zone
> > >> allocation beyond the write pointer need to be cleared. We could do this during
> > >> log recovery in kernel, but that means we need full ZBC awareness in log
> > >> recovery to iterate and query all the zones.
> > >
> > > If you just use a cached zone pointer provided by block, this should
> > > never be a problem because you'd always know where the drive thought the
> > > pointer was.
> > 
> > This would require a look at the order of updating the stack
> > information, and also WCD vs WCE behavior.  As for the WP, the spec
> > says that any data after the WP is returned with a clear pattern
> > (zeros on Seagate drives) -- it is already cleared.
> 
> As long as the drive behaves to spec, our consistency algorithms should
> be able to cope.  We would expect that on a crash the write pointer
> would be further back than we think it should be, but then the FS will
> just follow its consistency recovery procedures and either roll back or
> forward the transactions from where the WP is at.

Journal recovery doesn't work that way - you can't roll back random
changes mid way through recovery and expect the result to be a
consistent filesystem.

If we run recovery fully, then we have blocks allocated to files
beyond the write pointer and that leaves us two choices:

	- writing zeros to the blocks allocated beyond the write
	  pointer during log recovery to get stuff back in sync,
	  prevent stale data exposure and double-referenced blocks
	- revoke the allocated blocks beyond the write pointer so
	  they can be allocated correctly on the next write.

Either way, it's different behaviour and we need to run write pointer
synchronisation after log recovery to detect the problems...

> In some ways, the WP
> will help us, because we do a lot of re-committing transactions that may
> be on disk currently because we don't clearly know where the device
> stopped writing data.

And therein lies the fundamental reason why write pointer
sychronisation after unclean shutdown is a really hard problem.

> > == Kernel implementation
> > 
> > The allocator will need to learn about multiple allocation zones based on
> > bitmaps. They aren't really allocation groups, but the initialisation and
> > iteration of them is going to be similar to allocation groups. To get use going
> > we can do some simple mapping between inode AG and data AZ mapping so that we
> > keep some form of locality to related data (e.g. grouping of data by parent
> > directory).
> > 
> > We can do simple things first - simply rotoring allocation across zones will get
> > us moving very quickly, and then we can refine it once we have more than just a
> > proof of concept prototype.
> > 
> > Optimising data allocation for SMR is going to be tricky, and I hope to be able
> > to leave that to drive vendor engineers....

Maybe in 5 years time....

> I think we'd all be interested in whether the write and return
> allocation position suggested at LSF/MM would prove useful for this (and
> whether the manufacturers are interested in prototyping it with us).

Right, that's where we need to head. I've got several other block
layer interfaces in mind that could use exactly this semantic to
avoid significant complexity in the filesystem layers.

> > Ideally, we won't need a zbc interface in the kernel, except to erase zones.
> > I'd like to see an interface that doesn't even require that. For example, we
> > issue a discard (TRIM) on an entire  zone and that erases it and
> > resets the write
> > pointer. This way we need no new infrastructure at the filesystem layer to
> > implement SMR awareness. In effect, the kernel isn't even aware that it's an SMR
> > drive underneath it.
> > 
> > 
> > Dr. Reinecke has already done the Discard/TRIM stuff.  However, he's
> > as of yet ignored the zone management pieces.  I have thought
> > (briefly) of the possible need for a new allocator:  the group
> > allocator.  As there can only be a few (relatively) zones available at
> > any one time, We might need a mechanism to tell which are available
> > and which are not.  The stack will have to collectively work together
> > to find a way to request and use zones in an orderly fashion.
> 
> Here I think the sense of LSF/MM was that only allowing a fixed number
> of zones to be open would get a bit unmanageable (unless the drive
> silently manages it for us).  The idea of different sized zones is also
> a complicating factor.

Not for XFS - my proposal handles variable sized zones without any
additional complexity. Indeed, it will handle zone sizes from 16MB
to 1TB without any modification - mkfs handles it all when it
queries the zones and sets up the zone allocation inodes...

And we limit the number of "open zones" by the number of zone groups
we alow concurrent allocation to....

> The other open question is that if we go for
> fully drive managed, what sort of alignment, size, trim + anything else
> should we do to make the drive's job easier.  I'm guessing we won't
> really have a practical answer to any of these until we see how the
> market responds.

I'm not aiming this proposal at drive managed, or even host-managed
drives: this proposal is for full host-aware (i.e. error on
out-of-order write) drive support. If you have drive managed SMR,
then there's pretty much nothing to change in existing filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16 15:28   ` James Bottomley
@ 2015-03-16 20:32     ` Dave Chinner
  -1 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-16 20:32 UTC (permalink / raw)
  To: James Bottomley; +Cc: xfs, linux-fsdevel, linux-scsi

On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
> [cc to linux-scsi added since this seems relevant]
> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > 
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code.  Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > 
> > The source for that document can be found in this git tree here:
> > 
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > 
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > 
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > 
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > 
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > 
> > Happy reading!
> 
> I don't think it would have caused too much heartache to post the entire
> doc to the list, but anyway
> 
> The first is a meta question: What happened to the idea of separating
> the fs block allocator from filesystems?  It looks like a lot of the
> updates could be duplicated into other filesystems, so it might be a
> very opportune time to think about this.

Which requires a complete rework of the fs/block layer. That's the
long term goal, but we aren't going to be there for a few years yet.
Hust look at how long it's taken for copy offload (which is trivial
compared to allocation offload) to be implemented....

> > === RAID on SMR....
> > 
> > How does RAID work with SMR, and exactly what does that look like to
> > the filesystem?
> > 
> > How does libzbc work with RAID given it is implemented through the scsi ioctl
> > interface?
> 
> Probably need to cc dm-devel here.  However, I think we're all agreed
> this is RAID across multiple devices, rather than within a single
> device?  In which case we just need a way of ensuring identical zoning
> on the raided devices and what you get is either a standard zone (for
> mirror) or a larger zone (for hamming etc).

Any sort of RAID is a bloody hard problem, hence the fact that I'm
designing a solution for a filesystem on top of an entire bare
drive. I'm not trying to solve every use case in the world, just the
one where the drive manufactures think SMR will be mostly used: the
back end of "never delete" distributed storage environments....

We can't wait for years for infrastructure layers to catch up in the
brave new world of shipping SMR drives. We may not like them, but we
have to make stuff work. I'm not trying to solve every problem - I'm
just tryin gto address the biggest use case I see for SMR devices
and it just so happens that XFS is already used pervasively in that
same use case, mostly within the same "no raid, fs per entire
device" constraints as I've documented for this proposal...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-16 20:32     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-16 20:32 UTC (permalink / raw)
  To: James Bottomley; +Cc: linux-fsdevel, linux-scsi, xfs

On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
> [cc to linux-scsi added since this seems relevant]
> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > 
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code.  Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > 
> > The source for that document can be found in this git tree here:
> > 
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > 
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > 
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > 
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > 
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > 
> > Happy reading!
> 
> I don't think it would have caused too much heartache to post the entire
> doc to the list, but anyway
> 
> The first is a meta question: What happened to the idea of separating
> the fs block allocator from filesystems?  It looks like a lot of the
> updates could be duplicated into other filesystems, so it might be a
> very opportune time to think about this.

Which requires a complete rework of the fs/block layer. That's the
long term goal, but we aren't going to be there for a few years yet.
Hust look at how long it's taken for copy offload (which is trivial
compared to allocation offload) to be implemented....

> > === RAID on SMR....
> > 
> > How does RAID work with SMR, and exactly what does that look like to
> > the filesystem?
> > 
> > How does libzbc work with RAID given it is implemented through the scsi ioctl
> > interface?
> 
> Probably need to cc dm-devel here.  However, I think we're all agreed
> this is RAID across multiple devices, rather than within a single
> device?  In which case we just need a way of ensuring identical zoning
> on the raided devices and what you get is either a standard zone (for
> mirror) or a larger zone (for hamming etc).

Any sort of RAID is a bloody hard problem, hence the fact that I'm
designing a solution for a filesystem on top of an entire bare
drive. I'm not trying to solve every use case in the world, just the
one where the drive manufactures think SMR will be mostly used: the
back end of "never delete" distributed storage environments....

We can't wait for years for infrastructure layers to catch up in the
brave new world of shipping SMR drives. We may not like them, but we
have to make stuff work. I'm not trying to solve every problem - I'm
just tryin gto address the biggest use case I see for SMR devices
and it just so happens that XFS is already used pervasively in that
same use case, mostly within the same "no raid, fs per entire
device" constraints as I've documented for this proposal...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16 20:20         ` Dave Chinner
  (?)
@ 2015-03-16 22:48         ` Cyril Guyot
  -1 siblings, 0 replies; 21+ messages in thread
From: Cyril Guyot @ 2015-03-16 22:48 UTC (permalink / raw)
  To: linux-fsdevel

Dave Chinner <david <at> fromorbit.com> writes:

> I'm not aiming this proposal at drive managed, or even host-managed
> drives: this proposal is for full host-aware (i.e. error on
> out-of-order write) drive support.

I just wanted to clarify that what you are describing - a device that errors
out on writes not at the zone write pointer - is actually a host-managed
device, as per the ZBC/ZAC standards.

Best regards,
Cyril


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16 20:32     ` Dave Chinner
@ 2015-03-17  1:12       ` Alireza Haghdoost
  -1 siblings, 0 replies; 21+ messages in thread
From: Alireza Haghdoost @ 2015-03-17  1:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: James Bottomley, xfs, Linux Filesystem Development List, linux-scsi

On Mon, Mar 16, 2015 at 3:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
>> [cc to linux-scsi added since this seems relevant]
>> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
>> > Hi Folks,
>> >
>> > As I told many people at Vault last week, I wrote a document
>> > outlining how we should modify the on-disk structures of XFS to
>> > support host aware SMR drives on the (long) plane flights to Boston.
>> >
>> > TL;DR: not a lot of change to the XFS kernel code is required, no
>> > specific SMR awareness is needed by the kernel code.  Only
>> > relatively minor tweaks to the on-disk format will be needed and
>> > most of the userspace changes are relatively straight forward, too.
>> >
>> > The source for that document can be found in this git tree here:
>> >
>> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
>> >
>> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
>> > pull it straight from cgit:
>> >
>> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
>> >
>> > Or there is a pdf version built from the current TOT on the xfs.org
>> > wiki here:
>> >
>> > http://xfs.org/index.php/Host_Aware_SMR_architecture
>> >
>> > Happy reading!
>>
>> I don't think it would have caused too much heartache to post the entire
>> doc to the list, but anyway
>>
>> The first is a meta question: What happened to the idea of separating
>> the fs block allocator from filesystems?  It looks like a lot of the
>> updates could be duplicated into other filesystems, so it might be a
>> very opportune time to think about this.
>
> Which requires a complete rework of the fs/block layer. That's the
> long term goal, but we aren't going to be there for a few years yet.
> Hust look at how long it's taken for copy offload (which is trivial
> compared to allocation offload) to be implemented....
>
>> > === RAID on SMR....
>> >
>> > How does RAID work with SMR, and exactly what does that look like to
>> > the filesystem?
>> >
>> > How does libzbc work with RAID given it is implemented through the scsi ioctl
>> > interface?
>>
>> Probably need to cc dm-devel here.  However, I think we're all agreed
>> this is RAID across multiple devices, rather than within a single
>> device?  In which case we just need a way of ensuring identical zoning
>> on the raided devices and what you get is either a standard zone (for
>> mirror) or a larger zone (for hamming etc).
>
> Any sort of RAID is a bloody hard problem, hence the fact that I'm
> designing a solution for a filesystem on top of an entire bare
> drive. I'm not trying to solve every use case in the world, just the
> one where the drive manufactures think SMR will be mostly used: the
> back end of "never delete" distributed storage environments....
> We can't wait for years for infrastructure layers to catch up in the
> brave new world of shipping SMR drives. We may not like them, but we
> have to make stuff work. I'm not trying to solve every problem - I'm
> just tryin gto address the biggest use case I see for SMR devices
> and it just so happens that XFS is already used pervasively in that
> same use case, mostly within the same "no raid, fs per entire
> device" constraints as I've documented for this proposal...
>
> Cheers,
>
> Dave.


I am confused what kind of application you are referring to for this
"back end, no raid, fs per entire device". Are you gonna rely on the
application to do replication for disk failure protection ?

I think it is a good idea to devise the file system changes with a
little bit concern about its negative impact on RAID. My impression is
that these changes push more in-place parity update if file system
deployed on the top of parity based RAID array since it would convert
most of random IOs to sequential IOs that might happen to be in the
same parity stripe.

--Alireza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-17  1:12       ` Alireza Haghdoost
  0 siblings, 0 replies; 21+ messages in thread
From: Alireza Haghdoost @ 2015-03-17  1:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: James Bottomley, Linux Filesystem Development List, linux-scsi, xfs

On Mon, Mar 16, 2015 at 3:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
>> [cc to linux-scsi added since this seems relevant]
>> On Mon, 2015-03-16 at 17:00 +1100, Dave Chinner wrote:
>> > Hi Folks,
>> >
>> > As I told many people at Vault last week, I wrote a document
>> > outlining how we should modify the on-disk structures of XFS to
>> > support host aware SMR drives on the (long) plane flights to Boston.
>> >
>> > TL;DR: not a lot of change to the XFS kernel code is required, no
>> > specific SMR awareness is needed by the kernel code.  Only
>> > relatively minor tweaks to the on-disk format will be needed and
>> > most of the userspace changes are relatively straight forward, too.
>> >
>> > The source for that document can be found in this git tree here:
>> >
>> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
>> >
>> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
>> > pull it straight from cgit:
>> >
>> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
>> >
>> > Or there is a pdf version built from the current TOT on the xfs.org
>> > wiki here:
>> >
>> > http://xfs.org/index.php/Host_Aware_SMR_architecture
>> >
>> > Happy reading!
>>
>> I don't think it would have caused too much heartache to post the entire
>> doc to the list, but anyway
>>
>> The first is a meta question: What happened to the idea of separating
>> the fs block allocator from filesystems?  It looks like a lot of the
>> updates could be duplicated into other filesystems, so it might be a
>> very opportune time to think about this.
>
> Which requires a complete rework of the fs/block layer. That's the
> long term goal, but we aren't going to be there for a few years yet.
> Hust look at how long it's taken for copy offload (which is trivial
> compared to allocation offload) to be implemented....
>
>> > === RAID on SMR....
>> >
>> > How does RAID work with SMR, and exactly what does that look like to
>> > the filesystem?
>> >
>> > How does libzbc work with RAID given it is implemented through the scsi ioctl
>> > interface?
>>
>> Probably need to cc dm-devel here.  However, I think we're all agreed
>> this is RAID across multiple devices, rather than within a single
>> device?  In which case we just need a way of ensuring identical zoning
>> on the raided devices and what you get is either a standard zone (for
>> mirror) or a larger zone (for hamming etc).
>
> Any sort of RAID is a bloody hard problem, hence the fact that I'm
> designing a solution for a filesystem on top of an entire bare
> drive. I'm not trying to solve every use case in the world, just the
> one where the drive manufactures think SMR will be mostly used: the
> back end of "never delete" distributed storage environments....
> We can't wait for years for infrastructure layers to catch up in the
> brave new world of shipping SMR drives. We may not like them, but we
> have to make stuff work. I'm not trying to solve every problem - I'm
> just tryin gto address the biggest use case I see for SMR devices
> and it just so happens that XFS is already used pervasively in that
> same use case, mostly within the same "no raid, fs per entire
> device" constraints as I've documented for this proposal...
>
> Cheers,
>
> Dave.


I am confused what kind of application you are referring to for this
"back end, no raid, fs per entire device". Are you gonna rely on the
application to do replication for disk failure protection ?

I think it is a good idea to devise the file system changes with a
little bit concern about its negative impact on RAID. My impression is
that these changes push more in-place parity update if file system
deployed on the top of parity based RAID array since it would convert
most of random IOs to sequential IOs that might happen to be in the
same parity stripe.

--Alireza

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-17  1:12       ` Alireza Haghdoost
  (?)
@ 2015-03-17  6:06       ` Dave Chinner
  -1 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-17  6:06 UTC (permalink / raw)
  To: Alireza Haghdoost
  Cc: James Bottomley, Linux Filesystem Development List, linux-scsi, xfs

On Mon, Mar 16, 2015 at 08:12:16PM -0500, Alireza Haghdoost wrote:
> On Mon, Mar 16, 2015 at 3:32 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Mar 16, 2015 at 11:28:53AM -0400, James Bottomley wrote:
> >> Probably need to cc dm-devel here.  However, I think we're all agreed
> >> this is RAID across multiple devices, rather than within a single
> >> device?  In which case we just need a way of ensuring identical zoning
> >> on the raided devices and what you get is either a standard zone (for
> >> mirror) or a larger zone (for hamming etc).
> >
> > Any sort of RAID is a bloody hard problem, hence the fact that I'm
> > designing a solution for a filesystem on top of an entire bare
> > drive. I'm not trying to solve every use case in the world, just the
> > one where the drive manufactures think SMR will be mostly used: the
> > back end of "never delete" distributed storage environments....
> > We can't wait for years for infrastructure layers to catch up in the
> > brave new world of shipping SMR drives. We may not like them, but we
> > have to make stuff work. I'm not trying to solve every problem - I'm
> > just tryin gto address the biggest use case I see for SMR devices
> > and it just so happens that XFS is already used pervasively in that
> > same use case, mostly within the same "no raid, fs per entire
> > device" constraints as I've documented for this proposal...
> 
> I am confused what kind of application you are referring to for this
> "back end, no raid, fs per entire device". Are you gonna rely on the
> application to do replication for disk failure protection ?

Exactly. Think distributed storage such as Ceph and gluster where
the data redundancy and failure recovery algorithms are in layers
*above* the local filesystem, not in the storage below the fs.  The
"no raid, fs per device" model is already a very common back end
storage configuration for such deployments.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-16  6:00 ` Dave Chinner
@ 2015-03-17 13:25   ` Brian Foster
  -1 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-03-17 13:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6421 bytes --]

On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> Hi Folks,
> 
> As I told many people at Vault last week, I wrote a document
> outlining how we should modify the on-disk structures of XFS to
> support host aware SMR drives on the (long) plane flights to Boston.
> 
> TL;DR: not a lot of change to the XFS kernel code is required, no
> specific SMR awareness is needed by the kernel code.  Only
> relatively minor tweaks to the on-disk format will be needed and
> most of the userspace changes are relatively straight forward, too.
> 
> The source for that document can be found in this git tree here:
> 
> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> 
> in the file design/xfs-smr-structure.asciidoc. Alternatively,
> pull it straight from cgit:
> 
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> 
> Or there is a pdf version built from the current TOT on the xfs.org
> wiki here:
> 
> http://xfs.org/index.php/Host_Aware_SMR_architecture
> 
> Happy reading!
> 

Hi Dave,

Thanks for sharing this. Here are some thoughts/notes/questions/etc.
from a first pass. This is mostly XFS oriented and I'll try to break it
down by section.

I've also attached a diff to the original doc with some typo fixes and
whatnot. Feel free to just fold it into the original doc if you like.

== Concepts

- With regard to the assumption that the CMR region is not spread around
the drive, I saw at least one presentation at Vault that suggested
otherwise (the skylight one iirc). That said, it was theoretical and
based on a drive-managed drive. It is in no way clear to me whether that
is something to expect for host-managed drives.

- It isn't clear to me here and in other places whether you propose to
use the CMR regions as a "metadata device" or require some other
randomly writeable storage to serve that purpose.

== Journal modifications

- The tail->head log zeroing behavior on mount comes to mind here. Maybe
the writes are still sequential and it's not a problem, but we should
consider that with the proposition. It's probably not critical as we do
have the out of using the cmr region here (as noted). I assume we can
also cleanly relocate the log without breaking anything else (e.g., the
current location is performance oriented rather than architectural,
yes?).

== Data zones

- Will this actually support data overwrite or will that return error?

- TBH, I've never looked at realtime functionality so I don't grok the
high level approach yet. I'm wondering... have you considered a design
based on reflink and copy-on-write? I know the current plan is to
disentangle the reflink tree from the rmap tree, but my understanding is
the reflink tree is still in the pipeline. Assuming we have that
functionality, it seems like there's potential to use it to overcome
some of the overwrite complexity. Just as a handwaving example, use the
per-zone inode to hold an additional reference to each allocated extent
in the zone, thus all writes are handled as if the file had a clone. If
the only reference drops to the zoneino, the extent is freed and thus
stale wrt to the zone cleaner logic.

I suspect we would still need an allocation strategy, but I expect we're
going to have zone metadata regardless that will help deal with that.
Note that the current sparse inode proposal includes an allocation range
limit mechanism (for the inode record overlaps an ag boundary case),
which could potentially be used/extended to build something on top of
the existing allocator for zone allocation (e.g., if we had some kind of
zone record with the write pointer that indicated where it's safe to
allocate from). Again, just thinking out loud here.

== Zone cleaner

- Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
figure out what it's supposed to say. ;)

- The idea sounds sane, but the dependency on userspace for a critical
fs mechanism sounds a bit scary to be honest. Is in kernel allocation
going to throttle/depend on background work in the userspace cleaner in
the event of low writeable free space? What if that userspace thing
dies, etc.? I suppose an implementation with as much mechanism in libxfs
as possible allows us greatest flexibility to go in either direction
here.

- I'm also wondering how much real overlap there is in xfs_fsr (another
thing I haven't really looked at :) beyond that it calls swapext.
E.g., cleaning a zone sounds like it must map back to N files that could
have allocated extents in the zone vs. considering individual files for
defragmentation, fragmentation of the parent file may not be as much of
a consideration as resetting zones, etc. It sounds like a separate tool
might be warranted, even if there is code to steal from fsr. :)

== Reverse mapping btrees

- This is something I still need to grok, perhaps just because the rmap
code isn't available yet. But I'll note that this does seem like
another bit that could be unnecessary if we could get away with using
the traditional allocator.

== Mkfs

- We have references to the "metadata device" as well as random write
regions. Similar to my question above, is there an expectation of a
separate physical metadata device or is that terminology for the random
write regions?

Finally, some general/summary notes:

- Some kind of data structure outline would eventually make a nice
addition to this document. I understand it's probably too early yet,
but we are talking about new per-zone inodes, new and interesting
relationships between AGs and zones (?), etc. Fine grained detail is not
required, but an outline or visual that describes the high-level
mappings goes a long way to facilitate reasoning about the design.

- A big question I had (and something that is touched on down thread wrt
to embedded flash) is whether the random write zones are runtime
configurable. If so, couldn't this facilitate use of existing AG
metadata (now that I think of it, it's not clear to me whether the
realtime mechanism excludes or coexists with AGs)? IOW, we obviously
need this kind of space for inodes, dirs, xattrs, btrees, etc.
regardless. It would be interesting if we had the added flexibility to
align it with AGs.

Thanks again!

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

[-- Attachment #2: smr-typos.diff --]
[-- Type: text/plain, Size: 8149 bytes --]

diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
index dd959ab..2fea88f 100644
--- a/design/xfs-smr-structure.asciidoc
+++ b/design/xfs-smr-structure.asciidoc
@@ -95,7 +95,7 @@ going to need a special directory to expose this information. It would be useful
 to have a ".zones" directory hanging off the root directory that contains all
 the zone allocation inodes so userspace can simply open them.
 
-THis biggest issue that has come to light here is the number of zones in a
+This biggest issue that has come to light here is the number of zones in a
 device. Zones are typically 256MB in size, and so we are looking at 4,000
 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if
 the devices keep getting larger at the expected rate, we're going to have to
@@ -112,24 +112,24 @@ also have other benefits...
 While it seems like tracking free space is trivial for the purposes of
 allocation (and it is!), the complexity comes when we start to delete or
 overwrite data. Suddenly zones no longer contain contiguous ranges of valid
-data; they have "freed" extents in the middle of them that contian stale data.
+data; they have "freed" extents in the middle of them that contain stale data.
 We can't use that "stale space" until the entire zone is made up of "stale"
 extents. Hence we need a Cleaner.
 
 === Zone Cleaner
 
 The purpose of the cleaner is to find zones that are mostly stale space and
-consolidate the remaining referenced data into a new, contigious zone, enabling
+consolidate the remaining referenced data into a new, contiguous zone, enabling
 us to then "clean" the stale zone and make it available for writing new data
 again.
 
-The real complexity here is finding the owner of the data that needs to be move,
-but we are in the process of solving that with the reverse mapping btree and
-parent pointer functionality. This gives us the mechanism by which we can
+The real complexity here is finding the owner of the data that needs to be
+moved, but we are in the process of solving that with the reverse mapping btree
+and parent pointer functionality. This gives us the mechanism by which we can
 quickly re-organise files that have extents in zones that need cleaning.
 
 The key word here is "reorganise". We have a tool that already reorganises file
-layout: xfs_fsr. The "Cleaner" is a finely targetted policy for xfs_fsr -
+layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr -
 instead of trying to minimise fixpel fragments, it finds zones that need
 cleaning by reading their summary info from the /.zones/ directory and analysing
 the free bitmap state if there is a high enough percentage of stale blocks. From
@@ -142,7 +142,7 @@ Hence we don't actually need any major new data moving functionality in the
 kernel to enable this, except maybe an event channel for the kernel to tell
 xfs_fsr it needs to do some cleaning work.
 
-If we arrange zones into zoen groups, we also have a method for keeping new
+If we arrange zones into zone groups, we also have a method for keeping new
 allocations out of regions we are re-organising. That is, we need to be able to
 mark zone groups as "read only" so the kernel will not attempt to allocate from
 them while the cleaner is running and re-organising the data within the zones in
@@ -166,17 +166,17 @@ inode to track the zone's owner information.
 == Mkfs
 
 Mkfs is going to have to integrate with the userspace zbc libraries to query the
-layout of zones from the underlying disk and then do some magic to lay out al
+layout of zones from the underlying disk and then do some magic to lay out all
 the necessary metadata correctly. I don't see there being any significant
 challenge to doing this, but we will need a stable libzbc API to work with and
-it will need ot be packaged by distros.
+it will need to be packaged by distros.
 
-If mkfs cannot find ensough random write space for the amount of metadata we
-need to track all the space in the sequential write zones and a decent amount of
-internal fielsystem metadata (inodes, etc) then it will need to fail. Drive
-vendors are going to need to provide sufficient space in these regions for us
-to be able to make use of it, otherwise we'll simply not be able to do what we
-need to do.
+If mkfs cannot find enough random write space for the amount of metadata we need
+to track all the space in the sequential write zones and a decent amount of
+internal filesystem metadata (inodes, etc) then it will need to fail. Drive
+vendors are going to need to provide sufficient space in these regions for us to
+be able to make use of it, otherwise we'll simply not be able to do what we need
+to do.
 
 mkfs will need to initialise all the zone allocation inodes, reset all the zone
 write pointers, create the /.zones directory, place the log in an appropriate
@@ -187,13 +187,13 @@ place and initialise the metadata device as well.
 Because we've limited the metadata to a section of the drive that can be
 overwritten, we don't have to make significant changes to xfs_repair. It will
 need to be taught about the multiple zone allocation bitmaps for it's space
-reference checking, but otherwise all the infrastructure we need ifor using
+reference checking, but otherwise all the infrastructure we need for using
 bitmaps for verifying used space should already be there.
 
-THere be dragons waiting for us if we don't have random write zones for
+There be dragons waiting for us if we don't have random write zones for
 metadata. If that happens, we cannot repair metadata in place and we will have
 to redesign xfs_repair from the ground up to support such functionality. That's
-jus tnot going to happen, so we'll need drives with a significant amount of
+just not going to happen, so we'll need drives with a significant amount of
 random write space for all our metadata......
 
 == Quantification of Random Write Zone Capacity
@@ -214,7 +214,7 @@ performance, replace the CMR region with a SSD....
 
 The allocator will need to learn about multiple allocation zones based on
 bitmaps. They aren't really allocation groups, but the initialisation and
-iteration of them is going to be similar to allocation groups. To get use going
+iteration of them is going to be similar to allocation groups. To get us going
 we can do some simple mapping between inode AG and data AZ mapping so that we
 keep some form of locality to related data (e.g. grouping of data by parent
 directory).
@@ -273,19 +273,19 @@ location, the current location or anywhere in between. The only guarantee that
 we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
 least be in a position at or past the location of the fsync.
 
-Hence before a filesystem runs journal recovery, all it's zone allocation write
+Hence before a filesystem runs journal recovery, all its zone allocation write
 pointers need to be set to what the drive thinks they are, and all of the zone
 allocation beyond the write pointer need to be cleared. We could do this during
 log recovery in kernel, but that means we need full ZBC awareness in log
 recovery to iterate and query all the zones.
 
-Hence it's not clear if we want to do this in userspace as that has it's own
-problems e.g. we'd need to  have xfs.fsck detect that it's a smr filesystem and
+Hence it's not clear if we want to do this in userspace as that has its own
+problems e.g. we'd need to  have xfs.fsck detect that it's an smr filesystem and
 perform that recovery, or write a mount.xfs helper that does it prior to
 mounting the filesystem. Either way, we need to synchronise the on-disk
 filesystem state to the internal disk zone state before doing anything else.
 
-This needs more thought, because I have a nagging suspiscion that we need to do
+This needs more thought, because I have a nagging suspicion that we need to do
 this write pointer resynchronisation *after log recovery* has completed so we
 can determine if we've got to now go and free extents that the filesystem has
 allocated and are referenced by some inode out there. This, again, will require

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-17 13:25   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-03-17 13:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

[-- Attachment #1: Type: text/plain, Size: 6421 bytes --]

On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> Hi Folks,
> 
> As I told many people at Vault last week, I wrote a document
> outlining how we should modify the on-disk structures of XFS to
> support host aware SMR drives on the (long) plane flights to Boston.
> 
> TL;DR: not a lot of change to the XFS kernel code is required, no
> specific SMR awareness is needed by the kernel code.  Only
> relatively minor tweaks to the on-disk format will be needed and
> most of the userspace changes are relatively straight forward, too.
> 
> The source for that document can be found in this git tree here:
> 
> git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> 
> in the file design/xfs-smr-structure.asciidoc. Alternatively,
> pull it straight from cgit:
> 
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> 
> Or there is a pdf version built from the current TOT on the xfs.org
> wiki here:
> 
> http://xfs.org/index.php/Host_Aware_SMR_architecture
> 
> Happy reading!
> 

Hi Dave,

Thanks for sharing this. Here are some thoughts/notes/questions/etc.
from a first pass. This is mostly XFS oriented and I'll try to break it
down by section.

I've also attached a diff to the original doc with some typo fixes and
whatnot. Feel free to just fold it into the original doc if you like.

== Concepts

- With regard to the assumption that the CMR region is not spread around
the drive, I saw at least one presentation at Vault that suggested
otherwise (the skylight one iirc). That said, it was theoretical and
based on a drive-managed drive. It is in no way clear to me whether that
is something to expect for host-managed drives.

- It isn't clear to me here and in other places whether you propose to
use the CMR regions as a "metadata device" or require some other
randomly writeable storage to serve that purpose.

== Journal modifications

- The tail->head log zeroing behavior on mount comes to mind here. Maybe
the writes are still sequential and it's not a problem, but we should
consider that with the proposition. It's probably not critical as we do
have the out of using the cmr region here (as noted). I assume we can
also cleanly relocate the log without breaking anything else (e.g., the
current location is performance oriented rather than architectural,
yes?).

== Data zones

- Will this actually support data overwrite or will that return error?

- TBH, I've never looked at realtime functionality so I don't grok the
high level approach yet. I'm wondering... have you considered a design
based on reflink and copy-on-write? I know the current plan is to
disentangle the reflink tree from the rmap tree, but my understanding is
the reflink tree is still in the pipeline. Assuming we have that
functionality, it seems like there's potential to use it to overcome
some of the overwrite complexity. Just as a handwaving example, use the
per-zone inode to hold an additional reference to each allocated extent
in the zone, thus all writes are handled as if the file had a clone. If
the only reference drops to the zoneino, the extent is freed and thus
stale wrt to the zone cleaner logic.

I suspect we would still need an allocation strategy, but I expect we're
going to have zone metadata regardless that will help deal with that.
Note that the current sparse inode proposal includes an allocation range
limit mechanism (for the inode record overlaps an ag boundary case),
which could potentially be used/extended to build something on top of
the existing allocator for zone allocation (e.g., if we had some kind of
zone record with the write pointer that indicated where it's safe to
allocate from). Again, just thinking out loud here.

== Zone cleaner

- Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
figure out what it's supposed to say. ;)

- The idea sounds sane, but the dependency on userspace for a critical
fs mechanism sounds a bit scary to be honest. Is in kernel allocation
going to throttle/depend on background work in the userspace cleaner in
the event of low writeable free space? What if that userspace thing
dies, etc.? I suppose an implementation with as much mechanism in libxfs
as possible allows us greatest flexibility to go in either direction
here.

- I'm also wondering how much real overlap there is in xfs_fsr (another
thing I haven't really looked at :) beyond that it calls swapext.
E.g., cleaning a zone sounds like it must map back to N files that could
have allocated extents in the zone vs. considering individual files for
defragmentation, fragmentation of the parent file may not be as much of
a consideration as resetting zones, etc. It sounds like a separate tool
might be warranted, even if there is code to steal from fsr. :)

== Reverse mapping btrees

- This is something I still need to grok, perhaps just because the rmap
code isn't available yet. But I'll note that this does seem like
another bit that could be unnecessary if we could get away with using
the traditional allocator.

== Mkfs

- We have references to the "metadata device" as well as random write
regions. Similar to my question above, is there an expectation of a
separate physical metadata device or is that terminology for the random
write regions?

Finally, some general/summary notes:

- Some kind of data structure outline would eventually make a nice
addition to this document. I understand it's probably too early yet,
but we are talking about new per-zone inodes, new and interesting
relationships between AGs and zones (?), etc. Fine grained detail is not
required, but an outline or visual that describes the high-level
mappings goes a long way to facilitate reasoning about the design.

- A big question I had (and something that is touched on down thread wrt
to embedded flash) is whether the random write zones are runtime
configurable. If so, couldn't this facilitate use of existing AG
metadata (now that I think of it, it's not clear to me whether the
realtime mechanism excludes or coexists with AGs)? IOW, we obviously
need this kind of space for inodes, dirs, xattrs, btrees, etc.
regardless. It would be interesting if we had the added flexibility to
align it with AGs.

Thanks again!

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

[-- Attachment #2: smr-typos.diff --]
[-- Type: text/plain, Size: 8149 bytes --]

diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
index dd959ab..2fea88f 100644
--- a/design/xfs-smr-structure.asciidoc
+++ b/design/xfs-smr-structure.asciidoc
@@ -95,7 +95,7 @@ going to need a special directory to expose this information. It would be useful
 to have a ".zones" directory hanging off the root directory that contains all
 the zone allocation inodes so userspace can simply open them.
 
-THis biggest issue that has come to light here is the number of zones in a
+This biggest issue that has come to light here is the number of zones in a
 device. Zones are typically 256MB in size, and so we are looking at 4,000
 zones/TB. For a 10TB drive, that's 40,000 zones we have to keep track of. And if
 the devices keep getting larger at the expected rate, we're going to have to
@@ -112,24 +112,24 @@ also have other benefits...
 While it seems like tracking free space is trivial for the purposes of
 allocation (and it is!), the complexity comes when we start to delete or
 overwrite data. Suddenly zones no longer contain contiguous ranges of valid
-data; they have "freed" extents in the middle of them that contian stale data.
+data; they have "freed" extents in the middle of them that contain stale data.
 We can't use that "stale space" until the entire zone is made up of "stale"
 extents. Hence we need a Cleaner.
 
 === Zone Cleaner
 
 The purpose of the cleaner is to find zones that are mostly stale space and
-consolidate the remaining referenced data into a new, contigious zone, enabling
+consolidate the remaining referenced data into a new, contiguous zone, enabling
 us to then "clean" the stale zone and make it available for writing new data
 again.
 
-The real complexity here is finding the owner of the data that needs to be move,
-but we are in the process of solving that with the reverse mapping btree and
-parent pointer functionality. This gives us the mechanism by which we can
+The real complexity here is finding the owner of the data that needs to be
+moved, but we are in the process of solving that with the reverse mapping btree
+and parent pointer functionality. This gives us the mechanism by which we can
 quickly re-organise files that have extents in zones that need cleaning.
 
 The key word here is "reorganise". We have a tool that already reorganises file
-layout: xfs_fsr. The "Cleaner" is a finely targetted policy for xfs_fsr -
+layout: xfs_fsr. The "Cleaner" is a finely targeted policy for xfs_fsr -
 instead of trying to minimise fixpel fragments, it finds zones that need
 cleaning by reading their summary info from the /.zones/ directory and analysing
 the free bitmap state if there is a high enough percentage of stale blocks. From
@@ -142,7 +142,7 @@ Hence we don't actually need any major new data moving functionality in the
 kernel to enable this, except maybe an event channel for the kernel to tell
 xfs_fsr it needs to do some cleaning work.
 
-If we arrange zones into zoen groups, we also have a method for keeping new
+If we arrange zones into zone groups, we also have a method for keeping new
 allocations out of regions we are re-organising. That is, we need to be able to
 mark zone groups as "read only" so the kernel will not attempt to allocate from
 them while the cleaner is running and re-organising the data within the zones in
@@ -166,17 +166,17 @@ inode to track the zone's owner information.
 == Mkfs
 
 Mkfs is going to have to integrate with the userspace zbc libraries to query the
-layout of zones from the underlying disk and then do some magic to lay out al
+layout of zones from the underlying disk and then do some magic to lay out all
 the necessary metadata correctly. I don't see there being any significant
 challenge to doing this, but we will need a stable libzbc API to work with and
-it will need ot be packaged by distros.
+it will need to be packaged by distros.
 
-If mkfs cannot find ensough random write space for the amount of metadata we
-need to track all the space in the sequential write zones and a decent amount of
-internal fielsystem metadata (inodes, etc) then it will need to fail. Drive
-vendors are going to need to provide sufficient space in these regions for us
-to be able to make use of it, otherwise we'll simply not be able to do what we
-need to do.
+If mkfs cannot find enough random write space for the amount of metadata we need
+to track all the space in the sequential write zones and a decent amount of
+internal filesystem metadata (inodes, etc) then it will need to fail. Drive
+vendors are going to need to provide sufficient space in these regions for us to
+be able to make use of it, otherwise we'll simply not be able to do what we need
+to do.
 
 mkfs will need to initialise all the zone allocation inodes, reset all the zone
 write pointers, create the /.zones directory, place the log in an appropriate
@@ -187,13 +187,13 @@ place and initialise the metadata device as well.
 Because we've limited the metadata to a section of the drive that can be
 overwritten, we don't have to make significant changes to xfs_repair. It will
 need to be taught about the multiple zone allocation bitmaps for it's space
-reference checking, but otherwise all the infrastructure we need ifor using
+reference checking, but otherwise all the infrastructure we need for using
 bitmaps for verifying used space should already be there.
 
-THere be dragons waiting for us if we don't have random write zones for
+There be dragons waiting for us if we don't have random write zones for
 metadata. If that happens, we cannot repair metadata in place and we will have
 to redesign xfs_repair from the ground up to support such functionality. That's
-jus tnot going to happen, so we'll need drives with a significant amount of
+just not going to happen, so we'll need drives with a significant amount of
 random write space for all our metadata......
 
 == Quantification of Random Write Zone Capacity
@@ -214,7 +214,7 @@ performance, replace the CMR region with a SSD....
 
 The allocator will need to learn about multiple allocation zones based on
 bitmaps. They aren't really allocation groups, but the initialisation and
-iteration of them is going to be similar to allocation groups. To get use going
+iteration of them is going to be similar to allocation groups. To get us going
 we can do some simple mapping between inode AG and data AZ mapping so that we
 keep some form of locality to related data (e.g. grouping of data by parent
 directory).
@@ -273,19 +273,19 @@ location, the current location or anywhere in between. The only guarantee that
 we have is that if we flushed the cache (i.e. fsync'd a file) then they will at
 least be in a position at or past the location of the fsync.
 
-Hence before a filesystem runs journal recovery, all it's zone allocation write
+Hence before a filesystem runs journal recovery, all its zone allocation write
 pointers need to be set to what the drive thinks they are, and all of the zone
 allocation beyond the write pointer need to be cleared. We could do this during
 log recovery in kernel, but that means we need full ZBC awareness in log
 recovery to iterate and query all the zones.
 
-Hence it's not clear if we want to do this in userspace as that has it's own
-problems e.g. we'd need to  have xfs.fsck detect that it's a smr filesystem and
+Hence it's not clear if we want to do this in userspace as that has its own
+problems e.g. we'd need to  have xfs.fsck detect that it's an smr filesystem and
 perform that recovery, or write a mount.xfs helper that does it prior to
 mounting the filesystem. Either way, we need to synchronise the on-disk
 filesystem state to the internal disk zone state before doing anything else.
 
-This needs more thought, because I have a nagging suspiscion that we need to do
+This needs more thought, because I have a nagging suspicion that we need to do
 this write pointer resynchronisation *after log recovery* has completed so we
 can determine if we've got to now go and free extents that the filesystem has
 allocated and are referenced by some inode out there. This, again, will require

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-17 13:25   ` Brian Foster
@ 2015-03-17 21:28     ` Dave Chinner
  -1 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-17 21:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, linux-fsdevel

On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote:
> On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > 
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code.  Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > 
> > The source for that document can be found in this git tree here:
> > 
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > 
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > 
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > 
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > 
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > 
> > Happy reading!
> > 
> 
> Hi Dave,
> 
> Thanks for sharing this. Here are some thoughts/notes/questions/etc.
> from a first pass. This is mostly XFS oriented and I'll try to break it
> down by section.
> 
> I've also attached a diff to the original doc with some typo fixes and
> whatnot. Feel free to just fold it into the original doc if you like.
> 
> == Concepts
> 
> - With regard to the assumption that the CMR region is not spread around
> the drive, I saw at least one presentation at Vault that suggested
> otherwise (the skylight one iirc). That said, it was theoretical and
> based on a drive-managed drive. It is in no way clear to me whether that
> is something to expect for host-managed drives.

AFAIK, the CMR region is contiguous. The skylight paper spells it
out pretty clearly that it is a contiguous 20-25GB region on the
outer edge of the seagate drives. Other vendors I've spoken to
indicate that the region in host managed drives is also contiguous
and at the outer edge, and some vendors have indicated they have
much more of it that the seagate drives analysed in the skylight
paper.

If it is not contiguous, then we can use DM to make that problem go
away. i.e. use DM to stitch the CMR zones back together into a
contiguous LBA region. Then we can size AGs in the data device to
map to the size of the individual disjoint CMR regions, and we
have a neat, well aligned, isolated solution to the problem without
having to modify the XFS code at all.

> - It isn't clear to me here and in other places whether you propose to
> use the CMR regions as a "metadata device" or require some other
> randomly writeable storage to serve that purpose.

CMR as the "metadata device" if there is nothing else we can use.
I'd really like to see hybrid drives with the "CMR" zone being the
flash region in the drive....

> == Journal modifications
> 
> - The tail->head log zeroing behavior on mount comes to mind here. Maybe
> the writes are still sequential and it's not a problem, but we should
> consider that with the proposition.  It's probably not critical as we do
> have the out of using the cmr region here (as noted). I assume we can
> also cleanly relocate the log without breaking anything else (e.g., the
> current location is performance oriented rather than architectural,
> yes?).

We place the log anywhere in the data device LBA space. You might
want to go look up what L_AGNUM does in mkfs. :)

And if we can use the CMR region for the log, then that's what we'll
do - "no modifications required" is always the best solution.

> == Data zones
> 
> - Will this actually support data overwrite or will that return error?

We'll support data overwrite. xfs_get_blocks() will need to detect
overwrite....

> - TBH, I've never looked at realtime functionality so I don't grok the
> high level approach yet. I'm wondering... have you considered a design
> based on reflink and copy-on-write?

Yes, I have. Complex, invasive and we don't even have basic reflink
infrastructure yet. Such a solution pushes us a couple of years
out, as opposed to having something before the end of the year...

> I know the current plan is to
> disentangle the reflink tree from the rmap tree, but my understanding is
> the reflink tree is still in the pipeline. Assuming we have that
> functionality, it seems like there's potential to use it to overcome
> some of the overwrite complexity.

There isn't much overwrite complexity - it's simply clearing bits
in a zone bitmap to indicate free space, allocating new blocks and
then rewriting bmbt extent records. It's fairly simple, really ;)

> Just as a handwaving example, use the
> per-zone inode to hold an additional reference to each allocated extent
> in the zone, thus all writes are handled as if the file had a clone. If
> the only reference drops to the zoneino, the extent is freed and thus
> stale wrt to the zone cleaner logic.
> 
> I suspect we would still need an allocation strategy, but I expect we're
> going to have zone metadata regardless that will help deal with that.
> Note that the current sparse inode proposal includes an allocation range
> limit mechanism (for the inode record overlaps an ag boundary case),
> which could potentially be used/extended to build something on top of
> the existing allocator for zone allocation (e.g., if we had some kind of
> zone record with the write pointer that indicated where it's safe to
> allocate from). Again, just thinking out loud here.

Yup, but the bitmap allocator doesn't have support for many of the
btree allocator controls.  It's a simple, fast, deterministic
allocator, and we only need it is to track freed space in the zones
as all allocation from the zones is going to be sequential...

> == Zone cleaner
> 
> - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
> figure out what it's supposed to say. ;)
> 
> - The idea sounds sane, but the dependency on userspace for a critical
> fs mechanism sounds a bit scary to be honest. Is in kernel allocation
> going to throttle/depend on background work in the userspace cleaner in
> the event of low writeable free space?

Of course. ENOSPC always throttles ;)

I expect the cleaner will work zone group at a time; locking new,
non-cleaner based allocations out of the zone group while it cleans
zones. This means the cleaner should always be able to make progress
w.r.t. ENOSPC - it gets triggered on a zone group before it runs out
of clean zones for freespace defrag purposes....

I also expect that the cleaner won't be used in many bulk storage
applications as data is never deleted. I also expect tht XFS-SMR
won't be used for general purpose storage applications - that's what
solid state storage will be used for - and so the cleaner is not
something we need to focus a lot of time and effort on.

And the thing that distributed storage guys should love: if we put
the cleaner in userspace, then they can *write their own cleaners*
that are customised to their own storage algorithms.

> What if that userspace thing
> dies, etc.? I suppose an implementation with as much mechanism in libxfs
> as possible allows us greatest flexibility to go in either direction
> here.

If the cleaner dies of can't make progress, we ENOSPC. Whether the
cleaner is in kernel or userspace is irrelevant to how we handle
such cases.

> - I'm also wondering how much real overlap there is in xfs_fsr (another
> thing I haven't really looked at :) beyond that it calls swapext.
> E.g., cleaning a zone sounds like it must map back to N files that could
> have allocated extents in the zone vs. considering individual files for
> defragmentation, fragmentation of the parent file may not be as much of
> a consideration as resetting zones, etc. It sounds like a separate tool
> might be warranted, even if there is code to steal from fsr. :)

As I implied above, zone cleaning is addressing exactly the same
problem as we are currently working on in xfs_fsr: defragmenting
free space.

> == Reverse mapping btrees
> 
> - This is something I still need to grok, perhaps just because the rmap
> code isn't available yet. But I'll note that this does seem like
> another bit that could be unnecessary if we could get away with using
> the traditional allocator.
> 
> == Mkfs
> 
> - We have references to the "metadata device" as well as random write
> regions. Similar to my question above, is there an expectation of a
> separate physical metadata device or is that terminology for the random
> write regions?

"metadata device" == "data device" == "CMR" == "random write region"

> Finally, some general/summary notes:
> 
> - Some kind of data structure outline would eventually make a nice
> addition to this document. I understand it's probably too early yet,
> but we are talking about new per-zone inodes, new and interesting
> relationships between AGs and zones (?), etc. Fine grained detail is not
> required, but an outline or visual that describes the high-level
> mappings goes a long way to facilitate reasoning about the design.

Sure, a plane flight is not long enough to do this. Future
revisions, as the structure is clarified.

> - A big question I had (and something that is touched on down thread wrt
> to embedded flash) is whether the random write zones are runtime
> configurable. If so, couldn't this facilitate use of existing AG
> metadata (now that I think of it, it's not clear to me whether the
> realtime mechanism excludes or coexists with AGs)?

the "realtime device" contains only user data. It contains no
filesystem metadata at all. That separation of user data and
filesystem metadata is what makes it so appealing for supporting SMR
devices....

> IOW, we obviously
> need this kind of space for inodes, dirs, xattrs, btrees, etc.
> regardless. It would be interesting if we had the added flexibility to
> align it with AGs.

I'm trying to keep the solution as simple as possible. No alignment,
single whole disk only, metadata in the "data device" on CMR and
user data in "real time" zones on SMR.

> diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
> index dd959ab..2fea88f 100644

Oh, there's a patch. Thanks! ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
@ 2015-03-17 21:28     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2015-03-17 21:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, xfs

On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote:
> On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> > Hi Folks,
> > 
> > As I told many people at Vault last week, I wrote a document
> > outlining how we should modify the on-disk structures of XFS to
> > support host aware SMR drives on the (long) plane flights to Boston.
> > 
> > TL;DR: not a lot of change to the XFS kernel code is required, no
> > specific SMR awareness is needed by the kernel code.  Only
> > relatively minor tweaks to the on-disk format will be needed and
> > most of the userspace changes are relatively straight forward, too.
> > 
> > The source for that document can be found in this git tree here:
> > 
> > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > 
> > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > pull it straight from cgit:
> > 
> > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > 
> > Or there is a pdf version built from the current TOT on the xfs.org
> > wiki here:
> > 
> > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > 
> > Happy reading!
> > 
> 
> Hi Dave,
> 
> Thanks for sharing this. Here are some thoughts/notes/questions/etc.
> from a first pass. This is mostly XFS oriented and I'll try to break it
> down by section.
> 
> I've also attached a diff to the original doc with some typo fixes and
> whatnot. Feel free to just fold it into the original doc if you like.
> 
> == Concepts
> 
> - With regard to the assumption that the CMR region is not spread around
> the drive, I saw at least one presentation at Vault that suggested
> otherwise (the skylight one iirc). That said, it was theoretical and
> based on a drive-managed drive. It is in no way clear to me whether that
> is something to expect for host-managed drives.

AFAIK, the CMR region is contiguous. The skylight paper spells it
out pretty clearly that it is a contiguous 20-25GB region on the
outer edge of the seagate drives. Other vendors I've spoken to
indicate that the region in host managed drives is also contiguous
and at the outer edge, and some vendors have indicated they have
much more of it that the seagate drives analysed in the skylight
paper.

If it is not contiguous, then we can use DM to make that problem go
away. i.e. use DM to stitch the CMR zones back together into a
contiguous LBA region. Then we can size AGs in the data device to
map to the size of the individual disjoint CMR regions, and we
have a neat, well aligned, isolated solution to the problem without
having to modify the XFS code at all.

> - It isn't clear to me here and in other places whether you propose to
> use the CMR regions as a "metadata device" or require some other
> randomly writeable storage to serve that purpose.

CMR as the "metadata device" if there is nothing else we can use.
I'd really like to see hybrid drives with the "CMR" zone being the
flash region in the drive....

> == Journal modifications
> 
> - The tail->head log zeroing behavior on mount comes to mind here. Maybe
> the writes are still sequential and it's not a problem, but we should
> consider that with the proposition.  It's probably not critical as we do
> have the out of using the cmr region here (as noted). I assume we can
> also cleanly relocate the log without breaking anything else (e.g., the
> current location is performance oriented rather than architectural,
> yes?).

We place the log anywhere in the data device LBA space. You might
want to go look up what L_AGNUM does in mkfs. :)

And if we can use the CMR region for the log, then that's what we'll
do - "no modifications required" is always the best solution.

> == Data zones
> 
> - Will this actually support data overwrite or will that return error?

We'll support data overwrite. xfs_get_blocks() will need to detect
overwrite....

> - TBH, I've never looked at realtime functionality so I don't grok the
> high level approach yet. I'm wondering... have you considered a design
> based on reflink and copy-on-write?

Yes, I have. Complex, invasive and we don't even have basic reflink
infrastructure yet. Such a solution pushes us a couple of years
out, as opposed to having something before the end of the year...

> I know the current plan is to
> disentangle the reflink tree from the rmap tree, but my understanding is
> the reflink tree is still in the pipeline. Assuming we have that
> functionality, it seems like there's potential to use it to overcome
> some of the overwrite complexity.

There isn't much overwrite complexity - it's simply clearing bits
in a zone bitmap to indicate free space, allocating new blocks and
then rewriting bmbt extent records. It's fairly simple, really ;)

> Just as a handwaving example, use the
> per-zone inode to hold an additional reference to each allocated extent
> in the zone, thus all writes are handled as if the file had a clone. If
> the only reference drops to the zoneino, the extent is freed and thus
> stale wrt to the zone cleaner logic.
> 
> I suspect we would still need an allocation strategy, but I expect we're
> going to have zone metadata regardless that will help deal with that.
> Note that the current sparse inode proposal includes an allocation range
> limit mechanism (for the inode record overlaps an ag boundary case),
> which could potentially be used/extended to build something on top of
> the existing allocator for zone allocation (e.g., if we had some kind of
> zone record with the write pointer that indicated where it's safe to
> allocate from). Again, just thinking out loud here.

Yup, but the bitmap allocator doesn't have support for many of the
btree allocator controls.  It's a simple, fast, deterministic
allocator, and we only need it is to track freed space in the zones
as all allocation from the zones is going to be sequential...

> == Zone cleaner
> 
> - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
> figure out what it's supposed to say. ;)
> 
> - The idea sounds sane, but the dependency on userspace for a critical
> fs mechanism sounds a bit scary to be honest. Is in kernel allocation
> going to throttle/depend on background work in the userspace cleaner in
> the event of low writeable free space?

Of course. ENOSPC always throttles ;)

I expect the cleaner will work zone group at a time; locking new,
non-cleaner based allocations out of the zone group while it cleans
zones. This means the cleaner should always be able to make progress
w.r.t. ENOSPC - it gets triggered on a zone group before it runs out
of clean zones for freespace defrag purposes....

I also expect that the cleaner won't be used in many bulk storage
applications as data is never deleted. I also expect tht XFS-SMR
won't be used for general purpose storage applications - that's what
solid state storage will be used for - and so the cleaner is not
something we need to focus a lot of time and effort on.

And the thing that distributed storage guys should love: if we put
the cleaner in userspace, then they can *write their own cleaners*
that are customised to their own storage algorithms.

> What if that userspace thing
> dies, etc.? I suppose an implementation with as much mechanism in libxfs
> as possible allows us greatest flexibility to go in either direction
> here.

If the cleaner dies of can't make progress, we ENOSPC. Whether the
cleaner is in kernel or userspace is irrelevant to how we handle
such cases.

> - I'm also wondering how much real overlap there is in xfs_fsr (another
> thing I haven't really looked at :) beyond that it calls swapext.
> E.g., cleaning a zone sounds like it must map back to N files that could
> have allocated extents in the zone vs. considering individual files for
> defragmentation, fragmentation of the parent file may not be as much of
> a consideration as resetting zones, etc. It sounds like a separate tool
> might be warranted, even if there is code to steal from fsr. :)

As I implied above, zone cleaning is addressing exactly the same
problem as we are currently working on in xfs_fsr: defragmenting
free space.

> == Reverse mapping btrees
> 
> - This is something I still need to grok, perhaps just because the rmap
> code isn't available yet. But I'll note that this does seem like
> another bit that could be unnecessary if we could get away with using
> the traditional allocator.
> 
> == Mkfs
> 
> - We have references to the "metadata device" as well as random write
> regions. Similar to my question above, is there an expectation of a
> separate physical metadata device or is that terminology for the random
> write regions?

"metadata device" == "data device" == "CMR" == "random write region"

> Finally, some general/summary notes:
> 
> - Some kind of data structure outline would eventually make a nice
> addition to this document. I understand it's probably too early yet,
> but we are talking about new per-zone inodes, new and interesting
> relationships between AGs and zones (?), etc. Fine grained detail is not
> required, but an outline or visual that describes the high-level
> mappings goes a long way to facilitate reasoning about the design.

Sure, a plane flight is not long enough to do this. Future
revisions, as the structure is clarified.

> - A big question I had (and something that is touched on down thread wrt
> to embedded flash) is whether the random write zones are runtime
> configurable. If so, couldn't this facilitate use of existing AG
> metadata (now that I think of it, it's not clear to me whether the
> realtime mechanism excludes or coexists with AGs)?

the "realtime device" contains only user data. It contains no
filesystem metadata at all. That separation of user data and
filesystem metadata is what makes it so appealing for supporting SMR
devices....

> IOW, we obviously
> need this kind of space for inodes, dirs, xattrs, btrees, etc.
> regardless. It would be interesting if we had the added flexibility to
> align it with AGs.

I'm trying to keep the solution as simple as possible. No alignment,
single whole disk only, metadata in the "data device" on CMR and
user data in "real time" zones on SMR.

> diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
> index dd959ab..2fea88f 100644

Oh, there's a patch. Thanks! ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [ANNOUNCE] xfs: Supporting Host Aware SMR Drives
  2015-03-17 21:28     ` Dave Chinner
  (?)
@ 2015-03-21 14:48     ` Brian Foster
  -1 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2015-03-21 14:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, xfs

On Wed, Mar 18, 2015 at 08:28:35AM +1100, Dave Chinner wrote:
> On Tue, Mar 17, 2015 at 09:25:15AM -0400, Brian Foster wrote:
> > On Mon, Mar 16, 2015 at 05:00:20PM +1100, Dave Chinner wrote:
> > > Hi Folks,
> > > 
> > > As I told many people at Vault last week, I wrote a document
> > > outlining how we should modify the on-disk structures of XFS to
> > > support host aware SMR drives on the (long) plane flights to Boston.
> > > 
> > > TL;DR: not a lot of change to the XFS kernel code is required, no
> > > specific SMR awareness is needed by the kernel code.  Only
> > > relatively minor tweaks to the on-disk format will be needed and
> > > most of the userspace changes are relatively straight forward, too.
> > > 
> > > The source for that document can be found in this git tree here:
> > > 
> > > git://git.kernel.org/pub/scm/fs/xfs/xfs-documentation
> > > 
> > > in the file design/xfs-smr-structure.asciidoc. Alternatively,
> > > pull it straight from cgit:
> > > 
> > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc
> > > 
> > > Or there is a pdf version built from the current TOT on the xfs.org
> > > wiki here:
> > > 
> > > http://xfs.org/index.php/Host_Aware_SMR_architecture
> > > 
> > > Happy reading!
> > > 
> > 
> > Hi Dave,
> > 
> > Thanks for sharing this. Here are some thoughts/notes/questions/etc.
> > from a first pass. This is mostly XFS oriented and I'll try to break it
> > down by section.
> > 
> > I've also attached a diff to the original doc with some typo fixes and
> > whatnot. Feel free to just fold it into the original doc if you like.
> > 
> > == Concepts
> > 
> > - With regard to the assumption that the CMR region is not spread around
> > the drive, I saw at least one presentation at Vault that suggested
> > otherwise (the skylight one iirc). That said, it was theoretical and
> > based on a drive-managed drive. It is in no way clear to me whether that
> > is something to expect for host-managed drives.
> 
> AFAIK, the CMR region is contiguous. The skylight paper spells it
> out pretty clearly that it is a contiguous 20-25GB region on the
> outer edge of the seagate drives. Other vendors I've spoken to
> indicate that the region in host managed drives is also contiguous
> and at the outer edge, and some vendors have indicated they have
> much more of it that the seagate drives analysed in the skylight
> paper.
> 
> If it is not contiguous, then we can use DM to make that problem go
> away. i.e. use DM to stitch the CMR zones back together into a
> contiguous LBA region. Then we can size AGs in the data device to
> map to the size of the individual disjoint CMR regions, and we
> have a neat, well aligned, isolated solution to the problem without
> having to modify the XFS code at all.
> 

Looking back at the slides, that was apparently one of the emulated
drives. So I guess that bit was more oriented towards showcasing the
experimental method than to suggest how one of the drives works.
Regardless, it seems reasonable to me to use dm to stitch things
together (or go the other direction and split things up) if need be.

> > - It isn't clear to me here and in other places whether you propose to
> > use the CMR regions as a "metadata device" or require some other
> > randomly writeable storage to serve that purpose.
> 
> CMR as the "metadata device" if there is nothing else we can use.
> I'd really like to see hybrid drives with the "CMR" zone being the
> flash region in the drive....
> 

Ok.

> > == Journal modifications
> > 
> > - The tail->head log zeroing behavior on mount comes to mind here. Maybe
> > the writes are still sequential and it's not a problem, but we should
> > consider that with the proposition.  It's probably not critical as we do
> > have the out of using the cmr region here (as noted). I assume we can
> > also cleanly relocate the log without breaking anything else (e.g., the
> > current location is performance oriented rather than architectural,
> > yes?).
> 
> We place the log anywhere in the data device LBA space. You might
> want to go look up what L_AGNUM does in mkfs. :)
> 
> And if we can use the CMR region for the log, then that's what we'll
> do - "no modifications required" is always the best solution.
> 
> > == Data zones
> > 
> > - Will this actually support data overwrite or will that return error?
> 
> We'll support data overwrite. xfs_get_blocks() will need to detect
> overwrite....
> 
> > - TBH, I've never looked at realtime functionality so I don't grok the
> > high level approach yet. I'm wondering... have you considered a design
> > based on reflink and copy-on-write?
> 
> Yes, I have. Complex, invasive and we don't even have basic reflink
> infrastructure yet. Such a solution pushes us a couple of years
> out, as opposed to having something before the end of the year...
> 

It certainly would take longer to implement, but the point is that it's
a potential reuse of a mechanism we already plan to implement. I suppose
a zone aware allocation is a more simple problem for now and we can
revisit it down the road.

> > I know the current plan is to
> > disentangle the reflink tree from the rmap tree, but my understanding is
> > the reflink tree is still in the pipeline. Assuming we have that
> > functionality, it seems like there's potential to use it to overcome
> > some of the overwrite complexity.
> 
> There isn't much overwrite complexity - it's simply clearing bits
> in a zone bitmap to indicate free space, allocating new blocks and
> then rewriting bmbt extent records. It's fairly simple, really ;)
> 

Perhaps, but it's not really the act of marking blocks allocated or free
that I was interested in. It's the combination of managing the zone
write constraints in the write path and the allocator, finding free
blocks vs. stale blocks, etc. (e.g., the "extent lifecycle" for lack of
a better term).

> > Just as a handwaving example, use the
> > per-zone inode to hold an additional reference to each allocated extent
> > in the zone, thus all writes are handled as if the file had a clone. If
> > the only reference drops to the zoneino, the extent is freed and thus
> > stale wrt to the zone cleaner logic.
> > 
> > I suspect we would still need an allocation strategy, but I expect we're
> > going to have zone metadata regardless that will help deal with that.
> > Note that the current sparse inode proposal includes an allocation range
> > limit mechanism (for the inode record overlaps an ag boundary case),
> > which could potentially be used/extended to build something on top of
> > the existing allocator for zone allocation (e.g., if we had some kind of
> > zone record with the write pointer that indicated where it's safe to
> > allocate from). Again, just thinking out loud here.
> 
> Yup, but the bitmap allocator doesn't have support for many of the
> btree allocator controls.  It's a simple, fast, deterministic
> allocator, and we only need it is to track freed space in the zones
> as all allocation from the zones is going to be sequential...
> 

Right, the point is that the traditional allocator has some mechanisms
that might facilitate zone compliant allocation provided we have the
associated zone metadata. E.g., the allocation range mechanism
facilitates allocation within a particular zone, within a "usable" range
of a zone, or across a wider set of zones of similar state, depending on
the allocator implementation details.

Anyways, I don't want to hijack this thread too much. :) I might send
you something separately for a sanity check or brainstorming purposes.

> > == Zone cleaner
> > 
> > - Paragraph 3 - "fixpel?" I would have just fixed this, but I can't
> > figure out what it's supposed to say. ;)
> > 
> > - The idea sounds sane, but the dependency on userspace for a critical
> > fs mechanism sounds a bit scary to be honest. Is in kernel allocation
> > going to throttle/depend on background work in the userspace cleaner in
> > the event of low writeable free space?
> 
> Of course. ENOSPC always throttles ;)
> 

Heh. :)

> I expect the cleaner will work zone group at a time; locking new,
> non-cleaner based allocations out of the zone group while it cleans
> zones. This means the cleaner should always be able to make progress
> w.r.t. ENOSPC - it gets triggered on a zone group before it runs out
> of clean zones for freespace defrag purposes....
> 

There's some interesting allocation dynamics going on here that aren't
fully clear to me. E.g., on the one hand we want zone groups to be
fairly large to help manage the zone count, on the other we're
potentially locking out a TB-sized zone group at a time while the
userspace tool does its thing..? I take it this means we'll also want
some way to actually do zone-cleaning allocations (i.e., the extents
copied from the cleaned zones) from this zone from the userspace tool
while other general users are locked out. Even with that, incorporating
any kind of locality into the allocator seems futile if the target zone
group for an independently active file could be locked down at any given
point in time.

Maybe 256MB zone groups means that's less of a practical issue..? I'm
probably reading too far into it at this point... :P

> I also expect that the cleaner won't be used in many bulk storage
> applications as data is never deleted. I also expect tht XFS-SMR
> won't be used for general purpose storage applications - that's what
> solid state storage will be used for - and so the cleaner is not
> something we need to focus a lot of time and effort on.
> 
> And the thing that distributed storage guys should love: if we put
> the cleaner in userspace, then they can *write their own cleaners*
> that are customised to their own storage algorithms.
> 
> > What if that userspace thing
> > dies, etc.? I suppose an implementation with as much mechanism in libxfs
> > as possible allows us greatest flexibility to go in either direction
> > here.
> 
> If the cleaner dies of can't make progress, we ENOSPC. Whether the
> cleaner is in kernel or userspace is irrelevant to how we handle
> such cases.
> 
> > - I'm also wondering how much real overlap there is in xfs_fsr (another
> > thing I haven't really looked at :) beyond that it calls swapext.
> > E.g., cleaning a zone sounds like it must map back to N files that could
> > have allocated extents in the zone vs. considering individual files for
> > defragmentation, fragmentation of the parent file may not be as much of
> > a consideration as resetting zones, etc. It sounds like a separate tool
> > might be warranted, even if there is code to steal from fsr. :)
> 
> As I implied above, zone cleaning is addressing exactly the same
> problem as we are currently working on in xfs_fsr: defragmenting
> free space.
> 

Ah, Ok. That is an interesting connection. There also seems to be an
interesting correlation between zone cleaning and overwrite handling +
unlink/truncate + discard handling (if you represent a zone with an
inode that tracks a particular fsb range and references "stale" blocks
before they are ultimately freed).

> > == Reverse mapping btrees
> > 
> > - This is something I still need to grok, perhaps just because the rmap
> > code isn't available yet. But I'll note that this does seem like
> > another bit that could be unnecessary if we could get away with using
> > the traditional allocator.
> > 
> > == Mkfs
> > 
> > - We have references to the "metadata device" as well as random write
> > regions. Similar to my question above, is there an expectation of a
> > separate physical metadata device or is that terminology for the random
> > write regions?
> 
> "metadata device" == "data device" == "CMR" == "random write region"
> 
> > Finally, some general/summary notes:
> > 
> > - Some kind of data structure outline would eventually make a nice
> > addition to this document. I understand it's probably too early yet,
> > but we are talking about new per-zone inodes, new and interesting
> > relationships between AGs and zones (?), etc. Fine grained detail is not
> > required, but an outline or visual that describes the high-level
> > mappings goes a long way to facilitate reasoning about the design.
> 
> Sure, a plane flight is not long enough to do this. Future
> revisions, as the structure is clarified.
> 

Of course. :)

> > - A big question I had (and something that is touched on down thread wrt
> > to embedded flash) is whether the random write zones are runtime
> > configurable. If so, couldn't this facilitate use of existing AG
> > metadata (now that I think of it, it's not clear to me whether the
> > realtime mechanism excludes or coexists with AGs)?
> 
> the "realtime device" contains only user data. It contains no
> filesystem metadata at all. That separation of user data and
> filesystem metadata is what makes it so appealing for supporting SMR
> devices....
> 
> > IOW, we obviously
> > need this kind of space for inodes, dirs, xattrs, btrees, etc.
> > regardless. It would be interesting if we had the added flexibility to
> > align it with AGs.
> 
> I'm trying to keep the solution as simple as possible. No alignment,
> single whole disk only, metadata in the "data device" on CMR and
> user data in "real time" zones on SMR.
> 

Understood. From the commentary here and our irc discussion, my take
away is that the primary objective is to get to some kind of SMR capable
solution sooner rather than later. Beyond that, you have concerns about
the complexity of making the current format work with smr drives. That
all sounds reasonable to me.

I get a bit more concerned when we start talking about implementing
solutions to the same problems we've mostly solved with the existing
algorithms, such as zone reservation vs. preallocation, zone group
rotoring vs. ag rotoring, etc. At some point, I think it will be worth
taking a harder look at whether we could reuse the more traditional
layout and algorithms...

Brian

> > diff --git a/design/xfs-smr-structure.asciidoc b/design/xfs-smr-structure.asciidoc
> > index dd959ab..2fea88f 100644
> 
> Oh, there's a patch. Thanks! ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-03-21 14:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-16  6:00 [ANNOUNCE] xfs: Supporting Host Aware SMR Drives Dave Chinner
2015-03-16  6:00 ` Dave Chinner
2015-03-16 15:28 ` James Bottomley
2015-03-16 15:28   ` James Bottomley
2015-03-16 18:23   ` Adrian Palmer
2015-03-16 18:23     ` Adrian Palmer
2015-03-16 19:06     ` James Bottomley
2015-03-16 19:06       ` James Bottomley
2015-03-16 20:20       ` Dave Chinner
2015-03-16 20:20         ` Dave Chinner
2015-03-16 22:48         ` Cyril Guyot
2015-03-16 20:32   ` Dave Chinner
2015-03-16 20:32     ` Dave Chinner
2015-03-17  1:12     ` Alireza Haghdoost
2015-03-17  1:12       ` Alireza Haghdoost
2015-03-17  6:06       ` Dave Chinner
2015-03-17 13:25 ` Brian Foster
2015-03-17 13:25   ` Brian Foster
2015-03-17 21:28   ` Dave Chinner
2015-03-17 21:28     ` Dave Chinner
2015-03-21 14:48     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.