All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Lsf-pc] [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
       [not found] <CAKdFiL6CMHTHWCvSFjFNw6hzvPLDEWWAC-o3wdivq9co7SX1FA@mail.gmail.com>
@ 2015-01-07 14:57 ` Sasha Levin
  2015-01-10 13:40   ` Hannes Reinecke
  2015-01-13 20:32 ` Adrian Palmer
  1 sibling, 1 reply; 7+ messages in thread
From: Sasha Levin @ 2015-01-07 14:57 UTC (permalink / raw)
  To: Adrian Palmer, lsf-pc; +Cc: linux-fsdevel, linux-ide, linux-scsi

On 01/06/2015 06:29 PM, Adrian Palmer wrote:
> I'd like to host a discussion of SMRFFS and ZAC for consumer and cloud
> systems at LSF/MM. I want to gather community consensus at LSF/MM of the
> required technical kernel changes before this topic is presented at Vault.
> 
> Subtopics:
> 
> On-disk metadata structures and data algorithms
> Explicit in-order write requirement and a look at the IO stack
> New IOCTLs to call from the FS and the need to know about the underlying
> disk -- no longer completely disk agnostic

Where can we read about the details of SMRFFS before LSF/MM / Vault?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
  2015-01-07 14:57 ` [Lsf-pc] [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems) Sasha Levin
@ 2015-01-10 13:40   ` Hannes Reinecke
  0 siblings, 0 replies; 7+ messages in thread
From: Hannes Reinecke @ 2015-01-10 13:40 UTC (permalink / raw)
  To: Sasha Levin, Adrian Palmer, lsf-pc; +Cc: linux-fsdevel, linux-ide, linux-scsi

On 01/07/2015 03:57 PM, Sasha Levin wrote:
> On 01/06/2015 06:29 PM, Adrian Palmer wrote:
>> I'd like to host a discussion of SMRFFS and ZAC for consumer and cloud
>> systems at LSF/MM. I want to gather community consensus at LSF/MM of the
>> required technical kernel changes before this topic is presented at Vault.
>>
>> Subtopics:
>>
>> On-disk metadata structures and data algorithms
>> Explicit in-order write requirement and a look at the IO stack
>> New IOCTLs to call from the FS and the need to know about the underlying
>> disk -- no longer completely disk agnostic
> 
> Where can we read about the details of SMRFFS before LSF/MM / Vault?
> 
Please be aware that I've been working on a ZAC prototype, and have
applied for a session at LSF/MM.
And my paper for Vault has just been accepted (which will be a survey of
existing filesystems and their suitability for SMR drives).

Can you keep me in the loop here?
Maybe we should have a joint session at LSF ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
       [not found] <CAKdFiL6CMHTHWCvSFjFNw6hzvPLDEWWAC-o3wdivq9co7SX1FA@mail.gmail.com>
  2015-01-07 14:57 ` [Lsf-pc] [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems) Sasha Levin
@ 2015-01-13 20:32 ` Adrian Palmer
  2015-01-13 21:50   ` Andreas Dilger
  1 sibling, 1 reply; 7+ messages in thread
From: Adrian Palmer @ 2015-01-13 20:32 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-ide, linux-scsi, linux-fsdevel

This seemed to bounce on most of the lists to which it originally
sent.  I'm resending..

I've uploaded an introductory design document at
https://github.com/Seagate/SMR_FS-EXT4. I'll update regularly.  Please
feel free to send questions my way.

It seems there are many sub topics requested related to SMR for this conference.

Adrian

On Tue, Jan 6, 2015 at 4:29 PM, Adrian Palmer <adrian.palmer@seagate.com> wrote:
> I agree wholeheartedly with Dr. Reinecke in discussing what is becoming my
> favourite topic also. I support the need for generic filesystem support with
> SMR and ZAC/ZBC drives.
>
> Dr. Reinecke has already proposed a discussion on the ZAC/ZBC
> implementation.  As a complementary topic, I want to discuss the generic
> filesystem support for Host Aware (HA) / Host Managed (HM) drives.
>
> We at Seagate are developing an SMR Friendly File System (SMRFFS) for this
> very purpose.  Instead of a new filesystem with a long development time, we
> are implementing it as an HA extension to EXT4 (and WILL be backwards
> compatible with minimal code paths).  I'll be talking about the the on-disk
> changes we need to consider as well as the needed kernel changes common to
> all generic filesystems.  Later, we intend to evaluate the work for use in
> other filesystems and kernel processes.
>
> I'd like to host a discussion of SMRFFS and ZAC for consumer and cloud
> systems at LSF/MM. I want to gather community consensus at LSF/MM of the
> required technical kernel changes before this topic is presented at Vault.
>
> Subtopics:
>
> On-disk metadata structures and data algorithms
> Explicit in-order write requirement and a look at the IO stack
> New IOCTLs to call from the FS and the need to know about the underlying
> disk -- no longer completely disk agnostic
>
>
> Adrian Palmer
> Firmware Engineer II
> R&D Firmware
> Seagate, Longmont Colorado
> 720-684-1307

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
  2015-01-13 20:32 ` Adrian Palmer
@ 2015-01-13 21:50   ` Andreas Dilger
  2015-01-13 23:26     ` Adrian Palmer
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Dilger @ 2015-01-13 21:50 UTC (permalink / raw)
  To: Adrian Palmer; +Cc: ext4 development, Linux Filesystem Development List

On Jan 13, 2015, at 1:32 PM, Adrian Palmer <adrian.palmer@seagate.com> wrote:
> This seemed to bounce on most of the lists to which it originally
> sent.  I'm resending..
> 
> I've uploaded an introductory design document at
> https://github.com/Seagate/SMR_FS-EXT4. I'll update regularly.  Please
> feel free to send questions my way.
> 
> It seems there are many sub topics requested related to SMR for this conference.

I'm replying to this on the linux-ext4 list since it is mostly of
interest to ext4 developers, and I'm not in control over who attends
the LSF/MM conference.  Also, there will be an ext4 developer meeting
during/adjacent to LSF/MM that you should probably attend.

I think one of the important design decisions that needs to be made
early on is whether it is possible to directly access some storage
that can be updated with small random writes (either a separate flash
LUN on the device, or a section of the disk that is formatted for 4kB
sectors without SMR write requirements).

That would allow writing metadata (superblock, bitmap, group descriptor,
inode table, journal, in decreasing order of importance) in random
order instead of imposing possibly painful read-modify-write or COW
semantics on the whole filesystem.

As for the journal, I think it would be possible to handle that in a
way that is very SMR friendly.  It is written in linear order, and if
mke2fs can size/align the journal file with SMR write regions then the
only thing that needs to happen is to size/align journal transactions
and the journal superblock with SMR write regions as well. 

I saw on your SMR_FS-EXT4 README that you are looking at 8KB sector size.
Please correct my poor understanding of SMR, but isn't 8KB a lot smaller
than what the actual erase block size (or chunks or whatever they are
named)?  I thought the erase blocks were on the order of MB in size?

Are you already aware of the "bigalloc" feature?  That may provide most
of what you need already.  It may be appropriate to default to e.g. 1MB
bigalloc size for SMR drives, so that it is clear to users that the
effective IO/allocation size is large for that filesystem.

> On Tue, Jan 6, 2015 at 4:29 PM, Adrian Palmer <adrian.palmer@seagate.com> wrote:
>> I agree wholeheartedly with Dr. Reinecke in discussing what is becoming my
>> favourite topic also. I support the need for generic filesystem support with
>> SMR and ZAC/ZBC drives.
>> 
>> Dr. Reinecke has already proposed a discussion on the ZAC/ZBC
>> implementation.  As a complementary topic, I want to discuss the generic
>> filesystem support for Host Aware (HA) / Host Managed (HM) drives.
>> 
>> We at Seagate are developing an SMR Friendly File System (SMRFFS) for this
>> very purpose.  Instead of a new filesystem with a long development time, we
>> are implementing it as an HA extension to EXT4 (and WILL be backwards
>> compatible with minimal code paths).  I'll be talking about the the on-disk
>> changes we need to consider as well as the needed kernel changes common to
>> all generic filesystems.  Later, we intend to evaluate the work for use in
>> other filesystems and kernel processes.
>> 
>> I'd like to host a discussion of SMRFFS and ZAC for consumer and cloud
>> systems at LSF/MM. I want to gather community consensus at LSF/MM of the
>> required technical kernel changes before this topic is presented at Vault.
>> 
>> Subtopics:
>> 
>> On-disk metadata structures and data algorithms
>> Explicit in-order write requirement and a look at the IO stack
>> New IOCTLs to call from the FS and the need to know about the underlying
>> disk -- no longer completely disk agnostic
>> 
>> 
>> Adrian Palmer
>> Firmware Engineer II
>> R&D Firmware
>> Seagate, Longmont Colorado
>> 720-684-1307
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
  2015-01-13 21:50   ` Andreas Dilger
@ 2015-01-13 23:26     ` Adrian Palmer
  2015-02-15 20:27       ` Alireza Haghdoost
  0 siblings, 1 reply; 7+ messages in thread
From: Adrian Palmer @ 2015-01-13 23:26 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: ext4 development, Linux Filesystem Development List

Andreas;

Thanks.  I appear to have overlooked the ext4 list for some reason
(the most obvious list).

On Tue, Jan 13, 2015 at 2:50 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Jan 13, 2015, at 1:32 PM, Adrian Palmer <adrian.palmer@seagate.com> wrote:
>> This seemed to bounce on most of the lists to which it originally
>> sent.  I'm resending..
>>
>> I've uploaded an introductory design document at
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Seagate_SMR-5FFS-2DEXT4&d=AwIDAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=-UHpLrTYk1Bz8YZcfJ9jdbDSs-C2VebVNvEaL0IKzDU&m=XG8Mp9OYaVpO5TJCkUXiHE6ULZ7s7CxSF2yJDzUTcZ0&s=l5X2IwR9XqKRMeuqd2qt8iumrj43h1muKUohZ6yTflM&e= . I'll update regularly.  Please
>> feel free to send questions my way.
>>
>> It seems there are many sub topics requested related to SMR for this conference.
>
> I'm replying to this on the linux-ext4 list since it is mostly of
> interest to ext4 developers, and I'm not in control over who attends
> the LSF/MM conference.  Also, there will be an ext4 developer meeting
> during/adjacent to LSF/MM that you should probably attend.

Is this co-located, or part of LSF/MM?  I would be very willing to
attend if I can.

>
> I think one of the important design decisions that needs to be made
> early on is whether it is possible to directly access some storage
> that can be updated with small random writes (either a separate flash
> LUN on the device, or a section of the disk that is formatted for 4kB
> sectors without SMR write requirements).

This would be nice, but I looking more generally to what I call
'single disk' systems.  Several more complicated FSs use a separate
flash drive for this purpose, but ext4 expects 1 vdev, and thus only
one type of media (agnostic).  We have hybrid HDD that have flash on
them, but the lba space isn't separate, so the FS or the DM couldn't
very easily treat them as 2 devices.

Also, talk in the standards committee has resulted in the allowance of
zero or more zones to be conventional PMR formatted vs SMR.  The idea
is the first zone on the disk.  That doesn't help 1) because of the
GPT table there and 2) partitions can be anywhere on the disk.  This
is set at manufacture time, and is not a change that can be made in
the field.

>
> That would allow writing metadata (superblock, bitmap, group descriptor,
> inode table, journal, in decreasing order of importance) in random
> order instead of imposing possibly painful read-modify-write or COW
> semantics on the whole filesystem.

Yeah.  Big design point.  For backwards compatibility, 1) the
superblock must reside in known locations and 2) any location change
in metadata would (eventually) require the superblock to be written in
place.  As such the bitmaps are almost constantly updated, either
in-place, or pass the in-place update through the group descriptor to
the superblock.

To make data more linear, I'm coding the data bitmap to mirror the
Writepointer information from the disk.  That would make the update of
the data bitmap, while not trivial, but also much less important.  For
the metadata, I'm exploring the idea of putting a scratchpad in the
devicemapper to hold a zone worth of data to be compacted/rewritten in
place.  That will require some thought.  We should get to coding that
in a couple of weeks.

>
> As for the journal, I think it would be possible to handle that in a
> way that is very SMR friendly.  It is written in linear order, and if
> mke2fs can size/align the journal file with SMR write regions then the
> only thing that needs to happen is to size/align journal transactions
> and the journal superblock with SMR write regions as well.

Agreed.  A circular buffer would be nice, but that's in ZACv2.  In the
meantime, I'm looking at using 2 zones as a buffer, freeing one while
using the other, both in forward only writes.  I remember T'so had a
proposal out for the journal.  We intend to rely on that when we get
to the journal (unfortunately, some time after LSF/MM)

>
> I saw on your SMR_FS-EXT4 README that you are looking at 8KB sector size.
> Please correct my poor understanding of SMR, but isn't 8KB a lot smaller
> than what the actual erase block size (or chunks or whatever they are
> named)?  I thought the erase blocks were on the order of MB in size?

SMR doesn't use erase blocks like Flash.  The idea is a zone, but I
admit it is similar.  They (currently) are 256MiB and nothing in the
standard requires this size -- it can change or even be irregular.
Current BG max size is 128MiB (4k).  An 8K cluster allows for a BG to
match a zone in size -- a new BG doesn't (can't) start in the middle
of a zone.  Also the BG/zone can be managed a single unit for purposes
of file collocation/defragmentation.  The ResetWritePointer command
acts like an eraser, zeroing out the BG (using the same code path as
discard and trim).  The difference is that the FS is now aware of the
state of the zone, using the information to make write decisions --
and is NOT media agnostic anymore.

>
> Are you already aware of the "bigalloc" feature?  That may provide most
> of what you need already.  It may be appropriate to default to e.g. 1MB
> bigalloc size for SMR drives, so that it is clear to users that the
> effective IO/allocation size is large for that filesystem.

We've looked at this, and found several problems.  The biggest is that
it is still experimental, along with it requires extents.  SMR HA and
HM don't like extents, as that requires a backward write.  We are
looking at a combination of code to scavenge from flex_bg and meta_bg
to create the large BG and move the metadata around on the disk.  We
are finding that the developer resources required on that path are
MUCH less - LSF/MM is only 2 months away.


Thanks again for the questions

Adrian

>
>> On Tue, Jan 6, 2015 at 4:29 PM, Adrian Palmer <adrian.palmer@seagate.com> wrote:
>>> I agree wholeheartedly with Dr. Reinecke in discussing what is becoming my
>>> favourite topic also. I support the need for generic filesystem support with
>>> SMR and ZAC/ZBC drives.
>>>
>>> Dr. Reinecke has already proposed a discussion on the ZAC/ZBC
>>> implementation.  As a complementary topic, I want to discuss the generic
>>> filesystem support for Host Aware (HA) / Host Managed (HM) drives.
>>>
>>> We at Seagate are developing an SMR Friendly File System (SMRFFS) for this
>>> very purpose.  Instead of a new filesystem with a long development time, we
>>> are implementing it as an HA extension to EXT4 (and WILL be backwards
>>> compatible with minimal code paths).  I'll be talking about the the on-disk
>>> changes we need to consider as well as the needed kernel changes common to
>>> all generic filesystems.  Later, we intend to evaluate the work for use in
>>> other filesystems and kernel processes.
>>>
>>> I'd like to host a discussion of SMRFFS and ZAC for consumer and cloud
>>> systems at LSF/MM. I want to gather community consensus at LSF/MM of the
>>> required technical kernel changes before this topic is presented at Vault.
>>>
>>> Subtopics:
>>>
>>> On-disk metadata structures and data algorithms
>>> Explicit in-order write requirement and a look at the IO stack
>>> New IOCTLs to call from the FS and the need to know about the underlying
>>> disk -- no longer completely disk agnostic
>>>
>>>
>>> Adrian Palmer
>>> Firmware Engineer II
>>> R&D Firmware
>>> Seagate, Longmont Colorado
>>> 720-684-1307
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=AwIDAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=-UHpLrTYk1Bz8YZcfJ9jdbDSs-C2VebVNvEaL0IKzDU&m=XG8Mp9OYaVpO5TJCkUXiHE6ULZ7s7CxSF2yJDzUTcZ0&s=2qssUWsRrBGSRntKGELIrxpaWCcpJsOfz8HBZaxvegM&e=
>
>
> Cheers, Andreas
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=AwIDAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=-UHpLrTYk1Bz8YZcfJ9jdbDSs-C2VebVNvEaL0IKzDU&m=XG8Mp9OYaVpO5TJCkUXiHE6ULZ7s7CxSF2yJDzUTcZ0&s=2qssUWsRrBGSRntKGELIrxpaWCcpJsOfz8HBZaxvegM&e=

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
  2015-01-13 23:26     ` Adrian Palmer
@ 2015-02-15 20:27       ` Alireza Haghdoost
  2015-02-16  5:02         ` Adrian Palmer
  0 siblings, 1 reply; 7+ messages in thread
From: Alireza Haghdoost @ 2015-02-15 20:27 UTC (permalink / raw)
  To: Adrian Palmer
  Cc: Andreas Dilger, ext4 development, Linux Filesystem Development List

>> I think one of the important design decisions that needs to be made
>> early on is whether it is possible to directly access some storage
>> that can be updated with small random writes (either a separate flash
>> LUN on the device, or a section of the disk that is formatted for 4kB
>> sectors without SMR write requirements).
>
> This would be nice, but I looking more generally to what I call
> 'single disk' systems.  Several more complicated FSs use a separate
> flash drive for this purpose, but ext4 expects 1 vdev, and thus only
> one type of media (agnostic).  We have hybrid HDD that have flash on
> them, but the lba space isn't separate, so the FS or the DM couldn't
> very easily treat them as 2 devices.
>

Adrian,
What if vdev that has been exposed to ext4 composed out of md device
instead of regular block device ? In other words, how do you see that
these changes in EXT4 file system apply on software RAID array of SMR
drives ?

--Alireza

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems)
  2015-02-15 20:27       ` Alireza Haghdoost
@ 2015-02-16  5:02         ` Adrian Palmer
  0 siblings, 0 replies; 7+ messages in thread
From: Adrian Palmer @ 2015-02-16  5:02 UTC (permalink / raw)
  To: Alireza Haghdoost
  Cc: Andreas Dilger, ext4 development, Linux Filesystem Development List

That is a an issue that is on deck to explore further.  The DM needs
to manage each disk independently, but aggregate them and present it
as 1 vdev.  The trick to be figured out is in how it mixes the disks
in a ZBD aware way.  Stripes of 256Mib are easily handled, but
impractical.  Stripes of 128k are practical, but not easily handled.

I see the changes that we're exploring/implementing as working on both
an SMR drive and a conventional drive.  ZBD does not require SMR, so
the superblock and group descriptor changes should not affect
conventional drives.  In fact, the gd will mark the bg as a
conventional zone by default, but can still use the ZBD changes
(forward-write and defragmentation) without the writepointer
information.

EXT4 will need to be forward-write only as per SMR/ZBD.  If working
with a combination of drives with small stripe sizes, re-writes would
work on one drive (conventional) but not the other (SMR).  The bulk of
the change would need to be in the DM, and will likely not bleed over
to the FS.  The exception I can see is that the bg size may need to
increase to accommodate multiple zones on multiple SMR drives (768MiB
or 1GiB BGs for RAID5).  The DM would be responsible for aggregating
the REPORT_ZONE data before presenting it to the FS (which would
behave as normally expected).  Of note, the standard requires zone
size as a power of 2, so a 3-disk RAID5 may violate that on
implementation.  RAID0 has similar constraints, and RAID1 can
operating in the same paradigm with no changes to zone information.

So, in short, the DM would have to be modified to pass the aggregated
zone information up to EXT4.  I don't see much divergence in the
proposed redesign of EXT4.


Adrian Palmer
Firmware Engineer II
R&D Firmware
Seagate, Longmont Colorado
720-684-1307


On Sun, Feb 15, 2015 at 1:27 PM, Alireza Haghdoost <haghdoost@gmail.com> wrote:
>>> I think one of the important design decisions that needs to be made
>>> early on is whether it is possible to directly access some storage
>>> that can be updated with small random writes (either a separate flash
>>> LUN on the device, or a section of the disk that is formatted for 4kB
>>> sectors without SMR write requirements).
>>
>> This would be nice, but I looking more generally to what I call
>> 'single disk' systems.  Several more complicated FSs use a separate
>> flash drive for this purpose, but ext4 expects 1 vdev, and thus only
>> one type of media (agnostic).  We have hybrid HDD that have flash on
>> them, but the lba space isn't separate, so the FS or the DM couldn't
>> very easily treat them as 2 devices.
>>
>
> Adrian,
> What if vdev that has been exposed to ext4 composed out of md device
> instead of regular block device ? In other words, how do you see that
> these changes in EXT4 file system apply on software RAID array of SMR
> drives ?
>
> --Alireza

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-02-16  5:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAKdFiL6CMHTHWCvSFjFNw6hzvPLDEWWAC-o3wdivq9co7SX1FA@mail.gmail.com>
2015-01-07 14:57 ` [Lsf-pc] [LSF/MM TOPIC] - SMR Modifications to EXT4 (and other generic file systems) Sasha Levin
2015-01-10 13:40   ` Hannes Reinecke
2015-01-13 20:32 ` Adrian Palmer
2015-01-13 21:50   ` Andreas Dilger
2015-01-13 23:26     ` Adrian Palmer
2015-02-15 20:27       ` Alireza Haghdoost
2015-02-16  5:02         ` Adrian Palmer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.