All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Draft Linux kernel interfaces for ZBC drives
@ 2014-01-31  5:38 Theodore Ts'o
  2014-01-31 13:07 ` Matthew Wilcox
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-01-31  5:38 UTC (permalink / raw)
  To: linux-fsdevel


I've been reading the draft ZBC specifications, especially 14-010r1[1],
and I've created the following draft kernel interfaces, which I present
as a strawman proposal for comments.

[1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf

As noted in the comments below, supporting variable length SMR zones
does result in more complexity at the file system / userspace interface
layer.  Life would certainly get simpler if these zones were fixed
length.

                                                        - Ted


/*
 * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
 * will have 32,768 zones.   That means if we tried to use a contiguous
 * array we would need to allocate 768k of contiguous, non-swappable
 * kernel memory.  (Boo, hiss.) 
 *
 * This large enough that it would be painful to hang an array off the
 * block_device structure.  So we will define a function
 * blkdev_query_zones() to selectively return information for some
 * number of zones.
 */
struct zone_status {
       sector_t	z_start;
       __u32	z_length;
       __u32	z_write_ptr_offset;  /* offset */
       __u32	z_checkpoint_offset; /* offset */
       __u32	z_flags;	     /* full, ro, offline, reset_requested */
};

#define Z_FLAGS_FULL		0x0001
#define Z_FLAGS_OFFLINE		0x0002
#define Z_FLAGS_RO		0x0004
#define Z_FLAG_RESET_REQUESTED	0x0008

#define Z_FLAG_TYPE_MASK	0x0F00
#define Z_FLAG_TYPE_CONVENTIONAL 0x0000
#define Z_FLAG_TYPE_SEQUENTIAL	0x0100


/*
 * Query the block_device bdev for information about the zones
 * starting at start_sector that match the criteria specified by
 * free_sectors_criteria.  Zone status information for at most
 * max_zones will be placed into the memory array ret_zones.  The
 * return value contains the number of zones actually returned.
 *
 * If free_sectors_criteria is positive, then return zones that have
 * at least that many sectors available to be written.  If it is zero,
 * then match all zones.  If free_sectors_criteria is negative, then
 * return the zones that match the following criteria:
 *
 *      -1     Return all read-only zones
 *      -2     Return all offline zones
 *      -3     Return all zones where the write ptr != the checkpoint ptr
 */
extern int blkdev_query_zones(struct block_device *bdev,
			      sector_t start_sector,
			      int free_sectors_criteria,
       			      struct zone_status *ret_zones,
			      int max_zones);

/*
 * Reset the write pointer for a sequential write zone.
 *
 * Returns -EINVAL if the start_sector is not the beginning of a
 * sequential write zone.
 */
extern int blkdev_reset_zone_ptr(struct block_dev *bdev,
				 sector_t start_sector);


/* 
 * ----------------------------
 */

/* 
 * The zone_status structure could be a lot smaller if zones are a
 * constant fixed size, then we could address zones using an 16 bit
 * integer, instead of using a 64-bit starting lba number then this
 * structure could half the size (12 bytes).
 *
 * We can also further shrink the structure by removing the
 * z_checkpoint_offset element, since most of the time
 * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
 * only time they will be different is after a write is interrupted
 * via an unexpected power removal
 * 
 * With the smaller structure, we could fit all of the zones in an 8TB
 * SMR drive in 256k, which maybe we could afford to vmalloc()
 */
struct simplified_zone_status {
       __u32	z_write_ptr_offset;  /* offset */
       __u32	z_flags;
};

/* add a new flag */
#define Z_FLAG_POWER_FAIL_WRITE 0x0010 /* write_ptr != checkpoint ptr */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
@ 2014-01-31 13:07 ` Matthew Wilcox
  2014-01-31 15:44   ` Theodore Ts'o
  2014-02-03 21:01 ` Jeff Moyer
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2014-01-31 13:07 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

On Fri, Jan 31, 2014 at 12:38:22AM -0500, Theodore Ts'o wrote:
> /*
>  * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
>  * will have 32,768 zones.   That means if we tried to use a contiguous
>  * array we would need to allocate 768k of contiguous, non-swappable
>  * kernel memory.  (Boo, hiss.) 
>  *
>  * This large enough that it would be painful to hang an array off the
>  * block_device structure.  So we will define a function
>  * blkdev_query_zones() to selectively return information for some
>  * number of zones.
>  */
> struct zone_status {
>        sector_t	z_start;
>        __u32	z_length;
>        __u32	z_write_ptr_offset;  /* offset */
>        __u32	z_checkpoint_offset; /* offset */
>        __u32	z_flags;	     /* full, ro, offline, reset_requested */
> };
> 
> /*
>  * Query the block_device bdev for information about the zones
>  * starting at start_sector that match the criteria specified by
>  * free_sectors_criteria.  Zone status information for at most
>  * max_zones will be placed into the memory array ret_zones.  The
>  * return value contains the number of zones actually returned.
>  *
>  * If free_sectors_criteria is positive, then return zones that have
>  * at least that many sectors available to be written.  If it is zero,
>  * then match all zones.  If free_sectors_criteria is negative, then
>  * return the zones that match the following criteria:
>  *
>  *      -1     Return all read-only zones
>  *      -2     Return all offline zones
>  *      -3     Return all zones where the write ptr != the checkpoint ptr
>  */
> extern int blkdev_query_zones(struct block_device *bdev,
> 			      sector_t start_sector,
> 			      int free_sectors_criteria,
>        			      struct zone_status *ret_zones,
> 			      int max_zones);

So the caller does:

	zones = kmalloc(max * sizeof *zones, GFP_KERNEL);
	blkdev_query_zones(bdev, sector, fsc, zones, max);
...
	kfree(zones);

Just want to be sure I understand the lifetime rules on the memory used.
I imagine the block layer will have some kind of compressed representation,
probably a tree of some kind, then expand that representation into the
zone_status.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-01-31 13:07 ` Matthew Wilcox
@ 2014-01-31 15:44   ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-01-31 15:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel

On Fri, Jan 31, 2014 at 06:07:54AM -0700, Matthew Wilcox wrote:
> > extern int blkdev_query_zones(struct block_device *bdev,
> > 			      sector_t start_sector,
> > 			      int free_sectors_criteria,
> >        			      struct zone_status *ret_zones,
> > 			      int max_zones);
> 
> So the caller does:
> 
> 	zones = kmalloc(max * sizeof *zones, GFP_KERNEL);
> 	blkdev_query_zones(bdev, sector, fsc, zones, max);
> ...
> 	kfree(zones);
> 
> Just want to be sure I understand the lifetime rules on the memory used.

Yes.  Or the caller is looking a single zone which has at least 256
sectors, the structure might be allocated on the stack:

{
	struct zone_status zs;

	zones = kmalloc(bdev, 0, 256, zs, 1);
	if (zones == 0)
		return -ENOSPC;
	....
}
	 
> I imagine the block layer will have some kind of compressed representation,
> probably a tree of some kind, then expand that representation into the
> zone_status.

Yes.  Zones which are off-line, full, or empty don't require storage
of the write pointer or checkpoint LBA, for example.  So if the vast
majority of the zones are either full, or empty, it wouldn't take that
much space to store the zone information.  One of the reasons why I
think we should have a interface using a function is so we can change
the underyling compressed representation without changing all of the
users of the zone status information.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
  2014-01-31 13:07 ` Matthew Wilcox
@ 2014-02-03 21:01 ` Jeff Moyer
  2014-02-03 21:07   ` Martin K. Petersen
  2014-02-03 21:38   ` Theodore Ts'o
  2014-02-03 21:03 ` Eric Sandeen
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Jeff Moyer @ 2014-02-03 21:01 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

"Theodore Ts'o" <tytso@mit.edu> writes:

>  * We can also further shrink the structure by removing the
>  * z_checkpoint_offset element, since most of the time
>  * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
>  * only time they will be different is after a write is interrupted
>  * via an unexpected power removal

This may fall into the nit-picking category, but at runtime I'd expect
the write pointer and the checkpoint lba to be different more often than
not, unless you're doing all FUA writes, or are issuing flushes after
every write.

After a power loss event, do we know what READ will return when you try
to read between the checkpoint lba and the write pointer?  I didn't see
that in the linked specification, and I think it's important to know.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
  2014-01-31 13:07 ` Matthew Wilcox
  2014-02-03 21:01 ` Jeff Moyer
@ 2014-02-03 21:03 ` Eric Sandeen
  2014-02-03 22:17   ` Theodore Ts'o
  2014-02-04  2:00 ` HanBin Yoon
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Eric Sandeen @ 2014-02-03 21:03 UTC (permalink / raw)
  To: Theodore Ts'o, linux-fsdevel

On 1/30/14, 11:38 PM, Theodore Ts'o wrote:
> I've been reading the draft ZBC specifications, especially 14-010r1[1],
> and I've created the following draft kernel interfaces, which I present
> as a strawman proposal for comments.
> 
> [1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf
> 
> As noted in the comments below, supporting variable length SMR zones
> does result in more complexity at the file system / userspace interface
> layer.  Life would certainly get simpler if these zones were fixed
> length.

Hi Ted - 

Just to flesh out the context for these a bit, what do you envision as the
consumer of these interfaces?  Things in the block layer?  A DM target?
Existing filesystems?  A new filesystem?

I suppose we'll need an interface similar to this at whatever layer has
to deal with it.  I've got my own opinions on where we might handle
it (IMHO, retrofitting 3 or 4 major filesystems sounds like more
than really want to take on), but it'd be nice to know what you're
thinking here.

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-03 21:01 ` Jeff Moyer
@ 2014-02-03 21:07   ` Martin K. Petersen
  2014-02-03 21:38   ` Theodore Ts'o
  1 sibling, 0 replies; 19+ messages in thread
From: Martin K. Petersen @ 2014-02-03 21:07 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Theodore Ts'o, linux-fsdevel

>>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes:

Jeff> After a power loss event, do we know what READ will return when
Jeff> you try to read between the checkpoint lba and the write pointer?
Jeff> I didn't see that in the linked specification, and I think it's
Jeff> important to know.

That's still being actively discussed.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-03 21:01 ` Jeff Moyer
  2014-02-03 21:07   ` Martin K. Petersen
@ 2014-02-03 21:38   ` Theodore Ts'o
  2014-02-03 22:26     ` Jeff Moyer
  1 sibling, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-03 21:38 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel

On Mon, Feb 03, 2014 at 04:01:15PM -0500, Jeff Moyer wrote:
> "Theodore Ts'o" <tytso@mit.edu> writes:
> 
> >  * We can also further shrink the structure by removing the
> >  * z_checkpoint_offset element, since most of the time
> >  * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
> >  * only time they will be different is after a write is interrupted
> >  * via an unexpected power removal
> 
> This may fall into the nit-picking category, but at runtime I'd expect
> the write pointer and the checkpoint lba to be different more often than
> not, unless you're doing all FUA writes, or are issuing flushes after
> every write.

Sure, but the only time we care is after an unexpected power removal,
and I would expect that shortly after the system is rebooted, the file
system or userspace storage space application would want to take care
of dealing with recovery right away.

So I'm not really proposing to track the z_checkpoint_offset except
report writes to the storage device that might have failed due to
power failures, since presumably this is the only time users of this
interface would care.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-03 21:03 ` Eric Sandeen
@ 2014-02-03 22:17   ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-03 22:17 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-fsdevel

On Mon, Feb 03, 2014 at 03:03:02PM -0600, Eric Sandeen wrote:
> 
> Just to flesh out the context for these a bit, what do you envision as the
> consumer of these interfaces?  Things in the block layer?  A DM target?
> Existing filesystems?  A new filesystem?

All of the above.  In addition, I propose to expose this interface via
an ioctl for block devices since there may be userspace applications
that want to be manage the SMR directly from user space.  It would
also be used by ext4 to better align the journal (Project ext4-01 in
the SMR project spreadsheet).

I also want to make the same ioctl exposed for an ext4 file where all
of its blocks cover SMR zones directly --- that is some of the
motivation for the mke2fs mk_hugefile functionality that I've been
working on, since I have specific userspace use case in mind for that,
where we want to get both the management advantages of an file system
but the userspace application might want to treat one or more huge
files as a miniature SMR disk(s).

> I suppose we'll need an interface similar to this at whatever layer has
> to deal with it.  I've got my own opinions on where we might handle
> it (IMHO, retrofitting 3 or 4 major filesystems sounds like more
> than really want to take on), but it'd be nice to know what you're
> thinking here.

Well, even if we have a device mapper shim layer which takes a
restricted mode SMR drive and handles the indirection layer
(i.e. projects Core-05 and Core-06) I think is a good thing, it still
might be a good idea to have the file system's block allocation system
make the block allocations to be more SMR-friendly.  Similarly, we
might use something like the very lazy journal updates (Ext4-02) to
make life easier for the shim layer, as well as using this interface
so we can better make block allocation decisions (Ext4-03).

So my vision for ext4 is not that it would ever be able to be
completely able to use restricted mode SMR drives directly.  I expect
that we would use either run on drives that implement cooperative
(host aware) SMR mode, or on top of a device mapper ship that provided
cooperative mode layered on top of a restricted mode SMR drive.  So I
wouldn't call this a "massive retrofitting" of ext4, but just some
medium-sized changes that would make ext4 more SMR friendly.  Some of
these changes, such as the lazy journal updates, should also improve
performance for HDD and especially for eMMC flash storage.

And it may very well be that not all file systems will want to put in
the work to make them more SMR friendly, and that's fine.  Just as
ext4 hasn't really focused on supporting massive RAID arrays, and
that's OK, because users can always use XFS for that --- similarily,
it may be completely rational for the XFS developers to decide that it
doesn't make sense for them worry about making XFS SMR-aware.

Cheers,

      	    	      	  	    	     - Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-03 21:38   ` Theodore Ts'o
@ 2014-02-03 22:26     ` Jeff Moyer
  0 siblings, 0 replies; 19+ messages in thread
From: Jeff Moyer @ 2014-02-03 22:26 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

"Theodore Ts'o" <tytso@mit.edu> writes:

> On Mon, Feb 03, 2014 at 04:01:15PM -0500, Jeff Moyer wrote:
>> "Theodore Ts'o" <tytso@mit.edu> writes:
>> 
>> >  * We can also further shrink the structure by removing the
>> >  * z_checkpoint_offset element, since most of the time
>> >  * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
>> >  * only time they will be different is after a write is interrupted
>> >  * via an unexpected power removal
>> 
>> This may fall into the nit-picking category, but at runtime I'd expect
>> the write pointer and the checkpoint lba to be different more often than
>> not, unless you're doing all FUA writes, or are issuing flushes after
>> every write.
>
> Sure, but the only time we care is after an unexpected power removal,
> and I would expect that shortly after the system is rebooted, the file
> system or userspace storage space application would want to take care
> of dealing with recovery right away.
>
> So I'm not really proposing to track the z_checkpoint_offset except
> report writes to the storage device that might have failed due to
> power failures, since presumably this is the only time users of this
> interface would care.

I agree it would be silly to track the checkpoint lba.  I only took
issue with your comment.  ;-)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
                   ` (2 preceding siblings ...)
  2014-02-03 21:03 ` Eric Sandeen
@ 2014-02-04  2:00 ` HanBin Yoon
  2014-02-04 16:27   ` Theodore Ts'o
  2014-02-11 18:43 ` [RFC] Draft Linux kernel interfaces for SMR/ZBC drives Theodore Ts'o
  2014-02-21 10:02 ` [RFC] Draft Linux kernel interfaces for ZBC drives Rohan Puri
  5 siblings, 1 reply; 19+ messages in thread
From: HanBin Yoon @ 2014-02-04  2:00 UTC (permalink / raw)
  To: linux-fsdevel

Theodore Ts'o <tytso <at> mit.edu> writes:

> #define Z_FLAGS_FULL		0x0001
> #define Z_FLAGS_OFFLINE		0x0002
> #define Z_FLAGS_RO		0x0004
> #define Z_FLAG_RESET_REQUESTED	0x0008
> 
> #define Z_FLAG_TYPE_MASK	0x0F00
> #define Z_FLAG_TYPE_CONVENTIONAL 0x0000
> #define Z_FLAG_TYPE_SEQUENTIAL	0x0100

Just a minor point, but I noticed that the specification (14-010r1) had an 
ordering for these flags in the zone descriptor format (Table 6) in a 
different way from the above #define's. I thought it might be handy to have 
these sync up? For example:

#define Z_FLAG_RESET_REQUESTED	0x0001
#define Z_FLAGS_OFFLINE		0x0002
#define Z_FLAGS_RO		0x0004
#define Z_FLAGS_FULL		0x0008

#define Z_FLAG_TYPE_MASK	0x0F00
#define Z_FLAG_TYPE_CONVENTIONAL 0x0100 (Table 1 in 14-009r1)
#define Z_FLAG_TYPE_SEQUENTIAL	0x0200 (Table 1 in 14-009r1)

Or to be little-endian:

#define Z_FLAG_TYPE_MASK	0x000F
#define Z_FLAG_TYPE_CONVENTIONAL 0x0001 (Table 1 in 14-009r1)
#define Z_FLAG_TYPE_SEQUENTIAL	0x0002 (Table 1 in 14-009r1)

#define Z_FLAG_RESET_REQUESTED	0x0100
#define Z_FLAGS_OFFLINE		0x0200
#define Z_FLAGS_RO		0x0400
#define Z_FLAGS_FULL		0x0800


>  * If free_sectors_criteria is positive, then return zones that have
>  * at least that many sectors available to be written.  If it is zero,
>  * then match all zones.  If free_sectors_criteria is negative, then
>  * return the zones that match the following criteria:
>  *
>  *      -1     Return all read-only zones
>  *      -2     Return all offline zones
>  *      -3     Return all zones where the write ptr != the checkpoint ptr

"all" above for -1/-2/-3 is still limited by (int) max_zones, correct?

I was also wondering whether the returned (struct) zone_status'es should 
have any ordering, e.g., if they should preserve the ordering given by the 
drive. According to the specification for REPORT ZONES, "The descriptors 
shall be sorted in ascending order based on the zone start LBA value." If 
this ordering is preserved, maybe it will help to reduce seek distance 
(assuming correlation between ascending LBA and going from OD to ID)?

Han



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-04  2:00 ` HanBin Yoon
@ 2014-02-04 16:27   ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-04 16:27 UTC (permalink / raw)
  To: HanBin Yoon; +Cc: linux-fsdevel

On Tue, Feb 04, 2014 at 02:00:58AM +0000, HanBin Yoon wrote:
> 
> Just a minor point, but I noticed that the specification (14-010r1) had an 
> ordering for these flags in the zone descriptor format (Table 6) in a 
> different way from the above #define's. I thought it might be handy to have 
> these sync up? For example:

Sure, making them line up would mean make it a little easier to
translate from the response from the drive to the Linux kernel
structure.

> >  * If free_sectors_criteria is positive, then return zones that have
> >  * at least that many sectors available to be written.  If it is zero,
> >  * then match all zones.  If free_sectors_criteria is negative, then
> >  * return the zones that match the following criteria:
> >  *
> >  *      -1     Return all read-only zones
> >  *      -2     Return all offline zones
> >  *      -3     Return all zones where the write ptr != the checkpoint ptr
> 
> "all" above for -1/-2/-3 is still limited by (int) max_zones, correct?

Good point, that is what I intended.  We can better clarify this
replacing "Return all..." with "Match all...".

> I was also wondering whether the returned (struct) zone_status'es should 
> have any ordering, e.g., if they should preserve the ordering given by the 
> drive. According to the specification for REPORT ZONES, "The descriptors 
> shall be sorted in ascending order based on the zone start LBA value." If 
> this ordering is preserved, maybe it will help to reduce seek distance 
> (assuming correlation between ascending LBA and going from OD to ID)?

Yes, I was assuming they would be returned in ascending LBA order,
but we should explicitly specify this in the interface specification.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for SMR/ZBC drives
  2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
                   ` (3 preceding siblings ...)
  2014-02-04  2:00 ` HanBin Yoon
@ 2014-02-11 18:43 ` Theodore Ts'o
  2014-02-11 19:04   ` Andreas Dilger
  2014-02-21 10:02 ` [RFC] Draft Linux kernel interfaces for ZBC drives Rohan Puri
  5 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-11 18:43 UTC (permalink / raw)
  To: linux-fsdevel

Based on the comments raised on the list, here is a revised version of
the proposed ZBC kernel interface.

Changes from the last version:

1)  Aligned ZBC_FLAG values to be aligned with the ZBC specification to
	simplify implementations
2)  Aligned the free_sector_criteria values to be mostly aligned with the ZBC
	specification
3)  Clarified the behaviour of blkdev_query_zones()
4)  Added an ioctl interface to expose this functionality to userspace
5)  Removed the proposed simplified data variant

Please let me know what you think!

						- Ted


/*
 * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
 * will have 32,768 zones.   That means if we tried to use a contiguous
 * array we would need to allocate 768k of contiguous, non-swappable
 * kernel memory.  (Boo, hiss.) 
 *
 * This large enough that it would be painful to hang an array off the
 * block_device structure.  So we will define a function
 * blkdev_query_zones() to selectively return information for some
 * number of zones.
 *
 * It is anticipated that the block device driver will store this
 * information in a compressed form, and that z_checkpoint_offset will
 * not be dynamically tracked.  That is, the checkpoint offset will,
 * if non-zero, indicates that drive suffered a power fail event, and
 * the file system or userspace process may need to implement recovery
 * procedures.  Once the file system or userspace process writes to an
 * SMR band, the checkpoint offset will be cleared and future queries
 * for the SMR band will return the checkpoint offset == write_ptr.
 */
struct zone_status {
       sector_t	z_start;
       __u32	z_length;
       __u32	z_write_ptr_offset;  /* offset */
       __u32	z_checkpoint_offset; /* offset */
       __u32	z_flags;	     /* full, ro, offline, reset_requested */
};

#define Z_FLAG_RESET_REQUESTED	0x0001
#define Z_FLAGS_OFFLINE		0x0002
#define Z_FLAGS_RO		0x0004
#define Z_FLAGS_FULL		0x0008

#define Z_FLAG_TYPE_MASK	0x0F00
#define Z_FLAG_TYPE_CONVENTIONAL 0x0100
#define Z_FLAG_TYPE_SEQUENTIAL	0x0200


/*
 * Query the block_device bdev for information about the zones
 * starting at start_sector that match the criteria specified by
 * free_sectors_criteria.  Zone status information for at most
 * max_zones will be placed into the memory array ret_zones (which is
 * allocated by the caller, not by the blkdev_query_zones function),
 * in ascending LBA order.  The return value will be a kernel error
 * code if negative, or the number of zones actually returned if
 * non-nonegative.
 *
 * If free_sectors_criteria is positive, then return zones that have
 * at least that many sectors available to be written.  If it is zero,
 * then match all zones.  If free_sectors_criteria is negative, then
 * return the zones that match the following criteria:
 *
 *	-1     Match all full zones
 *	-2     Match all open zones
 *		  (the zone has at least one written sector and is not full)
 *	-3     Match all free zones
 *		  (the zone has no written sectors)
 *      -4     Match all read-only zones
 *      -5     Match all offline zones
 *      -6     Match all zones where the write ptr != the checkpoint ptr
 *
 * The negative values are taken from Table 4 of 14-010r1, with the
 * exception of -6, which is not in the draft spec --- but IMHO should
 * be :-) It is anticipated, though, that the kernel will keep this
 * info in in memory and so will handle matching zones which meet
 * these criteria itself, without needing to issue a ZBC command for
 * each call to blkdev_query_zones().
 */
extern int blkdev_query_zones(struct block_device *bdev,
			      sector_t start_sector,
			      int free_sectors_criteria,
			      int max_zones,
       			      struct zone_status *ret_zones);

/*
 * Reset the write pointer for a sequential write zone.
 *
 * Returns -EINVAL if the start_sector is not the beginning of a
 * sequential write zone.
 */
extern int blkdev_reset_zone_ptr(struct block_dev *bdev,
				 sector_t start_sector);


/* ioctl interface */

ZBCQUERY
	u64 starting_lba	/* IN */
	u32 criteria		/* IN */
	u32 *num_zones		/* IN/OUT */
	struct zone_status *ptr	/* OUT */

ZBCRESETZONE
	u64 starting_lba



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for SMR/ZBC drives
  2014-02-11 18:43 ` [RFC] Draft Linux kernel interfaces for SMR/ZBC drives Theodore Ts'o
@ 2014-02-11 19:04   ` Andreas Dilger
  2014-02-11 19:53     ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Dilger @ 2014-02-11 19:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4858 bytes --]

On Feb 11, 2014, at 11:43 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> Based on the comments raised on the list, here is a revised version of
> the proposed ZBC kernel interface.
> 
> Changes from the last version:
> 
> 1)  Aligned ZBC_FLAG values to be aligned with the ZBC specification to
> 	simplify implementations
> 2)  Aligned the free_sector_criteria values to be mostly aligned with the ZBC
> 	specification
> 3)  Clarified the behaviour of blkdev_query_zones()
> 4)  Added an ioctl interface to expose this functionality to userspace
> 5)  Removed the proposed simplified data variant
> 
> Please let me know what you think!

Should ZBCRESETZONE take a length or number of zones to reset?

Cheers, Andreas

> /*
> * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
> * will have 32,768 zones.   That means if we tried to use a contiguous
> * array we would need to allocate 768k of contiguous, non-swappable
> * kernel memory.  (Boo, hiss.) 
> *
> * This large enough that it would be painful to hang an array off the
> * block_device structure.  So we will define a function
> * blkdev_query_zones() to selectively return information for some
> * number of zones.
> *
> * It is anticipated that the block device driver will store this
> * information in a compressed form, and that z_checkpoint_offset will
> * not be dynamically tracked.  That is, the checkpoint offset will,
> * if non-zero, indicates that drive suffered a power fail event, and
> * the file system or userspace process may need to implement recovery
> * procedures.  Once the file system or userspace process writes to an
> * SMR band, the checkpoint offset will be cleared and future queries
> * for the SMR band will return the checkpoint offset == write_ptr.
> */
> struct zone_status {
>       sector_t	z_start;
>       __u32	z_length;
>       __u32	z_write_ptr_offset;  /* offset */
>       __u32	z_checkpoint_offset; /* offset */
>       __u32	z_flags;	     /* full, ro, offline, reset_requested */
> };
> 
> #define Z_FLAG_RESET_REQUESTED	0x0001
> #define Z_FLAGS_OFFLINE		0x0002
> #define Z_FLAGS_RO		0x0004
> #define Z_FLAGS_FULL		0x0008
> 
> #define Z_FLAG_TYPE_MASK	0x0F00
> #define Z_FLAG_TYPE_CONVENTIONAL 0x0100
> #define Z_FLAG_TYPE_SEQUENTIAL	0x0200
> 
> 
> /*
>  * Query the block_device bdev for information about the zones
>  * starting at start_sector that match the criteria specified by
>  * free_sectors_criteria.  Zone status information for at most
>  * max_zones will be placed into the memory array ret_zones (which is
>  * allocated by the caller, not by the blkdev_query_zones function),
>  * in ascending LBA order.  The return value will be a kernel error
>  * code if negative, or the number of zones actually returned if
>  * non-nonegative.
>  *
>  * If free_sectors_criteria is positive, then return zones that have
>  * at least that many sectors available to be written.  If it is zero,
>  * then match all zones.  If free_sectors_criteria is negative, then
>  * return the zones that match the following criteria:
>  *
>  *	-1     Match all full zones
>  *	-2     Match all open zones
>  *		(the zone has at least one written sector and is not full)
>  *	-3     Match all free zones
>  *		(the zone has no written sectors)
>  *      -4     Match all read-only zones
>  *      -5     Match all offline zones
>  *      -6     Match all zones where the write ptr != the checkpoint ptr
>  *
>  * The negative values are taken from Table 4 of 14-010r1, with the
>  * exception of -6, which is not in the draft spec --- but IMHO should
>  * be :-) It is anticipated, though, that the kernel will keep this
>  * info in in memory and so will handle matching zones which meet
>  * these criteria itself, without needing to issue a ZBC command for
>  * each call to blkdev_query_zones().
>  */
> extern int blkdev_query_zones(struct block_device *bdev,
> 			      sector_t start_sector,
> 			      int free_sectors_criteria,
> 			      int max_zones,
>       			      struct zone_status *ret_zones);
> 
> /*
>  * Reset the write pointer for a sequential write zone.
>  *
>  * Returns -EINVAL if the start_sector is not the beginning of a
>  * sequential write zone.
>  */
> extern int blkdev_reset_zone_ptr(struct block_dev *bdev,
> 				 sector_t start_sector);
> 
> 
> /* ioctl interface */
> 
> ZBCQUERY
> 	u64 starting_lba	/* IN */
> 	u32 criteria		/* IN */
> 	u32 *num_zones		/* IN/OUT */
> 	struct zone_status *ptr	/* OUT */
> 
> ZBCRESETZONE
> 	u64 starting_lba
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for SMR/ZBC drives
  2014-02-11 19:04   ` Andreas Dilger
@ 2014-02-11 19:53     ` Theodore Ts'o
  2014-02-13  2:08       ` Andreas Dilger
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-11 19:53 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel

On Tue, Feb 11, 2014 at 12:04:12PM -0700, Andreas Dilger wrote:
> 
> Should ZBCRESETZONE take a length or number of zones to reset?

I was assuming that we would reset one zone at a time, which I think
is the main way this would be used.

Thinking about this some more, one other way we could do is to reuse
BLKDISCARD, with the restriction that the starting LBA (offset) must
be the first LBA for a zone, and offset + length - 1 must be last LBA
for a zone.  If either of these restrictions are violated, then
BLKDISCARD would return EINVAL --- or maybe some other error code,
since EINVAL can also mean an invalid ioctl code.  

Hmm... ERANGE would be good, but the strerror(ERANGE) returns "Math
result not representable", which would be highly confusing, although
it falls within the grand tradition of confusing Unix error messages,
such as "not a typewriter".  :-)

Any other suggestions?

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for SMR/ZBC drives
  2014-02-11 19:53     ` Theodore Ts'o
@ 2014-02-13  2:08       ` Andreas Dilger
  2014-02-13  3:09         ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Andreas Dilger @ 2014-02-13  2:08 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1446 bytes --]

On Feb 11, 2014, at 12:53 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Feb 11, 2014 at 12:04:12PM -0700, Andreas Dilger wrote:
>> 
>> Should ZBCRESETZONE take a length or number of zones to reset?
> 
> I was assuming that we would reset one zone at a time, which I think
> is the main way this would be used.
> 
> Thinking about this some more, one other way we could do is to reuse
> BLKDISCARD, with the restriction that the starting LBA (offset) must
> be the first LBA for a zone, and offset + length - 1 must be last LBA
> for a zone.  If either of these restrictions are violated, then
> BLKDISCARD would return EINVAL --- or maybe some other error code,
> since EINVAL can also mean an invalid ioctl code.  
> 
> Hmm... ERANGE would be good, but the strerror(ERANGE) returns "Math
> result not representable", which would be highly confusing, although
> it falls within the grand tradition of confusing Unix error messages,
> such as "not a typewriter".  :-)
> 
> Any other suggestions?

What about FITRIM?  I'm not sure of the exact semantics of ZBCRESETZONE,
but at first glance it seems similar to resetting an erase block on an
SSD.  That might also be beneficial if the SMR drives have directly
addressible on-board flash that is mapped to a Z_FLAG_TYPE_RANDOM zone
type (I wish...) and it would pass the FITRIM straight through.

For non-SSD "trim" it would just reset the SMR zone but not return zeroes.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for SMR/ZBC drives
  2014-02-13  2:08       ` Andreas Dilger
@ 2014-02-13  3:09         ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-13  3:09 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-fsdevel

On Wed, Feb 12, 2014 at 07:08:20PM -0700, Andreas Dilger wrote:
> 
> What about FITRIM?  I'm not sure of the exact semantics of ZBCRESETZONE,
> but at first glance it seems similar to resetting an erase block on an
> SSD.  That might also be beneficial if the SMR drives have directly
> addressible on-board flash that is mapped to a Z_FLAG_TYPE_RANDOM zone
> type (I wish...) and it would pass the FITRIM straight through., it 

But that's not what the FITRIM ioctl does.  The FITRIM ioctl is a
request for the file system to send trim/discard for all blocks which
are not in use by the file system.  It doesn't map well to how
ZBCRESETZONE.

It is possible to map ZBCRESETZONE to BLKDISCARD, except that
non-SMR/ZBC drives don't have a concept of a write pointer, so while
it is true that the all of the data blocks are "reset" ala an erase
block, with a MTD flash device, once an erase block is reset, you can
write to any page within the rease block, in any order, whereas this
is not true for an SMR/ZBC device.

> For non-SSD "trim" it would just reset the SMR zone but not return zeroes.

I think you are thinking of BLKDISCARD, not FITRIM.

							- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
                   ` (4 preceding siblings ...)
  2014-02-11 18:43 ` [RFC] Draft Linux kernel interfaces for SMR/ZBC drives Theodore Ts'o
@ 2014-02-21 10:02 ` Rohan Puri
  2014-02-21 15:49   ` Theodore Ts'o
  5 siblings, 1 reply; 19+ messages in thread
From: Rohan Puri @ 2014-02-21 10:02 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux FS Devel

On Fri, Jan 31, 2014 at 11:08 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> I've been reading the draft ZBC specifications, especially 14-010r1[1],
> and I've created the following draft kernel interfaces, which I present
> as a strawman proposal for comments.
>
> [1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf
>
> As noted in the comments below, supporting variable length SMR zones
> does result in more complexity at the file system / userspace interface
> layer.  Life would certainly get simpler if these zones were fixed
> length.
>
>                                                         - Ted
>
>
> /*
>  * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
>  * will have 32,768 zones.   That means if we tried to use a contiguous
>  * array we would need to allocate 768k of contiguous, non-swappable
>  * kernel memory.  (Boo, hiss.)
>  *
>  * This large enough that it would be painful to hang an array off the
>  * block_device structure.  So we will define a function
>  * blkdev_query_zones() to selectively return information for some
>  * number of zones.
>  */
> struct zone_status {
>        sector_t z_start;
>        __u32    z_length;
>        __u32    z_write_ptr_offset;  /* offset */
>        __u32    z_checkpoint_offset; /* offset */
>        __u32    z_flags;             /* full, ro, offline, reset_requested */
> };
>
> #define Z_FLAGS_FULL            0x0001
> #define Z_FLAGS_OFFLINE         0x0002
> #define Z_FLAGS_RO              0x0004
> #define Z_FLAG_RESET_REQUESTED  0x0008
>
> #define Z_FLAG_TYPE_MASK        0x0F00
> #define Z_FLAG_TYPE_CONVENTIONAL 0x0000
> #define Z_FLAG_TYPE_SEQUENTIAL  0x0100
>
>
> /*
>  * Query the block_device bdev for information about the zones
>  * starting at start_sector that match the criteria specified by
>  * free_sectors_criteria.  Zone status information for at most
>  * max_zones will be placed into the memory array ret_zones.  The
>  * return value contains the number of zones actually returned.
>  *
>  * If free_sectors_criteria is positive, then return zones that have
>  * at least that many sectors available to be written.  If it is zero,
>  * then match all zones.  If free_sectors_criteria is negative, then
>  * return the zones that match the following criteria:
>  *
>  *      -1     Return all read-only zones
>  *      -2     Return all offline zones
>  *      -3     Return all zones where the write ptr != the checkpoint ptr
>  */
> extern int blkdev_query_zones(struct block_device *bdev,
>                               sector_t start_sector,
>                               int free_sectors_criteria,
>                               struct zone_status *ret_zones,
>                               int max_zones);
In this api, the caller would allocate the memory for ret_zones as
sizeof(struct zone_status) * max_zones, right? There can be a case
where return value is less than max_zones, in this case we would be
preallocating extra memory for (max_zones - ret val) that would not be
used (since they would not contain valid zone_status structs). As the
hdd ages, it can be prone to failures, instances of differences of the
two values can happen. Can we pass a double pointer to ret_zones, so
that the api allocates the memory and the caller can free it? Would
like to know your views on this. This thing will be invalid for the
single zone_status example that you gave.

>
> /*
>  * Reset the write pointer for a sequential write zone.
>  *
>  * Returns -EINVAL if the start_sector is not the beginning of a
>  * sequential write zone.
>  */
> extern int blkdev_reset_zone_ptr(struct block_dev *bdev,
>                                  sector_t start_sector);
>
>
> /*
>  * ----------------------------
>  */
>
> /*
>  * The zone_status structure could be a lot smaller if zones are a
>  * constant fixed size, then we could address zones using an 16 bit
>  * integer, instead of using a 64-bit starting lba number then this
>  * structure could half the size (12 bytes).
>  *
>  * We can also further shrink the structure by removing the
>  * z_checkpoint_offset element, since most of the time
>  * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
>  * only time they will be different is after a write is interrupted
>  * via an unexpected power removal
>  *
>  * With the smaller structure, we could fit all of the zones in an 8TB
>  * SMR drive in 256k, which maybe we could afford to vmalloc()
>  */
> struct simplified_zone_status {
>        __u32    z_write_ptr_offset;  /* offset */
>        __u32    z_flags;
> };
>
> /* add a new flag */
> #define Z_FLAG_POWER_FAIL_WRITE 0x0010 /* write_ptr != checkpoint ptr */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

- Regards,
     Rohan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-21 10:02 ` [RFC] Draft Linux kernel interfaces for ZBC drives Rohan Puri
@ 2014-02-21 15:49   ` Theodore Ts'o
  2014-02-25  9:36     ` Rohan Puri
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2014-02-21 15:49 UTC (permalink / raw)
  To: Rohan Puri; +Cc: Linux FS Devel

On Fri, Feb 21, 2014 at 03:32:52PM +0530, Rohan Puri wrote:
> > extern int blkdev_query_zones(struct block_device *bdev,
> >                               sector_t start_sector,
> >                               int free_sectors_criteria,
> >                               struct zone_status *ret_zones,
> >                               int max_zones);
>
> In this api, the caller would allocate the memory for ret_zones as
> sizeof(struct zone_status) * max_zones, right? There can be a case
> where return value is less than max_zones, in this case we would be
> preallocating extra memory for (max_zones - ret val) that would not be
> used (since they would not contain valid zone_status structs). As the
> hdd ages, it can be prone to failures, instances of differences of the
> two values can happen. Can we pass a double pointer to ret_zones, so
> that the api allocates the memory and the caller can free it? Would
> like to know your views on this. This thing will be invalid for the
> single zone_status example that you gave.

I think you are making the assumption here that max_zones will
normally be the maximum number of zone available to the disk.  In
practice, this will never be true.  Consider that a 8TB SMR drive with
256 MB zones will have 32,768 zones.  The kernel will *not* want to
allocate 768k of non-swappable kernel memory on a regular basis.
(There is no guarantee there will be that number of contiguous pages
available, and if you use vmalloc() instead, it's slower since it
involves page table operations.)  Also, when will the kernel ever want
to see all of the zones all at once, anyway?

So it's likely that the caller will always be allocating, a relatively
small number of zones (I suspect it will always be less than 128), and
if the caller needs more zones, it will simply call
blkdev_qeury_zones() with a larger start_sector value and get the next
128 zones.

So your concern about preallocating extra memory for zones that would
not be used is I don't belive a major issue.


My anticipation is that kernel will be storing the information
returned blkdev_query_zones() in a much more compact fashion (since we
don't need to store the write pointer if the zone is completely full,
or completely empty, which will very often be the case, I suspect),
and there will be a different interface that will be used by block
device drivers to send this information to the block device layer
library function which will be maintaining this information in a
compact form.

I know that I still need to spec out some functions to make life
easier for the block device drivers that will be interfacing into ZBC
maintenance layer.   They will probably look something like this:

extern int blkdev_set_zone_info(struct block_device *bdev,
       	   			struct zone_status *zone_info);

blkdev_set_zone_info() would get called once per zone when the block
device is initially set up.  My assumption is that the block device
layer will query the drive initially, and grab all of this
information, and keep it in the compressed form.  (Since querying this
data each time the OS needs it will likely be too expensive; even if
the ZBC commands don't have the same insanity as the non-queable TRIM
command, the fact that we need to go out to the disk means that we
will need to send a disk command and wait for an command completion
interrupt, which would be sad.)

I suspect we will also need commands such as these for the convenience
of the block device driver:

extern int blkdev_update_write_ptr(struct block_device *bdev,
       	   			   sector_t start_sector,
				   u32 write_ptr);

extern int blkdev_update_zone_info(struct block_device *bdev,
       	   			   struct zone_status *zone_info);

And we will probably want to define that in blockdev_query_zones(), if
start_sector is not located at the beginning of a zone, that the first
zone returned will be zone containing the specified sector.  (We'll
need this in the event that the T10 committee allows for variable
sized zones, instead of the much simpler fixed-size zone design, since
given a sector number, the block driver or the file system above the
ZBC OS management layer would have no way of mapping a sector number
to a specific zone.)

So I suspect as start implementing device mapper SMR simulators and
actual SAS/SATA block device drivers which will interface with the ZBC
prototype drives, there may be other functions we will need to
implement in order to make life easier both for these systems.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] Draft Linux kernel interfaces for ZBC drives
  2014-02-21 15:49   ` Theodore Ts'o
@ 2014-02-25  9:36     ` Rohan Puri
  0 siblings, 0 replies; 19+ messages in thread
From: Rohan Puri @ 2014-02-25  9:36 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Linux FS Devel

On Fri, Feb 21, 2014 at 9:19 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Feb 21, 2014 at 03:32:52PM +0530, Rohan Puri wrote:
>> > extern int blkdev_query_zones(struct block_device *bdev,
>> >                               sector_t start_sector,
>> >                               int free_sectors_criteria,
>> >                               struct zone_status *ret_zones,
>> >                               int max_zones);
>>
>> In this api, the caller would allocate the memory for ret_zones as
>> sizeof(struct zone_status) * max_zones, right? There can be a case
>> where return value is less than max_zones, in this case we would be
>> preallocating extra memory for (max_zones - ret val) that would not be
>> used (since they would not contain valid zone_status structs). As the
>> hdd ages, it can be prone to failures, instances of differences of the
>> two values can happen. Can we pass a double pointer to ret_zones, so
>> that the api allocates the memory and the caller can free it? Would
>> like to know your views on this. This thing will be invalid for the
>> single zone_status example that you gave.
>
> I think you are making the assumption here that max_zones will
> normally be the maximum number of zone available to the disk.In
No, not the maximum anything greater than 1 and less than maximum
number of available zone. Consider a out of 32,768 zones, kernel wants
to query for 1000 zones, out of this 1000 zones information requested,
there can be a case that only 700-800 zones information could be
obtained & for remaining 200-300 zones information couldn't be
obtained due to some error condition. Now, we would be preallocated
more (200-300) * 24 bytes of information. This will only happen in
case of error path and i am not quite sure about the probability of
it. What are your views on this?
> practice, this will never be true.  Consider that a 8TB SMR drive with
> 256 MB zones will have 32,768 zones.  The kernel will *not* want to
> allocate 768k of non-swappable kernel memory on a regular basis.
> (There is no guarantee there will be that number of contiguous pages
> available, and if you use vmalloc() instead, it's slower since it
> involves page table operations.)  Also, when will the kernel ever want
> to see all of the zones all at once, anyway?
>
any filesystem that would be SMR-aware can need this, right? like its
block allocator, to optimise for fragmentation n stuff?
> So it's likely that the caller will always be allocating, a relatively
> small number of zones (I suspect it will always be less than 128), and
agree, but this no has to be optimal to reduce the no of disk reads.
> if the caller needs more zones, it will simply call
> blkdev_qeury_zones() with a larger start_sector value and get the next
> 128 zones.
>
> So your concern about preallocating extra memory for zones that would
> not be used is I don't belive a major issue.
>
Yes, only in disk read errors & requests of more than 1 zones
information this could happen.
>
> My anticipation is that kernel will be storing the information
> returned blkdev_query_zones() in a much more compact fashion (since we
> don't need to store the write pointer if the zone is completely full,
> or completely empty, which will very often be the case, I suspect),
> and there will be a different interface that will be used by block
> device drivers to send this information to the block device layer
> library function which will be maintaining this information in a
> compact form.
>
> I know that I still need to spec out some functions to make life
> easier for the block device drivers that will be interfacing into ZBC
> maintenance layer.   They will probably look something like this:
>
> extern int blkdev_set_zone_info(struct block_device *bdev,
>                                 struct zone_status *zone_info);
>
> blkdev_set_zone_info() would get called once per zone when the block
> device is initially set up.  My assumption is that the block device
would this happen every time on os boot up? if so then will it not
increase the os boot time?
> layer will query the drive initially, and grab all of this
> information, and keep it in the compressed form.  (Since querying this
> data each time the OS needs it will likely be too expensive; even if
> the ZBC commands don't have the same insanity as the non-queable TRIM
> command, the fact that we need to go out to the disk means that we
> will need to send a disk command and wait for an command completion
> interrupt, which would be sad.)
>
Agree.
> I suspect we will also need commands such as these for the convenience
> of the block device driver:
>
> extern int blkdev_update_write_ptr(struct block_device *bdev,
>                                    sector_t start_sector,
>                                    u32 write_ptr);
>
> extern int blkdev_update_zone_info(struct block_device *bdev,
>                                    struct zone_status *zone_info);
>
Will this lead to the update on the disk or in-memory, like write
through or write back?
> And we will probably want to define that in blockdev_query_zones(), if
> start_sector is not located at the beginning of a zone, that the first
> zone returned will be zone containing the specified sector.  (We'll
> need this in the event that the T10 committee allows for variable
> sized zones, instead of the much simpler fixed-size zone design, since
> given a sector number, the block driver or the file system above the
> ZBC OS management layer would have no way of mapping a sector number
> to a specific zone.)
>
> So I suspect as start implementing device mapper SMR simulators and
> actual SAS/SATA block device drivers which will interface with the ZBC
> prototype drives, there may be other functions we will need to
> implement in order to make life easier both for these systems.
>
I was interested in project core-04, smr simulator. I read a project
report related to it, research conducted at ucsc link : -
http://www.ssrc.ucsc.edu/Papers/ssrctr-12-05.pdf
Also, would like to know your inputs to for core-04.

> Cheers,
>
>                                         - Ted

- Regards,
     Rohan

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-02-25  9:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-31  5:38 [RFC] Draft Linux kernel interfaces for ZBC drives Theodore Ts'o
2014-01-31 13:07 ` Matthew Wilcox
2014-01-31 15:44   ` Theodore Ts'o
2014-02-03 21:01 ` Jeff Moyer
2014-02-03 21:07   ` Martin K. Petersen
2014-02-03 21:38   ` Theodore Ts'o
2014-02-03 22:26     ` Jeff Moyer
2014-02-03 21:03 ` Eric Sandeen
2014-02-03 22:17   ` Theodore Ts'o
2014-02-04  2:00 ` HanBin Yoon
2014-02-04 16:27   ` Theodore Ts'o
2014-02-11 18:43 ` [RFC] Draft Linux kernel interfaces for SMR/ZBC drives Theodore Ts'o
2014-02-11 19:04   ` Andreas Dilger
2014-02-11 19:53     ` Theodore Ts'o
2014-02-13  2:08       ` Andreas Dilger
2014-02-13  3:09         ` Theodore Ts'o
2014-02-21 10:02 ` [RFC] Draft Linux kernel interfaces for ZBC drives Rohan Puri
2014-02-21 15:49   ` Theodore Ts'o
2014-02-25  9:36     ` Rohan Puri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.