[RFC] Draft Linux kernel interfaces for ZBC drives

* [RFC] Draft Linux kernel interfaces for ZBC drives
@ 2014-01-31  5:38 Theodore Ts'o
  2014-01-31 13:07 ` Matthew Wilcox
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Theodore Ts'o @ 2014-01-31  5:38 UTC (permalink / raw)
  To: linux-fsdevel

I've been reading the draft ZBC specifications, especially 14-010r1[1],
and I've created the following draft kernel interfaces, which I present
as a strawman proposal for comments.

[1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf

As noted in the comments below, supporting variable length SMR zones
does result in more complexity at the file system / userspace interface
layer.  Life would certainly get simpler if these zones were fixed
length.

                                                        - Ted

/*
 * Note: this structure is 24 bytes.  Using 256 MB zones, an 8TB drive
 * will have 32,768 zones.   That means if we tried to use a contiguous
 * array we would need to allocate 768k of contiguous, non-swappable
 * kernel memory.  (Boo, hiss.) 
 *
 * This large enough that it would be painful to hang an array off the
 * block_device structure.  So we will define a function
 * blkdev_query_zones() to selectively return information for some
 * number of zones.
 */
struct zone_status {
       sector_t	z_start;
       __u32	z_length;
       __u32	z_write_ptr_offset;  /* offset */
       __u32	z_checkpoint_offset; /* offset */
       __u32	z_flags;	     /* full, ro, offline, reset_requested */
};

#define Z_FLAGS_FULL		0x0001
#define Z_FLAGS_OFFLINE		0x0002
#define Z_FLAGS_RO		0x0004
#define Z_FLAG_RESET_REQUESTED	0x0008

#define Z_FLAG_TYPE_MASK	0x0F00
#define Z_FLAG_TYPE_CONVENTIONAL 0x0000
#define Z_FLAG_TYPE_SEQUENTIAL	0x0100

/*
 * Query the block_device bdev for information about the zones
 * starting at start_sector that match the criteria specified by
 * free_sectors_criteria.  Zone status information for at most
 * max_zones will be placed into the memory array ret_zones.  The
 * return value contains the number of zones actually returned.
 *
 * If free_sectors_criteria is positive, then return zones that have
 * at least that many sectors available to be written.  If it is zero,
 * then match all zones.  If free_sectors_criteria is negative, then
 * return the zones that match the following criteria:
 *
 *      -1     Return all read-only zones
 *      -2     Return all offline zones
 *      -3     Return all zones where the write ptr != the checkpoint ptr
 */
extern int blkdev_query_zones(struct block_device *bdev,
			      sector_t start_sector,
			      int free_sectors_criteria,
       			      struct zone_status *ret_zones,
			      int max_zones);

/*
 * Reset the write pointer for a sequential write zone.
 *
 * Returns -EINVAL if the start_sector is not the beginning of a
 * sequential write zone.
 */
extern int blkdev_reset_zone_ptr(struct block_dev *bdev,
				 sector_t start_sector);

/* 
 * ----------------------------
 */

/* 
 * The zone_status structure could be a lot smaller if zones are a
 * constant fixed size, then we could address zones using an 16 bit
 * integer, instead of using a 64-bit starting lba number then this
 * structure could half the size (12 bytes).
 *
 * We can also further shrink the structure by removing the
 * z_checkpoint_offset element, since most of the time
 * z_write_ptr_offset and z_checkpoint_offset will be the same.  The
 * only time they will be different is after a write is interrupted
 * via an unexpected power removal
 * 
 * With the smaller structure, we could fit all of the zones in an 8TB
 * SMR drive in 256k, which maybe we could afford to vmalloc()
 */
struct simplified_zone_status {
       __u32	z_write_ptr_offset;  /* offset */
       __u32	z_flags;
};

/* add a new flag */
#define Z_FLAG_POWER_FAIL_WRITE 0x0010 /* write_ptr != checkpoint ptr */

^ permalink raw reply	[flat|nested] 19+ messages in thread