From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: [RFC] Draft Linux kernel interfaces for ZBC drives Date: Fri, 31 Jan 2014 00:38:22 -0500 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: linux-fsdevel@vger.kernel.org Return-path: Received: from imap.thunk.org ([74.207.234.97]:52952 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750961AbaAaFiY (ORCPT ); Fri, 31 Jan 2014 00:38:24 -0500 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: I've been reading the draft ZBC specifications, especially 14-010r1[1], and I've created the following draft kernel interfaces, which I present as a strawman proposal for comments. [1] http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-010r1.pdf As noted in the comments below, supporting variable length SMR zones does result in more complexity at the file system / userspace interface layer. Life would certainly get simpler if these zones were fixed length. - Ted /* * Note: this structure is 24 bytes. Using 256 MB zones, an 8TB drive * will have 32,768 zones. That means if we tried to use a contiguous * array we would need to allocate 768k of contiguous, non-swappable * kernel memory. (Boo, hiss.) * * This large enough that it would be painful to hang an array off the * block_device structure. So we will define a function * blkdev_query_zones() to selectively return information for some * number of zones. */ struct zone_status { sector_t z_start; __u32 z_length; __u32 z_write_ptr_offset; /* offset */ __u32 z_checkpoint_offset; /* offset */ __u32 z_flags; /* full, ro, offline, reset_requested */ }; #define Z_FLAGS_FULL 0x0001 #define Z_FLAGS_OFFLINE 0x0002 #define Z_FLAGS_RO 0x0004 #define Z_FLAG_RESET_REQUESTED 0x0008 #define Z_FLAG_TYPE_MASK 0x0F00 #define Z_FLAG_TYPE_CONVENTIONAL 0x0000 #define Z_FLAG_TYPE_SEQUENTIAL 0x0100 /* * Query the block_device bdev for information about the zones * starting at start_sector that match the criteria specified by * free_sectors_criteria. Zone status information for at most * max_zones will be placed into the memory array ret_zones. The * return value contains the number of zones actually returned. * * If free_sectors_criteria is positive, then return zones that have * at least that many sectors available to be written. If it is zero, * then match all zones. If free_sectors_criteria is negative, then * return the zones that match the following criteria: * * -1 Return all read-only zones * -2 Return all offline zones * -3 Return all zones where the write ptr != the checkpoint ptr */ extern int blkdev_query_zones(struct block_device *bdev, sector_t start_sector, int free_sectors_criteria, struct zone_status *ret_zones, int max_zones); /* * Reset the write pointer for a sequential write zone. * * Returns -EINVAL if the start_sector is not the beginning of a * sequential write zone. */ extern int blkdev_reset_zone_ptr(struct block_dev *bdev, sector_t start_sector); /* * ---------------------------- */ /* * The zone_status structure could be a lot smaller if zones are a * constant fixed size, then we could address zones using an 16 bit * integer, instead of using a 64-bit starting lba number then this * structure could half the size (12 bytes). * * We can also further shrink the structure by removing the * z_checkpoint_offset element, since most of the time * z_write_ptr_offset and z_checkpoint_offset will be the same. The * only time they will be different is after a write is interrupted * via an unexpected power removal * * With the smaller structure, we could fit all of the zones in an 8TB * SMR drive in 256k, which maybe we could afford to vmalloc() */ struct simplified_zone_status { __u32 z_write_ptr_offset; /* offset */ __u32 z_flags; }; /* add a new flag */ #define Z_FLAG_POWER_FAIL_WRITE 0x0010 /* write_ptr != checkpoint ptr */