From mboxrd@z Thu Jan 1 00:00:00 1970 From: Theodore Ts'o Subject: Re: [RFC] Draft Linux kernel interfaces for SMR/ZBC drives Date: Tue, 11 Feb 2014 13:43:43 -0500 Message-ID: <20140211184343.GA11971@thunk.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: linux-fsdevel@vger.kernel.org Return-path: Received: from imap.thunk.org ([74.207.234.97]:56614 "EHLO imap.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751667AbaBKSnr (ORCPT ); Tue, 11 Feb 2014 13:43:47 -0500 Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Based on the comments raised on the list, here is a revised version of the proposed ZBC kernel interface. Changes from the last version: 1) Aligned ZBC_FLAG values to be aligned with the ZBC specification to simplify implementations 2) Aligned the free_sector_criteria values to be mostly aligned with the ZBC specification 3) Clarified the behaviour of blkdev_query_zones() 4) Added an ioctl interface to expose this functionality to userspace 5) Removed the proposed simplified data variant Please let me know what you think! - Ted /* * Note: this structure is 24 bytes. Using 256 MB zones, an 8TB drive * will have 32,768 zones. That means if we tried to use a contiguous * array we would need to allocate 768k of contiguous, non-swappable * kernel memory. (Boo, hiss.) * * This large enough that it would be painful to hang an array off the * block_device structure. So we will define a function * blkdev_query_zones() to selectively return information for some * number of zones. * * It is anticipated that the block device driver will store this * information in a compressed form, and that z_checkpoint_offset will * not be dynamically tracked. That is, the checkpoint offset will, * if non-zero, indicates that drive suffered a power fail event, and * the file system or userspace process may need to implement recovery * procedures. Once the file system or userspace process writes to an * SMR band, the checkpoint offset will be cleared and future queries * for the SMR band will return the checkpoint offset == write_ptr. */ struct zone_status { sector_t z_start; __u32 z_length; __u32 z_write_ptr_offset; /* offset */ __u32 z_checkpoint_offset; /* offset */ __u32 z_flags; /* full, ro, offline, reset_requested */ }; #define Z_FLAG_RESET_REQUESTED 0x0001 #define Z_FLAGS_OFFLINE 0x0002 #define Z_FLAGS_RO 0x0004 #define Z_FLAGS_FULL 0x0008 #define Z_FLAG_TYPE_MASK 0x0F00 #define Z_FLAG_TYPE_CONVENTIONAL 0x0100 #define Z_FLAG_TYPE_SEQUENTIAL 0x0200 /* * Query the block_device bdev for information about the zones * starting at start_sector that match the criteria specified by * free_sectors_criteria. Zone status information for at most * max_zones will be placed into the memory array ret_zones (which is * allocated by the caller, not by the blkdev_query_zones function), * in ascending LBA order. The return value will be a kernel error * code if negative, or the number of zones actually returned if * non-nonegative. * * If free_sectors_criteria is positive, then return zones that have * at least that many sectors available to be written. If it is zero, * then match all zones. If free_sectors_criteria is negative, then * return the zones that match the following criteria: * * -1 Match all full zones * -2 Match all open zones * (the zone has at least one written sector and is not full) * -3 Match all free zones * (the zone has no written sectors) * -4 Match all read-only zones * -5 Match all offline zones * -6 Match all zones where the write ptr != the checkpoint ptr * * The negative values are taken from Table 4 of 14-010r1, with the * exception of -6, which is not in the draft spec --- but IMHO should * be :-) It is anticipated, though, that the kernel will keep this * info in in memory and so will handle matching zones which meet * these criteria itself, without needing to issue a ZBC command for * each call to blkdev_query_zones(). */ extern int blkdev_query_zones(struct block_device *bdev, sector_t start_sector, int free_sectors_criteria, int max_zones, struct zone_status *ret_zones); /* * Reset the write pointer for a sequential write zone. * * Returns -EINVAL if the start_sector is not the beginning of a * sequential write zone. */ extern int blkdev_reset_zone_ptr(struct block_dev *bdev, sector_t start_sector); /* ioctl interface */ ZBCQUERY u64 starting_lba /* IN */ u32 criteria /* IN */ u32 *num_zones /* IN/OUT */ struct zone_status *ptr /* OUT */ ZBCRESETZONE u64 starting_lba