* [PATCH v2 0/6] Page I/O @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy Page I/O allows us to read/write pages to storage without allocating any memory (in particular, it avoids allocating a BIO). This is nice for the purposes of swap and reduces overhead for fast storage devices. The downside is that it removes all batching from the I/O path, potentially sending dozens of commands for a large I/O instead of just one. This iteration of the Page I/O patchset has been tested with xfstests on ext4 on brd, and there are no unexpected failures. Changes since v1: - Rebased to 3.14-rc7 - Separate out the clean_buffers() refactoring into its own patch - Change the page_endio() interface to take an error code rather than a boolean 'success'. All of its callers prefer this (and my earlier patchset got this wrong in one caller). - Added kerneldoc to bdev_read_page() and bdev_write_page() - bdev_write_page() now does less on failure. Since its two customers (swap and mpage) want to do different things to the page flags on failure, let them. - Drop the virtio_blk patch, since I don't think it should be included Keith Busch (1): NVMe: Add support for rw_page Matthew Wilcox (5): Factor clean_buffers() out of __mpage_writepage() Factor page_endio() out of mpage_end_io() Add bdev_read_page() and bdev_write_page() swap: Use bdev_read_page() / bdev_write_page() brd: Add support for rw_page drivers/block/brd.c | 10 ++++ drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++--------- fs/block_dev.c | 63 ++++++++++++++++++++++ fs/mpage.c | 84 +++++++++++++++--------------- include/linux/blkdev.h | 4 ++ include/linux/pagemap.h | 2 + mm/filemap.c | 25 +++++++++ mm/page_io.c | 23 ++++++++- 8 files changed, 273 insertions(+), 67 deletions(-) -- 1.9.0 ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 0/6] Page I/O @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy Page I/O allows us to read/write pages to storage without allocating any memory (in particular, it avoids allocating a BIO). This is nice for the purposes of swap and reduces overhead for fast storage devices. The downside is that it removes all batching from the I/O path, potentially sending dozens of commands for a large I/O instead of just one. This iteration of the Page I/O patchset has been tested with xfstests on ext4 on brd, and there are no unexpected failures. Changes since v1: - Rebased to 3.14-rc7 - Separate out the clean_buffers() refactoring into its own patch - Change the page_endio() interface to take an error code rather than a boolean 'success'. All of its callers prefer this (and my earlier patchset got this wrong in one caller). - Added kerneldoc to bdev_read_page() and bdev_write_page() - bdev_write_page() now does less on failure. Since its two customers (swap and mpage) want to do different things to the page flags on failure, let them. - Drop the virtio_blk patch, since I don't think it should be included Keith Busch (1): NVMe: Add support for rw_page Matthew Wilcox (5): Factor clean_buffers() out of __mpage_writepage() Factor page_endio() out of mpage_end_io() Add bdev_read_page() and bdev_write_page() swap: Use bdev_read_page() / bdev_write_page() brd: Add support for rw_page drivers/block/brd.c | 10 ++++ drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++--------- fs/block_dev.c | 63 ++++++++++++++++++++++ fs/mpage.c | 84 +++++++++++++++--------------- include/linux/blkdev.h | 4 ++ include/linux/pagemap.h | 2 + mm/filemap.c | 25 +++++++++ mm/page_io.c | 23 ++++++++- 8 files changed, 273 insertions(+), 67 deletions(-) -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 1/6] Factor clean_buffers() out of __mpage_writepage() 2014-03-23 19:08 ` Matthew Wilcox @ 2014-03-23 19:08 ` Matthew Wilcox -1 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy __mpage_writepage() is over 200 lines long, has 20 local variables, four goto labels and could desperately use simplification. Splitting clean_buffers() into a helper function improves matters a little, removing 20+ lines from it. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- fs/mpage.c | 54 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 30 insertions(+), 24 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index 4979ffa..4cc9c5d 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -439,6 +439,35 @@ struct mpage_data { unsigned use_writepage; }; +/* + * We have our BIO, so we can now mark the buffers clean. Make + * sure to only clean buffers which we know we'll be writing. + */ +static void clean_buffers(struct page *page, unsigned first_unmapped) +{ + unsigned buffer_counter = 0; + struct buffer_head *bh, *head; + if (!page_has_buffers(page)) + return; + head = page_buffers(page); + bh = head; + + do { + if (buffer_counter++ == first_unmapped) + break; + clear_buffer_dirty(bh); + bh = bh->b_this_page; + } while (bh != head); + + /* + * we cannot drop the bh if the page is not uptodate or a concurrent + * readpage would fail to serialize with the bh and it would read from + * disk before we reach the platter. + */ + if (buffer_heads_over_limit && PageUptodate(page)) + try_to_free_buffers(page); +} + static int __mpage_writepage(struct page *page, struct writeback_control *wbc, void *data) { @@ -591,30 +620,7 @@ alloc_new: goto alloc_new; } - /* - * OK, we have our BIO, so we can now mark the buffers clean. Make - * sure to only clean buffers which we know we'll be writing. - */ - if (page_has_buffers(page)) { - struct buffer_head *head = page_buffers(page); - struct buffer_head *bh = head; - unsigned buffer_counter = 0; - - do { - if (buffer_counter++ == first_unmapped) - break; - clear_buffer_dirty(bh); - bh = bh->b_this_page; - } while (bh != head); - - /* - * we cannot drop the bh if the page is not uptodate - * or a concurrent readpage would fail to serialize with the bh - * and it would read from disk before we reach the platter. - */ - if (buffer_heads_over_limit && PageUptodate(page)) - try_to_free_buffers(page); - } + clean_buffers(page, first_unmapped); BUG_ON(PageWriteback(page)); set_page_writeback(page); -- 1.9.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 1/6] Factor clean_buffers() out of __mpage_writepage() @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy __mpage_writepage() is over 200 lines long, has 20 local variables, four goto labels and could desperately use simplification. Splitting clean_buffers() into a helper function improves matters a little, removing 20+ lines from it. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- fs/mpage.c | 54 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 30 insertions(+), 24 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index 4979ffa..4cc9c5d 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -439,6 +439,35 @@ struct mpage_data { unsigned use_writepage; }; +/* + * We have our BIO, so we can now mark the buffers clean. Make + * sure to only clean buffers which we know we'll be writing. + */ +static void clean_buffers(struct page *page, unsigned first_unmapped) +{ + unsigned buffer_counter = 0; + struct buffer_head *bh, *head; + if (!page_has_buffers(page)) + return; + head = page_buffers(page); + bh = head; + + do { + if (buffer_counter++ == first_unmapped) + break; + clear_buffer_dirty(bh); + bh = bh->b_this_page; + } while (bh != head); + + /* + * we cannot drop the bh if the page is not uptodate or a concurrent + * readpage would fail to serialize with the bh and it would read from + * disk before we reach the platter. + */ + if (buffer_heads_over_limit && PageUptodate(page)) + try_to_free_buffers(page); +} + static int __mpage_writepage(struct page *page, struct writeback_control *wbc, void *data) { @@ -591,30 +620,7 @@ alloc_new: goto alloc_new; } - /* - * OK, we have our BIO, so we can now mark the buffers clean. Make - * sure to only clean buffers which we know we'll be writing. - */ - if (page_has_buffers(page)) { - struct buffer_head *head = page_buffers(page); - struct buffer_head *bh = head; - unsigned buffer_counter = 0; - - do { - if (buffer_counter++ == first_unmapped) - break; - clear_buffer_dirty(bh); - bh = bh->b_this_page; - } while (bh != head); - - /* - * we cannot drop the bh if the page is not uptodate - * or a concurrent readpage would fail to serialize with the bh - * and it would read from disk before we reach the platter. - */ - if (buffer_heads_over_limit && PageUptodate(page)) - try_to_free_buffers(page); - } + clean_buffers(page, first_unmapped); BUG_ON(PageWriteback(page)); set_page_writeback(page); -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/6] Factor page_endio() out of mpage_end_io() 2014-03-23 19:08 ` Matthew Wilcox @ 2014-03-23 19:08 ` Matthew Wilcox -1 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy page_endio() takes care of updating all the appropriate page flags once I/O has finished to a page. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- fs/mpage.c | 18 +----------------- include/linux/pagemap.h | 2 ++ mm/filemap.c | 25 +++++++++++++++++++++++++ 3 files changed, 28 insertions(+), 17 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index 4cc9c5d..10da0da 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -48,23 +48,7 @@ static void mpage_end_io(struct bio *bio, int err) bio_for_each_segment_all(bv, bio, i) { struct page *page = bv->bv_page; - - if (bio_data_dir(bio) == READ) { - if (!err) { - SetPageUptodate(page); - } else { - ClearPageUptodate(page); - SetPageError(page); - } - unlock_page(page); - } else { /* bio_data_dir(bio) == WRITE */ - if (err) { - SetPageError(page); - if (page->mapping) - set_bit(AS_EIO, &page->mapping->flags); - } - end_page_writeback(page); - } + page_endio(page, bio_data_dir(bio), err); } bio_put(bio); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 1710d1b..396fddf 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -416,6 +416,8 @@ static inline void wait_on_page_writeback(struct page *page) extern void end_page_writeback(struct page *page); void wait_for_stable_page(struct page *page); +void page_endio(struct page *page, int rw, int err); + /* * Add an arbitrary waiter to a page's wait queue */ diff --git a/mm/filemap.c b/mm/filemap.c index 7a13f6a..1b8c028 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -631,6 +631,31 @@ void end_page_writeback(struct page *page) } EXPORT_SYMBOL(end_page_writeback); +/* + * After completing I/O on a page, call this routine to update the page + * flags appropriately + */ +void page_endio(struct page *page, int rw, int err) +{ + if (rw == READ) { + if (!err) { + SetPageUptodate(page); + } else { + ClearPageUptodate(page); + SetPageError(page); + } + unlock_page(page); + } else { /* rw == WRITE */ + if (err) { + SetPageError(page); + if (page->mapping) + set_bit(AS_EIO, &page->mapping->flags); + } + end_page_writeback(page); + } +} +EXPORT_SYMBOL_GPL(page_endio); + /** * __lock_page - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock -- 1.9.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/6] Factor page_endio() out of mpage_end_io() @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy page_endio() takes care of updating all the appropriate page flags once I/O has finished to a page. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- fs/mpage.c | 18 +----------------- include/linux/pagemap.h | 2 ++ mm/filemap.c | 25 +++++++++++++++++++++++++ 3 files changed, 28 insertions(+), 17 deletions(-) diff --git a/fs/mpage.c b/fs/mpage.c index 4cc9c5d..10da0da 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -48,23 +48,7 @@ static void mpage_end_io(struct bio *bio, int err) bio_for_each_segment_all(bv, bio, i) { struct page *page = bv->bv_page; - - if (bio_data_dir(bio) == READ) { - if (!err) { - SetPageUptodate(page); - } else { - ClearPageUptodate(page); - SetPageError(page); - } - unlock_page(page); - } else { /* bio_data_dir(bio) == WRITE */ - if (err) { - SetPageError(page); - if (page->mapping) - set_bit(AS_EIO, &page->mapping->flags); - } - end_page_writeback(page); - } + page_endio(page, bio_data_dir(bio), err); } bio_put(bio); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 1710d1b..396fddf 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -416,6 +416,8 @@ static inline void wait_on_page_writeback(struct page *page) extern void end_page_writeback(struct page *page); void wait_for_stable_page(struct page *page); +void page_endio(struct page *page, int rw, int err); + /* * Add an arbitrary waiter to a page's wait queue */ diff --git a/mm/filemap.c b/mm/filemap.c index 7a13f6a..1b8c028 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -631,6 +631,31 @@ void end_page_writeback(struct page *page) } EXPORT_SYMBOL(end_page_writeback); +/* + * After completing I/O on a page, call this routine to update the page + * flags appropriately + */ +void page_endio(struct page *page, int rw, int err) +{ + if (rw == READ) { + if (!err) { + SetPageUptodate(page); + } else { + ClearPageUptodate(page); + SetPageError(page); + } + unlock_page(page); + } else { /* rw == WRITE */ + if (err) { + SetPageError(page); + if (page->mapping) + set_bit(AS_EIO, &page->mapping->flags); + } + end_page_writeback(page); + } +} +EXPORT_SYMBOL_GPL(page_endio); + /** * __lock_page - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/6] Add bdev_read_page() and bdev_write_page() 2014-03-23 19:08 ` Matthew Wilcox @ 2014-03-23 19:08 ` Matthew Wilcox -1 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy A block device driver may choose to provide a rw_page operation. These will be called when the filesystem is attempting to do page sized I/O to page cache pages (ie not for direct I/O). This does preclude I/Os that are larger than page size, so this may only be a performance gain for some devices. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com> --- fs/block_dev.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ fs/mpage.c | 12 ++++++++++ include/linux/blkdev.h | 4 ++++ 3 files changed, 79 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 1e86823..62eabf5 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -363,6 +363,69 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) } EXPORT_SYMBOL(blkdev_fsync); +/** + * bdev_read_page() - Start reading a page from a block device + * @bdev: The device to read the page from + * @sector: The offset on the device to read the page to (need not be aligned) + * @page: The page to read + * + * On entry, the page should be locked. It will be unlocked when the page + * has been read. If the block driver implements rw_page synchronously, + * that will be true on exit from this function, but it need not be. + * + * Errors returned by this function are usually "soft", eg out of memory, or + * queue full; callers should try a different route to read this page rather + * than propagate an error back up the stack. + * + * Return: negative errno if an error occurs, 0 if submission was successful. + */ +int bdev_read_page(struct block_device *bdev, sector_t sector, + struct page *page) +{ + const struct block_device_operations *ops = bdev->bd_disk->fops; + if (!ops->rw_page) + return -EOPNOTSUPP; + return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ); +} +EXPORT_SYMBOL_GPL(bdev_read_page); + +/** + * bdev_write_page() - Start writing a page to a block device + * @bdev: The device to write the page to + * @sector: The offset on the device to write the page to (need not be aligned) + * @page: The page to write + * @wbc: The writeback_control for the write + * + * On entry, the page should be locked and not currently under writeback. + * On exit, if the write started successfully, the page will be unlocked and + * under writeback. If the write failed already (eg the driver failed to + * queue the page to the device), the page will still be locked. If the + * caller is a ->writepage implementation, it will need to unlock the page. + * + * Errors returned by this function are usually "soft", eg out of memory, or + * queue full; callers should try a different route to write this page rather + * than propagate an error back up the stack. + * + * Return: negative errno if an error occurs, 0 if submission was successful. + */ +int bdev_write_page(struct block_device *bdev, sector_t sector, + struct page *page, struct writeback_control *wbc) +{ + int result; + int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE; + const struct block_device_operations *ops = bdev->bd_disk->fops; + if (!ops->rw_page) + return -EOPNOTSUPP; + set_page_writeback(page); + result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw); + if (result) + end_page_writeback(page); + else + unlock_page(page); + return result; +} +EXPORT_SYMBOL_GPL(bdev_write_page); + /* * pseudo-fs */ diff --git a/fs/mpage.c b/fs/mpage.c index 10da0da..5f9ed62 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -269,6 +269,11 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, alloc_new: if (bio == NULL) { + if (first_hole == blocks_per_page) { + if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9), + page)) + goto out; + } bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), min_t(int, nr_pages, bio_get_nr_vecs(bdev)), GFP_KERNEL); @@ -587,6 +592,13 @@ page_is_mapped: alloc_new: if (bio == NULL) { + if (first_unmapped == blocks_per_page) { + if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9), + page, wbc)) { + clean_buffers(page, first_unmapped); + goto out; + } + } bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); if (bio == NULL) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4afa4f8..f6f6965 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1558,6 +1558,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g) struct block_device_operations { int (*open) (struct block_device *, fmode_t); void (*release) (struct gendisk *, fmode_t); + int (*rw_page)(struct block_device *, sector_t, struct page *, int rw); int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*direct_access) (struct block_device *, sector_t, @@ -1576,6 +1577,9 @@ struct block_device_operations { extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int, unsigned long); +extern int bdev_read_page(struct block_device *, sector_t, struct page *); +extern int bdev_write_page(struct block_device *, sector_t, struct page *, + struct writeback_control *); #else /* CONFIG_BLOCK */ /* * stubs for when the block layer is configured out -- 1.9.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/6] Add bdev_read_page() and bdev_write_page() @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy A block device driver may choose to provide a rw_page operation. These will be called when the filesystem is attempting to do page sized I/O to page cache pages (ie not for direct I/O). This does preclude I/Os that are larger than page size, so this may only be a performance gain for some devices. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com> --- fs/block_dev.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++ fs/mpage.c | 12 ++++++++++ include/linux/blkdev.h | 4 ++++ 3 files changed, 79 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 1e86823..62eabf5 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -363,6 +363,69 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync) } EXPORT_SYMBOL(blkdev_fsync); +/** + * bdev_read_page() - Start reading a page from a block device + * @bdev: The device to read the page from + * @sector: The offset on the device to read the page to (need not be aligned) + * @page: The page to read + * + * On entry, the page should be locked. It will be unlocked when the page + * has been read. If the block driver implements rw_page synchronously, + * that will be true on exit from this function, but it need not be. + * + * Errors returned by this function are usually "soft", eg out of memory, or + * queue full; callers should try a different route to read this page rather + * than propagate an error back up the stack. + * + * Return: negative errno if an error occurs, 0 if submission was successful. + */ +int bdev_read_page(struct block_device *bdev, sector_t sector, + struct page *page) +{ + const struct block_device_operations *ops = bdev->bd_disk->fops; + if (!ops->rw_page) + return -EOPNOTSUPP; + return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ); +} +EXPORT_SYMBOL_GPL(bdev_read_page); + +/** + * bdev_write_page() - Start writing a page to a block device + * @bdev: The device to write the page to + * @sector: The offset on the device to write the page to (need not be aligned) + * @page: The page to write + * @wbc: The writeback_control for the write + * + * On entry, the page should be locked and not currently under writeback. + * On exit, if the write started successfully, the page will be unlocked and + * under writeback. If the write failed already (eg the driver failed to + * queue the page to the device), the page will still be locked. If the + * caller is a ->writepage implementation, it will need to unlock the page. + * + * Errors returned by this function are usually "soft", eg out of memory, or + * queue full; callers should try a different route to write this page rather + * than propagate an error back up the stack. + * + * Return: negative errno if an error occurs, 0 if submission was successful. + */ +int bdev_write_page(struct block_device *bdev, sector_t sector, + struct page *page, struct writeback_control *wbc) +{ + int result; + int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE; + const struct block_device_operations *ops = bdev->bd_disk->fops; + if (!ops->rw_page) + return -EOPNOTSUPP; + set_page_writeback(page); + result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw); + if (result) + end_page_writeback(page); + else + unlock_page(page); + return result; +} +EXPORT_SYMBOL_GPL(bdev_write_page); + /* * pseudo-fs */ diff --git a/fs/mpage.c b/fs/mpage.c index 10da0da..5f9ed62 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -269,6 +269,11 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, alloc_new: if (bio == NULL) { + if (first_hole == blocks_per_page) { + if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9), + page)) + goto out; + } bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), min_t(int, nr_pages, bio_get_nr_vecs(bdev)), GFP_KERNEL); @@ -587,6 +592,13 @@ page_is_mapped: alloc_new: if (bio == NULL) { + if (first_unmapped == blocks_per_page) { + if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9), + page, wbc)) { + clean_buffers(page, first_unmapped); + goto out; + } + } bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); if (bio == NULL) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4afa4f8..f6f6965 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1558,6 +1558,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g) struct block_device_operations { int (*open) (struct block_device *, fmode_t); void (*release) (struct gendisk *, fmode_t); + int (*rw_page)(struct block_device *, sector_t, struct page *, int rw); int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); int (*direct_access) (struct block_device *, sector_t, @@ -1576,6 +1577,9 @@ struct block_device_operations { extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int, unsigned long); +extern int bdev_read_page(struct block_device *, sector_t, struct page *); +extern int bdev_write_page(struct block_device *, sector_t, struct page *, + struct writeback_control *); #else /* CONFIG_BLOCK */ /* * stubs for when the block layer is configured out -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 4/6] swap: Use bdev_read_page() / bdev_write_page() 2014-03-23 19:08 ` Matthew Wilcox @ 2014-03-23 19:08 ` Matthew Wilcox -1 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- mm/page_io.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 7c59ef6..43d7220 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -248,11 +248,16 @@ out: return ret; } +static sector_t swap_page_sector(struct page *page) +{ + return (sector_t)__page_file_index(page) << (PAGE_CACHE_SHIFT - 9); +} + int __swap_writepage(struct page *page, struct writeback_control *wbc, void (*end_write_func)(struct bio *, int)) { struct bio *bio; - int ret = 0, rw = WRITE; + int ret, rw = WRITE; struct swap_info_struct *sis = page_swap_info(page); if (sis->flags & SWP_FILE) { @@ -297,6 +302,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return ret; } + ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc); + if (!ret) { + count_vm_event(PSWPOUT); + return 0; + } + + ret = 0; bio = get_swap_bio(GFP_NOIO, page, end_write_func); if (bio == NULL) { set_page_dirty(page); @@ -317,7 +329,7 @@ out: int swap_readpage(struct page *page) { struct bio *bio; - int ret = 0; + int ret; struct swap_info_struct *sis = page_swap_info(page); VM_BUG_ON_PAGE(!PageLocked(page), page); @@ -338,6 +350,13 @@ int swap_readpage(struct page *page) return ret; } + ret = bdev_read_page(sis->bdev, swap_page_sector(page), page); + if (!ret) { + count_vm_event(PSWPIN); + return 0; + } + + ret = 0; bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); if (bio == NULL) { unlock_page(page); -- 1.9.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 4/6] swap: Use bdev_read_page() / bdev_write_page() @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- mm/page_io.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 7c59ef6..43d7220 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -248,11 +248,16 @@ out: return ret; } +static sector_t swap_page_sector(struct page *page) +{ + return (sector_t)__page_file_index(page) << (PAGE_CACHE_SHIFT - 9); +} + int __swap_writepage(struct page *page, struct writeback_control *wbc, void (*end_write_func)(struct bio *, int)) { struct bio *bio; - int ret = 0, rw = WRITE; + int ret, rw = WRITE; struct swap_info_struct *sis = page_swap_info(page); if (sis->flags & SWP_FILE) { @@ -297,6 +302,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return ret; } + ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc); + if (!ret) { + count_vm_event(PSWPOUT); + return 0; + } + + ret = 0; bio = get_swap_bio(GFP_NOIO, page, end_write_func); if (bio == NULL) { set_page_dirty(page); @@ -317,7 +329,7 @@ out: int swap_readpage(struct page *page) { struct bio *bio; - int ret = 0; + int ret; struct swap_info_struct *sis = page_swap_info(page); VM_BUG_ON_PAGE(!PageLocked(page), page); @@ -338,6 +350,13 @@ int swap_readpage(struct page *page) return ret; } + ret = bdev_read_page(sis->bdev, swap_page_sector(page), page); + if (!ret) { + count_vm_event(PSWPIN); + return 0; + } + + ret = 0; bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read); if (bio == NULL) { unlock_page(page); -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 5/6] NVMe: Add support for rw_page 2014-03-23 19:08 ` Matthew Wilcox @ 2014-03-23 19:08 ` Matthew Wilcox -1 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Keith Busch, willy, Matthew Wilcox From: Keith Busch <keith.busch@intel.com> This demonstrates the full potential of rw_page in a real device driver. By adding a dma_addr_t to the preallocated per-command data structure, we can avoid doing any memory allocation in the rw_page path. For example, that lets us swap without allocating any memory. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++--------- 1 file changed, 105 insertions(+), 24 deletions(-) diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index 51824d1..10ccd80 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -118,12 +118,13 @@ static inline void _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); } -typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, +typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, dma_addr_t, struct nvme_completion *); struct nvme_cmd_info { nvme_completion_fn fn; void *ctx; + dma_addr_t dma; unsigned long timeout; int aborted; }; @@ -153,7 +154,7 @@ static unsigned nvme_queue_extra(int depth) * May be called with local interrupts disabled and the q_lock held, * or with interrupts enabled and no locks held. */ -static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, +static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma, nvme_completion_fn handler, unsigned timeout) { int depth = nvmeq->q_depth - 1; @@ -168,17 +169,18 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, info[cmdid].fn = handler; info[cmdid].ctx = ctx; + info[cmdid].dma = dma; info[cmdid].timeout = jiffies + timeout; info[cmdid].aborted = 0; return cmdid; } -static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, +static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma, nvme_completion_fn handler, unsigned timeout) { int cmdid; wait_event_killable(nvmeq->sq_full, - (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0); + (cmdid = alloc_cmdid(nvmeq, ctx, dma, handler, timeout)) >= 0); return (cmdid < 0) ? -EINTR : cmdid; } @@ -190,7 +192,7 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, #define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE) #define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE) -static void special_completion(struct nvme_dev *dev, void *ctx, +static void special_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { if (ctx == CMD_CTX_CANCELLED) @@ -217,7 +219,7 @@ static void special_completion(struct nvme_dev *dev, void *ctx, dev_warn(&dev->pci_dev->dev, "Unknown special completion %p\n", ctx); } -static void async_completion(struct nvme_dev *dev, void *ctx, +static void async_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { struct async_cmd_info *cmdinfo = ctx; @@ -229,7 +231,7 @@ static void async_completion(struct nvme_dev *dev, void *ctx, /* * Called with local interrupts disabled and the q_lock held. May not sleep. */ -static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, +static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap, nvme_completion_fn *fn) { void *ctx; @@ -241,6 +243,8 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, } if (fn) *fn = info[cmdid].fn; + if (dmap) + *dmap = info[cmdid].dma; ctx = info[cmdid].ctx; info[cmdid].fn = special_completion; info[cmdid].ctx = CMD_CTX_COMPLETED; @@ -249,13 +253,15 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, return ctx; } -static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, +static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap, nvme_completion_fn *fn) { void *ctx; struct nvme_cmd_info *info = nvme_cmd_info(nvmeq); if (fn) *fn = info[cmdid].fn; + if (dmap) + *dmap = info[cmdid].dma; ctx = info[cmdid].ctx; info[cmdid].fn = special_completion; info[cmdid].ctx = CMD_CTX_CANCELLED; @@ -371,7 +377,7 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time) part_stat_unlock(); } -static void bio_completion(struct nvme_dev *dev, void *ctx, +static void bio_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { struct nvme_iod *iod = ctx; @@ -593,7 +599,7 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns, int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns) { - int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, + int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, 0, special_completion, NVME_IO_TIMEOUT); if (unlikely(cmdid < 0)) return cmdid; @@ -628,7 +634,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns, iod->private = bio; result = -EBUSY; - cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT); + cmdid = alloc_cmdid(nvmeq, iod, 0, bio_completion, NVME_IO_TIMEOUT); if (unlikely(cmdid < 0)) goto free_iod; @@ -684,7 +690,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns, return 0; free_cmdid: - free_cmdid(nvmeq, cmdid, NULL); + free_cmdid(nvmeq, cmdid, NULL, NULL); free_iod: nvme_free_iod(nvmeq->dev, iod); nomem: @@ -700,6 +706,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) for (;;) { void *ctx; + dma_addr_t dma; nvme_completion_fn fn; struct nvme_completion cqe = nvmeq->cqes[head]; if ((le16_to_cpu(cqe.status) & 1) != phase) @@ -710,8 +717,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) phase = !phase; } - ctx = free_cmdid(nvmeq, cqe.command_id, &fn); - fn(nvmeq->dev, ctx, &cqe); + ctx = free_cmdid(nvmeq, cqe.command_id, &dma, &fn); + fn(nvmeq->dev, ctx, dma, &cqe); } /* If the controller ignores the cq head doorbell and continuously @@ -781,7 +788,7 @@ static irqreturn_t nvme_irq_check(int irq, void *data) static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid) { spin_lock_irq(&nvmeq->q_lock); - cancel_cmdid(nvmeq, cmdid, NULL); + cancel_cmdid(nvmeq, cmdid, NULL, NULL); spin_unlock_irq(&nvmeq->q_lock); } @@ -791,7 +798,7 @@ struct sync_cmd_info { int status; }; -static void sync_completion(struct nvme_dev *dev, void *ctx, +static void sync_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { struct sync_cmd_info *cmdinfo = ctx; @@ -813,7 +820,7 @@ int nvme_submit_sync_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd, cmdinfo.task = current; cmdinfo.status = -EINTR; - cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, sync_completion, + cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, 0, sync_completion, timeout); if (cmdid < 0) return cmdid; @@ -838,9 +845,8 @@ static int nvme_submit_async_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd, struct async_cmd_info *cmdinfo, unsigned timeout) { - int cmdid; - - cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout); + int cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, 0, async_completion, + timeout); if (cmdid < 0) return cmdid; cmdinfo->status = -EINTR; @@ -1001,8 +1007,8 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq) if (!dev->abort_limit) return; - a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, special_completion, - ADMIN_TIMEOUT); + a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, 0, + special_completion, ADMIN_TIMEOUT); if (a_cmdid < 0) return; @@ -1035,6 +1041,7 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout) for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) { void *ctx; + dma_addr_t dma; nvme_completion_fn fn; static struct nvme_completion cqe = { .status = cpu_to_le16(NVME_SC_ABORT_REQ << 1), @@ -1050,8 +1057,8 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout) } dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid, nvmeq->qid); - ctx = cancel_cmdid(nvmeq, cmdid, &fn); - fn(nvmeq->dev, ctx, &cqe); + ctx = cancel_cmdid(nvmeq, cmdid, &dma, &fn); + fn(nvmeq->dev, ctx, dma, &cqe); } } @@ -1539,6 +1546,79 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) return status; } +static void pgrd_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, + struct nvme_completion *cqe) +{ + struct page *page = ctx; + u16 status = le16_to_cpup(&cqe->status) >> 1; + + dma_unmap_page(&dev->pci_dev->dev, dma, + PAGE_CACHE_SIZE, DMA_FROM_DEVICE); + page_endio(page, READ, status != NVME_SC_SUCCESS); +} + +static void pgwr_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, + struct nvme_completion *cqe) +{ + struct page *page = ctx; + u16 status = le16_to_cpup(&cqe->status) >> 1; + + dma_unmap_page(&dev->pci_dev->dev, dma, PAGE_CACHE_SIZE, DMA_TO_DEVICE); + page_endio(page, WRITE, status != NVME_SC_SUCCESS); +} + +static const enum dma_data_direction nvme_to_direction[] = { + DMA_NONE, DMA_TO_DEVICE, DMA_FROM_DEVICE, DMA_BIDIRECTIONAL +}; + +static int nvme_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, int rw) +{ + struct nvme_ns *ns = bdev->bd_disk->private_data; + u8 op = (rw & WRITE) ? nvme_cmd_write : nvme_cmd_read; + nvme_completion_fn fn = (rw & WRITE) ? pgwr_completion : + pgrd_completion; + dma_addr_t dma; + int cmdid; + struct nvme_command *cmd; + enum dma_data_direction dma_dir = nvme_to_direction[op & 3]; + struct nvme_queue *nvmeq = get_nvmeq(ns->dev); + dma = dma_map_page(nvmeq->q_dmadev, page, 0, PAGE_CACHE_SIZE, dma_dir); + + if (rw == WRITE) + cmdid = alloc_cmdid(nvmeq, page, dma, fn, NVME_IO_TIMEOUT); + else + cmdid = alloc_cmdid_killable(nvmeq, page, dma, fn, + NVME_IO_TIMEOUT); + if (unlikely(cmdid < 0)) { + dma_unmap_page(nvmeq->q_dmadev, dma, PAGE_CACHE_SIZE, + DMA_FROM_DEVICE); + put_nvmeq(nvmeq); + return -EBUSY; + } + + spin_lock_irq(&nvmeq->q_lock); + cmd = &nvmeq->sq_cmds[nvmeq->sq_tail]; + memset(cmd, 0, sizeof(*cmd)); + + cmd->rw.opcode = op; + cmd->rw.command_id = cmdid; + cmd->rw.nsid = cpu_to_le32(ns->ns_id); + cmd->rw.slba = cpu_to_le64(nvme_block_nr(ns, sector)); + cmd->rw.length = cpu_to_le16((PAGE_CACHE_SIZE >> ns->lba_shift) - 1); + cmd->rw.prp1 = cpu_to_le64(dma); + + if (++nvmeq->sq_tail == nvmeq->q_depth) + nvmeq->sq_tail = 0; + writel(nvmeq->sq_tail, nvmeq->q_db); + + nvme_process_cq(nvmeq); + spin_unlock_irq(&nvmeq->q_lock); + put_nvmeq(nvmeq); + + return 0; +} + static int nvme_user_admin_cmd(struct nvme_dev *dev, struct nvme_admin_cmd __user *ucmd) { @@ -1655,6 +1735,7 @@ static void nvme_release(struct gendisk *disk, fmode_t mode) static const struct block_device_operations nvme_fops = { .owner = THIS_MODULE, + .rw_page = nvme_rw_page, .ioctl = nvme_ioctl, .compat_ioctl = nvme_compat_ioctl, .open = nvme_open, -- 1.9.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 5/6] NVMe: Add support for rw_page @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Keith Busch, willy, Matthew Wilcox From: Keith Busch <keith.busch@intel.com> This demonstrates the full potential of rw_page in a real device driver. By adding a dma_addr_t to the preallocated per-command data structure, we can avoid doing any memory allocation in the rw_page path. For example, that lets us swap without allocating any memory. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++--------- 1 file changed, 105 insertions(+), 24 deletions(-) diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c index 51824d1..10ccd80 100644 --- a/drivers/block/nvme-core.c +++ b/drivers/block/nvme-core.c @@ -118,12 +118,13 @@ static inline void _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); } -typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, +typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, dma_addr_t, struct nvme_completion *); struct nvme_cmd_info { nvme_completion_fn fn; void *ctx; + dma_addr_t dma; unsigned long timeout; int aborted; }; @@ -153,7 +154,7 @@ static unsigned nvme_queue_extra(int depth) * May be called with local interrupts disabled and the q_lock held, * or with interrupts enabled and no locks held. */ -static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, +static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma, nvme_completion_fn handler, unsigned timeout) { int depth = nvmeq->q_depth - 1; @@ -168,17 +169,18 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, info[cmdid].fn = handler; info[cmdid].ctx = ctx; + info[cmdid].dma = dma; info[cmdid].timeout = jiffies + timeout; info[cmdid].aborted = 0; return cmdid; } -static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, +static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma, nvme_completion_fn handler, unsigned timeout) { int cmdid; wait_event_killable(nvmeq->sq_full, - (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0); + (cmdid = alloc_cmdid(nvmeq, ctx, dma, handler, timeout)) >= 0); return (cmdid < 0) ? -EINTR : cmdid; } @@ -190,7 +192,7 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, #define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE) #define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE) -static void special_completion(struct nvme_dev *dev, void *ctx, +static void special_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { if (ctx == CMD_CTX_CANCELLED) @@ -217,7 +219,7 @@ static void special_completion(struct nvme_dev *dev, void *ctx, dev_warn(&dev->pci_dev->dev, "Unknown special completion %p\n", ctx); } -static void async_completion(struct nvme_dev *dev, void *ctx, +static void async_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { struct async_cmd_info *cmdinfo = ctx; @@ -229,7 +231,7 @@ static void async_completion(struct nvme_dev *dev, void *ctx, /* * Called with local interrupts disabled and the q_lock held. May not sleep. */ -static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, +static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap, nvme_completion_fn *fn) { void *ctx; @@ -241,6 +243,8 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, } if (fn) *fn = info[cmdid].fn; + if (dmap) + *dmap = info[cmdid].dma; ctx = info[cmdid].ctx; info[cmdid].fn = special_completion; info[cmdid].ctx = CMD_CTX_COMPLETED; @@ -249,13 +253,15 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, return ctx; } -static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, +static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap, nvme_completion_fn *fn) { void *ctx; struct nvme_cmd_info *info = nvme_cmd_info(nvmeq); if (fn) *fn = info[cmdid].fn; + if (dmap) + *dmap = info[cmdid].dma; ctx = info[cmdid].ctx; info[cmdid].fn = special_completion; info[cmdid].ctx = CMD_CTX_CANCELLED; @@ -371,7 +377,7 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time) part_stat_unlock(); } -static void bio_completion(struct nvme_dev *dev, void *ctx, +static void bio_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { struct nvme_iod *iod = ctx; @@ -593,7 +599,7 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns, int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns) { - int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, + int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, 0, special_completion, NVME_IO_TIMEOUT); if (unlikely(cmdid < 0)) return cmdid; @@ -628,7 +634,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns, iod->private = bio; result = -EBUSY; - cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT); + cmdid = alloc_cmdid(nvmeq, iod, 0, bio_completion, NVME_IO_TIMEOUT); if (unlikely(cmdid < 0)) goto free_iod; @@ -684,7 +690,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns, return 0; free_cmdid: - free_cmdid(nvmeq, cmdid, NULL); + free_cmdid(nvmeq, cmdid, NULL, NULL); free_iod: nvme_free_iod(nvmeq->dev, iod); nomem: @@ -700,6 +706,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) for (;;) { void *ctx; + dma_addr_t dma; nvme_completion_fn fn; struct nvme_completion cqe = nvmeq->cqes[head]; if ((le16_to_cpu(cqe.status) & 1) != phase) @@ -710,8 +717,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq) phase = !phase; } - ctx = free_cmdid(nvmeq, cqe.command_id, &fn); - fn(nvmeq->dev, ctx, &cqe); + ctx = free_cmdid(nvmeq, cqe.command_id, &dma, &fn); + fn(nvmeq->dev, ctx, dma, &cqe); } /* If the controller ignores the cq head doorbell and continuously @@ -781,7 +788,7 @@ static irqreturn_t nvme_irq_check(int irq, void *data) static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid) { spin_lock_irq(&nvmeq->q_lock); - cancel_cmdid(nvmeq, cmdid, NULL); + cancel_cmdid(nvmeq, cmdid, NULL, NULL); spin_unlock_irq(&nvmeq->q_lock); } @@ -791,7 +798,7 @@ struct sync_cmd_info { int status; }; -static void sync_completion(struct nvme_dev *dev, void *ctx, +static void sync_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, struct nvme_completion *cqe) { struct sync_cmd_info *cmdinfo = ctx; @@ -813,7 +820,7 @@ int nvme_submit_sync_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd, cmdinfo.task = current; cmdinfo.status = -EINTR; - cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, sync_completion, + cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, 0, sync_completion, timeout); if (cmdid < 0) return cmdid; @@ -838,9 +845,8 @@ static int nvme_submit_async_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd, struct async_cmd_info *cmdinfo, unsigned timeout) { - int cmdid; - - cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout); + int cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, 0, async_completion, + timeout); if (cmdid < 0) return cmdid; cmdinfo->status = -EINTR; @@ -1001,8 +1007,8 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq) if (!dev->abort_limit) return; - a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, special_completion, - ADMIN_TIMEOUT); + a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, 0, + special_completion, ADMIN_TIMEOUT); if (a_cmdid < 0) return; @@ -1035,6 +1041,7 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout) for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) { void *ctx; + dma_addr_t dma; nvme_completion_fn fn; static struct nvme_completion cqe = { .status = cpu_to_le16(NVME_SC_ABORT_REQ << 1), @@ -1050,8 +1057,8 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout) } dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid, nvmeq->qid); - ctx = cancel_cmdid(nvmeq, cmdid, &fn); - fn(nvmeq->dev, ctx, &cqe); + ctx = cancel_cmdid(nvmeq, cmdid, &dma, &fn); + fn(nvmeq->dev, ctx, dma, &cqe); } } @@ -1539,6 +1546,79 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) return status; } +static void pgrd_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, + struct nvme_completion *cqe) +{ + struct page *page = ctx; + u16 status = le16_to_cpup(&cqe->status) >> 1; + + dma_unmap_page(&dev->pci_dev->dev, dma, + PAGE_CACHE_SIZE, DMA_FROM_DEVICE); + page_endio(page, READ, status != NVME_SC_SUCCESS); +} + +static void pgwr_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma, + struct nvme_completion *cqe) +{ + struct page *page = ctx; + u16 status = le16_to_cpup(&cqe->status) >> 1; + + dma_unmap_page(&dev->pci_dev->dev, dma, PAGE_CACHE_SIZE, DMA_TO_DEVICE); + page_endio(page, WRITE, status != NVME_SC_SUCCESS); +} + +static const enum dma_data_direction nvme_to_direction[] = { + DMA_NONE, DMA_TO_DEVICE, DMA_FROM_DEVICE, DMA_BIDIRECTIONAL +}; + +static int nvme_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, int rw) +{ + struct nvme_ns *ns = bdev->bd_disk->private_data; + u8 op = (rw & WRITE) ? nvme_cmd_write : nvme_cmd_read; + nvme_completion_fn fn = (rw & WRITE) ? pgwr_completion : + pgrd_completion; + dma_addr_t dma; + int cmdid; + struct nvme_command *cmd; + enum dma_data_direction dma_dir = nvme_to_direction[op & 3]; + struct nvme_queue *nvmeq = get_nvmeq(ns->dev); + dma = dma_map_page(nvmeq->q_dmadev, page, 0, PAGE_CACHE_SIZE, dma_dir); + + if (rw == WRITE) + cmdid = alloc_cmdid(nvmeq, page, dma, fn, NVME_IO_TIMEOUT); + else + cmdid = alloc_cmdid_killable(nvmeq, page, dma, fn, + NVME_IO_TIMEOUT); + if (unlikely(cmdid < 0)) { + dma_unmap_page(nvmeq->q_dmadev, dma, PAGE_CACHE_SIZE, + DMA_FROM_DEVICE); + put_nvmeq(nvmeq); + return -EBUSY; + } + + spin_lock_irq(&nvmeq->q_lock); + cmd = &nvmeq->sq_cmds[nvmeq->sq_tail]; + memset(cmd, 0, sizeof(*cmd)); + + cmd->rw.opcode = op; + cmd->rw.command_id = cmdid; + cmd->rw.nsid = cpu_to_le32(ns->ns_id); + cmd->rw.slba = cpu_to_le64(nvme_block_nr(ns, sector)); + cmd->rw.length = cpu_to_le16((PAGE_CACHE_SIZE >> ns->lba_shift) - 1); + cmd->rw.prp1 = cpu_to_le64(dma); + + if (++nvmeq->sq_tail == nvmeq->q_depth) + nvmeq->sq_tail = 0; + writel(nvmeq->sq_tail, nvmeq->q_db); + + nvme_process_cq(nvmeq); + spin_unlock_irq(&nvmeq->q_lock); + put_nvmeq(nvmeq); + + return 0; +} + static int nvme_user_admin_cmd(struct nvme_dev *dev, struct nvme_admin_cmd __user *ucmd) { @@ -1655,6 +1735,7 @@ static void nvme_release(struct gendisk *disk, fmode_t mode) static const struct block_device_operations nvme_fops = { .owner = THIS_MODULE, + .rw_page = nvme_rw_page, .ioctl = nvme_ioctl, .compat_ioctl = nvme_compat_ioctl, .open = nvme_open, -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 6/6] brd: Add support for rw_page 2014-03-23 19:08 ` Matthew Wilcox @ 2014-03-23 19:08 ` Matthew Wilcox -1 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- drivers/block/brd.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index e73b85c..807d3d5 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -360,6 +360,15 @@ out: bio_endio(bio, err); } +static int brd_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, int rw) +{ + struct brd_device *brd = bdev->bd_disk->private_data; + int err = brd_do_bvec(brd, page, PAGE_CACHE_SIZE, 0, rw, sector); + page_endio(page, rw & WRITE, err); + return err; +} + #ifdef CONFIG_BLK_DEV_XIP static int brd_direct_access(struct block_device *bdev, sector_t sector, void **kaddr, unsigned long *pfn) @@ -419,6 +428,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode, static const struct block_device_operations brd_fops = { .owner = THIS_MODULE, + .rw_page = brd_rw_page, .ioctl = brd_ioctl, #ifdef CONFIG_BLK_DEV_XIP .direct_access = brd_direct_access, -- 1.9.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 6/6] brd: Add support for rw_page @ 2014-03-23 19:08 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw) To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> --- drivers/block/brd.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index e73b85c..807d3d5 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -360,6 +360,15 @@ out: bio_endio(bio, err); } +static int brd_rw_page(struct block_device *bdev, sector_t sector, + struct page *page, int rw) +{ + struct brd_device *brd = bdev->bd_disk->private_data; + int err = brd_do_bvec(brd, page, PAGE_CACHE_SIZE, 0, rw, sector); + page_endio(page, rw & WRITE, err); + return err; +} + #ifdef CONFIG_BLK_DEV_XIP static int brd_direct_access(struct block_device *bdev, sector_t sector, void **kaddr, unsigned long *pfn) @@ -419,6 +428,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode, static const struct block_device_operations brd_fops = { .owner = THIS_MODULE, + .rw_page = brd_rw_page, .ioctl = brd_ioctl, #ifdef CONFIG_BLK_DEV_XIP .direct_access = brd_direct_access, -- 1.9.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-03-23 19:24 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-03-23 19:08 [PATCH v2 0/6] Page I/O Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox 2014-03-23 19:08 ` [PATCH v2 1/6] Factor clean_buffers() out of __mpage_writepage() Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox 2014-03-23 19:08 ` [PATCH v2 2/6] Factor page_endio() out of mpage_end_io() Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox 2014-03-23 19:08 ` [PATCH v2 3/6] Add bdev_read_page() and bdev_write_page() Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox 2014-03-23 19:08 ` [PATCH v2 4/6] swap: Use bdev_read_page() / bdev_write_page() Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox 2014-03-23 19:08 ` [PATCH v2 5/6] NVMe: Add support for rw_page Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox 2014-03-23 19:08 ` [PATCH v2 6/6] brd: " Matthew Wilcox 2014-03-23 19:08 ` Matthew Wilcox
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.