* [PATCH v2 0/6] Page I/O
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
Page I/O allows us to read/write pages to storage without allocating any
memory (in particular, it avoids allocating a BIO). This is nice for
the purposes of swap and reduces overhead for fast storage devices. The
downside is that it removes all batching from the I/O path, potentially
sending dozens of commands for a large I/O instead of just one.
This iteration of the Page I/O patchset has been tested with xfstests
on ext4 on brd, and there are no unexpected failures.
Changes since v1:
- Rebased to 3.14-rc7
- Separate out the clean_buffers() refactoring into its own patch
- Change the page_endio() interface to take an error code rather than
a boolean 'success'. All of its callers prefer this (and my earlier
patchset got this wrong in one caller).
- Added kerneldoc to bdev_read_page() and bdev_write_page()
- bdev_write_page() now does less on failure. Since its two customers
(swap and mpage) want to do different things to the page flags on
failure, let them.
- Drop the virtio_blk patch, since I don't think it should be included
Keith Busch (1):
NVMe: Add support for rw_page
Matthew Wilcox (5):
Factor clean_buffers() out of __mpage_writepage()
Factor page_endio() out of mpage_end_io()
Add bdev_read_page() and bdev_write_page()
swap: Use bdev_read_page() / bdev_write_page()
brd: Add support for rw_page
drivers/block/brd.c | 10 ++++
drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++---------
fs/block_dev.c | 63 ++++++++++++++++++++++
fs/mpage.c | 84 +++++++++++++++---------------
include/linux/blkdev.h | 4 ++
include/linux/pagemap.h | 2 +
mm/filemap.c | 25 +++++++++
mm/page_io.c | 23 ++++++++-
8 files changed, 273 insertions(+), 67 deletions(-)
--
1.9.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 0/6] Page I/O
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
Page I/O allows us to read/write pages to storage without allocating any
memory (in particular, it avoids allocating a BIO). This is nice for
the purposes of swap and reduces overhead for fast storage devices. The
downside is that it removes all batching from the I/O path, potentially
sending dozens of commands for a large I/O instead of just one.
This iteration of the Page I/O patchset has been tested with xfstests
on ext4 on brd, and there are no unexpected failures.
Changes since v1:
- Rebased to 3.14-rc7
- Separate out the clean_buffers() refactoring into its own patch
- Change the page_endio() interface to take an error code rather than
a boolean 'success'. All of its callers prefer this (and my earlier
patchset got this wrong in one caller).
- Added kerneldoc to bdev_read_page() and bdev_write_page()
- bdev_write_page() now does less on failure. Since its two customers
(swap and mpage) want to do different things to the page flags on
failure, let them.
- Drop the virtio_blk patch, since I don't think it should be included
Keith Busch (1):
NVMe: Add support for rw_page
Matthew Wilcox (5):
Factor clean_buffers() out of __mpage_writepage()
Factor page_endio() out of mpage_end_io()
Add bdev_read_page() and bdev_write_page()
swap: Use bdev_read_page() / bdev_write_page()
brd: Add support for rw_page
drivers/block/brd.c | 10 ++++
drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++---------
fs/block_dev.c | 63 ++++++++++++++++++++++
fs/mpage.c | 84 +++++++++++++++---------------
include/linux/blkdev.h | 4 ++
include/linux/pagemap.h | 2 +
mm/filemap.c | 25 +++++++++
mm/page_io.c | 23 ++++++++-
8 files changed, 273 insertions(+), 67 deletions(-)
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 1/6] Factor clean_buffers() out of __mpage_writepage()
2014-03-23 19:08 ` Matthew Wilcox
@ 2014-03-23 19:08 ` Matthew Wilcox
-1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
__mpage_writepage() is over 200 lines long, has 20 local variables,
four goto labels and could desperately use simplification. Splitting
clean_buffers() into a helper function improves matters a little,
removing 20+ lines from it.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
fs/mpage.c | 54 ++++++++++++++++++++++++++++++------------------------
1 file changed, 30 insertions(+), 24 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 4979ffa..4cc9c5d 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -439,6 +439,35 @@ struct mpage_data {
unsigned use_writepage;
};
+/*
+ * We have our BIO, so we can now mark the buffers clean. Make
+ * sure to only clean buffers which we know we'll be writing.
+ */
+static void clean_buffers(struct page *page, unsigned first_unmapped)
+{
+ unsigned buffer_counter = 0;
+ struct buffer_head *bh, *head;
+ if (!page_has_buffers(page))
+ return;
+ head = page_buffers(page);
+ bh = head;
+
+ do {
+ if (buffer_counter++ == first_unmapped)
+ break;
+ clear_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /*
+ * we cannot drop the bh if the page is not uptodate or a concurrent
+ * readpage would fail to serialize with the bh and it would read from
+ * disk before we reach the platter.
+ */
+ if (buffer_heads_over_limit && PageUptodate(page))
+ try_to_free_buffers(page);
+}
+
static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
void *data)
{
@@ -591,30 +620,7 @@ alloc_new:
goto alloc_new;
}
- /*
- * OK, we have our BIO, so we can now mark the buffers clean. Make
- * sure to only clean buffers which we know we'll be writing.
- */
- if (page_has_buffers(page)) {
- struct buffer_head *head = page_buffers(page);
- struct buffer_head *bh = head;
- unsigned buffer_counter = 0;
-
- do {
- if (buffer_counter++ == first_unmapped)
- break;
- clear_buffer_dirty(bh);
- bh = bh->b_this_page;
- } while (bh != head);
-
- /*
- * we cannot drop the bh if the page is not uptodate
- * or a concurrent readpage would fail to serialize with the bh
- * and it would read from disk before we reach the platter.
- */
- if (buffer_heads_over_limit && PageUptodate(page))
- try_to_free_buffers(page);
- }
+ clean_buffers(page, first_unmapped);
BUG_ON(PageWriteback(page));
set_page_writeback(page);
--
1.9.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 1/6] Factor clean_buffers() out of __mpage_writepage()
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
__mpage_writepage() is over 200 lines long, has 20 local variables,
four goto labels and could desperately use simplification. Splitting
clean_buffers() into a helper function improves matters a little,
removing 20+ lines from it.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
fs/mpage.c | 54 ++++++++++++++++++++++++++++++------------------------
1 file changed, 30 insertions(+), 24 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 4979ffa..4cc9c5d 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -439,6 +439,35 @@ struct mpage_data {
unsigned use_writepage;
};
+/*
+ * We have our BIO, so we can now mark the buffers clean. Make
+ * sure to only clean buffers which we know we'll be writing.
+ */
+static void clean_buffers(struct page *page, unsigned first_unmapped)
+{
+ unsigned buffer_counter = 0;
+ struct buffer_head *bh, *head;
+ if (!page_has_buffers(page))
+ return;
+ head = page_buffers(page);
+ bh = head;
+
+ do {
+ if (buffer_counter++ == first_unmapped)
+ break;
+ clear_buffer_dirty(bh);
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /*
+ * we cannot drop the bh if the page is not uptodate or a concurrent
+ * readpage would fail to serialize with the bh and it would read from
+ * disk before we reach the platter.
+ */
+ if (buffer_heads_over_limit && PageUptodate(page))
+ try_to_free_buffers(page);
+}
+
static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
void *data)
{
@@ -591,30 +620,7 @@ alloc_new:
goto alloc_new;
}
- /*
- * OK, we have our BIO, so we can now mark the buffers clean. Make
- * sure to only clean buffers which we know we'll be writing.
- */
- if (page_has_buffers(page)) {
- struct buffer_head *head = page_buffers(page);
- struct buffer_head *bh = head;
- unsigned buffer_counter = 0;
-
- do {
- if (buffer_counter++ == first_unmapped)
- break;
- clear_buffer_dirty(bh);
- bh = bh->b_this_page;
- } while (bh != head);
-
- /*
- * we cannot drop the bh if the page is not uptodate
- * or a concurrent readpage would fail to serialize with the bh
- * and it would read from disk before we reach the platter.
- */
- if (buffer_heads_over_limit && PageUptodate(page))
- try_to_free_buffers(page);
- }
+ clean_buffers(page, first_unmapped);
BUG_ON(PageWriteback(page));
set_page_writeback(page);
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/6] Factor page_endio() out of mpage_end_io()
2014-03-23 19:08 ` Matthew Wilcox
@ 2014-03-23 19:08 ` Matthew Wilcox
-1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
page_endio() takes care of updating all the appropriate page flags once I/O
has finished to a page.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
fs/mpage.c | 18 +-----------------
include/linux/pagemap.h | 2 ++
mm/filemap.c | 25 +++++++++++++++++++++++++
3 files changed, 28 insertions(+), 17 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 4cc9c5d..10da0da 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,23 +48,7 @@ static void mpage_end_io(struct bio *bio, int err)
bio_for_each_segment_all(bv, bio, i) {
struct page *page = bv->bv_page;
-
- if (bio_data_dir(bio) == READ) {
- if (!err) {
- SetPageUptodate(page);
- } else {
- ClearPageUptodate(page);
- SetPageError(page);
- }
- unlock_page(page);
- } else { /* bio_data_dir(bio) == WRITE */
- if (err) {
- SetPageError(page);
- if (page->mapping)
- set_bit(AS_EIO, &page->mapping->flags);
- }
- end_page_writeback(page);
- }
+ page_endio(page, bio_data_dir(bio), err);
}
bio_put(bio);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1710d1b..396fddf 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -416,6 +416,8 @@ static inline void wait_on_page_writeback(struct page *page)
extern void end_page_writeback(struct page *page);
void wait_for_stable_page(struct page *page);
+void page_endio(struct page *page, int rw, int err);
+
/*
* Add an arbitrary waiter to a page's wait queue
*/
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..1b8c028 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -631,6 +631,31 @@ void end_page_writeback(struct page *page)
}
EXPORT_SYMBOL(end_page_writeback);
+/*
+ * After completing I/O on a page, call this routine to update the page
+ * flags appropriately
+ */
+void page_endio(struct page *page, int rw, int err)
+{
+ if (rw == READ) {
+ if (!err) {
+ SetPageUptodate(page);
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ unlock_page(page);
+ } else { /* rw == WRITE */
+ if (err) {
+ SetPageError(page);
+ if (page->mapping)
+ set_bit(AS_EIO, &page->mapping->flags);
+ }
+ end_page_writeback(page);
+ }
+}
+EXPORT_SYMBOL_GPL(page_endio);
+
/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
--
1.9.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/6] Factor page_endio() out of mpage_end_io()
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
page_endio() takes care of updating all the appropriate page flags once I/O
has finished to a page.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
fs/mpage.c | 18 +-----------------
include/linux/pagemap.h | 2 ++
mm/filemap.c | 25 +++++++++++++++++++++++++
3 files changed, 28 insertions(+), 17 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 4cc9c5d..10da0da 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -48,23 +48,7 @@ static void mpage_end_io(struct bio *bio, int err)
bio_for_each_segment_all(bv, bio, i) {
struct page *page = bv->bv_page;
-
- if (bio_data_dir(bio) == READ) {
- if (!err) {
- SetPageUptodate(page);
- } else {
- ClearPageUptodate(page);
- SetPageError(page);
- }
- unlock_page(page);
- } else { /* bio_data_dir(bio) == WRITE */
- if (err) {
- SetPageError(page);
- if (page->mapping)
- set_bit(AS_EIO, &page->mapping->flags);
- }
- end_page_writeback(page);
- }
+ page_endio(page, bio_data_dir(bio), err);
}
bio_put(bio);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1710d1b..396fddf 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -416,6 +416,8 @@ static inline void wait_on_page_writeback(struct page *page)
extern void end_page_writeback(struct page *page);
void wait_for_stable_page(struct page *page);
+void page_endio(struct page *page, int rw, int err);
+
/*
* Add an arbitrary waiter to a page's wait queue
*/
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..1b8c028 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -631,6 +631,31 @@ void end_page_writeback(struct page *page)
}
EXPORT_SYMBOL(end_page_writeback);
+/*
+ * After completing I/O on a page, call this routine to update the page
+ * flags appropriately
+ */
+void page_endio(struct page *page, int rw, int err)
+{
+ if (rw == READ) {
+ if (!err) {
+ SetPageUptodate(page);
+ } else {
+ ClearPageUptodate(page);
+ SetPageError(page);
+ }
+ unlock_page(page);
+ } else { /* rw == WRITE */
+ if (err) {
+ SetPageError(page);
+ if (page->mapping)
+ set_bit(AS_EIO, &page->mapping->flags);
+ }
+ end_page_writeback(page);
+ }
+}
+EXPORT_SYMBOL_GPL(page_endio);
+
/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/6] Add bdev_read_page() and bdev_write_page()
2014-03-23 19:08 ` Matthew Wilcox
@ 2014-03-23 19:08 ` Matthew Wilcox
-1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
A block device driver may choose to provide a rw_page operation.
These will be called when the filesystem is attempting to do page sized
I/O to page cache pages (ie not for direct I/O). This does preclude
I/Os that are larger than page size, so this may only be a performance
gain for some devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com>
---
fs/block_dev.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/mpage.c | 12 ++++++++++
include/linux/blkdev.h | 4 ++++
3 files changed, 79 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1e86823..62eabf5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -363,6 +363,69 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
}
EXPORT_SYMBOL(blkdev_fsync);
+/**
+ * bdev_read_page() - Start reading a page from a block device
+ * @bdev: The device to read the page from
+ * @sector: The offset on the device to read the page to (need not be aligned)
+ * @page: The page to read
+ *
+ * On entry, the page should be locked. It will be unlocked when the page
+ * has been read. If the block driver implements rw_page synchronously,
+ * that will be true on exit from this function, but it need not be.
+ *
+ * Errors returned by this function are usually "soft", eg out of memory, or
+ * queue full; callers should try a different route to read this page rather
+ * than propagate an error back up the stack.
+ *
+ * Return: negative errno if an error occurs, 0 if submission was successful.
+ */
+int bdev_read_page(struct block_device *bdev, sector_t sector,
+ struct page *page)
+{
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->rw_page)
+ return -EOPNOTSUPP;
+ return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+}
+EXPORT_SYMBOL_GPL(bdev_read_page);
+
+/**
+ * bdev_write_page() - Start writing a page to a block device
+ * @bdev: The device to write the page to
+ * @sector: The offset on the device to write the page to (need not be aligned)
+ * @page: The page to write
+ * @wbc: The writeback_control for the write
+ *
+ * On entry, the page should be locked and not currently under writeback.
+ * On exit, if the write started successfully, the page will be unlocked and
+ * under writeback. If the write failed already (eg the driver failed to
+ * queue the page to the device), the page will still be locked. If the
+ * caller is a ->writepage implementation, it will need to unlock the page.
+ *
+ * Errors returned by this function are usually "soft", eg out of memory, or
+ * queue full; callers should try a different route to write this page rather
+ * than propagate an error back up the stack.
+ *
+ * Return: negative errno if an error occurs, 0 if submission was successful.
+ */
+int bdev_write_page(struct block_device *bdev, sector_t sector,
+ struct page *page, struct writeback_control *wbc)
+{
+ int result;
+ int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->rw_page)
+ return -EOPNOTSUPP;
+ set_page_writeback(page);
+ result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
+ if (result)
+ end_page_writeback(page);
+ else
+ unlock_page(page);
+ return result;
+}
+EXPORT_SYMBOL_GPL(bdev_write_page);
+
/*
* pseudo-fs
*/
diff --git a/fs/mpage.c b/fs/mpage.c
index 10da0da..5f9ed62 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -269,6 +269,11 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
alloc_new:
if (bio == NULL) {
+ if (first_hole == blocks_per_page) {
+ if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
+ page))
+ goto out;
+ }
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
min_t(int, nr_pages, bio_get_nr_vecs(bdev)),
GFP_KERNEL);
@@ -587,6 +592,13 @@ page_is_mapped:
alloc_new:
if (bio == NULL) {
+ if (first_unmapped == blocks_per_page) {
+ if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
+ page, wbc)) {
+ clean_buffers(page, first_unmapped);
+ goto out;
+ }
+ }
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..f6f6965 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1558,6 +1558,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
struct block_device_operations {
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
+ int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t,
@@ -1576,6 +1577,9 @@ struct block_device_operations {
extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
unsigned long);
+extern int bdev_read_page(struct block_device *, sector_t, struct page *);
+extern int bdev_write_page(struct block_device *, sector_t, struct page *,
+ struct writeback_control *);
#else /* CONFIG_BLOCK */
/*
* stubs for when the block layer is configured out
--
1.9.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/6] Add bdev_read_page() and bdev_write_page()
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
A block device driver may choose to provide a rw_page operation.
These will be called when the filesystem is attempting to do page sized
I/O to page cache pages (ie not for direct I/O). This does preclude
I/Os that are larger than page size, so this may only be a performance
gain for some devices.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Tested-by: Dheeraj Reddy <dheeraj.reddy@intel.com>
---
fs/block_dev.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/mpage.c | 12 ++++++++++
include/linux/blkdev.h | 4 ++++
3 files changed, 79 insertions(+)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1e86823..62eabf5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -363,6 +363,69 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
}
EXPORT_SYMBOL(blkdev_fsync);
+/**
+ * bdev_read_page() - Start reading a page from a block device
+ * @bdev: The device to read the page from
+ * @sector: The offset on the device to read the page to (need not be aligned)
+ * @page: The page to read
+ *
+ * On entry, the page should be locked. It will be unlocked when the page
+ * has been read. If the block driver implements rw_page synchronously,
+ * that will be true on exit from this function, but it need not be.
+ *
+ * Errors returned by this function are usually "soft", eg out of memory, or
+ * queue full; callers should try a different route to read this page rather
+ * than propagate an error back up the stack.
+ *
+ * Return: negative errno if an error occurs, 0 if submission was successful.
+ */
+int bdev_read_page(struct block_device *bdev, sector_t sector,
+ struct page *page)
+{
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->rw_page)
+ return -EOPNOTSUPP;
+ return ops->rw_page(bdev, sector + get_start_sect(bdev), page, READ);
+}
+EXPORT_SYMBOL_GPL(bdev_read_page);
+
+/**
+ * bdev_write_page() - Start writing a page to a block device
+ * @bdev: The device to write the page to
+ * @sector: The offset on the device to write the page to (need not be aligned)
+ * @page: The page to write
+ * @wbc: The writeback_control for the write
+ *
+ * On entry, the page should be locked and not currently under writeback.
+ * On exit, if the write started successfully, the page will be unlocked and
+ * under writeback. If the write failed already (eg the driver failed to
+ * queue the page to the device), the page will still be locked. If the
+ * caller is a ->writepage implementation, it will need to unlock the page.
+ *
+ * Errors returned by this function are usually "soft", eg out of memory, or
+ * queue full; callers should try a different route to write this page rather
+ * than propagate an error back up the stack.
+ *
+ * Return: negative errno if an error occurs, 0 if submission was successful.
+ */
+int bdev_write_page(struct block_device *bdev, sector_t sector,
+ struct page *page, struct writeback_control *wbc)
+{
+ int result;
+ int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->rw_page)
+ return -EOPNOTSUPP;
+ set_page_writeback(page);
+ result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, rw);
+ if (result)
+ end_page_writeback(page);
+ else
+ unlock_page(page);
+ return result;
+}
+EXPORT_SYMBOL_GPL(bdev_write_page);
+
/*
* pseudo-fs
*/
diff --git a/fs/mpage.c b/fs/mpage.c
index 10da0da..5f9ed62 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -269,6 +269,11 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
alloc_new:
if (bio == NULL) {
+ if (first_hole == blocks_per_page) {
+ if (!bdev_read_page(bdev, blocks[0] << (blkbits - 9),
+ page))
+ goto out;
+ }
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
min_t(int, nr_pages, bio_get_nr_vecs(bdev)),
GFP_KERNEL);
@@ -587,6 +592,13 @@ page_is_mapped:
alloc_new:
if (bio == NULL) {
+ if (first_unmapped == blocks_per_page) {
+ if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
+ page, wbc)) {
+ clean_buffers(page, first_unmapped);
+ goto out;
+ }
+ }
bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9),
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..f6f6965 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1558,6 +1558,7 @@ static inline bool blk_integrity_is_initialized(struct gendisk *g)
struct block_device_operations {
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
+ int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t,
@@ -1576,6 +1577,9 @@ struct block_device_operations {
extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
unsigned long);
+extern int bdev_read_page(struct block_device *, sector_t, struct page *);
+extern int bdev_write_page(struct block_device *, sector_t, struct page *,
+ struct writeback_control *);
#else /* CONFIG_BLOCK */
/*
* stubs for when the block layer is configured out
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 4/6] swap: Use bdev_read_page() / bdev_write_page()
2014-03-23 19:08 ` Matthew Wilcox
@ 2014-03-23 19:08 ` Matthew Wilcox
-1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
mm/page_io.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index 7c59ef6..43d7220 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -248,11 +248,16 @@ out:
return ret;
}
+static sector_t swap_page_sector(struct page *page)
+{
+ return (sector_t)__page_file_index(page) << (PAGE_CACHE_SHIFT - 9);
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
void (*end_write_func)(struct bio *, int))
{
struct bio *bio;
- int ret = 0, rw = WRITE;
+ int ret, rw = WRITE;
struct swap_info_struct *sis = page_swap_info(page);
if (sis->flags & SWP_FILE) {
@@ -297,6 +302,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return ret;
}
+ ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
+ if (!ret) {
+ count_vm_event(PSWPOUT);
+ return 0;
+ }
+
+ ret = 0;
bio = get_swap_bio(GFP_NOIO, page, end_write_func);
if (bio == NULL) {
set_page_dirty(page);
@@ -317,7 +329,7 @@ out:
int swap_readpage(struct page *page)
{
struct bio *bio;
- int ret = 0;
+ int ret;
struct swap_info_struct *sis = page_swap_info(page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -338,6 +350,13 @@ int swap_readpage(struct page *page)
return ret;
}
+ ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
+ if (!ret) {
+ count_vm_event(PSWPIN);
+ return 0;
+ }
+
+ ret = 0;
bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
if (bio == NULL) {
unlock_page(page);
--
1.9.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 4/6] swap: Use bdev_read_page() / bdev_write_page()
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
mm/page_io.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index 7c59ef6..43d7220 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -248,11 +248,16 @@ out:
return ret;
}
+static sector_t swap_page_sector(struct page *page)
+{
+ return (sector_t)__page_file_index(page) << (PAGE_CACHE_SHIFT - 9);
+}
+
int __swap_writepage(struct page *page, struct writeback_control *wbc,
void (*end_write_func)(struct bio *, int))
{
struct bio *bio;
- int ret = 0, rw = WRITE;
+ int ret, rw = WRITE;
struct swap_info_struct *sis = page_swap_info(page);
if (sis->flags & SWP_FILE) {
@@ -297,6 +302,13 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
return ret;
}
+ ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc);
+ if (!ret) {
+ count_vm_event(PSWPOUT);
+ return 0;
+ }
+
+ ret = 0;
bio = get_swap_bio(GFP_NOIO, page, end_write_func);
if (bio == NULL) {
set_page_dirty(page);
@@ -317,7 +329,7 @@ out:
int swap_readpage(struct page *page)
{
struct bio *bio;
- int ret = 0;
+ int ret;
struct swap_info_struct *sis = page_swap_info(page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -338,6 +350,13 @@ int swap_readpage(struct page *page)
return ret;
}
+ ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
+ if (!ret) {
+ count_vm_event(PSWPIN);
+ return 0;
+ }
+
+ ret = 0;
bio = get_swap_bio(GFP_KERNEL, page, end_swap_bio_read);
if (bio == NULL) {
unlock_page(page);
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 5/6] NVMe: Add support for rw_page
2014-03-23 19:08 ` Matthew Wilcox
@ 2014-03-23 19:08 ` Matthew Wilcox
-1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Keith Busch, willy, Matthew Wilcox
From: Keith Busch <keith.busch@intel.com>
This demonstrates the full potential of rw_page in a real device driver.
By adding a dma_addr_t to the preallocated per-command data structure, we
can avoid doing any memory allocation in the rw_page path. For example,
that lets us swap without allocating any memory.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 105 insertions(+), 24 deletions(-)
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 51824d1..10ccd80 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -118,12 +118,13 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
}
-typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
+typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, dma_addr_t,
struct nvme_completion *);
struct nvme_cmd_info {
nvme_completion_fn fn;
void *ctx;
+ dma_addr_t dma;
unsigned long timeout;
int aborted;
};
@@ -153,7 +154,7 @@ static unsigned nvme_queue_extra(int depth)
* May be called with local interrupts disabled and the q_lock held,
* or with interrupts enabled and no locks held.
*/
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
+static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma,
nvme_completion_fn handler, unsigned timeout)
{
int depth = nvmeq->q_depth - 1;
@@ -168,17 +169,18 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
info[cmdid].fn = handler;
info[cmdid].ctx = ctx;
+ info[cmdid].dma = dma;
info[cmdid].timeout = jiffies + timeout;
info[cmdid].aborted = 0;
return cmdid;
}
-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
+static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma,
nvme_completion_fn handler, unsigned timeout)
{
int cmdid;
wait_event_killable(nvmeq->sq_full,
- (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
+ (cmdid = alloc_cmdid(nvmeq, ctx, dma, handler, timeout)) >= 0);
return (cmdid < 0) ? -EINTR : cmdid;
}
@@ -190,7 +192,7 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
#define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE)
#define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE)
-static void special_completion(struct nvme_dev *dev, void *ctx,
+static void special_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
if (ctx == CMD_CTX_CANCELLED)
@@ -217,7 +219,7 @@ static void special_completion(struct nvme_dev *dev, void *ctx,
dev_warn(&dev->pci_dev->dev, "Unknown special completion %p\n", ctx);
}
-static void async_completion(struct nvme_dev *dev, void *ctx,
+static void async_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct async_cmd_info *cmdinfo = ctx;
@@ -229,7 +231,7 @@ static void async_completion(struct nvme_dev *dev, void *ctx,
/*
* Called with local interrupts disabled and the q_lock held. May not sleep.
*/
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap,
nvme_completion_fn *fn)
{
void *ctx;
@@ -241,6 +243,8 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
}
if (fn)
*fn = info[cmdid].fn;
+ if (dmap)
+ *dmap = info[cmdid].dma;
ctx = info[cmdid].ctx;
info[cmdid].fn = special_completion;
info[cmdid].ctx = CMD_CTX_COMPLETED;
@@ -249,13 +253,15 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
return ctx;
}
-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap,
nvme_completion_fn *fn)
{
void *ctx;
struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
if (fn)
*fn = info[cmdid].fn;
+ if (dmap)
+ *dmap = info[cmdid].dma;
ctx = info[cmdid].ctx;
info[cmdid].fn = special_completion;
info[cmdid].ctx = CMD_CTX_CANCELLED;
@@ -371,7 +377,7 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
part_stat_unlock();
}
-static void bio_completion(struct nvme_dev *dev, void *ctx,
+static void bio_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
@@ -593,7 +599,7 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns)
{
- int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH,
+ int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, 0,
special_completion, NVME_IO_TIMEOUT);
if (unlikely(cmdid < 0))
return cmdid;
@@ -628,7 +634,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
iod->private = bio;
result = -EBUSY;
- cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
+ cmdid = alloc_cmdid(nvmeq, iod, 0, bio_completion, NVME_IO_TIMEOUT);
if (unlikely(cmdid < 0))
goto free_iod;
@@ -684,7 +690,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
return 0;
free_cmdid:
- free_cmdid(nvmeq, cmdid, NULL);
+ free_cmdid(nvmeq, cmdid, NULL, NULL);
free_iod:
nvme_free_iod(nvmeq->dev, iod);
nomem:
@@ -700,6 +706,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
for (;;) {
void *ctx;
+ dma_addr_t dma;
nvme_completion_fn fn;
struct nvme_completion cqe = nvmeq->cqes[head];
if ((le16_to_cpu(cqe.status) & 1) != phase)
@@ -710,8 +717,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
phase = !phase;
}
- ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
- fn(nvmeq->dev, ctx, &cqe);
+ ctx = free_cmdid(nvmeq, cqe.command_id, &dma, &fn);
+ fn(nvmeq->dev, ctx, dma, &cqe);
}
/* If the controller ignores the cq head doorbell and continuously
@@ -781,7 +788,7 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
{
spin_lock_irq(&nvmeq->q_lock);
- cancel_cmdid(nvmeq, cmdid, NULL);
+ cancel_cmdid(nvmeq, cmdid, NULL, NULL);
spin_unlock_irq(&nvmeq->q_lock);
}
@@ -791,7 +798,7 @@ struct sync_cmd_info {
int status;
};
-static void sync_completion(struct nvme_dev *dev, void *ctx,
+static void sync_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct sync_cmd_info *cmdinfo = ctx;
@@ -813,7 +820,7 @@ int nvme_submit_sync_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
cmdinfo.task = current;
cmdinfo.status = -EINTR;
- cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, sync_completion,
+ cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, 0, sync_completion,
timeout);
if (cmdid < 0)
return cmdid;
@@ -838,9 +845,8 @@ static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
struct nvme_command *cmd,
struct async_cmd_info *cmdinfo, unsigned timeout)
{
- int cmdid;
-
- cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
+ int cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, 0, async_completion,
+ timeout);
if (cmdid < 0)
return cmdid;
cmdinfo->status = -EINTR;
@@ -1001,8 +1007,8 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
if (!dev->abort_limit)
return;
- a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, special_completion,
- ADMIN_TIMEOUT);
+ a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, 0,
+ special_completion, ADMIN_TIMEOUT);
if (a_cmdid < 0)
return;
@@ -1035,6 +1041,7 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
void *ctx;
+ dma_addr_t dma;
nvme_completion_fn fn;
static struct nvme_completion cqe = {
.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
@@ -1050,8 +1057,8 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
}
dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
nvmeq->qid);
- ctx = cancel_cmdid(nvmeq, cmdid, &fn);
- fn(nvmeq->dev, ctx, &cqe);
+ ctx = cancel_cmdid(nvmeq, cmdid, &dma, &fn);
+ fn(nvmeq->dev, ctx, dma, &cqe);
}
}
@@ -1539,6 +1546,79 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
return status;
}
+static void pgrd_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
+ struct nvme_completion *cqe)
+{
+ struct page *page = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+ dma_unmap_page(&dev->pci_dev->dev, dma,
+ PAGE_CACHE_SIZE, DMA_FROM_DEVICE);
+ page_endio(page, READ, status != NVME_SC_SUCCESS);
+}
+
+static void pgwr_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
+ struct nvme_completion *cqe)
+{
+ struct page *page = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+ dma_unmap_page(&dev->pci_dev->dev, dma, PAGE_CACHE_SIZE, DMA_TO_DEVICE);
+ page_endio(page, WRITE, status != NVME_SC_SUCCESS);
+}
+
+static const enum dma_data_direction nvme_to_direction[] = {
+ DMA_NONE, DMA_TO_DEVICE, DMA_FROM_DEVICE, DMA_BIDIRECTIONAL
+};
+
+static int nvme_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct nvme_ns *ns = bdev->bd_disk->private_data;
+ u8 op = (rw & WRITE) ? nvme_cmd_write : nvme_cmd_read;
+ nvme_completion_fn fn = (rw & WRITE) ? pgwr_completion :
+ pgrd_completion;
+ dma_addr_t dma;
+ int cmdid;
+ struct nvme_command *cmd;
+ enum dma_data_direction dma_dir = nvme_to_direction[op & 3];
+ struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
+ dma = dma_map_page(nvmeq->q_dmadev, page, 0, PAGE_CACHE_SIZE, dma_dir);
+
+ if (rw == WRITE)
+ cmdid = alloc_cmdid(nvmeq, page, dma, fn, NVME_IO_TIMEOUT);
+ else
+ cmdid = alloc_cmdid_killable(nvmeq, page, dma, fn,
+ NVME_IO_TIMEOUT);
+ if (unlikely(cmdid < 0)) {
+ dma_unmap_page(nvmeq->q_dmadev, dma, PAGE_CACHE_SIZE,
+ DMA_FROM_DEVICE);
+ put_nvmeq(nvmeq);
+ return -EBUSY;
+ }
+
+ spin_lock_irq(&nvmeq->q_lock);
+ cmd = &nvmeq->sq_cmds[nvmeq->sq_tail];
+ memset(cmd, 0, sizeof(*cmd));
+
+ cmd->rw.opcode = op;
+ cmd->rw.command_id = cmdid;
+ cmd->rw.nsid = cpu_to_le32(ns->ns_id);
+ cmd->rw.slba = cpu_to_le64(nvme_block_nr(ns, sector));
+ cmd->rw.length = cpu_to_le16((PAGE_CACHE_SIZE >> ns->lba_shift) - 1);
+ cmd->rw.prp1 = cpu_to_le64(dma);
+
+ if (++nvmeq->sq_tail == nvmeq->q_depth)
+ nvmeq->sq_tail = 0;
+ writel(nvmeq->sq_tail, nvmeq->q_db);
+
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
+ put_nvmeq(nvmeq);
+
+ return 0;
+}
+
static int nvme_user_admin_cmd(struct nvme_dev *dev,
struct nvme_admin_cmd __user *ucmd)
{
@@ -1655,6 +1735,7 @@ static void nvme_release(struct gendisk *disk, fmode_t mode)
static const struct block_device_operations nvme_fops = {
.owner = THIS_MODULE,
+ .rw_page = nvme_rw_page,
.ioctl = nvme_ioctl,
.compat_ioctl = nvme_compat_ioctl,
.open = nvme_open,
--
1.9.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 5/6] NVMe: Add support for rw_page
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Keith Busch, willy, Matthew Wilcox
From: Keith Busch <keith.busch@intel.com>
This demonstrates the full potential of rw_page in a real device driver.
By adding a dma_addr_t to the preallocated per-command data structure, we
can avoid doing any memory allocation in the rw_page path. For example,
that lets us swap without allocating any memory.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
drivers/block/nvme-core.c | 129 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 105 insertions(+), 24 deletions(-)
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 51824d1..10ccd80 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -118,12 +118,13 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
}
-typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
+typedef void (*nvme_completion_fn)(struct nvme_dev *, void *, dma_addr_t,
struct nvme_completion *);
struct nvme_cmd_info {
nvme_completion_fn fn;
void *ctx;
+ dma_addr_t dma;
unsigned long timeout;
int aborted;
};
@@ -153,7 +154,7 @@ static unsigned nvme_queue_extra(int depth)
* May be called with local interrupts disabled and the q_lock held,
* or with interrupts enabled and no locks held.
*/
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
+static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma,
nvme_completion_fn handler, unsigned timeout)
{
int depth = nvmeq->q_depth - 1;
@@ -168,17 +169,18 @@ static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
info[cmdid].fn = handler;
info[cmdid].ctx = ctx;
+ info[cmdid].dma = dma;
info[cmdid].timeout = jiffies + timeout;
info[cmdid].aborted = 0;
return cmdid;
}
-static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
+static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx, dma_addr_t dma,
nvme_completion_fn handler, unsigned timeout)
{
int cmdid;
wait_event_killable(nvmeq->sq_full,
- (cmdid = alloc_cmdid(nvmeq, ctx, handler, timeout)) >= 0);
+ (cmdid = alloc_cmdid(nvmeq, ctx, dma, handler, timeout)) >= 0);
return (cmdid < 0) ? -EINTR : cmdid;
}
@@ -190,7 +192,7 @@ static int alloc_cmdid_killable(struct nvme_queue *nvmeq, void *ctx,
#define CMD_CTX_FLUSH (0x318 + CMD_CTX_BASE)
#define CMD_CTX_ABORT (0x31C + CMD_CTX_BASE)
-static void special_completion(struct nvme_dev *dev, void *ctx,
+static void special_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
if (ctx == CMD_CTX_CANCELLED)
@@ -217,7 +219,7 @@ static void special_completion(struct nvme_dev *dev, void *ctx,
dev_warn(&dev->pci_dev->dev, "Unknown special completion %p\n", ctx);
}
-static void async_completion(struct nvme_dev *dev, void *ctx,
+static void async_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct async_cmd_info *cmdinfo = ctx;
@@ -229,7 +231,7 @@ static void async_completion(struct nvme_dev *dev, void *ctx,
/*
* Called with local interrupts disabled and the q_lock held. May not sleep.
*/
-static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap,
nvme_completion_fn *fn)
{
void *ctx;
@@ -241,6 +243,8 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
}
if (fn)
*fn = info[cmdid].fn;
+ if (dmap)
+ *dmap = info[cmdid].dma;
ctx = info[cmdid].ctx;
info[cmdid].fn = special_completion;
info[cmdid].ctx = CMD_CTX_COMPLETED;
@@ -249,13 +253,15 @@ static void *free_cmdid(struct nvme_queue *nvmeq, int cmdid,
return ctx;
}
-static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid,
+static void *cancel_cmdid(struct nvme_queue *nvmeq, int cmdid, dma_addr_t *dmap,
nvme_completion_fn *fn)
{
void *ctx;
struct nvme_cmd_info *info = nvme_cmd_info(nvmeq);
if (fn)
*fn = info[cmdid].fn;
+ if (dmap)
+ *dmap = info[cmdid].dma;
ctx = info[cmdid].ctx;
info[cmdid].fn = special_completion;
info[cmdid].ctx = CMD_CTX_CANCELLED;
@@ -371,7 +377,7 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
part_stat_unlock();
}
-static void bio_completion(struct nvme_dev *dev, void *ctx,
+static void bio_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct nvme_iod *iod = ctx;
@@ -593,7 +599,7 @@ static int nvme_submit_flush(struct nvme_queue *nvmeq, struct nvme_ns *ns,
int nvme_submit_flush_data(struct nvme_queue *nvmeq, struct nvme_ns *ns)
{
- int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH,
+ int cmdid = alloc_cmdid(nvmeq, (void *)CMD_CTX_FLUSH, 0,
special_completion, NVME_IO_TIMEOUT);
if (unlikely(cmdid < 0))
return cmdid;
@@ -628,7 +634,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
iod->private = bio;
result = -EBUSY;
- cmdid = alloc_cmdid(nvmeq, iod, bio_completion, NVME_IO_TIMEOUT);
+ cmdid = alloc_cmdid(nvmeq, iod, 0, bio_completion, NVME_IO_TIMEOUT);
if (unlikely(cmdid < 0))
goto free_iod;
@@ -684,7 +690,7 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
return 0;
free_cmdid:
- free_cmdid(nvmeq, cmdid, NULL);
+ free_cmdid(nvmeq, cmdid, NULL, NULL);
free_iod:
nvme_free_iod(nvmeq->dev, iod);
nomem:
@@ -700,6 +706,7 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
for (;;) {
void *ctx;
+ dma_addr_t dma;
nvme_completion_fn fn;
struct nvme_completion cqe = nvmeq->cqes[head];
if ((le16_to_cpu(cqe.status) & 1) != phase)
@@ -710,8 +717,8 @@ static int nvme_process_cq(struct nvme_queue *nvmeq)
phase = !phase;
}
- ctx = free_cmdid(nvmeq, cqe.command_id, &fn);
- fn(nvmeq->dev, ctx, &cqe);
+ ctx = free_cmdid(nvmeq, cqe.command_id, &dma, &fn);
+ fn(nvmeq->dev, ctx, dma, &cqe);
}
/* If the controller ignores the cq head doorbell and continuously
@@ -781,7 +788,7 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
static void nvme_abort_command(struct nvme_queue *nvmeq, int cmdid)
{
spin_lock_irq(&nvmeq->q_lock);
- cancel_cmdid(nvmeq, cmdid, NULL);
+ cancel_cmdid(nvmeq, cmdid, NULL, NULL);
spin_unlock_irq(&nvmeq->q_lock);
}
@@ -791,7 +798,7 @@ struct sync_cmd_info {
int status;
};
-static void sync_completion(struct nvme_dev *dev, void *ctx,
+static void sync_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
struct nvme_completion *cqe)
{
struct sync_cmd_info *cmdinfo = ctx;
@@ -813,7 +820,7 @@ int nvme_submit_sync_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
cmdinfo.task = current;
cmdinfo.status = -EINTR;
- cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, sync_completion,
+ cmdid = alloc_cmdid_killable(nvmeq, &cmdinfo, 0, sync_completion,
timeout);
if (cmdid < 0)
return cmdid;
@@ -838,9 +845,8 @@ static int nvme_submit_async_cmd(struct nvme_queue *nvmeq,
struct nvme_command *cmd,
struct async_cmd_info *cmdinfo, unsigned timeout)
{
- int cmdid;
-
- cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, async_completion, timeout);
+ int cmdid = alloc_cmdid_killable(nvmeq, cmdinfo, 0, async_completion,
+ timeout);
if (cmdid < 0)
return cmdid;
cmdinfo->status = -EINTR;
@@ -1001,8 +1007,8 @@ static void nvme_abort_cmd(int cmdid, struct nvme_queue *nvmeq)
if (!dev->abort_limit)
return;
- a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, special_completion,
- ADMIN_TIMEOUT);
+ a_cmdid = alloc_cmdid(dev->queues[0], CMD_CTX_ABORT, 0,
+ special_completion, ADMIN_TIMEOUT);
if (a_cmdid < 0)
return;
@@ -1035,6 +1041,7 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
for_each_set_bit(cmdid, nvmeq->cmdid_data, depth) {
void *ctx;
+ dma_addr_t dma;
nvme_completion_fn fn;
static struct nvme_completion cqe = {
.status = cpu_to_le16(NVME_SC_ABORT_REQ << 1),
@@ -1050,8 +1057,8 @@ static void nvme_cancel_ios(struct nvme_queue *nvmeq, bool timeout)
}
dev_warn(nvmeq->q_dmadev, "Cancelling I/O %d QID %d\n", cmdid,
nvmeq->qid);
- ctx = cancel_cmdid(nvmeq, cmdid, &fn);
- fn(nvmeq->dev, ctx, &cqe);
+ ctx = cancel_cmdid(nvmeq, cmdid, &dma, &fn);
+ fn(nvmeq->dev, ctx, dma, &cqe);
}
}
@@ -1539,6 +1546,79 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
return status;
}
+static void pgrd_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
+ struct nvme_completion *cqe)
+{
+ struct page *page = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+ dma_unmap_page(&dev->pci_dev->dev, dma,
+ PAGE_CACHE_SIZE, DMA_FROM_DEVICE);
+ page_endio(page, READ, status != NVME_SC_SUCCESS);
+}
+
+static void pgwr_completion(struct nvme_dev *dev, void *ctx, dma_addr_t dma,
+ struct nvme_completion *cqe)
+{
+ struct page *page = ctx;
+ u16 status = le16_to_cpup(&cqe->status) >> 1;
+
+ dma_unmap_page(&dev->pci_dev->dev, dma, PAGE_CACHE_SIZE, DMA_TO_DEVICE);
+ page_endio(page, WRITE, status != NVME_SC_SUCCESS);
+}
+
+static const enum dma_data_direction nvme_to_direction[] = {
+ DMA_NONE, DMA_TO_DEVICE, DMA_FROM_DEVICE, DMA_BIDIRECTIONAL
+};
+
+static int nvme_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct nvme_ns *ns = bdev->bd_disk->private_data;
+ u8 op = (rw & WRITE) ? nvme_cmd_write : nvme_cmd_read;
+ nvme_completion_fn fn = (rw & WRITE) ? pgwr_completion :
+ pgrd_completion;
+ dma_addr_t dma;
+ int cmdid;
+ struct nvme_command *cmd;
+ enum dma_data_direction dma_dir = nvme_to_direction[op & 3];
+ struct nvme_queue *nvmeq = get_nvmeq(ns->dev);
+ dma = dma_map_page(nvmeq->q_dmadev, page, 0, PAGE_CACHE_SIZE, dma_dir);
+
+ if (rw == WRITE)
+ cmdid = alloc_cmdid(nvmeq, page, dma, fn, NVME_IO_TIMEOUT);
+ else
+ cmdid = alloc_cmdid_killable(nvmeq, page, dma, fn,
+ NVME_IO_TIMEOUT);
+ if (unlikely(cmdid < 0)) {
+ dma_unmap_page(nvmeq->q_dmadev, dma, PAGE_CACHE_SIZE,
+ DMA_FROM_DEVICE);
+ put_nvmeq(nvmeq);
+ return -EBUSY;
+ }
+
+ spin_lock_irq(&nvmeq->q_lock);
+ cmd = &nvmeq->sq_cmds[nvmeq->sq_tail];
+ memset(cmd, 0, sizeof(*cmd));
+
+ cmd->rw.opcode = op;
+ cmd->rw.command_id = cmdid;
+ cmd->rw.nsid = cpu_to_le32(ns->ns_id);
+ cmd->rw.slba = cpu_to_le64(nvme_block_nr(ns, sector));
+ cmd->rw.length = cpu_to_le16((PAGE_CACHE_SIZE >> ns->lba_shift) - 1);
+ cmd->rw.prp1 = cpu_to_le64(dma);
+
+ if (++nvmeq->sq_tail == nvmeq->q_depth)
+ nvmeq->sq_tail = 0;
+ writel(nvmeq->sq_tail, nvmeq->q_db);
+
+ nvme_process_cq(nvmeq);
+ spin_unlock_irq(&nvmeq->q_lock);
+ put_nvmeq(nvmeq);
+
+ return 0;
+}
+
static int nvme_user_admin_cmd(struct nvme_dev *dev,
struct nvme_admin_cmd __user *ucmd)
{
@@ -1655,6 +1735,7 @@ static void nvme_release(struct gendisk *disk, fmode_t mode)
static const struct block_device_operations nvme_fops = {
.owner = THIS_MODULE,
+ .rw_page = nvme_rw_page,
.ioctl = nvme_ioctl,
.compat_ioctl = nvme_compat_ioctl,
.open = nvme_open,
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 6/6] brd: Add support for rw_page
2014-03-23 19:08 ` Matthew Wilcox
@ 2014-03-23 19:08 ` Matthew Wilcox
-1 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
drivers/block/brd.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index e73b85c..807d3d5 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -360,6 +360,15 @@ out:
bio_endio(bio, err);
}
+static int brd_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct brd_device *brd = bdev->bd_disk->private_data;
+ int err = brd_do_bvec(brd, page, PAGE_CACHE_SIZE, 0, rw, sector);
+ page_endio(page, rw & WRITE, err);
+ return err;
+}
+
#ifdef CONFIG_BLK_DEV_XIP
static int brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn)
@@ -419,6 +428,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
+ .rw_page = brd_rw_page,
.ioctl = brd_ioctl,
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
--
1.9.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 6/6] brd: Add support for rw_page
@ 2014-03-23 19:08 ` Matthew Wilcox
0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2014-03-23 19:08 UTC (permalink / raw)
To: linux-fsdevel, linux-mm, linux-kernel; +Cc: Matthew Wilcox, willy
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
---
drivers/block/brd.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index e73b85c..807d3d5 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -360,6 +360,15 @@ out:
bio_endio(bio, err);
}
+static int brd_rw_page(struct block_device *bdev, sector_t sector,
+ struct page *page, int rw)
+{
+ struct brd_device *brd = bdev->bd_disk->private_data;
+ int err = brd_do_bvec(brd, page, PAGE_CACHE_SIZE, 0, rw, sector);
+ page_endio(page, rw & WRITE, err);
+ return err;
+}
+
#ifdef CONFIG_BLK_DEV_XIP
static int brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn)
@@ -419,6 +428,7 @@ static int brd_ioctl(struct block_device *bdev, fmode_t mode,
static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
+ .rw_page = brd_rw_page,
.ioctl = brd_ioctl,
#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
--
1.9.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-03-23 19:24 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-23 19:08 [PATCH v2 0/6] Page I/O Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v2 1/6] Factor clean_buffers() out of __mpage_writepage() Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v2 2/6] Factor page_endio() out of mpage_end_io() Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v2 3/6] Add bdev_read_page() and bdev_write_page() Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v2 4/6] swap: Use bdev_read_page() / bdev_write_page() Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v2 5/6] NVMe: Add support for rw_page Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
2014-03-23 19:08 ` [PATCH v2 6/6] brd: " Matthew Wilcox
2014-03-23 19:08 ` Matthew Wilcox
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.