linux-erofs.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/19] Change readahead API
@ 2020-02-17 18:45 Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 01/19] mm: Return void from various readahead functions Matthew Wilcox
                   ` (34 more replies)
  0 siblings, 35 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This series adds a readahead address_space operation to eventually
replace the readpages operation.  The key difference is that
pages are added to the page cache as they are allocated (and
then looked up by the filesystem) instead of passing them on a
list to the readpages operation and having the filesystem add
them to the page cache.  It's a net reduction in code for each
implementation, more efficient than walking a list, and solves
the direct-write vs buffered-read problem reported by yu kuai at
https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/

The only unconverted filesystems are those which use fscache.
Their conversion is pending Dave Howells' rewrite which will make the
conversion substantially easier.

v6:
 - Name the private members of readahead_control with a leading underscore
   (suggested by Christoph Hellwig)
 - Fix whitespace in rst file
 - Remove misleading comment in btrfs patch
 - Add readahead_next() API and use it in iomap
 - Add iomap_readahead kerneldoc.
 - Fix the mpage_readahead kerneldoc
 - Make various readahead functions return void
 - Keep readahead_index() and readahead_offset() pointing to the start of
   this batch through the body.  No current user requires this, but it's
   less surprising.
 - Add kerneldoc for page_cache_readahead_limit
 - Make page_idx an unsigned long, and rename it to just 'i'
 - Get rid of page_offset local variable
 - Add patch to call memalloc_nofs_save() before allocating pages (suggested
   by Michal Hocko)
 - Resplit a lot of patches for more logical progression and easier review
   (suggested by John Hubbard)
 - Added sign-offs where received, and I deemed still relevant

v5 switched to passing a readahead_control struct (mirroring the
writepages_control struct passed to writepages).  This has a number of
advantages:
 - It fixes a number of bugs in various implementations, eg forgetting to
   increment 'start', an off-by-one error in 'nr_pages' or treating 'start'
   as a byte offset instead of a page offset.
 - It allows us to change the arguments without changing all the
   implementations of ->readahead which just call mpage_readahead() or
   iomap_readahead()
 - Figuring out which pages haven't been attempted by the implementation
   is more natural this way.
 - There's less code in each implementation.

Matthew Wilcox (Oracle) (19):
  mm: Return void from various readahead functions
  mm: Ignore return value of ->readpages
  mm: Use readahead_control to pass arguments
  mm: Rearrange readahead loop
  mm: Remove 'page_offset' from readahead loop
  mm: rename readahead loop variable to 'i'
  mm: Put readahead pages in cache earlier
  mm: Add readahead address space operation
  mm: Add page_cache_readahead_limit
  fs: Convert mpage_readpages to mpage_readahead
  btrfs: Convert from readpages to readahead
  erofs: Convert uncompressed files from readpages to readahead
  erofs: Convert compressed files from readpages to readahead
  ext4: Convert from readpages to readahead
  f2fs: Convert from readpages to readahead
  fuse: Convert from readpages to readahead
  iomap: Restructure iomap_readpages_actor
  iomap: Convert from readpages to readahead
  mm: Use memalloc_nofs_save in readahead path

 Documentation/filesystems/locking.rst |   6 +-
 Documentation/filesystems/vfs.rst     |  13 ++
 drivers/staging/exfat/exfat_super.c   |   7 +-
 fs/block_dev.c                        |   7 +-
 fs/btrfs/extent_io.c                  |  46 ++-----
 fs/btrfs/extent_io.h                  |   3 +-
 fs/btrfs/inode.c                      |  16 +--
 fs/erofs/data.c                       |  39 ++----
 fs/erofs/zdata.c                      |  29 ++--
 fs/ext2/inode.c                       |  10 +-
 fs/ext4/ext4.h                        |   3 +-
 fs/ext4/inode.c                       |  23 ++--
 fs/ext4/readpage.c                    |  22 ++-
 fs/ext4/verity.c                      |  35 +----
 fs/f2fs/data.c                        |  50 +++----
 fs/f2fs/f2fs.h                        |   5 +-
 fs/f2fs/verity.c                      |  35 +----
 fs/fat/inode.c                        |   7 +-
 fs/fuse/file.c                        |  46 +++----
 fs/gfs2/aops.c                        |  23 ++--
 fs/hpfs/file.c                        |   7 +-
 fs/iomap/buffered-io.c                | 118 +++++++----------
 fs/iomap/trace.h                      |   2 +-
 fs/isofs/inode.c                      |   7 +-
 fs/jfs/inode.c                        |   7 +-
 fs/mpage.c                            |  38 ++----
 fs/nilfs2/inode.c                     |  15 +--
 fs/ocfs2/aops.c                       |  34 ++---
 fs/omfs/file.c                        |   7 +-
 fs/qnx6/inode.c                       |   7 +-
 fs/reiserfs/inode.c                   |   8 +-
 fs/udf/inode.c                        |   7 +-
 fs/xfs/xfs_aops.c                     |  13 +-
 fs/zonefs/super.c                     |   7 +-
 include/linux/fs.h                    |   2 +
 include/linux/iomap.h                 |   3 +-
 include/linux/mpage.h                 |   4 +-
 include/linux/pagemap.h               |  90 +++++++++++++
 include/trace/events/erofs.h          |   6 +-
 include/trace/events/f2fs.h           |   6 +-
 mm/internal.h                         |   8 +-
 mm/migrate.c                          |   2 +-
 mm/readahead.c                        | 184 +++++++++++++++++---------
 43 files changed, 474 insertions(+), 533 deletions(-)


base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8
-- 
2.25.0


^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH v6 01/19] mm: Return void from various readahead functions
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  4:47   ` Dave Chinner
  2020-02-18 21:05   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 02/19] mm: Ignore return value of ->readpages Matthew Wilcox
                   ` (33 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

ondemand_readahead has two callers, neither of which use the return value.
That means that both ra_submit and __do_page_cache_readahead() can return
void, and we don't need to worry that a present page in the readahead
window causes us to return a smaller nr_pages than we ought to have.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h  |  8 ++++----
 mm/readahead.c | 24 ++++++++++--------------
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..f779f058118b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -49,18 +49,18 @@ void unmap_page_range(struct mmu_gather *tlb,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details);
 
-extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
+extern void __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size);
 
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
-static inline unsigned long ra_submit(struct file_ra_state *ra,
+static inline void ra_submit(struct file_ra_state *ra,
 		struct address_space *mapping, struct file *filp)
 {
-	return __do_page_cache_readahead(mapping, filp,
-					ra->start, ra->size, ra->async_size);
+	__do_page_cache_readahead(mapping, filp,
+			ra->start, ra->size, ra->async_size);
 }
 
 /*
diff --git a/mm/readahead.c b/mm/readahead.c
index 2fe72cd29b47..8ce46d69e6ae 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -149,10 +149,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
  * the pages first, then submits them for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
  * We really don't want to intermingle reads and writes like that.
- *
- * Returns the number of pages requested, or the maximum amount of I/O allowed.
  */
-unsigned int __do_page_cache_readahead(struct address_space *mapping,
+void __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size)
 {
@@ -166,7 +164,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
 	if (isize == 0)
-		goto out;
+		return;
 
 	end_index = ((isize - 1) >> PAGE_SHIFT);
 
@@ -211,8 +209,6 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
 	if (nr_pages)
 		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
 	BUG_ON(!list_empty(&page_pool));
-out:
-	return nr_pages;
 }
 
 /*
@@ -378,11 +374,10 @@ static int try_context_readahead(struct address_space *mapping,
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
-static unsigned long
-ondemand_readahead(struct address_space *mapping,
-		   struct file_ra_state *ra, struct file *filp,
-		   bool hit_readahead_marker, pgoff_t offset,
-		   unsigned long req_size)
+static void ondemand_readahead(struct address_space *mapping,
+		struct file_ra_state *ra, struct file *filp,
+		bool hit_readahead_marker, pgoff_t offset,
+		unsigned long req_size)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	unsigned long max_pages = ra->ra_pages;
@@ -428,7 +423,7 @@ ondemand_readahead(struct address_space *mapping,
 		rcu_read_unlock();
 
 		if (!start || start - offset > max_pages)
-			return 0;
+			return;
 
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
@@ -464,7 +459,8 @@ ondemand_readahead(struct address_space *mapping,
 	 * standalone, small random read
 	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	__do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	return;
 
 initial_readahead:
 	ra->start = offset;
@@ -489,7 +485,7 @@ ondemand_readahead(struct address_space *mapping,
 		}
 	}
 
-	return ra_submit(ra, mapping, filp);
+	ra_submit(ra, mapping, filp);
 }
 
 /**
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 02/19] mm: Ignore return value of ->readpages
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 01/19] mm: Return void from various readahead functions Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  4:48   ` Dave Chinner
  2020-02-18 21:33   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 03/19] mm: Use readahead_control to pass arguments Matthew Wilcox
                   ` (32 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, Christoph Hellwig, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

We used to assign the return value to a variable, which we then ignored.
Remove the pretence of caring.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 mm/readahead.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 8ce46d69e6ae..12d13b7792da 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -113,17 +113,16 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static int read_pages(struct address_space *mapping, struct file *filp,
+static void read_pages(struct address_space *mapping, struct file *filp,
 		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
 {
 	struct blk_plug plug;
 	unsigned page_idx;
-	int ret;
 
 	blk_start_plug(&plug);
 
 	if (mapping->a_ops->readpages) {
-		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
+		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
 		goto out;
@@ -136,12 +135,9 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 			mapping->a_ops->readpage(filp, page);
 		put_page(page);
 	}
-	ret = 0;
 
 out:
 	blk_finish_plug(&plug);
-
-	return ret;
 }
 
 /*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 03/19] mm: Use readahead_control to pass arguments
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 01/19] mm: Return void from various readahead functions Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 02/19] mm: Ignore return value of ->readpages Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  5:03   ` Dave Chinner
  2020-02-18 22:22   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 04/19] mm: Rearrange readahead loop Matthew Wilcox
                   ` (31 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

In this patch, only between __do_page_cache_readahead() and
read_pages(), but it will be extended in upcoming patches.  Also add
the readahead_count() accessor.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 17 +++++++++++++++++
 mm/readahead.c          | 36 +++++++++++++++++++++---------------
 2 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ccb14b6a16b5..982ecda2d4a2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page,
 	return error;
 }
 
+/*
+ * Readahead is of a block of consecutive pages.
+ */
+struct readahead_control {
+	struct file *file;
+	struct address_space *mapping;
+/* private: use the readahead_* accessors instead */
+	pgoff_t _start;
+	unsigned int _nr_pages;
+};
+
+/* The number of pages in this readahead block */
+static inline unsigned int readahead_count(struct readahead_control *rac)
+{
+	return rac->_nr_pages;
+}
+
 static inline unsigned long dir_pages(struct inode *inode)
 {
 	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
diff --git a/mm/readahead.c b/mm/readahead.c
index 12d13b7792da..15329309231f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static void read_pages(struct address_space *mapping, struct file *filp,
-		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
+static void read_pages(struct readahead_control *rac, struct list_head *pages,
+		gfp_t gfp)
 {
+	const struct address_space_operations *aops = rac->mapping->a_ops;
 	struct blk_plug plug;
 	unsigned page_idx;
 
 	blk_start_plug(&plug);
 
-	if (mapping->a_ops->readpages) {
-		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
+	if (aops->readpages) {
+		aops->readpages(rac->file, rac->mapping, pages,
+				readahead_count(rac));
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
 		goto out;
 	}
 
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
+	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
 		struct page *page = lru_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
-			mapping->a_ops->readpage(filp, page);
+		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
+				gfp))
+			aops->readpage(rac->file, page);
 		put_page(page);
 	}
 
@@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
-	unsigned int nr_pages = 0;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	struct readahead_control rac = {
+		.mapping = mapping,
+		.file = filp,
+		._nr_pages = 0,
+	};
 
 	if (isize == 0)
 		return;
@@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space *mapping,
 			 * contiguous pages before continuing with the next
 			 * batch.
 			 */
-			if (nr_pages)
-				read_pages(mapping, filp, &page_pool, nr_pages,
-						gfp_mask);
-			nr_pages = 0;
+			if (readahead_count(&rac))
+				read_pages(&rac, &page_pool, gfp_mask);
+			rac._nr_pages = 0;
 			continue;
 		}
 
@@ -194,7 +200,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		list_add(&page->lru, &page_pool);
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
-		nr_pages++;
+		rac._nr_pages++;
 	}
 
 	/*
@@ -202,8 +208,8 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.
 	 */
-	if (nr_pages)
-		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
+	if (readahead_count(&rac))
+		read_pages(&rac, &page_pool, gfp_mask);
 	BUG_ON(!list_empty(&page_pool));
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 04/19] mm: Rearrange readahead loop
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (2 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 03/19] mm: Use readahead_control to pass arguments Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  5:08   ` Dave Chinner
  2020-02-18 22:33   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 04/16] mm: Tweak readahead loop slightly Matthew Wilcox
                   ` (30 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Move the declaration of 'page' to inside the loop and move the 'kick
off a fresh batch' code to the end of the function for easier use in
subsequent patches.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 15329309231f..3eca59c43a45 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
-	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
@@ -175,6 +174,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	 * Preallocate as many pages as we will need.
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
+		struct page *page;
 		pgoff_t page_offset = offset + page_idx;
 
 		if (page_offset > end_index)
@@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		page = xa_load(&mapping->i_pages, page_offset);
 		if (page && !xa_is_value(page)) {
 			/*
-			 * Page already present?  Kick off the current batch of
-			 * contiguous pages before continuing with the next
-			 * batch.
+			 * Page already present?  Kick off the current batch
+			 * of contiguous pages before continuing with the
+			 * next batch.  This page may be the one we would
+			 * have intended to mark as Readahead, but we don't
+			 * have a stable reference to this page, and it's
+			 * not worth getting one just for that.
 			 */
-			if (readahead_count(&rac))
-				read_pages(&rac, &page_pool, gfp_mask);
-			rac._nr_pages = 0;
-			continue;
+			goto read;
 		}
 
 		page = __page_cache_alloc(gfp_mask);
@@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		rac._nr_pages++;
+		continue;
+read:
+		if (readahead_count(&rac))
+			read_pages(&rac, &page_pool, gfp_mask);
+		rac._nr_pages = 0;
 	}
 
 	/*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 04/16] mm: Tweak readahead loop slightly
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (3 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 04/19] mm: Rearrange readahead loop Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18 22:57   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 05/16] mm: Put readahead pages in cache earlier Matthew Wilcox
                   ` (29 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Eliminate the page_offset variable which was just confusing;
record the start of each consecutive run of pages in the
readahead_control, and move the 'kick off a fresh batch' code to
the end of the function for easier use in the next patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 15329309231f..74791b96013f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
-	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
@@ -163,6 +162,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	struct readahead_control rac = {
 		.mapping = mapping,
 		.file = filp,
+		._start = offset,
 		._nr_pages = 0,
 	};
 
@@ -175,32 +175,39 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	 * Preallocate as many pages as we will need.
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
-		pgoff_t page_offset = offset + page_idx;
+		struct page *page;
 
-		if (page_offset > end_index)
+		if (offset > end_index)
 			break;
 
-		page = xa_load(&mapping->i_pages, page_offset);
+		page = xa_load(&mapping->i_pages, offset);
 		if (page && !xa_is_value(page)) {
 			/*
-			 * Page already present?  Kick off the current batch of
-			 * contiguous pages before continuing with the next
-			 * batch.
+			 * Page already present?  Kick off the current batch
+			 * of contiguous pages before continuing with the
+			 * next batch.  This page may be the one we would
+			 * have intended to mark as Readahead, but we don't
+			 * have a stable reference to this page, and it's
+			 * not worth getting one just for that.
 			 */
-			if (readahead_count(&rac))
-				read_pages(&rac, &page_pool, gfp_mask);
-			rac._nr_pages = 0;
-			continue;
+			goto read;
 		}
 
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
-		page->index = page_offset;
+		page->index = offset;
 		list_add(&page->lru, &page_pool);
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		rac._nr_pages++;
+		offset++;
+		continue;
+read:
+		if (readahead_count(&rac))
+			read_pages(&rac, &page_pool, gfp_mask);
+		rac._nr_pages = 0;
+		rac._start = ++offset;
 	}
 
 	/*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 05/16] mm: Put readahead pages in cache earlier
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (4 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 04/16] mm: Tweak readahead loop slightly Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop Matthew Wilcox
                   ` (28 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

At allocation time, put the pages in the cache unless we're using
->readpages.  Add the readahead_for_each() iterator for the benefit of
the ->readpage fallback.  This iterator supports huge pages, even though
none of the filesystems to be converted do yet.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 24 ++++++++++++++++++++++++
 mm/readahead.c          | 34 +++++++++++++++++-----------------
 2 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 982ecda2d4a2..3613154e79e4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -639,8 +639,32 @@ struct readahead_control {
 /* private: use the readahead_* accessors instead */
 	pgoff_t _start;
 	unsigned int _nr_pages;
+	unsigned int _batch_count;
 };
 
+static inline struct page *readahead_page(struct readahead_control *rac)
+{
+	struct page *page;
+
+	if (!rac->_nr_pages)
+		return NULL;
+
+	page = xa_load(&rac->mapping->i_pages, rac->_start);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	rac->_batch_count = hpage_nr_pages(page);
+
+	return page;
+}
+
+static inline void readahead_next(struct readahead_control *rac)
+{
+	rac->_nr_pages -= rac->_batch_count;
+	rac->_start += rac->_batch_count;
+}
+
+#define readahead_for_each(rac, page)					\
+	for (; (page = readahead_page(rac)); readahead_next(rac))
+
 /* The number of pages in this readahead block */
 static inline unsigned int readahead_count(struct readahead_control *rac)
 {
diff --git a/mm/readahead.c b/mm/readahead.c
index 74791b96013f..7663de534734 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -113,12 +113,11 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static void read_pages(struct readahead_control *rac, struct list_head *pages,
-		gfp_t gfp)
+static void read_pages(struct readahead_control *rac, struct list_head *pages)
 {
 	const struct address_space_operations *aops = rac->mapping->a_ops;
+	struct page *page;
 	struct blk_plug plug;
-	unsigned page_idx;
 
 	blk_start_plug(&plug);
 
@@ -127,19 +126,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 				readahead_count(rac));
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
-		goto out;
-	}
-
-	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
-		struct page *page = lru_to_page(pages);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
-				gfp))
+	} else {
+		readahead_for_each(rac, page) {
 			aops->readpage(rac->file, page);
-		put_page(page);
+			put_page(page);
+		}
 	}
 
-out:
 	blk_finish_plug(&plug);
 }
 
@@ -159,6 +152,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	int page_idx;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	bool use_list = mapping->a_ops->readpages;
 	struct readahead_control rac = {
 		.mapping = mapping,
 		.file = filp,
@@ -196,8 +190,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
-		page->index = offset;
-		list_add(&page->lru, &page_pool);
+		if (use_list) {
+			page->index = offset;
+			list_add(&page->lru, &page_pool);
+		} else if (add_to_page_cache_lru(page, mapping, offset,
+					gfp_mask) < 0) {
+			put_page(page);
+			goto read;
+		}
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		rac._nr_pages++;
@@ -205,7 +205,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		continue;
 read:
 		if (readahead_count(&rac))
-			read_pages(&rac, &page_pool, gfp_mask);
+			read_pages(&rac, &page_pool);
 		rac._nr_pages = 0;
 		rac._start = ++offset;
 	}
@@ -216,7 +216,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	 * will then handle the error.
 	 */
 	if (readahead_count(&rac))
-		read_pages(&rac, &page_pool, gfp_mask);
+		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (5 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 05/16] mm: Put readahead pages in cache earlier Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  5:14   ` Dave Chinner
  2020-02-18 23:08   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 06/16] mm: Add readahead address space operation Matthew Wilcox
                   ` (27 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Eliminate the page_offset variable which was confusing with the
'offset' parameter and record the start of each consecutive run of
pages in the readahead_control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 3eca59c43a45..74791b96013f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -162,6 +162,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	struct readahead_control rac = {
 		.mapping = mapping,
 		.file = filp,
+		._start = offset,
 		._nr_pages = 0,
 	};
 
@@ -175,12 +176,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	 */
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		struct page *page;
-		pgoff_t page_offset = offset + page_idx;
 
-		if (page_offset > end_index)
+		if (offset > end_index)
 			break;
 
-		page = xa_load(&mapping->i_pages, page_offset);
+		page = xa_load(&mapping->i_pages, offset);
 		if (page && !xa_is_value(page)) {
 			/*
 			 * Page already present?  Kick off the current batch
@@ -196,16 +196,18 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
-		page->index = page_offset;
+		page->index = offset;
 		list_add(&page->lru, &page_pool);
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		rac._nr_pages++;
+		offset++;
 		continue;
 read:
 		if (readahead_count(&rac))
 			read_pages(&rac, &page_pool, gfp_mask);
 		rac._nr_pages = 0;
+		rac._start = ++offset;
 	}
 
 	/*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 06/16] mm: Add readahead address space operation
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (6 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 06/19] mm: rename readahead loop variable to 'i' Matthew Wilcox
                   ` (26 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This replaces ->readpages with a saner interface:
 - Return void instead of an ignored error code.
 - Pages are already in the page cache when ->readahead is called.
 - Implementation looks up the pages in the page cache instead of
   having them passed in a linked list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst |  6 +++++-
 Documentation/filesystems/vfs.rst     | 13 +++++++++++++
 include/linux/fs.h                    |  2 ++
 include/linux/pagemap.h               | 18 ++++++++++++++++++
 mm/readahead.c                        |  8 +++++++-
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 5057e4d9dcd1..0ebc4491025a 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -239,6 +239,7 @@ prototypes::
 	int (*readpage)(struct file *, struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	int (*set_page_dirty)(struct page *page);
+	void (*readahead)(struct readahead_control *);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*write_begin)(struct file *, struct address_space *mapping,
@@ -271,7 +272,8 @@ writepage:		yes, unlocks (see below)
 readpage:		yes, unlocks
 writepages:
 set_page_dirty		no
-readpages:
+readahead:		yes, unlocks
+readpages:		no
 write_begin:		locks the page		 exclusive
 write_end:		yes, unlocks		 exclusive
 bmap:
@@ -295,6 +297,8 @@ the request handler (/dev/loop).
 ->readpage() unlocks the page, either synchronously or via I/O
 completion.
 
+->readahead() unlocks the pages like ->readpage().
+
 ->readpages() populates the pagecache with the passed pages and starts
 I/O against them.  They come unlocked upon I/O completion.
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..81ab30fbe45c 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
 		int (*readpage)(struct file *, struct page *);
 		int (*writepages)(struct address_space *, struct writeback_control *);
 		int (*set_page_dirty)(struct page *page);
+		void (*readahead)(struct readahead_control *);
 		int (*readpages)(struct file *filp, struct address_space *mapping,
 				 struct list_head *pages, unsigned nr_pages);
 		int (*write_begin)(struct file *, struct address_space *mapping,
@@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
 	If defined, it should set the PageDirty flag, and the
 	PAGECACHE_TAG_DIRTY tag in the radix tree.
 
+``readahead``
+	Called by the VM to read pages associated with the address_space
+	object.  The pages are consecutive in the page cache and are
+	locked.  The implementation should decrement the page refcount
+	after starting I/O on each page.  Usually the page will be
+	unlocked by the I/O completion handler.  If the function does
+	not attempt I/O on some pages, the caller will decrement the page
+	refcount and unlock the pages for you.	Set PageUptodate if the
+	I/O completes successfully.  Setting PageError on any page will
+	be ignored; simply unlock the page if an I/O error occurs.
+
 ``readpages``
 	called by the VM to read pages associated with the address_space
 	object.  This is essentially just a vector version of readpage.
 	Instead of just one page, several pages are requested.
 	readpages is only used for read-ahead, so read errors are
 	ignored.  If anything goes wrong, feel free to give up.
+	This interface is deprecated; implement readahead instead.
 
 ``write_begin``
 	Called by the generic buffered write code to ask the filesystem
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..d4e2d2964346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -292,6 +292,7 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+struct readahead_control;
 
 /*
  * Write life time hint values.
@@ -375,6 +376,7 @@ struct address_space_operations {
 	 */
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
+	void (*readahead)(struct readahead_control *);
 
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3613154e79e4..bd4291f78f41 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -665,6 +665,24 @@ static inline void readahead_next(struct readahead_control *rac)
 #define readahead_for_each(rac, page)					\
 	for (; (page = readahead_page(rac)); readahead_next(rac))
 
+/* The byte offset into the file of this readahead block */
+static inline loff_t readahead_offset(struct readahead_control *rac)
+{
+	return (loff_t)rac->_start * PAGE_SIZE;
+}
+
+/* The number of bytes in this readahead block */
+static inline loff_t readahead_length(struct readahead_control *rac)
+{
+	return (loff_t)rac->_nr_pages * PAGE_SIZE;
+}
+
+/* The index of the first page in this readahead block */
+static inline unsigned int readahead_index(struct readahead_control *rac)
+{
+	return rac->_start;
+}
+
 /* The number of pages in this readahead block */
 static inline unsigned int readahead_count(struct readahead_control *rac)
 {
diff --git a/mm/readahead.c b/mm/readahead.c
index 7663de534734..5be7e1cb8666 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
 
 	blk_start_plug(&plug);
 
-	if (aops->readpages) {
+	if (aops->readahead) {
+		aops->readahead(rac);
+		readahead_for_each(rac, page) {
+			unlock_page(page);
+			put_page(page);
+		}
+	} else if (aops->readpages) {
 		aops->readpages(rac->file, rac->mapping, pages,
 				readahead_count(rac));
 		/* Clean up the remaining pages */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (7 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 06/16] mm: Add readahead address space operation Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  5:33   ` Dave Chinner
  2020-02-18 23:11   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 07/16] mm: Add page_cache_readahead_limit Matthew Wilcox
                   ` (25 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, John Hubbard, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Change the type of page_idx to unsigned long, and rename it -- it's
just a loop counter, not a page index.

Suggested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/readahead.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 74791b96013f..bdc5759000d3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -156,7 +156,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
-	int page_idx;
+	unsigned long i;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 	struct readahead_control rac = {
@@ -174,7 +174,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
+	for (i = 0; i < nr_to_read; i++) {
 		struct page *page;
 
 		if (offset > end_index)
@@ -198,7 +198,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 			break;
 		page->index = offset;
 		list_add(&page->lru, &page_pool);
-		if (page_idx == nr_to_read - lookahead_size)
+		if (i == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		rac._nr_pages++;
 		offset++;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 07/16] mm: Add page_cache_readahead_limit
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (8 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 06/19] mm: rename readahead loop variable to 'i' Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 07/19] mm: Put readahead pages in cache earlier Matthew Wilcox
                   ` (24 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

ext4 and f2fs have duplicated the guts of the readahead code so
they can read past i_size.  Instead, separate out the guts of the
readahead code so they can call it directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/verity.c        | 35 ++-----------------------------
 fs/f2fs/verity.c        | 35 ++-----------------------------
 include/linux/pagemap.h |  4 ++++
 mm/readahead.c          | 46 +++++++++++++++++++++++++----------------
 4 files changed, 36 insertions(+), 84 deletions(-)

diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index dc5ec724d889..f6e0bf05933e 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -342,37 +342,6 @@ static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
 	return desc_size;
 }
 
-/*
- * Prefetch some pages from the file's Merkle tree.
- *
- * This is basically a stripped-down version of __do_page_cache_readahead()
- * which works on pages past i_size.
- */
-static void ext4_merkle_tree_readahead(struct address_space *mapping,
-				       pgoff_t start_index, unsigned long count)
-{
-	LIST_HEAD(pages);
-	unsigned int nr_pages = 0;
-	struct page *page;
-	pgoff_t index;
-	struct blk_plug plug;
-
-	for (index = start_index; index < start_index + count; index++) {
-		page = xa_load(&mapping->i_pages, index);
-		if (!page || xa_is_value(page)) {
-			page = __page_cache_alloc(readahead_gfp_mask(mapping));
-			if (!page)
-				break;
-			page->index = index;
-			list_add(&page->lru, &pages);
-			nr_pages++;
-		}
-	}
-	blk_start_plug(&plug);
-	ext4_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
-	blk_finish_plug(&plug);
-}
-
 static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
@@ -386,8 +355,8 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			ext4_merkle_tree_readahead(inode->i_mapping, index,
-						   num_ra_pages);
+			page_cache_readahead_limit(inode->i_mapping, NULL,
+					index, LONG_MAX, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index d7d430a6f130..71a3e36721fa 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -222,37 +222,6 @@ static int f2fs_get_verity_descriptor(struct inode *inode, void *buf,
 	return size;
 }
 
-/*
- * Prefetch some pages from the file's Merkle tree.
- *
- * This is basically a stripped-down version of __do_page_cache_readahead()
- * which works on pages past i_size.
- */
-static void f2fs_merkle_tree_readahead(struct address_space *mapping,
-				       pgoff_t start_index, unsigned long count)
-{
-	LIST_HEAD(pages);
-	unsigned int nr_pages = 0;
-	struct page *page;
-	pgoff_t index;
-	struct blk_plug plug;
-
-	for (index = start_index; index < start_index + count; index++) {
-		page = xa_load(&mapping->i_pages, index);
-		if (!page || xa_is_value(page)) {
-			page = __page_cache_alloc(readahead_gfp_mask(mapping));
-			if (!page)
-				break;
-			page->index = index;
-			list_add(&page->lru, &pages);
-			nr_pages++;
-		}
-	}
-	blk_start_plug(&plug);
-	f2fs_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
-	blk_finish_plug(&plug);
-}
-
 static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
@@ -266,8 +235,8 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			f2fs_merkle_tree_readahead(inode->i_mapping, index,
-						   num_ra_pages);
+			page_cache_readahead_limit(inode->i_mapping, NULL,
+					index, LONG_MAX, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd4291f78f41..4f36c06d064d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -389,6 +389,10 @@ extern struct page * read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern int read_cache_pages(struct address_space *mapping,
 		struct list_head *pages, filler_t *filler, void *data);
+void page_cache_readahead_limit(struct address_space *mapping,
+		struct file *file, pgoff_t offset, pgoff_t end_index,
+		unsigned long nr_to_read, unsigned long lookahead_size);
+
 
 static inline struct page *read_mapping_page(struct address_space *mapping,
 				pgoff_t index, void *data)
diff --git a/mm/readahead.c b/mm/readahead.c
index 5be7e1cb8666..566693f4e539 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -142,35 +142,21 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
 	blk_finish_plug(&plug);
 }
 
-/*
- * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
- * the pages first, then submits them for I/O. This avoids the very bad
- * behaviour which would occur if page allocations are causing VM writeback.
- * We really don't want to intermingle reads and writes like that.
- */
-void __do_page_cache_readahead(struct address_space *mapping,
-		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
-		unsigned long lookahead_size)
+void page_cache_readahead_limit(struct address_space *mapping,
+		struct file *file, pgoff_t offset, pgoff_t end_index,
+		unsigned long nr_to_read, unsigned long lookahead_size)
 {
-	struct inode *inode = mapping->host;
-	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	int page_idx;
-	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 	bool use_list = mapping->a_ops->readpages;
 	struct readahead_control rac = {
 		.mapping = mapping,
-		.file = filp,
+		.file = file,
 		._start = offset,
 		._nr_pages = 0,
 	};
 
-	if (isize == 0)
-		return;
-
-	end_index = ((isize - 1) >> PAGE_SHIFT);
-
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -225,6 +211,30 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
 }
+EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
+
+/*
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
+ * the pages first, then submits them for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
+ */
+void __do_page_cache_readahead(struct address_space *mapping,
+		struct file *file, pgoff_t offset, unsigned long nr_to_read,
+		unsigned long lookahead_size)
+{
+	struct inode *inode = mapping->host;
+	unsigned long end_index;	/* The last page we want to read */
+	loff_t isize = i_size_read(inode);
+
+	if (isize == 0)
+		return;
+
+	end_index = ((isize - 1) >> PAGE_SHIFT);
+
+	page_cache_readahead_limit(mapping, file, offset, end_index,
+			nr_to_read, lookahead_size);
+}
 
 /*
  * Chunk the readahead into 2 megabyte units, so that we don't pin too much
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (9 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 07/16] mm: Add page_cache_readahead_limit Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  6:14   ` Dave Chinner
  2020-02-19  0:01   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 08/16] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
                   ` (23 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

At allocation time, put the pages in the cache unless we're using
->readpages.  Add the readahead_for_each() iterator for the benefit of
the ->readpage fallback.  This iterator supports huge pages, even though
none of the filesystems to be converted do yet.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 24 ++++++++++++++++++++++++
 mm/readahead.c          | 34 +++++++++++++++++-----------------
 2 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 982ecda2d4a2..3613154e79e4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -639,8 +639,32 @@ struct readahead_control {
 /* private: use the readahead_* accessors instead */
 	pgoff_t _start;
 	unsigned int _nr_pages;
+	unsigned int _batch_count;
 };
 
+static inline struct page *readahead_page(struct readahead_control *rac)
+{
+	struct page *page;
+
+	if (!rac->_nr_pages)
+		return NULL;
+
+	page = xa_load(&rac->mapping->i_pages, rac->_start);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	rac->_batch_count = hpage_nr_pages(page);
+
+	return page;
+}
+
+static inline void readahead_next(struct readahead_control *rac)
+{
+	rac->_nr_pages -= rac->_batch_count;
+	rac->_start += rac->_batch_count;
+}
+
+#define readahead_for_each(rac, page)					\
+	for (; (page = readahead_page(rac)); readahead_next(rac))
+
 /* The number of pages in this readahead block */
 static inline unsigned int readahead_count(struct readahead_control *rac)
 {
diff --git a/mm/readahead.c b/mm/readahead.c
index bdc5759000d3..9e430daae42f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -113,12 +113,11 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 
 EXPORT_SYMBOL(read_cache_pages);
 
-static void read_pages(struct readahead_control *rac, struct list_head *pages,
-		gfp_t gfp)
+static void read_pages(struct readahead_control *rac, struct list_head *pages)
 {
 	const struct address_space_operations *aops = rac->mapping->a_ops;
+	struct page *page;
 	struct blk_plug plug;
-	unsigned page_idx;
 
 	blk_start_plug(&plug);
 
@@ -127,19 +126,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
 				readahead_count(rac));
 		/* Clean up the remaining pages */
 		put_pages_list(pages);
-		goto out;
-	}
-
-	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
-		struct page *page = lru_to_page(pages);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
-				gfp))
+	} else {
+		readahead_for_each(rac, page) {
 			aops->readpage(rac->file, page);
-		put_page(page);
+			put_page(page);
+		}
 	}
 
-out:
 	blk_finish_plug(&plug);
 }
 
@@ -159,6 +152,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	unsigned long i;
 	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	bool use_list = mapping->a_ops->readpages;
 	struct readahead_control rac = {
 		.mapping = mapping,
 		.file = filp,
@@ -196,8 +190,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
-		page->index = offset;
-		list_add(&page->lru, &page_pool);
+		if (use_list) {
+			page->index = offset;
+			list_add(&page->lru, &page_pool);
+		} else if (add_to_page_cache_lru(page, mapping, offset,
+					gfp_mask) < 0) {
+			put_page(page);
+			goto read;
+		}
 		if (i == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
 		rac._nr_pages++;
@@ -205,7 +205,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		continue;
 read:
 		if (readahead_count(&rac))
-			read_pages(&rac, &page_pool, gfp_mask);
+			read_pages(&rac, &page_pool);
 		rac._nr_pages = 0;
 		rac._start = ++offset;
 	}
@@ -216,7 +216,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
 	 * will then handle the error.
 	 */
 	if (readahead_count(&rac))
-		read_pages(&rac, &page_pool, gfp_mask);
+		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 08/16] fs: Convert mpage_readpages to mpage_readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (10 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 07/19] mm: Put readahead pages in cache earlier Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 08/19] mm: Add readahead address space operation Matthew Wilcox
                   ` (22 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, Junxiao Bi, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
---
 drivers/staging/exfat/exfat_super.c |  7 +++---
 fs/block_dev.c                      |  7 +++---
 fs/ext2/inode.c                     | 10 +++-----
 fs/fat/inode.c                      |  7 +++---
 fs/gfs2/aops.c                      | 23 ++++++-----------
 fs/hpfs/file.c                      |  7 +++---
 fs/iomap/buffered-io.c              |  2 +-
 fs/isofs/inode.c                    |  7 +++---
 fs/jfs/inode.c                      |  7 +++---
 fs/mpage.c                          | 38 +++++++++--------------------
 fs/nilfs2/inode.c                   | 15 +++---------
 fs/ocfs2/aops.c                     | 34 ++++++++++----------------
 fs/omfs/file.c                      |  7 +++---
 fs/qnx6/inode.c                     |  7 +++---
 fs/reiserfs/inode.c                 |  8 +++---
 fs/udf/inode.c                      |  7 +++---
 include/linux/mpage.h               |  4 +--
 mm/migrate.c                        |  2 +-
 18 files changed, 73 insertions(+), 126 deletions(-)

diff --git a/drivers/staging/exfat/exfat_super.c b/drivers/staging/exfat/exfat_super.c
index b81d2a87b82e..96aad9b16d31 100644
--- a/drivers/staging/exfat/exfat_super.c
+++ b/drivers/staging/exfat/exfat_super.c
@@ -3002,10 +3002,9 @@ static int exfat_readpage(struct file *file, struct page *page)
 	return  mpage_readpage(page, exfat_get_block);
 }
 
-static int exfat_readpages(struct file *file, struct address_space *mapping,
-			   struct list_head *pages, unsigned int nr_pages)
+static void exfat_readahead(struct readahead_control *rac)
 {
-	return  mpage_readpages(mapping, pages, nr_pages, exfat_get_block);
+	mpage_readahead(rac, exfat_get_block);
 }
 
 static int exfat_writepage(struct page *page, struct writeback_control *wbc)
@@ -3104,7 +3103,7 @@ static sector_t _exfat_bmap(struct address_space *mapping, sector_t block)
 
 static const struct address_space_operations exfat_aops = {
 	.readpage    = exfat_readpage,
-	.readpages   = exfat_readpages,
+	.readahead   = exfat_readahead,
 	.writepage   = exfat_writepage,
 	.writepages  = exfat_writepages,
 	.write_begin = exfat_write_begin,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..2fd9c7bd61f6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -614,10 +614,9 @@ static int blkdev_readpage(struct file * file, struct page * page)
 	return block_read_full_page(page, blkdev_get_block);
 }
 
-static int blkdev_readpages(struct file *file, struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void blkdev_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block);
+	mpage_readahead(rac, blkdev_get_block);
 }
 
 static int blkdev_write_begin(struct file *file, struct address_space *mapping,
@@ -2062,7 +2061,7 @@ static int blkdev_writepages(struct address_space *mapping,
 
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
-	.readpages	= blkdev_readpages,
+	.readahead	= blkdev_readahead,
 	.writepage	= blkdev_writepage,
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c885cf7d724b..2875c0a705b5 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -877,11 +877,9 @@ static int ext2_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, ext2_get_block);
 }
 
-static int
-ext2_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void ext2_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
+	mpage_readahead(rac, ext2_get_block);
 }
 
 static int
@@ -967,7 +965,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
 
 const struct address_space_operations ext2_aops = {
 	.readpage		= ext2_readpage,
-	.readpages		= ext2_readpages,
+	.readahead		= ext2_readahead,
 	.writepage		= ext2_writepage,
 	.write_begin		= ext2_write_begin,
 	.write_end		= ext2_write_end,
@@ -981,7 +979,7 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
-	.readpages		= ext2_readpages,
+	.readahead		= ext2_readahead,
 	.writepage		= ext2_nobh_writepage,
 	.write_begin		= ext2_nobh_write_begin,
 	.write_end		= nobh_write_end,
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 594b05ae16c9..3496f5fc3e6d 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -210,10 +210,9 @@ static int fat_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, fat_get_block);
 }
 
-static int fat_readpages(struct file *file, struct address_space *mapping,
-			 struct list_head *pages, unsigned nr_pages)
+static void fat_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
+	mpage_readahead(rac, fat_get_block);
 }
 
 static void fat_write_failed(struct address_space *mapping, loff_t to)
@@ -344,7 +343,7 @@ int fat_block_truncate_page(struct inode *inode, loff_t from)
 
 static const struct address_space_operations fat_aops = {
 	.readpage	= fat_readpage,
-	.readpages	= fat_readpages,
+	.readahead	= fat_readahead,
 	.writepage	= fat_writepage,
 	.writepages	= fat_writepages,
 	.write_begin	= fat_write_begin,
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ba83b49ce18c..5e63c13c12c1 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -577,7 +577,7 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 }
 
 /**
- * gfs2_readpages - Read a bunch of pages at once
+ * gfs2_readahead - Read a bunch of pages at once
  * @file: The file to read from
  * @mapping: Address space info
  * @pages: List of pages to read
@@ -590,31 +590,24 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
  *    obviously not something we'd want to do on too regular a basis.
  *    Any I/O we ignore at this time will be done via readpage later.
  * 2. We don't handle stuffed files here we let readpage do the honours.
- * 3. mpage_readpages() does most of the heavy lifting in the common case.
+ * 3. mpage_readahead() does most of the heavy lifting in the common case.
  * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
  */
 
-static int gfs2_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void gfs2_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_holder gh;
-	int ret;
 
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	ret = gfs2_glock_nq(&gh);
-	if (unlikely(ret))
+	if (gfs2_glock_nq(&gh))
 		goto out_uninit;
 	if (!gfs2_is_stuffed(ip))
-		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
+		mpage_readahead(rac, gfs2_block_map);
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
-	if (unlikely(gfs2_withdrawn(sdp)))
-		ret = -EIO;
-	return ret;
 }
 
 /**
@@ -828,7 +821,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepage = gfs2_writepage,
 	.writepages = gfs2_writepages,
 	.readpage = gfs2_readpage,
-	.readpages = gfs2_readpages,
+	.readahead = gfs2_readahead,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
@@ -842,7 +835,7 @@ static const struct address_space_operations gfs2_jdata_aops = {
 	.writepage = gfs2_jdata_writepage,
 	.writepages = gfs2_jdata_writepages,
 	.readpage = gfs2_readpage,
-	.readpages = gfs2_readpages,
+	.readahead = gfs2_readahead,
 	.set_page_dirty = jdata_set_page_dirty,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index b36abf9cb345..2de0d3492d15 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -125,10 +125,9 @@ static int hpfs_writepage(struct page *page, struct writeback_control *wbc)
 	return block_write_full_page(page, hpfs_get_block, wbc);
 }
 
-static int hpfs_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void hpfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, hpfs_get_block);
+	mpage_readahead(rac, hpfs_get_block);
 }
 
 static int hpfs_writepages(struct address_space *mapping,
@@ -198,7 +197,7 @@ static int hpfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 const struct address_space_operations hpfs_aops = {
 	.readpage = hpfs_readpage,
 	.writepage = hpfs_writepage,
-	.readpages = hpfs_readpages,
+	.readahead = hpfs_readahead,
 	.writepages = hpfs_writepages,
 	.write_begin = hpfs_write_begin,
 	.write_end = hpfs_write_end,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..cb3511eb152a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -359,7 +359,7 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	}
 
 	/*
-	 * Just like mpage_readpages and block_read_full_page we always
+	 * Just like mpage_readahead and block_read_full_page we always
 	 * return 0 and just mark the page as PageError on errors.  This
 	 * should be cleaned up all through the stack eventually.
 	 */
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 62c0462dc89f..95b1f377ad09 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -1185,10 +1185,9 @@ static int isofs_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, isofs_get_block);
 }
 
-static int isofs_readpages(struct file *file, struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void isofs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, isofs_get_block);
+	mpage_readahead(rac, isofs_get_block);
 }
 
 static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
@@ -1198,7 +1197,7 @@ static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
 
 static const struct address_space_operations isofs_aops = {
 	.readpage = isofs_readpage,
-	.readpages = isofs_readpages,
+	.readahead = isofs_readahead,
 	.bmap = _isofs_bmap
 };
 
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 9486afcdac76..6f65bfa9f18d 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -296,10 +296,9 @@ static int jfs_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, jfs_get_block);
 }
 
-static int jfs_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void jfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, jfs_get_block);
+	mpage_readahead(rac, jfs_get_block);
 }
 
 static void jfs_write_failed(struct address_space *mapping, loff_t to)
@@ -358,7 +357,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 const struct address_space_operations jfs_aops = {
 	.readpage	= jfs_readpage,
-	.readpages	= jfs_readpages,
+	.readahead	= jfs_readahead,
 	.writepage	= jfs_writepage,
 	.writepages	= jfs_writepages,
 	.write_begin	= jfs_write_begin,
diff --git a/fs/mpage.c b/fs/mpage.c
index ccba3c4c4479..8a09e6002dc2 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -91,7 +91,7 @@ mpage_alloc(struct block_device *bdev,
 }
 
 /*
- * support function for mpage_readpages.  The fs supplied get_block might
+ * support function for mpage_readahead.  The fs supplied get_block might
  * return an up to date buffer.  This is used to map that buffer into
  * the page, which allows readpage to avoid triggering a duplicate call
  * to get_block.
@@ -338,13 +338,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 }
 
 /**
- * mpage_readpages - populate an address space with some pages & start reads against them
- * @mapping: the address_space
- * @pages: The address of a list_head which contains the target pages.  These
- *   pages have their ->index populated and are otherwise uninitialised.
- *   The page at @pages->prev has the lowest file offset, and reads should be
- *   issued in @pages->prev to @pages->next order.
- * @nr_pages: The number of pages at *@pages
+ * mpage_readahead - start reads against pages
+ * @rac: Describes which pages to read.
  * @get_block: The filesystem's block mapper function.
  *
  * This function walks the pages and the blocks within each page, building and
@@ -381,36 +376,25 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
  *
  * This all causes the disk requests to be issued in the correct order.
  */
-int
-mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block)
+void mpage_readahead(struct readahead_control *rac, get_block_t get_block)
 {
+	struct page *page;
 	struct mpage_readpage_args args = {
 		.get_block = get_block,
 		.is_readahead = true,
 	};
-	unsigned page_idx;
-
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-		struct page *page = lru_to_page(pages);
 
+	readahead_for_each(rac, page) {
 		prefetchw(&page->flags);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping,
-					page->index,
-					readahead_gfp_mask(mapping))) {
-			args.page = page;
-			args.nr_pages = nr_pages - page_idx;
-			args.bio = do_mpage_readpage(&args);
-		}
+		args.page = page;
+		args.nr_pages = readahead_count(rac);
+		args.bio = do_mpage_readpage(&args);
 		put_page(page);
 	}
-	BUG_ON(!list_empty(pages));
 	if (args.bio)
 		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
-	return 0;
 }
-EXPORT_SYMBOL(mpage_readpages);
+EXPORT_SYMBOL(mpage_readahead);
 
 /*
  * This isn't called much at all
@@ -563,7 +547,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 		 * Page has buffers, but they are all unmapped. The page was
 		 * created by pagein or read over a hole which was handled by
 		 * block_read_full_page().  If this address_space is also
-		 * using mpage_readpages then this can rarely happen.
+		 * using mpage_readahead then this can rarely happen.
 		 */
 		goto confused;
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 671085512e0f..ceeb3b441844 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -145,18 +145,9 @@ static int nilfs_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, nilfs_get_block);
 }
 
-/**
- * nilfs_readpages() - implement readpages() method of nilfs_aops {}
- * address_space_operations.
- * @file - file struct of the file to be read
- * @mapping - address_space struct used for reading multiple pages
- * @pages - the pages to be read
- * @nr_pages - number of pages to be read
- */
-static int nilfs_readpages(struct file *file, struct address_space *mapping,
-			   struct list_head *pages, unsigned int nr_pages)
+static void nilfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, nilfs_get_block);
+	mpage_readahead(rac, nilfs_get_block);
 }
 
 static int nilfs_writepages(struct address_space *mapping,
@@ -308,7 +299,7 @@ const struct address_space_operations nilfs_aops = {
 	.readpage		= nilfs_readpage,
 	.writepages		= nilfs_writepages,
 	.set_page_dirty		= nilfs_set_page_dirty,
-	.readpages		= nilfs_readpages,
+	.readahead		= nilfs_readahead,
 	.write_begin		= nilfs_write_begin,
 	.write_end		= nilfs_write_end,
 	/* .releasepage		= nilfs_releasepage, */
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 3a67a6518ddf..e8137efaafec 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -350,14 +350,11 @@ static int ocfs2_readpage(struct file *file, struct page *page)
  * grow out to a tree. If need be, detecting boundary extents could
  * trivially be added in a future version of ocfs2_get_block().
  */
-static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
-			   struct list_head *pages, unsigned nr_pages)
+static void ocfs2_readahead(struct readahead_control *rac)
 {
-	int ret, err = -EIO;
-	struct inode *inode = mapping->host;
+	int ret;
+	struct inode *inode = rac->mapping->host;
 	struct ocfs2_inode_info *oi = OCFS2_I(inode);
-	loff_t start;
-	struct page *last;
 
 	/*
 	 * Use the nonblocking flag for the dlm code to avoid page
@@ -365,36 +362,31 @@ static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
 	 */
 	ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK);
 	if (ret)
-		return err;
+		return;
 
-	if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
-		ocfs2_inode_unlock(inode, 0);
-		return err;
-	}
+	if (down_read_trylock(&oi->ip_alloc_sem) == 0)
+		goto out_unlock;
 
 	/*
 	 * Don't bother with inline-data. There isn't anything
 	 * to read-ahead in that case anyway...
 	 */
 	if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
-		goto out_unlock;
+		goto out_up;
 
 	/*
 	 * Check whether a remote node truncated this file - we just
 	 * drop out in that case as it's not worth handling here.
 	 */
-	last = lru_to_page(pages);
-	start = (loff_t)last->index << PAGE_SHIFT;
-	if (start >= i_size_read(inode))
-		goto out_unlock;
+	if (readahead_offset(rac) >= i_size_read(inode))
+		goto out_up;
 
-	err = mpage_readpages(mapping, pages, nr_pages, ocfs2_get_block);
+	mpage_readahead(rac, ocfs2_get_block);
 
-out_unlock:
+out_up:
 	up_read(&oi->ip_alloc_sem);
+out_unlock:
 	ocfs2_inode_unlock(inode, 0);
-
-	return err;
 }
 
 /* Note: Because we don't support holes, our allocation has
@@ -2474,7 +2466,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 const struct address_space_operations ocfs2_aops = {
 	.readpage		= ocfs2_readpage,
-	.readpages		= ocfs2_readpages,
+	.readahead		= ocfs2_readahead,
 	.writepage		= ocfs2_writepage,
 	.write_begin		= ocfs2_write_begin,
 	.write_end		= ocfs2_write_end,
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index d640b9388238..d7b5f09d298c 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -289,10 +289,9 @@ static int omfs_readpage(struct file *file, struct page *page)
 	return block_read_full_page(page, omfs_get_block);
 }
 
-static int omfs_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void omfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, omfs_get_block);
+	mpage_readahead(rac, omfs_get_block);
 }
 
 static int omfs_writepage(struct page *page, struct writeback_control *wbc)
@@ -373,7 +372,7 @@ const struct inode_operations omfs_file_inops = {
 
 const struct address_space_operations omfs_aops = {
 	.readpage = omfs_readpage,
-	.readpages = omfs_readpages,
+	.readahead = omfs_readahead,
 	.writepage = omfs_writepage,
 	.writepages = omfs_writepages,
 	.write_begin = omfs_write_begin,
diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
index 345db56c98fd..755293c8c71a 100644
--- a/fs/qnx6/inode.c
+++ b/fs/qnx6/inode.c
@@ -99,10 +99,9 @@ static int qnx6_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, qnx6_get_block);
 }
 
-static int qnx6_readpages(struct file *file, struct address_space *mapping,
-		   struct list_head *pages, unsigned nr_pages)
+static void qnx6_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, qnx6_get_block);
+	mpage_readahead(rac, qnx6_get_block);
 }
 
 /*
@@ -499,7 +498,7 @@ static sector_t qnx6_bmap(struct address_space *mapping, sector_t block)
 }
 static const struct address_space_operations qnx6_aops = {
 	.readpage	= qnx6_readpage,
-	.readpages	= qnx6_readpages,
+	.readahead	= qnx6_readahead,
 	.bmap		= qnx6_bmap
 };
 
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 6419e6dacc39..0031070b3692 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -1160,11 +1160,9 @@ int reiserfs_get_block(struct inode *inode, sector_t block,
 	return retval;
 }
 
-static int
-reiserfs_readpages(struct file *file, struct address_space *mapping,
-		   struct list_head *pages, unsigned nr_pages)
+static void reiserfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, reiserfs_get_block);
+	mpage_readahead(rac, reiserfs_get_block);
 }
 
 /*
@@ -3434,7 +3432,7 @@ int reiserfs_setattr(struct dentry *dentry, struct iattr *attr)
 const struct address_space_operations reiserfs_address_space_operations = {
 	.writepage = reiserfs_writepage,
 	.readpage = reiserfs_readpage,
-	.readpages = reiserfs_readpages,
+	.readahead = reiserfs_readahead,
 	.releasepage = reiserfs_releasepage,
 	.invalidatepage = reiserfs_invalidatepage,
 	.write_begin = reiserfs_write_begin,
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index e875bc5668ee..adaba8e8b326 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -195,10 +195,9 @@ static int udf_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, udf_get_block);
 }
 
-static int udf_readpages(struct file *file, struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void udf_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, udf_get_block);
+	mpage_readahead(rac, udf_get_block);
 }
 
 static int udf_write_begin(struct file *file, struct address_space *mapping,
@@ -234,7 +233,7 @@ static sector_t udf_bmap(struct address_space *mapping, sector_t block)
 
 const struct address_space_operations udf_aops = {
 	.readpage	= udf_readpage,
-	.readpages	= udf_readpages,
+	.readahead	= udf_readahead,
 	.writepage	= udf_writepage,
 	.writepages	= udf_writepages,
 	.write_begin	= udf_write_begin,
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
index 001f1fcf9836..f4f5e90a6844 100644
--- a/include/linux/mpage.h
+++ b/include/linux/mpage.h
@@ -13,9 +13,9 @@
 #ifdef CONFIG_BLOCK
 
 struct writeback_control;
+struct readahead_control;
 
-int mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block);
+void mpage_readahead(struct readahead_control *, get_block_t get_block);
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
diff --git a/mm/migrate.c b/mm/migrate.c
index b1092876e537..a32122095702 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1020,7 +1020,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * to the LRU. Later, when the IO completes the pages are
 		 * marked uptodate and unlocked. However, the queueing
 		 * could be merging multiple pages for one bio (e.g.
-		 * mpage_readpages). If an allocation happens for the
+		 * mpage_readahead). If an allocation happens for the
 		 * second or third page, the process can end up locking
 		 * the same page twice and deadlocking. Rather than
 		 * trying to be clever about what pages can be locked,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (11 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 08/16] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  6:21   ` Dave Chinner
                     ` (2 more replies)
  2020-02-17 18:45 ` [PATCH v6 09/16] btrfs: Convert from readpages to readahead Matthew Wilcox
                   ` (21 subsequent siblings)
  34 siblings, 3 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

This replaces ->readpages with a saner interface:
 - Return void instead of an ignored error code.
 - Pages are already in the page cache when ->readahead is called.
 - Implementation looks up the pages in the page cache instead of
   having them passed in a linked list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 Documentation/filesystems/locking.rst |  6 +++++-
 Documentation/filesystems/vfs.rst     | 13 +++++++++++++
 include/linux/fs.h                    |  2 ++
 include/linux/pagemap.h               | 18 ++++++++++++++++++
 mm/readahead.c                        |  8 +++++++-
 5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 5057e4d9dcd1..0ebc4491025a 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -239,6 +239,7 @@ prototypes::
 	int (*readpage)(struct file *, struct page *);
 	int (*writepages)(struct address_space *, struct writeback_control *);
 	int (*set_page_dirty)(struct page *page);
+	void (*readahead)(struct readahead_control *);
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
 	int (*write_begin)(struct file *, struct address_space *mapping,
@@ -271,7 +272,8 @@ writepage:		yes, unlocks (see below)
 readpage:		yes, unlocks
 writepages:
 set_page_dirty		no
-readpages:
+readahead:		yes, unlocks
+readpages:		no
 write_begin:		locks the page		 exclusive
 write_end:		yes, unlocks		 exclusive
 bmap:
@@ -295,6 +297,8 @@ the request handler (/dev/loop).
 ->readpage() unlocks the page, either synchronously or via I/O
 completion.
 
+->readahead() unlocks the pages like ->readpage().
+
 ->readpages() populates the pagecache with the passed pages and starts
 I/O against them.  They come unlocked upon I/O completion.
 
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..81ab30fbe45c 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
 		int (*readpage)(struct file *, struct page *);
 		int (*writepages)(struct address_space *, struct writeback_control *);
 		int (*set_page_dirty)(struct page *page);
+		void (*readahead)(struct readahead_control *);
 		int (*readpages)(struct file *filp, struct address_space *mapping,
 				 struct list_head *pages, unsigned nr_pages);
 		int (*write_begin)(struct file *, struct address_space *mapping,
@@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
 	If defined, it should set the PageDirty flag, and the
 	PAGECACHE_TAG_DIRTY tag in the radix tree.
 
+``readahead``
+	Called by the VM to read pages associated with the address_space
+	object.  The pages are consecutive in the page cache and are
+	locked.  The implementation should decrement the page refcount
+	after starting I/O on each page.  Usually the page will be
+	unlocked by the I/O completion handler.  If the function does
+	not attempt I/O on some pages, the caller will decrement the page
+	refcount and unlock the pages for you.	Set PageUptodate if the
+	I/O completes successfully.  Setting PageError on any page will
+	be ignored; simply unlock the page if an I/O error occurs.
+
 ``readpages``
 	called by the VM to read pages associated with the address_space
 	object.  This is essentially just a vector version of readpage.
 	Instead of just one page, several pages are requested.
 	readpages is only used for read-ahead, so read errors are
 	ignored.  If anything goes wrong, feel free to give up.
+	This interface is deprecated; implement readahead instead.
 
 ``write_begin``
 	Called by the generic buffered write code to ask the filesystem
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..d4e2d2964346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -292,6 +292,7 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+struct readahead_control;
 
 /*
  * Write life time hint values.
@@ -375,6 +376,7 @@ struct address_space_operations {
 	 */
 	int (*readpages)(struct file *filp, struct address_space *mapping,
 			struct list_head *pages, unsigned nr_pages);
+	void (*readahead)(struct readahead_control *);
 
 	int (*write_begin)(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3613154e79e4..bd4291f78f41 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -665,6 +665,24 @@ static inline void readahead_next(struct readahead_control *rac)
 #define readahead_for_each(rac, page)					\
 	for (; (page = readahead_page(rac)); readahead_next(rac))
 
+/* The byte offset into the file of this readahead block */
+static inline loff_t readahead_offset(struct readahead_control *rac)
+{
+	return (loff_t)rac->_start * PAGE_SIZE;
+}
+
+/* The number of bytes in this readahead block */
+static inline loff_t readahead_length(struct readahead_control *rac)
+{
+	return (loff_t)rac->_nr_pages * PAGE_SIZE;
+}
+
+/* The index of the first page in this readahead block */
+static inline unsigned int readahead_index(struct readahead_control *rac)
+{
+	return rac->_start;
+}
+
 /* The number of pages in this readahead block */
 static inline unsigned int readahead_count(struct readahead_control *rac)
 {
diff --git a/mm/readahead.c b/mm/readahead.c
index 9e430daae42f..975ff5e387be 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
 
 	blk_start_plug(&plug);
 
-	if (aops->readpages) {
+	if (aops->readahead) {
+		aops->readahead(rac);
+		readahead_for_each(rac, page) {
+			unlock_page(page);
+			put_page(page);
+		}
+	} else if (aops->readpages) {
 		aops->readpages(rac->file, rac->mapping, pages,
 				readahead_count(rac));
 		/* Clean up the remaining pages */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 09/16] btrfs: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (12 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 08/19] mm: Add readahead address space operation Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 09/19] mm: Add page_cache_readahead_limit Matthew Wilcox
                   ` (20 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in btrfs.  Add a
readahead_for_each_batch() iterator to optimise the loop in the XArray.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/btrfs/extent_io.c    | 48 ++++++++++++++---------------------------
 fs/btrfs/extent_io.h    |  3 +--
 fs/btrfs/inode.c        | 16 ++++++--------
 include/linux/pagemap.h | 27 +++++++++++++++++++++++
 4 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c0f202741e09..d9f66058e0a7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4278,52 +4278,36 @@ int extent_writepages(struct address_space *mapping,
 	return ret;
 }
 
-int extent_readpages(struct address_space *mapping, struct list_head *pages,
-		     unsigned nr_pages)
+void extent_readahead(struct readahead_control *rac)
 {
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
 	struct page *pagepool[16];
 	struct extent_map *em_cached = NULL;
-	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
-	int nr = 0;
+	struct extent_io_tree *tree = &BTRFS_I(rac->mapping->host)->io_tree;
 	u64 prev_em_start = (u64)-1;
+	int nr;
 
-	while (!list_empty(pages)) {
-		u64 contig_end = 0;
-
-		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
-			struct page *page = lru_to_page(pages);
-
-			prefetchw(&page->flags);
-			list_del(&page->lru);
-			if (add_to_page_cache_lru(page, mapping, page->index,
-						readahead_gfp_mask(mapping))) {
-				put_page(page);
-				break;
-			}
-
-			pagepool[nr++] = page;
-			contig_end = page_offset(page) + PAGE_SIZE - 1;
-		}
-
-		if (nr) {
-			u64 contig_start = page_offset(pagepool[0]);
+	readahead_for_each_batch(rac, pagepool, ARRAY_SIZE(pagepool), nr) {
+		u64 contig_start = page_offset(pagepool[0]);
+		u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;
 
-			ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
+		ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
 
-			contiguous_readpages(tree, pagepool, nr, contig_start,
-				     contig_end, &em_cached, &bio, &bio_flags,
-				     &prev_em_start);
-		}
+		contiguous_readpages(tree, pagepool, nr, contig_start,
+				contig_end, &em_cached, &bio, &bio_flags,
+				&prev_em_start);
 	}
 
 	if (em_cached)
 		free_extent_map(em_cached);
 
-	if (bio)
-		return submit_one_bio(bio, 0, bio_flags);
-	return 0;
+	if (bio) {
+		int ret = submit_one_bio(bio, 0, bio_flags);
+		if (ret < 0) {
+			/* XXX: unlock the pages here? */
+		}
+	}
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5d205bbaafdc..bddac32948c7 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -198,8 +198,7 @@ int extent_writepages(struct address_space *mapping,
 		      struct writeback_control *wbc);
 int btree_write_cache_pages(struct address_space *mapping,
 			    struct writeback_control *wbc);
-int extent_readpages(struct address_space *mapping, struct list_head *pages,
-		     unsigned nr_pages);
+void extent_readahead(struct readahead_control *rac);
 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len);
 void set_page_extent_mapped(struct page *page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b3ec93ff911..d964b2a78ed8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4794,8 +4794,8 @@ static void evict_inode_truncate_pages(struct inode *inode)
 
 	/*
 	 * Keep looping until we have no more ranges in the io tree.
-	 * We can have ongoing bios started by readpages (called from readahead)
-	 * that have their endio callback (extent_io.c:end_bio_extent_readpage)
+	 * We can have ongoing bios started by readahead that have
+	 * their endio callback (extent_io.c:end_bio_extent_readpage)
 	 * still in progress (unlocked the pages in the bio but did not yet
 	 * unlocked the ranges in the io tree). Therefore this means some
 	 * ranges can still be locked and eviction started because before
@@ -6996,11 +6996,11 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 * for it to complete) and then invalidate the pages for
 			 * this range (through invalidate_inode_pages2_range()),
 			 * but that can lead us to a deadlock with a concurrent
-			 * call to readpages() (a buffered read or a defrag call
+			 * call to readahead (a buffered read or a defrag call
 			 * triggered a readahead) on a page lock due to an
 			 * ordered dio extent we created before but did not have
 			 * yet a corresponding bio submitted (whence it can not
-			 * complete), which makes readpages() wait for that
+			 * complete), which makes readahead wait for that
 			 * ordered extent to complete while holding a lock on
 			 * that page.
 			 */
@@ -8239,11 +8239,9 @@ static int btrfs_writepages(struct address_space *mapping,
 	return extent_writepages(mapping, wbc);
 }
 
-static int
-btrfs_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void btrfs_readahead(struct readahead_control *rac)
 {
-	return extent_readpages(mapping, pages, nr_pages);
+	extent_readahead(rac);
 }
 
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
@@ -10448,7 +10446,7 @@ static const struct address_space_operations btrfs_aops = {
 	.readpage	= btrfs_readpage,
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
-	.readpages	= btrfs_readpages,
+	.readahead	= btrfs_readahead,
 	.direct_IO	= btrfs_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4f36c06d064d..1bbb60a0bf16 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -669,6 +669,33 @@ static inline void readahead_next(struct readahead_control *rac)
 #define readahead_for_each(rac, page)					\
 	for (; (page = readahead_page(rac)); readahead_next(rac))
 
+static inline unsigned int readahead_page_batch(struct readahead_control *rac,
+		struct page **array, unsigned int size)
+{
+	unsigned int batch = 0;
+	XA_STATE(xas, &rac->mapping->i_pages, rac->_start);
+	struct page *page;
+
+	rac->_batch_count = 0;
+	xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) {
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		VM_BUG_ON_PAGE(PageTail(page), page);
+		array[batch++] = page;
+		rac->_batch_count += hpage_nr_pages(page);
+		if (PageHead(page))
+			xas_set(&xas, rac->_start + rac->_batch_count);
+
+		if (batch == size)
+			break;
+	}
+
+	return batch;
+}
+
+#define readahead_for_each_batch(rac, array, size, nr)			\
+	for (; (nr = readahead_page_batch(rac, array, size));		\
+			readahead_next(rac))
+
 /* The byte offset into the file of this readahead block */
 static inline loff_t readahead_offset(struct readahead_control *rac)
 {
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (13 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 09/16] btrfs: Convert from readpages to readahead Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  6:31   ` Dave Chinner
  2020-02-19  1:32   ` John Hubbard
  2020-02-17 18:45 ` [PATCH v6 10/16] erofs: Convert uncompressed files from readpages to readahead Matthew Wilcox
                   ` (19 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

ext4 and f2fs have duplicated the guts of the readahead code so
they can read past i_size.  Instead, separate out the guts of the
readahead code so they can call it directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/verity.c        | 35 ++---------------------
 fs/f2fs/verity.c        | 35 ++---------------------
 include/linux/pagemap.h |  4 +++
 mm/readahead.c          | 61 +++++++++++++++++++++++++++++------------
 4 files changed, 52 insertions(+), 83 deletions(-)

diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index dc5ec724d889..f6e0bf05933e 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -342,37 +342,6 @@ static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
 	return desc_size;
 }
 
-/*
- * Prefetch some pages from the file's Merkle tree.
- *
- * This is basically a stripped-down version of __do_page_cache_readahead()
- * which works on pages past i_size.
- */
-static void ext4_merkle_tree_readahead(struct address_space *mapping,
-				       pgoff_t start_index, unsigned long count)
-{
-	LIST_HEAD(pages);
-	unsigned int nr_pages = 0;
-	struct page *page;
-	pgoff_t index;
-	struct blk_plug plug;
-
-	for (index = start_index; index < start_index + count; index++) {
-		page = xa_load(&mapping->i_pages, index);
-		if (!page || xa_is_value(page)) {
-			page = __page_cache_alloc(readahead_gfp_mask(mapping));
-			if (!page)
-				break;
-			page->index = index;
-			list_add(&page->lru, &pages);
-			nr_pages++;
-		}
-	}
-	blk_start_plug(&plug);
-	ext4_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
-	blk_finish_plug(&plug);
-}
-
 static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
@@ -386,8 +355,8 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			ext4_merkle_tree_readahead(inode->i_mapping, index,
-						   num_ra_pages);
+			page_cache_readahead_limit(inode->i_mapping, NULL,
+					index, LONG_MAX, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index d7d430a6f130..71a3e36721fa 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -222,37 +222,6 @@ static int f2fs_get_verity_descriptor(struct inode *inode, void *buf,
 	return size;
 }
 
-/*
- * Prefetch some pages from the file's Merkle tree.
- *
- * This is basically a stripped-down version of __do_page_cache_readahead()
- * which works on pages past i_size.
- */
-static void f2fs_merkle_tree_readahead(struct address_space *mapping,
-				       pgoff_t start_index, unsigned long count)
-{
-	LIST_HEAD(pages);
-	unsigned int nr_pages = 0;
-	struct page *page;
-	pgoff_t index;
-	struct blk_plug plug;
-
-	for (index = start_index; index < start_index + count; index++) {
-		page = xa_load(&mapping->i_pages, index);
-		if (!page || xa_is_value(page)) {
-			page = __page_cache_alloc(readahead_gfp_mask(mapping));
-			if (!page)
-				break;
-			page->index = index;
-			list_add(&page->lru, &pages);
-			nr_pages++;
-		}
-	}
-	blk_start_plug(&plug);
-	f2fs_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
-	blk_finish_plug(&plug);
-}
-
 static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
@@ -266,8 +235,8 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			f2fs_merkle_tree_readahead(inode->i_mapping, index,
-						   num_ra_pages);
+			page_cache_readahead_limit(inode->i_mapping, NULL,
+					index, LONG_MAX, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd4291f78f41..4f36c06d064d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -389,6 +389,10 @@ extern struct page * read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 extern int read_cache_pages(struct address_space *mapping,
 		struct list_head *pages, filler_t *filler, void *data);
+void page_cache_readahead_limit(struct address_space *mapping,
+		struct file *file, pgoff_t offset, pgoff_t end_index,
+		unsigned long nr_to_read, unsigned long lookahead_size);
+
 
 static inline struct page *read_mapping_page(struct address_space *mapping,
 				pgoff_t index, void *data)
diff --git a/mm/readahead.c b/mm/readahead.c
index 975ff5e387be..94d499cfb657 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -142,35 +142,38 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
 	blk_finish_plug(&plug);
 }
 
-/*
- * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
- * the pages first, then submits them for I/O. This avoids the very bad
- * behaviour which would occur if page allocations are causing VM writeback.
- * We really don't want to intermingle reads and writes like that.
+/**
+ * page_cache_readahead_limit - Start readahead beyond a file's i_size.
+ * @mapping: File address space.
+ * @file: This instance of the open file; used for authentication.
+ * @offset: First page index to read.
+ * @end_index: The maximum page index to read.
+ * @nr_to_read: The number of pages to read.
+ * @lookahead_size: Where to start the next readahead.
+ *
+ * This function is for filesystems to call when they want to start
+ * readahead potentially beyond a file's stated i_size.  If you want
+ * to start readahead on a normal file, you probably want to call
+ * page_cache_async_readahead() or page_cache_sync_readahead() instead.
+ *
+ * Context: File is referenced by caller.  Mutexes may be held by caller.
+ * May sleep, but will not reenter filesystem to reclaim memory.
  */
-void __do_page_cache_readahead(struct address_space *mapping,
-		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
-		unsigned long lookahead_size)
+void page_cache_readahead_limit(struct address_space *mapping,
+		struct file *file, pgoff_t offset, pgoff_t end_index,
+		unsigned long nr_to_read, unsigned long lookahead_size)
 {
-	struct inode *inode = mapping->host;
-	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	unsigned long i;
-	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 	bool use_list = mapping->a_ops->readpages;
 	struct readahead_control rac = {
 		.mapping = mapping,
-		.file = filp,
+		.file = file,
 		._start = offset,
 		._nr_pages = 0,
 	};
 
-	if (isize == 0)
-		return;
-
-	end_index = ((isize - 1) >> PAGE_SHIFT);
-
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -225,6 +228,30 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
 }
+EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
+
+/*
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
+ * the pages first, then submits them for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
+ */
+void __do_page_cache_readahead(struct address_space *mapping,
+		struct file *file, pgoff_t offset, unsigned long nr_to_read,
+		unsigned long lookahead_size)
+{
+	struct inode *inode = mapping->host;
+	unsigned long end_index;	/* The last page we want to read */
+	loff_t isize = i_size_read(inode);
+
+	if (isize == 0)
+		return;
+
+	end_index = ((isize - 1) >> PAGE_SHIFT);
+
+	page_cache_readahead_limit(mapping, file, offset, end_index,
+			nr_to_read, lookahead_size);
+}
 
 /*
  * Chunk the readahead into 2 megabyte units, so that we don't pin too much
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 10/16] erofs: Convert uncompressed files from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (14 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 09/19] mm: Add page_cache_readahead_limit Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
                   ` (18 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in erofs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/erofs/data.c              | 39 +++++++++++++-----------------------
 fs/erofs/zdata.c             |  2 +-
 include/trace/events/erofs.h |  6 +++---
 3 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index fc3a8d8064f8..82ebcee9d178 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -280,47 +280,36 @@ static int erofs_raw_access_readpage(struct file *file, struct page *page)
 	return 0;
 }
 
-static int erofs_raw_access_readpages(struct file *filp,
-				      struct address_space *mapping,
-				      struct list_head *pages,
-				      unsigned int nr_pages)
+static void erofs_raw_access_readahead(struct readahead_control *rac)
 {
 	erofs_off_t last_block;
 	struct bio *bio = NULL;
-	gfp_t gfp = readahead_gfp_mask(mapping);
-	struct page *page = list_last_entry(pages, struct page, lru);
-
-	trace_erofs_readpages(mapping->host, page, nr_pages, true);
+	struct page *page;
 
-	for (; nr_pages; --nr_pages) {
-		page = list_entry(pages->prev, struct page, lru);
+	trace_erofs_readpages(rac->mapping->host, readahead_index(rac),
+			readahead_count(rac), true);
 
+	readahead_for_each(rac, page) {
 		prefetchw(&page->flags);
-		list_del(&page->lru);
 
-		if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
-			bio = erofs_read_raw_page(bio, mapping, page,
-						  &last_block, nr_pages, true);
+		bio = erofs_read_raw_page(bio, rac->mapping, page, &last_block,
+				readahead_count(rac), true);
 
-			/* all the page errors are ignored when readahead */
-			if (IS_ERR(bio)) {
-				pr_err("%s, readahead error at page %lu of nid %llu\n",
-				       __func__, page->index,
-				       EROFS_I(mapping->host)->nid);
+		/* all the page errors are ignored when readahead */
+		if (IS_ERR(bio)) {
+			pr_err("%s, readahead error at page %lu of nid %llu\n",
+			       __func__, page->index,
+			       EROFS_I(rac->mapping->host)->nid);
 
-				bio = NULL;
-			}
+			bio = NULL;
 		}
 
-		/* pages could still be locked */
 		put_page(page);
 	}
-	DBG_BUGON(!list_empty(pages));
 
 	/* the rare case (end in gaps) */
 	if (bio)
 		submit_bio(bio);
-	return 0;
 }
 
 static int erofs_get_block(struct inode *inode, sector_t iblock,
@@ -358,7 +347,7 @@ static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
 /* for uncompressed (aligned) files and raw access for other files */
 const struct address_space_operations erofs_raw_access_aops = {
 	.readpage = erofs_raw_access_readpage,
-	.readpages = erofs_raw_access_readpages,
+	.readahead = erofs_raw_access_readahead,
 	.bmap = erofs_bmap,
 };
 
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 80e47f07d946..17f45fcb8c5c 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1315,7 +1315,7 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
 	struct page *head = NULL;
 	LIST_HEAD(pagepool);
 
-	trace_erofs_readpages(mapping->host, lru_to_page(pages),
+	trace_erofs_readpages(mapping->host, lru_to_page(pages)->index,
 			      nr_pages, false);
 
 	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
diff --git a/include/trace/events/erofs.h b/include/trace/events/erofs.h
index 27f5caa6299a..bf9806fd1306 100644
--- a/include/trace/events/erofs.h
+++ b/include/trace/events/erofs.h
@@ -113,10 +113,10 @@ TRACE_EVENT(erofs_readpage,
 
 TRACE_EVENT(erofs_readpages,
 
-	TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage,
+	TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage,
 		bool raw),
 
-	TP_ARGS(inode, page, nrpage, raw),
+	TP_ARGS(inode, start, nrpage, raw),
 
 	TP_STRUCT__entry(
 		__field(dev_t,		dev	)
@@ -129,7 +129,7 @@ TRACE_EVENT(erofs_readpages,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->nid	= EROFS_I(inode)->nid;
-		__entry->start	= page->index;
+		__entry->start	= start;
 		__entry->nrpage	= nrpage;
 		__entry->raw	= raw;
 	),
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (15 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 10/16] erofs: Convert uncompressed files from readpages to readahead Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  1:51   ` [Ocfs2-devel] " Joseph Qi
                     ` (3 more replies)
  2020-02-17 18:45 ` [PATCH v6 11/19] btrfs: Convert from readpages to readahead Matthew Wilcox
                   ` (17 subsequent siblings)
  34 siblings, 4 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, Junxiao Bi, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Implement the new readahead aop and convert all callers (block_dev,
exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
---
 drivers/staging/exfat/exfat_super.c |  7 +++---
 fs/block_dev.c                      |  7 +++---
 fs/ext2/inode.c                     | 10 +++-----
 fs/fat/inode.c                      |  7 +++---
 fs/gfs2/aops.c                      | 23 ++++++-----------
 fs/hpfs/file.c                      |  7 +++---
 fs/iomap/buffered-io.c              |  2 +-
 fs/isofs/inode.c                    |  7 +++---
 fs/jfs/inode.c                      |  7 +++---
 fs/mpage.c                          | 38 +++++++++--------------------
 fs/nilfs2/inode.c                   | 15 +++---------
 fs/ocfs2/aops.c                     | 34 ++++++++++----------------
 fs/omfs/file.c                      |  7 +++---
 fs/qnx6/inode.c                     |  7 +++---
 fs/reiserfs/inode.c                 |  8 +++---
 fs/udf/inode.c                      |  7 +++---
 include/linux/mpage.h               |  4 +--
 mm/migrate.c                        |  2 +-
 18 files changed, 73 insertions(+), 126 deletions(-)

diff --git a/drivers/staging/exfat/exfat_super.c b/drivers/staging/exfat/exfat_super.c
index b81d2a87b82e..96aad9b16d31 100644
--- a/drivers/staging/exfat/exfat_super.c
+++ b/drivers/staging/exfat/exfat_super.c
@@ -3002,10 +3002,9 @@ static int exfat_readpage(struct file *file, struct page *page)
 	return  mpage_readpage(page, exfat_get_block);
 }
 
-static int exfat_readpages(struct file *file, struct address_space *mapping,
-			   struct list_head *pages, unsigned int nr_pages)
+static void exfat_readahead(struct readahead_control *rac)
 {
-	return  mpage_readpages(mapping, pages, nr_pages, exfat_get_block);
+	mpage_readahead(rac, exfat_get_block);
 }
 
 static int exfat_writepage(struct page *page, struct writeback_control *wbc)
@@ -3104,7 +3103,7 @@ static sector_t _exfat_bmap(struct address_space *mapping, sector_t block)
 
 static const struct address_space_operations exfat_aops = {
 	.readpage    = exfat_readpage,
-	.readpages   = exfat_readpages,
+	.readahead   = exfat_readahead,
 	.writepage   = exfat_writepage,
 	.writepages  = exfat_writepages,
 	.write_begin = exfat_write_begin,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..2fd9c7bd61f6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -614,10 +614,9 @@ static int blkdev_readpage(struct file * file, struct page * page)
 	return block_read_full_page(page, blkdev_get_block);
 }
 
-static int blkdev_readpages(struct file *file, struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void blkdev_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block);
+	mpage_readahead(rac, blkdev_get_block);
 }
 
 static int blkdev_write_begin(struct file *file, struct address_space *mapping,
@@ -2062,7 +2061,7 @@ static int blkdev_writepages(struct address_space *mapping,
 
 static const struct address_space_operations def_blk_aops = {
 	.readpage	= blkdev_readpage,
-	.readpages	= blkdev_readpages,
+	.readahead	= blkdev_readahead,
 	.writepage	= blkdev_writepage,
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c885cf7d724b..2875c0a705b5 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -877,11 +877,9 @@ static int ext2_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, ext2_get_block);
 }
 
-static int
-ext2_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void ext2_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
+	mpage_readahead(rac, ext2_get_block);
 }
 
 static int
@@ -967,7 +965,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
 
 const struct address_space_operations ext2_aops = {
 	.readpage		= ext2_readpage,
-	.readpages		= ext2_readpages,
+	.readahead		= ext2_readahead,
 	.writepage		= ext2_writepage,
 	.write_begin		= ext2_write_begin,
 	.write_end		= ext2_write_end,
@@ -981,7 +979,7 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
-	.readpages		= ext2_readpages,
+	.readahead		= ext2_readahead,
 	.writepage		= ext2_nobh_writepage,
 	.write_begin		= ext2_nobh_write_begin,
 	.write_end		= nobh_write_end,
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 594b05ae16c9..3496f5fc3e6d 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -210,10 +210,9 @@ static int fat_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, fat_get_block);
 }
 
-static int fat_readpages(struct file *file, struct address_space *mapping,
-			 struct list_head *pages, unsigned nr_pages)
+static void fat_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
+	mpage_readahead(rac, fat_get_block);
 }
 
 static void fat_write_failed(struct address_space *mapping, loff_t to)
@@ -344,7 +343,7 @@ int fat_block_truncate_page(struct inode *inode, loff_t from)
 
 static const struct address_space_operations fat_aops = {
 	.readpage	= fat_readpage,
-	.readpages	= fat_readpages,
+	.readahead	= fat_readahead,
 	.writepage	= fat_writepage,
 	.writepages	= fat_writepages,
 	.write_begin	= fat_write_begin,
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ba83b49ce18c..5e63c13c12c1 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -577,7 +577,7 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 }
 
 /**
- * gfs2_readpages - Read a bunch of pages at once
+ * gfs2_readahead - Read a bunch of pages at once
  * @file: The file to read from
  * @mapping: Address space info
  * @pages: List of pages to read
@@ -590,31 +590,24 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
  *    obviously not something we'd want to do on too regular a basis.
  *    Any I/O we ignore at this time will be done via readpage later.
  * 2. We don't handle stuffed files here we let readpage do the honours.
- * 3. mpage_readpages() does most of the heavy lifting in the common case.
+ * 3. mpage_readahead() does most of the heavy lifting in the common case.
  * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
  */
 
-static int gfs2_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void gfs2_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
 	struct gfs2_inode *ip = GFS2_I(inode);
-	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_holder gh;
-	int ret;
 
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
-	ret = gfs2_glock_nq(&gh);
-	if (unlikely(ret))
+	if (gfs2_glock_nq(&gh))
 		goto out_uninit;
 	if (!gfs2_is_stuffed(ip))
-		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
+		mpage_readahead(rac, gfs2_block_map);
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
-	if (unlikely(gfs2_withdrawn(sdp)))
-		ret = -EIO;
-	return ret;
 }
 
 /**
@@ -828,7 +821,7 @@ static const struct address_space_operations gfs2_aops = {
 	.writepage = gfs2_writepage,
 	.writepages = gfs2_writepages,
 	.readpage = gfs2_readpage,
-	.readpages = gfs2_readpages,
+	.readahead = gfs2_readahead,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
 	.releasepage = gfs2_releasepage,
@@ -842,7 +835,7 @@ static const struct address_space_operations gfs2_jdata_aops = {
 	.writepage = gfs2_jdata_writepage,
 	.writepages = gfs2_jdata_writepages,
 	.readpage = gfs2_readpage,
-	.readpages = gfs2_readpages,
+	.readahead = gfs2_readahead,
 	.set_page_dirty = jdata_set_page_dirty,
 	.bmap = gfs2_bmap,
 	.invalidatepage = gfs2_invalidatepage,
diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
index b36abf9cb345..2de0d3492d15 100644
--- a/fs/hpfs/file.c
+++ b/fs/hpfs/file.c
@@ -125,10 +125,9 @@ static int hpfs_writepage(struct page *page, struct writeback_control *wbc)
 	return block_write_full_page(page, hpfs_get_block, wbc);
 }
 
-static int hpfs_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void hpfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, hpfs_get_block);
+	mpage_readahead(rac, hpfs_get_block);
 }
 
 static int hpfs_writepages(struct address_space *mapping,
@@ -198,7 +197,7 @@ static int hpfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 const struct address_space_operations hpfs_aops = {
 	.readpage = hpfs_readpage,
 	.writepage = hpfs_writepage,
-	.readpages = hpfs_readpages,
+	.readahead = hpfs_readahead,
 	.writepages = hpfs_writepages,
 	.write_begin = hpfs_write_begin,
 	.write_end = hpfs_write_end,
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..cb3511eb152a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -359,7 +359,7 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 	}
 
 	/*
-	 * Just like mpage_readpages and block_read_full_page we always
+	 * Just like mpage_readahead and block_read_full_page we always
 	 * return 0 and just mark the page as PageError on errors.  This
 	 * should be cleaned up all through the stack eventually.
 	 */
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 62c0462dc89f..95b1f377ad09 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -1185,10 +1185,9 @@ static int isofs_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, isofs_get_block);
 }
 
-static int isofs_readpages(struct file *file, struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void isofs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, isofs_get_block);
+	mpage_readahead(rac, isofs_get_block);
 }
 
 static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
@@ -1198,7 +1197,7 @@ static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
 
 static const struct address_space_operations isofs_aops = {
 	.readpage = isofs_readpage,
-	.readpages = isofs_readpages,
+	.readahead = isofs_readahead,
 	.bmap = _isofs_bmap
 };
 
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 9486afcdac76..6f65bfa9f18d 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -296,10 +296,9 @@ static int jfs_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, jfs_get_block);
 }
 
-static int jfs_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void jfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, jfs_get_block);
+	mpage_readahead(rac, jfs_get_block);
 }
 
 static void jfs_write_failed(struct address_space *mapping, loff_t to)
@@ -358,7 +357,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 const struct address_space_operations jfs_aops = {
 	.readpage	= jfs_readpage,
-	.readpages	= jfs_readpages,
+	.readahead	= jfs_readahead,
 	.writepage	= jfs_writepage,
 	.writepages	= jfs_writepages,
 	.write_begin	= jfs_write_begin,
diff --git a/fs/mpage.c b/fs/mpage.c
index ccba3c4c4479..8a09e6002dc2 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -91,7 +91,7 @@ mpage_alloc(struct block_device *bdev,
 }
 
 /*
- * support function for mpage_readpages.  The fs supplied get_block might
+ * support function for mpage_readahead.  The fs supplied get_block might
  * return an up to date buffer.  This is used to map that buffer into
  * the page, which allows readpage to avoid triggering a duplicate call
  * to get_block.
@@ -338,13 +338,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 }
 
 /**
- * mpage_readpages - populate an address space with some pages & start reads against them
- * @mapping: the address_space
- * @pages: The address of a list_head which contains the target pages.  These
- *   pages have their ->index populated and are otherwise uninitialised.
- *   The page at @pages->prev has the lowest file offset, and reads should be
- *   issued in @pages->prev to @pages->next order.
- * @nr_pages: The number of pages at *@pages
+ * mpage_readahead - start reads against pages
+ * @rac: Describes which pages to read.
  * @get_block: The filesystem's block mapper function.
  *
  * This function walks the pages and the blocks within each page, building and
@@ -381,36 +376,25 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
  *
  * This all causes the disk requests to be issued in the correct order.
  */
-int
-mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block)
+void mpage_readahead(struct readahead_control *rac, get_block_t get_block)
 {
+	struct page *page;
 	struct mpage_readpage_args args = {
 		.get_block = get_block,
 		.is_readahead = true,
 	};
-	unsigned page_idx;
-
-	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
-		struct page *page = lru_to_page(pages);
 
+	readahead_for_each(rac, page) {
 		prefetchw(&page->flags);
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping,
-					page->index,
-					readahead_gfp_mask(mapping))) {
-			args.page = page;
-			args.nr_pages = nr_pages - page_idx;
-			args.bio = do_mpage_readpage(&args);
-		}
+		args.page = page;
+		args.nr_pages = readahead_count(rac);
+		args.bio = do_mpage_readpage(&args);
 		put_page(page);
 	}
-	BUG_ON(!list_empty(pages));
 	if (args.bio)
 		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
-	return 0;
 }
-EXPORT_SYMBOL(mpage_readpages);
+EXPORT_SYMBOL(mpage_readahead);
 
 /*
  * This isn't called much at all
@@ -563,7 +547,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 		 * Page has buffers, but they are all unmapped. The page was
 		 * created by pagein or read over a hole which was handled by
 		 * block_read_full_page().  If this address_space is also
-		 * using mpage_readpages then this can rarely happen.
+		 * using mpage_readahead then this can rarely happen.
 		 */
 		goto confused;
 	}
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 671085512e0f..ceeb3b441844 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -145,18 +145,9 @@ static int nilfs_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, nilfs_get_block);
 }
 
-/**
- * nilfs_readpages() - implement readpages() method of nilfs_aops {}
- * address_space_operations.
- * @file - file struct of the file to be read
- * @mapping - address_space struct used for reading multiple pages
- * @pages - the pages to be read
- * @nr_pages - number of pages to be read
- */
-static int nilfs_readpages(struct file *file, struct address_space *mapping,
-			   struct list_head *pages, unsigned int nr_pages)
+static void nilfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, nilfs_get_block);
+	mpage_readahead(rac, nilfs_get_block);
 }
 
 static int nilfs_writepages(struct address_space *mapping,
@@ -308,7 +299,7 @@ const struct address_space_operations nilfs_aops = {
 	.readpage		= nilfs_readpage,
 	.writepages		= nilfs_writepages,
 	.set_page_dirty		= nilfs_set_page_dirty,
-	.readpages		= nilfs_readpages,
+	.readahead		= nilfs_readahead,
 	.write_begin		= nilfs_write_begin,
 	.write_end		= nilfs_write_end,
 	/* .releasepage		= nilfs_releasepage, */
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 3a67a6518ddf..e8137efaafec 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -350,14 +350,11 @@ static int ocfs2_readpage(struct file *file, struct page *page)
  * grow out to a tree. If need be, detecting boundary extents could
  * trivially be added in a future version of ocfs2_get_block().
  */
-static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
-			   struct list_head *pages, unsigned nr_pages)
+static void ocfs2_readahead(struct readahead_control *rac)
 {
-	int ret, err = -EIO;
-	struct inode *inode = mapping->host;
+	int ret;
+	struct inode *inode = rac->mapping->host;
 	struct ocfs2_inode_info *oi = OCFS2_I(inode);
-	loff_t start;
-	struct page *last;
 
 	/*
 	 * Use the nonblocking flag for the dlm code to avoid page
@@ -365,36 +362,31 @@ static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
 	 */
 	ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK);
 	if (ret)
-		return err;
+		return;
 
-	if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
-		ocfs2_inode_unlock(inode, 0);
-		return err;
-	}
+	if (down_read_trylock(&oi->ip_alloc_sem) == 0)
+		goto out_unlock;
 
 	/*
 	 * Don't bother with inline-data. There isn't anything
 	 * to read-ahead in that case anyway...
 	 */
 	if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
-		goto out_unlock;
+		goto out_up;
 
 	/*
 	 * Check whether a remote node truncated this file - we just
 	 * drop out in that case as it's not worth handling here.
 	 */
-	last = lru_to_page(pages);
-	start = (loff_t)last->index << PAGE_SHIFT;
-	if (start >= i_size_read(inode))
-		goto out_unlock;
+	if (readahead_offset(rac) >= i_size_read(inode))
+		goto out_up;
 
-	err = mpage_readpages(mapping, pages, nr_pages, ocfs2_get_block);
+	mpage_readahead(rac, ocfs2_get_block);
 
-out_unlock:
+out_up:
 	up_read(&oi->ip_alloc_sem);
+out_unlock:
 	ocfs2_inode_unlock(inode, 0);
-
-	return err;
 }
 
 /* Note: Because we don't support holes, our allocation has
@@ -2474,7 +2466,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 const struct address_space_operations ocfs2_aops = {
 	.readpage		= ocfs2_readpage,
-	.readpages		= ocfs2_readpages,
+	.readahead		= ocfs2_readahead,
 	.writepage		= ocfs2_writepage,
 	.write_begin		= ocfs2_write_begin,
 	.write_end		= ocfs2_write_end,
diff --git a/fs/omfs/file.c b/fs/omfs/file.c
index d640b9388238..d7b5f09d298c 100644
--- a/fs/omfs/file.c
+++ b/fs/omfs/file.c
@@ -289,10 +289,9 @@ static int omfs_readpage(struct file *file, struct page *page)
 	return block_read_full_page(page, omfs_get_block);
 }
 
-static int omfs_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void omfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, omfs_get_block);
+	mpage_readahead(rac, omfs_get_block);
 }
 
 static int omfs_writepage(struct page *page, struct writeback_control *wbc)
@@ -373,7 +372,7 @@ const struct inode_operations omfs_file_inops = {
 
 const struct address_space_operations omfs_aops = {
 	.readpage = omfs_readpage,
-	.readpages = omfs_readpages,
+	.readahead = omfs_readahead,
 	.writepage = omfs_writepage,
 	.writepages = omfs_writepages,
 	.write_begin = omfs_write_begin,
diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
index 345db56c98fd..755293c8c71a 100644
--- a/fs/qnx6/inode.c
+++ b/fs/qnx6/inode.c
@@ -99,10 +99,9 @@ static int qnx6_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, qnx6_get_block);
 }
 
-static int qnx6_readpages(struct file *file, struct address_space *mapping,
-		   struct list_head *pages, unsigned nr_pages)
+static void qnx6_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, qnx6_get_block);
+	mpage_readahead(rac, qnx6_get_block);
 }
 
 /*
@@ -499,7 +498,7 @@ static sector_t qnx6_bmap(struct address_space *mapping, sector_t block)
 }
 static const struct address_space_operations qnx6_aops = {
 	.readpage	= qnx6_readpage,
-	.readpages	= qnx6_readpages,
+	.readahead	= qnx6_readahead,
 	.bmap		= qnx6_bmap
 };
 
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 6419e6dacc39..0031070b3692 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -1160,11 +1160,9 @@ int reiserfs_get_block(struct inode *inode, sector_t block,
 	return retval;
 }
 
-static int
-reiserfs_readpages(struct file *file, struct address_space *mapping,
-		   struct list_head *pages, unsigned nr_pages)
+static void reiserfs_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, reiserfs_get_block);
+	mpage_readahead(rac, reiserfs_get_block);
 }
 
 /*
@@ -3434,7 +3432,7 @@ int reiserfs_setattr(struct dentry *dentry, struct iattr *attr)
 const struct address_space_operations reiserfs_address_space_operations = {
 	.writepage = reiserfs_writepage,
 	.readpage = reiserfs_readpage,
-	.readpages = reiserfs_readpages,
+	.readahead = reiserfs_readahead,
 	.releasepage = reiserfs_releasepage,
 	.invalidatepage = reiserfs_invalidatepage,
 	.write_begin = reiserfs_write_begin,
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index e875bc5668ee..adaba8e8b326 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -195,10 +195,9 @@ static int udf_readpage(struct file *file, struct page *page)
 	return mpage_readpage(page, udf_get_block);
 }
 
-static int udf_readpages(struct file *file, struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void udf_readahead(struct readahead_control *rac)
 {
-	return mpage_readpages(mapping, pages, nr_pages, udf_get_block);
+	mpage_readahead(rac, udf_get_block);
 }
 
 static int udf_write_begin(struct file *file, struct address_space *mapping,
@@ -234,7 +233,7 @@ static sector_t udf_bmap(struct address_space *mapping, sector_t block)
 
 const struct address_space_operations udf_aops = {
 	.readpage	= udf_readpage,
-	.readpages	= udf_readpages,
+	.readahead	= udf_readahead,
 	.writepage	= udf_writepage,
 	.writepages	= udf_writepages,
 	.write_begin	= udf_write_begin,
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
index 001f1fcf9836..f4f5e90a6844 100644
--- a/include/linux/mpage.h
+++ b/include/linux/mpage.h
@@ -13,9 +13,9 @@
 #ifdef CONFIG_BLOCK
 
 struct writeback_control;
+struct readahead_control;
 
-int mpage_readpages(struct address_space *mapping, struct list_head *pages,
-				unsigned nr_pages, get_block_t get_block);
+void mpage_readahead(struct readahead_control *, get_block_t get_block);
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
diff --git a/mm/migrate.c b/mm/migrate.c
index b1092876e537..a32122095702 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1020,7 +1020,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * to the LRU. Later, when the IO completes the pages are
 		 * marked uptodate and unlocked. However, the queueing
 		 * could be merging multiple pages for one bio (e.g.
-		 * mpage_readpages). If an allocation happens for the
+		 * mpage_readahead). If an allocation happens for the
 		 * second or third page, the process can end up locking
 		 * the same page twice and deadlocking. Rather than
 		 * trying to be clever about what pages can be locked,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 11/19] btrfs: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (16 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
@ 2020-02-17 18:45 ` Matthew Wilcox
  2020-02-18  6:57   ` Dave Chinner
  2020-02-17 18:46 ` [PATCH v6 11/16] erofs: Convert compressed files " Matthew Wilcox
                   ` (16 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:45 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in btrfs.  Add a
readahead_for_each_batch() iterator to optimise the loop in the XArray.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/btrfs/extent_io.c    | 46 +++++++++++++----------------------------
 fs/btrfs/extent_io.h    |  3 +--
 fs/btrfs/inode.c        | 16 +++++++-------
 include/linux/pagemap.h | 27 ++++++++++++++++++++++++
 4 files changed, 49 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c0f202741e09..e97a6acd6f5d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4278,52 +4278,34 @@ int extent_writepages(struct address_space *mapping,
 	return ret;
 }
 
-int extent_readpages(struct address_space *mapping, struct list_head *pages,
-		     unsigned nr_pages)
+void extent_readahead(struct readahead_control *rac)
 {
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
 	struct page *pagepool[16];
 	struct extent_map *em_cached = NULL;
-	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
-	int nr = 0;
+	struct extent_io_tree *tree = &BTRFS_I(rac->mapping->host)->io_tree;
 	u64 prev_em_start = (u64)-1;
+	int nr;
 
-	while (!list_empty(pages)) {
-		u64 contig_end = 0;
-
-		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
-			struct page *page = lru_to_page(pages);
-
-			prefetchw(&page->flags);
-			list_del(&page->lru);
-			if (add_to_page_cache_lru(page, mapping, page->index,
-						readahead_gfp_mask(mapping))) {
-				put_page(page);
-				break;
-			}
-
-			pagepool[nr++] = page;
-			contig_end = page_offset(page) + PAGE_SIZE - 1;
-		}
+	readahead_for_each_batch(rac, pagepool, ARRAY_SIZE(pagepool), nr) {
+		u64 contig_start = page_offset(pagepool[0]);
+		u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;
 
-		if (nr) {
-			u64 contig_start = page_offset(pagepool[0]);
+		ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
 
-			ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
-
-			contiguous_readpages(tree, pagepool, nr, contig_start,
-				     contig_end, &em_cached, &bio, &bio_flags,
-				     &prev_em_start);
-		}
+		contiguous_readpages(tree, pagepool, nr, contig_start,
+				contig_end, &em_cached, &bio, &bio_flags,
+				&prev_em_start);
 	}
 
 	if (em_cached)
 		free_extent_map(em_cached);
 
-	if (bio)
-		return submit_one_bio(bio, 0, bio_flags);
-	return 0;
+	if (bio) {
+		if (submit_one_bio(bio, 0, bio_flags))
+			return;
+	}
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 5d205bbaafdc..bddac32948c7 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -198,8 +198,7 @@ int extent_writepages(struct address_space *mapping,
 		      struct writeback_control *wbc);
 int btree_write_cache_pages(struct address_space *mapping,
 			    struct writeback_control *wbc);
-int extent_readpages(struct address_space *mapping, struct list_head *pages,
-		     unsigned nr_pages);
+void extent_readahead(struct readahead_control *rac);
 int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		__u64 start, __u64 len);
 void set_page_extent_mapped(struct page *page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7d26b4bfb2c6..61d5137ce4e9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4802,8 +4802,8 @@ static void evict_inode_truncate_pages(struct inode *inode)
 
 	/*
 	 * Keep looping until we have no more ranges in the io tree.
-	 * We can have ongoing bios started by readpages (called from readahead)
-	 * that have their endio callback (extent_io.c:end_bio_extent_readpage)
+	 * We can have ongoing bios started by readahead that have
+	 * their endio callback (extent_io.c:end_bio_extent_readpage)
 	 * still in progress (unlocked the pages in the bio but did not yet
 	 * unlocked the ranges in the io tree). Therefore this means some
 	 * ranges can still be locked and eviction started because before
@@ -7004,11 +7004,11 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 			 * for it to complete) and then invalidate the pages for
 			 * this range (through invalidate_inode_pages2_range()),
 			 * but that can lead us to a deadlock with a concurrent
-			 * call to readpages() (a buffered read or a defrag call
+			 * call to readahead (a buffered read or a defrag call
 			 * triggered a readahead) on a page lock due to an
 			 * ordered dio extent we created before but did not have
 			 * yet a corresponding bio submitted (whence it can not
-			 * complete), which makes readpages() wait for that
+			 * complete), which makes readahead wait for that
 			 * ordered extent to complete while holding a lock on
 			 * that page.
 			 */
@@ -8247,11 +8247,9 @@ static int btrfs_writepages(struct address_space *mapping,
 	return extent_writepages(mapping, wbc);
 }
 
-static int
-btrfs_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void btrfs_readahead(struct readahead_control *rac)
 {
-	return extent_readpages(mapping, pages, nr_pages);
+	extent_readahead(rac);
 }
 
 static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
@@ -10456,7 +10454,7 @@ static const struct address_space_operations btrfs_aops = {
 	.readpage	= btrfs_readpage,
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
-	.readpages	= btrfs_readpages,
+	.readahead	= btrfs_readahead,
 	.direct_IO	= btrfs_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4f36c06d064d..1bbb60a0bf16 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -669,6 +669,33 @@ static inline void readahead_next(struct readahead_control *rac)
 #define readahead_for_each(rac, page)					\
 	for (; (page = readahead_page(rac)); readahead_next(rac))
 
+static inline unsigned int readahead_page_batch(struct readahead_control *rac,
+		struct page **array, unsigned int size)
+{
+	unsigned int batch = 0;
+	XA_STATE(xas, &rac->mapping->i_pages, rac->_start);
+	struct page *page;
+
+	rac->_batch_count = 0;
+	xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) {
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		VM_BUG_ON_PAGE(PageTail(page), page);
+		array[batch++] = page;
+		rac->_batch_count += hpage_nr_pages(page);
+		if (PageHead(page))
+			xas_set(&xas, rac->_start + rac->_batch_count);
+
+		if (batch == size)
+			break;
+	}
+
+	return batch;
+}
+
+#define readahead_for_each_batch(rac, array, size, nr)			\
+	for (; (nr = readahead_page_batch(rac, array, size));		\
+			readahead_next(rac))
+
 /* The byte offset into the file of this readahead block */
 static inline loff_t readahead_offset(struct readahead_control *rac)
 {
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 11/16] erofs: Convert compressed files from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (17 preceding siblings ...)
  2020-02-17 18:45 ` [PATCH v6 11/19] btrfs: Convert from readpages to readahead Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  2:34   ` Gao Xiang
  2020-02-17 18:46 ` [PATCH v6 12/19] erofs: Convert uncompressed " Matthew Wilcox
                   ` (15 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in erofs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/erofs/zdata.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 17f45fcb8c5c..7c02015d501d 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1303,28 +1303,23 @@ static bool should_decompress_synchronously(struct erofs_sb_info *sbi,
 	return nr <= sbi->max_sync_decompress_pages;
 }
 
-static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
-			     struct list_head *pages, unsigned int nr_pages)
+static void z_erofs_readahead(struct readahead_control *rac)
 {
-	struct inode *const inode = mapping->host;
+	struct inode *const inode = rac->mapping->host;
 	struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
 
-	bool sync = should_decompress_synchronously(sbi, nr_pages);
+	bool sync = should_decompress_synchronously(sbi, readahead_count(rac));
 	struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
-	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
-	struct page *head = NULL;
+	struct page *page, *head = NULL;
 	LIST_HEAD(pagepool);
 
-	trace_erofs_readpages(mapping->host, lru_to_page(pages)->index,
-			      nr_pages, false);
+	trace_erofs_readpages(inode, readahead_index(rac),
+			readahead_count(rac), false);
 
-	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
-
-	for (; nr_pages; --nr_pages) {
-		struct page *page = lru_to_page(pages);
+	f.headoffset = readahead_offset(rac);
 
+	readahead_for_each(rac, page) {
 		prefetchw(&page->flags);
-		list_del(&page->lru);
 
 		/*
 		 * A pure asynchronous readahead is indicated if
@@ -1333,11 +1328,6 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
 		 */
 		sync &= !(PageReadahead(page) && !head);
 
-		if (add_to_page_cache_lru(page, mapping, page->index, gfp)) {
-			list_add(&page->lru, &pagepool);
-			continue;
-		}
-
 		set_page_private(page, (unsigned long)head);
 		head = page;
 	}
@@ -1366,11 +1356,10 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
 
 	/* clean up the remaining free pages */
 	put_pages_list(&pagepool);
-	return 0;
 }
 
 const struct address_space_operations z_erofs_aops = {
 	.readpage = z_erofs_readpage,
-	.readpages = z_erofs_readpages,
+	.readahead = z_erofs_readahead,
 };
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (18 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 11/16] erofs: Convert compressed files " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  2:39   ` Gao Xiang
  2020-02-19  3:04   ` Dave Chinner
  2020-02-17 18:46 ` [PATCH v6 12/16] ext4: Convert " Matthew Wilcox
                   ` (14 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in erofs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/erofs/data.c              | 39 +++++++++++++-----------------------
 fs/erofs/zdata.c             |  2 +-
 include/trace/events/erofs.h |  6 +++---
 3 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/fs/erofs/data.c b/fs/erofs/data.c
index fc3a8d8064f8..82ebcee9d178 100644
--- a/fs/erofs/data.c
+++ b/fs/erofs/data.c
@@ -280,47 +280,36 @@ static int erofs_raw_access_readpage(struct file *file, struct page *page)
 	return 0;
 }
 
-static int erofs_raw_access_readpages(struct file *filp,
-				      struct address_space *mapping,
-				      struct list_head *pages,
-				      unsigned int nr_pages)
+static void erofs_raw_access_readahead(struct readahead_control *rac)
 {
 	erofs_off_t last_block;
 	struct bio *bio = NULL;
-	gfp_t gfp = readahead_gfp_mask(mapping);
-	struct page *page = list_last_entry(pages, struct page, lru);
-
-	trace_erofs_readpages(mapping->host, page, nr_pages, true);
+	struct page *page;
 
-	for (; nr_pages; --nr_pages) {
-		page = list_entry(pages->prev, struct page, lru);
+	trace_erofs_readpages(rac->mapping->host, readahead_index(rac),
+			readahead_count(rac), true);
 
+	readahead_for_each(rac, page) {
 		prefetchw(&page->flags);
-		list_del(&page->lru);
 
-		if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
-			bio = erofs_read_raw_page(bio, mapping, page,
-						  &last_block, nr_pages, true);
+		bio = erofs_read_raw_page(bio, rac->mapping, page, &last_block,
+				readahead_count(rac), true);
 
-			/* all the page errors are ignored when readahead */
-			if (IS_ERR(bio)) {
-				pr_err("%s, readahead error at page %lu of nid %llu\n",
-				       __func__, page->index,
-				       EROFS_I(mapping->host)->nid);
+		/* all the page errors are ignored when readahead */
+		if (IS_ERR(bio)) {
+			pr_err("%s, readahead error at page %lu of nid %llu\n",
+			       __func__, page->index,
+			       EROFS_I(rac->mapping->host)->nid);
 
-				bio = NULL;
-			}
+			bio = NULL;
 		}
 
-		/* pages could still be locked */
 		put_page(page);
 	}
-	DBG_BUGON(!list_empty(pages));
 
 	/* the rare case (end in gaps) */
 	if (bio)
 		submit_bio(bio);
-	return 0;
 }
 
 static int erofs_get_block(struct inode *inode, sector_t iblock,
@@ -358,7 +347,7 @@ static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
 /* for uncompressed (aligned) files and raw access for other files */
 const struct address_space_operations erofs_raw_access_aops = {
 	.readpage = erofs_raw_access_readpage,
-	.readpages = erofs_raw_access_readpages,
+	.readahead = erofs_raw_access_readahead,
 	.bmap = erofs_bmap,
 };
 
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 80e47f07d946..17f45fcb8c5c 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1315,7 +1315,7 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
 	struct page *head = NULL;
 	LIST_HEAD(pagepool);
 
-	trace_erofs_readpages(mapping->host, lru_to_page(pages),
+	trace_erofs_readpages(mapping->host, lru_to_page(pages)->index,
 			      nr_pages, false);
 
 	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
diff --git a/include/trace/events/erofs.h b/include/trace/events/erofs.h
index 27f5caa6299a..bf9806fd1306 100644
--- a/include/trace/events/erofs.h
+++ b/include/trace/events/erofs.h
@@ -113,10 +113,10 @@ TRACE_EVENT(erofs_readpage,
 
 TRACE_EVENT(erofs_readpages,
 
-	TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage,
+	TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage,
 		bool raw),
 
-	TP_ARGS(inode, page, nrpage, raw),
+	TP_ARGS(inode, start, nrpage, raw),
 
 	TP_STRUCT__entry(
 		__field(dev_t,		dev	)
@@ -129,7 +129,7 @@ TRACE_EVENT(erofs_readpages,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->nid	= EROFS_I(inode)->nid;
-		__entry->start	= page->index;
+		__entry->start	= start;
 		__entry->nrpage	= nrpage;
 		__entry->raw	= raw;
 	),
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 12/16] ext4: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (19 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 12/19] erofs: Convert uncompressed " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-17 18:46 ` [PATCH v6 13/19] erofs: Convert compressed files " Matthew Wilcox
                   ` (13 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in ext4

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/ext4.h     |  3 +--
 fs/ext4/inode.c    | 23 ++++++++++-------------
 fs/ext4/readpage.c | 22 ++++++++--------------
 3 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9a2ee2428ecc..3af755da101d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3276,8 +3276,7 @@ static inline void ext4_set_de_type(struct super_block *sb,
 
 /* readpages.c */
 extern int ext4_mpage_readpages(struct address_space *mapping,
-				struct list_head *pages, struct page *page,
-				unsigned nr_pages, bool is_readahead);
+		struct readahead_control *rac, struct page *page);
 extern int __init ext4_init_post_read_processing(void);
 extern void ext4_exit_post_read_processing(void);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1305b810c44a..7770e38e17e7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3218,7 +3218,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
 static int ext4_readpage(struct file *file, struct page *page)
 {
 	int ret = -EAGAIN;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file_inode(file);
 
 	trace_ext4_readpage(page);
 
@@ -3226,23 +3226,20 @@ static int ext4_readpage(struct file *file, struct page *page)
 		ret = ext4_readpage_inline(inode, page);
 
 	if (ret == -EAGAIN)
-		return ext4_mpage_readpages(page->mapping, NULL, page, 1,
-						false);
+		return ext4_mpage_readpages(page->mapping, NULL, page);
 
 	return ret;
 }
 
-static int
-ext4_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void ext4_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
 
-	/* If the file has inline data, no need to do readpages. */
+	/* If the file has inline data, no need to do readahead. */
 	if (ext4_has_inline_data(inode))
-		return 0;
+		return;
 
-	return ext4_mpage_readpages(mapping, pages, NULL, nr_pages, true);
+	ext4_mpage_readpages(rac->mapping, rac, NULL);
 }
 
 static void ext4_invalidatepage(struct page *page, unsigned int offset,
@@ -3587,7 +3584,7 @@ static int ext4_set_page_dirty(struct page *page)
 
 static const struct address_space_operations ext4_aops = {
 	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
+	.readahead		= ext4_readahead,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
@@ -3604,7 +3601,7 @@ static const struct address_space_operations ext4_aops = {
 
 static const struct address_space_operations ext4_journalled_aops = {
 	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
+	.readahead		= ext4_readahead,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
@@ -3620,7 +3617,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 
 static const struct address_space_operations ext4_da_aops = {
 	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
+	.readahead		= ext4_readahead,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_da_write_begin,
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c1769afbf799..e14841ade612 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -7,8 +7,8 @@
  *
  * This was originally taken from fs/mpage.c
  *
- * The intent is the ext4_mpage_readpages() function here is intended
- * to replace mpage_readpages() in the general case, not just for
+ * The ext4_mpage_readahead() function here is intended to
+ * replace mpage_readahead() in the general case, not just for
  * encrypted files.  It has some limitations (see below), where it
  * will fall back to read_block_full_page(), but these limitations
  * should only be hit when page_size != block_size.
@@ -222,8 +222,7 @@ static inline loff_t ext4_readpage_limit(struct inode *inode)
 }
 
 int ext4_mpage_readpages(struct address_space *mapping,
-			 struct list_head *pages, struct page *page,
-			 unsigned nr_pages, bool is_readahead)
+		struct readahead_control *rac, struct page *page)
 {
 	struct bio *bio = NULL;
 	sector_t last_block_in_bio = 0;
@@ -241,6 +240,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 	int length;
 	unsigned relative_block = 0;
 	struct ext4_map_blocks map;
+	unsigned int nr_pages = rac ? readahead_count(rac) : 1;
 
 	map.m_pblk = 0;
 	map.m_lblk = 0;
@@ -251,14 +251,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
 		int fully_mapped = 1;
 		unsigned first_hole = blocks_per_page;
 
-		if (pages) {
-			page = lru_to_page(pages);
-
+		if (rac) {
+			page = readahead_page(rac);
 			prefetchw(&page->flags);
-			list_del(&page->lru);
-			if (add_to_page_cache_lru(page, mapping, page->index,
-				  readahead_gfp_mask(mapping)))
-				goto next_page;
 		}
 
 		if (page_has_buffers(page))
@@ -381,7 +376,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 			bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9);
 			bio->bi_end_io = mpage_end_io;
 			bio_set_op_attrs(bio, REQ_OP_READ,
-						is_readahead ? REQ_RAHEAD : 0);
+						rac ? REQ_RAHEAD : 0);
 		}
 
 		length = first_hole << blkbits;
@@ -406,10 +401,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
 		else
 			unlock_page(page);
 	next_page:
-		if (pages)
+		if (rac)
 			put_page(page);
 	}
-	BUG_ON(pages && !list_empty(pages));
 	if (bio)
 		submit_bio(bio);
 	return 0;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (20 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 12/16] ext4: Convert " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  3:08   ` Dave Chinner
  2020-02-17 18:46 ` [PATCH v6 13/16] f2fs: Convert " Matthew Wilcox
                   ` (12 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in erofs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/erofs/zdata.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 17f45fcb8c5c..7c02015d501d 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1303,28 +1303,23 @@ static bool should_decompress_synchronously(struct erofs_sb_info *sbi,
 	return nr <= sbi->max_sync_decompress_pages;
 }
 
-static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
-			     struct list_head *pages, unsigned int nr_pages)
+static void z_erofs_readahead(struct readahead_control *rac)
 {
-	struct inode *const inode = mapping->host;
+	struct inode *const inode = rac->mapping->host;
 	struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
 
-	bool sync = should_decompress_synchronously(sbi, nr_pages);
+	bool sync = should_decompress_synchronously(sbi, readahead_count(rac));
 	struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
-	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
-	struct page *head = NULL;
+	struct page *page, *head = NULL;
 	LIST_HEAD(pagepool);
 
-	trace_erofs_readpages(mapping->host, lru_to_page(pages)->index,
-			      nr_pages, false);
+	trace_erofs_readpages(inode, readahead_index(rac),
+			readahead_count(rac), false);
 
-	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
-
-	for (; nr_pages; --nr_pages) {
-		struct page *page = lru_to_page(pages);
+	f.headoffset = readahead_offset(rac);
 
+	readahead_for_each(rac, page) {
 		prefetchw(&page->flags);
-		list_del(&page->lru);
 
 		/*
 		 * A pure asynchronous readahead is indicated if
@@ -1333,11 +1328,6 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
 		 */
 		sync &= !(PageReadahead(page) && !head);
 
-		if (add_to_page_cache_lru(page, mapping, page->index, gfp)) {
-			list_add(&page->lru, &pagepool);
-			continue;
-		}
-
 		set_page_private(page, (unsigned long)head);
 		head = page;
 	}
@@ -1366,11 +1356,10 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
 
 	/* clean up the remaining free pages */
 	put_pages_list(&pagepool);
-	return 0;
 }
 
 const struct address_space_operations z_erofs_aops = {
 	.readpage = z_erofs_readpage,
-	.readpages = z_erofs_readpages,
+	.readahead = z_erofs_readahead,
 };
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 13/16] f2fs: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (21 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 13/19] erofs: Convert compressed files " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-17 18:46 ` [PATCH v6 14/19] ext4: " Matthew Wilcox
                   ` (11 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in f2fs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/f2fs/data.c              | 50 +++++++++++++++----------------------
 fs/f2fs/f2fs.h              |  5 ++--
 include/trace/events/f2fs.h |  6 ++---
 3 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b27b72107911..87964e4cb6b8 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2159,13 +2159,11 @@ int f2fs_read_multi_pages(struct compress_ctx *cc, struct bio **bio_ret,
  * use ->readpage() or do the necessary surgery to decouple ->readpages()
  * from read-ahead.
  */
-int f2fs_mpage_readpages(struct address_space *mapping,
-			struct list_head *pages, struct page *page,
-			unsigned nr_pages, bool is_readahead)
+int f2fs_mpage_readpages(struct inode *inode, struct readahead_control *rac,
+		struct page *page)
 {
 	struct bio *bio = NULL;
 	sector_t last_block_in_bio = 0;
-	struct inode *inode = mapping->host;
 	struct f2fs_map_blocks map;
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 	struct compress_ctx cc = {
@@ -2179,6 +2177,7 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 		.nr_cpages = 0,
 	};
 #endif
+	unsigned nr_pages = rac ? readahead_count(rac) : 1;
 	unsigned max_nr_pages = nr_pages;
 	int ret = 0;
 
@@ -2192,15 +2191,9 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 	map.m_may_create = false;
 
 	for (; nr_pages; nr_pages--) {
-		if (pages) {
-			page = list_last_entry(pages, struct page, lru);
-
+		if (rac) {
+			page = readahead_page(rac);
 			prefetchw(&page->flags);
-			list_del(&page->lru);
-			if (add_to_page_cache_lru(page, mapping,
-						  page_index(page),
-						  readahead_gfp_mask(mapping)))
-				goto next_page;
 		}
 
 #ifdef CONFIG_F2FS_FS_COMPRESSION
@@ -2210,7 +2203,7 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 				ret = f2fs_read_multi_pages(&cc, &bio,
 							max_nr_pages,
 							&last_block_in_bio,
-							is_readahead);
+							rac);
 				f2fs_destroy_compress_ctx(&cc);
 				if (ret)
 					goto set_error_page;
@@ -2233,7 +2226,7 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 #endif
 
 		ret = f2fs_read_single_page(inode, page, max_nr_pages, &map,
-					&bio, &last_block_in_bio, is_readahead);
+					&bio, &last_block_in_bio, rac);
 		if (ret) {
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 set_error_page:
@@ -2242,8 +2235,10 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 			zero_user_segment(page, 0, PAGE_SIZE);
 			unlock_page(page);
 		}
+#ifdef CONFIG_F2FS_FS_COMPRESSION
 next_page:
-		if (pages)
+#endif
+		if (rac)
 			put_page(page);
 
 #ifdef CONFIG_F2FS_FS_COMPRESSION
@@ -2253,16 +2248,15 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 				ret = f2fs_read_multi_pages(&cc, &bio,
 							max_nr_pages,
 							&last_block_in_bio,
-							is_readahead);
+							rac);
 				f2fs_destroy_compress_ctx(&cc);
 			}
 		}
 #endif
 	}
-	BUG_ON(pages && !list_empty(pages));
 	if (bio)
 		__submit_bio(F2FS_I_SB(inode), bio, DATA);
-	return pages ? 0 : ret;
+	return ret;
 }
 
 static int f2fs_read_data_page(struct file *file, struct page *page)
@@ -2281,28 +2275,24 @@ static int f2fs_read_data_page(struct file *file, struct page *page)
 	if (f2fs_has_inline_data(inode))
 		ret = f2fs_read_inline_data(inode, page);
 	if (ret == -EAGAIN)
-		ret = f2fs_mpage_readpages(page_file_mapping(page),
-						NULL, page, 1, false);
+		ret = f2fs_mpage_readpages(inode, NULL, page);
 	return ret;
 }
 
-static int f2fs_read_data_pages(struct file *file,
-			struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void f2fs_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
-	struct page *page = list_last_entry(pages, struct page, lru);
+	struct inode *inode = rac->mapping->host;
 
-	trace_f2fs_readpages(inode, page, nr_pages);
+	trace_f2fs_readpages(inode, readahead_index(rac), readahead_count(rac));
 
 	if (!f2fs_is_compress_backend_ready(inode))
-		return 0;
+		return;
 
 	/* If the file has inline data, skip readpages */
 	if (f2fs_has_inline_data(inode))
-		return 0;
+		return;
 
-	return f2fs_mpage_readpages(mapping, pages, NULL, nr_pages, true);
+	f2fs_mpage_readpages(inode, rac, NULL);
 }
 
 int f2fs_encrypt_one_page(struct f2fs_io_info *fio)
@@ -3784,7 +3774,7 @@ static void f2fs_swap_deactivate(struct file *file)
 
 const struct address_space_operations f2fs_dblock_aops = {
 	.readpage	= f2fs_read_data_page,
-	.readpages	= f2fs_read_data_pages,
+	.readahead	= f2fs_readahead,
 	.writepage	= f2fs_write_data_page,
 	.writepages	= f2fs_write_data_pages,
 	.write_begin	= f2fs_write_begin,
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 5355be6b6755..b5e72dee8826 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -3344,9 +3344,8 @@ int f2fs_reserve_new_block(struct dnode_of_data *dn);
 int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index);
 int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from);
 int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index);
-int f2fs_mpage_readpages(struct address_space *mapping,
-			struct list_head *pages, struct page *page,
-			unsigned nr_pages, bool is_readahead);
+int f2fs_mpage_readpages(struct inode *inode, struct readahead_control *rac,
+		struct page *page);
 struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index,
 			int op_flags, bool for_write);
 struct page *f2fs_find_data_page(struct inode *inode, pgoff_t index);
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 67a97838c2a0..d72da4a33883 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -1375,9 +1375,9 @@ TRACE_EVENT(f2fs_writepages,
 
 TRACE_EVENT(f2fs_readpages,
 
-	TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage),
+	TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage),
 
-	TP_ARGS(inode, page, nrpage),
+	TP_ARGS(inode, start, nrpage),
 
 	TP_STRUCT__entry(
 		__field(dev_t,	dev)
@@ -1389,7 +1389,7 @@ TRACE_EVENT(f2fs_readpages,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= page->index;
+		__entry->start	= start;
 		__entry->nrpage	= nrpage;
 	),
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 14/19] ext4: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (22 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 13/16] f2fs: Convert " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  3:16   ` Dave Chinner
  2020-02-19  3:29   ` Eric Biggers
  2020-02-17 18:46 ` [PATCH v6 14/16] fuse: " Matthew Wilcox
                   ` (10 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in ext4

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/ext4.h     |  3 +--
 fs/ext4/inode.c    | 23 ++++++++++-------------
 fs/ext4/readpage.c | 22 ++++++++--------------
 3 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4441331d06cc..1570a0b51b73 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3279,8 +3279,7 @@ static inline void ext4_set_de_type(struct super_block *sb,
 
 /* readpages.c */
 extern int ext4_mpage_readpages(struct address_space *mapping,
-				struct list_head *pages, struct page *page,
-				unsigned nr_pages, bool is_readahead);
+		struct readahead_control *rac, struct page *page);
 extern int __init ext4_init_post_read_processing(void);
 extern void ext4_exit_post_read_processing(void);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e60aca791d3f..b3349bfb75b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3218,7 +3218,7 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
 static int ext4_readpage(struct file *file, struct page *page)
 {
 	int ret = -EAGAIN;
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = file_inode(file);
 
 	trace_ext4_readpage(page);
 
@@ -3226,23 +3226,20 @@ static int ext4_readpage(struct file *file, struct page *page)
 		ret = ext4_readpage_inline(inode, page);
 
 	if (ret == -EAGAIN)
-		return ext4_mpage_readpages(page->mapping, NULL, page, 1,
-						false);
+		return ext4_mpage_readpages(page->mapping, NULL, page);
 
 	return ret;
 }
 
-static int
-ext4_readpages(struct file *file, struct address_space *mapping,
-		struct list_head *pages, unsigned nr_pages)
+static void ext4_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
 
-	/* If the file has inline data, no need to do readpages. */
+	/* If the file has inline data, no need to do readahead. */
 	if (ext4_has_inline_data(inode))
-		return 0;
+		return;
 
-	return ext4_mpage_readpages(mapping, pages, NULL, nr_pages, true);
+	ext4_mpage_readpages(rac->mapping, rac, NULL);
 }
 
 static void ext4_invalidatepage(struct page *page, unsigned int offset,
@@ -3587,7 +3584,7 @@ static int ext4_set_page_dirty(struct page *page)
 
 static const struct address_space_operations ext4_aops = {
 	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
+	.readahead		= ext4_readahead,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
@@ -3604,7 +3601,7 @@ static const struct address_space_operations ext4_aops = {
 
 static const struct address_space_operations ext4_journalled_aops = {
 	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
+	.readahead		= ext4_readahead,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_write_begin,
@@ -3620,7 +3617,7 @@ static const struct address_space_operations ext4_journalled_aops = {
 
 static const struct address_space_operations ext4_da_aops = {
 	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
+	.readahead		= ext4_readahead,
 	.writepage		= ext4_writepage,
 	.writepages		= ext4_writepages,
 	.write_begin		= ext4_da_write_begin,
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c1769afbf799..e14841ade612 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -7,8 +7,8 @@
  *
  * This was originally taken from fs/mpage.c
  *
- * The intent is the ext4_mpage_readpages() function here is intended
- * to replace mpage_readpages() in the general case, not just for
+ * The ext4_mpage_readahead() function here is intended to
+ * replace mpage_readahead() in the general case, not just for
  * encrypted files.  It has some limitations (see below), where it
  * will fall back to read_block_full_page(), but these limitations
  * should only be hit when page_size != block_size.
@@ -222,8 +222,7 @@ static inline loff_t ext4_readpage_limit(struct inode *inode)
 }
 
 int ext4_mpage_readpages(struct address_space *mapping,
-			 struct list_head *pages, struct page *page,
-			 unsigned nr_pages, bool is_readahead)
+		struct readahead_control *rac, struct page *page)
 {
 	struct bio *bio = NULL;
 	sector_t last_block_in_bio = 0;
@@ -241,6 +240,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 	int length;
 	unsigned relative_block = 0;
 	struct ext4_map_blocks map;
+	unsigned int nr_pages = rac ? readahead_count(rac) : 1;
 
 	map.m_pblk = 0;
 	map.m_lblk = 0;
@@ -251,14 +251,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
 		int fully_mapped = 1;
 		unsigned first_hole = blocks_per_page;
 
-		if (pages) {
-			page = lru_to_page(pages);
-
+		if (rac) {
+			page = readahead_page(rac);
 			prefetchw(&page->flags);
-			list_del(&page->lru);
-			if (add_to_page_cache_lru(page, mapping, page->index,
-				  readahead_gfp_mask(mapping)))
-				goto next_page;
 		}
 
 		if (page_has_buffers(page))
@@ -381,7 +376,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
 			bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9);
 			bio->bi_end_io = mpage_end_io;
 			bio_set_op_attrs(bio, REQ_OP_READ,
-						is_readahead ? REQ_RAHEAD : 0);
+						rac ? REQ_RAHEAD : 0);
 		}
 
 		length = first_hole << blkbits;
@@ -406,10 +401,9 @@ int ext4_mpage_readpages(struct address_space *mapping,
 		else
 			unlock_page(page);
 	next_page:
-		if (pages)
+		if (rac)
 			put_page(page);
 	}
-	BUG_ON(pages && !list_empty(pages));
 	if (bio)
 		submit_bio(bio);
 	return 0;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 14/16] fuse: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (23 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 14/19] ext4: " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-17 18:46 ` [PATCH v6 15/19] f2fs: " Matthew Wilcox
                   ` (9 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in fuse.  Switching away from the
read_cache_pages() helper gets rid of an implicit call to put_page(),
so we can get rid of the get_page() call in fuse_readpages_fill().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/fuse/file.c | 46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d67b830fb7a..f64f98708b5e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -923,9 +923,8 @@ struct fuse_fill_data {
 	unsigned int max_pages;
 };
 
-static int fuse_readpages_fill(void *_data, struct page *page)
+static int fuse_readpages_fill(struct fuse_fill_data *data, struct page *page)
 {
-	struct fuse_fill_data *data = _data;
 	struct fuse_io_args *ia = data->ia;
 	struct fuse_args_pages *ap = &ia->ap;
 	struct inode *inode = data->inode;
@@ -941,10 +940,8 @@ static int fuse_readpages_fill(void *_data, struct page *page)
 					fc->max_pages);
 		fuse_send_readpages(ia, data->file);
 		data->ia = ia = fuse_io_alloc(NULL, data->max_pages);
-		if (!ia) {
-			unlock_page(page);
+		if (!ia)
 			return -ENOMEM;
-		}
 		ap = &ia->ap;
 	}
 
@@ -954,7 +951,6 @@ static int fuse_readpages_fill(void *_data, struct page *page)
 		return -EIO;
 	}
 
-	get_page(page);
 	ap->pages[ap->num_pages] = page;
 	ap->descs[ap->num_pages].length = PAGE_SIZE;
 	ap->num_pages++;
@@ -962,37 +958,33 @@ static int fuse_readpages_fill(void *_data, struct page *page)
 	return 0;
 }
 
-static int fuse_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void fuse_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_fill_data data;
-	int err;
+	struct page *page;
 
-	err = -EIO;
 	if (is_bad_inode(inode))
-		goto out;
+		return;
 
-	data.file = file;
+	data.file = rac->file;
 	data.inode = inode;
-	data.nr_pages = nr_pages;
-	data.max_pages = min_t(unsigned int, nr_pages, fc->max_pages);
-;
+	data.nr_pages = readahead_count(rac);
+	data.max_pages = min_t(unsigned int, data.nr_pages, fc->max_pages);
 	data.ia = fuse_io_alloc(NULL, data.max_pages);
-	err = -ENOMEM;
 	if (!data.ia)
-		goto out;
+		return;
 
-	err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
-	if (!err) {
-		if (data.ia->ap.num_pages)
-			fuse_send_readpages(data.ia, file);
-		else
-			fuse_io_free(data.ia);
+	readahead_for_each(rac, page) {
+		if (fuse_readpages_fill(&data, page) != 0)
+			return;
 	}
-out:
-	return err;
+
+	if (data.ia->ap.num_pages)
+		fuse_send_readpages(data.ia, rac->file);
+	else
+		fuse_io_free(data.ia);
 }
 
 static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
@@ -3373,10 +3365,10 @@ static const struct file_operations fuse_file_operations = {
 
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
+	.readahead	= fuse_readahead,
 	.writepage	= fuse_writepage,
 	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
-	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 15/19] f2fs: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (24 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 14/16] fuse: " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-17 18:46 ` [PATCH v6 15/16] iomap: " Matthew Wilcox
                   ` (8 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in f2fs

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/f2fs/data.c              | 50 +++++++++++++++----------------------
 fs/f2fs/f2fs.h              |  5 ++--
 include/trace/events/f2fs.h |  6 ++---
 3 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b27b72107911..87964e4cb6b8 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2159,13 +2159,11 @@ int f2fs_read_multi_pages(struct compress_ctx *cc, struct bio **bio_ret,
  * use ->readpage() or do the necessary surgery to decouple ->readpages()
  * from read-ahead.
  */
-int f2fs_mpage_readpages(struct address_space *mapping,
-			struct list_head *pages, struct page *page,
-			unsigned nr_pages, bool is_readahead)
+int f2fs_mpage_readpages(struct inode *inode, struct readahead_control *rac,
+		struct page *page)
 {
 	struct bio *bio = NULL;
 	sector_t last_block_in_bio = 0;
-	struct inode *inode = mapping->host;
 	struct f2fs_map_blocks map;
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 	struct compress_ctx cc = {
@@ -2179,6 +2177,7 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 		.nr_cpages = 0,
 	};
 #endif
+	unsigned nr_pages = rac ? readahead_count(rac) : 1;
 	unsigned max_nr_pages = nr_pages;
 	int ret = 0;
 
@@ -2192,15 +2191,9 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 	map.m_may_create = false;
 
 	for (; nr_pages; nr_pages--) {
-		if (pages) {
-			page = list_last_entry(pages, struct page, lru);
-
+		if (rac) {
+			page = readahead_page(rac);
 			prefetchw(&page->flags);
-			list_del(&page->lru);
-			if (add_to_page_cache_lru(page, mapping,
-						  page_index(page),
-						  readahead_gfp_mask(mapping)))
-				goto next_page;
 		}
 
 #ifdef CONFIG_F2FS_FS_COMPRESSION
@@ -2210,7 +2203,7 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 				ret = f2fs_read_multi_pages(&cc, &bio,
 							max_nr_pages,
 							&last_block_in_bio,
-							is_readahead);
+							rac);
 				f2fs_destroy_compress_ctx(&cc);
 				if (ret)
 					goto set_error_page;
@@ -2233,7 +2226,7 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 #endif
 
 		ret = f2fs_read_single_page(inode, page, max_nr_pages, &map,
-					&bio, &last_block_in_bio, is_readahead);
+					&bio, &last_block_in_bio, rac);
 		if (ret) {
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 set_error_page:
@@ -2242,8 +2235,10 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 			zero_user_segment(page, 0, PAGE_SIZE);
 			unlock_page(page);
 		}
+#ifdef CONFIG_F2FS_FS_COMPRESSION
 next_page:
-		if (pages)
+#endif
+		if (rac)
 			put_page(page);
 
 #ifdef CONFIG_F2FS_FS_COMPRESSION
@@ -2253,16 +2248,15 @@ int f2fs_mpage_readpages(struct address_space *mapping,
 				ret = f2fs_read_multi_pages(&cc, &bio,
 							max_nr_pages,
 							&last_block_in_bio,
-							is_readahead);
+							rac);
 				f2fs_destroy_compress_ctx(&cc);
 			}
 		}
 #endif
 	}
-	BUG_ON(pages && !list_empty(pages));
 	if (bio)
 		__submit_bio(F2FS_I_SB(inode), bio, DATA);
-	return pages ? 0 : ret;
+	return ret;
 }
 
 static int f2fs_read_data_page(struct file *file, struct page *page)
@@ -2281,28 +2275,24 @@ static int f2fs_read_data_page(struct file *file, struct page *page)
 	if (f2fs_has_inline_data(inode))
 		ret = f2fs_read_inline_data(inode, page);
 	if (ret == -EAGAIN)
-		ret = f2fs_mpage_readpages(page_file_mapping(page),
-						NULL, page, 1, false);
+		ret = f2fs_mpage_readpages(inode, NULL, page);
 	return ret;
 }
 
-static int f2fs_read_data_pages(struct file *file,
-			struct address_space *mapping,
-			struct list_head *pages, unsigned nr_pages)
+static void f2fs_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
-	struct page *page = list_last_entry(pages, struct page, lru);
+	struct inode *inode = rac->mapping->host;
 
-	trace_f2fs_readpages(inode, page, nr_pages);
+	trace_f2fs_readpages(inode, readahead_index(rac), readahead_count(rac));
 
 	if (!f2fs_is_compress_backend_ready(inode))
-		return 0;
+		return;
 
 	/* If the file has inline data, skip readpages */
 	if (f2fs_has_inline_data(inode))
-		return 0;
+		return;
 
-	return f2fs_mpage_readpages(mapping, pages, NULL, nr_pages, true);
+	f2fs_mpage_readpages(inode, rac, NULL);
 }
 
 int f2fs_encrypt_one_page(struct f2fs_io_info *fio)
@@ -3784,7 +3774,7 @@ static void f2fs_swap_deactivate(struct file *file)
 
 const struct address_space_operations f2fs_dblock_aops = {
 	.readpage	= f2fs_read_data_page,
-	.readpages	= f2fs_read_data_pages,
+	.readahead	= f2fs_readahead,
 	.writepage	= f2fs_write_data_page,
 	.writepages	= f2fs_write_data_pages,
 	.write_begin	= f2fs_write_begin,
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 5355be6b6755..b5e72dee8826 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -3344,9 +3344,8 @@ int f2fs_reserve_new_block(struct dnode_of_data *dn);
 int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index);
 int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from);
 int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index);
-int f2fs_mpage_readpages(struct address_space *mapping,
-			struct list_head *pages, struct page *page,
-			unsigned nr_pages, bool is_readahead);
+int f2fs_mpage_readpages(struct inode *inode, struct readahead_control *rac,
+		struct page *page);
 struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index,
 			int op_flags, bool for_write);
 struct page *f2fs_find_data_page(struct inode *inode, pgoff_t index);
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 67a97838c2a0..d72da4a33883 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -1375,9 +1375,9 @@ TRACE_EVENT(f2fs_writepages,
 
 TRACE_EVENT(f2fs_readpages,
 
-	TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage),
+	TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage),
 
-	TP_ARGS(inode, page, nrpage),
+	TP_ARGS(inode, start, nrpage),
 
 	TP_STRUCT__entry(
 		__field(dev_t,	dev)
@@ -1389,7 +1389,7 @@ TRACE_EVENT(f2fs_readpages,
 	TP_fast_assign(
 		__entry->dev	= inode->i_sb->s_dev;
 		__entry->ino	= inode->i_ino;
-		__entry->start	= page->index;
+		__entry->start	= start;
 		__entry->nrpage	= nrpage;
 	),
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 15/16] iomap: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (25 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 15/19] f2fs: " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-17 18:46 ` [PATCH v6 16/19] fuse: " Matthew Wilcox
                   ` (7 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 116 ++++++++++++++++-------------------------
 fs/iomap/trace.h       |   2 +-
 fs/xfs/xfs_aops.c      |  13 ++---
 fs/zonefs/super.c      |   7 ++-
 include/linux/iomap.h  |   3 +-
 5 files changed, 54 insertions(+), 87 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cb3511eb152a..2bfcd5242264 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -214,9 +214,8 @@ iomap_read_end_io(struct bio *bio)
 struct iomap_readpage_ctx {
 	struct page		*cur_page;
 	bool			cur_page_in_bio;
-	bool			is_readahead;
 	struct bio		*bio;
-	struct list_head	*pages;
+	struct readahead_control *rac;
 };
 
 static void
@@ -307,11 +306,11 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		if (ctx->bio)
 			submit_bio(ctx->bio);
 
-		if (ctx->is_readahead) /* same as readahead_gfp_mask */
+		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
 		ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs));
 		ctx->bio->bi_opf = REQ_OP_READ;
-		if (ctx->is_readahead)
+		if (ctx->rac)
 			ctx->bio->bi_opf |= REQ_RAHEAD;
 		ctx->bio->bi_iter.bi_sector = sector;
 		bio_set_dev(ctx->bio, iomap->bdev);
@@ -367,104 +366,77 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_readpage);
 
-static struct page *
-iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
-		loff_t length, loff_t *done)
-{
-	while (!list_empty(pages)) {
-		struct page *page = lru_to_page(pages);
-
-		if (page_offset(page) >= (u64)pos + length)
-			break;
-
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
-				GFP_NOFS))
-			return page;
-
-		/*
-		 * If we already have a page in the page cache at index we are
-		 * done.  Upper layers don't care if it is uptodate after the
-		 * readpages call itself as every page gets checked again once
-		 * actually needed.
-		 */
-		*done += PAGE_SIZE;
-		put_page(page);
-	}
-
-	return NULL;
-}
-
 static loff_t
-iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
+iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
-	loff_t done, ret;
+	loff_t ret, done = 0;
 
-	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
-			if (!ctx->cur_page_in_bio)
-				unlock_page(ctx->cur_page);
-			put_page(ctx->cur_page);
-			ctx->cur_page = NULL;
-		}
+	while (done < length) {
 		if (!ctx->cur_page) {
-			ctx->cur_page = iomap_next_page(inode, ctx->pages,
-					pos, length, &done);
-			if (!ctx->cur_page)
-				break;
+			ctx->cur_page = readahead_page(ctx->rac);
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
 				ctx, iomap, srcmap);
+		if (WARN_ON(ret == 0))
+			break;
+		done += ret;
+		if (offset_in_page(pos + done) == 0) {
+			readahead_next(ctx->rac);
+			if (!ctx->cur_page_in_bio)
+				unlock_page(ctx->cur_page);
+			put_page(ctx->cur_page);
+			ctx->cur_page = NULL;
+		}
 	}
 
 	return done;
 }
 
-int
-iomap_readpages(struct address_space *mapping, struct list_head *pages,
-		unsigned nr_pages, const struct iomap_ops *ops)
+/**
+ * iomap_readahead - Attempt to read pages from a file.
+ * @rac: Describes the pages to be read.
+ * @ops: The operations vector for the filesystem.
+ *
+ * This function is for filesystems to call to implement their readahead
+ * address_space operation.
+ *
+ * Context: The file is pinned by the caller, and the pages to be read are
+ * all locked and have an elevated refcount.  This function will unlock
+ * the pages (once I/O has completed on them, or I/O has been determined to
+ * not be necessary).  It will also decrease the refcount once the pages
+ * have been submitted for I/O.  After this point, the page may be removed
+ * from the page cache, and should not be referenced.
+ */
+void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 {
+	struct inode *inode = rac->mapping->host;
 	struct iomap_readpage_ctx ctx = {
-		.pages		= pages,
-		.is_readahead	= true,
+		.rac	= rac,
 	};
-	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
-	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
-	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+	loff_t pos = readahead_offset(rac);
+	loff_t length = readahead_length(rac);
 
-	trace_iomap_readpages(mapping->host, nr_pages);
+	trace_iomap_readahead(inode, readahead_count(rac));
 
 	while (length > 0) {
-		ret = iomap_apply(mapping->host, pos, length, 0, ops,
-				&ctx, iomap_readpages_actor);
+		loff_t ret = iomap_apply(inode, pos, length, 0, ops,
+				&ctx, iomap_readahead_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
-			goto done;
+			break;
 		}
 		pos += ret;
 		length -= ret;
 	}
-	ret = 0;
-done:
+
 	if (ctx.bio)
 		submit_bio(ctx.bio);
-	if (ctx.cur_page) {
-		if (!ctx.cur_page_in_bio)
-			unlock_page(ctx.cur_page);
-		put_page(ctx.cur_page);
-	}
-
-	/*
-	 * Check that we didn't lose a page due to the arcance calling
-	 * conventions..
-	 */
-	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
-	return ret;
+	BUG_ON(ctx.cur_page);
 }
-EXPORT_SYMBOL_GPL(iomap_readpages);
+EXPORT_SYMBOL_GPL(iomap_readahead);
 
 /*
  * iomap_is_partially_uptodate checks whether blocks within a page are
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 6dc227b8c47e..d6ba705f938a 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -39,7 +39,7 @@ DEFINE_EVENT(iomap_readpage_class, name,	\
 	TP_PROTO(struct inode *inode, int nr_pages), \
 	TP_ARGS(inode, nr_pages))
 DEFINE_READPAGE_EVENT(iomap_readpage);
-DEFINE_READPAGE_EVENT(iomap_readpages);
+DEFINE_READPAGE_EVENT(iomap_readahead);
 
 DECLARE_EVENT_CLASS(iomap_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 58e937be24ce..6e68eeb50b07 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -621,14 +621,11 @@ xfs_vm_readpage(
 	return iomap_readpage(page, &xfs_read_iomap_ops);
 }
 
-STATIC int
-xfs_vm_readpages(
-	struct file		*unused,
-	struct address_space	*mapping,
-	struct list_head	*pages,
-	unsigned		nr_pages)
+STATIC void
+xfs_vm_readahead(
+	struct readahead_control	*rac)
 {
-	return iomap_readpages(mapping, pages, nr_pages, &xfs_read_iomap_ops);
+	iomap_readahead(rac, &xfs_read_iomap_ops);
 }
 
 static int
@@ -644,7 +641,7 @@ xfs_iomap_swapfile_activate(
 
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
-	.readpages		= xfs_vm_readpages,
+	.readahead		= xfs_vm_readahead,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 8bc6ef82d693..8327a01d3bac 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -78,10 +78,9 @@ static int zonefs_readpage(struct file *unused, struct page *page)
 	return iomap_readpage(page, &zonefs_iomap_ops);
 }
 
-static int zonefs_readpages(struct file *unused, struct address_space *mapping,
-			    struct list_head *pages, unsigned int nr_pages)
+static void zonefs_readahead(struct readahead_control *rac)
 {
-	return iomap_readpages(mapping, pages, nr_pages, &zonefs_iomap_ops);
+	iomap_readahead(rac, &zonefs_iomap_ops);
 }
 
 /*
@@ -128,7 +127,7 @@ static int zonefs_writepages(struct address_space *mapping,
 
 static const struct address_space_operations zonefs_file_aops = {
 	.readpage		= zonefs_readpage,
-	.readpages		= zonefs_readpages,
+	.readahead		= zonefs_readahead,
 	.writepage		= zonefs_writepage,
 	.writepages		= zonefs_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..bc20bd04c2a2 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -155,8 +155,7 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-int iomap_readpages(struct address_space *mapping, struct list_head *pages,
-		unsigned nr_pages, const struct iomap_ops *ops);
+void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 int iomap_set_page_dirty(struct page *page);
 int iomap_is_partially_uptodate(struct page *page, unsigned long from,
 		unsigned long count);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 16/19] fuse: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (26 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 15/16] iomap: " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  3:22   ` Dave Chinner
  2020-02-17 18:46 ` [PATCH v6 16/16] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
                   ` (6 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in fuse.  Switching away from the
read_cache_pages() helper gets rid of an implicit call to put_page(),
so we can get rid of the get_page() call in fuse_readpages_fill().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/fuse/file.c | 46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d67b830fb7a..f64f98708b5e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -923,9 +923,8 @@ struct fuse_fill_data {
 	unsigned int max_pages;
 };
 
-static int fuse_readpages_fill(void *_data, struct page *page)
+static int fuse_readpages_fill(struct fuse_fill_data *data, struct page *page)
 {
-	struct fuse_fill_data *data = _data;
 	struct fuse_io_args *ia = data->ia;
 	struct fuse_args_pages *ap = &ia->ap;
 	struct inode *inode = data->inode;
@@ -941,10 +940,8 @@ static int fuse_readpages_fill(void *_data, struct page *page)
 					fc->max_pages);
 		fuse_send_readpages(ia, data->file);
 		data->ia = ia = fuse_io_alloc(NULL, data->max_pages);
-		if (!ia) {
-			unlock_page(page);
+		if (!ia)
 			return -ENOMEM;
-		}
 		ap = &ia->ap;
 	}
 
@@ -954,7 +951,6 @@ static int fuse_readpages_fill(void *_data, struct page *page)
 		return -EIO;
 	}
 
-	get_page(page);
 	ap->pages[ap->num_pages] = page;
 	ap->descs[ap->num_pages].length = PAGE_SIZE;
 	ap->num_pages++;
@@ -962,37 +958,33 @@ static int fuse_readpages_fill(void *_data, struct page *page)
 	return 0;
 }
 
-static int fuse_readpages(struct file *file, struct address_space *mapping,
-			  struct list_head *pages, unsigned nr_pages)
+static void fuse_readahead(struct readahead_control *rac)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = rac->mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_fill_data data;
-	int err;
+	struct page *page;
 
-	err = -EIO;
 	if (is_bad_inode(inode))
-		goto out;
+		return;
 
-	data.file = file;
+	data.file = rac->file;
 	data.inode = inode;
-	data.nr_pages = nr_pages;
-	data.max_pages = min_t(unsigned int, nr_pages, fc->max_pages);
-;
+	data.nr_pages = readahead_count(rac);
+	data.max_pages = min_t(unsigned int, data.nr_pages, fc->max_pages);
 	data.ia = fuse_io_alloc(NULL, data.max_pages);
-	err = -ENOMEM;
 	if (!data.ia)
-		goto out;
+		return;
 
-	err = read_cache_pages(mapping, pages, fuse_readpages_fill, &data);
-	if (!err) {
-		if (data.ia->ap.num_pages)
-			fuse_send_readpages(data.ia, file);
-		else
-			fuse_io_free(data.ia);
+	readahead_for_each(rac, page) {
+		if (fuse_readpages_fill(&data, page) != 0)
+			return;
 	}
-out:
-	return err;
+
+	if (data.ia->ap.num_pages)
+		fuse_send_readpages(data.ia, rac->file);
+	else
+		fuse_io_free(data.ia);
 }
 
 static ssize_t fuse_cache_read_iter(struct kiocb *iocb, struct iov_iter *to)
@@ -3373,10 +3365,10 @@ static const struct file_operations fuse_file_operations = {
 
 static const struct address_space_operations fuse_file_aops  = {
 	.readpage	= fuse_readpage,
+	.readahead	= fuse_readahead,
 	.writepage	= fuse_writepage,
 	.writepages	= fuse_writepages,
 	.launder_page	= fuse_launder_page,
-	.readpages	= fuse_readpages,
 	.set_page_dirty	= __set_page_dirty_nobuffers,
 	.bmap		= fuse_bmap,
 	.direct_IO	= fuse_direct_IO,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 16/16] mm: Use memalloc_nofs_save in readahead path
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (27 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 16/19] fuse: " Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-17 18:46 ` [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor Matthew Wilcox
                   ` (5 subsequent siblings)
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, Michal Hocko, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	Cong Wang, linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Ensure that memory allocations in the readahead path do not attempt to
reclaim file-backed pages, which could lead to a deadlock.  It is
possible, though unlikely this is the root cause of a problem observed
by Cong Wang.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
---
 mm/readahead.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index 566693f4e539..ae8abab939a3 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -22,6 +22,7 @@
 #include <linux/mm_inline.h>
 #include <linux/blk-cgroup.h>
 #include <linux/fadvise.h>
+#include <linux/sched/mm.h>
 
 #include "internal.h"
 
@@ -157,6 +158,18 @@ void page_cache_readahead_limit(struct address_space *mapping,
 		._nr_pages = 0,
 	};
 
+	/*
+	 * Partway through the readahead operation, we will have added
+	 * locked pages to the page cache, but will not yet have submitted
+	 * them for I/O.  Adding another page may need to allocate memory,
+	 * which can trigger memory reclaim.  Telling the VM we're in
+	 * the middle of a filesystem operation will cause it to not
+	 * touch file-backed pages, preventing a deadlock.  Most (all?)
+	 * filesystems already specify __GFP_NOFS in their mapping's
+	 * gfp_mask, but let's be explicit here.
+	 */
+	unsigned int nofs = memalloc_nofs_save();
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -210,6 +223,7 @@ void page_cache_readahead_limit(struct address_space *mapping,
 	if (readahead_count(&rac))
 		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
+	memalloc_nofs_restore(nofs);
 }
 EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (28 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 16/16] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  3:17   ` John Hubbard
  2020-02-19  3:29   ` Dave Chinner
  2020-02-17 18:46 ` [PATCH v6 18/19] iomap: Convert from readpages to readahead Matthew Wilcox
                   ` (4 subsequent siblings)
  34 siblings, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

By putting the 'have we reached the end of the page' condition at the end
of the loop instead of the beginning, we can remove the 'submit the last
page' code from iomap_readpages().  Also check that iomap_readpage_actor()
didn't return 0, which would lead to an endless loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index cb3511eb152a..44303f370b2d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
-	loff_t done, ret;
+	loff_t ret, done = 0;
 
-	for (done = 0; done < length; done += ret) {
-		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
-			if (!ctx->cur_page_in_bio)
-				unlock_page(ctx->cur_page);
-			put_page(ctx->cur_page);
-			ctx->cur_page = NULL;
-		}
+	while (done < length) {
 		if (!ctx->cur_page) {
 			ctx->cur_page = iomap_next_page(inode, ctx->pages,
 					pos, length, &done);
@@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
 				ctx, iomap, srcmap);
+		if (WARN_ON(ret == 0))
+			break;
+		done += ret;
+		if (offset_in_page(pos + done) == 0) {
+			if (!ctx->cur_page_in_bio)
+				unlock_page(ctx->cur_page);
+			put_page(ctx->cur_page);
+			ctx->cur_page = NULL;
+		}
 	}
 
 	return done;
@@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 done:
 	if (ctx.bio)
 		submit_bio(ctx.bio);
-	if (ctx.cur_page) {
-		if (!ctx.cur_page_in_bio)
-			unlock_page(ctx.cur_page);
-		put_page(ctx.cur_page);
-	}
+	BUG_ON(ctx.cur_page);
 
 	/*
 	 * Check that we didn't lose a page due to the arcance calling
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 18/19] iomap: Convert from readpages to readahead
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (29 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  3:40   ` Dave Chinner
  2020-02-17 18:46 ` [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
                   ` (3 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/iomap/buffered-io.c | 91 +++++++++++++++---------------------------
 fs/iomap/trace.h       |  2 +-
 fs/xfs/xfs_aops.c      | 13 +++---
 fs/zonefs/super.c      |  7 ++--
 include/linux/iomap.h  |  3 +-
 5 files changed, 42 insertions(+), 74 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 44303f370b2d..2bfcd5242264 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -214,9 +214,8 @@ iomap_read_end_io(struct bio *bio)
 struct iomap_readpage_ctx {
 	struct page		*cur_page;
 	bool			cur_page_in_bio;
-	bool			is_readahead;
 	struct bio		*bio;
-	struct list_head	*pages;
+	struct readahead_control *rac;
 };
 
 static void
@@ -307,11 +306,11 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		if (ctx->bio)
 			submit_bio(ctx->bio);
 
-		if (ctx->is_readahead) /* same as readahead_gfp_mask */
+		if (ctx->rac) /* same as readahead_gfp_mask */
 			gfp |= __GFP_NORETRY | __GFP_NOWARN;
 		ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs));
 		ctx->bio->bi_opf = REQ_OP_READ;
-		if (ctx->is_readahead)
+		if (ctx->rac)
 			ctx->bio->bi_opf |= REQ_RAHEAD;
 		ctx->bio->bi_iter.bi_sector = sector;
 		bio_set_dev(ctx->bio, iomap->bdev);
@@ -367,36 +366,8 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
 }
 EXPORT_SYMBOL_GPL(iomap_readpage);
 
-static struct page *
-iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
-		loff_t length, loff_t *done)
-{
-	while (!list_empty(pages)) {
-		struct page *page = lru_to_page(pages);
-
-		if (page_offset(page) >= (u64)pos + length)
-			break;
-
-		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
-				GFP_NOFS))
-			return page;
-
-		/*
-		 * If we already have a page in the page cache at index we are
-		 * done.  Upper layers don't care if it is uptodate after the
-		 * readpages call itself as every page gets checked again once
-		 * actually needed.
-		 */
-		*done += PAGE_SIZE;
-		put_page(page);
-	}
-
-	return NULL;
-}
-
 static loff_t
-iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
+iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
 		void *data, struct iomap *iomap, struct iomap *srcmap)
 {
 	struct iomap_readpage_ctx *ctx = data;
@@ -404,10 +375,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 
 	while (done < length) {
 		if (!ctx->cur_page) {
-			ctx->cur_page = iomap_next_page(inode, ctx->pages,
-					pos, length, &done);
-			if (!ctx->cur_page)
-				break;
+			ctx->cur_page = readahead_page(ctx->rac);
 			ctx->cur_page_in_bio = false;
 		}
 		ret = iomap_readpage_actor(inode, pos + done, length - done,
@@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 			break;
 		done += ret;
 		if (offset_in_page(pos + done) == 0) {
+			readahead_next(ctx->rac);
 			if (!ctx->cur_page_in_bio)
 				unlock_page(ctx->cur_page);
 			put_page(ctx->cur_page);
@@ -426,44 +395,48 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
 	return done;
 }
 
-int
-iomap_readpages(struct address_space *mapping, struct list_head *pages,
-		unsigned nr_pages, const struct iomap_ops *ops)
+/**
+ * iomap_readahead - Attempt to read pages from a file.
+ * @rac: Describes the pages to be read.
+ * @ops: The operations vector for the filesystem.
+ *
+ * This function is for filesystems to call to implement their readahead
+ * address_space operation.
+ *
+ * Context: The file is pinned by the caller, and the pages to be read are
+ * all locked and have an elevated refcount.  This function will unlock
+ * the pages (once I/O has completed on them, or I/O has been determined to
+ * not be necessary).  It will also decrease the refcount once the pages
+ * have been submitted for I/O.  After this point, the page may be removed
+ * from the page cache, and should not be referenced.
+ */
+void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
 {
+	struct inode *inode = rac->mapping->host;
 	struct iomap_readpage_ctx ctx = {
-		.pages		= pages,
-		.is_readahead	= true,
+		.rac	= rac,
 	};
-	loff_t pos = page_offset(list_entry(pages->prev, struct page, lru));
-	loff_t last = page_offset(list_entry(pages->next, struct page, lru));
-	loff_t length = last - pos + PAGE_SIZE, ret = 0;
+	loff_t pos = readahead_offset(rac);
+	loff_t length = readahead_length(rac);
 
-	trace_iomap_readpages(mapping->host, nr_pages);
+	trace_iomap_readahead(inode, readahead_count(rac));
 
 	while (length > 0) {
-		ret = iomap_apply(mapping->host, pos, length, 0, ops,
-				&ctx, iomap_readpages_actor);
+		loff_t ret = iomap_apply(inode, pos, length, 0, ops,
+				&ctx, iomap_readahead_actor);
 		if (ret <= 0) {
 			WARN_ON_ONCE(ret == 0);
-			goto done;
+			break;
 		}
 		pos += ret;
 		length -= ret;
 	}
-	ret = 0;
-done:
+
 	if (ctx.bio)
 		submit_bio(ctx.bio);
 	BUG_ON(ctx.cur_page);
-
-	/*
-	 * Check that we didn't lose a page due to the arcance calling
-	 * conventions..
-	 */
-	WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
-	return ret;
 }
-EXPORT_SYMBOL_GPL(iomap_readpages);
+EXPORT_SYMBOL_GPL(iomap_readahead);
 
 /*
  * iomap_is_partially_uptodate checks whether blocks within a page are
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 6dc227b8c47e..d6ba705f938a 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -39,7 +39,7 @@ DEFINE_EVENT(iomap_readpage_class, name,	\
 	TP_PROTO(struct inode *inode, int nr_pages), \
 	TP_ARGS(inode, nr_pages))
 DEFINE_READPAGE_EVENT(iomap_readpage);
-DEFINE_READPAGE_EVENT(iomap_readpages);
+DEFINE_READPAGE_EVENT(iomap_readahead);
 
 DECLARE_EVENT_CLASS(iomap_page_class,
 	TP_PROTO(struct inode *inode, struct page *page, unsigned long off,
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 58e937be24ce..6e68eeb50b07 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -621,14 +621,11 @@ xfs_vm_readpage(
 	return iomap_readpage(page, &xfs_read_iomap_ops);
 }
 
-STATIC int
-xfs_vm_readpages(
-	struct file		*unused,
-	struct address_space	*mapping,
-	struct list_head	*pages,
-	unsigned		nr_pages)
+STATIC void
+xfs_vm_readahead(
+	struct readahead_control	*rac)
 {
-	return iomap_readpages(mapping, pages, nr_pages, &xfs_read_iomap_ops);
+	iomap_readahead(rac, &xfs_read_iomap_ops);
 }
 
 static int
@@ -644,7 +641,7 @@ xfs_iomap_swapfile_activate(
 
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
-	.readpages		= xfs_vm_readpages,
+	.readahead		= xfs_vm_readahead,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 8bc6ef82d693..8327a01d3bac 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -78,10 +78,9 @@ static int zonefs_readpage(struct file *unused, struct page *page)
 	return iomap_readpage(page, &zonefs_iomap_ops);
 }
 
-static int zonefs_readpages(struct file *unused, struct address_space *mapping,
-			    struct list_head *pages, unsigned int nr_pages)
+static void zonefs_readahead(struct readahead_control *rac)
 {
-	return iomap_readpages(mapping, pages, nr_pages, &zonefs_iomap_ops);
+	iomap_readahead(rac, &zonefs_iomap_ops);
 }
 
 /*
@@ -128,7 +127,7 @@ static int zonefs_writepages(struct address_space *mapping,
 
 static const struct address_space_operations zonefs_file_aops = {
 	.readpage		= zonefs_readpage,
-	.readpages		= zonefs_readpages,
+	.readahead		= zonefs_readahead,
 	.writepage		= zonefs_writepage,
 	.writepages		= zonefs_writepages,
 	.set_page_dirty		= iomap_set_page_dirty,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 8b09463dae0d..bc20bd04c2a2 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -155,8 +155,7 @@ loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
 ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
 		const struct iomap_ops *ops);
 int iomap_readpage(struct page *page, const struct iomap_ops *ops);
-int iomap_readpages(struct address_space *mapping, struct list_head *pages,
-		unsigned nr_pages, const struct iomap_ops *ops);
+void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 int iomap_set_page_dirty(struct page *page);
 int iomap_is_partially_uptodate(struct page *page, unsigned long from,
 		unsigned long count);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (30 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 18/19] iomap: Convert from readpages to readahead Matthew Wilcox
@ 2020-02-17 18:46 ` Matthew Wilcox
  2020-02-19  3:43   ` Dave Chinner
  2020-02-17 18:48 ` [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (2 subsequent siblings)
  34 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, Michal Hocko, linux-kernel, Matthew Wilcox (Oracle),
	linux-f2fs-devel, cluster-devel, linux-mm, ocfs2-devel,
	Cong Wang, linux-ext4, linux-erofs, linux-btrfs

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Ensure that memory allocations in the readahead path do not attempt to
reclaim file-backed pages, which could lead to a deadlock.  It is
possible, though unlikely this is the root cause of a problem observed
by Cong Wang.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
---
 mm/readahead.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index 94d499cfb657..8f9c0dba24e7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -22,6 +22,7 @@
 #include <linux/mm_inline.h>
 #include <linux/blk-cgroup.h>
 #include <linux/fadvise.h>
+#include <linux/sched/mm.h>
 
 #include "internal.h"
 
@@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space *mapping,
 		._nr_pages = 0,
 	};
 
+	/*
+	 * Partway through the readahead operation, we will have added
+	 * locked pages to the page cache, but will not yet have submitted
+	 * them for I/O.  Adding another page may need to allocate memory,
+	 * which can trigger memory reclaim.  Telling the VM we're in
+	 * the middle of a filesystem operation will cause it to not
+	 * touch file-backed pages, preventing a deadlock.  Most (all?)
+	 * filesystems already specify __GFP_NOFS in their mapping's
+	 * gfp_mask, but let's be explicit here.
+	 */
+	unsigned int nofs = memalloc_nofs_save();
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -227,6 +240,7 @@ void page_cache_readahead_limit(struct address_space *mapping,
 	if (readahead_count(&rac))
 		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
+	memalloc_nofs_restore(nofs);
 }
 EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (31 preceding siblings ...)
  2020-02-17 18:46 ` [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
@ 2020-02-17 18:48 ` Matthew Wilcox
  2020-02-18  4:56 ` Dave Chinner
  2020-02-18 20:49 ` John Hubbard
  34 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-17 18:48 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote:
> This series adds a readahead address_space operation to eventually

*sigh*.  Clearly I forgot to rm -rf an earlier version.  Please disregard
any patches labelled n/16.  I can send a v7 if this is too much hassle.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [Ocfs2-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
@ 2020-02-18  1:51   ` Joseph Qi
  2020-02-18  6:37   ` Dave Chinner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 111+ messages in thread
From: Joseph Qi @ 2020-02-18  1:51 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: cluster-devel, linux-kernel, linux-f2fs-devel, linux-xfs,
	linux-mm, linux-btrfs, linux-ext4, linux-erofs, ocfs2-devel



On 20/2/18 02:45, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Implement the new readahead aop and convert all callers (block_dev,
> exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
> ---
>  drivers/staging/exfat/exfat_super.c |  7 +++---
>  fs/block_dev.c                      |  7 +++---
>  fs/ext2/inode.c                     | 10 +++-----
>  fs/fat/inode.c                      |  7 +++---
>  fs/gfs2/aops.c                      | 23 ++++++-----------
>  fs/hpfs/file.c                      |  7 +++---
>  fs/iomap/buffered-io.c              |  2 +-
>  fs/isofs/inode.c                    |  7 +++---
>  fs/jfs/inode.c                      |  7 +++---
>  fs/mpage.c                          | 38 +++++++++--------------------
>  fs/nilfs2/inode.c                   | 15 +++---------
>  fs/ocfs2/aops.c                     | 34 ++++++++++----------------
>  fs/omfs/file.c                      |  7 +++---
>  fs/qnx6/inode.c                     |  7 +++---
>  fs/reiserfs/inode.c                 |  8 +++---
>  fs/udf/inode.c                      |  7 +++---
>  include/linux/mpage.h               |  4 +--
>  mm/migrate.c                        |  2 +-
>  18 files changed, 73 insertions(+), 126 deletions(-)
> 
> diff --git a/drivers/staging/exfat/exfat_super.c b/drivers/staging/exfat/exfat_super.c
> index b81d2a87b82e..96aad9b16d31 100644
> --- a/drivers/staging/exfat/exfat_super.c
> +++ b/drivers/staging/exfat/exfat_super.c
> @@ -3002,10 +3002,9 @@ static int exfat_readpage(struct file *file, struct page *page)
>  	return  mpage_readpage(page, exfat_get_block);
>  }
>  
> -static int exfat_readpages(struct file *file, struct address_space *mapping,
> -			   struct list_head *pages, unsigned int nr_pages)
> +static void exfat_readahead(struct readahead_control *rac)
>  {
> -	return  mpage_readpages(mapping, pages, nr_pages, exfat_get_block);
> +	mpage_readahead(rac, exfat_get_block);
>  }
>  
>  static int exfat_writepage(struct page *page, struct writeback_control *wbc)
> @@ -3104,7 +3103,7 @@ static sector_t _exfat_bmap(struct address_space *mapping, sector_t block)
>  
>  static const struct address_space_operations exfat_aops = {
>  	.readpage    = exfat_readpage,
> -	.readpages   = exfat_readpages,
> +	.readahead   = exfat_readahead,
>  	.writepage   = exfat_writepage,
>  	.writepages  = exfat_writepages,
>  	.write_begin = exfat_write_begin,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 69bf2fb6f7cd..2fd9c7bd61f6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -614,10 +614,9 @@ static int blkdev_readpage(struct file * file, struct page * page)
>  	return block_read_full_page(page, blkdev_get_block);
>  }
>  
> -static int blkdev_readpages(struct file *file, struct address_space *mapping,
> -			struct list_head *pages, unsigned nr_pages)
> +static void blkdev_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block);
> +	mpage_readahead(rac, blkdev_get_block);
>  }
>  
>  static int blkdev_write_begin(struct file *file, struct address_space *mapping,
> @@ -2062,7 +2061,7 @@ static int blkdev_writepages(struct address_space *mapping,
>  
>  static const struct address_space_operations def_blk_aops = {
>  	.readpage	= blkdev_readpage,
> -	.readpages	= blkdev_readpages,
> +	.readahead	= blkdev_readahead,
>  	.writepage	= blkdev_writepage,
>  	.write_begin	= blkdev_write_begin,
>  	.write_end	= blkdev_write_end,
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c885cf7d724b..2875c0a705b5 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -877,11 +877,9 @@ static int ext2_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, ext2_get_block);
>  }
>  
> -static int
> -ext2_readpages(struct file *file, struct address_space *mapping,
> -		struct list_head *pages, unsigned nr_pages)
> +static void ext2_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
> +	mpage_readahead(rac, ext2_get_block);
>  }
>  
>  static int
> @@ -967,7 +965,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
>  
>  const struct address_space_operations ext2_aops = {
>  	.readpage		= ext2_readpage,
> -	.readpages		= ext2_readpages,
> +	.readahead		= ext2_readahead,
>  	.writepage		= ext2_writepage,
>  	.write_begin		= ext2_write_begin,
>  	.write_end		= ext2_write_end,
> @@ -981,7 +979,7 @@ const struct address_space_operations ext2_aops = {
>  
>  const struct address_space_operations ext2_nobh_aops = {
>  	.readpage		= ext2_readpage,
> -	.readpages		= ext2_readpages,
> +	.readahead		= ext2_readahead,
>  	.writepage		= ext2_nobh_writepage,
>  	.write_begin		= ext2_nobh_write_begin,
>  	.write_end		= nobh_write_end,
> diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> index 594b05ae16c9..3496f5fc3e6d 100644
> --- a/fs/fat/inode.c
> +++ b/fs/fat/inode.c
> @@ -210,10 +210,9 @@ static int fat_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, fat_get_block);
>  }
>  
> -static int fat_readpages(struct file *file, struct address_space *mapping,
> -			 struct list_head *pages, unsigned nr_pages)
> +static void fat_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
> +	mpage_readahead(rac, fat_get_block);
>  }
>  
>  static void fat_write_failed(struct address_space *mapping, loff_t to)
> @@ -344,7 +343,7 @@ int fat_block_truncate_page(struct inode *inode, loff_t from)
>  
>  static const struct address_space_operations fat_aops = {
>  	.readpage	= fat_readpage,
> -	.readpages	= fat_readpages,
> +	.readahead	= fat_readahead,
>  	.writepage	= fat_writepage,
>  	.writepages	= fat_writepages,
>  	.write_begin	= fat_write_begin,
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index ba83b49ce18c..5e63c13c12c1 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -577,7 +577,7 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
>  }
>  
>  /**
> - * gfs2_readpages - Read a bunch of pages at once
> + * gfs2_readahead - Read a bunch of pages at once
>   * @file: The file to read from
>   * @mapping: Address space info
>   * @pages: List of pages to read
> @@ -590,31 +590,24 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
>   *    obviously not something we'd want to do on too regular a basis.
>   *    Any I/O we ignore at this time will be done via readpage later.
>   * 2. We don't handle stuffed files here we let readpage do the honours.
> - * 3. mpage_readpages() does most of the heavy lifting in the common case.
> + * 3. mpage_readahead() does most of the heavy lifting in the common case.
>   * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
>   */
>  
> -static int gfs2_readpages(struct file *file, struct address_space *mapping,
> -			  struct list_head *pages, unsigned nr_pages)
> +static void gfs2_readahead(struct readahead_control *rac)
>  {
> -	struct inode *inode = mapping->host;
> +	struct inode *inode = rac->mapping->host;
>  	struct gfs2_inode *ip = GFS2_I(inode);
> -	struct gfs2_sbd *sdp = GFS2_SB(inode);
>  	struct gfs2_holder gh;
> -	int ret;
>  
>  	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
> -	ret = gfs2_glock_nq(&gh);
> -	if (unlikely(ret))
> +	if (gfs2_glock_nq(&gh))
>  		goto out_uninit;
>  	if (!gfs2_is_stuffed(ip))
> -		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
> +		mpage_readahead(rac, gfs2_block_map);
>  	gfs2_glock_dq(&gh);
>  out_uninit:
>  	gfs2_holder_uninit(&gh);
> -	if (unlikely(gfs2_withdrawn(sdp)))
> -		ret = -EIO;
> -	return ret;
>  }
>  
>  /**
> @@ -828,7 +821,7 @@ static const struct address_space_operations gfs2_aops = {
>  	.writepage = gfs2_writepage,
>  	.writepages = gfs2_writepages,
>  	.readpage = gfs2_readpage,
> -	.readpages = gfs2_readpages,
> +	.readahead = gfs2_readahead,
>  	.bmap = gfs2_bmap,
>  	.invalidatepage = gfs2_invalidatepage,
>  	.releasepage = gfs2_releasepage,
> @@ -842,7 +835,7 @@ static const struct address_space_operations gfs2_jdata_aops = {
>  	.writepage = gfs2_jdata_writepage,
>  	.writepages = gfs2_jdata_writepages,
>  	.readpage = gfs2_readpage,
> -	.readpages = gfs2_readpages,
> +	.readahead = gfs2_readahead,
>  	.set_page_dirty = jdata_set_page_dirty,
>  	.bmap = gfs2_bmap,
>  	.invalidatepage = gfs2_invalidatepage,
> diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
> index b36abf9cb345..2de0d3492d15 100644
> --- a/fs/hpfs/file.c
> +++ b/fs/hpfs/file.c
> @@ -125,10 +125,9 @@ static int hpfs_writepage(struct page *page, struct writeback_control *wbc)
>  	return block_write_full_page(page, hpfs_get_block, wbc);
>  }
>  
> -static int hpfs_readpages(struct file *file, struct address_space *mapping,
> -			  struct list_head *pages, unsigned nr_pages)
> +static void hpfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, hpfs_get_block);
> +	mpage_readahead(rac, hpfs_get_block);
>  }
>  
>  static int hpfs_writepages(struct address_space *mapping,
> @@ -198,7 +197,7 @@ static int hpfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  const struct address_space_operations hpfs_aops = {
>  	.readpage = hpfs_readpage,
>  	.writepage = hpfs_writepage,
> -	.readpages = hpfs_readpages,
> +	.readahead = hpfs_readahead,
>  	.writepages = hpfs_writepages,
>  	.write_begin = hpfs_write_begin,
>  	.write_end = hpfs_write_end,
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7c84c4c027c4..cb3511eb152a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -359,7 +359,7 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
>  	}
>  
>  	/*
> -	 * Just like mpage_readpages and block_read_full_page we always
> +	 * Just like mpage_readahead and block_read_full_page we always
>  	 * return 0 and just mark the page as PageError on errors.  This
>  	 * should be cleaned up all through the stack eventually.
>  	 */
> diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
> index 62c0462dc89f..95b1f377ad09 100644
> --- a/fs/isofs/inode.c
> +++ b/fs/isofs/inode.c
> @@ -1185,10 +1185,9 @@ static int isofs_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, isofs_get_block);
>  }
>  
> -static int isofs_readpages(struct file *file, struct address_space *mapping,
> -			struct list_head *pages, unsigned nr_pages)
> +static void isofs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, isofs_get_block);
> +	mpage_readahead(rac, isofs_get_block);
>  }
>  
>  static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
> @@ -1198,7 +1197,7 @@ static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
>  
>  static const struct address_space_operations isofs_aops = {
>  	.readpage = isofs_readpage,
> -	.readpages = isofs_readpages,
> +	.readahead = isofs_readahead,
>  	.bmap = _isofs_bmap
>  };
>  
> diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
> index 9486afcdac76..6f65bfa9f18d 100644
> --- a/fs/jfs/inode.c
> +++ b/fs/jfs/inode.c
> @@ -296,10 +296,9 @@ static int jfs_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, jfs_get_block);
>  }
>  
> -static int jfs_readpages(struct file *file, struct address_space *mapping,
> -		struct list_head *pages, unsigned nr_pages)
> +static void jfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, jfs_get_block);
> +	mpage_readahead(rac, jfs_get_block);
>  }
>  
>  static void jfs_write_failed(struct address_space *mapping, loff_t to)
> @@ -358,7 +357,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  const struct address_space_operations jfs_aops = {
>  	.readpage	= jfs_readpage,
> -	.readpages	= jfs_readpages,
> +	.readahead	= jfs_readahead,
>  	.writepage	= jfs_writepage,
>  	.writepages	= jfs_writepages,
>  	.write_begin	= jfs_write_begin,
> diff --git a/fs/mpage.c b/fs/mpage.c
> index ccba3c4c4479..8a09e6002dc2 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -91,7 +91,7 @@ mpage_alloc(struct block_device *bdev,
>  }
>  
>  /*
> - * support function for mpage_readpages.  The fs supplied get_block might
> + * support function for mpage_readahead.  The fs supplied get_block might
>   * return an up to date buffer.  This is used to map that buffer into
>   * the page, which allows readpage to avoid triggering a duplicate call
>   * to get_block.
> @@ -338,13 +338,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>  }
>  
>  /**
> - * mpage_readpages - populate an address space with some pages & start reads against them
> - * @mapping: the address_space
> - * @pages: The address of a list_head which contains the target pages.  These
> - *   pages have their ->index populated and are otherwise uninitialised.
> - *   The page at @pages->prev has the lowest file offset, and reads should be
> - *   issued in @pages->prev to @pages->next order.
> - * @nr_pages: The number of pages at *@pages
> + * mpage_readahead - start reads against pages
> + * @rac: Describes which pages to read.
>   * @get_block: The filesystem's block mapper function.
>   *
>   * This function walks the pages and the blocks within each page, building and
> @@ -381,36 +376,25 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>   *
>   * This all causes the disk requests to be issued in the correct order.
>   */
> -int
> -mpage_readpages(struct address_space *mapping, struct list_head *pages,
> -				unsigned nr_pages, get_block_t get_block)
> +void mpage_readahead(struct readahead_control *rac, get_block_t get_block)
>  {
> +	struct page *page;
>  	struct mpage_readpage_args args = {
>  		.get_block = get_block,
>  		.is_readahead = true,
>  	};
> -	unsigned page_idx;
> -
> -	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> -		struct page *page = lru_to_page(pages);
>  
> +	readahead_for_each(rac, page) {
>  		prefetchw(&page->flags);
> -		list_del(&page->lru);
> -		if (!add_to_page_cache_lru(page, mapping,
> -					page->index,
> -					readahead_gfp_mask(mapping))) {
> -			args.page = page;
> -			args.nr_pages = nr_pages - page_idx;
> -			args.bio = do_mpage_readpage(&args);
> -		}
> +		args.page = page;
> +		args.nr_pages = readahead_count(rac);
> +		args.bio = do_mpage_readpage(&args);
>  		put_page(page);
>  	}
> -	BUG_ON(!list_empty(pages));
>  	if (args.bio)
>  		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
> -	return 0;
>  }
> -EXPORT_SYMBOL(mpage_readpages);
> +EXPORT_SYMBOL(mpage_readahead);
>  
>  /*
>   * This isn't called much at all
> @@ -563,7 +547,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  		 * Page has buffers, but they are all unmapped. The page was
>  		 * created by pagein or read over a hole which was handled by
>  		 * block_read_full_page().  If this address_space is also
> -		 * using mpage_readpages then this can rarely happen.
> +		 * using mpage_readahead then this can rarely happen.
>  		 */
>  		goto confused;
>  	}
> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
> index 671085512e0f..ceeb3b441844 100644
> --- a/fs/nilfs2/inode.c
> +++ b/fs/nilfs2/inode.c
> @@ -145,18 +145,9 @@ static int nilfs_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, nilfs_get_block);
>  }
>  
> -/**
> - * nilfs_readpages() - implement readpages() method of nilfs_aops {}
> - * address_space_operations.
> - * @file - file struct of the file to be read
> - * @mapping - address_space struct used for reading multiple pages
> - * @pages - the pages to be read
> - * @nr_pages - number of pages to be read
> - */
> -static int nilfs_readpages(struct file *file, struct address_space *mapping,
> -			   struct list_head *pages, unsigned int nr_pages)
> +static void nilfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, nilfs_get_block);
> +	mpage_readahead(rac, nilfs_get_block);
>  }
>  
>  static int nilfs_writepages(struct address_space *mapping,
> @@ -308,7 +299,7 @@ const struct address_space_operations nilfs_aops = {
>  	.readpage		= nilfs_readpage,
>  	.writepages		= nilfs_writepages,
>  	.set_page_dirty		= nilfs_set_page_dirty,
> -	.readpages		= nilfs_readpages,
> +	.readahead		= nilfs_readahead,
>  	.write_begin		= nilfs_write_begin,
>  	.write_end		= nilfs_write_end,
>  	/* .releasepage		= nilfs_releasepage, */
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 3a67a6518ddf..e8137efaafec 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -350,14 +350,11 @@ static int ocfs2_readpage(struct file *file, struct page *page)
>   * grow out to a tree. If need be, detecting boundary extents could
>   * trivially be added in a future version of ocfs2_get_block().
>   */
> -static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
> -			   struct list_head *pages, unsigned nr_pages)
> +static void ocfs2_readahead(struct readahead_control *rac)
>  {
> -	int ret, err = -EIO;
> -	struct inode *inode = mapping->host;
> +	int ret;
> +	struct inode *inode = rac->mapping->host;
>  	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> -	loff_t start;
> -	struct page *last;
>  
>  	/*
>  	 * Use the nonblocking flag for the dlm code to avoid page
> @@ -365,36 +362,31 @@ static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
>  	 */
>  	ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK);
>  	if (ret)
> -		return err;
> +		return;
>  
> -	if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
> -		ocfs2_inode_unlock(inode, 0);
> -		return err;
> -	}
> +	if (down_read_trylock(&oi->ip_alloc_sem) == 0)
> +		goto out_unlock;
>  
>  	/*
>  	 * Don't bother with inline-data. There isn't anything
>  	 * to read-ahead in that case anyway...
>  	 */
>  	if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
> -		goto out_unlock;
> +		goto out_up;
>  
>  	/*
>  	 * Check whether a remote node truncated this file - we just
>  	 * drop out in that case as it's not worth handling here.
>  	 */
> -	last = lru_to_page(pages);
> -	start = (loff_t)last->index << PAGE_SHIFT;
> -	if (start >= i_size_read(inode))
> -		goto out_unlock;
> +	if (readahead_offset(rac) >= i_size_read(inode))
> +		goto out_up;
>  
> -	err = mpage_readpages(mapping, pages, nr_pages, ocfs2_get_block);
> +	mpage_readahead(rac, ocfs2_get_block);
>  
> -out_unlock:
> +out_up:
>  	up_read(&oi->ip_alloc_sem);
> +out_unlock:
>  	ocfs2_inode_unlock(inode, 0);
> -
> -	return err;
>  }
>  
>  /* Note: Because we don't support holes, our allocation has
> @@ -2474,7 +2466,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  const struct address_space_operations ocfs2_aops = {
>  	.readpage		= ocfs2_readpage,
> -	.readpages		= ocfs2_readpages,
> +	.readahead		= ocfs2_readahead,
>  	.writepage		= ocfs2_writepage,
>  	.write_begin		= ocfs2_write_begin,
>  	.write_end		= ocfs2_write_end,
For ocfs2 part, looks good.
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>

> diff --git a/fs/omfs/file.c b/fs/omfs/file.c
> index d640b9388238..d7b5f09d298c 100644
> --- a/fs/omfs/file.c
> +++ b/fs/omfs/file.c
> @@ -289,10 +289,9 @@ static int omfs_readpage(struct file *file, struct page *page)
>  	return block_read_full_page(page, omfs_get_block);
>  }
>  
> -static int omfs_readpages(struct file *file, struct address_space *mapping,
> -		struct list_head *pages, unsigned nr_pages)
> +static void omfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, omfs_get_block);
> +	mpage_readahead(rac, omfs_get_block);
>  }
>  
>  static int omfs_writepage(struct page *page, struct writeback_control *wbc)
> @@ -373,7 +372,7 @@ const struct inode_operations omfs_file_inops = {
>  
>  const struct address_space_operations omfs_aops = {
>  	.readpage = omfs_readpage,
> -	.readpages = omfs_readpages,
> +	.readahead = omfs_readahead,
>  	.writepage = omfs_writepage,
>  	.writepages = omfs_writepages,
>  	.write_begin = omfs_write_begin,
> diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
> index 345db56c98fd..755293c8c71a 100644
> --- a/fs/qnx6/inode.c
> +++ b/fs/qnx6/inode.c
> @@ -99,10 +99,9 @@ static int qnx6_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, qnx6_get_block);
>  }
>  
> -static int qnx6_readpages(struct file *file, struct address_space *mapping,
> -		   struct list_head *pages, unsigned nr_pages)
> +static void qnx6_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, qnx6_get_block);
> +	mpage_readahead(rac, qnx6_get_block);
>  }
>  
>  /*
> @@ -499,7 +498,7 @@ static sector_t qnx6_bmap(struct address_space *mapping, sector_t block)
>  }
>  static const struct address_space_operations qnx6_aops = {
>  	.readpage	= qnx6_readpage,
> -	.readpages	= qnx6_readpages,
> +	.readahead	= qnx6_readahead,
>  	.bmap		= qnx6_bmap
>  };
>  
> diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
> index 6419e6dacc39..0031070b3692 100644
> --- a/fs/reiserfs/inode.c
> +++ b/fs/reiserfs/inode.c
> @@ -1160,11 +1160,9 @@ int reiserfs_get_block(struct inode *inode, sector_t block,
>  	return retval;
>  }
>  
> -static int
> -reiserfs_readpages(struct file *file, struct address_space *mapping,
> -		   struct list_head *pages, unsigned nr_pages)
> +static void reiserfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, reiserfs_get_block);
> +	mpage_readahead(rac, reiserfs_get_block);
>  }
>  
>  /*
> @@ -3434,7 +3432,7 @@ int reiserfs_setattr(struct dentry *dentry, struct iattr *attr)
>  const struct address_space_operations reiserfs_address_space_operations = {
>  	.writepage = reiserfs_writepage,
>  	.readpage = reiserfs_readpage,
> -	.readpages = reiserfs_readpages,
> +	.readahead = reiserfs_readahead,
>  	.releasepage = reiserfs_releasepage,
>  	.invalidatepage = reiserfs_invalidatepage,
>  	.write_begin = reiserfs_write_begin,
> diff --git a/fs/udf/inode.c b/fs/udf/inode.c
> index e875bc5668ee..adaba8e8b326 100644
> --- a/fs/udf/inode.c
> +++ b/fs/udf/inode.c
> @@ -195,10 +195,9 @@ static int udf_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, udf_get_block);
>  }
>  
> -static int udf_readpages(struct file *file, struct address_space *mapping,
> -			struct list_head *pages, unsigned nr_pages)
> +static void udf_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, udf_get_block);
> +	mpage_readahead(rac, udf_get_block);
>  }
>  
>  static int udf_write_begin(struct file *file, struct address_space *mapping,
> @@ -234,7 +233,7 @@ static sector_t udf_bmap(struct address_space *mapping, sector_t block)
>  
>  const struct address_space_operations udf_aops = {
>  	.readpage	= udf_readpage,
> -	.readpages	= udf_readpages,
> +	.readahead	= udf_readahead,
>  	.writepage	= udf_writepage,
>  	.writepages	= udf_writepages,
>  	.write_begin	= udf_write_begin,
> diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> index 001f1fcf9836..f4f5e90a6844 100644
> --- a/include/linux/mpage.h
> +++ b/include/linux/mpage.h
> @@ -13,9 +13,9 @@
>  #ifdef CONFIG_BLOCK
>  
>  struct writeback_control;
> +struct readahead_control;
>  
> -int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> -				unsigned nr_pages, get_block_t get_block);
> +void mpage_readahead(struct readahead_control *, get_block_t get_block);
>  int mpage_readpage(struct page *page, get_block_t get_block);
>  int mpage_writepages(struct address_space *mapping,
>  		struct writeback_control *wbc, get_block_t get_block);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b1092876e537..a32122095702 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1020,7 +1020,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		 * to the LRU. Later, when the IO completes the pages are
>  		 * marked uptodate and unlocked. However, the queueing
>  		 * could be merging multiple pages for one bio (e.g.
> -		 * mpage_readpages). If an allocation happens for the
> +		 * mpage_readahead). If an allocation happens for the
>  		 * second or third page, the process can end up locking
>  		 * the same page twice and deadlocking. Rather than
>  		 * trying to be clever about what pages can be locked,
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 01/19] mm: Return void from various readahead functions
  2020-02-17 18:45 ` [PATCH v6 01/19] mm: Return void from various readahead functions Matthew Wilcox
@ 2020-02-18  4:47   ` Dave Chinner
  2020-02-18 21:05   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  4:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:42AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> ondemand_readahead has two callers, neither of which use the return value.
> That means that both ra_submit and __do_page_cache_readahead() can return
> void, and we don't need to worry that a present page in the readahead
> window causes us to return a smaller nr_pages than we ought to have.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 02/19] mm: Ignore return value of ->readpages
  2020-02-17 18:45 ` [PATCH v6 02/19] mm: Ignore return value of ->readpages Matthew Wilcox
@ 2020-02-18  4:48   ` Dave Chinner
  2020-02-18 21:33   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  4:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	Christoph Hellwig, linux-btrfs

On Mon, Feb 17, 2020 at 10:45:43AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We used to assign the return value to a variable, which we then ignored.
> Remove the pretence of caring.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/readahead.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)

Simple enough.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (32 preceding siblings ...)
  2020-02-17 18:48 ` [PATCH v6 00/19] Change readahead API Matthew Wilcox
@ 2020-02-18  4:56 ` Dave Chinner
  2020-02-18 13:42   ` Matthew Wilcox
  2020-02-18 20:49 ` John Hubbard
  34 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  4:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.

Latest version in your git tree:

$ ▶ glo -n 5 willy/readahead
4be497096c04 mm: Use memalloc_nofs_save in readahead path
ff63497fcb98 iomap: Convert from readpages to readahead
26aee60e89b5 iomap: Restructure iomap_readpages_actor
8115bcca7312 fuse: Convert from readpages to readahead
3db3d10d9ea1 f2fs: Convert from readpages to readahead
$

merged into a 5.6-rc2 tree fails at boot on my test vm:

[    2.423116] ------------[ cut here ]------------
[    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
[    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
[    2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800
[    2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    2.434744] RIP: 0010:__list_add_valid+0x67/0x70
[    2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 c7
[    2.441161] RSP: 0000:ffffc900018a3bb0 EFLAGS: 00010082
[    2.442548] RAX: 0000000000000000 RBX: ffffea000efff4c0 RCX: 0000000000000256
[    2.444432] RDX: 0000000000000001 RSI: 0000000000000086 RDI: ffffffff8288a8b0
[    2.446315] RBP: ffffea000efff4c8 R08: ffffc900018a3a65 R09: 0000000000000256
[    2.448199] R10: 0000000000000008 R11: ffffc900018a3a65 R12: ffffea000efff4c8
[    2.450072] R13: ffff8883bfffee60 R14: 0000000000000010 R15: 0000000000000001
[    2.451959] FS:  0000000000000000(0000) GS:ffff8883b9c00000(0000) knlGS:0000000000000000
[    2.454083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.455604] CR2: 00000000ffffffff CR3: 00000003b9a37002 CR4: 0000000000060ee0
[    2.457484] Call Trace:
[    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
[    2.459376]  pagevec_lru_move_fn+0x87/0xd0
[    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
[    2.461712]  lru_add_drain_cpu+0x8d/0x160
[    2.462787]  lru_add_drain+0x18/0x20
[    2.463757]  shift_arg_pages+0xb8/0x180
[    2.464789]  ? vprintk_emit+0x101/0x1c0
[    2.465813]  ? printk+0x58/0x6f
[    2.466659]  setup_arg_pages+0x205/0x240
[    2.467716]  load_elf_binary+0x34a/0x1560
[    2.468789]  ? get_user_pages_remote+0x159/0x280
[    2.470024]  ? selinux_inode_permission+0x10d/0x1e0
[    2.471323]  ? _raw_read_unlock+0xa/0x20
[    2.472375]  ? load_misc_binary+0x2b2/0x410
[    2.473492]  search_binary_handler+0x60/0xe0
[    2.474634]  __do_execve_file.isra.0+0x512/0x850
[    2.475888]  ? rest_init+0xc6/0xc6
[    2.476801]  do_execve+0x21/0x30
[    2.477671]  try_to_run_init_process+0x10/0x34
[    2.478855]  kernel_init+0xe2/0xfa
[    2.479776]  ret_from_fork+0x1f/0x30
[    2.480737] ---[ end trace e77079de9b22dc6a ]---

I just dropped the ext4 conversion from my local tree so I can boot
the machine and test XFS. Might have some more info when that
crashes and burns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
  2020-02-17 18:45 ` [PATCH v6 03/19] mm: Use readahead_control to pass arguments Matthew Wilcox
@ 2020-02-18  5:03   ` Dave Chinner
  2020-02-18 13:56     ` Matthew Wilcox
  2020-02-18 22:22   ` John Hubbard
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  5:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> In this patch, only between __do_page_cache_readahead() and
> read_pages(), but it will be extended in upcoming patches.  Also add
> the readahead_count() accessor.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 17 +++++++++++++++++
>  mm/readahead.c          | 36 +++++++++++++++++++++---------------
>  2 files changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..982ecda2d4a2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page,
>  	return error;
>  }
>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> +	struct file *file;
> +	struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> +	pgoff_t _start;
> +	unsigned int _nr_pages;
> +};
> +
> +/* The number of pages in this readahead block */
> +static inline unsigned int readahead_count(struct readahead_control *rac)
> +{
> +	return rac->_nr_pages;
> +}
> +
>  static inline unsigned long dir_pages(struct inode *inode)
>  {
>  	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 12d13b7792da..15329309231f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static void read_pages(struct address_space *mapping, struct file *filp,
> -		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
> +static void read_pages(struct readahead_control *rac, struct list_head *pages,
> +		gfp_t gfp)
>  {
> +	const struct address_space_operations *aops = rac->mapping->a_ops;
>  	struct blk_plug plug;
>  	unsigned page_idx;

Splitting out the aops rather than the mapping here just looks
weird, especially as you need the mapping later in the function.
Using aops doesn't even reduce the code side....

>  
>  	blk_start_plug(&plug);
>  
> -	if (mapping->a_ops->readpages) {
> -		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> +	if (aops->readpages) {
> +		aops->readpages(rac->file, rac->mapping, pages,
> +				readahead_count(rac));
>  		/* Clean up the remaining pages */
>  		put_pages_list(pages);
>  		goto out;
>  	}
>  
> -	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> +	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
>  		struct page *page = lru_to_page(pages);
>  		list_del(&page->lru);
> -		if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> -			mapping->a_ops->readpage(filp, page);
> +		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
> +				gfp))
> +			aops->readpage(rac->file, page);

... it just makes this less readable by splitting the if() over two
lines...

>  		put_page(page);
>  	}
>  
> @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	int page_idx;
> -	unsigned int nr_pages = 0;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> +	struct readahead_control rac = {
> +		.mapping = mapping,
> +		.file = filp,
> +		._nr_pages = 0,
> +	};

No need to initialise _nr_pages to zero, leaving it out will do the
same thing.

>  
>  	if (isize == 0)
>  		return;
> @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  			 * contiguous pages before continuing with the next
>  			 * batch.
>  			 */
> -			if (nr_pages)
> -				read_pages(mapping, filp, &page_pool, nr_pages,
> -						gfp_mask);
> -			nr_pages = 0;
> +			if (readahead_count(&rac))
> +				read_pages(&rac, &page_pool, gfp_mask);
> +			rac._nr_pages = 0;

Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
control structure - if we have to pass the gfp_mask down all the
way along side the rac, then I think it makes sense to do that...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/19] mm: Rearrange readahead loop
  2020-02-17 18:45 ` [PATCH v6 04/19] mm: Rearrange readahead loop Matthew Wilcox
@ 2020-02-18  5:08   ` Dave Chinner
  2020-02-18 13:57     ` Matthew Wilcox
  2020-02-18 22:33   ` John Hubbard
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  5:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Move the declaration of 'page' to inside the loop and move the 'kick
> off a fresh batch' code to the end of the function for easier use in
> subsequent patches.

Stale? the "kick off" code is moved to the tail of the loop, not the
end of the function.

> @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		page = xa_load(&mapping->i_pages, page_offset);
>  		if (page && !xa_is_value(page)) {
>  			/*
> -			 * Page already present?  Kick off the current batch of
> -			 * contiguous pages before continuing with the next
> -			 * batch.
> +			 * Page already present?  Kick off the current batch
> +			 * of contiguous pages before continuing with the
> +			 * next batch.  This page may be the one we would
> +			 * have intended to mark as Readahead, but we don't
> +			 * have a stable reference to this page, and it's
> +			 * not worth getting one just for that.
>  			 */
> -			if (readahead_count(&rac))
> -				read_pages(&rac, &page_pool, gfp_mask);
> -			rac._nr_pages = 0;
> -			continue;
> +			goto read;
>  		}
>  
>  		page = __page_cache_alloc(gfp_mask);
> @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		if (page_idx == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
>  		rac._nr_pages++;
> +		continue;
> +read:
> +		if (readahead_count(&rac))
> +			read_pages(&rac, &page_pool, gfp_mask);
> +		rac._nr_pages = 0;
>  	}

Also, why? This adds a goto from branched code that continues, then
adds a continue so the unbranched code doesn't execute the code the
goto jumps to. In absence of any explanation, this isn't an
improvement and doesn't make any sense...

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
  2020-02-17 18:45 ` [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop Matthew Wilcox
@ 2020-02-18  5:14   ` Dave Chinner
  2020-02-18 23:08   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  5:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Eliminate the page_offset variable which was confusing with the
> 'offset' parameter and record the start of each consecutive run of
> pages in the readahead_control.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/readahead.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)

Looks ok, but having the readahead dispatch out of line from the
case that triggers it makes it hard to follow.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
  2020-02-17 18:45 ` [PATCH v6 06/19] mm: rename readahead loop variable to 'i' Matthew Wilcox
@ 2020-02-18  5:33   ` Dave Chinner
  2020-02-18 23:11   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  5:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, John Hubbard, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4,
	linux-erofs, linux-btrfs

On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Change the type of page_idx to unsigned long, and rename it -- it's
> just a loop counter, not a page index.
> 
> Suggested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/readahead.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-17 18:45 ` [PATCH v6 07/19] mm: Put readahead pages in cache earlier Matthew Wilcox
@ 2020-02-18  6:14   ` Dave Chinner
  2020-02-18 15:42     ` Matthew Wilcox
  2020-02-19  0:01   ` John Hubbard
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  6:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> At allocation time, put the pages in the cache unless we're using
> ->readpages.  Add the readahead_for_each() iterator for the benefit of
> the ->readpage fallback.  This iterator supports huge pages, even though
> none of the filesystems to be converted do yet.

This could be better written - took me some time to get my head
around it and the code.

"When populating the page cache for readahead, mappings that don't
use ->readpages need to have their pages added to the page cache
before ->readpage is called. Do this insertion earlier so that the
pages can be looked up immediately prior to ->readpage calls rather
than passing them on a linked list. This early insert functionality
is also required by the upcoming ->readahead method that will
replace ->readpages.

Optimise and simplify the readpage loop by adding a
readahead_for_each() iterator to provide the pages we need to read.
This iterator also supports huge pages, even though none of the
filesystems have been converted to use them yet."

> +static inline struct page *readahead_page(struct readahead_control *rac)
> +{
> +	struct page *page;
> +
> +	if (!rac->_nr_pages)
> +		return NULL;

Hmmmm.

> +
> +	page = xa_load(&rac->mapping->i_pages, rac->_start);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	rac->_batch_count = hpage_nr_pages(page);

So we could have rac->_nr_pages = 2, and then we get an order 2
large page returned, and so rac->_batch_count = 4.
> +
> +	return page;
> +}
> +
> +static inline void readahead_next(struct readahead_control *rac)
> +{
> +	rac->_nr_pages -= rac->_batch_count;
> +	rac->_start += rac->_batch_count;

This results in rac->_nr_pages = -2 (or a huge positive number).
That means that readahead_page() will not terminate when it should,
and potentially will panic if it doesn't find the page that it
thinks should be there at rac->_start + 4...

> +#define readahead_for_each(rac, page)					\
> +	for (; (page = readahead_page(rac)); readahead_next(rac))
> +
>  /* The number of pages in this readahead block */
>  static inline unsigned int readahead_count(struct readahead_control *rac)
>  {
> diff --git a/mm/readahead.c b/mm/readahead.c
> index bdc5759000d3..9e430daae42f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,12 +113,11 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static void read_pages(struct readahead_control *rac, struct list_head *pages,
> -		gfp_t gfp)
> +static void read_pages(struct readahead_control *rac, struct list_head *pages)
>  {
>  	const struct address_space_operations *aops = rac->mapping->a_ops;
> +	struct page *page;
>  	struct blk_plug plug;
> -	unsigned page_idx;
>  
>  	blk_start_plug(&plug);
>  
> @@ -127,19 +126,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
>  				readahead_count(rac));
>  		/* Clean up the remaining pages */
>  		put_pages_list(pages);
> -		goto out;
> -	}
> -
> -	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
> -		struct page *page = lru_to_page(pages);
> -		list_del(&page->lru);
> -		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
> -				gfp))
> +	} else {
> +		readahead_for_each(rac, page) {
>  			aops->readpage(rac->file, page);
> -		put_page(page);
> +			put_page(page);
> +		}
>  	}

Nice simplification and gets rid of the need for rac->mapping, but I
still find the aops variable weird.

> -out:
>  	blk_finish_plug(&plug);
>  }
>  
> @@ -159,6 +152,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	unsigned long i;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> +	bool use_list = mapping->a_ops->readpages;
>  	struct readahead_control rac = {
>  		.mapping = mapping,
>  		.file = filp,

[ I do find these unstructured mixes of declarations and
initialisations dense and difficult to read.... ]

> @@ -196,8 +190,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
>  			break;
> -		page->index = offset;
> -		list_add(&page->lru, &page_pool);
> +		if (use_list) {
> +			page->index = offset;
> +			list_add(&page->lru, &page_pool);
> +		} else if (add_to_page_cache_lru(page, mapping, offset,
> +					gfp_mask) < 0) {
> +			put_page(page);
> +			goto read;
> +		}

Ok, so that's why you put read code at the end of the loop. To turn
the code into spaghetti :/

How much does this simplify down when we get rid of ->readpages and
can restructure the loop? This really seems like you're trying to
flatten two nested loops into one by the use of goto....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-17 18:45 ` [PATCH v6 08/19] mm: Add readahead address space operation Matthew Wilcox
@ 2020-02-18  6:21   ` Dave Chinner
  2020-02-18 16:10     ` Matthew Wilcox
  2020-02-19  0:12   ` John Hubbard
  2020-02-19  3:10   ` Eric Biggers
  2 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  6:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.

Might read better as:

 - Page cache is already populates with locked pages when
   ->readahead is called.

>  - Implementation looks up the pages in the page cache instead of
>    having them passed in a linked list.

Add:

 - cleanup of unused readahead handled by ->readahead caller, not
   the method implementation.

> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  Documentation/filesystems/locking.rst |  6 +++++-
>  Documentation/filesystems/vfs.rst     | 13 +++++++++++++
>  include/linux/fs.h                    |  2 ++
>  include/linux/pagemap.h               | 18 ++++++++++++++++++
>  mm/readahead.c                        |  8 +++++++-
>  5 files changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..0ebc4491025a 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,7 @@ prototypes::
>  	int (*readpage)(struct file *, struct page *);
>  	int (*writepages)(struct address_space *, struct writeback_control *);
>  	int (*set_page_dirty)(struct page *page);
> +	void (*readahead)(struct readahead_control *);
>  	int (*readpages)(struct file *filp, struct address_space *mapping,
>  			struct list_head *pages, unsigned nr_pages);
>  	int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +272,8 @@ writepage:		yes, unlocks (see below)
>  readpage:		yes, unlocks
>  writepages:
>  set_page_dirty		no
> -readpages:
> +readahead:		yes, unlocks
> +readpages:		no
>  write_begin:		locks the page		 exclusive
>  write_end:		yes, unlocks		 exclusive
>  bmap:
> @@ -295,6 +297,8 @@ the request handler (/dev/loop).
>  ->readpage() unlocks the page, either synchronously or via I/O
>  completion.
>  
> +->readahead() unlocks the pages like ->readpage().
> +

"... the pages that I/O is attempted on ..."

>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..81ab30fbe45c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
>  		int (*readpage)(struct file *, struct page *);
>  		int (*writepages)(struct address_space *, struct writeback_control *);
>  		int (*set_page_dirty)(struct page *page);
> +		void (*readahead)(struct readahead_control *);
>  		int (*readpages)(struct file *filp, struct address_space *mapping,
>  				 struct list_head *pages, unsigned nr_pages);
>  		int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
>  	If defined, it should set the PageDirty flag, and the
>  	PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> +	Called by the VM to read pages associated with the address_space
> +	object.  The pages are consecutive in the page cache and are
> +	locked.  The implementation should decrement the page refcount
> +	after starting I/O on each page.  Usually the page will be
> +	unlocked by the I/O completion handler.  If the function does
> +	not attempt I/O on some pages, the caller will decrement the page
> +	refcount and unlock the pages for you.	Set PageUptodate if the
> +	I/O completes successfully.  Setting PageError on any page will
> +	be ignored; simply unlock the page if an I/O error occurs.
> +
>  ``readpages``
>  	called by the VM to read pages associated with the address_space
>  	object.  This is essentially just a vector version of readpage.
>  	Instead of just one page, several pages are requested.
>  	readpages is only used for read-ahead, so read errors are
>  	ignored.  If anything goes wrong, feel free to give up.
> +	This interface is deprecated; implement readahead instead.

What is the removal schedule for the deprecated interface? 

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 3613154e79e4..bd4291f78f41 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -665,6 +665,24 @@ static inline void readahead_next(struct readahead_control *rac)
>  #define readahead_for_each(rac, page)					\
>  	for (; (page = readahead_page(rac)); readahead_next(rac))
>  
> +/* The byte offset into the file of this readahead block */
> +static inline loff_t readahead_offset(struct readahead_control *rac)
> +{
> +	return (loff_t)rac->_start * PAGE_SIZE;
> +}

Urk. Didn't an early page use "offset" for the page index? That
was was "mm: Remove 'page_offset' from readahead loop" did, right?

That's just going to cause confusion to have different units for
readahead "offsets"....

> +
> +/* The number of bytes in this readahead block */
> +static inline loff_t readahead_length(struct readahead_control *rac)
> +{
> +	return (loff_t)rac->_nr_pages * PAGE_SIZE;
> +}
> +
> +/* The index of the first page in this readahead block */
> +static inline unsigned int readahead_index(struct readahead_control *rac)
> +{
> +	return rac->_start;
> +}

Based on this, I suspect the earlier patch should use "index" rather
than "offset" when walking the page cache indexes...

> +
>  /* The number of pages in this readahead block */
>  static inline unsigned int readahead_count(struct readahead_control *rac)
>  {
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9e430daae42f..975ff5e387be 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
>  
>  	blk_start_plug(&plug);
>  
> -	if (aops->readpages) {
> +	if (aops->readahead) {
> +		aops->readahead(rac);
> +		readahead_for_each(rac, page) {
> +			unlock_page(page);
> +			put_page(page);
> +		}

This needs a comment to explain the unwinding that needs to be done
here. I'm not going to remember in a year's time that this is just
for the pages that weren't submitted by ->readahead....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-17 18:45 ` [PATCH v6 09/19] mm: Add page_cache_readahead_limit Matthew Wilcox
@ 2020-02-18  6:31   ` Dave Chinner
  2020-02-18 19:54     ` Matthew Wilcox
  2020-02-19  1:32   ` John Hubbard
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  6:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> ext4 and f2fs have duplicated the guts of the readahead code so
> they can read past i_size.  Instead, separate out the guts of the
> readahead code so they can call it directly.

Gross and nasty (hosting non-stale data beyond EOF in the page
cache, that is).

Code is pretty simple, but...

>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @offset: First page index to read.
> + * @end_index: The maximum page index to read.
> + * @nr_to_read: The number of pages to read.
> + * @lookahead_size: Where to start the next readahead.
> + *
> + * This function is for filesystems to call when they want to start
> + * readahead potentially beyond a file's stated i_size.  If you want
> + * to start readahead on a normal file, you probably want to call
> + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.
>   */
> -void __do_page_cache_readahead(struct address_space *mapping,
> -		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> -		unsigned long lookahead_size)
> +void page_cache_readahead_limit(struct address_space *mapping,

... I don't think the function name conveys it's purpose. It's
really a ranged readahead that ignores where i_size lies. i.e

	page_cache_readahead_range(mapping, start, end, nr_to_read)

seems like a better API to me, and then you can drop the "start
readahead beyond i_size" comments and replace it with "Range is not
limited by the inode's i_size and hence can be used to read data
stored beyond EOF into the page cache."

Also: "This is almost certainly not the function you want to call.
Use page_cache_async_readahead or page_cache_sync_readahead()
instead."

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
  2020-02-18  1:51   ` [Ocfs2-devel] " Joseph Qi
@ 2020-02-18  6:37   ` Dave Chinner
  2020-02-19  2:48   ` John Hubbard
  2020-02-19  3:28   ` Eric Biggers
  3 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  6:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, Junxiao Bi, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4,
	linux-erofs, linux-btrfs

On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Implement the new readahead aop and convert all callers (block_dev,
> exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
> ---
>  drivers/staging/exfat/exfat_super.c |  7 +++---
>  fs/block_dev.c                      |  7 +++---
>  fs/ext2/inode.c                     | 10 +++-----
>  fs/fat/inode.c                      |  7 +++---
>  fs/gfs2/aops.c                      | 23 ++++++-----------
>  fs/hpfs/file.c                      |  7 +++---
>  fs/iomap/buffered-io.c              |  2 +-
>  fs/isofs/inode.c                    |  7 +++---
>  fs/jfs/inode.c                      |  7 +++---
>  fs/mpage.c                          | 38 +++++++++--------------------
>  fs/nilfs2/inode.c                   | 15 +++---------
>  fs/ocfs2/aops.c                     | 34 ++++++++++----------------
>  fs/omfs/file.c                      |  7 +++---
>  fs/qnx6/inode.c                     |  7 +++---
>  fs/reiserfs/inode.c                 |  8 +++---
>  fs/udf/inode.c                      |  7 +++---
>  include/linux/mpage.h               |  4 +--
>  mm/migrate.c                        |  2 +-
>  18 files changed, 73 insertions(+), 126 deletions(-)

That's actually pretty simple changeover. Nothing really scary
there. :)

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 11/19] btrfs: Convert from readpages to readahead
  2020-02-17 18:45 ` [PATCH v6 11/19] btrfs: Convert from readpages to readahead Matthew Wilcox
@ 2020-02-18  6:57   ` Dave Chinner
  2020-02-18 21:12     ` Matthew Wilcox
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18  6:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in btrfs.  Add a
> readahead_for_each_batch() iterator to optimise the loop in the XArray.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/btrfs/extent_io.c    | 46 +++++++++++++----------------------------
>  fs/btrfs/extent_io.h    |  3 +--
>  fs/btrfs/inode.c        | 16 +++++++-------
>  include/linux/pagemap.h | 27 ++++++++++++++++++++++++
>  4 files changed, 49 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c0f202741e09..e97a6acd6f5d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4278,52 +4278,34 @@ int extent_writepages(struct address_space *mapping,
>  	return ret;
>  }
>  
> -int extent_readpages(struct address_space *mapping, struct list_head *pages,
> -		     unsigned nr_pages)
> +void extent_readahead(struct readahead_control *rac)
>  {
>  	struct bio *bio = NULL;
>  	unsigned long bio_flags = 0;
>  	struct page *pagepool[16];
>  	struct extent_map *em_cached = NULL;
> -	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
> -	int nr = 0;
> +	struct extent_io_tree *tree = &BTRFS_I(rac->mapping->host)->io_tree;
>  	u64 prev_em_start = (u64)-1;
> +	int nr;
>  
> -	while (!list_empty(pages)) {
> -		u64 contig_end = 0;
> -
> -		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
> -			struct page *page = lru_to_page(pages);
> -
> -			prefetchw(&page->flags);
> -			list_del(&page->lru);
> -			if (add_to_page_cache_lru(page, mapping, page->index,
> -						readahead_gfp_mask(mapping))) {
> -				put_page(page);
> -				break;
> -			}
> -
> -			pagepool[nr++] = page;
> -			contig_end = page_offset(page) + PAGE_SIZE - 1;
> -		}
> +	readahead_for_each_batch(rac, pagepool, ARRAY_SIZE(pagepool), nr) {
> +		u64 contig_start = page_offset(pagepool[0]);
> +		u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;

So this assumes a contiguous page range is returned, right?

>  
> -		if (nr) {
> -			u64 contig_start = page_offset(pagepool[0]);
> +		ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);

Ok, yes it does. :)

I don't see how readahead_for_each_batch() guarantees that, though.

>  
> -			ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
> -
> -			contiguous_readpages(tree, pagepool, nr, contig_start,
> -				     contig_end, &em_cached, &bio, &bio_flags,
> -				     &prev_em_start);
> -		}
> +		contiguous_readpages(tree, pagepool, nr, contig_start,
> +				contig_end, &em_cached, &bio, &bio_flags,
> +				&prev_em_start);
>  	}
>  
>  	if (em_cached)
>  		free_extent_map(em_cached);
>  
> -	if (bio)
> -		return submit_one_bio(bio, 0, bio_flags);
> -	return 0;
> +	if (bio) {
> +		if (submit_one_bio(bio, 0, bio_flags))
> +			return;
> +	}
>  }

Shouldn't that just be

	if (bio)
		submit_one_bio(bio, 0, bio_flags);

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4f36c06d064d..1bbb60a0bf16 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -669,6 +669,33 @@ static inline void readahead_next(struct readahead_control *rac)
>  #define readahead_for_each(rac, page)					\
>  	for (; (page = readahead_page(rac)); readahead_next(rac))
>  
> +static inline unsigned int readahead_page_batch(struct readahead_control *rac,
> +		struct page **array, unsigned int size)
> +{
> +	unsigned int batch = 0;

Confusing when put alongside rac->_batch_count counting the number
of pages in the batch, and "batch" being the index into the page
array, and they aren't the same counts....

> +	XA_STATE(xas, &rac->mapping->i_pages, rac->_start);
> +	struct page *page;
> +
> +	rac->_batch_count = 0;
> +	xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) {

That just iterates pages in the start,end doesn't it? What
guarantees that this fills the array with a contiguous page range?

> +		VM_BUG_ON_PAGE(!PageLocked(page), page);
> +		VM_BUG_ON_PAGE(PageTail(page), page);
> +		array[batch++] = page;
> +		rac->_batch_count += hpage_nr_pages(page);
> +		if (PageHead(page))
> +			xas_set(&xas, rac->_start + rac->_batch_count);

What on earth does this do? Comments please!

> +
> +		if (batch == size)
> +			break;
> +	}
> +
> +	return batch;
> +}

Seems a bit big for an inline function.

> +
> +#define readahead_for_each_batch(rac, array, size, nr)			\
> +	for (; (nr = readahead_page_batch(rac, array, size));		\
> +			readahead_next(rac))

I had to go look at the caller to work out what "size" refered to
here.

This is complex enough that it needs proper API documentation.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-18  4:56 ` Dave Chinner
@ 2020-02-18 13:42   ` Matthew Wilcox
  2020-02-18 21:26     ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 13:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> Latest version in your git tree:
> 
> $ ▶ glo -n 5 willy/readahead
> 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> ff63497fcb98 iomap: Convert from readpages to readahead
> 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> 8115bcca7312 fuse: Convert from readpages to readahead
> 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> $
> 
> merged into a 5.6-rc2 tree fails at boot on my test vm:
> 
> [    2.423116] ------------[ cut here ]------------
> [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> [    2.457484] Call Trace:
> [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> [    2.462787]  lru_add_drain+0x18/0x20

Are you sure that was 4be497096c04 ?  I ask because there was a
version pushed to that git tree that did contain a list double-add
(due to a mismerge when shuffling patches).  I noticed it and fixed
it, and 4be497096c04 doesn't have that problem.  I also test with
CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
probabilistic because it'll depend on the timing between whatever other
list is being used and the page actually being added to the LRU.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
  2020-02-18  5:03   ` Dave Chinner
@ 2020-02-18 13:56     ` Matthew Wilcox
  2020-02-18 22:46       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 13:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	Christoph Hellwig, linux-btrfs

On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> > +static void read_pages(struct readahead_control *rac, struct list_head *pages,
> > +		gfp_t gfp)
> >  {
> > +	const struct address_space_operations *aops = rac->mapping->a_ops;
> >  	struct blk_plug plug;
> >  	unsigned page_idx;
> 
> Splitting out the aops rather than the mapping here just looks
> weird, especially as you need the mapping later in the function.
> Using aops doesn't even reduce the code side....

It does in subsequent patches ... I agree it looks a little weird here,
but I think in the final form, it makes sense:

static void read_pages(struct readahead_control *rac, struct list_head *pages)
{
        const struct address_space_operations *aops = rac->mapping->a_ops;
        struct page *page;
        struct blk_plug plug;

        blk_start_plug(&plug);

        if (aops->readahead) {
                aops->readahead(rac);
                readahead_for_each(rac, page) {
                        unlock_page(page);
                        put_page(page);
                }
        } else if (aops->readpages) {
                aops->readpages(rac->file, rac->mapping, pages,
                                readahead_count(rac));
                /* Clean up the remaining pages */
                put_pages_list(pages);
        } else {
                readahead_for_each(rac, page) {
                        aops->readpage(rac->file, page);
                        put_page(page);
                }
        }

        blk_finish_plug(&plug);
}

It'll look even better once ->readpages goes away.

> > @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space *mapping,
> >  	unsigned long end_index;	/* The last page we want to read */
> >  	LIST_HEAD(page_pool);
> >  	int page_idx;
> > -	unsigned int nr_pages = 0;
> >  	loff_t isize = i_size_read(inode);
> >  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> > +	struct readahead_control rac = {
> > +		.mapping = mapping,
> > +		.file = filp,
> > +		._nr_pages = 0,
> > +	};
> 
> No need to initialise _nr_pages to zero, leaving it out will do the
> same thing.

Yes, it does, but I wanted to make it explicit here.

> > +			if (readahead_count(&rac))
> > +				read_pages(&rac, &page_pool, gfp_mask);
> > +			rac._nr_pages = 0;
> 
> Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
> control structure - if we have to pass the gfp_mask down all the
> way along side the rac, then I think it makes sense to do that...

So we end up removing it later on in this series, but I do wonder if
it would make sense anyway.  By the end of the series, we still have
this in iomap:

                if (ctx->rac) /* same as readahead_gfp_mask */
                        gfp |= __GFP_NORETRY | __GFP_NOWARN;

and we could get rid of that by passing gfp flags down in the rac.  On the
other hand, I don't know why it doesn't just use readahead_gfp_mask()
here anyway ... Christoph?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/19] mm: Rearrange readahead loop
  2020-02-18  5:08   ` Dave Chinner
@ 2020-02-18 13:57     ` Matthew Wilcox
  2020-02-18 22:48       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 13:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Move the declaration of 'page' to inside the loop and move the 'kick
> > off a fresh batch' code to the end of the function for easier use in
> > subsequent patches.
> 
> Stale? the "kick off" code is moved to the tail of the loop, not the
> end of the function.

Braino; I meant to write end of the loop.

> > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
> >  		page = xa_load(&mapping->i_pages, page_offset);
> >  		if (page && !xa_is_value(page)) {
> >  			/*
> > -			 * Page already present?  Kick off the current batch of
> > -			 * contiguous pages before continuing with the next
> > -			 * batch.
> > +			 * Page already present?  Kick off the current batch
> > +			 * of contiguous pages before continuing with the
> > +			 * next batch.  This page may be the one we would
> > +			 * have intended to mark as Readahead, but we don't
> > +			 * have a stable reference to this page, and it's
> > +			 * not worth getting one just for that.
> >  			 */
> > -			if (readahead_count(&rac))
> > -				read_pages(&rac, &page_pool, gfp_mask);
> > -			rac._nr_pages = 0;
> > -			continue;
> > +			goto read;
> >  		}
> >  
> >  		page = __page_cache_alloc(gfp_mask);
> > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
> >  		if (page_idx == nr_to_read - lookahead_size)
> >  			SetPageReadahead(page);
> >  		rac._nr_pages++;
> > +		continue;
> > +read:
> > +		if (readahead_count(&rac))
> > +			read_pages(&rac, &page_pool, gfp_mask);
> > +		rac._nr_pages = 0;
> >  	}
> 
> Also, why? This adds a goto from branched code that continues, then
> adds a continue so the unbranched code doesn't execute the code the
> goto jumps to. In absence of any explanation, this isn't an
> improvement and doesn't make any sense...

I thought I was explaining it ... "for easier use in subsequent patches".

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-18  6:14   ` Dave Chinner
@ 2020-02-18 15:42     ` Matthew Wilcox
  2020-02-19  0:59       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 15:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > At allocation time, put the pages in the cache unless we're using
> > ->readpages.  Add the readahead_for_each() iterator for the benefit of
> > the ->readpage fallback.  This iterator supports huge pages, even though
> > none of the filesystems to be converted do yet.
> 
> This could be better written - took me some time to get my head
> around it and the code.
> 
> "When populating the page cache for readahead, mappings that don't
> use ->readpages need to have their pages added to the page cache
> before ->readpage is called. Do this insertion earlier so that the
> pages can be looked up immediately prior to ->readpage calls rather
> than passing them on a linked list. This early insert functionality
> is also required by the upcoming ->readahead method that will
> replace ->readpages.
> 
> Optimise and simplify the readpage loop by adding a
> readahead_for_each() iterator to provide the pages we need to read.
> This iterator also supports huge pages, even though none of the
> filesystems have been converted to use them yet."

Thanks, I'll use that.

> > +static inline struct page *readahead_page(struct readahead_control *rac)
> > +{
> > +	struct page *page;
> > +
> > +	if (!rac->_nr_pages)
> > +		return NULL;
> 
> Hmmmm.
> 
> > +
> > +	page = xa_load(&rac->mapping->i_pages, rac->_start);
> > +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +	rac->_batch_count = hpage_nr_pages(page);
> 
> So we could have rac->_nr_pages = 2, and then we get an order 2
> large page returned, and so rac->_batch_count = 4.

Well, no, we couldn't.  rac->_nr_pages is incremented by 4 when we add
an order-2 page to the readahead.  I can put a
	BUG_ON(rac->_batch_count > rac->_nr_pages)
in here to be sure to catch any logic error like that.

> > @@ -159,6 +152,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
> >  	unsigned long i;
> >  	loff_t isize = i_size_read(inode);
> >  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> > +	bool use_list = mapping->a_ops->readpages;
> >  	struct readahead_control rac = {
> >  		.mapping = mapping,
> >  		.file = filp,
> 
> [ I do find these unstructured mixes of declarations and
> initialisations dense and difficult to read.... ]

Fair ... although I didn't create this mess, I can tidy it up a bit.

> > -		page->index = offset;
> > -		list_add(&page->lru, &page_pool);
> > +		if (use_list) {
> > +			page->index = offset;
> > +			list_add(&page->lru, &page_pool);
> > +		} else if (add_to_page_cache_lru(page, mapping, offset,
> > +					gfp_mask) < 0) {
> > +			put_page(page);
> > +			goto read;
> > +		}
> 
> Ok, so that's why you put read code at the end of the loop. To turn
> the code into spaghetti :/
> 
> How much does this simplify down when we get rid of ->readpages and
> can restructure the loop? This really seems like you're trying to
> flatten two nested loops into one by the use of goto....

I see it as having two failure cases in this loop.  One for "page is
already present" (which already existed) and one for "allocated a page,
but failed to add it to the page cache" (which used to be done later).
I didn't want to duplicate the "call read_pages()" code.  So I reshuffled
the code rather than add a nested loop.  I don't think the nested loop
is easier to read (we'll be at 5 levels of indentation for some statements).
Could do it this way ...

@@ -218,18 +218,17 @@ void page_cache_readahead_limit(struct address_space *mapping,
                } else if (add_to_page_cache_lru(page, mapping, offset,
                                        gfp_mask) < 0) {
                        put_page(page);
-                       goto read;
+read:
+                       if (readahead_count(&rac))
+                               read_pages(&rac, &page_pool);
+                       rac._nr_pages = 0;
+                       rac._start = ++offset;
+                       continue;
                }
                if (i == nr_to_read - lookahead_size)
                        SetPageReadahead(page);
                rac._nr_pages++;
                offset++;
-               continue;
-read:
-               if (readahead_count(&rac))
-                       read_pages(&rac, &page_pool);
-               rac._nr_pages = 0;
-               rac._start = ++offset;
        }
 
        /*

but I'm not sure that's any better.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-18  6:21   ` Dave Chinner
@ 2020-02-18 16:10     ` Matthew Wilcox
  2020-02-19  1:04       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 16:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > This replaces ->readpages with a saner interface:
> >  - Return void instead of an ignored error code.
> >  - Pages are already in the page cache when ->readahead is called.
> 
> Might read better as:
> 
>  - Page cache is already populates with locked pages when
>    ->readahead is called.

Will do.

> >  - Implementation looks up the pages in the page cache instead of
> >    having them passed in a linked list.
> 
> Add:
> 
>  - cleanup of unused readahead handled by ->readahead caller, not
>    the method implementation.

The readpages caller does that cleanup too, so it's not an advantage
to the readahead interface.

        if (mapping->a_ops->readpages) {
                ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
                /* Clean up the remaining pages */
                put_pages_list(pages);
                goto out;
        }

> >  ``readpages``
> >  	called by the VM to read pages associated with the address_space
> >  	object.  This is essentially just a vector version of readpage.
> >  	Instead of just one page, several pages are requested.
> >  	readpages is only used for read-ahead, so read errors are
> >  	ignored.  If anything goes wrong, feel free to give up.
> > +	This interface is deprecated; implement readahead instead.
> 
> What is the removal schedule for the deprecated interface? 

I mentioned that in the cover letter; once Dave Howells has the fscache
branch merged, I'll do the remaining filesystems.  Should be within the
next couple of merge windows.

> > +/* The byte offset into the file of this readahead block */
> > +static inline loff_t readahead_offset(struct readahead_control *rac)
> > +{
> > +	return (loff_t)rac->_start * PAGE_SIZE;
> > +}
> 
> Urk. Didn't an early page use "offset" for the page index? That
> was was "mm: Remove 'page_offset' from readahead loop" did, right?
> 
> That's just going to cause confusion to have different units for
> readahead "offsets"....

We are ... not consistent anywhere in the VM/VFS with our naming.
Unfortunately.

$ grep -n offset mm/filemap.c 
391: * @start:	offset in bytes where the range starts
...
815:	pgoff_t offset = old->index;
...
2020:	unsigned long offset;      /* offset into pagecache page */
...
2257:	*ppos = ((loff_t)index << PAGE_SHIFT) + offset;

That last one's my favourite.  Not to mention the fine distinction you
and I discussed recently between offset_in_page() and page_offset().

Best of all, even our types encode the ambiguity of an 'offset'.  We have
pgoff_t and loff_t (replacing the earlier off_t).

So, new rule.  'pos' is the number of bytes into a file.  'index' is the
number of PAGE_SIZE pages into a file.  We don't use the word 'offset'
at all.  'length' as a byte count and 'count' as a page count seem like
fine names to me.

> > -	if (aops->readpages) {
> > +	if (aops->readahead) {
> > +		aops->readahead(rac);
> > +		readahead_for_each(rac, page) {
> > +			unlock_page(page);
> > +			put_page(page);
> > +		}
> 
> This needs a comment to explain the unwinding that needs to be done
> here. I'm not going to remember in a year's time that this is just
> for the pages that weren't submitted by ->readahead....

ACK.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-18  6:31   ` Dave Chinner
@ 2020-02-18 19:54     ` Matthew Wilcox
  2020-02-19  1:08       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 19:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > ext4 and f2fs have duplicated the guts of the readahead code so
> > they can read past i_size.  Instead, separate out the guts of the
> > readahead code so they can call it directly.
> 
> Gross and nasty (hosting non-stale data beyond EOF in the page
> cache, that is).

I thought you meant sneaking changes into the VFS (that were rejected) by
copying VFS code and modifying it ...

> > +/**
> > + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> > + * @mapping: File address space.
> > + * @file: This instance of the open file; used for authentication.
> > + * @offset: First page index to read.
> > + * @end_index: The maximum page index to read.
> > + * @nr_to_read: The number of pages to read.
> > + * @lookahead_size: Where to start the next readahead.
> > + *
> > + * This function is for filesystems to call when they want to start
> > + * readahead potentially beyond a file's stated i_size.  If you want
> > + * to start readahead on a normal file, you probably want to call
> > + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> > + *
> > + * Context: File is referenced by caller.  Mutexes may be held by caller.
> > + * May sleep, but will not reenter filesystem to reclaim memory.
> >   */
> > -void __do_page_cache_readahead(struct address_space *mapping,
> > -		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> > -		unsigned long lookahead_size)
> > +void page_cache_readahead_limit(struct address_space *mapping,
> 
> ... I don't think the function name conveys it's purpose. It's
> really a ranged readahead that ignores where i_size lies. i.e
> 
> 	page_cache_readahead_range(mapping, start, end, nr_to_read)
> 
> seems like a better API to me, and then you can drop the "start
> readahead beyond i_size" comments and replace it with "Range is not
> limited by the inode's i_size and hence can be used to read data
> stored beyond EOF into the page cache."

I'm concerned that calling it 'range' implies "I want to read between
start and end" rather than "I want to read nr_to_read at start, oh but
don't go past end".

Maybe the right way to do this is have the three callers cap nr_to_read.
Well, the one caller ... after all, f2fs and ext4 have no desire to
cap the length.  Then we can call it page_cache_readahead_exceed() or
page_cache_readahead_dangerous() or something else like that to make it
clear that you shouldn't be calling it.

> Also: "This is almost certainly not the function you want to call.
> Use page_cache_async_readahead or page_cache_sync_readahead()
> instead."

+1 to that ;-)

Here's what I currently have:

From d202dda7a92566496fe9e233ee7855fb560324ce Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Mon, 10 Feb 2020 18:31:15 -0500
Subject: [PATCH] mm: Add page_cache_readahead_exceed

ext4 and f2fs have duplicated the guts of the readahead code so
they can read past i_size.  Instead, separate out the guts of the
readahead code so they can call it directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/ext4/verity.c        | 35 ++--------------------
 fs/f2fs/verity.c        | 35 ++--------------------
 include/linux/pagemap.h |  3 ++
 mm/readahead.c          | 66 ++++++++++++++++++++++++++++-------------
 4 files changed, 52 insertions(+), 87 deletions(-)

diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index dc5ec724d889..172ebf860014 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -342,37 +342,6 @@ static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
 	return desc_size;
 }
 
-/*
- * Prefetch some pages from the file's Merkle tree.
- *
- * This is basically a stripped-down version of __do_page_cache_readahead()
- * which works on pages past i_size.
- */
-static void ext4_merkle_tree_readahead(struct address_space *mapping,
-				       pgoff_t start_index, unsigned long count)
-{
-	LIST_HEAD(pages);
-	unsigned int nr_pages = 0;
-	struct page *page;
-	pgoff_t index;
-	struct blk_plug plug;
-
-	for (index = start_index; index < start_index + count; index++) {
-		page = xa_load(&mapping->i_pages, index);
-		if (!page || xa_is_value(page)) {
-			page = __page_cache_alloc(readahead_gfp_mask(mapping));
-			if (!page)
-				break;
-			page->index = index;
-			list_add(&page->lru, &pages);
-			nr_pages++;
-		}
-	}
-	blk_start_plug(&plug);
-	ext4_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
-	blk_finish_plug(&plug);
-}
-
 static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
@@ -386,8 +355,8 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			ext4_merkle_tree_readahead(inode->i_mapping, index,
-						   num_ra_pages);
+			page_cache_readahead_exceed(inode->i_mapping, NULL,
+					index, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
index d7d430a6f130..f240ad087162 100644
--- a/fs/f2fs/verity.c
+++ b/fs/f2fs/verity.c
@@ -222,37 +222,6 @@ static int f2fs_get_verity_descriptor(struct inode *inode, void *buf,
 	return size;
 }
 
-/*
- * Prefetch some pages from the file's Merkle tree.
- *
- * This is basically a stripped-down version of __do_page_cache_readahead()
- * which works on pages past i_size.
- */
-static void f2fs_merkle_tree_readahead(struct address_space *mapping,
-				       pgoff_t start_index, unsigned long count)
-{
-	LIST_HEAD(pages);
-	unsigned int nr_pages = 0;
-	struct page *page;
-	pgoff_t index;
-	struct blk_plug plug;
-
-	for (index = start_index; index < start_index + count; index++) {
-		page = xa_load(&mapping->i_pages, index);
-		if (!page || xa_is_value(page)) {
-			page = __page_cache_alloc(readahead_gfp_mask(mapping));
-			if (!page)
-				break;
-			page->index = index;
-			list_add(&page->lru, &pages);
-			nr_pages++;
-		}
-	}
-	blk_start_plug(&plug);
-	f2fs_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
-	blk_finish_plug(&plug);
-}
-
 static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index,
 					       unsigned long num_ra_pages)
@@ -266,8 +235,8 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
 		if (page)
 			put_page(page);
 		else if (num_ra_pages > 1)
-			f2fs_merkle_tree_readahead(inode->i_mapping, index,
-						   num_ra_pages);
+			page_cache_readahead_exceed(inode->i_mapping, NULL,
+					index, num_ra_pages, 0);
 		page = read_mapping_page(inode->i_mapping, index, NULL);
 	}
 	return page;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 48c3bca57df6..1f7964d2b8ca 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -623,6 +623,9 @@ void page_cache_sync_readahead(struct address_space *, struct file_ra_state *,
 void page_cache_async_readahead(struct address_space *, struct file_ra_state *,
 		struct file *, struct page *, pgoff_t index,
 		unsigned long req_count);
+void page_cache_readahead_exceed(struct address_space *, struct file *,
+		pgoff_t index, unsigned long nr_to_read,
+		unsigned long lookahead_count);
 
 /*
  * Like add_to_page_cache_locked, but used to add newly allocated pages:
diff --git a/mm/readahead.c b/mm/readahead.c
index 9dd431fa16c9..cad26287ad8b 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
 	blk_finish_plug(&plug);
 }
 
-/*
- * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
- * the pages first, then submits them for I/O. This avoids the very bad
- * behaviour which would occur if page allocations are causing VM writeback.
- * We really don't want to intermingle reads and writes like that.
+/**
+ * page_cache_readahead_exceed - Start unchecked readahead.
+ * @mapping: File address space.
+ * @file: This instance of the open file; used for authentication.
+ * @index: First page index to read.
+ * @nr_to_read: The number of pages to read.
+ * @lookahead_size: Where to start the next readahead.
+ *
+ * This function is for filesystems to call when they want to start
+ * readahead beyond a file's stated i_size.  This is almost certainly
+ * not the function you want to call.  Use page_cache_async_readahead()
+ * or page_cache_sync_readahead() instead.
+ *
+ * Context: File is referenced by caller.  Mutexes may be held by caller.
+ * May sleep, but will not reenter filesystem to reclaim memory.
  */
-void __do_page_cache_readahead(struct address_space *mapping,
-		struct file *filp, pgoff_t index, unsigned long nr_to_read,
+void page_cache_readahead_exceed(struct address_space *mapping,
+		struct file *file, pgoff_t index, unsigned long nr_to_read,
 		unsigned long lookahead_size)
 {
-	struct inode *inode = mapping->host;
-	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
 	unsigned long i;
-	loff_t isize = i_size_read(inode);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 	bool use_list = mapping->a_ops->readpages;
 	struct readahead_control rac = {
 		.mapping = mapping,
-		.file = filp,
+		.file = file,
 		._start = index,
 		._nr_pages = 0,
 	};
 
-	if (isize == 0)
-		return;
-
-	end_index = ((isize - 1) >> PAGE_SHIFT);
-
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
 	for (i = 0; i < nr_to_read; i++) {
-		struct page *page;
-
-		if (index > end_index)
-			break;
+		struct page *page = xa_load(&mapping->i_pages, index);
 
-		page = xa_load(&mapping->i_pages, index);
 		if (page && !xa_is_value(page)) {
 			/*
 			 * Page already present?  Kick off the current batch
@@ -225,6 +223,32 @@ void __do_page_cache_readahead(struct address_space *mapping,
 		read_pages(&rac, &page_pool);
 	BUG_ON(!list_empty(&page_pool));
 }
+EXPORT_SYMBOL_GPL(page_cache_readahead_exceed);
+
+/*
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
+ * the pages first, then submits them for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
+ */
+void __do_page_cache_readahead(struct address_space *mapping,
+		struct file *file, pgoff_t index, unsigned long nr_to_read,
+		unsigned long lookahead_size)
+{
+	struct inode *inode = mapping->host;
+	loff_t isize = i_size_read(inode);
+	pgoff_t end_index;
+
+	if (isize == 0)
+		return;
+
+	end_index = (isize - 1) >> PAGE_SHIFT;
+	if (end_index < index + nr_to_read)
+		nr_to_read = end_index - index;
+
+	page_cache_readahead_exceed(mapping, file, index, nr_to_read,
+			lookahead_size);
+}
 
 /*
  * Chunk the readahead into 2 megabyte units, so that we don't pin too much
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
                   ` (33 preceding siblings ...)
  2020-02-18  4:56 ` Dave Chinner
@ 2020-02-18 20:49 ` John Hubbard
  34 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 20:49 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yukuai3@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.

Hi Matthew,

I see that Dave Chinner is reviewing this series, but I'm trying out his recent
advice about code reviews [1], and so I'm not going to read his comments first.
So you may see some duplication or contradictions this time around.


[1] Somewhere in this thread, "[LSF/MM/BPF TOPIC] FS Maintainers Don't Scale": 
https://lore.kernel.org/r/20200131052520.GC6869@magnolia


thanks,
-- 
John Hubbard
NVIDIA

> 
> v6:
>  - Name the private members of readahead_control with a leading underscore
>    (suggested by Christoph Hellwig)
>  - Fix whitespace in rst file
>  - Remove misleading comment in btrfs patch
>  - Add readahead_next() API and use it in iomap
>  - Add iomap_readahead kerneldoc.
>  - Fix the mpage_readahead kerneldoc
>  - Make various readahead functions return void
>  - Keep readahead_index() and readahead_offset() pointing to the start of
>    this batch through the body.  No current user requires this, but it's
>    less surprising.
>  - Add kerneldoc for page_cache_readahead_limit
>  - Make page_idx an unsigned long, and rename it to just 'i'
>  - Get rid of page_offset local variable
>  - Add patch to call memalloc_nofs_save() before allocating pages (suggested
>    by Michal Hocko)
>  - Resplit a lot of patches for more logical progression and easier review
>    (suggested by John Hubbard)
>  - Added sign-offs where received, and I deemed still relevant
> 
> v5 switched to passing a readahead_control struct (mirroring the
> writepages_control struct passed to writepages).  This has a number of
> advantages:
>  - It fixes a number of bugs in various implementations, eg forgetting to
>    increment 'start', an off-by-one error in 'nr_pages' or treating 'start'
>    as a byte offset instead of a page offset.
>  - It allows us to change the arguments without changing all the
>    implementations of ->readahead which just call mpage_readahead() or
>    iomap_readahead()
>  - Figuring out which pages haven't been attempted by the implementation
>    is more natural this way.
>  - There's less code in each implementation.
> 
> Matthew Wilcox (Oracle) (19):
>   mm: Return void from various readahead functions
>   mm: Ignore return value of ->readpages
>   mm: Use readahead_control to pass arguments
>   mm: Rearrange readahead loop
>   mm: Remove 'page_offset' from readahead loop
>   mm: rename readahead loop variable to 'i'
>   mm: Put readahead pages in cache earlier
>   mm: Add readahead address space operation
>   mm: Add page_cache_readahead_limit
>   fs: Convert mpage_readpages to mpage_readahead
>   btrfs: Convert from readpages to readahead
>   erofs: Convert uncompressed files from readpages to readahead
>   erofs: Convert compressed files from readpages to readahead
>   ext4: Convert from readpages to readahead
>   f2fs: Convert from readpages to readahead
>   fuse: Convert from readpages to readahead
>   iomap: Restructure iomap_readpages_actor
>   iomap: Convert from readpages to readahead
>   mm: Use memalloc_nofs_save in readahead path
> 
>  Documentation/filesystems/locking.rst |   6 +-
>  Documentation/filesystems/vfs.rst     |  13 ++
>  drivers/staging/exfat/exfat_super.c   |   7 +-
>  fs/block_dev.c                        |   7 +-
>  fs/btrfs/extent_io.c                  |  46 ++-----
>  fs/btrfs/extent_io.h                  |   3 +-
>  fs/btrfs/inode.c                      |  16 +--
>  fs/erofs/data.c                       |  39 ++----
>  fs/erofs/zdata.c                      |  29 ++--
>  fs/ext2/inode.c                       |  10 +-
>  fs/ext4/ext4.h                        |   3 +-
>  fs/ext4/inode.c                       |  23 ++--
>  fs/ext4/readpage.c                    |  22 ++-
>  fs/ext4/verity.c                      |  35 +----
>  fs/f2fs/data.c                        |  50 +++----
>  fs/f2fs/f2fs.h                        |   5 +-
>  fs/f2fs/verity.c                      |  35 +----
>  fs/fat/inode.c                        |   7 +-
>  fs/fuse/file.c                        |  46 +++----
>  fs/gfs2/aops.c                        |  23 ++--
>  fs/hpfs/file.c                        |   7 +-
>  fs/iomap/buffered-io.c                | 118 +++++++----------
>  fs/iomap/trace.h                      |   2 +-
>  fs/isofs/inode.c                      |   7 +-
>  fs/jfs/inode.c                        |   7 +-
>  fs/mpage.c                            |  38 ++----
>  fs/nilfs2/inode.c                     |  15 +--
>  fs/ocfs2/aops.c                       |  34 ++---
>  fs/omfs/file.c                        |   7 +-
>  fs/qnx6/inode.c                       |   7 +-
>  fs/reiserfs/inode.c                   |   8 +-
>  fs/udf/inode.c                        |   7 +-
>  fs/xfs/xfs_aops.c                     |  13 +-
>  fs/zonefs/super.c                     |   7 +-
>  include/linux/fs.h                    |   2 +
>  include/linux/iomap.h                 |   3 +-
>  include/linux/mpage.h                 |   4 +-
>  include/linux/pagemap.h               |  90 +++++++++++++
>  include/trace/events/erofs.h          |   6 +-
>  include/trace/events/f2fs.h           |   6 +-
>  mm/internal.h                         |   8 +-
>  mm/migrate.c                          |   2 +-
>  mm/readahead.c                        | 184 +++++++++++++++++---------
>  43 files changed, 474 insertions(+), 533 deletions(-)
> 
> 
> base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 01/19] mm: Return void from various readahead functions
  2020-02-17 18:45 ` [PATCH v6 01/19] mm: Return void from various readahead functions Matthew Wilcox
  2020-02-18  4:47   ` Dave Chinner
@ 2020-02-18 21:05   ` John Hubbard
  2020-02-18 21:21     ` Matthew Wilcox
  1 sibling, 1 reply; 111+ messages in thread
From: John Hubbard @ 2020-02-18 21:05 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> ondemand_readahead has two callers, neither of which use the return value.
> That means that both ra_submit and __do_page_cache_readahead() can return
> void, and we don't need to worry that a present page in the readahead
> window causes us to return a smaller nr_pages than we ought to have.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/internal.h  |  8 ++++----
>  mm/readahead.c | 24 ++++++++++--------------
>  2 files changed, 14 insertions(+), 18 deletions(-)


This is an easy review and obviously correct, so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


Thoughts for the future of the API:

I will add that I could envision another patchset that went in the
opposite direction, and attempted to preserve the information about
how many pages were successfully read ahead. And that would be nice
to have (at least IMHO), even all the way out to the syscall level,
especially for the readahead syscall.

Of course, vague opinions about how the API might be improved are less
pressing than cleaning up the code now--I'm just bringing this up because
I suspect some people will wonder, "wouldn't it be helpful if I the 
syscall would tell me what happened here? Success (returning 0) doesn't
necessarily mean any pages were even read ahead." It just seems worth 
mentioning.


thanks,
-- 
John Hubbard
NVIDIA

> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 3cf20ab3ca01..f779f058118b 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -49,18 +49,18 @@ void unmap_page_range(struct mmu_gather *tlb,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details);
>  
> -extern unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +extern void __do_page_cache_readahead(struct address_space *mapping,
>  		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>  		unsigned long lookahead_size);
>  
>  /*
>   * Submit IO for the read-ahead request in file_ra_state.
>   */
> -static inline unsigned long ra_submit(struct file_ra_state *ra,
> +static inline void ra_submit(struct file_ra_state *ra,
>  		struct address_space *mapping, struct file *filp)
>  {
> -	return __do_page_cache_readahead(mapping, filp,
> -					ra->start, ra->size, ra->async_size);
> +	__do_page_cache_readahead(mapping, filp,
> +			ra->start, ra->size, ra->async_size);
>  }
>  
>  /*
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2fe72cd29b47..8ce46d69e6ae 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -149,10 +149,8 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>   * the pages first, then submits them for I/O. This avoids the very bad
>   * behaviour which would occur if page allocations are causing VM writeback.
>   * We really don't want to intermingle reads and writes like that.
> - *
> - * Returns the number of pages requested, or the maximum amount of I/O allowed.
>   */
> -unsigned int __do_page_cache_readahead(struct address_space *mapping,
> +void __do_page_cache_readahead(struct address_space *mapping,
>  		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
>  		unsigned long lookahead_size)
>  {
> @@ -166,7 +164,7 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  
>  	if (isize == 0)
> -		goto out;
> +		return;
>  
>  	end_index = ((isize - 1) >> PAGE_SHIFT);
>  
> @@ -211,8 +209,6 @@ unsigned int __do_page_cache_readahead(struct address_space *mapping,
>  	if (nr_pages)
>  		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
>  	BUG_ON(!list_empty(&page_pool));
> -out:
> -	return nr_pages;
>  }
>  
>  /*
> @@ -378,11 +374,10 @@ static int try_context_readahead(struct address_space *mapping,
>  /*
>   * A minimal readahead algorithm for trivial sequential/random reads.
>   */
> -static unsigned long
> -ondemand_readahead(struct address_space *mapping,
> -		   struct file_ra_state *ra, struct file *filp,
> -		   bool hit_readahead_marker, pgoff_t offset,
> -		   unsigned long req_size)
> +static void ondemand_readahead(struct address_space *mapping,
> +		struct file_ra_state *ra, struct file *filp,
> +		bool hit_readahead_marker, pgoff_t offset,
> +		unsigned long req_size)
>  {
>  	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>  	unsigned long max_pages = ra->ra_pages;
> @@ -428,7 +423,7 @@ ondemand_readahead(struct address_space *mapping,
>  		rcu_read_unlock();
>  
>  		if (!start || start - offset > max_pages)
> -			return 0;
> +			return;
>  
>  		ra->start = start;
>  		ra->size = start - offset;	/* old async_size */
> @@ -464,7 +459,8 @@ ondemand_readahead(struct address_space *mapping,
>  	 * standalone, small random read
>  	 * Read as is, and do not pollute the readahead state.
>  	 */
> -	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
> +	__do_page_cache_readahead(mapping, filp, offset, req_size, 0);
> +	return;
>  
>  initial_readahead:
>  	ra->start = offset;
> @@ -489,7 +485,7 @@ ondemand_readahead(struct address_space *mapping,
>  		}
>  	}
>  
> -	return ra_submit(ra, mapping, filp);
> +	ra_submit(ra, mapping, filp);
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 11/19] btrfs: Convert from readpages to readahead
  2020-02-18  6:57   ` Dave Chinner
@ 2020-02-18 21:12     ` Matthew Wilcox
  2020-02-19  1:23       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 21:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Use the new readahead operation in btrfs.  Add a
> > readahead_for_each_batch() iterator to optimise the loop in the XArray.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >  fs/btrfs/extent_io.c    | 46 +++++++++++++----------------------------
> >  fs/btrfs/extent_io.h    |  3 +--
> >  fs/btrfs/inode.c        | 16 +++++++-------
> >  include/linux/pagemap.h | 27 ++++++++++++++++++++++++
> >  4 files changed, 49 insertions(+), 43 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index c0f202741e09..e97a6acd6f5d 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -4278,52 +4278,34 @@ int extent_writepages(struct address_space *mapping,
> >  	return ret;
> >  }
> >  
> > -int extent_readpages(struct address_space *mapping, struct list_head *pages,
> > -		     unsigned nr_pages)
> > +void extent_readahead(struct readahead_control *rac)
> >  {
> >  	struct bio *bio = NULL;
> >  	unsigned long bio_flags = 0;
> >  	struct page *pagepool[16];
> >  	struct extent_map *em_cached = NULL;
> > -	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
> > -	int nr = 0;
> > +	struct extent_io_tree *tree = &BTRFS_I(rac->mapping->host)->io_tree;
> >  	u64 prev_em_start = (u64)-1;
> > +	int nr;
> >  
> > -	while (!list_empty(pages)) {
> > -		u64 contig_end = 0;
> > -
> > -		for (nr = 0; nr < ARRAY_SIZE(pagepool) && !list_empty(pages);) {
> > -			struct page *page = lru_to_page(pages);
> > -
> > -			prefetchw(&page->flags);
> > -			list_del(&page->lru);
> > -			if (add_to_page_cache_lru(page, mapping, page->index,
> > -						readahead_gfp_mask(mapping))) {
> > -				put_page(page);
> > -				break;
> > -			}
> > -
> > -			pagepool[nr++] = page;
> > -			contig_end = page_offset(page) + PAGE_SIZE - 1;
> > -		}
> > +	readahead_for_each_batch(rac, pagepool, ARRAY_SIZE(pagepool), nr) {
> > +		u64 contig_start = page_offset(pagepool[0]);
> > +		u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;
> 
> So this assumes a contiguous page range is returned, right?

Yes.  That's documented in the readahead API and is the behaviour of
the code.  I mean, btrfs asserts it's true while most of the rest of
the kernel is indifferent to it, but it's the documented and actual
behaviour.

> >  
> > -		if (nr) {
> > -			u64 contig_start = page_offset(pagepool[0]);
> > +		ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
> 
> Ok, yes it does. :)
> 
> I don't see how readahead_for_each_batch() guarantees that, though.

I ... don't see how it doesn't?  We start at rac->_start and iterate
through the consecutive pages in the page cache.  readahead_for_each_batch()
does assume that __do_page_cache_readahead() has its current behaviour
of putting the pages in the page cache in order, and kicks off a new
call to ->readahead() every time it has to skip an index for whatever
reason (eg page already in page cache).

> > -	if (bio)
> > -		return submit_one_bio(bio, 0, bio_flags);
> > -	return 0;
> > +	if (bio) {
> > +		if (submit_one_bio(bio, 0, bio_flags))
> > +			return;
> > +	}
> >  }
> 
> Shouldn't that just be
> 
> 	if (bio)
> 		submit_one_bio(bio, 0, bio_flags);

It should, but some overzealous person decided to mark submit_one_bio()
as __must_check, so I have to work around that.

> > +static inline unsigned int readahead_page_batch(struct readahead_control *rac,
> > +		struct page **array, unsigned int size)
> > +{
> > +	unsigned int batch = 0;
> 
> Confusing when put alongside rac->_batch_count counting the number
> of pages in the batch, and "batch" being the index into the page
> array, and they aren't the same counts....

Yes.  Renamed to 'i'.

> > +	XA_STATE(xas, &rac->mapping->i_pages, rac->_start);
> > +	struct page *page;
> > +
> > +	rac->_batch_count = 0;
> > +	xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) {
> 
> That just iterates pages in the start,end doesn't it? What
> guarantees that this fills the array with a contiguous page range?

The behaviour of __do_page_cache_readahead().  Dave Howells also has a
usecase for xas_for_each_contig(), so I'm going to add that soon.

> > +		VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +		VM_BUG_ON_PAGE(PageTail(page), page);
> > +		array[batch++] = page;
> > +		rac->_batch_count += hpage_nr_pages(page);
> > +		if (PageHead(page))
> > +			xas_set(&xas, rac->_start + rac->_batch_count);
> 
> What on earth does this do? Comments please!

		/*
		 * The page cache isn't using multi-index entries yet,
		 * so xas_for_each() won't do the right thing for
		 * large pages.  This can be removed once the page cache
		 * is converted.
		 */

> > +
> > +		if (batch == size)
> > +			break;
> > +	}
> > +
> > +	return batch;
> > +}
> 
> Seems a bit big for an inline function.

It's only called by btrfs at the moment.  If it gets more than one caller,
then sure, let's move it out of line.

> > +
> > +#define readahead_for_each_batch(rac, array, size, nr)			\
> > +	for (; (nr = readahead_page_batch(rac, array, size));		\
> > +			readahead_next(rac))
> 
> I had to go look at the caller to work out what "size" refered to
> here.
> 
> This is complex enough that it needs proper API documentation.

How about just:

-#define readahead_for_each_batch(rac, array, size, nr)                 \
-       for (; (nr = readahead_page_batch(rac, array, size));           \
+#define readahead_for_each_batch(rac, array, array_sz, nr)             \
+       for (; (nr = readahead_page_batch(rac, array, array_sz));       \

(corresponding rename in readahead_page_batch).  I mean, we could also
do:

#define readahead_for_each_batch(rac, array, nr)			\
	for (; (nr = readahead_page_batch(rac, array, ARRAY_SIZE(array)); \
			readahead_next(rac))

making it less flexible, but easier to use.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 01/19] mm: Return void from various readahead functions
  2020-02-18 21:05   ` John Hubbard
@ 2020-02-18 21:21     ` Matthew Wilcox
  2020-02-18 21:52       ` John Hubbard
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 21:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 01:05:29PM -0800, John Hubbard wrote:
> This is an easy review and obviously correct, so:
> 
>     Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thanks

> Thoughts for the future of the API:
> 
> I will add that I could envision another patchset that went in the
> opposite direction, and attempted to preserve the information about
> how many pages were successfully read ahead. And that would be nice
> to have (at least IMHO), even all the way out to the syscall level,
> especially for the readahead syscall.

Right, and that was where I went initially.  It turns out to be a
non-trivial aount of work to do the book-keeping to find out how many
pages were _attempted_, and since we don't wait for the I/O to complete,
we don't know how many _succeeded_, and we also don't know how many
weren't attempted because they were already there, and how many weren't
attempted because somebody else has raced with us and is going to attempt
them themselves, and how many weren't attempted because we just ran out
of memory, and decided to give up.

Also, we don't know how many pages were successfully read, and then the
system decided to evict before the program found out how many were read,
let alone before it did any action based on that.

So, given all that complexity, and the fact that nobody actually does
anything with the limited and incorrect information we tried to provide
today, I think it's fair to say that anybody who wants to start to do
anything with that information can delve into all the complexity around
"what number should we return, and what does it really mean".  In the
meantime, let's just ditch the complexity and pretense that this number
means anything.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-18 13:42   ` Matthew Wilcox
@ 2020-02-18 21:26     ` Dave Chinner
  2020-02-19  3:45       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18 21:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > Latest version in your git tree:
> > 
> > $ ▶ glo -n 5 willy/readahead
> > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > ff63497fcb98 iomap: Convert from readpages to readahead
> > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > 8115bcca7312 fuse: Convert from readpages to readahead
> > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > $
> > 
> > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > 
> > [    2.423116] ------------[ cut here ]------------
> > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > [    2.457484] Call Trace:
> > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > [    2.462787]  lru_add_drain+0x18/0x20
> 
> Are you sure that was 4be497096c04 ?  I ask because there was a

Yes, because it's the only version I've actually merged into my
working tree, compiled and tried to run. :P

> version pushed to that git tree that did contain a list double-add
> (due to a mismerge when shuffling patches).  I noticed it and fixed
> it, and 4be497096c04 doesn't have that problem.  I also test with
> CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> probabilistic because it'll depend on the timing between whatever other
> list is being used and the page actually being added to the LRU.

I'll see if I can reproduce it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 02/19] mm: Ignore return value of ->readpages
  2020-02-17 18:45 ` [PATCH v6 02/19] mm: Ignore return value of ->readpages Matthew Wilcox
  2020-02-18  4:48   ` Dave Chinner
@ 2020-02-18 21:33   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 21:33 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs,
	Christoph Hellwig, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> We used to assign the return value to a variable, which we then ignored.
> Remove the pretence of caring.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/readahead.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)

Looks good,

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 8ce46d69e6ae..12d13b7792da 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,17 +113,16 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static int read_pages(struct address_space *mapping, struct file *filp,
> +static void read_pages(struct address_space *mapping, struct file *filp,
>  		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
>  {
>  	struct blk_plug plug;
>  	unsigned page_idx;
> -	int ret;
>  
>  	blk_start_plug(&plug);
>  
>  	if (mapping->a_ops->readpages) {
> -		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> +		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
>  		/* Clean up the remaining pages */
>  		put_pages_list(pages);
>  		goto out;
> @@ -136,12 +135,9 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  			mapping->a_ops->readpage(filp, page);
>  		put_page(page);
>  	}
> -	ret = 0;
>  
>  out:
>  	blk_finish_plug(&plug);
> -
> -	return ret;
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 01/19] mm: Return void from various readahead functions
  2020-02-18 21:21     ` Matthew Wilcox
@ 2020-02-18 21:52       ` John Hubbard
  0 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 21:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On 2/18/20 1:21 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 01:05:29PM -0800, John Hubbard wrote:
>> This is an easy review and obviously correct, so:
>>
>>     Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> Thanks
> 
>> Thoughts for the future of the API:
>>
>> I will add that I could envision another patchset that went in the
>> opposite direction, and attempted to preserve the information about
>> how many pages were successfully read ahead. And that would be nice
>> to have (at least IMHO), even all the way out to the syscall level,
>> especially for the readahead syscall.
> 
> Right, and that was where I went initially.  It turns out to be a
> non-trivial aount of work to do the book-keeping to find out how many
> pages were _attempted_, and since we don't wait for the I/O to complete,
> we don't know how many _succeeded_, and we also don't know how many
> weren't attempted because they were already there, and how many weren't
> attempted because somebody else has raced with us and is going to attempt
> them themselves, and how many weren't attempted because we just ran out
> of memory, and decided to give up.
> 
> Also, we don't know how many pages were successfully read, and then the
> system decided to evict before the program found out how many were read,
> let alone before it did any action based on that.
> 


That is even worse than I initially thought. :)


> So, given all that complexity, and the fact that nobody actually does
> anything with the limited and incorrect information we tried to provide
> today, I think it's fair to say that anybody who wants to start to do
> anything with that information can delve into all the complexity around
> "what number should we return, and what does it really mean".  In the


Yes, and now that you mention it, it's really tough to pick a single number
that answers the right questions that the user space caller might have. whew.


> meantime, let's just ditch the complexity and pretense that this number
> means anything.
> 

Definitely. Thanks for the notes here.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
  2020-02-17 18:45 ` [PATCH v6 03/19] mm: Use readahead_control to pass arguments Matthew Wilcox
  2020-02-18  5:03   ` Dave Chinner
@ 2020-02-18 22:22   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 22:22 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> In this patch, only between __do_page_cache_readahead() and
> read_pages(), but it will be extended in upcoming patches.  Also add
> the readahead_count() accessor.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 17 +++++++++++++++++
>  mm/readahead.c          | 36 +++++++++++++++++++++---------------
>  2 files changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..982ecda2d4a2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page,
>  	return error;
>  }
>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> +	struct file *file;
> +	struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */


Really a minor point, sorry...what about documenting "input", "output", 
"input/output" instead? I ask because:

a) public and private seems sort of meaningless here: even in this initial
   patch, the code starts off by setting .file, .mapping, and .nr_pages.

b) The part that's confusing, and that might benefit from either documentation
   or naming changes, is the way _nr_pages is used. Is it "number of pages
   requested to read ahead", or "number of pages just read", or number of
   pages remaining to be read"? I've had trouble keeping it straight because
   I recall it being used differently at different points.


> +	pgoff_t _start;
> +	unsigned int _nr_pages;
> +};
> +
> +/* The number of pages in this readahead block */
> +static inline unsigned int readahead_count(struct readahead_control *rac)
> +{
> +	return rac->_nr_pages;
> +}


I took a peek at the generated code, and was reassured to see that this realy
does work even in the "for" loops. Once in a while I like to get my faith in
the compiler renewed. :)

> +
>  static inline unsigned long dir_pages(struct inode *inode)
>  {
>  	return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 12d13b7792da..15329309231f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static void read_pages(struct address_space *mapping, struct file *filp,
> -		struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
> +static void read_pages(struct readahead_control *rac, struct list_head *pages,
> +		gfp_t gfp)
>  {
> +	const struct address_space_operations *aops = rac->mapping->a_ops;
>  	struct blk_plug plug;
>  	unsigned page_idx;
>  
>  	blk_start_plug(&plug);
>  
> -	if (mapping->a_ops->readpages) {
> -		mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> +	if (aops->readpages) {
> +		aops->readpages(rac->file, rac->mapping, pages,
> +				readahead_count(rac));
>  		/* Clean up the remaining pages */
>  		put_pages_list(pages);
>  		goto out;
>  	}
>  
> -	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> +	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
>  		struct page *page = lru_to_page(pages);
>  		list_del(&page->lru);
> -		if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> -			mapping->a_ops->readpage(filp, page);
> +		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
> +				gfp))
> +			aops->readpage(rac->file, page);
>  		put_page(page);
>  	}
>  
> @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	int page_idx;
> -	unsigned int nr_pages = 0;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> +	struct readahead_control rac = {
> +		.mapping = mapping,
> +		.file = filp,
> +		._nr_pages = 0,
> +	};
>  
>  	if (isize == 0)
>  		return;
> @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  			 * contiguous pages before continuing with the next
>  			 * batch.
>  			 */
> -			if (nr_pages)
> -				read_pages(mapping, filp, &page_pool, nr_pages,
> -						gfp_mask);
> -			nr_pages = 0;
> +			if (readahead_count(&rac))
> +				read_pages(&rac, &page_pool, gfp_mask);
> +			rac._nr_pages = 0;
>  			continue;
>  		}
>  
> @@ -194,7 +200,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		list_add(&page->lru, &page_pool);
>  		if (page_idx == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
> -		nr_pages++;
> +		rac._nr_pages++;
>  	}
>  
>  	/*
> @@ -202,8 +208,8 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	 * uptodate then the caller will launch readpage again, and
>  	 * will then handle the error.
>  	 */
> -	if (nr_pages)
> -		read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);
> +	if (readahead_count(&rac))
> +		read_pages(&rac, &page_pool, gfp_mask);
>  	BUG_ON(!list_empty(&page_pool));
>  }
>  
> 

In any case, this patch faithfully preserves the existing logic, so regardless of any
documentation decisions, 

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/19] mm: Rearrange readahead loop
  2020-02-17 18:45 ` [PATCH v6 04/19] mm: Rearrange readahead loop Matthew Wilcox
  2020-02-18  5:08   ` Dave Chinner
@ 2020-02-18 22:33   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 22:33 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Move the declaration of 'page' to inside the loop and move the 'kick
> off a fresh batch' code to the end of the function for easier use in
> subsequent patches.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/readahead.c | 21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 15329309231f..3eca59c43a45 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		unsigned long lookahead_size)
>  {
>  	struct inode *inode = mapping->host;
> -	struct page *page;
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	int page_idx;
> @@ -175,6 +174,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	 * Preallocate as many pages as we will need.
>  	 */
>  	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> +		struct page *page;
>  		pgoff_t page_offset = offset + page_idx;
>  
>  		if (page_offset > end_index)
> @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		page = xa_load(&mapping->i_pages, page_offset);
>  		if (page && !xa_is_value(page)) {
>  			/*
> -			 * Page already present?  Kick off the current batch of
> -			 * contiguous pages before continuing with the next
> -			 * batch.
> +			 * Page already present?  Kick off the current batch
> +			 * of contiguous pages before continuing with the
> +			 * next batch.  This page may be the one we would
> +			 * have intended to mark as Readahead, but we don't
> +			 * have a stable reference to this page, and it's
> +			 * not worth getting one just for that.
>  			 */
> -			if (readahead_count(&rac))
> -				read_pages(&rac, &page_pool, gfp_mask);
> -			rac._nr_pages = 0;
> -			continue;


A fine point:  you'll get better readability and a less complex function by
factoring that into a static subroutine, instead of jumping around with 
goto's. (This clearly wants to be a subroutine, and in fact you've effectively 
created one inside this function, at the "read:" label. Either way, though...


> +			goto read;
>  		}
>  
>  		page = __page_cache_alloc(gfp_mask);
> @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		if (page_idx == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
>  		rac._nr_pages++;
> +		continue;
> +read:
> +		if (readahead_count(&rac))
> +			read_pages(&rac, &page_pool, gfp_mask);
> +		rac._nr_pages = 0;
>  	}
>  
>  	/*
> 

...no errors spotted, I'm confident that this patch is correct,

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
  2020-02-18 13:56     ` Matthew Wilcox
@ 2020-02-18 22:46       ` Dave Chinner
  2020-02-18 22:52         ` Matthew Wilcox
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-18 22:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	Christoph Hellwig, linux-btrfs

On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> > > +static void read_pages(struct readahead_control *rac, struct list_head *pages,
> > > +		gfp_t gfp)
> > >  {
> > > +	const struct address_space_operations *aops = rac->mapping->a_ops;
> > >  	struct blk_plug plug;
> > >  	unsigned page_idx;
> > 
> > Splitting out the aops rather than the mapping here just looks
> > weird, especially as you need the mapping later in the function.
> > Using aops doesn't even reduce the code side....
> 
> It does in subsequent patches ... I agree it looks a little weird here,
> but I think in the final form, it makes sense:

Ok. Perhaps just an additional commit comment to say "read_pages() is
changed to be aops centric as @rac abstracts away all other
implementation details by the end of the patchset."

> > > +			if (readahead_count(&rac))
> > > +				read_pages(&rac, &page_pool, gfp_mask);
> > > +			rac._nr_pages = 0;
> > 
> > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
> > control structure - if we have to pass the gfp_mask down all the
> > way along side the rac, then I think it makes sense to do that...
> 
> So we end up removing it later on in this series, but I do wonder if
> it would make sense anyway.  By the end of the series, we still have
> this in iomap:
> 
>                 if (ctx->rac) /* same as readahead_gfp_mask */
>                         gfp |= __GFP_NORETRY | __GFP_NOWARN;
> 
> and we could get rid of that by passing gfp flags down in the rac.  On the
> other hand, I don't know why it doesn't just use readahead_gfp_mask()
> here anyway ... Christoph?

mapping->gfp_mask is awful. Is it a mask, or is it a valid set of
allocation flags? Or both?  Some callers to mapping_gfp_constraint()
uses it as a mask, some callers to mapping_gfp_constraint() use it
as base flags that context specific flags get masked out of,
readahead_gfp_mask() callers use it as the entire set of gfp flags
for allocation.

That whole API sucks - undocumented as to what it's suposed to do
and how it's supposed to be used. Hence it's difficult to use
correctly or understand whether it's being used correctly. And
reading callers only leads to more confusion and crazy code like in
do_mpage_readpage() where readahead returns a mask that are used as
base flags and normal reads return a masked set of base flags...

The iomap code is obviously correct when it comes to gfp flag
manipulation. We start with GFP_KERNEL context, then constrain it
via the mask held in mapping->gfp_mask, then if it's readahead we
allow the allocation to silently fail.

Simple to read and understand code, versus having weird code that
requires the reader to decipher an undocumented and inconsistent API
to understand how the gfp flags have been calculated and are valid.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/19] mm: Rearrange readahead loop
  2020-02-18 13:57     ` Matthew Wilcox
@ 2020-02-18 22:48       ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-18 22:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > Move the declaration of 'page' to inside the loop and move the 'kick
> > > off a fresh batch' code to the end of the function for easier use in
> > > subsequent patches.
> > 
> > Stale? the "kick off" code is moved to the tail of the loop, not the
> > end of the function.
> 
> Braino; I meant to write end of the loop.
> 
> > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
> > >  		page = xa_load(&mapping->i_pages, page_offset);
> > >  		if (page && !xa_is_value(page)) {
> > >  			/*
> > > -			 * Page already present?  Kick off the current batch of
> > > -			 * contiguous pages before continuing with the next
> > > -			 * batch.
> > > +			 * Page already present?  Kick off the current batch
> > > +			 * of contiguous pages before continuing with the
> > > +			 * next batch.  This page may be the one we would
> > > +			 * have intended to mark as Readahead, but we don't
> > > +			 * have a stable reference to this page, and it's
> > > +			 * not worth getting one just for that.
> > >  			 */
> > > -			if (readahead_count(&rac))
> > > -				read_pages(&rac, &page_pool, gfp_mask);
> > > -			rac._nr_pages = 0;
> > > -			continue;
> > > +			goto read;
> > >  		}
> > >  
> > >  		page = __page_cache_alloc(gfp_mask);
> > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
> > >  		if (page_idx == nr_to_read - lookahead_size)
> > >  			SetPageReadahead(page);
> > >  		rac._nr_pages++;
> > > +		continue;
> > > +read:
> > > +		if (readahead_count(&rac))
> > > +			read_pages(&rac, &page_pool, gfp_mask);
> > > +		rac._nr_pages = 0;
> > >  	}
> > 
> > Also, why? This adds a goto from branched code that continues, then
> > adds a continue so the unbranched code doesn't execute the code the
> > goto jumps to. In absence of any explanation, this isn't an
> > improvement and doesn't make any sense...
> 
> I thought I was explaining it ... "for easier use in subsequent patches".

Sorry, my braino there. :) I commented on the problem with the first
part of the sentence, then the rest of the sentence completely
failed to sink in.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 03/19] mm: Use readahead_control to pass arguments
  2020-02-18 22:46       ` Dave Chinner
@ 2020-02-18 22:52         ` Matthew Wilcox
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-18 22:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	Christoph Hellwig, linux-btrfs

On Wed, Feb 19, 2020 at 09:46:10AM +1100, Dave Chinner wrote:
> On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote:
> > > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> > > > +static void read_pages(struct readahead_control *rac, struct list_head *pages,
> > > > +		gfp_t gfp)
> > > >  {
> > > > +	const struct address_space_operations *aops = rac->mapping->a_ops;
> > > >  	struct blk_plug plug;
> > > >  	unsigned page_idx;
> > > 
> > > Splitting out the aops rather than the mapping here just looks
> > > weird, especially as you need the mapping later in the function.
> > > Using aops doesn't even reduce the code side....
> > 
> > It does in subsequent patches ... I agree it looks a little weird here,
> > but I think in the final form, it makes sense:
> 
> Ok. Perhaps just an additional commit comment to say "read_pages() is
> changed to be aops centric as @rac abstracts away all other
> implementation details by the end of the patchset."

ACK, will add.

> > > > +			if (readahead_count(&rac))
> > > > +				read_pages(&rac, &page_pool, gfp_mask);
> > > > +			rac._nr_pages = 0;
> > > 
> > > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
> > > control structure - if we have to pass the gfp_mask down all the
> > > way along side the rac, then I think it makes sense to do that...
> > 
> > So we end up removing it later on in this series, but I do wonder if
> > it would make sense anyway.  By the end of the series, we still have
> > this in iomap:
> > 
> >                 if (ctx->rac) /* same as readahead_gfp_mask */
> >                         gfp |= __GFP_NORETRY | __GFP_NOWARN;
> > 
> > and we could get rid of that by passing gfp flags down in the rac.  On the
> > other hand, I don't know why it doesn't just use readahead_gfp_mask()
> > here anyway ... Christoph?
> 
> mapping->gfp_mask is awful. Is it a mask, or is it a valid set of
> allocation flags? Or both?  Some callers to mapping_gfp_constraint()
> uses it as a mask, some callers to mapping_gfp_constraint() use it
> as base flags that context specific flags get masked out of,
> readahead_gfp_mask() callers use it as the entire set of gfp flags
> for allocation.
> 
> That whole API sucks - undocumented as to what it's suposed to do
> and how it's supposed to be used. Hence it's difficult to use
> correctly or understand whether it's being used correctly. And
> reading callers only leads to more confusion and crazy code like in
> do_mpage_readpage() where readahead returns a mask that are used as
> base flags and normal reads return a masked set of base flags...
> 
> The iomap code is obviously correct when it comes to gfp flag
> manipulation. We start with GFP_KERNEL context, then constrain it
> via the mask held in mapping->gfp_mask, then if it's readahead we
> allow the allocation to silently fail.
> 
> Simple to read and understand code, versus having weird code that
> requires the reader to decipher an undocumented and inconsistent API
> to understand how the gfp flags have been calculated and are valid.

I think a lot of this is not so much a criticism of mapping->gfp_mask
as a criticism of the whole GFP flags concept.  Some of the flags make
allocations more likely to succeed, others make them more likely to
fail.  Some of them allow the allocator to do more things; some prevent
the allocator from doing things it would otherwise do.  Some of them
aren't flags at all.  Some of them are mutually incompatible (and will
be warned about if set in combination), some of them will silently win
over other flags.

I think they made a certain amount of clunky sense when they were added,
but they've grown to a point where they don't make sense any more and
partly that's because there's nobody standing over the allocator with
a flaming sword promising certain death to anyone who adds a new flag
without thoroughly documenting its interactions with every other flag.

I am no longer a fan of GFP flags ;-)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/16] mm: Tweak readahead loop slightly
  2020-02-17 18:45 ` [PATCH v6 04/16] mm: Tweak readahead loop slightly Matthew Wilcox
@ 2020-02-18 22:57   ` John Hubbard
  2020-02-18 23:00     ` John Hubbard
  0 siblings, 1 reply; 111+ messages in thread
From: John Hubbard @ 2020-02-18 22:57 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Eliminate the page_offset variable which was just confusing;
> record the start of each consecutive run of pages in the


OK...presumably for the benefit of a following patch, since it is not 
actually consumed in this patch.

> readahead_control, and move the 'kick off a fresh batch' code to
> the end of the function for easier use in the next patch.


That last bit was actually done in the previous patch, rather than this
one, right?

> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/readahead.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 15329309231f..74791b96013f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		unsigned long lookahead_size)
>  {
>  	struct inode *inode = mapping->host;
> -	struct page *page;
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	int page_idx;
> @@ -163,6 +162,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	struct readahead_control rac = {
>  		.mapping = mapping,
>  		.file = filp,
> +		._start = offset,
>  		._nr_pages = 0,
>  	};
>  
> @@ -175,32 +175,39 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	 * Preallocate as many pages as we will need.
>  	 */
>  	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> -		pgoff_t page_offset = offset + page_idx;


You know...this ends up incrementing offset each time through the
loop, so yes, the behavior is the same as when using "offset + page_idx".
However, now it's a little harder to see that.

IMHO the page_offset variable is not actually a bad thing, here. I'd rather
keep it, all other things being equal (and I don't see any other benefits
here: line count is the same, for example).

What do you think?


thanks,
-- 
John Hubbard
NVIDIA

> +		struct page *page;
>  
> -		if (page_offset > end_index)
> +		if (offset > end_index)
>  			break;
>  
> -		page = xa_load(&mapping->i_pages, page_offset);
> +		page = xa_load(&mapping->i_pages, offset);
>  		if (page && !xa_is_value(page)) {
>  			/*
> -			 * Page already present?  Kick off the current batch of
> -			 * contiguous pages before continuing with the next
> -			 * batch.
> +			 * Page already present?  Kick off the current batch
> +			 * of contiguous pages before continuing with the
> +			 * next batch.  This page may be the one we would
> +			 * have intended to mark as Readahead, but we don't
> +			 * have a stable reference to this page, and it's
> +			 * not worth getting one just for that.
>  			 */
> -			if (readahead_count(&rac))
> -				read_pages(&rac, &page_pool, gfp_mask);
> -			rac._nr_pages = 0;
> -			continue;
> +			goto read;
>  		}
>  
>  		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
>  			break;
> -		page->index = page_offset;
> +		page->index = offset;
>  		list_add(&page->lru, &page_pool);
>  		if (page_idx == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
>  		rac._nr_pages++;
> +		offset++;
> +		continue;
> +read:
> +		if (readahead_count(&rac))
> +			read_pages(&rac, &page_pool, gfp_mask);
> +		rac._nr_pages = 0;
> +		rac._start = ++offset;
>  	}
>  
>  	/*
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 04/16] mm: Tweak readahead loop slightly
  2020-02-18 22:57   ` John Hubbard
@ 2020-02-18 23:00     ` John Hubbard
  0 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 23:00 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/18/20 2:57 PM, John Hubbard wrote:
> On 2/17/20 10:45 AM, Matthew Wilcox wrote:
>> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>>
>> Eliminate the page_offset variable which was just confusing;
>> record the start of each consecutive run of pages in the
> 
> 

Darn it, I incorrectly reviewed the N/16 patch, instead of the N/19, for 
this one. I thought I had deleted all those! Let me try again with the
correct patch, sorry!!

thanks,
-- 
John Hubbard
NVIDIA

> OK...presumably for the benefit of a following patch, since it is not 
> actually consumed in this patch.
> 
>> readahead_control, and move the 'kick off a fresh batch' code to
>> the end of the function for easier use in the next patch.
> 
> 
> That last bit was actually done in the previous patch, rather than this
> one, right?
> 
>>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>  mm/readahead.c | 31 +++++++++++++++++++------------
>>  1 file changed, 19 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/readahead.c b/mm/readahead.c
>> index 15329309231f..74791b96013f 100644
>> --- a/mm/readahead.c
>> +++ b/mm/readahead.c
>> @@ -154,7 +154,6 @@ void __do_page_cache_readahead(struct address_space *mapping,
>>  		unsigned long lookahead_size)
>>  {
>>  	struct inode *inode = mapping->host;
>> -	struct page *page;
>>  	unsigned long end_index;	/* The last page we want to read */
>>  	LIST_HEAD(page_pool);
>>  	int page_idx;
>> @@ -163,6 +162,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>>  	struct readahead_control rac = {
>>  		.mapping = mapping,
>>  		.file = filp,
>> +		._start = offset,
>>  		._nr_pages = 0,
>>  	};
>>  
>> @@ -175,32 +175,39 @@ void __do_page_cache_readahead(struct address_space *mapping,
>>  	 * Preallocate as many pages as we will need.
>>  	 */
>>  	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>> -		pgoff_t page_offset = offset + page_idx;
> 
> 
> You know...this ends up incrementing offset each time through the
> loop, so yes, the behavior is the same as when using "offset + page_idx".
> However, now it's a little harder to see that.
> 
> IMHO the page_offset variable is not actually a bad thing, here. I'd rather
> keep it, all other things being equal (and I don't see any other benefits
> here: line count is the same, for example).
> 
> What do you think?
> 
> 
> thanks,
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
  2020-02-17 18:45 ` [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop Matthew Wilcox
  2020-02-18  5:14   ` Dave Chinner
@ 2020-02-18 23:08   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 23:08 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Eliminate the page_offset variable which was confusing with the
> 'offset' parameter and record the start of each consecutive run of
> pages in the readahead_control.


...presumably for the benefit of a subsequent patch, since it's not
consumed in this patch.

Thanks for breaking these up, btw, it really helps.


> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/readahead.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 3eca59c43a45..74791b96013f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -162,6 +162,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	struct readahead_control rac = {
>  		.mapping = mapping,
>  		.file = filp,
> +		._start = offset,
>  		._nr_pages = 0,
>  	};
>  
> @@ -175,12 +176,11 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	 */
>  	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
>  		struct page *page;
> -		pgoff_t page_offset = offset + page_idx;


OK, this is still something I want to mention (I wrote the same thing when reviewing 
the wrong version of this patch, a moment ago).

You know...this ends up incrementing offset each time through the
loop, so yes, the behavior is the same as when using "offset + page_idx".
However, now it's a little harder to see that.

IMHO the page_offset variable is not actually a bad thing, here. I'd rather
keep it, all other things being equal (and I don't see any other benefits
here: line count is about the same, for example).

What do you think? (I don't feel strongly about this fine point.)


thanks,
-- 
John Hubbard
NVIDIA


>  
> -		if (page_offset > end_index)
> +		if (offset > end_index)
>  			break;
>  
> -		page = xa_load(&mapping->i_pages, page_offset);
> +		page = xa_load(&mapping->i_pages, offset);
>  		if (page && !xa_is_value(page)) {
>  			/*
>  			 * Page already present?  Kick off the current batch
> @@ -196,16 +196,18 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
>  			break;
> -		page->index = page_offset;
> +		page->index = offset;
>  		list_add(&page->lru, &page_pool);
>  		if (page_idx == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
>  		rac._nr_pages++;
> +		offset++;
>  		continue;
>  read:
>  		if (readahead_count(&rac))
>  			read_pages(&rac, &page_pool, gfp_mask);
>  		rac._nr_pages = 0;
> +		rac._start = ++offset;
>  	}
>  
>  	/*
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
  2020-02-17 18:45 ` [PATCH v6 06/19] mm: rename readahead loop variable to 'i' Matthew Wilcox
  2020-02-18  5:33   ` Dave Chinner
@ 2020-02-18 23:11   ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-18 23:11 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Change the type of page_idx to unsigned long, and rename it -- it's
> just a loop counter, not a page index.
> 
> Suggested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/readahead.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 

Looks good,

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/mm/readahead.c b/mm/readahead.c
> index 74791b96013f..bdc5759000d3 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -156,7 +156,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	struct inode *inode = mapping->host;
>  	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
> -	int page_idx;
> +	unsigned long i;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  	struct readahead_control rac = {
> @@ -174,7 +174,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */
> -	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
> +	for (i = 0; i < nr_to_read; i++) {
>  		struct page *page;
>  
>  		if (offset > end_index)
> @@ -198,7 +198,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  			break;
>  		page->index = offset;
>  		list_add(&page->lru, &page_pool);
> -		if (page_idx == nr_to_read - lookahead_size)
> +		if (i == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
>  		rac._nr_pages++;
>  		offset++;
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-17 18:45 ` [PATCH v6 07/19] mm: Put readahead pages in cache earlier Matthew Wilcox
  2020-02-18  6:14   ` Dave Chinner
@ 2020-02-19  0:01   ` John Hubbard
  2020-02-19  1:02     ` Matthew Wilcox
  2020-02-19 14:41     ` Matthew Wilcox
  1 sibling, 2 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19  0:01 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> At allocation time, put the pages in the cache unless we're using
> ->readpages.  Add the readahead_for_each() iterator for the benefit of
> the ->readpage fallback.  This iterator supports huge pages, even though
> none of the filesystems to be converted do yet.
> 


"Also, remove the gfp argument from read_pages(), now that read_pages()
no longer does allocation."

Generally looks accurate, just a few notes below:


> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/pagemap.h | 24 ++++++++++++++++++++++++
>  mm/readahead.c          | 34 +++++++++++++++++-----------------
>  2 files changed, 41 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 982ecda2d4a2..3613154e79e4 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -639,8 +639,32 @@ struct readahead_control {
>  /* private: use the readahead_* accessors instead */
>  	pgoff_t _start;
>  	unsigned int _nr_pages;
> +	unsigned int _batch_count;
>  };
>  
> +static inline struct page *readahead_page(struct readahead_control *rac)
> +{
> +	struct page *page;
> +
> +	if (!rac->_nr_pages)
> +		return NULL;
> +
> +	page = xa_load(&rac->mapping->i_pages, rac->_start);
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	rac->_batch_count = hpage_nr_pages(page);
> +
> +	return page;
> +}
> +
> +static inline void readahead_next(struct readahead_control *rac)
> +{
> +	rac->_nr_pages -= rac->_batch_count;
> +	rac->_start += rac->_batch_count;
> +}
> +
> +#define readahead_for_each(rac, page)					\
> +	for (; (page = readahead_page(rac)); readahead_next(rac))
> +


How about this instead? It uses the "for" loop fully and more naturally,
and is easier to read. And it does the same thing:

static inline struct page *readahead_page(struct readahead_control *rac)
{
	struct page *page;

	if (!rac->_nr_pages)
		return NULL;

	page = xa_load(&rac->mapping->i_pages, rac->_start);
	VM_BUG_ON_PAGE(!PageLocked(page), page);
	rac->_batch_count = hpage_nr_pages(page);

	return page;
}

static inline struct page *readahead_next(struct readahead_control *rac)
{
	rac->_nr_pages -= rac->_batch_count;
	rac->_start += rac->_batch_count;

	return readahead_page(rac);
}

#define readahead_for_each(rac, page)			\
	for (page = readahead_page(rac); page != NULL;	\
	     page = readahead_page(rac))




>  /* The number of pages in this readahead block */
>  static inline unsigned int readahead_count(struct readahead_control *rac)
>  {
> diff --git a/mm/readahead.c b/mm/readahead.c
> index bdc5759000d3..9e430daae42f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,12 +113,11 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static void read_pages(struct readahead_control *rac, struct list_head *pages,
> -		gfp_t gfp)
> +static void read_pages(struct readahead_control *rac, struct list_head *pages)
>  {
>  	const struct address_space_operations *aops = rac->mapping->a_ops;
> +	struct page *page;
>  	struct blk_plug plug;
> -	unsigned page_idx;
>  
>  	blk_start_plug(&plug);
>  
> @@ -127,19 +126,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages,
>  				readahead_count(rac));
>  		/* Clean up the remaining pages */
>  		put_pages_list(pages);
> -		goto out;
> -	}
> -
> -	for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
> -		struct page *page = lru_to_page(pages);
> -		list_del(&page->lru);
> -		if (!add_to_page_cache_lru(page, rac->mapping, page->index,
> -				gfp))
> +	} else {
> +		readahead_for_each(rac, page) {
>  			aops->readpage(rac->file, page);
> -		put_page(page);
> +			put_page(page);
> +		}
>  	}
>  
> -out:
>  	blk_finish_plug(&plug);
>  }
>  
> @@ -159,6 +152,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	unsigned long i;
>  	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> +	bool use_list = mapping->a_ops->readpages;


fwiw, "bool have_readpages" seems like a better name (after all, that's how read_pages() 
effectively is written: "if you have .readpages, then..."), but I can see both sides 
of that bikeshed. :)


>  	struct readahead_control rac = {
>  		.mapping = mapping,
>  		.file = filp,
> @@ -196,8 +190,14 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
>  			break;
> -		page->index = offset;
> -		list_add(&page->lru, &page_pool);
> +		if (use_list) {
> +			page->index = offset;
> +			list_add(&page->lru, &page_pool);
> +		} else if (add_to_page_cache_lru(page, mapping, offset,
> +					gfp_mask) < 0) {


It would be a little safer from a maintenance point of view, to check for !=0, rather
than checking for < 0.  Most (all?) existing callers check that way, and it's good
to stay with the pack there.


> +			put_page(page);
> +			goto read;
> +		}
>  		if (i == nr_to_read - lookahead_size)
>  			SetPageReadahead(page);
>  		rac._nr_pages++;
> @@ -205,7 +205,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		continue;
>  read:
>  		if (readahead_count(&rac))
> -			read_pages(&rac, &page_pool, gfp_mask);
> +			read_pages(&rac, &page_pool);
>  		rac._nr_pages = 0;
>  		rac._start = ++offset;
>  	}
> @@ -216,7 +216,7 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  	 * will then handle the error.
>  	 */
>  	if (readahead_count(&rac))
> -		read_pages(&rac, &page_pool, gfp_mask);
> +		read_pages(&rac, &page_pool);
>  	BUG_ON(!list_empty(&page_pool));
>  }
>  
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-17 18:45 ` [PATCH v6 08/19] mm: Add readahead address space operation Matthew Wilcox
  2020-02-18  6:21   ` Dave Chinner
@ 2020-02-19  0:12   ` John Hubbard
  2020-02-19  3:10   ` Eric Biggers
  2 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19  0:12 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>    having them passed in a linked list.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  Documentation/filesystems/locking.rst |  6 +++++-
>  Documentation/filesystems/vfs.rst     | 13 +++++++++++++
>  include/linux/fs.h                    |  2 ++
>  include/linux/pagemap.h               | 18 ++++++++++++++++++
>  mm/readahead.c                        |  8 +++++++-
>  5 files changed, 45 insertions(+), 2 deletions(-)
> 

Looks nice,

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 5057e4d9dcd1..0ebc4491025a 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -239,6 +239,7 @@ prototypes::
>  	int (*readpage)(struct file *, struct page *);
>  	int (*writepages)(struct address_space *, struct writeback_control *);
>  	int (*set_page_dirty)(struct page *page);
> +	void (*readahead)(struct readahead_control *);
>  	int (*readpages)(struct file *filp, struct address_space *mapping,
>  			struct list_head *pages, unsigned nr_pages);
>  	int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -271,7 +272,8 @@ writepage:		yes, unlocks (see below)
>  readpage:		yes, unlocks
>  writepages:
>  set_page_dirty		no
> -readpages:
> +readahead:		yes, unlocks
> +readpages:		no
>  write_begin:		locks the page		 exclusive
>  write_end:		yes, unlocks		 exclusive
>  bmap:
> @@ -295,6 +297,8 @@ the request handler (/dev/loop).
>  ->readpage() unlocks the page, either synchronously or via I/O
>  completion.
>  
> +->readahead() unlocks the pages like ->readpage().
> +
>  ->readpages() populates the pagecache with the passed pages and starts
>  I/O against them.  They come unlocked upon I/O completion.
>  
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..81ab30fbe45c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
>  		int (*readpage)(struct file *, struct page *);
>  		int (*writepages)(struct address_space *, struct writeback_control *);
>  		int (*set_page_dirty)(struct page *page);
> +		void (*readahead)(struct readahead_control *);
>  		int (*readpages)(struct file *filp, struct address_space *mapping,
>  				 struct list_head *pages, unsigned nr_pages);
>  		int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
>  	If defined, it should set the PageDirty flag, and the
>  	PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> +	Called by the VM to read pages associated with the address_space
> +	object.  The pages are consecutive in the page cache and are
> +	locked.  The implementation should decrement the page refcount
> +	after starting I/O on each page.  Usually the page will be
> +	unlocked by the I/O completion handler.  If the function does
> +	not attempt I/O on some pages, the caller will decrement the page
> +	refcount and unlock the pages for you.	Set PageUptodate if the
> +	I/O completes successfully.  Setting PageError on any page will
> +	be ignored; simply unlock the page if an I/O error occurs.
> +
>  ``readpages``
>  	called by the VM to read pages associated with the address_space
>  	object.  This is essentially just a vector version of readpage.
>  	Instead of just one page, several pages are requested.
>  	readpages is only used for read-ahead, so read errors are
>  	ignored.  If anything goes wrong, feel free to give up.
> +	This interface is deprecated; implement readahead instead.
>  
>  ``write_begin``
>  	Called by the generic buffered write code to ask the filesystem
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..d4e2d2964346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -292,6 +292,7 @@ enum positive_aop_returns {
>  struct page;
>  struct address_space;
>  struct writeback_control;
> +struct readahead_control;
>  
>  /*
>   * Write life time hint values.
> @@ -375,6 +376,7 @@ struct address_space_operations {
>  	 */
>  	int (*readpages)(struct file *filp, struct address_space *mapping,
>  			struct list_head *pages, unsigned nr_pages);
> +	void (*readahead)(struct readahead_control *);
>  
>  	int (*write_begin)(struct file *, struct address_space *mapping,
>  				loff_t pos, unsigned len, unsigned flags,
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 3613154e79e4..bd4291f78f41 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -665,6 +665,24 @@ static inline void readahead_next(struct readahead_control *rac)
>  #define readahead_for_each(rac, page)					\
>  	for (; (page = readahead_page(rac)); readahead_next(rac))
>  
> +/* The byte offset into the file of this readahead block */
> +static inline loff_t readahead_offset(struct readahead_control *rac)
> +{
> +	return (loff_t)rac->_start * PAGE_SIZE;
> +}
> +
> +/* The number of bytes in this readahead block */
> +static inline loff_t readahead_length(struct readahead_control *rac)
> +{
> +	return (loff_t)rac->_nr_pages * PAGE_SIZE;
> +}
> +
> +/* The index of the first page in this readahead block */
> +static inline unsigned int readahead_index(struct readahead_control *rac)
> +{
> +	return rac->_start;
> +}
> +
>  /* The number of pages in this readahead block */
>  static inline unsigned int readahead_count(struct readahead_control *rac)
>  {
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9e430daae42f..975ff5e387be 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
>  
>  	blk_start_plug(&plug);
>  
> -	if (aops->readpages) {
> +	if (aops->readahead) {
> +		aops->readahead(rac);
> +		readahead_for_each(rac, page) {
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +	} else if (aops->readpages) {
>  		aops->readpages(rac->file, rac->mapping, pages,
>  				readahead_count(rac));
>  		/* Clean up the remaining pages */
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-18 15:42     ` Matthew Wilcox
@ 2020-02-19  0:59       ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  0:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > At allocation time, put the pages in the cache unless we're using
> > > ->readpages.  Add the readahead_for_each() iterator for the benefit of
> > > the ->readpage fallback.  This iterator supports huge pages, even though
> > > none of the filesystems to be converted do yet.
> > 
> > This could be better written - took me some time to get my head
> > around it and the code.
> > 
> > "When populating the page cache for readahead, mappings that don't
> > use ->readpages need to have their pages added to the page cache
> > before ->readpage is called. Do this insertion earlier so that the
> > pages can be looked up immediately prior to ->readpage calls rather
> > than passing them on a linked list. This early insert functionality
> > is also required by the upcoming ->readahead method that will
> > replace ->readpages.
> > 
> > Optimise and simplify the readpage loop by adding a
> > readahead_for_each() iterator to provide the pages we need to read.
> > This iterator also supports huge pages, even though none of the
> > filesystems have been converted to use them yet."
> 
> Thanks, I'll use that.
> 
> > > +static inline struct page *readahead_page(struct readahead_control *rac)
> > > +{
> > > +	struct page *page;
> > > +
> > > +	if (!rac->_nr_pages)
> > > +		return NULL;
> > 
> > Hmmmm.
> > 
> > > +
> > > +	page = xa_load(&rac->mapping->i_pages, rac->_start);
> > > +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +	rac->_batch_count = hpage_nr_pages(page);
> > 
> > So we could have rac->_nr_pages = 2, and then we get an order 2
> > large page returned, and so rac->_batch_count = 4.
> 
> Well, no, we couldn't.  rac->_nr_pages is incremented by 4 when we add
> an order-2 page to the readahead.

I don't see any code that does that. :)

i.e. we aren't actually putting high order pages into the page
cache here - page_alloc() allocates order-0 pages) - so there's
nothing in the patch that tells me how rac->_nr_pages behaves
when allocating large pages...

IOWs, we have an undocumented assumption in the implementation...

> I can put a
> 	BUG_ON(rac->_batch_count > rac->_nr_pages)
> in here to be sure to catch any logic error like that.

Definitely necessary given that we don't insert large pages for
readahead yet. A comment explaining the assumptions that the
code makes for large pages is probably in order, too.

> > > -		page->index = offset;
> > > -		list_add(&page->lru, &page_pool);
> > > +		if (use_list) {
> > > +			page->index = offset;
> > > +			list_add(&page->lru, &page_pool);
> > > +		} else if (add_to_page_cache_lru(page, mapping, offset,
> > > +					gfp_mask) < 0) {
> > > +			put_page(page);
> > > +			goto read;
> > > +		}
> > 
> > Ok, so that's why you put read code at the end of the loop. To turn
> > the code into spaghetti :/
> > 
> > How much does this simplify down when we get rid of ->readpages and
> > can restructure the loop? This really seems like you're trying to
> > flatten two nested loops into one by the use of goto....
> 
> I see it as having two failure cases in this loop.  One for "page is
> already present" (which already existed) and one for "allocated a page,
> but failed to add it to the page cache" (which used to be done later).
> I didn't want to duplicate the "call read_pages()" code.  So I reshuffled
> the code rather than add a nested loop.  I don't think the nested loop
> is easier to read (we'll be at 5 levels of indentation for some statements).
> Could do it this way ...

Can we move the update of @rac inside read_pages()? The next
start offset^Windex we start at is rac._start + rac._nr_pages, right?

so read_pages() could do:

{
	if (readahead_count(rac)) {
		/* do readahead */
	}

	/* advance the readahead cursor */
	rac->_start += rac->_nr_pages;
	rac._nr_pages = 0;
}

and then we only need to call read_pages() in these cases and so
the requirement for avoiding duplicating code is avoided...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19  0:01   ` John Hubbard
@ 2020-02-19  1:02     ` Matthew Wilcox
  2020-02-19  1:13       ` John Hubbard
  2020-02-19  3:24       ` John Hubbard
  2020-02-19 14:41     ` Matthew Wilcox
  1 sibling, 2 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  1:02 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 04:01:43PM -0800, John Hubbard wrote:
> How about this instead? It uses the "for" loop fully and more naturally,
> and is easier to read. And it does the same thing:
> 
> static inline struct page *readahead_page(struct readahead_control *rac)
> {
> 	struct page *page;
> 
> 	if (!rac->_nr_pages)
> 		return NULL;
> 
> 	page = xa_load(&rac->mapping->i_pages, rac->_start);
> 	VM_BUG_ON_PAGE(!PageLocked(page), page);
> 	rac->_batch_count = hpage_nr_pages(page);
> 
> 	return page;
> }
> 
> static inline struct page *readahead_next(struct readahead_control *rac)
> {
> 	rac->_nr_pages -= rac->_batch_count;
> 	rac->_start += rac->_batch_count;
> 
> 	return readahead_page(rac);
> }
> 
> #define readahead_for_each(rac, page)			\
> 	for (page = readahead_page(rac); page != NULL;	\
> 	     page = readahead_page(rac))

I'm assuming you mean 'page = readahead_next(rac)' on that second line.

If you keep reading all the way to the penultimate patch, it won't work
for iomap ... at least not in the same way.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-18 16:10     ` Matthew Wilcox
@ 2020-02-19  1:04       ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  1:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > This replaces ->readpages with a saner interface:
> > >  - Return void instead of an ignored error code.
> > >  - Pages are already in the page cache when ->readahead is called.
> > 
> > Might read better as:
> > 
> >  - Page cache is already populates with locked pages when
> >    ->readahead is called.
> 
> Will do.
> 
> > >  - Implementation looks up the pages in the page cache instead of
> > >    having them passed in a linked list.
> > 
> > Add:
> > 
> >  - cleanup of unused readahead handled by ->readahead caller, not
> >    the method implementation.
> 
> The readpages caller does that cleanup too, so it's not an advantage
> to the readahead interface.

Right. I kinda of read the list as "the reasons the new API is sane"
not as "how readahead is different to readpages"....

> > >  ``readpages``
> > >  	called by the VM to read pages associated with the address_space
> > >  	object.  This is essentially just a vector version of readpage.
> > >  	Instead of just one page, several pages are requested.
> > >  	readpages is only used for read-ahead, so read errors are
> > >  	ignored.  If anything goes wrong, feel free to give up.
> > > +	This interface is deprecated; implement readahead instead.
> > 
> > What is the removal schedule for the deprecated interface? 
> 
> I mentioned that in the cover letter; once Dave Howells has the fscache
> branch merged, I'll do the remaining filesystems.  Should be within the
> next couple of merge windows.

Sure, but I like to see actual release tags with the deprecation
notice so that it's obvious to the reader as to whether this is
something new and/or when they can expect it to go away.

> > > +/* The byte offset into the file of this readahead block */
> > > +static inline loff_t readahead_offset(struct readahead_control *rac)
> > > +{
> > > +	return (loff_t)rac->_start * PAGE_SIZE;
> > > +}
> > 
> > Urk. Didn't an early page use "offset" for the page index? That
> > was was "mm: Remove 'page_offset' from readahead loop" did, right?
> > 
> > That's just going to cause confusion to have different units for
> > readahead "offsets"....
> 
> We are ... not consistent anywhere in the VM/VFS with our naming.
> Unfortunately.
> 
> $ grep -n offset mm/filemap.c 
> 391: * @start:	offset in bytes where the range starts
> ...
> 815:	pgoff_t offset = old->index;
> ...
> 2020:	unsigned long offset;      /* offset into pagecache page */
> ...
> 2257:	*ppos = ((loff_t)index << PAGE_SHIFT) + offset;
> 
> That last one's my favourite.  Not to mention the fine distinction you
> and I discussed recently between offset_in_page() and page_offset().
> 
> Best of all, even our types encode the ambiguity of an 'offset'.  We have
> pgoff_t and loff_t (replacing the earlier off_t).
> 
> So, new rule.  'pos' is the number of bytes into a file.  'index' is the
> number of PAGE_SIZE pages into a file.  We don't use the word 'offset'
> at all.  'length' as a byte count and 'count' as a page count seem like
> fine names to me.

That sounds very reasonable to me. Another patchset in the making? :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-18 19:54     ` Matthew Wilcox
@ 2020-02-19  1:08       ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  1:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > ext4 and f2fs have duplicated the guts of the readahead code so
> > > they can read past i_size.  Instead, separate out the guts of the
> > > readahead code so they can call it directly.
> > 
> > Gross and nasty (hosting non-stale data beyond EOF in the page
> > cache, that is).
> 
> I thought you meant sneaking changes into the VFS (that were rejected) by
> copying VFS code and modifying it ...

Well, now that you mention it... :P

> > > +/**
> > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> > > + * @mapping: File address space.
> > > + * @file: This instance of the open file; used for authentication.
> > > + * @offset: First page index to read.
> > > + * @end_index: The maximum page index to read.
> > > + * @nr_to_read: The number of pages to read.
> > > + * @lookahead_size: Where to start the next readahead.
> > > + *
> > > + * This function is for filesystems to call when they want to start
> > > + * readahead potentially beyond a file's stated i_size.  If you want
> > > + * to start readahead on a normal file, you probably want to call
> > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> > > + *
> > > + * Context: File is referenced by caller.  Mutexes may be held by caller.
> > > + * May sleep, but will not reenter filesystem to reclaim memory.
> > >   */
> > > -void __do_page_cache_readahead(struct address_space *mapping,
> > > -		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> > > -		unsigned long lookahead_size)
> > > +void page_cache_readahead_limit(struct address_space *mapping,
> > 
> > ... I don't think the function name conveys it's purpose. It's
> > really a ranged readahead that ignores where i_size lies. i.e
> > 
> > 	page_cache_readahead_range(mapping, start, end, nr_to_read)
> > 
> > seems like a better API to me, and then you can drop the "start
> > readahead beyond i_size" comments and replace it with "Range is not
> > limited by the inode's i_size and hence can be used to read data
> > stored beyond EOF into the page cache."
> 
> I'm concerned that calling it 'range' implies "I want to read between
> start and end" rather than "I want to read nr_to_read at start, oh but
> don't go past end".
> 
> Maybe the right way to do this is have the three callers cap nr_to_read.
> Well, the one caller ... after all, f2fs and ext4 have no desire to
> cap the length.  Then we can call it page_cache_readahead_exceed() or
> page_cache_readahead_dangerous() or something else like that to make it
> clear that you shouldn't be calling it.

Fair point.

And in reading this, it occurred to me that what we are enabling is
an "out of bounds" readahead function. so
page_cache_readahead_OOB() or *_unbounded() might be a better name....

>   * Like add_to_page_cache_locked, but used to add newly allocated pages:
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9dd431fa16c9..cad26287ad8b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
>  	blk_finish_plug(&plug);
>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_exceed - Start unchecked readahead.
> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @index: First page index to read.
> + * @nr_to_read: The number of pages to read.
> + * @lookahead_size: Where to start the next readahead.
> + *
> + * This function is for filesystems to call when they want to start
> + * readahead beyond a file's stated i_size.  This is almost certainly
> + * not the function you want to call.  Use page_cache_async_readahead()
> + * or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.

Yup, looks much better.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19  1:02     ` Matthew Wilcox
@ 2020-02-19  1:13       ` John Hubbard
  2020-02-19  3:24       ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19  1:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On 2/18/20 5:02 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:01:43PM -0800, John Hubbard wrote:
>> How about this instead? It uses the "for" loop fully and more naturally,
>> and is easier to read. And it does the same thing:
>>
>> static inline struct page *readahead_page(struct readahead_control *rac)
>> {
>> 	struct page *page;
>>
>> 	if (!rac->_nr_pages)
>> 		return NULL;
>>
>> 	page = xa_load(&rac->mapping->i_pages, rac->_start);
>> 	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> 	rac->_batch_count = hpage_nr_pages(page);
>>
>> 	return page;
>> }
>>
>> static inline struct page *readahead_next(struct readahead_control *rac)
>> {
>> 	rac->_nr_pages -= rac->_batch_count;
>> 	rac->_start += rac->_batch_count;
>>
>> 	return readahead_page(rac);
>> }
>>
>> #define readahead_for_each(rac, page)			\
>> 	for (page = readahead_page(rac); page != NULL;	\
>> 	     page = readahead_page(rac))
> 
> I'm assuming you mean 'page = readahead_next(rac)' on that second line.


Yep. :)


> 
> If you keep reading all the way to the penultimate patch, it won't work
> for iomap ... at least not in the same way.
> 

OK, getting there...


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 11/19] btrfs: Convert from readpages to readahead
  2020-02-18 21:12     ` Matthew Wilcox
@ 2020-02-19  1:23       ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  1:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
....
> > >  
> > > -		if (nr) {
> > > -			u64 contig_start = page_offset(pagepool[0]);
> > > +		ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
> > 
> > Ok, yes it does. :)
> > 
> > I don't see how readahead_for_each_batch() guarantees that, though.
> 
> I ... don't see how it doesn't?  We start at rac->_start and iterate
> through the consecutive pages in the page cache.  readahead_for_each_batch()
> does assume that __do_page_cache_readahead() has its current behaviour
> of putting the pages in the page cache in order, and kicks off a new
> call to ->readahead() every time it has to skip an index for whatever
> reason (eg page already in page cache).

And there is the comment I was looking for while reading
readahead_for_each_batch() :)

> 
> > > -	if (bio)
> > > -		return submit_one_bio(bio, 0, bio_flags);
> > > -	return 0;
> > > +	if (bio) {
> > > +		if (submit_one_bio(bio, 0, bio_flags))
> > > +			return;
> > > +	}
> > >  }
> > 
> > Shouldn't that just be
> > 
> > 	if (bio)
> > 		submit_one_bio(bio, 0, bio_flags);
> 
> It should, but some overzealous person decided to mark submit_one_bio()
> as __must_check, so I have to work around that.

/me looks at code

Ngggh.

I rather dislike functions that are named in a way that look like
they belong to core kernel APIs but in reality are local static
functions.

I'd ask for this to be fixed if it was generic code, but it's btrfs
specific code so they can deal with the ugliness of their own
creation. :/

> > Confusing when put alongside rac->_batch_count counting the number
> > of pages in the batch, and "batch" being the index into the page
> > array, and they aren't the same counts....
> 
> Yes.  Renamed to 'i'.
> 
> > > +	XA_STATE(xas, &rac->mapping->i_pages, rac->_start);
> > > +	struct page *page;
> > > +
> > > +	rac->_batch_count = 0;
> > > +	xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) {
> > 
> > That just iterates pages in the start,end doesn't it? What
> > guarantees that this fills the array with a contiguous page range?
> 
> The behaviour of __do_page_cache_readahead().  Dave Howells also has a
> usecase for xas_for_each_contig(), so I'm going to add that soon.
> 
> > > +		VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +		VM_BUG_ON_PAGE(PageTail(page), page);
> > > +		array[batch++] = page;
> > > +		rac->_batch_count += hpage_nr_pages(page);
> > > +		if (PageHead(page))
> > > +			xas_set(&xas, rac->_start + rac->_batch_count);
> > 
> > What on earth does this do? Comments please!
> 
> 		/*
> 		 * The page cache isn't using multi-index entries yet,
> 		 * so xas_for_each() won't do the right thing for
> 		 * large pages.  This can be removed once the page cache
> 		 * is converted.
> 		 */

Oh, it's changing the internal xarray lookup cursor position to
point at the correct next page index? Perhaps it's better to say
that instead of "won't do the right thing"?

> > > +#define readahead_for_each_batch(rac, array, size, nr)			\
> > > +	for (; (nr = readahead_page_batch(rac, array, size));		\
> > > +			readahead_next(rac))
> > 
> > I had to go look at the caller to work out what "size" refered to
> > here.
> > 
> > This is complex enough that it needs proper API documentation.
> 
> How about just:
> 
> -#define readahead_for_each_batch(rac, array, size, nr)                 \
> -       for (; (nr = readahead_page_batch(rac, array, size));           \
> +#define readahead_for_each_batch(rac, array, array_sz, nr)             \
> +       for (; (nr = readahead_page_batch(rac, array, array_sz));       \

Yup, that's fine - now the macro documents itself.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-17 18:45 ` [PATCH v6 09/19] mm: Add page_cache_readahead_limit Matthew Wilcox
  2020-02-18  6:31   ` Dave Chinner
@ 2020-02-19  1:32   ` John Hubbard
  2020-02-19  2:23     ` Matthew Wilcox
  1 sibling, 1 reply; 111+ messages in thread
From: John Hubbard @ 2020-02-19  1:32 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> ext4 and f2fs have duplicated the guts of the readahead code so
> they can read past i_size.  Instead, separate out the guts of the
> readahead code so they can call it directly.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/ext4/verity.c        | 35 ++---------------------
>  fs/f2fs/verity.c        | 35 ++---------------------
>  include/linux/pagemap.h |  4 +++
>  mm/readahead.c          | 61 +++++++++++++++++++++++++++++------------
>  4 files changed, 52 insertions(+), 83 deletions(-)


Just some minor ideas below, mostly documentation, so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

> 
> diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
> index dc5ec724d889..f6e0bf05933e 100644
> --- a/fs/ext4/verity.c
> +++ b/fs/ext4/verity.c
> @@ -342,37 +342,6 @@ static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
>  	return desc_size;
>  }
>  
> -/*
> - * Prefetch some pages from the file's Merkle tree.
> - *
> - * This is basically a stripped-down version of __do_page_cache_readahead()
> - * which works on pages past i_size.
> - */
> -static void ext4_merkle_tree_readahead(struct address_space *mapping,
> -				       pgoff_t start_index, unsigned long count)
> -{
> -	LIST_HEAD(pages);
> -	unsigned int nr_pages = 0;
> -	struct page *page;
> -	pgoff_t index;
> -	struct blk_plug plug;
> -
> -	for (index = start_index; index < start_index + count; index++) {
> -		page = xa_load(&mapping->i_pages, index);
> -		if (!page || xa_is_value(page)) {
> -			page = __page_cache_alloc(readahead_gfp_mask(mapping));
> -			if (!page)
> -				break;
> -			page->index = index;
> -			list_add(&page->lru, &pages);
> -			nr_pages++;
> -		}
> -	}
> -	blk_start_plug(&plug);
> -	ext4_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
> -	blk_finish_plug(&plug);
> -}
> -
>  static struct page *ext4_read_merkle_tree_page(struct inode *inode,
>  					       pgoff_t index,
>  					       unsigned long num_ra_pages)
> @@ -386,8 +355,8 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
>  		if (page)
>  			put_page(page);
>  		else if (num_ra_pages > 1)
> -			ext4_merkle_tree_readahead(inode->i_mapping, index,
> -						   num_ra_pages);
> +			page_cache_readahead_limit(inode->i_mapping, NULL,
> +					index, LONG_MAX, num_ra_pages, 0);


LONG_MAX seems bold at first, but then again I can't think of anything smaller 
that makes any sense, and the previous code didn't have a limit either...OK.

I also wondered about the NULL file parameter, and wonder if we're stripping out
information that is needed for authentication, given that that's what the newly
written kerneldoc says the "file" arg is for. But it seems that if we're this 
deep in the fs code's read routines, file system authentication has long since 
been addressed.

Any actually I don't yet (still working through the patches) see any authentication,
so maybe that parameter will turn out to be unnecessary.

Anyway, It's nice to see this factored out into a single routine.


>  		page = read_mapping_page(inode->i_mapping, index, NULL);
>  	}
>  	return page;
> diff --git a/fs/f2fs/verity.c b/fs/f2fs/verity.c
> index d7d430a6f130..71a3e36721fa 100644
> --- a/fs/f2fs/verity.c
> +++ b/fs/f2fs/verity.c
> @@ -222,37 +222,6 @@ static int f2fs_get_verity_descriptor(struct inode *inode, void *buf,
>  	return size;
>  }
>  
> -/*
> - * Prefetch some pages from the file's Merkle tree.
> - *
> - * This is basically a stripped-down version of __do_page_cache_readahead()
> - * which works on pages past i_size.
> - */
> -static void f2fs_merkle_tree_readahead(struct address_space *mapping,
> -				       pgoff_t start_index, unsigned long count)
> -{
> -	LIST_HEAD(pages);
> -	unsigned int nr_pages = 0;
> -	struct page *page;
> -	pgoff_t index;
> -	struct blk_plug plug;
> -
> -	for (index = start_index; index < start_index + count; index++) {
> -		page = xa_load(&mapping->i_pages, index);
> -		if (!page || xa_is_value(page)) {
> -			page = __page_cache_alloc(readahead_gfp_mask(mapping));
> -			if (!page)
> -				break;
> -			page->index = index;
> -			list_add(&page->lru, &pages);
> -			nr_pages++;
> -		}
> -	}
> -	blk_start_plug(&plug);
> -	f2fs_mpage_readpages(mapping, &pages, NULL, nr_pages, true);
> -	blk_finish_plug(&plug);
> -}
> -
>  static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
>  					       pgoff_t index,
>  					       unsigned long num_ra_pages)
> @@ -266,8 +235,8 @@ static struct page *f2fs_read_merkle_tree_page(struct inode *inode,
>  		if (page)
>  			put_page(page);
>  		else if (num_ra_pages > 1)
> -			f2fs_merkle_tree_readahead(inode->i_mapping, index,
> -						   num_ra_pages);
> +			page_cache_readahead_limit(inode->i_mapping, NULL,
> +					index, LONG_MAX, num_ra_pages, 0);
>  		page = read_mapping_page(inode->i_mapping, index, NULL);
>  	}
>  	return page;
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index bd4291f78f41..4f36c06d064d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -389,6 +389,10 @@ extern struct page * read_cache_page_gfp(struct address_space *mapping,
>  				pgoff_t index, gfp_t gfp_mask);
>  extern int read_cache_pages(struct address_space *mapping,
>  		struct list_head *pages, filler_t *filler, void *data);
> +void page_cache_readahead_limit(struct address_space *mapping,
> +		struct file *file, pgoff_t offset, pgoff_t end_index,
> +		unsigned long nr_to_read, unsigned long lookahead_size);
> +
>  
>  static inline struct page *read_mapping_page(struct address_space *mapping,
>  				pgoff_t index, void *data)
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 975ff5e387be..94d499cfb657 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -142,35 +142,38 @@ static void read_pages(struct readahead_control *rac, struct list_head *pages)
>  	blk_finish_plug(&plug);
>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_limit - Start readahead beyond a file's i_size.


Maybe: 

    "Start readahead to a caller-specified end point" ?

(It's only *potentially* beyond files's i_size.)


> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @offset: First page index to read.
> + * @end_index: The maximum page index to read.
> + * @nr_to_read: The number of pages to read.


How about:

    "The number of pages to read, as long as end_index is not exceeded."


> + * @lookahead_size: Where to start the next readahead.


Pre-existing, but...it's hard to understand how a size is "where to start".
Should we rename this arg?

> + *
> + * This function is for filesystems to call when they want to start
> + * readahead potentially beyond a file's stated i_size.  If you want
> + * to start readahead on a normal file, you probably want to call
> + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.


In fact, can we say "must not reenter filesystem"? 


>   */
> -void __do_page_cache_readahead(struct address_space *mapping,
> -		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> -		unsigned long lookahead_size)
> +void page_cache_readahead_limit(struct address_space *mapping,
> +		struct file *file, pgoff_t offset, pgoff_t end_index,
> +		unsigned long nr_to_read, unsigned long lookahead_size)
>  {
> -	struct inode *inode = mapping->host;
> -	unsigned long end_index;	/* The last page we want to read */
>  	LIST_HEAD(page_pool);
>  	unsigned long i;
> -	loff_t isize = i_size_read(inode);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  	bool use_list = mapping->a_ops->readpages;
>  	struct readahead_control rac = {
>  		.mapping = mapping,
> -		.file = filp,
> +		.file = file,
>  		._start = offset,
>  		._nr_pages = 0,
>  	};
>  
> -	if (isize == 0)
> -		return;
> -
> -	end_index = ((isize - 1) >> PAGE_SHIFT);
> -
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */
> @@ -225,6 +228,30 @@ void __do_page_cache_readahead(struct address_space *mapping,
>  		read_pages(&rac, &page_pool);
>  	BUG_ON(!list_empty(&page_pool));
>  }
> +EXPORT_SYMBOL_GPL(page_cache_readahead_limit);
> +
> +/*
> + * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> + * the pages first, then submits them for I/O. This avoids the very bad
> + * behaviour which would occur if page allocations are causing VM writeback.
> + * We really don't want to intermingle reads and writes like that.
> + */
> +void __do_page_cache_readahead(struct address_space *mapping,
> +		struct file *file, pgoff_t offset, unsigned long nr_to_read,
> +		unsigned long lookahead_size)
> +{
> +	struct inode *inode = mapping->host;
> +	unsigned long end_index;	/* The last page we want to read */
> +	loff_t isize = i_size_read(inode);
> +
> +	if (isize == 0)
> +		return;
> +
> +	end_index = ((isize - 1) >> PAGE_SHIFT);
> +
> +	page_cache_readahead_limit(mapping, file, offset, end_index,
> +			nr_to_read, lookahead_size);
> +}
>  
>  /*
>   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
> 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-19  1:32   ` John Hubbard
@ 2020-02-19  2:23     ` Matthew Wilcox
  2020-02-19  2:46       ` John Hubbard
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  2:23 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 05:32:31PM -0800, John Hubbard wrote:
> > +			page_cache_readahead_limit(inode->i_mapping, NULL,
> > +					index, LONG_MAX, num_ra_pages, 0);
> 
> 
> LONG_MAX seems bold at first, but then again I can't think of anything smaller 
> that makes any sense, and the previous code didn't have a limit either...OK.

Probably worth looking at Dave's review of this and what we've just
negotiated on the other subthread ... LONG_MAX is gone.

> I also wondered about the NULL file parameter, and wonder if we're stripping out
> information that is needed for authentication, given that that's what the newly
> written kerneldoc says the "file" arg is for. But it seems that if we're this 
> deep in the fs code's read routines, file system authentication has long since 
> been addressed.

The authentication is for network filesystems.  Local filesystems
generally don't use the 'file' parameter, and since we're going to be
calling back into the filesystem's own readahead routine, we know it's
not needed.

> Any actually I don't yet (still working through the patches) see any authentication,
> so maybe that parameter will turn out to be unnecessary.
> 
> Anyway, It's nice to see this factored out into a single routine.

I'm kind of thinking about pushing the rac in the other direction too,
so page_cache_readahead_unlimited(rac, nr_to_read, lookahead_size).

> > +/**
> > + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> 
> 
> Maybe: 
> 
>     "Start readahead to a caller-specified end point" ?
> 
> (It's only *potentially* beyond files's i_size.)

My current tree has:
 * page_cache_readahead_exceed - Start unchecked readahead.


> > + * @mapping: File address space.
> > + * @file: This instance of the open file; used for authentication.
> > + * @offset: First page index to read.
> > + * @end_index: The maximum page index to read.
> > + * @nr_to_read: The number of pages to read.
> 
> 
> How about:
> 
>     "The number of pages to read, as long as end_index is not exceeded."

API change makes this irrelevant ;-)

> > + * @lookahead_size: Where to start the next readahead.
> 
> Pre-existing, but...it's hard to understand how a size is "where to start".
> Should we rename this arg?

It should probably be lookahead_count.

> > + *
> > + * This function is for filesystems to call when they want to start
> > + * readahead potentially beyond a file's stated i_size.  If you want
> > + * to start readahead on a normal file, you probably want to call
> > + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> > + *
> > + * Context: File is referenced by caller.  Mutexes may be held by caller.
> > + * May sleep, but will not reenter filesystem to reclaim memory.
> 
> In fact, can we say "must not reenter filesystem"? 

I think it depends which side of the API you're looking at which wording
you prefer ;-)


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 11/16] erofs: Convert compressed files from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 11/16] erofs: Convert compressed files " Matthew Wilcox
@ 2020-02-19  2:34   ` Gao Xiang
  0 siblings, 0 replies; 111+ messages in thread
From: Gao Xiang @ 2020-02-19  2:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:00AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in erofs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

It looks good to me, although some further optimization exists
but we could make a straight-forward transform first, and
I haven't tested the whole series for now...
Will test it later.

Acked-by: Gao Xiang <gaoxiang25@huawei.com>

Thanks,
Gao Xiang

> ---
>  fs/erofs/zdata.c | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 17f45fcb8c5c..7c02015d501d 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -1303,28 +1303,23 @@ static bool should_decompress_synchronously(struct erofs_sb_info *sbi,
>  	return nr <= sbi->max_sync_decompress_pages;
>  }
>  
> -static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
> -			     struct list_head *pages, unsigned int nr_pages)
> +static void z_erofs_readahead(struct readahead_control *rac)
>  {
> -	struct inode *const inode = mapping->host;
> +	struct inode *const inode = rac->mapping->host;
>  	struct erofs_sb_info *const sbi = EROFS_I_SB(inode);
>  
> -	bool sync = should_decompress_synchronously(sbi, nr_pages);
> +	bool sync = should_decompress_synchronously(sbi, readahead_count(rac));
>  	struct z_erofs_decompress_frontend f = DECOMPRESS_FRONTEND_INIT(inode);
> -	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
> -	struct page *head = NULL;
> +	struct page *page, *head = NULL;
>  	LIST_HEAD(pagepool);
>  
> -	trace_erofs_readpages(mapping->host, lru_to_page(pages)->index,
> -			      nr_pages, false);
> +	trace_erofs_readpages(inode, readahead_index(rac),
> +			readahead_count(rac), false);
>  
> -	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
> -
> -	for (; nr_pages; --nr_pages) {
> -		struct page *page = lru_to_page(pages);
> +	f.headoffset = readahead_offset(rac);
>  
> +	readahead_for_each(rac, page) {
>  		prefetchw(&page->flags);
> -		list_del(&page->lru);
>  
>  		/*
>  		 * A pure asynchronous readahead is indicated if
> @@ -1333,11 +1328,6 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
>  		 */
>  		sync &= !(PageReadahead(page) && !head);
>  
> -		if (add_to_page_cache_lru(page, mapping, page->index, gfp)) {
> -			list_add(&page->lru, &pagepool);
> -			continue;
> -		}
> -
>  		set_page_private(page, (unsigned long)head);
>  		head = page;
>  	}
> @@ -1366,11 +1356,10 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
>  
>  	/* clean up the remaining free pages */
>  	put_pages_list(&pagepool);
> -	return 0;
>  }
>  
>  const struct address_space_operations z_erofs_aops = {
>  	.readpage = z_erofs_readpage,
> -	.readpages = z_erofs_readpages,
> +	.readahead = z_erofs_readahead,
>  };
>  
> -- 
> 2.25.0
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 12/19] erofs: Convert uncompressed " Matthew Wilcox
@ 2020-02-19  2:39   ` Gao Xiang
  2020-02-19  3:04   ` Dave Chinner
  1 sibling, 0 replies; 111+ messages in thread
From: Gao Xiang @ 2020-02-19  2:39 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in erofs
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---

It looks good to me, and will test it later as well..

Acked-by: Gao Xiang <gaoxiang25@huawei.com>

Thanks,
Gao Xiang

>  fs/erofs/data.c              | 39 +++++++++++++-----------------------
>  fs/erofs/zdata.c             |  2 +-
>  include/trace/events/erofs.h |  6 +++---
>  3 files changed, 18 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/erofs/data.c b/fs/erofs/data.c
> index fc3a8d8064f8..82ebcee9d178 100644
> --- a/fs/erofs/data.c
> +++ b/fs/erofs/data.c
> @@ -280,47 +280,36 @@ static int erofs_raw_access_readpage(struct file *file, struct page *page)
>  	return 0;
>  }
>  
> -static int erofs_raw_access_readpages(struct file *filp,
> -				      struct address_space *mapping,
> -				      struct list_head *pages,
> -				      unsigned int nr_pages)
> +static void erofs_raw_access_readahead(struct readahead_control *rac)
>  {
>  	erofs_off_t last_block;
>  	struct bio *bio = NULL;
> -	gfp_t gfp = readahead_gfp_mask(mapping);
> -	struct page *page = list_last_entry(pages, struct page, lru);
> -
> -	trace_erofs_readpages(mapping->host, page, nr_pages, true);
> +	struct page *page;
>  
> -	for (; nr_pages; --nr_pages) {
> -		page = list_entry(pages->prev, struct page, lru);
> +	trace_erofs_readpages(rac->mapping->host, readahead_index(rac),
> +			readahead_count(rac), true);
>  
> +	readahead_for_each(rac, page) {
>  		prefetchw(&page->flags);
> -		list_del(&page->lru);
>  
> -		if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) {
> -			bio = erofs_read_raw_page(bio, mapping, page,
> -						  &last_block, nr_pages, true);
> +		bio = erofs_read_raw_page(bio, rac->mapping, page, &last_block,
> +				readahead_count(rac), true);
>  
> -			/* all the page errors are ignored when readahead */
> -			if (IS_ERR(bio)) {
> -				pr_err("%s, readahead error at page %lu of nid %llu\n",
> -				       __func__, page->index,
> -				       EROFS_I(mapping->host)->nid);
> +		/* all the page errors are ignored when readahead */
> +		if (IS_ERR(bio)) {
> +			pr_err("%s, readahead error at page %lu of nid %llu\n",
> +			       __func__, page->index,
> +			       EROFS_I(rac->mapping->host)->nid);
>  
> -				bio = NULL;
> -			}
> +			bio = NULL;
>  		}
>  
> -		/* pages could still be locked */
>  		put_page(page);
>  	}
> -	DBG_BUGON(!list_empty(pages));
>  
>  	/* the rare case (end in gaps) */
>  	if (bio)
>  		submit_bio(bio);
> -	return 0;
>  }
>  
>  static int erofs_get_block(struct inode *inode, sector_t iblock,
> @@ -358,7 +347,7 @@ static sector_t erofs_bmap(struct address_space *mapping, sector_t block)
>  /* for uncompressed (aligned) files and raw access for other files */
>  const struct address_space_operations erofs_raw_access_aops = {
>  	.readpage = erofs_raw_access_readpage,
> -	.readpages = erofs_raw_access_readpages,
> +	.readahead = erofs_raw_access_readahead,
>  	.bmap = erofs_bmap,
>  };
>  
> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
> index 80e47f07d946..17f45fcb8c5c 100644
> --- a/fs/erofs/zdata.c
> +++ b/fs/erofs/zdata.c
> @@ -1315,7 +1315,7 @@ static int z_erofs_readpages(struct file *filp, struct address_space *mapping,
>  	struct page *head = NULL;
>  	LIST_HEAD(pagepool);
>  
> -	trace_erofs_readpages(mapping->host, lru_to_page(pages),
> +	trace_erofs_readpages(mapping->host, lru_to_page(pages)->index,
>  			      nr_pages, false);
>  
>  	f.headoffset = (erofs_off_t)lru_to_page(pages)->index << PAGE_SHIFT;
> diff --git a/include/trace/events/erofs.h b/include/trace/events/erofs.h
> index 27f5caa6299a..bf9806fd1306 100644
> --- a/include/trace/events/erofs.h
> +++ b/include/trace/events/erofs.h
> @@ -113,10 +113,10 @@ TRACE_EVENT(erofs_readpage,
>  
>  TRACE_EVENT(erofs_readpages,
>  
> -	TP_PROTO(struct inode *inode, struct page *page, unsigned int nrpage,
> +	TP_PROTO(struct inode *inode, pgoff_t start, unsigned int nrpage,
>  		bool raw),
>  
> -	TP_ARGS(inode, page, nrpage, raw),
> +	TP_ARGS(inode, start, nrpage, raw),
>  
>  	TP_STRUCT__entry(
>  		__field(dev_t,		dev	)
> @@ -129,7 +129,7 @@ TRACE_EVENT(erofs_readpages,
>  	TP_fast_assign(
>  		__entry->dev	= inode->i_sb->s_dev;
>  		__entry->nid	= EROFS_I(inode)->nid;
> -		__entry->start	= page->index;
> +		__entry->start	= start;
>  		__entry->nrpage	= nrpage;
>  		__entry->raw	= raw;
>  	),
> -- 
> 2.25.0
> 
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 09/19] mm: Add page_cache_readahead_limit
  2020-02-19  2:23     ` Matthew Wilcox
@ 2020-02-19  2:46       ` John Hubbard
  0 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19  2:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On 2/18/20 6:23 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:32:31PM -0800, John Hubbard wrote:
>>> +			page_cache_readahead_limit(inode->i_mapping, NULL,
>>> +					index, LONG_MAX, num_ra_pages, 0);
>>
>>
>> LONG_MAX seems bold at first, but then again I can't think of anything smaller 
>> that makes any sense, and the previous code didn't have a limit either...OK.
> 
> Probably worth looking at Dave's review of this and what we've just
> negotiated on the other subthread ... LONG_MAX is gone.


Great. OK, I see where it's going there.

> 
>> I also wondered about the NULL file parameter, and wonder if we're stripping out
>> information that is needed for authentication, given that that's what the newly
>> written kerneldoc says the "file" arg is for. But it seems that if we're this 
>> deep in the fs code's read routines, file system authentication has long since 
>> been addressed.
> 
> The authentication is for network filesystems.  Local filesystems
> generally don't use the 'file' parameter, and since we're going to be
> calling back into the filesystem's own readahead routine, we know it's
> not needed.
> 
>> Any actually I don't yet (still working through the patches) see any authentication,
>> so maybe that parameter will turn out to be unnecessary.
>>
>> Anyway, It's nice to see this factored out into a single routine.
> 
> I'm kind of thinking about pushing the rac in the other direction too,
> so page_cache_readahead_unlimited(rac, nr_to_read, lookahead_size).
> 
>>> +/**
>>> + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
>>
>>
>> Maybe: 
>>
>>     "Start readahead to a caller-specified end point" ?
>>
>> (It's only *potentially* beyond files's i_size.)
> 
> My current tree has:
>  * page_cache_readahead_exceed - Start unchecked readahead.


Sounds good.

> 
> 
>>> + * @mapping: File address space.
>>> + * @file: This instance of the open file; used for authentication.
>>> + * @offset: First page index to read.
>>> + * @end_index: The maximum page index to read.
>>> + * @nr_to_read: The number of pages to read.
>>
>>
>> How about:
>>
>>     "The number of pages to read, as long as end_index is not exceeded."
> 
> API change makes this irrelevant ;-)
> 
>>> + * @lookahead_size: Where to start the next readahead.
>>
>> Pre-existing, but...it's hard to understand how a size is "where to start".
>> Should we rename this arg?
> 
> It should probably be lookahead_count.
> 
>>> + *
>>> + * This function is for filesystems to call when they want to start
>>> + * readahead potentially beyond a file's stated i_size.  If you want
>>> + * to start readahead on a normal file, you probably want to call
>>> + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
>>> + *
>>> + * Context: File is referenced by caller.  Mutexes may be held by caller.
>>> + * May sleep, but will not reenter filesystem to reclaim memory.
>>
>> In fact, can we say "must not reenter filesystem"? 
> 
> I think it depends which side of the API you're looking at which wording
> you prefer ;-)
> 
> 

Yes. We should try to write these so that it's clear which way we're looking:
in or out. 


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
  2020-02-18  1:51   ` [Ocfs2-devel] " Joseph Qi
  2020-02-18  6:37   ` Dave Chinner
@ 2020-02-19  2:48   ` John Hubbard
  2020-02-19  3:28   ` Eric Biggers
  3 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19  2:48 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, Junxiao Bi, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-ext4, linux-erofs,
	linux-btrfs

On 2/17/20 10:45 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Implement the new readahead aop and convert all callers (block_dev,
> exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> # ocfs2
> ---
>  drivers/staging/exfat/exfat_super.c |  7 +++---
>  fs/block_dev.c                      |  7 +++---
>  fs/ext2/inode.c                     | 10 +++-----
>  fs/fat/inode.c                      |  7 +++---
>  fs/gfs2/aops.c                      | 23 ++++++-----------
>  fs/hpfs/file.c                      |  7 +++---
>  fs/iomap/buffered-io.c              |  2 +-
>  fs/isofs/inode.c                    |  7 +++---
>  fs/jfs/inode.c                      |  7 +++---
>  fs/mpage.c                          | 38 +++++++++--------------------
>  fs/nilfs2/inode.c                   | 15 +++---------
>  fs/ocfs2/aops.c                     | 34 ++++++++++----------------
>  fs/omfs/file.c                      |  7 +++---
>  fs/qnx6/inode.c                     |  7 +++---
>  fs/reiserfs/inode.c                 |  8 +++---
>  fs/udf/inode.c                      |  7 +++---
>  include/linux/mpage.h               |  4 +--
>  mm/migrate.c                        |  2 +-
>  18 files changed, 73 insertions(+), 126 deletions(-)
> 

I didn't spot any errors in this, so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> diff --git a/drivers/staging/exfat/exfat_super.c b/drivers/staging/exfat/exfat_super.c
> index b81d2a87b82e..96aad9b16d31 100644
> --- a/drivers/staging/exfat/exfat_super.c
> +++ b/drivers/staging/exfat/exfat_super.c
> @@ -3002,10 +3002,9 @@ static int exfat_readpage(struct file *file, struct page *page)
>  	return  mpage_readpage(page, exfat_get_block);
>  }
>  
> -static int exfat_readpages(struct file *file, struct address_space *mapping,
> -			   struct list_head *pages, unsigned int nr_pages)
> +static void exfat_readahead(struct readahead_control *rac)
>  {
> -	return  mpage_readpages(mapping, pages, nr_pages, exfat_get_block);
> +	mpage_readahead(rac, exfat_get_block);
>  }
>  
>  static int exfat_writepage(struct page *page, struct writeback_control *wbc)
> @@ -3104,7 +3103,7 @@ static sector_t _exfat_bmap(struct address_space *mapping, sector_t block)
>  
>  static const struct address_space_operations exfat_aops = {
>  	.readpage    = exfat_readpage,
> -	.readpages   = exfat_readpages,
> +	.readahead   = exfat_readahead,
>  	.writepage   = exfat_writepage,
>  	.writepages  = exfat_writepages,
>  	.write_begin = exfat_write_begin,
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 69bf2fb6f7cd..2fd9c7bd61f6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -614,10 +614,9 @@ static int blkdev_readpage(struct file * file, struct page * page)
>  	return block_read_full_page(page, blkdev_get_block);
>  }
>  
> -static int blkdev_readpages(struct file *file, struct address_space *mapping,
> -			struct list_head *pages, unsigned nr_pages)
> +static void blkdev_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, blkdev_get_block);
> +	mpage_readahead(rac, blkdev_get_block);
>  }
>  
>  static int blkdev_write_begin(struct file *file, struct address_space *mapping,
> @@ -2062,7 +2061,7 @@ static int blkdev_writepages(struct address_space *mapping,
>  
>  static const struct address_space_operations def_blk_aops = {
>  	.readpage	= blkdev_readpage,
> -	.readpages	= blkdev_readpages,
> +	.readahead	= blkdev_readahead,
>  	.writepage	= blkdev_writepage,
>  	.write_begin	= blkdev_write_begin,
>  	.write_end	= blkdev_write_end,
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c885cf7d724b..2875c0a705b5 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -877,11 +877,9 @@ static int ext2_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, ext2_get_block);
>  }
>  
> -static int
> -ext2_readpages(struct file *file, struct address_space *mapping,
> -		struct list_head *pages, unsigned nr_pages)
> +static void ext2_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, ext2_get_block);
> +	mpage_readahead(rac, ext2_get_block);
>  }
>  
>  static int
> @@ -967,7 +965,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
>  
>  const struct address_space_operations ext2_aops = {
>  	.readpage		= ext2_readpage,
> -	.readpages		= ext2_readpages,
> +	.readahead		= ext2_readahead,
>  	.writepage		= ext2_writepage,
>  	.write_begin		= ext2_write_begin,
>  	.write_end		= ext2_write_end,
> @@ -981,7 +979,7 @@ const struct address_space_operations ext2_aops = {
>  
>  const struct address_space_operations ext2_nobh_aops = {
>  	.readpage		= ext2_readpage,
> -	.readpages		= ext2_readpages,
> +	.readahead		= ext2_readahead,
>  	.writepage		= ext2_nobh_writepage,
>  	.write_begin		= ext2_nobh_write_begin,
>  	.write_end		= nobh_write_end,
> diff --git a/fs/fat/inode.c b/fs/fat/inode.c
> index 594b05ae16c9..3496f5fc3e6d 100644
> --- a/fs/fat/inode.c
> +++ b/fs/fat/inode.c
> @@ -210,10 +210,9 @@ static int fat_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, fat_get_block);
>  }
>  
> -static int fat_readpages(struct file *file, struct address_space *mapping,
> -			 struct list_head *pages, unsigned nr_pages)
> +static void fat_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, fat_get_block);
> +	mpage_readahead(rac, fat_get_block);
>  }
>  
>  static void fat_write_failed(struct address_space *mapping, loff_t to)
> @@ -344,7 +343,7 @@ int fat_block_truncate_page(struct inode *inode, loff_t from)
>  
>  static const struct address_space_operations fat_aops = {
>  	.readpage	= fat_readpage,
> -	.readpages	= fat_readpages,
> +	.readahead	= fat_readahead,
>  	.writepage	= fat_writepage,
>  	.writepages	= fat_writepages,
>  	.write_begin	= fat_write_begin,
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index ba83b49ce18c..5e63c13c12c1 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -577,7 +577,7 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
>  }
>  
>  /**
> - * gfs2_readpages - Read a bunch of pages at once
> + * gfs2_readahead - Read a bunch of pages at once
>   * @file: The file to read from
>   * @mapping: Address space info
>   * @pages: List of pages to read
> @@ -590,31 +590,24 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
>   *    obviously not something we'd want to do on too regular a basis.
>   *    Any I/O we ignore at this time will be done via readpage later.
>   * 2. We don't handle stuffed files here we let readpage do the honours.
> - * 3. mpage_readpages() does most of the heavy lifting in the common case.
> + * 3. mpage_readahead() does most of the heavy lifting in the common case.
>   * 4. gfs2_block_map() is relied upon to set BH_Boundary in the right places.
>   */
>  
> -static int gfs2_readpages(struct file *file, struct address_space *mapping,
> -			  struct list_head *pages, unsigned nr_pages)
> +static void gfs2_readahead(struct readahead_control *rac)
>  {
> -	struct inode *inode = mapping->host;
> +	struct inode *inode = rac->mapping->host;
>  	struct gfs2_inode *ip = GFS2_I(inode);
> -	struct gfs2_sbd *sdp = GFS2_SB(inode);
>  	struct gfs2_holder gh;
> -	int ret;
>  
>  	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
> -	ret = gfs2_glock_nq(&gh);
> -	if (unlikely(ret))
> +	if (gfs2_glock_nq(&gh))
>  		goto out_uninit;
>  	if (!gfs2_is_stuffed(ip))
> -		ret = mpage_readpages(mapping, pages, nr_pages, gfs2_block_map);
> +		mpage_readahead(rac, gfs2_block_map);
>  	gfs2_glock_dq(&gh);
>  out_uninit:
>  	gfs2_holder_uninit(&gh);
> -	if (unlikely(gfs2_withdrawn(sdp)))
> -		ret = -EIO;
> -	return ret;
>  }
>  
>  /**
> @@ -828,7 +821,7 @@ static const struct address_space_operations gfs2_aops = {
>  	.writepage = gfs2_writepage,
>  	.writepages = gfs2_writepages,
>  	.readpage = gfs2_readpage,
> -	.readpages = gfs2_readpages,
> +	.readahead = gfs2_readahead,
>  	.bmap = gfs2_bmap,
>  	.invalidatepage = gfs2_invalidatepage,
>  	.releasepage = gfs2_releasepage,
> @@ -842,7 +835,7 @@ static const struct address_space_operations gfs2_jdata_aops = {
>  	.writepage = gfs2_jdata_writepage,
>  	.writepages = gfs2_jdata_writepages,
>  	.readpage = gfs2_readpage,
> -	.readpages = gfs2_readpages,
> +	.readahead = gfs2_readahead,
>  	.set_page_dirty = jdata_set_page_dirty,
>  	.bmap = gfs2_bmap,
>  	.invalidatepage = gfs2_invalidatepage,
> diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
> index b36abf9cb345..2de0d3492d15 100644
> --- a/fs/hpfs/file.c
> +++ b/fs/hpfs/file.c
> @@ -125,10 +125,9 @@ static int hpfs_writepage(struct page *page, struct writeback_control *wbc)
>  	return block_write_full_page(page, hpfs_get_block, wbc);
>  }
>  
> -static int hpfs_readpages(struct file *file, struct address_space *mapping,
> -			  struct list_head *pages, unsigned nr_pages)
> +static void hpfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, hpfs_get_block);
> +	mpage_readahead(rac, hpfs_get_block);
>  }
>  
>  static int hpfs_writepages(struct address_space *mapping,
> @@ -198,7 +197,7 @@ static int hpfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  const struct address_space_operations hpfs_aops = {
>  	.readpage = hpfs_readpage,
>  	.writepage = hpfs_writepage,
> -	.readpages = hpfs_readpages,
> +	.readahead = hpfs_readahead,
>  	.writepages = hpfs_writepages,
>  	.write_begin = hpfs_write_begin,
>  	.write_end = hpfs_write_end,
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7c84c4c027c4..cb3511eb152a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -359,7 +359,7 @@ iomap_readpage(struct page *page, const struct iomap_ops *ops)
>  	}
>  
>  	/*
> -	 * Just like mpage_readpages and block_read_full_page we always
> +	 * Just like mpage_readahead and block_read_full_page we always
>  	 * return 0 and just mark the page as PageError on errors.  This
>  	 * should be cleaned up all through the stack eventually.
>  	 */
> diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
> index 62c0462dc89f..95b1f377ad09 100644
> --- a/fs/isofs/inode.c
> +++ b/fs/isofs/inode.c
> @@ -1185,10 +1185,9 @@ static int isofs_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, isofs_get_block);
>  }
>  
> -static int isofs_readpages(struct file *file, struct address_space *mapping,
> -			struct list_head *pages, unsigned nr_pages)
> +static void isofs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, isofs_get_block);
> +	mpage_readahead(rac, isofs_get_block);
>  }
>  
>  static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
> @@ -1198,7 +1197,7 @@ static sector_t _isofs_bmap(struct address_space *mapping, sector_t block)
>  
>  static const struct address_space_operations isofs_aops = {
>  	.readpage = isofs_readpage,
> -	.readpages = isofs_readpages,
> +	.readahead = isofs_readahead,
>  	.bmap = _isofs_bmap
>  };
>  
> diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
> index 9486afcdac76..6f65bfa9f18d 100644
> --- a/fs/jfs/inode.c
> +++ b/fs/jfs/inode.c
> @@ -296,10 +296,9 @@ static int jfs_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, jfs_get_block);
>  }
>  
> -static int jfs_readpages(struct file *file, struct address_space *mapping,
> -		struct list_head *pages, unsigned nr_pages)
> +static void jfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, jfs_get_block);
> +	mpage_readahead(rac, jfs_get_block);
>  }
>  
>  static void jfs_write_failed(struct address_space *mapping, loff_t to)
> @@ -358,7 +357,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  const struct address_space_operations jfs_aops = {
>  	.readpage	= jfs_readpage,
> -	.readpages	= jfs_readpages,
> +	.readahead	= jfs_readahead,
>  	.writepage	= jfs_writepage,
>  	.writepages	= jfs_writepages,
>  	.write_begin	= jfs_write_begin,
> diff --git a/fs/mpage.c b/fs/mpage.c
> index ccba3c4c4479..8a09e6002dc2 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -91,7 +91,7 @@ mpage_alloc(struct block_device *bdev,
>  }
>  
>  /*
> - * support function for mpage_readpages.  The fs supplied get_block might
> + * support function for mpage_readahead.  The fs supplied get_block might
>   * return an up to date buffer.  This is used to map that buffer into
>   * the page, which allows readpage to avoid triggering a duplicate call
>   * to get_block.
> @@ -338,13 +338,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>  }
>  
>  /**
> - * mpage_readpages - populate an address space with some pages & start reads against them
> - * @mapping: the address_space
> - * @pages: The address of a list_head which contains the target pages.  These
> - *   pages have their ->index populated and are otherwise uninitialised.
> - *   The page at @pages->prev has the lowest file offset, and reads should be
> - *   issued in @pages->prev to @pages->next order.
> - * @nr_pages: The number of pages at *@pages
> + * mpage_readahead - start reads against pages
> + * @rac: Describes which pages to read.
>   * @get_block: The filesystem's block mapper function.
>   *
>   * This function walks the pages and the blocks within each page, building and
> @@ -381,36 +376,25 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>   *
>   * This all causes the disk requests to be issued in the correct order.
>   */
> -int
> -mpage_readpages(struct address_space *mapping, struct list_head *pages,
> -				unsigned nr_pages, get_block_t get_block)
> +void mpage_readahead(struct readahead_control *rac, get_block_t get_block)
>  {
> +	struct page *page;
>  	struct mpage_readpage_args args = {
>  		.get_block = get_block,
>  		.is_readahead = true,
>  	};
> -	unsigned page_idx;
> -
> -	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> -		struct page *page = lru_to_page(pages);
>  
> +	readahead_for_each(rac, page) {
>  		prefetchw(&page->flags);
> -		list_del(&page->lru);
> -		if (!add_to_page_cache_lru(page, mapping,
> -					page->index,
> -					readahead_gfp_mask(mapping))) {
> -			args.page = page;
> -			args.nr_pages = nr_pages - page_idx;
> -			args.bio = do_mpage_readpage(&args);
> -		}
> +		args.page = page;
> +		args.nr_pages = readahead_count(rac);
> +		args.bio = do_mpage_readpage(&args);
>  		put_page(page);
>  	}
> -	BUG_ON(!list_empty(pages));
>  	if (args.bio)
>  		mpage_bio_submit(REQ_OP_READ, REQ_RAHEAD, args.bio);
> -	return 0;
>  }
> -EXPORT_SYMBOL(mpage_readpages);
> +EXPORT_SYMBOL(mpage_readahead);
>  
>  /*
>   * This isn't called much at all
> @@ -563,7 +547,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  		 * Page has buffers, but they are all unmapped. The page was
>  		 * created by pagein or read over a hole which was handled by
>  		 * block_read_full_page().  If this address_space is also
> -		 * using mpage_readpages then this can rarely happen.
> +		 * using mpage_readahead then this can rarely happen.
>  		 */
>  		goto confused;
>  	}
> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
> index 671085512e0f..ceeb3b441844 100644
> --- a/fs/nilfs2/inode.c
> +++ b/fs/nilfs2/inode.c
> @@ -145,18 +145,9 @@ static int nilfs_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, nilfs_get_block);
>  }
>  
> -/**
> - * nilfs_readpages() - implement readpages() method of nilfs_aops {}
> - * address_space_operations.
> - * @file - file struct of the file to be read
> - * @mapping - address_space struct used for reading multiple pages
> - * @pages - the pages to be read
> - * @nr_pages - number of pages to be read
> - */
> -static int nilfs_readpages(struct file *file, struct address_space *mapping,
> -			   struct list_head *pages, unsigned int nr_pages)
> +static void nilfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, nilfs_get_block);
> +	mpage_readahead(rac, nilfs_get_block);
>  }
>  
>  static int nilfs_writepages(struct address_space *mapping,
> @@ -308,7 +299,7 @@ const struct address_space_operations nilfs_aops = {
>  	.readpage		= nilfs_readpage,
>  	.writepages		= nilfs_writepages,
>  	.set_page_dirty		= nilfs_set_page_dirty,
> -	.readpages		= nilfs_readpages,
> +	.readahead		= nilfs_readahead,
>  	.write_begin		= nilfs_write_begin,
>  	.write_end		= nilfs_write_end,
>  	/* .releasepage		= nilfs_releasepage, */
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 3a67a6518ddf..e8137efaafec 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -350,14 +350,11 @@ static int ocfs2_readpage(struct file *file, struct page *page)
>   * grow out to a tree. If need be, detecting boundary extents could
>   * trivially be added in a future version of ocfs2_get_block().
>   */
> -static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
> -			   struct list_head *pages, unsigned nr_pages)
> +static void ocfs2_readahead(struct readahead_control *rac)
>  {
> -	int ret, err = -EIO;
> -	struct inode *inode = mapping->host;
> +	int ret;
> +	struct inode *inode = rac->mapping->host;
>  	struct ocfs2_inode_info *oi = OCFS2_I(inode);
> -	loff_t start;
> -	struct page *last;
>  
>  	/*
>  	 * Use the nonblocking flag for the dlm code to avoid page
> @@ -365,36 +362,31 @@ static int ocfs2_readpages(struct file *filp, struct address_space *mapping,
>  	 */
>  	ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK);
>  	if (ret)
> -		return err;
> +		return;
>  
> -	if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
> -		ocfs2_inode_unlock(inode, 0);
> -		return err;
> -	}
> +	if (down_read_trylock(&oi->ip_alloc_sem) == 0)
> +		goto out_unlock;
>  
>  	/*
>  	 * Don't bother with inline-data. There isn't anything
>  	 * to read-ahead in that case anyway...
>  	 */
>  	if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL)
> -		goto out_unlock;
> +		goto out_up;
>  
>  	/*
>  	 * Check whether a remote node truncated this file - we just
>  	 * drop out in that case as it's not worth handling here.
>  	 */
> -	last = lru_to_page(pages);
> -	start = (loff_t)last->index << PAGE_SHIFT;
> -	if (start >= i_size_read(inode))
> -		goto out_unlock;
> +	if (readahead_offset(rac) >= i_size_read(inode))
> +		goto out_up;
>  
> -	err = mpage_readpages(mapping, pages, nr_pages, ocfs2_get_block);
> +	mpage_readahead(rac, ocfs2_get_block);
>  
> -out_unlock:
> +out_up:
>  	up_read(&oi->ip_alloc_sem);
> +out_unlock:
>  	ocfs2_inode_unlock(inode, 0);
> -
> -	return err;
>  }
>  
>  /* Note: Because we don't support holes, our allocation has
> @@ -2474,7 +2466,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>  
>  const struct address_space_operations ocfs2_aops = {
>  	.readpage		= ocfs2_readpage,
> -	.readpages		= ocfs2_readpages,
> +	.readahead		= ocfs2_readahead,
>  	.writepage		= ocfs2_writepage,
>  	.write_begin		= ocfs2_write_begin,
>  	.write_end		= ocfs2_write_end,
> diff --git a/fs/omfs/file.c b/fs/omfs/file.c
> index d640b9388238..d7b5f09d298c 100644
> --- a/fs/omfs/file.c
> +++ b/fs/omfs/file.c
> @@ -289,10 +289,9 @@ static int omfs_readpage(struct file *file, struct page *page)
>  	return block_read_full_page(page, omfs_get_block);
>  }
>  
> -static int omfs_readpages(struct file *file, struct address_space *mapping,
> -		struct list_head *pages, unsigned nr_pages)
> +static void omfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, omfs_get_block);
> +	mpage_readahead(rac, omfs_get_block);
>  }
>  
>  static int omfs_writepage(struct page *page, struct writeback_control *wbc)
> @@ -373,7 +372,7 @@ const struct inode_operations omfs_file_inops = {
>  
>  const struct address_space_operations omfs_aops = {
>  	.readpage = omfs_readpage,
> -	.readpages = omfs_readpages,
> +	.readahead = omfs_readahead,
>  	.writepage = omfs_writepage,
>  	.writepages = omfs_writepages,
>  	.write_begin = omfs_write_begin,
> diff --git a/fs/qnx6/inode.c b/fs/qnx6/inode.c
> index 345db56c98fd..755293c8c71a 100644
> --- a/fs/qnx6/inode.c
> +++ b/fs/qnx6/inode.c
> @@ -99,10 +99,9 @@ static int qnx6_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, qnx6_get_block);
>  }
>  
> -static int qnx6_readpages(struct file *file, struct address_space *mapping,
> -		   struct list_head *pages, unsigned nr_pages)
> +static void qnx6_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, qnx6_get_block);
> +	mpage_readahead(rac, qnx6_get_block);
>  }
>  
>  /*
> @@ -499,7 +498,7 @@ static sector_t qnx6_bmap(struct address_space *mapping, sector_t block)
>  }
>  static const struct address_space_operations qnx6_aops = {
>  	.readpage	= qnx6_readpage,
> -	.readpages	= qnx6_readpages,
> +	.readahead	= qnx6_readahead,
>  	.bmap		= qnx6_bmap
>  };
>  
> diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
> index 6419e6dacc39..0031070b3692 100644
> --- a/fs/reiserfs/inode.c
> +++ b/fs/reiserfs/inode.c
> @@ -1160,11 +1160,9 @@ int reiserfs_get_block(struct inode *inode, sector_t block,
>  	return retval;
>  }
>  
> -static int
> -reiserfs_readpages(struct file *file, struct address_space *mapping,
> -		   struct list_head *pages, unsigned nr_pages)
> +static void reiserfs_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, reiserfs_get_block);
> +	mpage_readahead(rac, reiserfs_get_block);
>  }
>  
>  /*
> @@ -3434,7 +3432,7 @@ int reiserfs_setattr(struct dentry *dentry, struct iattr *attr)
>  const struct address_space_operations reiserfs_address_space_operations = {
>  	.writepage = reiserfs_writepage,
>  	.readpage = reiserfs_readpage,
> -	.readpages = reiserfs_readpages,
> +	.readahead = reiserfs_readahead,
>  	.releasepage = reiserfs_releasepage,
>  	.invalidatepage = reiserfs_invalidatepage,
>  	.write_begin = reiserfs_write_begin,
> diff --git a/fs/udf/inode.c b/fs/udf/inode.c
> index e875bc5668ee..adaba8e8b326 100644
> --- a/fs/udf/inode.c
> +++ b/fs/udf/inode.c
> @@ -195,10 +195,9 @@ static int udf_readpage(struct file *file, struct page *page)
>  	return mpage_readpage(page, udf_get_block);
>  }
>  
> -static int udf_readpages(struct file *file, struct address_space *mapping,
> -			struct list_head *pages, unsigned nr_pages)
> +static void udf_readahead(struct readahead_control *rac)
>  {
> -	return mpage_readpages(mapping, pages, nr_pages, udf_get_block);
> +	mpage_readahead(rac, udf_get_block);
>  }
>  
>  static int udf_write_begin(struct file *file, struct address_space *mapping,
> @@ -234,7 +233,7 @@ static sector_t udf_bmap(struct address_space *mapping, sector_t block)
>  
>  const struct address_space_operations udf_aops = {
>  	.readpage	= udf_readpage,
> -	.readpages	= udf_readpages,
> +	.readahead	= udf_readahead,
>  	.writepage	= udf_writepage,
>  	.writepages	= udf_writepages,
>  	.write_begin	= udf_write_begin,
> diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> index 001f1fcf9836..f4f5e90a6844 100644
> --- a/include/linux/mpage.h
> +++ b/include/linux/mpage.h
> @@ -13,9 +13,9 @@
>  #ifdef CONFIG_BLOCK
>  
>  struct writeback_control;
> +struct readahead_control;
>  
> -int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> -				unsigned nr_pages, get_block_t get_block);
> +void mpage_readahead(struct readahead_control *, get_block_t get_block);
>  int mpage_readpage(struct page *page, get_block_t get_block);
>  int mpage_writepages(struct address_space *mapping,
>  		struct writeback_control *wbc, get_block_t get_block);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b1092876e537..a32122095702 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1020,7 +1020,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>  		 * to the LRU. Later, when the IO completes the pages are
>  		 * marked uptodate and unlocked. However, the queueing
>  		 * could be merging multiple pages for one bio (e.g.
> -		 * mpage_readpages). If an allocation happens for the
> +		 * mpage_readahead). If an allocation happens for the
>  		 * second or third page, the process can end up locking
>  		 * the same page twice and deadlocking. Rather than
>  		 * trying to be clever about what pages can be locked,
> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 12/19] erofs: Convert uncompressed " Matthew Wilcox
  2020-02-19  2:39   ` Gao Xiang
@ 2020-02-19  3:04   ` Dave Chinner
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in erofs
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/erofs/data.c              | 39 +++++++++++++-----------------------
>  fs/erofs/zdata.c             |  2 +-
>  include/trace/events/erofs.h |  6 +++---
>  3 files changed, 18 insertions(+), 29 deletions(-)

Looks fine from the perspective of page iteration and error handling.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 13/19] erofs: Convert compressed files " Matthew Wilcox
@ 2020-02-19  3:08   ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in erofs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/erofs/zdata.c | 29 +++++++++--------------------
>  1 file changed, 9 insertions(+), 20 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-17 18:45 ` [PATCH v6 08/19] mm: Add readahead address space operation Matthew Wilcox
  2020-02-18  6:21   ` Dave Chinner
  2020-02-19  0:12   ` John Hubbard
@ 2020-02-19  3:10   ` Eric Biggers
  2020-02-19  3:35     ` Eric Biggers
  2020-02-19 16:52     ` Matthew Wilcox
  2 siblings, 2 replies; 111+ messages in thread
From: Eric Biggers @ 2020-02-19  3:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cluster-devel, linux-kernel, linux-f2fs-devel, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, linux-ext4, linux-erofs,
	ocfs2-devel

On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..81ab30fbe45c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
>  		int (*readpage)(struct file *, struct page *);
>  		int (*writepages)(struct address_space *, struct writeback_control *);
>  		int (*set_page_dirty)(struct page *page);
> +		void (*readahead)(struct readahead_control *);
>  		int (*readpages)(struct file *filp, struct address_space *mapping,
>  				 struct list_head *pages, unsigned nr_pages);
>  		int (*write_begin)(struct file *, struct address_space *mapping,
> @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
>  	If defined, it should set the PageDirty flag, and the
>  	PAGECACHE_TAG_DIRTY tag in the radix tree.
>  
> +``readahead``
> +	Called by the VM to read pages associated with the address_space
> +	object.  The pages are consecutive in the page cache and are
> +	locked.  The implementation should decrement the page refcount
> +	after starting I/O on each page.  Usually the page will be
> +	unlocked by the I/O completion handler.  If the function does
> +	not attempt I/O on some pages, the caller will decrement the page
> +	refcount and unlock the pages for you.	Set PageUptodate if the
> +	I/O completes successfully.  Setting PageError on any page will
> +	be ignored; simply unlock the page if an I/O error occurs.
> +

This is unclear about how "not attempting I/O" works and how that affects who is
responsible for putting and unlocking the pages.  How does the caller know which
pages were not attempted?  Can any arbitrary subset of pages be not attempted,
or just the last N pages?

In the code, the caller actually uses readahead_for_each() to iterate through
and put+unlock the pages.  That implies that ->readahead() must also use
readahead_for_each() as well, in order for the iterator to be advanced
correctly... Right?  And the ownership of each page is transferred to the callee
when the callee advances the iterator past that page.

I don't see how ext4_readahead() and f2fs_readahead() can work at all, given
that they don't advance the iterator.

- Eric

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 14/19] ext4: Convert from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 14/19] ext4: " Matthew Wilcox
@ 2020-02-19  3:16   ` Dave Chinner
  2020-02-19  3:29   ` Eric Biggers
  1 sibling, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in ext4
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

There's nothing I can see in this that would cause that list
corruption I saw with ext4.

I'll re-introduce the patch and see if it falls over again.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-17 18:46 ` [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor Matthew Wilcox
@ 2020-02-19  3:17   ` John Hubbard
  2020-02-19  5:35     ` Matthew Wilcox
  2020-02-19  3:29   ` Dave Chinner
  1 sibling, 1 reply; 111+ messages in thread
From: John Hubbard @ 2020-02-19  3:17 UTC (permalink / raw)
  To: Matthew Wilcox, linux-fsdevel
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-ext4, linux-erofs, linux-btrfs

On 2/17/20 10:46 AM, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> By putting the 'have we reached the end of the page' condition at the end
> of the loop instead of the beginning, we can remove the 'submit the last
> page' code from iomap_readpages().  Also check that iomap_readpage_actor()
> didn't return 0, which would lead to an endless loop.


Also added a new WARN_ON() and BUG(), although I'm wondering about the BUG
below...


> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..44303f370b2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
> -	loff_t done, ret;
> +	loff_t ret, done = 0;
>  
> -	for (done = 0; done < length; done += ret) {


nit: this "for" loop was perfect just the way it was. :) I'd vote here for reverting
the change to a "while" loop. Because with this change, now the code has to 
separately initialize "done", separately increment "done", and the beauty of a
for loop is that the loop init and control is all clearly in one place. For things
that follow that model (as in this case!), that's a Good Thing.

And I don't see any technical reason (even in the following patch) that requires 
this change.


> -		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> -			if (!ctx->cur_page_in_bio)
> -				unlock_page(ctx->cur_page);
> -			put_page(ctx->cur_page);
> -			ctx->cur_page = NULL;
> -		}
> +	while (done < length) {
>  		if (!ctx->cur_page) {
>  			ctx->cur_page = iomap_next_page(inode, ctx->pages,
>  					pos, length, &done);
> @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  		}
>  		ret = iomap_readpage_actor(inode, pos + done, length - done,
>  				ctx, iomap, srcmap);
> +		if (WARN_ON(ret == 0))
> +			break;
> +		done += ret;
> +		if (offset_in_page(pos + done) == 0) {
> +			if (!ctx->cur_page_in_bio)
> +				unlock_page(ctx->cur_page);
> +			put_page(ctx->cur_page);
> +			ctx->cur_page = NULL;
> +		}
>  	}
>  
>  	return done;
> @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
>  done:
>  	if (ctx.bio)
>  		submit_bio(ctx.bio);
> -	if (ctx.cur_page) {
> -		if (!ctx.cur_page_in_bio)
> -			unlock_page(ctx.cur_page);
> -		put_page(ctx.cur_page);
> -	}
> +	BUG_ON(ctx.cur_page);


Is a full BUG_ON() definitely called for here? Seems like a WARN might suffice...


>  
>  	/*
>  	 * Check that we didn't lose a page due to the arcance calling
> 



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 16/19] fuse: Convert from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 16/19] fuse: " Matthew Wilcox
@ 2020-02-19  3:22   ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in fuse.  Switching away from the
> read_cache_pages() helper gets rid of an implicit call to put_page(),
> so we can get rid of the get_page() call in fuse_readpages_fill().
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks OK.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19  1:02     ` Matthew Wilcox
  2020-02-19  1:13       ` John Hubbard
@ 2020-02-19  3:24       ` John Hubbard
  1 sibling, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19  3:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On 2/18/20 5:02 PM, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:01:43PM -0800, John Hubbard wrote:
>> How about this instead? It uses the "for" loop fully and more naturally,
>> and is easier to read. And it does the same thing:
>>
>> static inline struct page *readahead_page(struct readahead_control *rac)
>> {
>> 	struct page *page;
>>
>> 	if (!rac->_nr_pages)
>> 		return NULL;
>>
>> 	page = xa_load(&rac->mapping->i_pages, rac->_start);
>> 	VM_BUG_ON_PAGE(!PageLocked(page), page);
>> 	rac->_batch_count = hpage_nr_pages(page);
>>
>> 	return page;
>> }
>>
>> static inline struct page *readahead_next(struct readahead_control *rac)
>> {
>> 	rac->_nr_pages -= rac->_batch_count;
>> 	rac->_start += rac->_batch_count;
>>
>> 	return readahead_page(rac);
>> }
>>
>> #define readahead_for_each(rac, page)			\
>> 	for (page = readahead_page(rac); page != NULL;	\
>> 	     page = readahead_page(rac))
> 
> I'm assuming you mean 'page = readahead_next(rac)' on that second line.
> 
> If you keep reading all the way to the penultimate patch, it won't work
> for iomap ... at least not in the same way.
> 

OK, so after an initial look at patch 18's ("iomap: Convert from readpages to
readahead") use of readahead_page() and readahead_next(), I'm not sure what 
I'm missing. Seems like it would work...?

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
                     ` (2 preceding siblings ...)
  2020-02-19  2:48   ` John Hubbard
@ 2020-02-19  3:28   ` Eric Biggers
  2020-02-19  3:47     ` Matthew Wilcox
  3 siblings, 1 reply; 111+ messages in thread
From: Eric Biggers @ 2020-02-19  3:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cluster-devel, linux-ext4, linux-kernel, linux-f2fs-devel,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, Junxiao Bi,
	linux-erofs, ocfs2-devel

On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> index 001f1fcf9836..f4f5e90a6844 100644
> --- a/include/linux/mpage.h
> +++ b/include/linux/mpage.h
> @@ -13,9 +13,9 @@
>  #ifdef CONFIG_BLOCK
>  
>  struct writeback_control;
> +struct readahead_control;
>  
> -int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> -				unsigned nr_pages, get_block_t get_block);
> +void mpage_readahead(struct readahead_control *, get_block_t get_block);
>  int mpage_readpage(struct page *page, get_block_t get_block);
>  int mpage_writepages(struct address_space *mapping,
>  		struct writeback_control *wbc, get_block_t get_block);

Can you name the 'struct readahead_control *' parameter?

checkpatch.pl should warn about this.

- Eric

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-17 18:46 ` [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor Matthew Wilcox
  2020-02-19  3:17   ` John Hubbard
@ 2020-02-19  3:29   ` Dave Chinner
  2020-02-19  6:04     ` Matthew Wilcox
  1 sibling, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> By putting the 'have we reached the end of the page' condition at the end
> of the loop instead of the beginning, we can remove the 'submit the last
> page' code from iomap_readpages().  Also check that iomap_readpage_actor()
> didn't return 0, which would lead to an endless loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 25 ++++++++++++-------------
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..44303f370b2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  		void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>  	struct iomap_readpage_ctx *ctx = data;
> -	loff_t done, ret;
> +	loff_t ret, done = 0;
>  
> -	for (done = 0; done < length; done += ret) {
> -		if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> -			if (!ctx->cur_page_in_bio)
> -				unlock_page(ctx->cur_page);
> -			put_page(ctx->cur_page);
> -			ctx->cur_page = NULL;
> -		}
> +	while (done < length) {
>  		if (!ctx->cur_page) {
>  			ctx->cur_page = iomap_next_page(inode, ctx->pages,
>  					pos, length, &done);
> @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  		}
>  		ret = iomap_readpage_actor(inode, pos + done, length - done,
>  				ctx, iomap, srcmap);
> +		if (WARN_ON(ret == 0))
> +			break;

This error case now leaks ctx->cur_page....

> +		done += ret;
> +		if (offset_in_page(pos + done) == 0) {
> +			if (!ctx->cur_page_in_bio)
> +				unlock_page(ctx->cur_page);
> +			put_page(ctx->cur_page);
> +			ctx->cur_page = NULL;
> +		}
>  	}
>  
>  	return done;
> @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
>  done:
>  	if (ctx.bio)
>  		submit_bio(ctx.bio);
> -	if (ctx.cur_page) {
> -		if (!ctx.cur_page_in_bio)
> -			unlock_page(ctx.cur_page);
> -		put_page(ctx.cur_page);
> -	}
> +	BUG_ON(ctx.cur_page);

And so will now trigger both a warn and a bug....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 14/19] ext4: Convert from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 14/19] ext4: " Matthew Wilcox
  2020-02-19  3:16   ` Dave Chinner
@ 2020-02-19  3:29   ` Eric Biggers
  1 sibling, 0 replies; 111+ messages in thread
From: Eric Biggers @ 2020-02-19  3:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cluster-devel, linux-kernel, linux-f2fs-devel, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, linux-ext4, linux-erofs,
	ocfs2-devel

On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote:
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index c1769afbf799..e14841ade612 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -7,8 +7,8 @@
>   *
>   * This was originally taken from fs/mpage.c
>   *
> - * The intent is the ext4_mpage_readpages() function here is intended
> - * to replace mpage_readpages() in the general case, not just for
> + * The ext4_mpage_readahead() function here is intended to
> + * replace mpage_readahead() in the general case, not just for
>   * encrypted files.  It has some limitations (see below), where it
>   * will fall back to read_block_full_page(), but these limitations
>   * should only be hit when page_size != block_size.
> @@ -222,8 +222,7 @@ static inline loff_t ext4_readpage_limit(struct inode *inode)
>  }

This says ext4_mpage_readahead(), but it's actually still called
ext4_mpage_readpages().

- Eric

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-19  3:10   ` Eric Biggers
@ 2020-02-19  3:35     ` Eric Biggers
  2020-02-19 16:52     ` Matthew Wilcox
  1 sibling, 0 replies; 111+ messages in thread
From: Eric Biggers @ 2020-02-19  3:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 07:10:44PM -0800, Eric Biggers wrote:
> On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 7d4d09dd5e6d..81ab30fbe45c 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -706,6 +706,7 @@ cache in your filesystem.  The following members are defined:
> >  		int (*readpage)(struct file *, struct page *);
> >  		int (*writepages)(struct address_space *, struct writeback_control *);
> >  		int (*set_page_dirty)(struct page *page);
> > +		void (*readahead)(struct readahead_control *);
> >  		int (*readpages)(struct file *filp, struct address_space *mapping,
> >  				 struct list_head *pages, unsigned nr_pages);
> >  		int (*write_begin)(struct file *, struct address_space *mapping,
> > @@ -781,12 +782,24 @@ cache in your filesystem.  The following members are defined:
> >  	If defined, it should set the PageDirty flag, and the
> >  	PAGECACHE_TAG_DIRTY tag in the radix tree.
> >  
> > +``readahead``
> > +	Called by the VM to read pages associated with the address_space
> > +	object.  The pages are consecutive in the page cache and are
> > +	locked.  The implementation should decrement the page refcount
> > +	after starting I/O on each page.  Usually the page will be
> > +	unlocked by the I/O completion handler.  If the function does
> > +	not attempt I/O on some pages, the caller will decrement the page
> > +	refcount and unlock the pages for you.	Set PageUptodate if the
> > +	I/O completes successfully.  Setting PageError on any page will
> > +	be ignored; simply unlock the page if an I/O error occurs.
> > +
> 
> This is unclear about how "not attempting I/O" works and how that affects who is
> responsible for putting and unlocking the pages.  How does the caller know which
> pages were not attempted?  Can any arbitrary subset of pages be not attempted,
> or just the last N pages?
> 
> In the code, the caller actually uses readahead_for_each() to iterate through
> and put+unlock the pages.  That implies that ->readahead() must also use
> readahead_for_each() as well, in order for the iterator to be advanced
> correctly... Right?  And the ownership of each page is transferred to the callee
> when the callee advances the iterator past that page.
> 
> I don't see how ext4_readahead() and f2fs_readahead() can work at all, given
> that they don't advance the iterator.
> 

Yep, this patchset immediately crashes on boot with:

BUG: Bad page state in process swapper/0  pfn:02176
page:ffffea00000751d0 refcount:0 mapcount:0 mapping:ffff88807cba0400 index:0x0
ext4_da_aops name:"systemd"
flags: 0x100000000020001(locked|mappedtodisk)
raw: 0100000000020001 dead000000000100 dead000000000122 ffff88807cba0400
raw: 0000000000000000 0000000000000000 00000000ffffffff
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x1(locked)
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2-00019-g7203ed9018cb9 #18
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x7a/0xaa lib/dump_stack.c:118
 bad_page.cold+0x89/0xba mm/page_alloc.c:649
 free_pages_check_bad+0x5d/0x60 mm/page_alloc.c:1050
 free_pages_check mm/page_alloc.c:1059 [inline]
 free_pages_prepare mm/page_alloc.c:1157 [inline]
 free_pcp_prepare+0x1c1/0x200 mm/page_alloc.c:1198
 free_unref_page_prepare mm/page_alloc.c:3011 [inline]
 free_unref_page+0x16/0x70 mm/page_alloc.c:3060
 __put_single_page mm/swap.c:81 [inline]
 __put_page+0x31/0x40 mm/swap.c:115
 put_page include/linux/mm.h:1029 [inline]
 ext4_mpage_readpages+0x778/0x9b0 fs/ext4/readpage.c:405
 ext4_readahead+0x2f/0x40 fs/ext4/inode.c:3242
 read_pages+0x4c/0x200 mm/readahead.c:126
 page_cache_readahead_limit+0x224/0x250 mm/readahead.c:241
 __do_page_cache_readahead mm/readahead.c:266 [inline]
 ra_submit mm/internal.h:62 [inline]
 ondemand_readahead+0x1df/0x4d0 mm/readahead.c:544
 page_cache_sync_readahead+0x2d/0x40 mm/readahead.c:579
 generic_file_buffered_read+0x77e/0xa90 mm/filemap.c:2029
 generic_file_read_iter+0xd4/0x130 mm/filemap.c:2302
 ext4_file_read_iter fs/ext4/file.c:131 [inline]
 ext4_file_read_iter+0x53/0x180 fs/ext4/file.c:114
 call_read_iter include/linux/fs.h:1897 [inline]
 new_sync_read+0x113/0x1a0 fs/read_write.c:414
 __vfs_read+0x21/0x30 fs/read_write.c:427
 vfs_read+0xcb/0x160 fs/read_write.c:461
 kernel_read+0x2c/0x40 fs/read_write.c:440
 prepare_binprm+0x14f/0x190 fs/exec.c:1589
 __do_execve_file.isra.0+0x4c0/0x800 fs/exec.c:1806
 do_execveat_common fs/exec.c:1871 [inline]
 do_execve+0x20/0x30 fs/exec.c:1888
 run_init_process+0xc8/0xcd init/main.c:1279
 try_to_run_init_process+0x10/0x36 init/main.c:1288
 kernel_init+0xac/0xfd init/main.c:1385
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Disabling lock debugging due to kernel taint
page:ffffea00000751d0 refcount:0 mapcount:0 mapping:ffff88807cba0400 index:0x0
ext4_da_aops name:"systemd"
flags: 0x100000000020001(locked|mappedtodisk)
raw: 0100000000020001 dead000000000100 dead000000000122 ffff88807cba0400
raw: 0000000000000000 0000000000000000 00000000ffffffff
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)


I had to add:

diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index e14841ade6122..cb982088b5225 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -401,8 +401,10 @@ int ext4_mpage_readpages(struct address_space *mapping,
 		else
 			unlock_page(page);
 	next_page:
-		if (rac)
+		if (rac) {
 			put_page(page);
+			readahead_next(rac);
+		}
 	}
 	if (bio)
 		submit_bio(bio);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 87964e4cb6b81..e16b0fe42e2e5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2238,8 +2238,10 @@ int f2fs_mpage_readpages(struct inode *inode, struct readahead_control *rac,
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 next_page:
 #endif
-		if (rac)
+		if (rac) {
 			put_page(page);
+			readahead_next(rac);
+		}
 
 #ifdef CONFIG_F2FS_FS_COMPRESSION
 		if (f2fs_compressed_file(inode)) {



^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 18/19] iomap: Convert from readpages to readahead
  2020-02-17 18:46 ` [PATCH v6 18/19] iomap: Convert from readpages to readahead Matthew Wilcox
@ 2020-02-19  3:40   ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
> use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  fs/iomap/buffered-io.c | 91 +++++++++++++++---------------------------
>  fs/iomap/trace.h       |  2 +-
>  fs/xfs/xfs_aops.c      | 13 +++---
>  fs/zonefs/super.c      |  7 ++--
>  include/linux/iomap.h  |  3 +-
>  5 files changed, 42 insertions(+), 74 deletions(-)

All pretty straight forward...

> @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
>  			break;
>  		done += ret;
>  		if (offset_in_page(pos + done) == 0) {
> +			readahead_next(ctx->rac);
>  			if (!ctx->cur_page_in_bio)
>  				unlock_page(ctx->cur_page);
>  			put_page(ctx->cur_page);

Though now I look at the addition here, this might be better
restructured to mention how we handle partial page submission such as:

		done += ret;

		/*
		 * Keep working on a partially complete page, otherwise ready
		 * the ctx for the next page to be acted on.
		 */
		if (offset_in_page(pos + done))
			continue;

		if (!ctx->cur_page_in_bio)
			unlock_page(ctx->cur_page);
		put_page(ctx->cur_page);
		ctx->cur_page = NULL;
		readahead_next(ctx->rac);
	}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path
  2020-02-17 18:46 ` [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
@ 2020-02-19  3:43   ` Dave Chinner
  2020-02-19  5:22     ` Matthew Wilcox
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, Michal Hocko, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-fsdevel, Cong Wang,
	linux-ext4, linux-erofs, linux-btrfs

On Mon, Feb 17, 2020 at 10:46:13AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Ensure that memory allocations in the readahead path do not attempt to
> reclaim file-backed pages, which could lead to a deadlock.  It is
> possible, though unlikely this is the root cause of a problem observed
> by Cong Wang.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/readahead.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 94d499cfb657..8f9c0dba24e7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -22,6 +22,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/blk-cgroup.h>
>  #include <linux/fadvise.h>
> +#include <linux/sched/mm.h>
>  
>  #include "internal.h"
>  
> @@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space *mapping,
>  		._nr_pages = 0,
>  	};
>  
> +	/*
> +	 * Partway through the readahead operation, we will have added
> +	 * locked pages to the page cache, but will not yet have submitted
> +	 * them for I/O.  Adding another page may need to allocate memory,
> +	 * which can trigger memory reclaim.  Telling the VM we're in
> +	 * the middle of a filesystem operation will cause it to not
> +	 * touch file-backed pages, preventing a deadlock.  Most (all?)
> +	 * filesystems already specify __GFP_NOFS in their mapping's
> +	 * gfp_mask, but let's be explicit here.
> +	 */
> +	unsigned int nofs = memalloc_nofs_save();
> +

So doesn't this largely remove the need for all the gfp flag futzing
in the readahead path? i.e. almost all readahead allocations are now
going to be GFP_NOFS | GFP_NORETRY | GFP_NOWARN ?

If so, shouldn't we just strip all the gfp flags and masking out of
the readahead path altogether?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-18 21:26     ` Dave Chinner
@ 2020-02-19  3:45       ` Dave Chinner
  2020-02-19  3:48         ` Matthew Wilcox
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > Latest version in your git tree:
> > > 
> > > $ ▶ glo -n 5 willy/readahead
> > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > $
> > > 
> > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > 
> > > [    2.423116] ------------[ cut here ]------------
> > > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > > [    2.457484] Call Trace:
> > > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > [    2.462787]  lru_add_drain+0x18/0x20
> > 
> > Are you sure that was 4be497096c04 ?  I ask because there was a
> 
> Yes, because it's the only version I've actually merged into my
> working tree, compiled and tried to run. :P
> 
> > version pushed to that git tree that did contain a list double-add
> > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > it, and 4be497096c04 doesn't have that problem.  I also test with
> > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > probabilistic because it'll depend on the timing between whatever other
> > list is being used and the page actually being added to the LRU.
> 
> I'll see if I can reproduce it.

Just updated to a current TOT Linus kernel and your latest branch,
and so far this is 100% reproducable.

Not sure how I'm going to debug it yet, because it's init that is
triggering it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-19  3:28   ` Eric Biggers
@ 2020-02-19  3:47     ` Matthew Wilcox
  2020-02-19  3:55       ` Eric Biggers
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  3:47 UTC (permalink / raw)
  To: Eric Biggers
  Cc: cluster-devel, linux-ext4, linux-kernel, linux-f2fs-devel,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, Junxiao Bi,
	linux-erofs, ocfs2-devel

On Tue, Feb 18, 2020 at 07:28:26PM -0800, Eric Biggers wrote:
> On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> > diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> > index 001f1fcf9836..f4f5e90a6844 100644
> > --- a/include/linux/mpage.h
> > +++ b/include/linux/mpage.h
> > @@ -13,9 +13,9 @@
> >  #ifdef CONFIG_BLOCK
> >  
> >  struct writeback_control;
> > +struct readahead_control;
> >  
> > -int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> > -				unsigned nr_pages, get_block_t get_block);
> > +void mpage_readahead(struct readahead_control *, get_block_t get_block);
> >  int mpage_readpage(struct page *page, get_block_t get_block);
> >  int mpage_writepages(struct address_space *mapping,
> >  		struct writeback_control *wbc, get_block_t get_block);
> 
> Can you name the 'struct readahead_control *' parameter?

What good would that do?  I'm sick of seeing 'struct page *page'.
Well, no shit it's a page.  Unless there's some actual information to
convey, leave the argument unnamed.  It should be a crime to not name
an unsigned long, but not naming the struct address_space pointer is
entirely reasonable.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-19  3:45       ` Dave Chinner
@ 2020-02-19  3:48         ` Matthew Wilcox
  2020-02-19  3:57           ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  3:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote:
> On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > > Latest version in your git tree:
> > > > 
> > > > $ ▶ glo -n 5 willy/readahead
> > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > > $
> > > > 
> > > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > > 
> > > > [    2.423116] ------------[ cut here ]------------
> > > > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > > > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > > > [    2.457484] Call Trace:
> > > > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > > [    2.462787]  lru_add_drain+0x18/0x20
> > > 
> > > Are you sure that was 4be497096c04 ?  I ask because there was a
> > 
> > Yes, because it's the only version I've actually merged into my
> > working tree, compiled and tried to run. :P
> > 
> > > version pushed to that git tree that did contain a list double-add
> > > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > > it, and 4be497096c04 doesn't have that problem.  I also test with
> > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > > probabilistic because it'll depend on the timing between whatever other
> > > list is being used and the page actually being added to the LRU.
> > 
> > I'll see if I can reproduce it.
> 
> Just updated to a current TOT Linus kernel and your latest branch,
> and so far this is 100% reproducable.
> 
> Not sure how I'm going to debug it yet, because it's init that is
> triggering it....

Eric found it ... still not sure why I don't see it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
  2020-02-19  3:47     ` Matthew Wilcox
@ 2020-02-19  3:55       ` Eric Biggers
  0 siblings, 0 replies; 111+ messages in thread
From: Eric Biggers @ 2020-02-19  3:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: cluster-devel, linux-ext4, linux-kernel, linux-f2fs-devel,
	linux-xfs, linux-mm, linux-btrfs, linux-fsdevel, Junxiao Bi,
	linux-erofs, ocfs2-devel

On Tue, Feb 18, 2020 at 07:47:41PM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 07:28:26PM -0800, Eric Biggers wrote:
> > On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> > > diff --git a/include/linux/mpage.h b/include/linux/mpage.h
> > > index 001f1fcf9836..f4f5e90a6844 100644
> > > --- a/include/linux/mpage.h
> > > +++ b/include/linux/mpage.h
> > > @@ -13,9 +13,9 @@
> > >  #ifdef CONFIG_BLOCK
> > >  
> > >  struct writeback_control;
> > > +struct readahead_control;
> > >  
> > > -int mpage_readpages(struct address_space *mapping, struct list_head *pages,
> > > -				unsigned nr_pages, get_block_t get_block);
> > > +void mpage_readahead(struct readahead_control *, get_block_t get_block);
> > >  int mpage_readpage(struct page *page, get_block_t get_block);
> > >  int mpage_writepages(struct address_space *mapping,
> > >  		struct writeback_control *wbc, get_block_t get_block);
> > 
> > Can you name the 'struct readahead_control *' parameter?
> 
> What good would that do?  I'm sick of seeing 'struct page *page'.
> Well, no shit it's a page.  Unless there's some actual information to
> convey, leave the argument unnamed.  It should be a crime to not name
> an unsigned long, but not naming the struct address_space pointer is
> entirely reasonable.

It's the coding style the community has agreed on, so the tools check for.

I don't care that much myself; it just appeared like this was a mistake rather
than intentional so I thought I'd point it out.

- Eric

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 00/19] Change readahead API
  2020-02-19  3:48         ` Matthew Wilcox
@ 2020-02-19  3:57           ` Dave Chinner
  0 siblings, 0 replies; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  3:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote:
> > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > > > Latest version in your git tree:
> > > > > 
> > > > > $ ▶ glo -n 5 willy/readahead
> > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > > > $
> > > > > 
> > > > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > > > 
> > > > > [    2.423116] ------------[ cut here ]------------
> > > > > [    2.424957] list_add double add: new=ffffea000efff4c8, prev=ffff8883bfffee60, next=ffffea000efff4c8.
> > > > > [    2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
> > > > > [    2.457484] Call Trace:
> > > > > [    2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > > > [    2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > > > [    2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > > > [    2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > > > [    2.462787]  lru_add_drain+0x18/0x20
> > > > 
> > > > Are you sure that was 4be497096c04 ?  I ask because there was a
> > > 
> > > Yes, because it's the only version I've actually merged into my
> > > working tree, compiled and tried to run. :P
> > > 
> > > > version pushed to that git tree that did contain a list double-add
> > > > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > > > it, and 4be497096c04 doesn't have that problem.  I also test with
> > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > > > probabilistic because it'll depend on the timing between whatever other
> > > > list is being used and the page actually being added to the LRU.
> > > 
> > > I'll see if I can reproduce it.
> > 
> > Just updated to a current TOT Linus kernel and your latest branch,
> > and so far this is 100% reproducable.
> > 
> > Not sure how I'm going to debug it yet, because it's init that is
> > triggering it....
> 
> Eric found it ...

Yeah, just saw that and am applying his patch to test it...

> still not sure why I don't see it.

No readahead configured on your device?


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path
  2020-02-19  3:43   ` Dave Chinner
@ 2020-02-19  5:22     ` Matthew Wilcox
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  5:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Michal Hocko, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-fsdevel, Cong Wang,
	linux-ext4, linux-erofs, linux-btrfs

On Wed, Feb 19, 2020 at 02:43:24PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:46:13AM -0800, Matthew Wilcox wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Ensure that memory allocations in the readahead path do not attempt to
> > reclaim file-backed pages, which could lead to a deadlock.  It is
> > possible, though unlikely this is the root cause of a problem observed
> > by Cong Wang.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/readahead.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 94d499cfb657..8f9c0dba24e7 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/mm_inline.h>
> >  #include <linux/blk-cgroup.h>
> >  #include <linux/fadvise.h>
> > +#include <linux/sched/mm.h>
> >  
> >  #include "internal.h"
> >  
> > @@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space *mapping,
> >  		._nr_pages = 0,
> >  	};
> >  
> > +	/*
> > +	 * Partway through the readahead operation, we will have added
> > +	 * locked pages to the page cache, but will not yet have submitted
> > +	 * them for I/O.  Adding another page may need to allocate memory,
> > +	 * which can trigger memory reclaim.  Telling the VM we're in
> > +	 * the middle of a filesystem operation will cause it to not
> > +	 * touch file-backed pages, preventing a deadlock.  Most (all?)
> > +	 * filesystems already specify __GFP_NOFS in their mapping's
> > +	 * gfp_mask, but let's be explicit here.
> > +	 */
> > +	unsigned int nofs = memalloc_nofs_save();
> > +
> 
> So doesn't this largely remove the need for all the gfp flag futzing
> in the readahead path? i.e. almost all readahead allocations are now
> going to be GFP_NOFS | GFP_NORETRY | GFP_NOWARN ?

I don't think it ensures the GFP_NORETRY | GFP_NOWARN, just the GFP_NOFS
part.  IOW, we'll still need a readahead_gfp() macro at some point ... I
don't want to add that to this already large series though.

Michal also wants to kill mapping->gfp_mask, btw.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-19  3:17   ` John Hubbard
@ 2020-02-19  5:35     ` Matthew Wilcox
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  5:35 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 07:17:18PM -0800, John Hubbard wrote:
> > -	for (done = 0; done < length; done += ret) {
> 
> nit: this "for" loop was perfect just the way it was. :) I'd vote here for reverting
> the change to a "while" loop. Because with this change, now the code has to 
> separately initialize "done", separately increment "done", and the beauty of a
> for loop is that the loop init and control is all clearly in one place. For things
> that follow that model (as in this case!), that's a Good Thing.
> 
> And I don't see any technical reason (even in the following patch) that requires 
> this change.

It's doing the increment in the wrong place.  We want the increment done in
the middle of the loop, before we check whether we've got to the end of
the page.  Not at the end of the loop.

> > +	BUG_ON(ctx.cur_page);
> 
> Is a full BUG_ON() definitely called for here? Seems like a WARN might suffice...

Dave made a similar comment; I'll pick this up there.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-19  3:29   ` Dave Chinner
@ 2020-02-19  6:04     ` Matthew Wilcox
  2020-02-19  6:40       ` Dave Chinner
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19  6:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote:
> On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> >  		}
> >  		ret = iomap_readpage_actor(inode, pos + done, length - done,
> >  				ctx, iomap, srcmap);
> > +		if (WARN_ON(ret == 0))
> > +			break;
> 
> This error case now leaks ctx->cur_page....

Yes ... and I see the consequence.  I mean, this is a "shouldn't happen",
so do we want to put effort into cleanup here ...

> > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
> >  done:
> >  	if (ctx.bio)
> >  		submit_bio(ctx.bio);
> > -	if (ctx.cur_page) {
> > -		if (!ctx.cur_page_in_bio)
> > -			unlock_page(ctx.cur_page);
> > -		put_page(ctx.cur_page);
> > -	}
> > +	BUG_ON(ctx.cur_page);
> 
> And so will now trigger both a warn and a bug....

... or do we just want to run slap bang into this bug?

Option 1: Remove the check for 'ret == 0' altogether, as we had it before.
That puts us into endless loop territory for a failure mode, and it's not
parallel with iomap_readpage().

Option 2: Remove the WARN_ON from the check.  Then we just hit the BUG_ON,
but we don't know why we did it.

Option 3: Set cur_page to NULL.  We'll hit the WARN_ON, avoid the BUG_ON,
might end up with a page in the page cache which is never unlocked.

Option 4: Do the unlock/put page dance before setting the cur_page to NULL.
We might double-unlock the page.

There are probably other options here too.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-19  6:04     ` Matthew Wilcox
@ 2020-02-19  6:40       ` Dave Chinner
  2020-02-19 17:06         ` Matthew Wilcox
  0 siblings, 1 reply; 111+ messages in thread
From: Dave Chinner @ 2020-02-19  6:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> > >  		}
> > >  		ret = iomap_readpage_actor(inode, pos + done, length - done,
> > >  				ctx, iomap, srcmap);
> > > +		if (WARN_ON(ret == 0))
> > > +			break;
> > 
> > This error case now leaks ctx->cur_page....
> 
> Yes ... and I see the consequence.  I mean, this is a "shouldn't happen",
> so do we want to put effort into cleanup here ...

Well, the normal thing for XFS is that a production kernel cleans up
and handles the error gracefully with a WARN_ON_ONCE, while a debug
kernel build will chuck a tanty and burn the house down so to make
the developers aware that there is a "should not happen" situation
occurring....

> > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
> > >  done:
> > >  	if (ctx.bio)
> > >  		submit_bio(ctx.bio);
> > > -	if (ctx.cur_page) {
> > > -		if (!ctx.cur_page_in_bio)
> > > -			unlock_page(ctx.cur_page);
> > > -		put_page(ctx.cur_page);
> > > -	}
> > > +	BUG_ON(ctx.cur_page);
> > 
> > And so will now trigger both a warn and a bug....
> 
> ... or do we just want to run slap bang into this bug?
> 
> Option 1: Remove the check for 'ret == 0' altogether, as we had it before.
> That puts us into endless loop territory for a failure mode, and it's not
> parallel with iomap_readpage().
> 
> Option 2: Remove the WARN_ON from the check.  Then we just hit the BUG_ON,
> but we don't know why we did it.
> 
> Option 3: Set cur_page to NULL.  We'll hit the WARN_ON, avoid the BUG_ON,
> might end up with a page in the page cache which is never unlocked.

None of these are appealing.

> Option 4: Do the unlock/put page dance before setting the cur_page to NULL.
> We might double-unlock the page.

why would we double unlock the page?

Oh, the readahead cursor doesn't handle the case of partial page
submission, which would result in IO completion unlocking the page.

Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e.
if we've got a page that the readahead cursor points at, and we
haven't actually added it to a bio, then we can leave it to the
read_pages() to unlock and clean up. If it's in a bio, then IO
completion will unlock it and so we only have to drop the submission
reference and move the readahead cursor forwards so read_pages()
doesn't try to unlock this page. i.e:

	/* clean up partial page submission failures */
	if (ctx.cur_page && ctx.cur_page_in_bio) {
		put_page(ctx.cur_page);
		readahead_next(rac);
	}

looks to me like it will handle the case of "ret == 0" in the actor
function just fine.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19  0:01   ` John Hubbard
  2020-02-19  1:02     ` Matthew Wilcox
@ 2020-02-19 14:41     ` Matthew Wilcox
  2020-02-19 14:52       ` Christoph Hellwig
  1 sibling, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19 14:41 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Tue, Feb 18, 2020 at 04:01:43PM -0800, John Hubbard wrote:
> How about this instead? It uses the "for" loop fully and more naturally,
> and is easier to read. And it does the same thing:
> 
> static inline struct page *readahead_page(struct readahead_control *rac)
> {
> 	struct page *page;
> 
> 	if (!rac->_nr_pages)
> 		return NULL;
> 
> 	page = xa_load(&rac->mapping->i_pages, rac->_start);
> 	VM_BUG_ON_PAGE(!PageLocked(page), page);
> 	rac->_batch_count = hpage_nr_pages(page);
> 
> 	return page;
> }
> 
> static inline struct page *readahead_next(struct readahead_control *rac)
> {
> 	rac->_nr_pages -= rac->_batch_count;
> 	rac->_start += rac->_batch_count;
> 
> 	return readahead_page(rac);
> }
> 
> #define readahead_for_each(rac, page)			\
> 	for (page = readahead_page(rac); page != NULL;	\
> 	     page = readahead_page(rac))

I'll go you one better ... how about we do this instead:

static inline struct page *readahead_page(struct readahead_control *rac)
{
        struct page *page;

        BUG_ON(rac->_batch_count > rac->_nr_pages);
        rac->_nr_pages -= rac->_batch_count;
        rac->_index += rac->_batch_count;
        rac->_batch_count = 0;

        if (!rac->_nr_pages)
                return NULL;

        page = xa_load(&rac->mapping->i_pages, rac->_index);
        VM_BUG_ON_PAGE(!PageLocked(page), page);
        rac->_batch_count = hpage_nr_pages(page);

        return page;
}

#define readahead_for_each(rac, page)                                   \
        while ((page = readahead_page(rac)))

No more readahead_next() to forget to add to filesystems which don't use
the readahead_for_each() iterator.  Ahem.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19 14:41     ` Matthew Wilcox
@ 2020-02-19 14:52       ` Christoph Hellwig
  2020-02-19 15:01         ` Matthew Wilcox
  0 siblings, 1 reply; 111+ messages in thread
From: Christoph Hellwig @ 2020-02-19 14:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, John Hubbard, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4,
	linux-erofs, linux-btrfs

On Wed, Feb 19, 2020 at 06:41:17AM -0800, Matthew Wilcox wrote:
> #define readahead_for_each(rac, page)                                   \
>         while ((page = readahead_page(rac)))
> 
> No more readahead_next() to forget to add to filesystems which don't use
> the readahead_for_each() iterator.  Ahem.

And then kill readahead_for_each and open code the above to make it
even more obvious?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19 14:52       ` Christoph Hellwig
@ 2020-02-19 15:01         ` Matthew Wilcox
  2020-02-19 20:24           ` John Hubbard
  0 siblings, 1 reply; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19 15:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, John Hubbard, linux-kernel, linux-f2fs-devel,
	cluster-devel, linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4,
	linux-erofs, linux-btrfs

On Wed, Feb 19, 2020 at 06:52:46AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 19, 2020 at 06:41:17AM -0800, Matthew Wilcox wrote:
> > #define readahead_for_each(rac, page)                                   \
> >         while ((page = readahead_page(rac)))
> > 
> > No more readahead_next() to forget to add to filesystems which don't use
> > the readahead_for_each() iterator.  Ahem.
> 
> And then kill readahead_for_each and open code the above to make it
> even more obvious?

Makes sense.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 08/19] mm: Add readahead address space operation
  2020-02-19  3:10   ` Eric Biggers
  2020-02-19  3:35     ` Eric Biggers
@ 2020-02-19 16:52     ` Matthew Wilcox
  1 sibling, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19 16:52 UTC (permalink / raw)
  To: Eric Biggers
  Cc: cluster-devel, linux-kernel, linux-f2fs-devel, linux-xfs,
	linux-mm, linux-btrfs, linux-fsdevel, linux-ext4, linux-erofs,
	ocfs2-devel

On Tue, Feb 18, 2020 at 07:10:44PM -0800, Eric Biggers wrote:
> > +``readahead``
> > +	Called by the VM to read pages associated with the address_space
> > +	object.  The pages are consecutive in the page cache and are
> > +	locked.  The implementation should decrement the page refcount
> > +	after starting I/O on each page.  Usually the page will be
> > +	unlocked by the I/O completion handler.  If the function does
> > +	not attempt I/O on some pages, the caller will decrement the page
> > +	refcount and unlock the pages for you.	Set PageUptodate if the
> > +	I/O completes successfully.  Setting PageError on any page will
> > +	be ignored; simply unlock the page if an I/O error occurs.
> > +
> 
> This is unclear about how "not attempting I/O" works and how that affects who is
> responsible for putting and unlocking the pages.  How does the caller know which
> pages were not attempted?  Can any arbitrary subset of pages be not attempted,
> or just the last N pages?

Changed to:

``readahead``
        Called by the VM to read pages associated with the address_space
        object.  The pages are consecutive in the page cache and are
        locked.  The implementation should decrement the page refcount
        after starting I/O on each page.  Usually the page will be
        unlocked by the I/O completion handler.  If the filesystem decides
        to stop attempting I/O before reaching the end of the readahead
        window, it can simply return.  The caller will decrement the page
        refcount and unlock the remaining pages for you.  Set PageUptodate
        if the I/O completes successfully.  Setting PageError on any page
        will be ignored; simply unlock the page if an I/O error occurs.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
  2020-02-19  6:40       ` Dave Chinner
@ 2020-02-19 17:06         ` Matthew Wilcox
  0 siblings, 0 replies; 111+ messages in thread
From: Matthew Wilcox @ 2020-02-19 17:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On Wed, Feb 19, 2020 at 05:40:05PM +1100, Dave Chinner wrote:
> Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e.
> if we've got a page that the readahead cursor points at, and we
> haven't actually added it to a bio, then we can leave it to the
> read_pages() to unlock and clean up. If it's in a bio, then IO
> completion will unlock it and so we only have to drop the submission
> reference and move the readahead cursor forwards so read_pages()
> doesn't try to unlock this page. i.e:
> 
> 	/* clean up partial page submission failures */
> 	if (ctx.cur_page && ctx.cur_page_in_bio) {
> 		put_page(ctx.cur_page);
> 		readahead_next(rac);
> 	}
> 
> looks to me like it will handle the case of "ret == 0" in the actor
> function just fine.

Here's what I ended up with:

@@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
                void *data, struct iomap *iomap, struct iomap *srcmap)
 {
        struct iomap_readpage_ctx *ctx = data;
-       loff_t done, ret;
-
-       for (done = 0; done < length; done += ret) {
-               if (ctx->cur_page && offset_in_page(pos + done) == 0) {
-                       if (!ctx->cur_page_in_bio)
-                               unlock_page(ctx->cur_page);
-                       put_page(ctx->cur_page);
-                       ctx->cur_page = NULL;
-               }
+       loff_t ret, done = 0;
+
+       while (done < length) {
                if (!ctx->cur_page) {
                        ctx->cur_page = iomap_next_page(inode, ctx->pages,
                                        pos, length, &done);
@@ -418,6 +412,20 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
                }
                ret = iomap_readpage_actor(inode, pos + done, length - done,
                                ctx, iomap, srcmap);
+               done += ret;
+
+               /* Keep working on a partial page */
+               if (ret && offset_in_page(pos + done))
+                       continue;
+
+               if (!ctx->cur_page_in_bio)
+                       unlock_page(ctx->cur_page);
+               put_page(ctx->cur_page);
+               ctx->cur_page = NULL;
+
+               /* Don't loop forever if we made no progress */
+               if (WARN_ON(!ret))
+                       break;
        }
 
        return done;
@@ -451,11 +459,7 @@ iomap_readpages(struct address_space *mapping, struct list_head *pages,
 done:
        if (ctx.bio)
                submit_bio(ctx.bio);
-       if (ctx.cur_page) {
-               if (!ctx.cur_page_in_bio)
-                       unlock_page(ctx.cur_page);
-               put_page(ctx.cur_page);
-       }
+       BUG_ON(ctx.cur_page);
 
        /*
         * Check that we didn't lose a page due to the arcance calling

so we'll WARN if we get a ret == 0 (matching ->readpage), and we'll
BUG if we ever see a page being leaked out of readpages_actor, which
is a thing that should never happen and we definitely want to be noisy
about if it does.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v6 07/19] mm: Put readahead pages in cache earlier
  2020-02-19 15:01         ` Matthew Wilcox
@ 2020-02-19 20:24           ` John Hubbard
  0 siblings, 0 replies; 111+ messages in thread
From: John Hubbard @ 2020-02-19 20:24 UTC (permalink / raw)
  To: Matthew Wilcox, Christoph Hellwig
  Cc: linux-xfs, linux-kernel, linux-f2fs-devel, cluster-devel,
	linux-mm, ocfs2-devel, linux-fsdevel, linux-ext4, linux-erofs,
	linux-btrfs

On 2/19/20 7:01 AM, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 06:52:46AM -0800, Christoph Hellwig wrote:
>> On Wed, Feb 19, 2020 at 06:41:17AM -0800, Matthew Wilcox wrote:
>>> #define readahead_for_each(rac, page)                                   \
>>>         while ((page = readahead_page(rac)))
>>>
>>> No more readahead_next() to forget to add to filesystems which don't use
>>> the readahead_for_each() iterator.  Ahem.


Yes, this looks very clean. And less error-prone, which I definitely
appreciate too. :)


>>
>> And then kill readahead_for_each and open code the above to make it
>> even more obvious?
> 
> Makes sense.
> 

Great!


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2020-02-19 20:24 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-17 18:45 [PATCH v6 00/19] Change readahead API Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 01/19] mm: Return void from various readahead functions Matthew Wilcox
2020-02-18  4:47   ` Dave Chinner
2020-02-18 21:05   ` John Hubbard
2020-02-18 21:21     ` Matthew Wilcox
2020-02-18 21:52       ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 02/19] mm: Ignore return value of ->readpages Matthew Wilcox
2020-02-18  4:48   ` Dave Chinner
2020-02-18 21:33   ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 03/19] mm: Use readahead_control to pass arguments Matthew Wilcox
2020-02-18  5:03   ` Dave Chinner
2020-02-18 13:56     ` Matthew Wilcox
2020-02-18 22:46       ` Dave Chinner
2020-02-18 22:52         ` Matthew Wilcox
2020-02-18 22:22   ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 04/19] mm: Rearrange readahead loop Matthew Wilcox
2020-02-18  5:08   ` Dave Chinner
2020-02-18 13:57     ` Matthew Wilcox
2020-02-18 22:48       ` Dave Chinner
2020-02-18 22:33   ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 04/16] mm: Tweak readahead loop slightly Matthew Wilcox
2020-02-18 22:57   ` John Hubbard
2020-02-18 23:00     ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 05/16] mm: Put readahead pages in cache earlier Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop Matthew Wilcox
2020-02-18  5:14   ` Dave Chinner
2020-02-18 23:08   ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 06/16] mm: Add readahead address space operation Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 06/19] mm: rename readahead loop variable to 'i' Matthew Wilcox
2020-02-18  5:33   ` Dave Chinner
2020-02-18 23:11   ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 07/16] mm: Add page_cache_readahead_limit Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 07/19] mm: Put readahead pages in cache earlier Matthew Wilcox
2020-02-18  6:14   ` Dave Chinner
2020-02-18 15:42     ` Matthew Wilcox
2020-02-19  0:59       ` Dave Chinner
2020-02-19  0:01   ` John Hubbard
2020-02-19  1:02     ` Matthew Wilcox
2020-02-19  1:13       ` John Hubbard
2020-02-19  3:24       ` John Hubbard
2020-02-19 14:41     ` Matthew Wilcox
2020-02-19 14:52       ` Christoph Hellwig
2020-02-19 15:01         ` Matthew Wilcox
2020-02-19 20:24           ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 08/16] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 08/19] mm: Add readahead address space operation Matthew Wilcox
2020-02-18  6:21   ` Dave Chinner
2020-02-18 16:10     ` Matthew Wilcox
2020-02-19  1:04       ` Dave Chinner
2020-02-19  0:12   ` John Hubbard
2020-02-19  3:10   ` Eric Biggers
2020-02-19  3:35     ` Eric Biggers
2020-02-19 16:52     ` Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 09/16] btrfs: Convert from readpages to readahead Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 09/19] mm: Add page_cache_readahead_limit Matthew Wilcox
2020-02-18  6:31   ` Dave Chinner
2020-02-18 19:54     ` Matthew Wilcox
2020-02-19  1:08       ` Dave Chinner
2020-02-19  1:32   ` John Hubbard
2020-02-19  2:23     ` Matthew Wilcox
2020-02-19  2:46       ` John Hubbard
2020-02-17 18:45 ` [PATCH v6 10/16] erofs: Convert uncompressed files from readpages to readahead Matthew Wilcox
2020-02-17 18:45 ` [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead Matthew Wilcox
2020-02-18  1:51   ` [Ocfs2-devel] " Joseph Qi
2020-02-18  6:37   ` Dave Chinner
2020-02-19  2:48   ` John Hubbard
2020-02-19  3:28   ` Eric Biggers
2020-02-19  3:47     ` Matthew Wilcox
2020-02-19  3:55       ` Eric Biggers
2020-02-17 18:45 ` [PATCH v6 11/19] btrfs: Convert from readpages to readahead Matthew Wilcox
2020-02-18  6:57   ` Dave Chinner
2020-02-18 21:12     ` Matthew Wilcox
2020-02-19  1:23       ` Dave Chinner
2020-02-17 18:46 ` [PATCH v6 11/16] erofs: Convert compressed files " Matthew Wilcox
2020-02-19  2:34   ` Gao Xiang
2020-02-17 18:46 ` [PATCH v6 12/19] erofs: Convert uncompressed " Matthew Wilcox
2020-02-19  2:39   ` Gao Xiang
2020-02-19  3:04   ` Dave Chinner
2020-02-17 18:46 ` [PATCH v6 12/16] ext4: Convert " Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 13/19] erofs: Convert compressed files " Matthew Wilcox
2020-02-19  3:08   ` Dave Chinner
2020-02-17 18:46 ` [PATCH v6 13/16] f2fs: Convert " Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 14/19] ext4: " Matthew Wilcox
2020-02-19  3:16   ` Dave Chinner
2020-02-19  3:29   ` Eric Biggers
2020-02-17 18:46 ` [PATCH v6 14/16] fuse: " Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 15/19] f2fs: " Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 15/16] iomap: " Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 16/19] fuse: " Matthew Wilcox
2020-02-19  3:22   ` Dave Chinner
2020-02-17 18:46 ` [PATCH v6 16/16] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor Matthew Wilcox
2020-02-19  3:17   ` John Hubbard
2020-02-19  5:35     ` Matthew Wilcox
2020-02-19  3:29   ` Dave Chinner
2020-02-19  6:04     ` Matthew Wilcox
2020-02-19  6:40       ` Dave Chinner
2020-02-19 17:06         ` Matthew Wilcox
2020-02-17 18:46 ` [PATCH v6 18/19] iomap: Convert from readpages to readahead Matthew Wilcox
2020-02-19  3:40   ` Dave Chinner
2020-02-17 18:46 ` [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path Matthew Wilcox
2020-02-19  3:43   ` Dave Chinner
2020-02-19  5:22     ` Matthew Wilcox
2020-02-17 18:48 ` [PATCH v6 00/19] Change readahead API Matthew Wilcox
2020-02-18  4:56 ` Dave Chinner
2020-02-18 13:42   ` Matthew Wilcox
2020-02-18 21:26     ` Dave Chinner
2020-02-19  3:45       ` Dave Chinner
2020-02-19  3:48         ` Matthew Wilcox
2020-02-19  3:57           ` Dave Chinner
2020-02-18 20:49 ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).